simple data analysis project

 COVER PAGE: Project title, due date & name of the author (e.g., your name)

Name

Subject; hypothesized relationship between simple statistical data analysis and comparison

Date

TABLE OF CONTENTS: Optional

EXECUTIVE SUMMARY:

Statistics s a major mathematical instrument, which is used for the simple and quantitative analysis of numerical data. Owing to this fact, statistics serves an integral role by which we extract important information from a specific data set. In the project, we study the two distinctive research methods, that is empirical and theoretical data analysis. Empirical research often generates more than one outcome of the same data because it replicates the measurements, which are prone to error. Simple statistical data analysis can be used to to summarize the observations made that is by coming up with an estimate of the average value, which is also refereed to as the true mean. The other most important statistical data analysis is the determination of the variance, which is used to quantify the uncertainty in a measures variable (Peters, 1). Statistical data analysis is crucial especially when used to spread the error measurement over a model in mathematics, which is used to estimate the error in the derived quantity. These methods of statistical data analysis are just among the few among a list of many ways for analysing an empirical and theoretical data set.

I. INTRODUCTION [0.5 pages]

Scientific research represents a wide range of disciplines that are interelational to one another. Further, within each discipline, a researcher can use several different methods of conducting research. Simple data analysis is divided into two broad categories namely theoretical and empirical research (SAS, 2003). Theoretical research incorporates mathematics and logic functions to prove beyond doubt that certain propositions are true. Experimental research is based on coming up with conclusions from observations. A good example is in biology, where one is required to compare the genetic strands obtained from two species, here one might conclude that the species compared may have a likely ancestry.

Testing of the null hypothesis is one common language social science researchers use (SAS, 2003). Theoretical data analysis is based on the fact that data is obtained from a normal population distribution but the experiments conducted prove that there is inconsistencies arising from the assumptions of data normality (SAS, 2003). However, if the investigated data is distributed normally, then the theoretic data analysis becomes a powerful tool in research because it detects the more important and significant data (SAS, 2003). Empirical research on the other hand uses data information regarding the relative size of any research observation without making any assumptions concerned with addressing the variance and mean of the population under survey, and in the end can only be useful in the analysis of any type of dataset (SAS, 2003).

Despite the fact that the scientific methods are diverse, quite a number of the data analysis methods used by researchers share several characteristics that are common. Most research involves a researcher, regardless of the field of study, gathering data and performing certain data analysis in order to determine the meaning of the data (SAS, 2003). Additionally, researchers in the field of social sciences like sociology and psychology use one common language in terms of reporting and conducting their research.

II. PROJECT WORK [4-6 pages]

Definition

When it comes to discussing how a simple and comprehensive data analysis and comparison works, the use of several concepts that bear specific meanings cannot be avoided. In the project, certain concepts were important and will be defined first

Error

In statistics, error is the random and unpredictable deviances between duplicated data, which have been quantified with a standard deviation (Peters, 6). Error in data analysis can also be attributed to several factors that include; a predictable regular deviation obtained from the true value, then quantified as the mean difference. This means error could occur from the the difference between the mean data from replicate determinations and the true value (Peters, 6). The other way analytical errors happen can be attributed to a constant data, which is unrelated to the data being analyzed.

Accuracy

This is defined as the closeness of the analytical data representation to the true value. Accuracy constitutes of several combinations incorporating a systematic as well as random errors because the errors cannot be directly quantified (Novikova, 1). The test result, which has been conducted may be a mean of several values, therefore, an accurate determination of the data may produce and precise and quantifiable value (Novikova, 1).

Precision

This means the results of the replicate data analysis of a given sample are close and tend to agree with each other (SAS, 2003). Precision is a measure of dispersion around the mean value, and it is usually expressed in the form of a standard deviation and range. Range is used to describe the difference between the lowest value and the highest value within a given data set (SAS, 2003).

Bias

Bias data are the most used measure of dispersion but quite the opposite of trueness of a data and is defined as the agreement of the mean of the logical results with respect to the true value, that is after one has excluded the contribution of randomness as represented by precision (Peters, 2). This means the steady deviation of the studied data results from the true value, which is brought about by the logical errors within a given procedure (Peters, 17). There are several factors that contribute greatly to the analytical data being bias:

Method bias is the difference between the mean test result obtained from various data sets. The method bias is dependent on the level of analysis conducted

Sample bias is the difference obtained from the true value of the target dataset from which the sample data was taken and the mean value of the duplicate test result.

Data collection

Data collection forms a very very important part in any kind of statistical research carried out. Accurate data have a very huge impact on the results obtained from a research and can ultimately determine the validity of the research results. Data collective have a wide range of different ways of data information. The two types of data collection methods include quantitative and qualitative methods (SAS, 2003). Depending on the type of research carried out, each method is suitable for a specific study.

Tools of data analysis and representation(Measures of Central Tendency)

Measures of Central Tendency

The Mean

The mean in each dataset is calculated by means of adding up all the scores within the given dataset, and then later dividing the total from the data set by the total number of participants in that condition. Supposing we are to compare three datasets D1, D2 and D3 as obtained in the workbook. Sheet 1 represents D1, Sheet 2 represents D2 and Sheet 3 representing D3. Comparing the three datasets D1, D2, and D3, whose values are 287.5533, 252.6467 and 259.2 respectively. According to the spreadsheet representation of the three data sets, the mean value of D2 and D3 have minimum variations as compared to dataset D1, whose values are quite extreme. This is illustrated in figure1 below

The main advantage of using the mean is based on the fact that puts into account all the scores present in the dataset. This makes it an important measure of central tendency especially if the scores are a resemblance of the normal distribution (Wike, 14). The normal distribution in most cases shows a bell shaped kind of distribution and if this is so most scores are clustered closely to the mean.

On the other side, the mean values can be extremely misleading especially if the distribution scores have a distinctive contrast from the normal, having one or more extremes leaning towards one direction (SAS, 2003).

Figure 1

The Median

The median is used to describe the general performance level within each statistical condition (SAS, 2003). In other terms, the median is the middle number in an odd number of scores and having equal number of scores on either side of the data. If the number of datasets is equal, then the median is the value that stands alone with the remaining values on either side remaining equal. In case the numerical data have even numbers, the median is obtained by determining the mean of the middle two values (SAS, 2003). In the case of the provided data sets D1, D2 and D3, the dataset comprises of one hundred and fifty different values. The median values fall in-between the seventy fifth and the seventy sixth value. The median of D1, D2 and D3 are 287, 252 and 255.5 respectively. This is represented in the figure below.

The good thing about using the median as a measure of central tendency is that it cannot be affected by certain scores that are quite extreme. This is because it only focuses on the scores that are in the middle of the distribution (SAS, 2003). The other advantage of the median is the fact that it tends to be easier to work out unlike working out the mean.

The only limitation with the median is that it does not include most of the datasets,therefore, it becomes a less sensitive measure of analyzing data unlike the mean (SAS, 2003). Additionally, the median is not always a representative of all the datasets obtained, particularly if the data sets are exceptionally few.

Figure 2

The Mode

The mode is simply defined as the most occurring number within a dataset. The modal values of the three datsets D1, D2 and D3 are 279, 238 and 254 respectively.

The mode is not affected by one or more extreme that occur in a dataset therefore this makes it the easiest central tendency to work out as compared to the mean and median (Wike, 14).The mode on the other hand is of benefit because when certain extreme scores are not identified, the datasets can still be used to work out (Wike, 14). Even though the mode has its advantages, its disadvantages outweigh its advantages. One disadvantage is the fact that the modal value is an unreliable method of data analysis and that the information of the same and exact values tend to be ignored when calculating the mode and it makes the measure of tendency become less sensitive (Wike, 14). Another disadvantage regarding the mode is the fact that it is for a data set to have more than one possible modal value (Wike, 14).

Figure 3

Measures of Dispersion

Central tendency provides a conclusive analysis of data however, incorporating the measures of dispersion such as the range, standard deviation and variation range among other methods. The importance of these measures is that they indicate whether or not the scores within a dataset are similar to the other scores (SAS, 2003).

Range

This is the simplest measure of dispersion. It can be defined as the difference between the lowest score in any given dataset and the highest score in the same given dataset (Wike, 23). In the case of the three data sets, the range for D1, D2 and D3 are 99, 104 and 100 respectively. Comparing the three data sets as shown in the figure 4 below, indicates that a little data range of about 5 between the three data sets having whole numbers. Determining the range within a dataset has its advantages as well as its disadvantages. The advantage of using the range as a dispersion measure is that they take full account of the extreme values and calculating the data sets is very easy (Wike, 23). One weakness with using the range is the fact that it can be influenced greatly by one score, which might be greatly different from all the rest (Wike, 23). Another weakness with using this method of dispersion is because it is likely to provide an insufficient measurement of the general score dispersion especially concerned with the mean and median (Wike, 24). Range can be further suddivided into quartile.

Standard deviation

This is one of the most useful measure of dispersion. This is because it becomes harder to calculate as compared to calculating the figures of the range (SAAS, 2003). The standard deviation in general terms provides a more accurate and precise measure of how the scores spread. Determining the standard deviation is not a complex procedure as it involves several steps. The first step is to find the mean value within a given dataset. The mean value from the given datasets have already been calculated in our case. The second step is to subtract the mean value from each of the scores in the data sets. The third step is to find the square of the values obtained after subtracting the given dataset scores from the mean value. The fourth step is to determine the total value of all the squared dataset scores. The fifth step is to divide the results obtained in the previous fourth step by one number less than the initial number of participants in order to get the variance. In the case of our datasets having one hundred and fifty participants, the fifth step will entail dividing the results by one hundred and fourty nine. The square root of the variance is known as the standard deviation.

Standard deviation is of great relevance especially with regards to a normal bell shaped data distribution. It provides intelligent scores within the general population of the datasets (SAAS, 2003). The standard deviation takes into account all the scores in a dataset including the majority of scores that are closer to the mean value as well the scores that are far from the mean value. The standard deviation takes into account other characteristics of a normal data distribution including the height and weight of the scores, therefore, presenting a more profound measure of dispersion. The standard deviation is also important because it spreads the scores within a dataset and in a normal distribution with great precision and in the end, presenting less possibilities of errors occurring (Peters, 6). The major disadvantage with the standard deviation is based on the fact that it becomes such a difficult task when it comes to the calculations of the scores (SAAS, 2003).

III. RESULTS/DISCUSSION [1-2 pages]

Statistical analysis of data has a long a wide history in terms of research. Numerous and powerful statistical softwares, which allow in-depth and precise calculations to be done have been established in recent years. These statistical softwares make calculations easy and fast especially if the data set beforehand is large and impossible to perform manually. There are some statistical datasets that are extremely complex to handle whereas there are others that are easy to use and have been used successfully by researchers without the need to have a strong statistical background.

Statistics provides a holistic advantage when analyzing data because it determines the most appropriate test to use (SAS, 2003). Such tests are determined partly by the experimental plan that has been used. The most statistical methods used include theoretical methods and investigative methods (SAS, 2003). The theoretical method uses numerical statistics including standard deviation, the arithmetic mean, median as well as mode to determine the significant differences brought about by different sets of data (SAS, 2003). In order to make good use of the numerical statistical figures, certain suppositions regarding the general normalcy, normal dissemination of the data and equality of variance among the other groups ought to be made (SAS, 2003). The good thing about these assumptions is that they provide a sufficient and satisfying analysis of statistical information therefore; such assumptions become extremely useful and act as a starting point for viable analysis (SAS, 2003).

However, if the data are created from information that does not meet the presupposed assumptions, it means that the theoretical methods of analysis will become unreliable. This is because the variance and mean will no longer describe the information precisely as expected (SAS, 2003). In case skewing of data from the non normal variance, it means that the end result would be a false assumption concerning the data sets (SAS, 2003).

From the research, comparing the mean and median values is the easiest and simplest method of investigating the distribution of statistical data. This method proves to be effective despite the fact that there are numerous methods of testing statistical data in the estimation of a normal distribution pattern (SAS, 2003). The mean simply refers to the average sum of a particular number of members within a particular group dataset. The median on the other hand, is the point at which the middle data value lies within a particular group dataset.

IV. CONCLUSION

Simple data analysis provides a platform for the analysis and representation of data analysis. Even with the advent of technological tools of statistical analysis, the mean, median, mode and standard deviation play an integral in data analysis. These methods provide a comprehensive method, which is easy to understand and interprete. Even though the standard deviation becomes difficult to work out, it eliminates the chances of errors from occurring within a dataset as compared to the mean, median and mode. The reason behind this is because it takes into account every score and therefore provides a more precise measure of statistical analysis. In future, statistical data analysis would be made very easier to work out and comprehend because of the emergence of various statistical software that make calculations and interpretation easy. The more common manual data representation and analysis would become obsolete in the nearby future.

Get your Custom paper done as per your instructions !

Order Now