Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA 93740
Note to the Instructor: This is the seventh in a series of 13 exercises that were written for an introductory research methods class. The first exercise focuses on the research design which is your plan of action that explains how you will try to answer your research questions. Exercises two through four focus on sampling, measurement, and data collection. The fifth exercise discusses hypotheses and hypothesis testing. The last eight exercises focus on data analysis. In these exercises we’re going to analyze data from one of the Monitoring the Future Surveys (i.e., the 2017 survey of high school seniors in the United States). This data set is part of the collection at the Inter-university Consortium for Political and Social Research at the University of Michigan. This data set is freely available to the public and you do not have to be a member of the Consortium to use it. We’re going to use SDA (Survey Documentation and Analysis) to analyze the data which is an online statistical package written by the Survey Methods Program at UC Berkeley and is available without cost wherever one has an internet connection. A weight variable is automatically applied to the data set so it better represents the population from which the sample was selected. You have permission to use this exercise and to revise it to fit your needs. Please send a copy of any revision to the author so I can see how people are using the exercises. Included with this exercise (as separate files) are more detailed notes to the instructors and the exercise itself. Please contact the author for additional information.
This page in MS Word (.docx) format is attached.
The goal of this exercise is to explore measures of central tendency (mode, median, and mean) and dispersion (range, standard deviation, and variance). The exercise also gives you practice in using FREQUENCIES in SDA.
Data analysis always starts with describing variables one-at-a-time. Sometimes this is referred to as univariate (one-variable) analysis. Central tendency refers to the center of the distribution.
There are three commonly used measures of central tendency – the mode, median, and mean of a distribution. The mode is the most common value or values in a distribution. The median is the middle value of a distribution. The mean is the sum of all the values divided by the number of values.
We’re going to use the Monitoring the Future (MTF) Survey of high school seniors for this exercise. The MTF survey is a multistage cluster sample of all high school seniors in the United States. The survey of seniors started in 1975 and has been an annual survey ever since. To access the MTF 2017 survey follow the instructions in the Appendix. Your screen should look like Figure 7-1. Notice that a weight variable has already been entered in the WEIGHT box. This will weight the data so the sample better represents the population from which the sample was selected.
MTF is an example of a social survey. The investigators selected a sample from the population of all high school seniors in the United States. This particular survey was conducted in 2017 and is a relatively large sample of a little more than 12,000 students. In a survey we ask respondents questions and use their answers as data for our analysis. The answers to these questions are used as measures of various concepts. In the language of survey research these measures are typically referred to as variables.
Run FREQUENCIES in SDA for the variable v2196. This variable is the number of miles per week that students drive. Here’s the question from the survey – “During an average week, how much do you usually drive a car, truck, or motorcycle?” To run the frequency distribution, enter the variable name, v2196, in the ROW box. The WEIGHT box is already filled in. Click on RUN THE TABLE to get the frequency distribution. Your screen should look like Figure 7-2.
The responses to this question were divided into a set of six categories – none, 1 to 10, 11 to 50, 51 to 100, 101 to 200, and more than 200. This was done to make the question easier to answer. It’s difficult for respondents to remember the precise number of miles they drove per week. It’s a lot easier to select one of these categories. But this means that we don’t have the exact number of miles driven. Keep that in mind as we think about measures of central tendency.
Rerun the table but this time check the box for SUMMARY STATISTICS under TABLE OPTIONS and click on the drop-down arrow next to TYPE OF CHART and select BAR CHART. Below the frequency distribution you should see the statistics that SDA computes for you and the bar chart. The summary statistics should look like Figure 7-3.
Your output will display a number of summary statistics. Three of these statistics are commonly used measures of central tendency – mode, median, and mean.
We can do this be changing the categorical values so they are the midpoint of the miles driven for each category. That would mean we would have to do the following.
How are we going to tell SDA to make these changes? By the way, this is called recoding. We’re recoding the categorical values of 1, 2, 3, 4, 5, and 6 into the values above. Follow these steps to recode in SDA.
Now tell SDA to compute the summary statistics for the recoded variable. The mean should be 60.00 this time. Notice that the mode is now 0 since that is the value for the first category and the median is 30 which is in the third category. Remember that this is based on the assumption that all the cases in each category fall at the midpoint of that category.
One of the variables in the data set is v2197 which is the number of driving tickets respondents received in the last twelve months. The response categories are 0, 1, 2, 3, and 4 or more. The only problem is the last open-ended category. Let’s assume that no one received more than six tickets. So the last category would be 4 to 6 with a midpoint of 5. Follow the procedure described in Part I and compute the mode, median, and mean. Write a paragraph discussing what these measures of central tendency mean.
The first thing to consider is the level of measurement (nominal, ordinal, interval, ratio) of your variable (see 6RM).
Run FREQUENCIES for the following variables. Once you have entered the variable names in the ROW box, ask for the SUMMARY STATISTICS and a BAR CHART. For each variable write a sentence or two indicating which measure(s) of central tendency (i.e., mode or median) would be appropriate to use to describe the center of the distribution and what the values of those statistics mean. For some variables there will be more than one appropriate measure of central tendency.
Dispersion or variation refers to the degree that values in a distribution are spread out or dispersed. The most commonly used measures – range, standard deviation, variance – are only appropriate for interval and ratio level variables (see exercise 6RM). The variables in the MTF survey are entirely nominal and ordinal variables but as you have seen in this exercise we can recode some of these variables so they are ratio variables.
The range is the difference between the highest and the lowest values in the distribution. We don’t actually know the highest value for v2196 since the last category is more than 200 miles. Earlier in this exercise we assumed that the largest value was 300. If that is the case, what would the range be for the recoded variable?
The range is not a very stable measure since it depends on the two most extreme values – the highest and lowest values. These are the values most likely to change from sample to sample.
The variance is the sum of the squared deviations from the mean divided by the number of cases minus 1 and the standard deviation is just the square root of the variance. Your instructor may want to go into more detail on how to calculate the variance by hand. Look back at the summary statistics for your recode of v2196. The variance equals 5,458.65. What will the standard deviation equal?
The variance and the standard deviation can never be negative. A value of 0 means that there is no variation or dispersion at all in the distribution. All the values are the same. The more variation there is, the larger the variance and standard deviation.
So what does the variance and the standard deviation for v2196 mean? That’s hard to answer because you don’t have anything to compare it to. But if you knew the standard deviation for both men and women you would be able to determine whether men or women have more variation. Instead of comparing the standard deviations for men and women you would compute a statistic called the Coefficient of Relative Variation (CRV). CRV is equal to the standard deviation divided by the mean of the distribution. A CRV of 2 means that the standard deviation is twice the mean and a CRV of 0.5 means that the standard deviation is one-half of the mean. You would compare the CRV’s for men and women to see whether men or women have more variation relative to their respective means.
How do we get SDA to compute the means and standard deviations for both men and women? Click on ANALYSIS and then on COMPARISON OF MEANS in the blue horizontal bar at the top of your screen. Enter the variable for which you want to compute the mean and standard deviation in the DEPENDENT box. We’re going to use the same variable we used in part I (v2196). Be sure to enter the recode that you used in part 1. Enter the variable (V2150) that you want to use to divide the sample into men and women in the ROW box. SDA will automatically calculate the mean number of miles driven for both men and women. To get the standard deviations, check the STD DEV box under TABLE OPTIONS. Uncheck the STD ERRORS box under TABLE OPTIONS since you won’t need this statistic. The mean will be the top number in each box of your output and the standard deviation will be right below the mean. Compute the Coefficient of Variation for both men and women and write a sentence or two discussing whether men or women have more variation.
By the way, you might also have wondered why you need both the variance and the standard deviation when the standard deviation is just the square root of the variance. You’ll just have to take my word for it that you will need both as you go further in statistics.
 Frequency distributions can be grouped or ungrouped. Think of age. We could have a distribution that lists all the ages in years of the respondents to our survey. We could also divide age into a series of categories such as under 30, 30 to 39, 40 to 49, 50 to 59, 60 to 69, and 70 and older. In a grouped frequency distribution the mode would be the most common category or categories.
 In a grouped frequency distribution the median would be the category that contains the middle value.
 We need to clear something up. Why is the total number of cases 12,169.1 and not a whole number? When you weight the cases by the weight variable, you will get a fractional number of cases. Don’t worry about this. It’s a technical issue and not important to us in this discussion.