Chapter 15
Descriptive Statistics
An overview of the field of statistics is shown in Figure 15.1 (also shown
below). As you can see, the field of statistics can be divided into descriptive
statistics and inferential statistics (and there are further subdivisions under
inferential statistics which is the topic of the next chapter).

This chapter is about descriptive statistics (i.e.,
the use of statistics to describe, summarize, and explain or make sense of a
given set of data).
- A data
set (i.e., a set of data with the "cases" going down the
rows and the "variables" going across the columns) is shown in
Table 15.1.
- Once
you put your data set (such as the one in Table 15.1) into a statistical
program such as SPSS, you are ready to obtain all the descriptive
statistics that you want (i.e., which will help you to make some sense out
of your data).
Frequency
Distributions
One useful way to view the data of a variable is to
construct a frequency distribution (i.e., an arrangement in which the
frequencies, and sometimes percentages, of the occurrence of each unique data
value are shown).
- An
example is shown in Table 15.2 in the book and here for your convenience.
- When
a variable has a wide range of values, you may prefer using a grouped
frequency distribution (i.e., where the data values are grouped into
intervals and the frequencies of the intervals are shown).
- For
the above frequency distribution, one possible set of grouped intervals
would be 20,000-24,999; 25,000-29,999; 30,000-34,999; 35,000-39,999;
40,000-44,999.
- Note
that the categories developed for a grouped frequency distribution must be
mutually exclusive (the property that intervals do not overlap) and exhaustive (the property
that a set of intervals or categories covers the complete range of data
values).
- An
example of a grouped frequency distribution is shown on pate 437.
Graphic
Representations of Data
Another excellent way to describe your data (especially for
visually oriented learners) is to construct graphical representations of the
data (i.e., pictorial representations of the data in two-dimensional space).
- Some
common graphical representations are bar graphs, histograms, line graphs,
and scatterplots.
Bar Graphs
A bar graph uses vertical bars to represent the data.
- The
height of the bars usually represent the frequencies for the categories
that sit on the X axis.
- Note
that, by tradition, the X axis is the horizontal axis and the Y axis is
the vertical axis.
- Bar
graphs are typically used for categorical variables.
- Here
is a bar graph of one of the categorical variables included in the data
set for this chapter (i.e., the data set shown on page 435).
Histograms
A histogram is a graphic that shows the frequencies
and shape that characterize a quantitative variable.
- In
statistics, we often want to see the shape of the distribution of
quantitative variables; having your computer program provide you with a
histogram is a simple way to do this.
- Here
is a histogram for a quantitative variable included in the data set for
this chapter:

Line Graphs
A line graph uses one or more lines to depict
information about one or more variables.
- A
simple line graph might be used to show a trend over time (e.g.,
with the years on the X axis and the population sizes on the Y axis).
- Here
is an example of a line graph (Figure 15.4):

- Line
graphs are used for many different purposes in research. For example, if
you will turn to page 290, you will see a line graph (e.g., Figure 9.15)
used in factorial experimental designs to depict the relationship between
two categorical independent variables and the dependent variable.
- Yet
another line graph is shown on page 468 in the next chapter. This line
graph shows that the "sampling distribution of the mean" is
normally distributed.
- As
you can see in the Figures just listed, line graphs have in common their
use of one or more lines within the graph (to depict the levels or
characteristics of a variable or to depict the relationships among
variables).
Scatterplots
A scatterplot is used to depict the relationship
between two quantitative variables.
- Typically,
the independent or predictor variable is represented by the X axis (i.e.,
on the horizontal axis) and the dependent variable is represented by the Y
axis (i.e., on the vertical axis).
- Here
is an example of a scatterplot showing the relationship between two of the
quantitative variables from the data set for this chapter:
Measures of
Central Tendency
Measures of central tendency provide descriptive
information about the single numerical value that is considered to be the most typical
of the values of a quantitative variable.
- Three
common measures of central tendency are the mode, the median, and the
mean.
The mode is simply the most frequently occurring
number.
The median is the center point in a set of numbers;
it is also the fiftieth percentile.
- To
get the median by hand, you first put your numbers in ascending or
descending order.
- Then
you check to see which of the following two rules applies:
·
Rule One. If you have an odd number of numbers, the
median is the center number (e.g., three is the median for the numbers 1, 1, 3,
4, 9).
·
Rule Two. If you have an even number of numbers, the
median is the average of the two innermost numbers (e.g., 2.5 is the median for
the numbers 1, 2, 3, 7).
The mean is the arithmetic average (e.g., the average
of the numbers 2, 3, 3, and 4, is equal to 3).
A Comparison of the Mean, Median, and Mode
The mean, median, and mode are affected by what is called skewness
(i.e., lack of symmetry) in the data.
- Here
is Figure 15.6, which showed a normal curve, a negatively skewed curve,
and a positively skewed curve:

- Look
at the above figure and note that when a variable is normally distributed,
the mean, median, and mode are the same number.
- When
the variable is skewed to the left (i.e., negatively skewed), the
mean shifts to the left the most, the median shifts to the left the second
most, and the mode the least affected by the presence of skew in the data.
- Therefore,
when the data are negatively skewed, this happens:
mean < median
< mode.
- When
the variable is skewed to the right (i.e., positively skewed), the
mean is shifted to the right the most, the median is shifted to the right
the second most, and the mode the least affected.
- Therefore,
when the data are positively skewed, this happens:
mean > median
> mode.
- If
you go to the end of the curve, to where it is pulled out the most, you
will see that the order goes mean, median, and mode as you “walk up the
curve” for negatively and positively skewed curves.
You can use the following two rules to provide some
information about skewness even when you cannot see a line graph of the data
(i.e., all you need is the mean and the median):
1.
Rule One. If the mean is less than the median, the data are
skewed to the left.
2.
Rule Two. If the mean is greater than the median, the data
are skewed to the right.
Measures of
Variability
Measures of variability tell you how "spread
out" or how much variability is present in a set of numbers. They tell you
how different your numbers tend to be. Note that measures of variability should
be reported along with measures of central tendency because they provide very
different but complementary and important information. To fully interpret one
(e.g., a mean), it is helpful to know about the other (e.g., a standard
deviation).
An easy way to get the idea of variability is to look at two
sets of data, one that is highly variable and one that is not very variable.
For example, which of these two sets of numbers appears to
be the most spread out, Set A or Set B?
- Set
A. 93, 96, 98, 99, 99, 99, 100
- Set
B. 10, 29, 52, 69, 87, 92, 100
If you said Set B is more spread out, then you are right!
The numbers in set B are more "spread out"; that is, they are more
variability.
All of the measures of variability should give us an
indication of the amount of variability in a set of data. We will discuss three
indices of variability: the range, the variance, and the standard deviation.
Range
A relatively crude indicator of variability is the range
(i.e., which is the difference between the highest and lowest numbers).
- For
example the range in Set A shown above is 7, and the range in Set B shown
above is 90.
Variance and Standard Deviation
Two commonly used indicators of variability are the variance
and the standard deviation.
- Higher
values for both of these indicators indicate a larger amount of
variability than do lower numbers.
- Zero stands for no variability at all
(e.g., for the data 3, 3, 3, 3, 3, 3, the variance and standard deviation
will equal zero).
- When
you have no variability, the numbers are a constant (i.e., the same
number).
Table 15.4 shows you how to easily calculate, by hand, the
variance and standard deviation.
- (Basically,
you set up the three columns shown, get the sum of the third column, and
then plug the relevant numbers into the variance formula.)
- The variance
tells you (exactly) the average deviation from the mean, in "squared
units."
- The standard
deviation is just the square root of the variance (i.e., it brings the
"squared units" back to regular units).
- The
standard deviation tells you (approximately) how far the numbers tend to
vary from the mean. (If the standard deviation is 7, then the numbers tend
to be about 7 units from the mean. If the standard deviation is 1500, then
the numbers tend to be about 1500 units from the mean.)
Virtually everyone in education is already familiar with the
normal curve (a picture of one is shown in Figure 15.7 on page 449).
If data are normally distributed, then an easy rule to apply
to the data is what we call “the 68, 95, 99.7 percent rule." That is . . .
- Approximately
68% of the cases will fall within one standard deviation of the mean.
- Approximately
95% of the cases will fall within two standard deviations of the mean.
- Approximately
99.7% of the cases will fall within three standard deviations of the mean.
Measures of
Relative Standing
Measures of relative standing are used to provide
information about where a particular score falls in relation to the other
scores in a distribution of data. Two commonly used measures of relative
standing are percentile ranks and Z-scores.
Here is Figure 15.8 which shows these and some additional
types of standard scores. You can determine the mean of the type of standard
scores below by simply looking under Mean. You can determine the standard
deviation by looking at how much the scores increase as you move from the mean
to 1 SD.
- Z-Scores:
have a mean of 0 and a standard deviation of 1. Therefore, if you
converted any set of scores (e.g., the set of student grades on a test) to
z-scores, then that new set WILL have a mean of zero and a standard
deviation of one.
- IQ
has a mean of 100 and a standard deviation of 15.
- SAT
has a mean of 500 and a standard deviation of 100.
- Note:
percentile ranks are a different type of score; because they only have
ordinal measurement properties, the concept of standard deviation is not
relevant.

Percentile Ranks
A percentile rank tells you the percentage of scores
in a reference group (i.e., in the norming group) that fall below a particular
raw score.
- For
example, if your percentile rank is 93 then you know that 93 percent of
the scores in the reference group fall below your score.
Z-Scores
A z-score tells you how many standard deviations (SD)
a raw score falls from the mean.
- A SD
of 2 says a score falls two standard deviations above the mean.
- A SD
of -3.5 says the score falls three and a half standard deviations below
the mean.
To transform a raw score into z-score units, just use
the following formula:
Raw score - Mean
Z-score = ------------------------
Standard Deviation
For example, you know that the mean for IQ scores is 100 and
the standard deviation for IQ scores is 15 (because we told you this in the
book and because you can see it by examining Figure 15.8).
Therefore, if your IQ is 115, you can get your z-score...
115
- 100 15
Z-score = --------------- =
-------- = 1
15
15
An IQ of 115 falls one standard deviation above the mean.
Note that once you have a set of z-scores, you can
convert to any other scale by using this formula: New score = Z-score(SD of
new scale) + mean of the new scale.
- For
example, let’s convert a z-score of three to an IQ score
- New
score=3(15) + 100 (remember, the mean of IQ scores is 100 and the standard
deviation of IQ scores is 15). Therefore, the new score (i.e., the IQ
score converted from the z-score of 3 using the formula I just provided)
is equal to 145 (3 times 15 is 45, and when 100 is added you get 145).
Examining
Relationships Among Variables
We have been talking about relationships among variables
throughout your textbook. For example, we have already talked about correlation
(e.g., see Figure 2.2 on page 44), partial correlation (e.g., see page
341), analysis of variance which is used for factorial designs (e.g.,
see pages 286-291), and analysis of covariance (e.g., see pages 274-275
and pages 341-342).
At this point in this chapter on descriptive statistics, I
introduce two additional techniques that you also can use for examining
relationships among variables: contingency tables and regression analysis.
Contingency Tables
When all of your variables are categorical, you can use
contingency tables to see if your variables are related.
- A
contingency table is a table displaying information in cells formed by the
intersection of two or more categorical variables.
- An
example is shown in Table 15.6.
When interpreting a contingency table, remember to
use the following two rules:
- Rule
One. If the percentages are calculated down the columns, compare across
the rows.
- Rule
Two. If the percentages are calculated across the rows, compare down the
columns.
- When
you follow these rule you will be comparing the appropriate rates (a rate
is the percentage of people in a group who have a specific
characteristic).
- When
you listen to the local and national news, you will often hear the
announcers compare rates.
- The
failure of some researchers to follow the two rules just provided has
resulted in misleading statements about how categorical variables are
related; so be careful.
Regression Analysis
Regression analysis is a set of statistical
procedures used to explain or predict the values of a quantitative dependent
variable based on the values of one or more independent variables.
- In simple
regression, there is one quantitative dependent variable and one
independent variable.
- In multiple
regression, there is one quantitative dependent variable and two or
more independent variables.
On pages 455-459, I show you the components of the
regression equations (e.g., the Y-intercept and the regression
coefficients). Here are the important definitions:
- Regression
equation-The equation that defines the regression line (see Figure 15.9 in
book and below).

- Here
is the simple regression equation showing the relationship between
starting salary (Y or your dependent variable) and GPA (X or your
independent variable) (two of the variables in the data set included with
this chapter on page 435).
Ŷ
= 9,234.56 + 7,638.85 (X)
- The
9,234.56 is the Y intercept (look at the above regression line; it crosses
the Y axis a little below $10,000; specifically, it crosses the Y axis at
$9,234.56).
- The
7,638.85 is the simple regression coefficient, which tells you the average
amount of increase in starting salary that occurs when GPA increases by
one unit. (It is also the slope or the rise over the run).
- Now,
you can plug in a value for X (i.e., starting salary) and easily get the
predicted starting salary.
- If
you put in a 3.00 for GPA in the above equation and solve it, you will see
that the predicted starting salary is $32,151.11
- Now
plug in another number within the range of the data (how about a 3.5) and
see what the predicted starting salary is. (Check on your work: it is
$35,970.54)
On pages 458-459, I show a multiple regression equation with
two independent variables.
- The
main difference is that in multiple regression, the regression coefficient
is now called a partial regression coefficient, and this
coefficient provides the predicted change in the dependent variable given
a one unit change in the independent variable, controlling for the
other independent variables in the equation. In other words, you can
use multiple regression to control for other variables (i.e., for what we
called in earlier chapters statistical control).