Chapter 5
Standardized Measurement and Assessment
(For the concept map that goes with this chapter, click
here.)
Defining
Measurement
When we measure, we attempt to identify the dimensions,
quantity, capacity, or degree of something.
- Measurement
is formally defined as the act of measuring by assigning symbols or
numbers to something according to a specific set of rules.
Measurement can be categorized by the type of information
that is communicated by the symbols or numbers assigned to the variables of
interest. In particular, there are four levels or types of information are
discussed next in the chapter. They are called the four "scales of
measurement."
Scales of
Measurement
1. Nominal Scale.
This is a nonquantitative measurement scale.
- It
is used to categorize, label, classify, name, or identify variables. It
classifies groups or types.
- Numbers
can be used to label the categories of a nominal variable but the numbers
serve only as markers, not as indicators of amount or quantity (e.g., if
you wanted to, you could mark the categories of the variable called
"gender" with 1=female and 2=male).
- Some
examples of nominal level variables are the country you were born in,
college major, personality type, experimental group (e.g., experimental
group or control group).
2. Ordinal Scale.
This level of measurement enables
one to make ordinal judgments (i.e., judgments about rank order).
- Any
variable where the levels can be ranked (but you don't know if the
distance between the levels is the same) is an ordinal variable.
- Some
examples are order of finish position in a marathon, billboard top 40,
rank in class.
3. Interval Scale.
- This
scale or level of measurement has the characteristics of rank order and equal
intervals (i.e., the distance between adjacent points is the same).
It does not possess an absolute zero point.
- Some
examples are Celsius temperature, Fahrenheit temperature, IQ scores.
- Here
is the idea of the lack of a true zero point: zero degrees Celsius does
not mean no temperature at all; in a Fahrenheit scale, it is equal to the
freezing point or 32 degrees. Zero degrees in these scales does not mean
zero or no temperature.
4. Ratio Scale.
This is a scale with a true zero point.
- It
also has all of the "lower level" characteristics (i.e., the key
characteristic of each of the lower level scales) of equal intervals
(interval scale), rank order (ordinal scale), and ability to mark a value
with a name (nominal scale).
- Some
examples of ratio level scales are number correct, weight, height,
response time, Kelvin temperature, and annual income.
- Here
is an example of the presence of a true zero point: If your annual income
is exactly zero dollars then you earned no annual income at all. (You can
buy absolutely nothing with zero dollars.) Zero means zero.
Assumptions
Underlying Testing and Measurement
Before I list the assumptions, note the difference between
testing and assessment. According to the definitions that we use:
- Testing
is the process of measuring variables by means of devices or procedures
designed to obtain a sample of behavior and
- Assessment
is the gathering and integration of data for the purpose of making an
educational evaluation, accomplished through the use of tools such as
tests, interviews, case studies, behavioral observation, and specially
designed apparatus and measurement procedures.
In this section of the text, we also list the twelve assumptions that Cohen, et
al. Consider basic to testing and assessment:
1. Psychological traits and states exist.
- A trait
is a relatively enduring (i.e., long lasting) characteristic on which
people differ; a state is a less enduring or more transient
characteristic on which people differ.
- Traits
and states are actually social constructions, but they are real in the
sense that they are useful for classifying and organizing the world, they
can be used to understand and predict behavior, and they refer to
something in the world that we can measure.
2. Psychological traits and states can be quantified
and measured.
- For
nominal scales, the number is used as a marker. For the other scales, the
numbers become more and more quantitative as you move from ordinal scales
(shows ranking only) to interval scales (shows amount, but lacks a true
zero point) to ratio scales (shows amount or quantity as we usually
understand this concept in mathematics or everyday use of the term).
- Most
traits and states measured in education are taken to be at the interval
level of measurement.
3. Various approaches to measuring aspects of the same
thing can be useful.
- For
example, different tests of intelligence tap into somewhat different
aspects of the construct of intelligence.
4. Assessment can provide answers to some of life's
most momentous questions.
- It
is important that the users of assessment tools know when these tools will
provide answers to their questions.
5. Assessment can pinpoint phenomena that require
further attention or study.
- For
example, assessment may identify someone as having dyslexia or low
self-esteem or at-risk for drug use.
6. Various sources of data enrich and are part of the
assessment process.
- Information
from several sources usually should be obtained in order to make an
accurate and informed decision. For example, the idea of portfolio
assessment is useful.
7. Various sources of error are always part of the
assessment process.
- There
is no such thing as perfect measurement. All measurement has some error.
- We
defined error as the difference between a person’s true score and
that person’s observed score.
- The
two main types of error are random error (e.g., error due to
transient factors such as being sick or tired) and systematic error
(e.g., error present every time the measurement instrument is used such as
an essay exam being graded by an overly easy grader). (Later when we
discuss reliability and validity, you might note that unreliability is due
to random error and lack of validity is due to systematic error.)
8. Tests and other measurement techniques have
strengths and weaknesses.
- It
is essential that users of tests understand this so that they can use them
appropriately and intelligently.
- In
this chapter, we will be talking about the two major characteristics:
reliability and validity.
9. Test-related behavior predicts non-test-related
behavior.
- The
goal of testing usually is to predict behavior other than the exact
behaviors required while the exam is being taken.
- For
example, paper-and-pencil achievement tests given to children are used to
say something about their level of achievement.
- Another
paper-and-pencil test (also called a self-report test) that is popular in
counseling is the MMPI (i.e., the Minnesota Multiphasic Personality
Inventory). Clients' scores on this test are used as indicators of the
presence or absence of various mental disorders.
- The
point here is that the actual mechanics of measurement (e.g.,
self-reports, behavioral performance, projective) can vary widely and
still provide good measurement of educational, psychological, and other
types of variables.
10. Present-day behavior sampling predicts future
behavior.
- Perhaps
the most important reason for giving tests is to predict future behavior.
- Tests
provide a sample of present-day behavior. However, this "sample"
is used to predict future behavior.
- For
example, an employment test given by someone in a Personnel Office may be
used as a predictor of future work behavior.
- Another
example: the Beck Depression Inventory is used to measure depression and,
importantly, to predict test taker’s future behavior (e.g., are they a
risk to themselves?).
11. Testing and assessment can be conducted in a fair
and unbiased manner.
- This
requires careful construction of test items and testing of the items on
different types of people.
- Test
makers always have to be on the alert to make sure tests are fair and
unbiased.
- This
assumption also requires that the test be administered to those types of
people for whom it has been shown to operate properly.
12. Testing and assessment benefit society.
- Many
critical decisions are made on the basis of tests (e.g., teacher
competency, employability, presence of a psychological disorder, degree of
teacher satisfactions, degree of student satisfaction, etc.).
- Without
tests, the world would be much more unpredictable.
Identifying A Good
Test or Assessment Procedure
As mentioned earlier in the chapter, good measurement us
fundamental for research. If we do not have good measurement then we cannot
have good research. That’s why it’s so important to use testing and assessment
procedures that are characterized by high reliability and high validity.
Overview of Reliability and Validity
As an introduction to reliability and validity and how they
are related, note the following:
- Reliability
refers to the consistency or stability of test scores
- Validity
refers to the accuracy of the inferences or interpretations we make from
test scores
- Reliability
is a necessary but not sufficient condition for validity (i.e., if you are
going to have validity, you must have reliability but reliability in and
of itself is not enough to ensure validity.
- Assume
you weigh 125 pounds. If you weigh yourself five times and get 135, 134,
134, 135, 136 then your scales are reliable but not valid. The scores were
consistent but wrong! Again, you want your scales to be both reliable and
valid.
Reliability
Reliability refers to consistency or stability. In
psychological and educational testing, it refers to the consistency or
stability of the scores that we get from a test or assessment procedure.
- Reliability
is usually determined using a correlation coefficient (it is called a reliability
coefficient in this context).
- Remember
(from chapter two) that a correlation coefficient is a measure of
relationship that varies from -1 to 0 to 1 and the farther the number is
from zero, the stronger the correlation. For example, minus one
(-1.00) indicates a perfect negative correlation, zero indicates no
correlation at all, and positive one (+1.00) indicates a perfect positive
correlation. Regarding strength, -.85 is stronger than +.55, and +.75 is
stronger than +.35. When you have a negative correlation, the variables
move in opposite directions (e.g., poor diet and life expectancy); when
you have a positive correlation, the variables move in the same direction
(e.g., education and income).
- When
looking at reliability coefficients we are interested in the values
ranging from 0 to 1; that is, we are only interested in positive
correlations. Note that zero means no reliability, and +1.00 means perfect
reliability.
- Reliability
coefficients of .70 or higher are generally considered to be acceptable
for research purposes. Reliability coefficients of .90 or higher are
needed to make decisions that have impacts on people's lives (e.g., the
clinical uses of tests).
- Reliability
is empirically determined; that is, we must check the reliability of test
scores with specific sets of people. That is, we must obtain the
reliability coefficients of interest to us.
There are four primary ways to measure reliability.
1.
The first type of reliability is called test-retest
reliability.
·
This refers to the consistency of test scores over
time.
·
It is measured by correlating the test scores obtained
at one point in time with the test scores obtained at a later point in time for
a group of people.
·
A primary issue is identifying the appropriate time
interval between the two testing occasions.
·
The longer the time interval between the two testing
occasions, the lower the reliability coefficient tends to be.
2.
The second type of reliability is called equivalent
forms reliability.
- This
refers to the consistency of test scores obtained on two equivalent forms
of a test designed to measure the same thing.
- It
is measured by correlating the scores obtained by giving two forms of the
same test to a group of people.
- The
success of this method hinges on the equivalence of the two forms of the
test.
3.
The third type of reliability is called internal
consistency reliability.
- It
refers to the consistency with which the items on a test measure a single
construct.
- Internal
consistency reliability only requires one administration of the test,
which makes it a very convenient form of reliability.
- One
type of internal consistency reliability is split-half reliability,
which involves splitting a test into two equivalent halves and checking
the consistency of the scores obtained from the two halves.
- The
measure of internal consistency that we emphasize in the chapter is coefficient
alpha. (It is also sometimes called Cronbach’s alpha.) The beauty of
coefficient alpha is that it is readily provided by statistical analysis
packages and it can be used when test items are quantitative and when they
are dichotomous (as in right or wrong).
- Researchers
use coefficient alpha when they want an estimate of the reliability of a
homogeneous test (i.e., a test that measures only one construct or trait)
or an estimate of the reliability of each dimension on a multidimensional
test. You will see it commonly reported in empirical research articles.
- Coefficient
alpha will be high (e.g., greater than .70) when the items on a test are
correlated with one another. But note that the number of items also
affects the strength of coefficient alpha (i.e., the more items you have
on a test, the higher coefficient alpha will be). This latter point is
important because it shows that it is possible to get a large alpha
coefficient even when the items are not very homogeneous or internally
consistent.
4.
The fourth and last major type of reliability is called
inter-scorer reliability.
- Inter-Scorer
Reliability refers to the consistency or degree of agreement between two
or more scorers, judges, or raters.
- You
could have two judges rate one set of papers. Then you would just
correlate their two sets of ratings to obtain the inter-scorer reliability
coefficient, showing the consistency of the two judges’ ratings.
Validity
Validity refers to the accuracy of the inferences,
interpretations, or actions made on the basis of test scores.
- Technically
speaking, it is incorrect to say that a test is valid or
invalid. It is the interpretations and actions taken based on the
test scores that are valid or invalid.
- All
of the ways of collecting validity evidence are really forms of what used
to be called construct validity. All that means is that in testing and
assessment, we are always measuring something (e.g., IQ, gender,
age, depression, self-efficacy).
Validation refers to gathering evidence supporting
some inference made on the basis of test scores.
There are three main methods of collecting validity
evidence.
1. Evidence Based
on Content
Content-related evidence is based on a judgment of
the degree to which the items, tasks, or questions on a test adequately represent
the domain of interest. Expert judgment is used to provide evidence of content
validity.
To make a decision about content-related evidence, you
should try to answer these three questions:
- Do
the items appear to represent the thing you are trying to measure?
- Does
the set of items underrepresent the construct’s content (i.e., have you
excluded any important content areas or topics)?
- Do
any of the items represent something other than what you are trying to
measure (i.e., have you included any irrelevant items)?