Reliability and Comparability of TOEFL iBT Scores

Transcription

Insight
TOEFL iBT Research
TM
Series I, Volume 3
Reliability and Comparability of
TOEFL iBT™ Scores
ETS — Listening. Learning. Leading.®
Insight
TOEFL iBT Research • Series 1, Volume 3
TM
Foreword
Preface
We are very excited to announce the TOEFL iBT™ Research
Insight Series, a bimonthly publication to make important
research on the TOEFL iBT available to all test score users in
a user-friendly format.
Since the 1970’s, the TOEFL test has had a rigorous,
productive and far-ranging research program. But why
should test score users care about the research base for a
test? In short, because it is only through a rigorous program
of research that a testing company can demonstrate its
forward-looking vision and substantiate claims about what
test takers know or can do based on their test scores. This is
why ETS has made the establishment of a strong research
base a consistent feature of the evolution of the TOEFL test.
The TOEFL iBT test is the most widely accepted English
language assessment, used for admissions purposes in more
than 130 countries including the United Kingdom, Canada,
Australia, New Zealand and the United States. Since its initial
launch in 1964, the TOEFL test has undergone several major
revisions motivated by advances in theories of language
ability and changes in English teaching practices. The most
recent revision, the TOEFL iBT test, was launched in 2005. It
contains a number of innovative design features, including
the use of integrated tasks that engage multiple language
skills to simulate language use in academic settings, and the
use of test materials that reflect the reading and listening
demands of real-world academic environments.
At ETS we understand that you use TOEFL iBT test scores to
help make important decisions about your students, and we
would like to keep you up-to-date about the research results
that assure the quality of these scores. Through the TOEFL
iBT Research Insight Series we wish to both communicate to
the institutions and English teachers who use the TOEFL iBT
test scores the strong research and development base that
underlies the TOEFL iBT test, and demonstrate our strong,
continued commitment to research.
We hope you will find this series relevant, informative and
useful. We welcome your comments and suggestions about
how to make it a better resource for you.
Ida Lawrence
Senior Vice President
Research & Development Division
Educational Testing Service
The TOEFL test is developed and supported by a worldclass team of test developers, educational measurement
specialists, statisticians and researchers. Our test developers
have advanced degrees in such fields as English, language
education and linguistics. They also possess extensive
international experience, having taught English in Africa,
Asia, Europe, North America and South America. Our
research, measurement and statistics team includes
some of the world’s most distinguished scientists and
internationally recognized leaders in diverse areas such as
test validity, language learning and testing, and educational
measurement and statistics.
To date, more than 150 peer-reviewed TOEFL research
reports, technical reports and monographs have been
published by ETS, many of which have also appeared
in academic journals and book volumes. In addition to
the 20-30 TOEFL-related research projects conducted by
ETS Research & Development staff each year, the TOEFL
Committee of Examiners (COE), comprised of language
learning and testing experts from the academic community,
funds an annual program of TOEFL research by external
researchers from all over the world, including preeminent
researchers from Australia, the UK, the US, Canada and
Japan.
In Series One of the TOEFL iBT Research Insight Series, we
provide a comprehensive account of the essential concepts,
procedures and research results that assure the quality of
scores on the TOEFL iBT test. The six issues in this Series will
cover the following topics:
oooooooo 1
Reliability and Comparability of TOEFL iBT™ Scores
Issue 1: TOEFL iBT Test Framework
and Development
The TOEFL iBT test is described along with the
processes used to develop test questions and forms.
These processes include rigorous review of test
materials, with special attention to fairness concerns.
Item pretesting, try outs and scoring procedures are
also detailed.
Issue 2: TOEFL Research
The TOEFL Program has supported rigorous research
to maintain and improve test quality. Over 150 reports
and monographs are catalogued on the TOEFL website.
A brief overview of some recent research on fairness
and automated scoring is presented here.
Issue 3: Reliability and Comparability
of Test Scores
Given that hundreds of thousands of test takers take
the TOEFL iBT test each year, many different test forms
are developed and administered. Procedures to achieve
score comparability on different forms are described
in this section.
Issue 4: Validity Evidence Supporting Test
Score Interpretation and Use
The many types of evidence supporting the proposed
interpretation and use of test scores as a measure of
English-language proficiency in academic contexts
are discussed.
Issue 5: Information for Score Users,
Teachers and Learners
Materials and guidelines are available to aid in the
interpretation and appropriate use of test scores, as
well as resources for teachers and learners that support
English-language instruction and test preparation.
Insight
TM
Issue 6: TOEFL Program History
A brief overview of the history and governance of
the TOEFL Program is presented. The evolution of the
TOEFL test constructs and contents from 1964 to the
present is summarized.
Future series will feature summaries of recent studies on
topics of interest to our score users, such as “what TOEFL iBT
test scores tell us about how examinees perform in academic
settings,” and “how score users perceive and use TOEFL iBT
test scores.”
The close collaboration with TOEFL iBT score users, English
language learning and teaching experts and university
professors in the redesign of the TOEFL iBT test has
contributed to its great success. Therefore, through this
publication, we hope to foster an ever stronger connection
with our score users by sharing the rigorous measurement
and research base and solid test development that continues
to ensure the quality of TOEFL iBT scores to meet the needs
of score users.
Xiaoming Xi
Senior Research Scientist
Research & Development Division
Educational Testing Service
Contributors
The primary authors of this section are Mary Enright and
Eileen Tyson.
The following individuals also contributed to this section by
providing their careful review as well as editorial suggestions
(in alphabetical order).
Cris Breining
Rosalie Szabo
Xiaofei Tang
Xiaoming Xi
Reliability and Comparability
of TOEFL iBT™ ‑ Scores
ETS has always been committed to the quality of its test
scores. As an ETS assessment program, TOEFL® strives to
ensure score reliability and comparability through strict
adherence to guidelines and practices established for
the development and operational implementation of its
products and services. Evidence of score reliability and
comparability is important because it suggests that test
scores will have the same meaning across test forms. ETS
strives to ensure that the test scores of the TOEFL Internet
based test (iBT) are reliable and comparable by:
• implementing and adhering to standardized
administration and test security procedures
• using detailed test specifications to guide
test development
• monitoring score reliability and generalizability
• employing an appropriate scale for
reporting scores
• using equating and other means to maintain
comparable scores across test forms
Standardized Administration
and Security Procedures
TOEFL program also provides extensive material to test
administrators and test takers so that violations of such
procedures can be reported to ETS for investigation. The
major procedures are:
• certifying all test centers facilities and equipment for
administering TOEFL iBT tests, including hardware,
software and Internet connections
• training test center staff on how to handle a
test administration session, including test taker
identity verification, test launch and incident and
irregularity management
• providing online practice tests and other supporting
information for test takers to become familiar with
the test and test-taking conditions, including section
sequence, test duration, use of headphones and
microphones and navigating within and across test
sections (information for test takers is available at
www.toeflgoanywhere.org)
• using technology to control test delivery and
transmission of test-related data to ensure the
security of test contents and test results
• informing test takers about how to report fraudulent
behavior in a test session
Test Specifications
Another important way to help ensure the comparability
of scores across test forms is to create detailed
specifications to guide the test development process.
In large-scale tests such as the TOEFL iBT test, a critical
Test specifications are a detailed operational definition
component to ensuring score validity and fairness
of test characteristics that form an exact content sampling
is accomplished by implementing and adhering to
plan. For example, test specifications may define the
standardized procedures for test administration and
kind of content covered by the test, the number
related test security.
of test questions, the format of test questions and
Standardized test administrations and
responses, and the response options. The Standards
test security measures ensure the
As a result, the test
for Educational and Psychological Testing (AERA, APA,
TOEFL
iBT
test
is
given
under
comparable
scores reflect the
NCME, 1999, p. 43) provides general guidelines for
conditions to all test takers, no matter
test takers’ abilities
developing and evaluating test specifications. When
where
or
when
they
take
the
test.
and are not unduly
multiple forms of a test are developed according to
influenced by other,
well-defined
test specifications, the test characteristics can
unrelated factors. TOEFL iBT operational procedures for
be expected to remain very similar across the test forms
maintaining standardized conditions for test administration
and across test administrations.
and security follow the requirements laid out in the
ETS Standards for Quality and Fairness (ETS, 2002). The
2 oooooooo
oooooooo 3
The TOEFL iBT test offers multiple test administrations each
year, so it is critical to ensure that test forms used for these
administrations are similar in content. This is accomplished
by following the test specifications for the TOEFL iBT test
in the test development process. Details of how these
specifications were developed using Evidence Centered
Design can be found in Pearlman (2008).
Score Reliability and
Generalizability
TM
equivalent. The person would receive two test scores, but
neither of the two scores would be the person’s exact true
ability score. However, both scores should be around his or
her true ability score. SEM is a measure that defines a score
range in which one’s true ability score lies with a certain
level of probability. Obviously, the smaller the SEM, the
better quality (more precise) the test scores will be.
Reliability is expressed in a statistical index whose value
ranges from 0 (not at all reliable) to 1 (perfectly reliable).
Such an index, called a coefficient, can be estimated in
different ways depending on the intended use of the
coefficient and the underlying theoretical frameworks.
In the TOEFL iBT test, the reliability estimation for the
Reading and Listening sections that contain selected
response questions is carried out using a method based
on item response theory (IRT) (Lord, 1980). For the Speaking
and Writing sections that contain constructed response
tasks, generalizability theory (G-theory) is used
(Brennan, 1983).
An important measure of the quality of a test is how
reliable the test scores are. Reliability is important because
it indicates the replicability of the test scores when either
a test could be given twice or more to the same group
of people, or two tests constructed in the same manner
could be given to the same group of people. Testing, like
other measurement events, is subject to the influence of
many factors that
are not relevant to
The more reliable the scores are, the
Generally speaking, there are two steps taken in
the ability being
more confidence score users have in
producing IRT-based reliability estimates for Reading
measured. Such
using the scores for making important
and Listening scores. The first step estimates the SEM
irrelevant factors
decisions about test takers.
for each scaled score point after a new test form is
contribute to what is
equated, and such a SEM specific to each score point
called “measurement error,” which in turn determines how
is called a conditional SEM or a CSEM. Item parameters and
reliable test scores are.
personal ability estimates from an IRT model are used in
calculating the CSEM values. In the second step, the CSEM
In educational measurement, score reliability is a statistical
values across all the scaled score points are averaged, and
index to quantify and evaluate the consistency of test scores.
this averaged CSEM value is used in the calculation of the
reliability estimate for the overall test of Reading or Listening
In essence, “the concern of reliability is to quantify the
by substituting this overall SEM (averaged CSEM) for the SEM
precision of test scores and other measurements” (Haertel,
in the formula below:
2006, p. 65). A test score is a measurement outcome of a
person’s performance on a test, and as such, just like any
other measurement, the score contains some degree of
x
xx
measurement error due to many factors. In fact, a person’s
real or true ability can never be observed in a test. A welldeveloped test is expected to yield a test score that will
where σx is the standard deviation of the scaled score, rxx
reflect a person’s real ability as much as possible and to keep
is the reliability estimate, and solving for rxx will then obtain
measurement error to a minimum. This is what “precision
the estimated reliability. The SEM is on the same metric
of test scores” really means. Precision of test scores is also
as the scaled scores that are reported.
expressed as a measurement index called the standard
Generalizability theory has seen extensive application in
error of measurement (SEM). A person’s true ability score is
measuring the score reliability for tests with constructed
never obtainable and is therefore estimated using statistical
response tasks in which the test taker generates answers
methods. Imagine that a person could take two tests that are
to test questions instead of selecting from a list of possible
constructed to the same specifications and are essentially
responses. Using an analysis of variance, generalizability
SEM = σ
4 oooooooo
Insight
1− r
,
theory separates different sources of variance (test taker
ability, rater effects, and task effects) so that the effect of
each source of variance (called a facet) can be evaluated.
The index for score reliability in this framework is a
generalizability coefficient (G-coefficient), which is also
on a scale of 0 to 1 with a value closer to 1 being desired.
A ‘person by task’ generalizability model is used for the
Speaking scores, whereas a nested model is used for Writing
(rating nested within task, and this is crossed with person)
(see Lee & Kantor, 2005 for details).
The above-mentioned reliability and generalizability
analyses are conducted for every test form. Table 1 presents
the average section and total score reliability estimates and
standard errors of measurement based on operational data
from 2007.
Table 1.
Reliabilities and Standard
Errors of Measurement
Score
Scale
Reliability
Estimate
SEM
Reading
0-30
0.85
3.35
Listening
0-30
0.85
3.20
Speaking
0-30
0.88
1.62
Writing
0-30
0.74
2.76
Total
0-120
0.94
5.64
The reliability estimates for the Reading, Listening,
Speaking, and Total scores are relatively high, while the
reliability of the Writing score is somewhat lower. This is
a typical result for writing measures composed of only
two tasks (Breland, Bridgeman, & Fowles, 1999) and
reflects one well-documented limitation of performance
testing—reliability estimates for measures composed of
a small number of time-consuming tasks are often lower
than estimates for measures composed of many shorter,
less time-consuming tasks. However, the construct of
academic writing as defined for the TOEFL iBT test required
the production of extended writing samples (Cumming,
Kantor, Powers, Santos, & Taylor, 2000). One implication
of these results is that, for making high-stakes decisions
such as admissions to college or graduate school, the
Total score provides the best information, both because it
reflects all four language skills and because it is the most
reliable. Nevertheless, there are circumstances under which
decision makers may want to examine the profile of scores
for test takers, such as the demands of the curriculum or
a need for additional language training. Also note that
ETS encourages score users to consider a number of other
factors, when making admissions decisions, including grade
point average, scores on other admissions exams, teacher
recommendations, and interviews with individuals.
The reliability estimates in Table 1 are what are used for the
TOEFL iBT operational test scores. Other types of reliability
estimates also exist that take into account other sources of
variability such as differences in test forms or changes in
examinees’ performances from day to day. Alternate form
reliability, for example, is calculated based on examinees’
scores on two different forms of a test. This requires
examinees to take two different test forms, something
only a few examinees would volunteer to do. But some
examinees do take the test twice during a period of time
too short for much learning to occur, for reasons of their
own. An analysis of the scores of these repeat test takers on
the two test forms provides an approximation of alternate
form reliability. Zhang (February 2008) compared the test
scores of more than 12,000 examinees who were identified
as having taken two TOEFL iBT tests within a period of one
month. The correlations of their scores on the two test forms
were 0.77 for the listening and writing sections, 0.78 for
reading, 0.84 for speaking, and 0.91 for the total test score.
Because these measures of reliability take into account
additional sources of variability, they are typically lower than
internal consistency measures. Nevertheless, they indicate a
high degree of consistency in the rank ordering of the scores
of these test repeaters.
Scaling TOEFL iBT Scores
Reported test scores are derived from performance on a
test through a statistical process called “scaling.” In a simple
example, a student who answered 55 questions correctly out
of 60 questions on a test would receive a score of 55 if each
correct answer was worth one score point. This score is the
number correct score, which is also called a raw score. For
standardized tests, raw score scales are almost never used
directly as reported scores. This is because the raw score
oooooooo 5
TM
is directly dependent on the specific items on a particular
Such an equating process, however, is not practical or
test form; this particular form may not have exactly the
feasible for performance-based tests that contain only a
same difficulty level
few tasks. For example, a writing test may have only
A
carefully
developed
score
scale,
as other forms of
one or two writing tasks that are scored by human
together with an equating plan (see
the same test. As
raters using a scoring rubric (rules or guidelines for
following
section),
is
important
in
a result, the same
assigning scores to constructed-response questions
raw scores from two maintaining score comparability and
or performance tasks). Many equating procedures
meaningful
interpretation
of
scores
different test forms
require repeating previously administered items in
across test forms and over time.
may not represent
the current administration. This may not be feasible if
the same level of
a writing test has only one or two tasks that are easily
performance or ability. Instead, the raw scale is transformed
remembered and shared with other examinees.
to a reporting score scale, which produces a scaled score.
Threats to score comparability on such performance tests
The scales for the measures on the TOEFL® iBT test were
result from both differences in test form difficulty, and
established such that the same scale range (0-30) for the
inconsistency in human raters’ scoring activities. In the
four sections was chosen to indicate that all sections should
absence of any feasible equating procedures, various other
be viewed as being equally important in measuring the
statistical analyses can be used to evaluate and control the
construct of academic language ability. The total score
quality of scores (Baldwin, Fowles, & Livingston, 2005).
would be the sum of the four section scores. The decision
to use a 0-to-30 scale was based primarily on the need to
provide reasonable raw to scale score mappings for each of
™
the sections, which differed in their maximum raw scores.
The maximum number of raw score points on the four
sections of the form used in the filed study ranged from
A non-equivalent anchor test design (also called a common
20 for writing to 44 for reading.
item non-equivalent group design) is used as the equating
data collection design for TOEFL iBT Reading and Listening
sections. This means that each new form of the TOEFL iBT
test contains an anchor block, which is a set of items that
have been pretested in previously administered forms
and have already been analyzed. This design enables an
adjustment for possible ability differences between the
group in which the items were pretested in an old test
For testing programs that have multiple administrations
form and the group in which the items are used in the
with different test forms, it is necessary to maintain score
current new test form for equating. This is possible
comparability across test forms for meaningful comparison
because the same items are given to two groups of
of scores. Score comparability across test forms is typically
candidates, and the differences in item statistics for the
maintained using certain statistical processes called equating.
two groups reflect the ability differences between the
two groups. Such differences need to be adjusted during
Equating is a statistical process that is used to adjust
the equating process.
scores on test forms so that scores on the forms can be
used interchangeably. “Equating adjusts for differences
A statistical model within the item response theory (IRT)
in difficulty among forms that are built to be similar in
framework is used to analyze the items and candidates’
difficulty and content” (Kolen & Brennan, 1995, p. 2).
abilities. From the IRT-based analysis, item parameters and
candidates’ abilities are derived. The derived item parameters
For tests containing selected response items, equating is
and ability values are on the TOEFL iBT IRT scales that were
routinely carried out to produce reporting scores for a
already established. This way, items and candidates from
new test form, as is the case for TOEFL iBT Reading and
different test administrations can be directly compared.
Listening sections.
Equating TOEFL iBT Reading
and Listening Sections
Maintaining Score
Comparability across
Test Forms
6 oooooooo
Insight
The IRT true score equating method (see detailed
the Writing or Speaking sections with the Reading and
descriptions about this method in Kolen & Brennan, 1995)
Listening sections. The performance of raters is evaluated
is used to establish
by statistics on rater agreement rates, which include
The scaled scores for any test forms
the relationship
both exact agreement (no score difference between
are directly comparable as they
between scores on
two raters) and adjacent agreement (1 point
now indicate the same levels of
the current new
difference). The average of all the scores a rater
performance and ability.
form and scores on
assigns in a scoring session is compared with the
the base form. After
average score of all the raters participating in the
equating, a raw score on the new form is adjusted to be
same session. A large difference between these two
equivalent to a raw score on the base form. Because each
average scores may alert a scoring leader to a possible
raw score on the base form already corresponds to a
problem in a rater’s
scaled score, each raw score on the new form is now Several statistical methods are also
performance.
implemented to monitor and evaluate
related to a scaled score between 0 and 30.
the performance of the tasks in each test Whenever possible,
monitor papers are
form, and the performance of raters.
also used to evaluate
™
cross-administration scoring consistency. Monitor papers are
selected responses on a task from a prior test administration
that were scored previously. If the task is occasionally
included in a new test administration, these monitor
papers are intermixed with the responses to the task on
Speaking and Writing scores are not equated statistically
the new test for scoring. Because these monitor papers are
due to technical and test security constraints. The two
indistinguishable from the responses to the task on the new
test sections have six and two constructed response tasks,
test, raters will score them in the same way as they score the
respectively. Because these few tasks are prone to
new responses. Then, the old and new scores on the monitor
memorization, test security concerns preclude the possibility
papers are compared, and the agreement rates between the
of repeating tasks on every new test form, as is required
two sets of scores indicate cross-administration rater
by equating. To
consistency in scoring.
minimize differences Careful test development effort and
in test form difficulty rigorous scoring standards are used
Another type of statistical evidence of score
and potential
to maintain score quality.
consistency across forms comes from the analysis
inconsistency due to
of repeat test takers. As noted in the section on
human scoring, a number of non-statistical procedures are
reliability, Zhang (2008) analyzed the test scores of
put in place in test development and scoring.
examinees who chose to take the test twice within a short
period of time. The correlational analyses established that
Detailed task specifications guide the development of
the examinees were rank-ordered consistently on the two
parallel tasks. Then, small-scale tryouts are used to screen
test forms. Zhang also reported that the differences in
out poorly performing tasks. Training is given to raters using
scores on the two test forms for all four sections and for the
well-defined and articulated scoring rubrics. In addition,
total score were negligible for most examinees.
raters are certified before they can begin scoring work and,
prior to each scoring session, they must pass a calibration
test. Furthermore, during all scoring sessions, raters are
monitored and supervised by chief raters.
Comparability of TOEFL iBT
Speaking and Writing Sections
across Forms
All the constructed response tasks in Writing and Speaking
sections are analyzed after a test is given. The analysis
examines such statistics as average scores on a task,
distributions of scores on a task, and correlations between
oooooooo 7
Conclusion
Evidence of score reliability and comparability allows
Because different forms of the TOEFL iBT test are
decision makers to evaluate the trustworthiness of the test
administered to test takers at different times and in
scores when the scores are used to indicate candidates’
different locations, score reliability and comparability
abilities or performances.
are important criteria to evaluate the quality of the
test. Therefore, ETS
has implemented a
Evidence of score reliability and comparability for
variety of procedures
TOEFL iBT scores comes both from statistical
to enhance test
analyses and from the application of accepted test
score reliability and
development, administration and scoring practices.
comparability.
References
American Educational Research Association, American Psychological Association & National Council on Measurement in
Education. (1999). The standards for educational and psychological testing. Washington, DC: American Educational Research
Association. www.apa.org/science/standards.html
Breland, H., Bridgeman, B. & Fowles, M. E. (1999). Writing assessment in admission to higher education: Review and framework
(ETS Research Rep. No. 99-03). Princeton, NJ: ETS. Available on the Web at: www.ets.org/Media/Research/pdf/RR-99-03Breland.pdf
Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing Program.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, (16), 297-334.
Cumming, A., Kantor, R., Powers, D., Santos, T. & Taylor, C. (2000). TOEFL 2000 writing framework: A working paper. (TOEFL
Monograph No. 18). Princeton, NJ: ETS. Available on the Web at: www.ets.org/Media/Research/pdf/RM-00-05.pdf
Educational Testing Service. (2002). ETS standards for quality and fairness. Princeton, NJ: Author. Available on the Web at:
www.ets.org/Media/About_ETS/pdf/standards.pdf
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., 65 – 110). Westport, CT: American
Council on Education and Praeger.
Kolen, M J. & Brennan, R.L. (1995). Test equating: Methods and practices. New York, NY: Springer-Verlag.
Lee, Y.-W. & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating
schemes. ETS Research Report (RR-05-14). Princeton, NJ: ETS. http://infoport/sites/rrpts/publications/RR-05-14.pdf
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Pearlman, M. (2008). Finalizing the test blueprint. In C. Chapelle, M. K. Enright & J. M. Jamieson (Eds.), Building a validity
argument for the Test of English as a Foreign Language™. New York: Routledge: Taylor & Francis Group.
Zhang, Y. (2008). Repeater analyses for TOEFL iBT. ETS Research Report (RM-08-05). Princeton, NJ: ETS.
8 oooooooo
Contact Us [email protected]
Insight
TOEFL iBT™ Research • Series 1, Volume 3
Copyright © 2011 by Educational Testing Service. All rights reserved. ETS, the ETS logo, LISTENING. LEARNING. LEADING.
and TOEFL are registered trademarks of Educational Testing Service (ETS) in the United States and other countries.
TOEFL iBT is a trademark of ETS. EDU00025
Listening. Learning. Leading.®
www.ets.org
760650

Reliability and Comparability of TOEFL iBT Scores

Transcription

Similar documents

Quý II - 2010 - IIG Việt Nam