Watson-Glaser: Short Form Manual
Transcription
Watson-Glaser: Short Form Manual
Watson-Glaser ® Critical Thinking Appraisal Short Form Manual Goodwin Watson & Edward M. Glaser 888-298-6227 • TalentLens.com Copyright © 2008 Pearson Education, Inc., or its affiliates. All rights reserved. Copyright © 2008 by Pearson Education, Inc., or its affiliate(s). All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the copyright owner. The Pearson and TalentLens logos, and Watson-Glaser Critical Thinking Appraisal are trademarks, in the U.S. and/or other countries, of Pearson Education, Inc. or its affiliate(s). Portions of this work were previously published. Printed in the United States of America. Table of Contents Chapter 1 Introduction................................................................................................................................... 1 Chapter 2 Critical Thinking Ability and the Development of the Original Watson-Glaser Forms................................................................................................... 3 The Watson-Glaser Short Form................................................................................................... 4 Chapter 3 Directions for Paper-and-Pencil Administration and Scoring.......................................... 5 Preparing for Administration..................................................................................................... 5 Testing Conditions.................................................................................................................... 5 Materials Needed to Administer the Test.................................................................................. 5 Answering Questions................................................................................................................ 6 Administering the Test................................................................................................................ 6 Timed Administration............................................................................................................... 7 Untimed Administration.......................................................................................................... 8 Concluding Administration...................................................................................................... 8 Scoring.......................................................................................................................................... 8 Scoring with the Hand-Scoring Key.......................................................................................... 8 Machine Scoring....................................................................................................................... 8 Test Security.................................................................................................................................. 9 Accommodating Examinees with Disabilities.......................................................................... 9 Chapter 4 Directions for Computer-Based Administration................................................................. 11 Preparing for Administration................................................................................................... 11 Testing Conditions.................................................................................................................. 11 Answering Questions.............................................................................................................. 11 Administering the Test.............................................................................................................. 12 Scoring and Reporting............................................................................................................... 12 Test Security................................................................................................................................ 12 Accommodating Examinees with Disabilities........................................................................ 13 iii Watson-Glaser Short Form Manual Chapter 5 Norms.............................................................................................................................................. 15 Using Norms Tables to Interpret Scores................................................................................... 15 Converting Raw Scores to Percentile Ranks............................................................................ 16 Chapter 6 Development of the Short Form.............................................................................................. 19 Test Assembly Data Set.............................................................................................................. 19 Criteria for Item Selection......................................................................................................... 20 Maintenance of Reading Level................................................................................................. 21 Updates to the Test..................................................................................................................... 21 Test Administration Time.......................................................................................................... 21 Chapter 7 Equivalence of Forms................................................................................................................. 23 Equivalence of Short Form to Form A...................................................................................... 23 Equivalent Raw Scores............................................................................................................... 24 Equivalence of Computer-Based and Paper-and-Pencil Versions of the Short Form.......... 25 Chapter 8 Evidence of Reliability............................................................................................................... 27 Historical Reliability.................................................................................................................. 28 Previous Studies of Internal Consistency Reliability.............................................................. 28 Previous Studies of Test-Retest Reliability............................................................................... 29 Current Reliability Studies........................................................................................................ 29 Evidence of Internal Consistency Reliability.......................................................................... 29 Evidence of Test-Retest Reliability.......................................................................................... 30 Chapter 9 Evidence of Validity.................................................................................................................... 33 Evidence of Validity Based on Content.................................................................................33 Evidence of Criterion-Related Validity.................................................................................... 34 Previous Studies of Evidence of Criterion-Related Validity.................................................... 35 Current Studies of Evidence of Criterion-Related Validity..................................................... 35 Evidence of Convergent and Discriminant Validity.............................................................. 39 Previous Studies of Evidence of Convergent and Discriminant Validity............................... 39 Studies of the Relationship Between the Watson-Glaser and General Intelligence............... 40 Current Studies of Evidence of Convergent and Discriminant Validity................................ 40 Chapter 10 Using the Watson-Glaser as an Employment Selection Tool ........................................... 43 Employment Selection.............................................................................................................. 43 Fairness in Selection Testing..................................................................................................... 44 Legal Considerations............................................................................................................... 44 iv Table of Contents Group Differences/Adverse Impact......................................................................................... 44 Monitoring the Selection System........................................................................................... 44 Research...................................................................................................................................... 45 Appendix A Description of the Normative Sample and Percentile Ranks.......................................... 47 Appendix B Final Item Statistics for the Watson-Glaser–Short Form Three-Parameter IRT Model..................................................................................................... 63 References..................................................................................................................................... 65 Research Bibliography............................................................................................................... 69 Glossary of Measurement Terms. ........................................................................................... 79 Tables 6.1 Distribution of Item Development Sample Form A Scores (N = 1,608)................................. 19 6.2 Grade Levels of Words on the Watson-Glaser–Short Form.................................................... 21 6.3 Frequency Distribution of Testing Time in Test-Retest Sample (n = 42)................................ 22 7.1 Part-Whole Correlations (rpw) of the Short Form and Form A................................................ 24 7.2 Raw Score Equivalencies Between the Short Form and Form A............................................. 25 7.3 Equivalency of Paper and Online Modes of Administration................................................. 26 8.1 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM) and Internal Consistency Reliability Coefficients (ralpha ) for the Short Form Based on Previous Studies....................................................................................................... 28 8.2 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM) and Internal Consistency Reliability Coefficients (ralpha ) for the Current Short Form Norm Groups....................................................................................................... 30 8.3 Test-Retest Reliability of the Short Form................................................................................ 31 9.1 Studies Showing Evidence of Criterion-Related Validity........................................................ 37 9.2 Watson-Glaser Convergent Evidence of Validity................................................................... 41 A.1 Description of the Normative Sample by Industry................................................................ 47 A.2 Description of the Normative Sample by Occupation........................................................... 51 A.3 Description of the Normative Sample by Position Type/Level.............................................. 54 A.4 Percentile Ranks of Total Raw Scores for Industry Groups..................................................... 57 A.5 Percentile Ranks of Total Raw Scores for Occupations........................................................... 59 A.6 Percentile Ranks of Total Raw Scores for Position Type/Level................................................ 60 A.7 Percentile Ranks of Total Raw Scores for Position Type/ Occupation Within Industry.................................................................................................. 61 B.1 Final Item Statistics for the Watson-Glaser Short Form Three-Parameter IRT Model (reprinted from Watson & Glaser, 1994).............................................................. 63 Acknowledgements The development and publication of updated information on a test like the Watson-Glaser Critical Thinking Appraisal ®–Short Form inevitably involves the helpful participation of many people in several phases of the project—design, data collection, statistical data analyses, editing, and publication. The Harcourt Assessment Talent Assessment team is indebted to the numerous professionals and organizations that provided assistance. The Talent Assessment team thanks Julia Kearney, Sampling Special Projects Coordinator; Terri Garrard, Study Manager; and Victoria N. Locke, Director, Catalog Sampling Department, for coordinating the data collection phase of this project. David Quintero, Clinical Handscoring Supervisor, ensured accurate scoring of the paper-administered test data. We thank Zhiming Yang, PhD, Psychometrician, and Jianjun Zhu, PhD, Manager, Data Analysis Operations. Zhiming’s technical expertise in analyzing the data and Jianjun’s psychometric leadership ensured the high level of analytical rigor and psychometric integrity of the results reported. Our thanks also go to Troy Beehler and Peter Schill, Project Managers, for skillfully managing the logistics of this project. Troy and Peter worked with several team members from the Technology Products Group, Harcourt Assessment, Inc. to ensure the high quality and accuracy of the computer interface. These dedicated individuals included Paula Oles, Manager, Software Quality Assurance; Matt Morris, Manager, System Development; Christina McCumber and Johnny Jackson, Software Quality Assurance Analysts; Terrill Freese, Requirements Analyst; and Maurya Duran, Technical Writer. Dawn Dunleavy, Senior Managing Editor and Konstantin Tikhonov, Project Editor, provided editorial guidance and support. Production assistance was provided by Stephanie Adams, Director, Production; Mark Cooley, Designer; Debbie Glaeser, Production Coordinator; and Robin Espiritu, Production Manager, Manufacturing. Finally, we wish to acknowledge the leadership, guidance, support, and commitment of the following people through all the phases of this project: Gene Bowles, Vice President, Publishing and Technology, Larry Weiss, PhD, Vice President, Psychological Assessment Products Group, and Aurelio Prifitera, PhD, Publisher, Harcourt Assessment, Inc., and President, Harcourt Assessment International. Kingsley C. Ejiogu, PhD, Research Director Mark Rose, PhD, Research Director John Trent, M.S., Senior Research Analyst vii 1 Introduction The Watson-Glaser Critical Thinking Appraisal® (subsequently referred to in this manual as the Watson-Glaser) is designed to measure important abilities involved in critical thinking. Critical thinking ability plays a vital role in academic instruction and occupations that require careful analytical thinking to perform essential job functions. The Watson-Glaser has been used to predict performance in a variety of educational settings and has been a popular selection tool for executive, managerial, supervisory, administrative, and technical occupations for many years. When used in conjunction with information from multiple sources about the examinee’s skills, abilities, and potential for success, the Watson-Glaser can contribute significantly to the quality of an organization’s selection program. The Watson-Glaser–Short Form was published in 1994 to enhance the use of the Watson-Glaser in assessing adult employment applicants, candidates for employment-related training, career and vocational counselees, college students, and students in technical schools and adult education programs. As an abbreviated version of the Watson-Glaser–Form A, the Short Form uses a subset of Form A scenarios and items to measure the same critical thinking abilities. This manual provides the following information about the Short Form: • Updated guidelines for administration. Chapter 3 includes guidelines for administering and scoring the traditional paper-and-pencil version. Chapter 4 provides guidelines for administering the new computer-based version. • Updated normative information (norms). Twenty-three new norm groups, based on 6,713 cases collected in 2004 and 2005, are presented in chapter 5. • Results of an equivalency study on the computer-based and paper-and-pencil versions. To ensure equivalence between the newly designed computer-based and traditional paper-and-pencil based formats of the Watson-Glaser, a study was conducted comparing scores on the two versions. A full description of the study, which supported equivalence of the two versions, is presented in chapter 7, Equivalence of Forms. • Updated reliability and validity information. New studies describing internal consistency and test-retest reliability are presented in chapter 8. New studies describing convergent and criterion-related validity are presented in chapter 9. Information on Forms A and B was published in the 1994 Watson-Glaser manual. Critical Thinking Ability and the Development of the Original Watson-Glaser Forms 2 Development of the Watson-Glaser was driven by the conceptualization of critical thinking as a combination of attitudes, knowledge, and skills. This conceptualization suggests that critical thinking includes: • the ability to recognize the existence of problems and an acceptance of the general need for evidence in support of what is asserted to be true, • knowledge of the nature of valid inferences, abstractions, and generalizations in which the weight or accuracy of different kinds of evidence are logically determined, and • skills in employing and applying the above attitudes and knowledge. The precursors of the Watson-Glaser were developed by Goodwin Watson in 1925 and Edward Glaser in 1937. These tests were developed with careful consideration given to the theoretical concept of critical thinking, as well as practical applications. In 1964, The Psychological Corporation (now Harcourt Assessment, Inc.) published Watson-Glaser Forms Ym and Zm. Each form contained 100 items and replaced an earlier version of the test, Form Am. In 1980, Form Ym and Form Zm were modified for clarity, current word usage, and the elimination of racial and sexual stereotypes. The revised instruments, each containing 80 items, were published as Form A and Form B. The Watson-Glaser measures the extent to which examinees need training or have mastered certain critical thinking skills. The availability of comparable forms (i.e., the Short Form, Form A, and Form B) makes it possible to partially gauge the efficacy of instructional programs, and to measure developments of these skills over an extended period of time. The Watson-Glaser also has been a particularly popular tool for assessing the success of critical thinking instruction programs and courses, and for placing students in gifted and talented programs at the high school level, and in honors curriculum at the university level. The Watson-Glaser is composed of a set of five tests. Each test is designed to tap a somewhat different aspect of critical thinking. A high level of competency in critical thinking, as measured by the Watson-Glaser, may be operationally defined as the ability to correctly perform the domain of tasks represented by the five tests. 1—Inference. Discriminating among degrees of truth or falsity of inferences drawn from given data. 2—Recognition of Assumptions. Recognizing unstated assumptions or presuppositions in given statements or assertions. 3—Deduction. Determining whether certain conclusions necessarily follow from information in given statements or premises. Watson-Glaser Short Form Manual 4—Interpretation. Weighing evidence and deciding if generalizations or conclusions based on the given data are warranted. 5—Evaluation of Arguments. Distinguishing between arguments that are strong and relevant and those that are weak or irrelevant to a particular issue. Each test is composed of reading passages or scenarios that include problems, statements, arguments, and interpretations of data similar to those encountered on a daily basis at work, in the classroom, and in newspaper or magazine articles. Each scenario is accompanied by a number of items to which the examinee responds. There are two types of item content: neutral and controversial. Neutral scenarios and items deal with subject matter that does not cause strong feelings or prejudices, such as the weather and scientific facts or experiments. Scenarios and items having controversial content refer to political, economic, and social issues that frequently provoke strong emotional responses. As noted in the research literature about critical thinking, strong attitudes, opinions, and biases affect the ability of some people to think critically (Jaeger & Freijo, 1975; Jones & Cook, 1975; Mitchell & Byrne, 1973; Sherif, Sherif, & Nebergall, 1965). Though the Watson-Glaser comprises five tests, it is the total score of these tests that yields a reliable measure of critical thinking ability. Individually, the tests are composed of relatively few items and lack sufficient reliability to measure specific aspects of critical thinking ability. Therefore, individual test scores should not be relied upon for most applications of the Watson-Glaser. The Watson-Glaser Short Form The Short Form was designed to offer a brief version of the Watson-Glaser without changing the essential nature of the constructs measured. The length of time required to administer Form A or Form B of the Watson-Glaser is approximately one hour, making both forms well suited for administration during a single classroom period in a school setting. However, such lengthy administration time increases the cost and decreases the practicality of using the Watson-Glaser in adult assessment, particularly in the employment selection context. The Short Form is composed of 16 scenarios and 40 items selected from the 80-item Form A. The Short Form takes about 30 minutes to complete in a paperand-pencil or computer-based format. It takes an additional five to ten minutes to read the directions and sample questions. At one-half the length of Form A, the Short Form presents a more practical measure of critical thinking ability, yet retains an equivalent nature (see chapter 7, Equivalence of Forms). Organizations requiring an alternative to the Short Form for retesting or other purposes may use the full length Form B. Like Form A and Form B, the Short Form is appropriate for use with persons who have at least the equivalent of a ninth-grade education (see chapter 6, Development of the Short Form, for more information on the reading level of the Watson-Glaser). Directions for Paper-and-Pencil Administration and Scoring 3 Preparing for Administration The person responsible for administering the Watson-Glaser does not need special training, but must be able to carry out standard examination procedures. To ensure accurate and reliable results, the administrator must become thoroughly familiar with the administration instructions and the test materials before attempting to administer the test. It is recommended for test administrators to take the Watson-Glaser prior to administration, being sure to comply with the directions and any time requirement. Testing Conditions Generally accepted conditions of good test administration should be observed: good lighting, comfortable seating, adequate desk or table space, and freedom from noise and other distractions. Examinees should have sufficient seating space to minimize cheating. Each examinee needs an adequate flat surface on which to work. Personal materials should be removed from the work surface. Materials Needed to Administer the Test • This manual or the Directions for Administration booklet • 1 Test Booklet for each examinee • 1 Answer Document for each examinee • 2 No. 2 pencils with erasers for each examinee • A clock or stopwatch if the test is timed • 1 Hand Scoring Key (if the test will be hand-scored rather than machine-scored) Intended as a test of critical thinking power rather than speed, the WatsonGlaser may be given in either timed or untimed administrations. In timed administrations, the time limit is based on the amount of time required to finish the test by the majority of examinees in test tryouts. The administrator should have a regular watch with a second hand, a wall clock with sweep-second hand, or any other accurate device to time the test administration. To facilitate accurate timing, the starting time and the finishing time should be written down immediately after the signal to begin has been given. In addition to testing time, allow 5–10 minutes to read the directions and answer questions. Watson-Glaser Short Form Manual Answering Questions Examinees may ask questions about the test before the signal to begin is given. To maintain standard testing conditions, answer such questions by rereading the appropriate section of the directions. Do not volunteer new explanations or examples. It is the responsibility of the test administrator to ensure that examinees understand the correct way to indicate their answers on the Answer Document and what is required of them. The question period should never be rushed or omitted. If any examinees have routine questions after the testing has started, try to answer them without disturbing the other examinees. However, questions about the test directions should be handled by telling the examinee to do his or her best. Administering the Test All directions that the test administrator reads aloud to examinees are in bold type. Read the directions exactly as they are written, using a natural tone and manner. Do not shorten the directions or change them in any way. If you make a mistake in reading a direction, say, No, that is wrong. Listen again. Then read the direction again. When all examinees are seated, give each examinee two pencils and an Answer Document. Say Please make sure that you do not fold, tear, or otherwise damage the Answer Documents in any way. Notice that your Answer Document has an example of how to properly blacken the circle. Point to the Correct Mark and Incorrect Marks samples on the Answer Document. Say Make sure that the circle is completely filled in as shown. Note. Y ou may want to point out how the test items are ordered on the front page of the Short Form Answer Document so that examinees do not skip anything or put the correct information in the wrong place. Say In the upper left corner of the Answer Document, you will find box A labeled NAME. Neatly print your Last Name, First Name, and Middle Initial here. Fill in the appropriate circle under each letter of your name. The Answer Document provides space for a nine-digit identification number. If you want the examinees to use this space for an employee identification number, provide them with specific instructions for completing the information at this time. For example, say, In box B labeled IDENTIFICATION NUMBER, enter your employee number in the last four spaces provided. Fill in the appropriate circle under each digit of the number. If no information is to be recorded in the space, tell examinees that they should not write anything in box B. Say Find box C, labeled DATE. Write down today’s Month, Day, and Year here. (Tell examinees today’s date.) Blacken the appropriate circle under each digit of the date. Chapter 3 Directions for Paper-and-Pencil Administration and Scoring Box D labeled OPTIONAL INFORMATION, provides space for additional information you would like to obtain from the examinees. Let examinees know what information, if any, they should provide in this box. Note. If optional information is collected, the test administrator should explain to the examinees the purpose of collecting this information (i.e., how it will be used). Say, Are there any questions? Answer any questions. Say After you receive your Test Booklet, please keep it closed. You will do all your writing on the Answer Document only. Do not make any additional marks on the Answer Document until I tell you to do so. Distribute the Test Booklets. Say In this test, all the questions are in the Test Booklets. There are five separate tests in the booklet, and each one is preceded by its own directions. For each question, decide what you think is the best answer. Because your score will be the number of items you answered correctly, try to answer each question even if you are not sure that your answer is correct. Record your choice by making a black mark in the appropriate space on the Answer Document. Always be sure that the answer space has the same number as the question in the booklet and that your marks stay within the circles. Do not make any other marks on the Answer Document. If you change your mind about an answer, be sure to erase the first mark completely. Do not spend too much time on any one question. When you finish a page, go right on to the next one. If you finish all the tests before time is up, you may go back and check your answers. Timed Administration Say, You will have 30 minutes to work on this test. Now read the directions on the cover of your Test Booklet. After allowing time for the examinees to read the directions, say, Are there any questions? Answer any questions, preferably by rereading the appropriate section of the directions, then say, Ready? Please begin the test. Start timing immediately. If any of the examinees finish before the end of the test period, either tell them to sit quietly until everyone has finished, or collect their materials and dismiss them. At the end of 30 minutes, say, Stop! Put your pencils down. This is the end of the test. Intervene if examinees continue to work on the test after the time signal is given. Watson-Glaser Short Form Manual Untimed Administration Say You will have as much time as you need to work on this test. Now read the directions on the cover of your Test Booklet. After allowing time for the examinees to read the directions, say, Are there any questions? Answer any questions, preferably by rereading the appropriate section of the directions, then instruct examinees regarding what they are to do upon completing the test (e.g., remain seated until everyone has finished, bring Test Booklet and Answer Document to the test administrator). Say Ready? Please begin the test. Allow the group to work until everyone is finished. Concluding Administration At the end of the testing session, collect all Test Booklets, Answer Documents, and pencils. Place the completed Answer Documents in one pile and the Test Booklets in another. The Test Booklets may be reused, but they will need to be inspected for marks. Marked booklets should not be reused, unless the marks can be completely erased. Scoring The Watson-Glaser Answer Document may be hand scored with the Hand Scoring Key or machine scored. Scoring With the Hand-Scoring Key First, cross out multiple responses to the same item with a heavy red mark that will show through the key. (Note: Red marks are only suitable for hand-scored documents.) Check for any answer spaces that were only partially erased by the examinee in changing an answer; partial erasures should be completely erased. Next, place the scoring key over the Answer Document so that the edges are neatly aligned and the two stars appear through the two holes that are the closest to the bottom of the key. Count the number of correctly marked spaces (other than those through which a red line has been drawn) appearing through the holes in the stencil. Record the total in the “Score” box on the Answer Document. The maximum raw score for the Short Form is 40. The percentile score corresponding to the raw score may be recorded in the “Percentile” space on the Answer Document, and the norm group used to determine that percentile may be recorded in the space labeled “Norms Used.” Machine Scoring First, completely erase multiple responses to the same item or configure the scanning program to treat multiple responses as incorrect answers. If you find any answer spaces that were only partially erased by the examinee, finish completely erasing them. The machine scorable Answer Documents available for the Short Form may be processed by any reflective scanning device programmed to your specifications. Chapter 3 Directions for Paper-and-Pencil Administration and Scoring Test Security Watson-Glaser scores are confidential and should be stored in a secure location accessible only to authorized individuals. It is unethical and poor test practice to allow test score access to individuals who do not have a legitimate need for the information. The security of testing materials and protection of copyright must also be maintained by authorized individuals. Storing test scores and materials in a locked cabinet (or password-protected file in the case of scores maintained electronically) that can only be accessed by designated test administrators is an effective means to ensure their security. Accommodating Examinees with Disabilities You will need to routinely provide reasonable accommodations that make it possible for candidates with particular needs to comfortably take the test, such as left-handed desks for some candidates and adequate and comfortable seating for all individuals. On occasion, a special administration may be required for an examinee with an impairment that affects his or her ability to take a test in the standard manner. Harcourt Assessment, Inc. recommends that reasonable accommodations for these examinees be made in accordance with the Americans with Disabilities Act (ADA) of 1990. The ADA has established basic legal rights for individuals with physical or mental disabilities that substantially limit one or more major life activities. Reasonable accommodations may include, but are not limited to, modifications to the testing environment (e.g., high desks), medium (e.g., having a reader read questions to the examinee), time limit, and/or content (Society for Industrial and Organizational Psychology, 2003). If an examinee’s disability is not likely to impair job performance, but may hinder his or her performance on the Watson-Glaser, you may want to consider waiving administration of the Watson-Glaser or de-emphasizing the test score in lieu of other application criteria. Directions for Computer-Based Administration 4 The computer-based Watson-Glaser is administered through eAssessTalent.com, the Internet-based testing system designed by Harcourt Assessment, Inc., for the administration, scoring, and reporting of professional assessments. Instructions for administrators on how to order and access the test online are provided at eAssessTalent.com. Instructions for accessing the Watson-Glaser interpretive reports are provided on the website. After a candidate has taken the Watson-Glaser on eAssessTalent.com, you can review the candidate’s results in an interpretive report, using the link that Harcourt Assessment provides. Preparing for Administration Being thoroughly prepared before the examinee’s arrival will result in a more efficient online administration session. It is recommended for test administrators to take the computer-based Watson-Glaser prior to administering the test, being sure to comply with the directions and any time requirement. Examinees will not need pencils or scratch paper for this computer-based test. In addition, examinees should not have access to any reference materials (e.g., dictionaries or calculators). Testing Conditions It is important to ensure that the test is administered in a quiet, well-lit room. The following conditions are necessary for accurate scores and for maintaining the cooperation of the examinee: good lighting, comfortable seating, adequate desk or table space, comfortable positioning of the computer screen, keyboard, and mouse, and freedom from noise and other distractions. Answering Questions Examinees may ask questions about the test before the signal to begin is given. To maintain standard testing conditions, answer such questions by rereading the appropriate section of these directions. Do not volunteer new explanations or examples. As the test administrator, it is your responsibility to ensure that examinees understand the correct way to indicate their answers and what is required of them. The question period should never be rushed or omitted. If any examinees have routine questions after the testing has started, try to answer them without disturbing the other examinees. However, questions about the test directions should be handled by telling the examinee to do his or her best. 11 Watson-Glaser Short Form Manual Administering the Test After the initial instruction screen for the Watson-Glaser has been accessed and the examinee is seated at the computer, say, The on-screen directions will take you through the entire process, which begins with some demographic questions. After you have completed these questions, the test will begin. You will have as much time as you need to complete the test items. The test ends with a few additional demographic questions. Do you have any questions before starting the test? Answer any questions and say, Please begin the test. Once the examinee clicks the “Start Your Test” button, test administration begins with the first page of test questions. The examinee may review test items at the end of the test. Examinees have as much time as they need to complete the exam, but they typically finish within 30 minutes. If an examinee’s computer develops technical problems during testing, move the examinee to another suitable computer location. If the technical problems cannot be solved by moving to another computer location, contact Harcourt Assessment, Inc. Technical Support for assistance. The contact information, including phone and fax numbers, can be found at the eAssessTalent.com website. Scoring and Reporting Scoring is automatic, and the report is available a few seconds after the test is completed. A link to the report will be available on eAssessTalent.com. Adobe® Acrobat Reader® is necessary to open the report. You may view, print, or save the candidate’s report. Test Security Watson-Glaser scores are confidential and should be stored in a secure location accessible only to authorized individuals. It is unethical and poor test practice to allow test-score access to individuals who do not have a legitimate need for the information. Storing test scores in a locked cabinet or password protected file that can only be accessed by designated test administrators will help ensure their security. The security of testing materials (e.g., access to online tests) and protection of copyright must also be maintained by authorized individuals. Avoid disclosure of test access information such as usernames or passwords and only administer the Watson-Glaser in proctored environments. All the computer stations used in administering the Watson-Glaser must be in locations that can be easily supervised with the same level of security as with the paper-andpencil administration. 12 Chapter 4 Directions for Computer-Based Administration Accommodating Examinees with Disabilities As noted in chapter 3 above under the section dealing with examinees with disability, the test administrator should provide reasonable accommodations to enable candidates with special needs to comfortably take the test. Reasonable accommodations may include, but are not limited to, modifications to the test environment (e.g., high desks) and medium (e.g., having a reader read questions to the examinee, or increasing the font size of questions) (Society for Industrial and Organizational Psychology, 2003). In situations where an examinee’s disability is not likely to impair his or her job performance, but may hinder the examinee’s performance on the Watson-Glaser, the organization may want to consider waiving the test or de-emphasizing the score in lieu of other application criteria. Interpretive data as to whether scores on the Watson-Glaser are comparable for examinees who are provided reasonable accommodations are not available at this time due to the small number of examinees who have requested such accommodations. If, due to some particular impairment, a candidate cannot take the computeradministered test but can take the test on paper, the administrator could provide reasonable accommodation for the candidate to take the test on paper, and then have the candidate’s certified responses and results entered into the computer system. The Americans with Disabilities Act (ADA) of 1990 requires an employer to reasonably accommodate the known disability of a qualified applicant provided such accommodation would not cause an “undue hardship” to the operation of the employer’s business. 13 5 Norms The raw score on the Watson-Glaser–Short Form is calculated by adding the total number of correct responses. The maximum raw score is 40. Raw scores may be used to rank examinees in order of performance, but little can be inferred from raw scores alone. It is important to relate the scores to specifically defined normative groups to make the test results meaningful. Norms provide a basis for evaluating an individual’s score relative to the scores of other individuals who took the same test. Norms allow for the conversion of raw scores to more useful comparative scores, such as percentile ranks. Typically, norms are constructed from the scores of a large sample of individuals who took a test. This group of individuals is referred to as the normative group or standardization sample. The characteristics of the sample used for preparing norms are critical in determining the usefulness of those norms. For some purposes, such as intelligence testing, norms that are representative of the general population are essential. For other purposes, such as selecting from among applicants to fill a particular job, normative information derived from a specific, relevant, well-defined group may be most useful. However, the composition of a sample of job applicants is influenced by a variety of situational factors, including job demands and local labor market conditions. Because such factors can vary across jobs, locations, and over time, the limitations on the usefulness of any set of published norms should be acknowledged. When a test is used to help make human resource decisions, the most appropriate norm group is one that is representative of those who will be taking the test in the local situation. It is best, whenever possible, to prepare local norms by accumulating the test scores of applicants, trainees, or employees. One of the factors that must be considered in preparing norms is sample size. With large samples, all possible scores can be converted to percentile ranks. Data from smaller samples tend to be unstable and the presentation of percentile ranks for all possible scores presents an unwarranted impression of precision. Until a sufficient and representative number of cases has been collected (preferably 100 or more), the norms presented in Appendix A should be used to guide the interpretation of test scores. Using Norms Tables to Interpret Scores The Short Form norms in Appendix A were derived from new data, collected June 2004 through March 2005, from 6,713 adults in a variety of employment settings. Please note that the distributions of occupational levels across industry 15 Watson-Glaser Short Form Manual samples vary. Therefore, it is not appropriate to compare industry means presented in Appendix A to each other. The tables in Appendix A show the total raw scores on the Watson-Glaser with their corresponding percentile ranks for identified norm groups. When using the norms tables in Appendix A, look for a group that is similar to the individual or group tested. For example, you would compare the test score of a person who applied for an engineer’s position with norms derived from the scores of other engineers. If a person applied for a management position, you would compare the candidate’s test score with norms for managers, or norms for managers in manufacturing. When using the norms tables in Appendix A to interpret candidates’ scores, keep in mind that norms are affected by the composition of the groups that participated in the normative study. Therefore, it is important to examine specific industry and occupational characteristics of a norm group. By comparing an individual’s raw score to the data in a norms table, it is possible to determine the percentile rank corresponding to that score. The percentile rank indicates an individuals’ relative position in the norm group. Percentiles should not be confused with percentage scores which represent the percentage of correct items. Percentiles are derived scores which are expressed in terms of the percent of people in the norm group scoring equal to or below a given raw score. Although percentiles are useful for explaining an examinee’s performance relative to others, they have limitations. Percentile ranks do not have equal intervals. In a normal distribution of scores, percentile ranks tend to cluster around the 50th percentile. This clustering affects scores in the average range the most because a difference of one or two raw score points may change the percentile rank. Extreme scores are less affected; a change in one or two raw score points typically does not produce a large change in percentile ranks. These factors should be taken into consideration when interpreting percentiles. Converting Raw Scores to Percentile Ranks To find the percentile rank of a candidate’s raw score, locate the raw score from either the extreme right- or left-hand column in Tables A.4–A.7. The corresponding percentile rank is read from the selected norm group column. For example, if a person applying for a job as an engineer had a score of 35 on the WatsonGlaser–Short Form, it is appropriate to use the Engineer norms in Appendix A (Table A.5) for comparison. In this case, the percentile rank corresponding to a raw score of 35 is 63. This percentile rank indicates that about 63% of the people in the norm group scored lower than or equal to a score of 35 on the WatsonGlaser–Short Form, and about 37% scored higher than a score of 35 on the Watson-Glaser–Short Form. Each group’s size (N), mean, and standard deviation (SD) are shown at the bottom of the norms tables. The group mean or average is calculated by summing the raw scores and dividing the sum by the total number of examinees. The standard deviation indicates the amount of variation in a group of scores. In a normal distribution, approximately two-thirds (68.26%) of the scores are within the range of –1 SD (below the mean) to +1 SD (above the mean). These statistics are often used in describing a study sample and setting cut scores. For example, a cut score may be set as one SD below the mean. 16 Chapter 5 Norms In accordance with the Civil Rights Act of 1991, Title 1, Section 106, the norms provided in Appendix A combine data for males and females, and for white and minority examinees. The use of combined group norms can exacerbate adverse impact if there are expected differences in scores due to differences in group membership. Previous investigations conducted during the development of Watson-Glaser Form A and Form B found no consistent differences between the scores of male examinees and the scores of female examinees. Other studies of earlier Forms Ym and Zm also found no consistent differences based on the sex of the examinee in critical thinking ability as measured by the Watson-Glaser (e.g., Burns, 1974; Gurfein, 1977; Simon & Ward, 1974). 17 6 Development of the Short Form The Watson-Glaser–Short Form is a shortened version of Form A. Historical and test development information for Form A may be found in the Watson-Glaser, Forms A and B Manual, 1980 edition. Test Assembly Data Set Two overlapping sets of data were used in the development of the Short Form. The first data set consisted of item-level responses to Form A from 1,608 applicants and employees. These data were obtained from eight sources between 1989 and 1992. This data set was used to generate item statistics and to make decisions about item selection for inclusion in the Short Form. The average Form A score for this sample was 61.78 (SD = 9.30) and the internal consistency (i.e., KR-20) coefficient was .87. Table 6.1 presents a frequency distribution of Form A scores for the sample. Table 6.1 Distribution of Item Development Sample Form A Scores (N = 1,608) Form A Score Frequency Percent Form A Score Frequency Percent1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 3 8 8 13 19 15 14 17 31 30 25 37 43 45 38 36 46 46 67 56 0.2 0.5 0.5 0.8 1.2 0.9 0.9 1.1 1.9 1.9 1.6 2.3 2.7 2.8 2.4 2.2 2.9 2.9 4.2 3.5 (continued) 19 Watson-Glaser Short Form Manual Table 6.1 Distribution of Item Development Sample Form A Scores (N = 1,608) (continued) 1 Form A Score Frequency Percent Form A Score Frequency Percent1 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 0 1 0 0 1 0 0 2 4 1 2 0 1 3 2 5 3 3 5 0.1 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.1 0.2 0.1 0.1 0.0 0.1 0.2 0.1 0.3 0.2 0.2 0.3 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 57 66 59 79 63 74 76 76 68 88 60 62 39 37 33 22 8 6 2 0 3.5 4.1 3.7 4.9 3.9 4.6 4.7 4.7 4.2 5.5 3.7 3.9 2.4 2.3 2.1 1.4 0.5 0.4 0.1 0.0 The total percent equals 100.4 due to rounding. A second set of data was created by combining the first dataset with item-level data obtained in 1993 from three additional sources of Form A data (N = 2,119). The combined data set (N = 3,727) was used to evaluate the psychometric properties of the Short Form, including reliability, and to examine the equivalency between the Short Form and Form A. Criteria for Item Selection In assembling the Short Form, the primary goal was to significantly reduce the time limit required for Form A without changing the essential nature of the constructs measured. For additional information regarding the selection of items for the Short Form, please refer to the 1994 edition of the Watson-Glaser manual. The following criteria were used to select Short Form items: 20 Maintenance of the Watson-Glaser five sub-test structure and the scenario-based format Items represent psychometrically sound scenarios and items Maintenance of test reliability Maintenance of reading level Update of test currency Chapter 6 Development of the Short Form Maintenance of Reading Level The reading level of the shortened test form was assessed using EDL Core Vocabulary in Reading, Mathematics, Science, and Social Studies (Taylor, et al., 1989) and Basic Reading Vocabularies (Harris & Jacobson, 1982). Approximately 98.2% of words appearing in directions and exercises were at or below ninth-grade reading level. A summary of word distribution by grade level is presented in Table 6.2. Table 6.2 Grade Levels of Words on the Watson-Glaser–Short Form Grade Level Preprimer 1 2 3 4 5 6 7 8 9 10 11 Total Frequency 106 224 310 379 222 191 229 80 53 5 5 25 1829 Percent 5.8 12.2 16.9 20.7 12.1 10.4 12.5 4.4 2.9 0.3 0.3 1.5 100.0 Updates to the Test During the test assembly process, attention was given to the currency of Form A scenarios and items. Some scenarios deal with dated subject matter, such as Russia prior to the dissolution of the USSR. The item selection process removed such dated scenarios, thereby making the composition of the Short Form more contemporary. Test Administration Time The optional 30-minute time limit for the Short Form was established during a study investigating the instrument’s test-retest reliability. A sample of 42 employees (92.9% non-minority; 54.8% female) at a large publishing company completed the Short Form twice (two-week testing interval). The participants worked in a variety of positions ranging from Secretary to Project Director. During the first testing session, participants were given as much time as they required to complete the test. A frequency distribution of time taken (see Table 6.3) revealed that approximately 90% of the respondents completed the Short Form in 30 minutes or less. Consistent with the method used to establish testing time limits for previous forms of the Watson-Glaser, these results were used to set the time limit for 21 Watson-Glaser Short Form Manual completing the Short Form at 30 minutes. The fact that the majority of respondents complete the Short Form within the allotted 30-minute time supports the point that the Watson-Glaser is a test of critical thinking power, rather than speed. Furthermore, normative data gathered in both timed and untimed administration may be used to interpret Short Form results, as the variability in scores is derived from test items rather than testing time limits. Table 6.3 Frequency Distribution of Testing Time in Test-Retest Sample (n = 42) Completion Time 20 minutes or less 21 to 25 minutes 26 to 30 minutes 31 minutes or more 22 Frequency 2 14 22 4 Percent 4.8 33.3 52.4 9.5 Cumulative Percent 4.8 38.1 90.5 100.0 7 Equivalence of Forms Equivalence of Short Form to Form A To support the equivalence of the Short Form and Form A, test item contents were not changed and the new form was assembled from Form A test items. As a result, the Short Form may be considered to measure the same abilities as Form A. Following assembly of the Short Form, correlation coefficients were computed between raw scores on the Short Form and those on Form A. Because the constituent test items of the Short Form are completely contained in the longer Form A, the coefficients are considered part-whole correlations (rpw). Part-whole correlations are known to overstate the relationship between independently measured variables and cannot be interpreted as alternate form correlations. However, they can be used to support the equivalence of a Short Form to a longer one, because examinees are expected to respond the same way to the same items, regardless of the form. The overall correlation coefficient was calculated by using data from a sample of 3,727 adults who were administered Form A. To compute the correlations, each Watson-Glaser was scored twice. First, the Form A raw score was computed, then the Short Form raw score was computed by ignoring responses to the Form A items not used in the Short Form. For the entire sample, the resulting coefficient was .96. Correlations between Form A and the Short Form scores were also computed separately for each of 21 sources providing data, some of which were not included in the Short Form developmental analysis. The resulting coefficients are presented in Table 7.1. A description of the sample group is followed by the group size (N) and the part-whole correlation (rpw) between the Short Form and Form A. The coefficients presented in Table 7.1 indicate that raw scores on the Short Form correlated very highly with Form A raw scores in a variety of groups. 23 Watson-Glaser Short Form Manual Table 7.1 Part–Whole Correlations (rpw ) of the Short Form and Form A Group Lower-level management applicants Lower to upper-level management applicants Mid-level management applicants Upper-level management applicants at Board of County Commissioners Construction management applicants Executive management applicants Supervisory and managerial applicants in the corrugated container industry Sales applicants Mid-level marketing applicants Bank employees Bank management associates Candidates for the ministry Clergy Railroad dispatchers Nurse managers and educators Police officers Administrative applicants in city government Security applicants Candidates for police captain Police department executives Various occupations N rpw 219 501 211 215 322 453 149 473 909 95 131 126 99 199 111 225 23 42 41 55 133 .93 .94 .94 .94 .94 .93 .94 .94 .94 .95 .94 .95 .91 .92 .95 .95 .97 .89 .89 .94 .97 Equivalent Raw Scores Table 7.2 presents raw score equivalents for the Short Form and Form A. For every possible score on the Short Form, this table contains an equivalent raw score on Form A. To convert a Form A raw score, find that score in the Form A column. Then, look in the Short Form raw score column on the left. To convert a Short Form raw score, simply reverse the procedure. Table 7.2 was prepared with data obtained from 3,727 adults comprising the item selection sample. To establish equivalent raw scores, raw-score-to-ability estimates for both the Short Form and Form A were generated using Rasch-model difficulty parameters. Then, using interpolation when necessary, the ability estimates were calibrated for all possible scores on each form. Form A and Short Form raw scores corresponding to the same ability estimate were considered equivalent (i.e., represent the same ability level). Organizations requiring an alternative to the Short Form for retesting or other purposes may use Form B. Form A and Form B are equivalent, alternate forms. Raw scores on one form (A or B) may be interpreted as having the same meaning as identical raw scores on the other form (A or B). Therefore, scores from either Form A or Form B may be equated to Form S using Table 7.2. 24 Chapter 7 Equivalence of Forms Table 7.2 Raw Score Equivalencies Between the Short Form and Form A Short Form Raw Score Form A Raw Score Short Form Raw Score Form A Raw Score 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 78–80 77 75–76 73–74 71–72 69–70 67–68 65–66 63–64 61–62 59–60 57–58 55–56 53–54 51–52 49–50 48 46–47 44–45 42–43 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 40–41 38–39 36–37 34–35 33 31–32 29–30 27–28 25–26 23–24 21–22 19–20 17–18 15–16 13–14 10–12 8–9 6–7 4–5 1–3 Equivalence of Computer-Based and Paper-andPencil Versions of the Short Form Studies of the effect of the medium of test administration have generally supported the equivalence of paper and computerized versions of non-speeded cognitive ability tests (Mead & Drasgow, 1993). To ensure that these findings held true for the Watson-Glaser, Harcourt Assessment conducted an equivalency study using paper-and-pencil and computer-administered versions of the Short Form. In this study, a counter-balanced design was employed using a sample of 226 adult participants from a variety of occupations. Approximately half of the group (n = 118) completed the paper form followed by the online version, while the other participants (n = 108) completed the tests in the reverse order. Table 7.3 presents means, standard deviations, and correlations obtained from an analysis of the resulting data. As indicated in the table, neither mode of administration yielded consistently higher raw scores, and mean score differences between modes were less than one point (0.5 and 0.7). The variability of scores also was very similar, with standard deviations ranging from 5.5 to 5.7. 25 Watson-Glaser Short Form Manual The coefficients indicate that paper-and-pencil raw scores correlate very highly with online administration raw scores (.86 and .88, respectively). The high correlations provide further support that the two modes of administration can be considered equivalent. Thus, raw scores on one form (paper or online) may be interpreted as having the same meaning as identical raw scores on the other form. Table 7.3 Equivalency of Paper and Online Modes of Administration Paper 26 Administration Order N Paper Followed by Online Online Followed by Paper Online Mean SD Mean SD r 118 30.1 5.7 30.6 5.5 .86 108 29.5 5.5 28.8 5.7 .88 8 Evidence of Reliability The reliability of a measurement instrument refers to the accuracy and precision of test results and is a widely used indicator of the confidence that may be placed in those results. The reliability of a test is expressed as a correlation coefficient that represents the consistency of scores that would be obtained if a test could be given an infinite number of times. In actual practice, however, we do not have the luxury of administering a test an infinite number of times, so we can expect some measurement error. Reliability coefficients help us to estimate the amount of error associated with test scores. Reliability coefficients can range from .00 to 1.00. The closer the reliability coefficient is to 1.00, the more reliable the test. A perfectly reliable test would have a reliability coefficient of 1.00 and no measurement error. A completely unreliable test would have a reliability coefficient of .00. The U.S. Department of Labor (1999) provides the following general guidelines for interpreting a reliability coefficient: above .89 is considered “excellent,” .80–.89 is “good,” .70–.79 is considered “adequate,” and below .70 “may have limited applicability.” The methods most commonly used to estimate test reliability are test-retest (the stability of test scores over time), alternate forms (the consistency of scores across alternate forms of a test), and internal consistency of the test items (e.g., Cronbach’s alpha coefficient, Cronbach 1970). Since repeated testing always results in some variation, no single test event ever measures an examinee’s actual ability with complete accuracy. We therefore need an estimate of the possible amount of error present in a test score, or the amount that scores would probably vary if an examinee were tested repeatedly with the same test. This error is known as the standard error of measurement (SEM). The SEM decreases as the reliability of a test increases; a large SEM denotes less reliable measurement and less reliable scores. The SEM is a quantity that is added to and subtracted from an examinee’s test score to create a confidence interval or band of scores around the obtained score. The confidence interval is a score range that, in all likelihood, includes the examinee’s hypothetical “true” score which represents the examinee’s actual ability. A true score is a theoretical score entirely free of error. Since the true score is a hypothetical value that can never be obtained because testing always involves some measurement error, the score obtained by an examinee on any test will vary somewhat from administration to administration. As a result, any obtained score is considered only an estimate of the examinee’s “true” score. Approximately 68% of the time, the observed score will lie within +1.0 and –1.0 SEM of the true score; 95% of the time, the observed score will lie within +1.96 and –1.96 SEM of the true score; and 99% of the time, the observed score will lie within +2.58 and –2.58 SEM of the true score. 27 Watson-Glaser Short Form Manual Historical Reliability Previous Studies of Internal Consistency Reliability For the sample used in the initial 1994 development of the Watson-Glaser Short Form (N = 1,608), Cronbach’s alpha coefficient (r) was .81. Cronbach’s alpha and the SEM were also calculated for a number of groups separately, including some groups that were in the development sample and some that were not in the development sample (see Table 8.1). Table 8.1 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM) and Internal Consistency Reliability Coefficients (ralpha), Based on Previous Studies Group Lower-level management applicants Lower to upper-level management applicants Mid-level management applicants Upper-level management applicants at a Board of County Commissioners Executive management applicants Construction management applicants Supervisory and managerial applicants in the corrugated container industry Sales applicants Mid-level marketing applicants Bank employees Bank management associates Candidates for the ministry Clergy Railroad dispatchers Nurse managers and educators Police officers Various occupations Administrative applicants in city government1 Security applicants Candidates for police captain2 Police department executives3 N Mean SD SEM r alpha 219 501 211 215 453 322 149 473 909 95 131 126 99 199 111 225 133 23 42 41 55 33.50 32.29 33.99 31.80 33.42 32.05 31.48 30.88 31.02 32.75 31.61 34.10 34.56 25.15 30.52 28.00 30.68 30.43 25.00 27.95 32.56 4.40 4.63 4.20 5.20 4.21 4.87 5.00 4.98 5.08 4.58 4.69 4.71 3.79 5.00 4.86 6.03 6.65 5.82 4.79 4.60 4.14 2.17 2.31 2.12 2.35 2.18 2.32 2.39 2.43 2.42 2.25 2.39 2.08 2.05 2.78 2.46 2.64 2.40 2.42 2.77 2.69 2.32 .76 .75 .74 .80 .73 .77 .77 .76 .77 .76 .74 .80 .71 .69 .74 .81 .87 .83 .67 .66 .69 1 D.O.T. Code 169. 167-010 2 D.O.T. Code 375. 167-034 3 Includes Commander (D.O.T. Codes 375.167-034 and 375. 267-026) Chief (D.O.T. Code 375.117-010), Deputy Chief (D.O.T. Code 375.267-026) and Warden (D.O.T. Code 187, 117-018) Using the SEM means that scores are interpreted as bands or ranges of scores, rather than as precise points. Thinking in terms of score ranges serves as a check against overemphasizing small differences between scores. The SEM may be used to determine whether an individual’s score is significantly different from a cut score, or whether the scores of two individuals differ significantly. An example of one general rule of thumb is that the difference between two scores on the same test should not be interpreted as significant unless the difference is equal to at least twice the SEM of the test (Aiken, 1979; as reported in Cascio, 1982). 28 Chapter 8 Evidence of Reliability The internal consistency estimates calculated for the Short Form tests were moderately low, consistent with research involving previous forms of the Watson-Glaser; for this reason, individual test scores should not be used. Previous Studies of Test-Retest Reliability In 1994, a study investigating the test-retest reliability and required completion time of the Watson-Glaser–Short Form was conducted at a large publishing company. A sample of 42 employees (92.9% non-minority; 54.8% female) completed the Short Form two weeks apart. The participants worked in a variety of positions ranging from Secretary to Project Director. The mean score at the first testing was 30.5 (SD = 5.6) and at the second testing 31.4 (SD = 5.9), while the test-retest correlation was .81 (p < .001). Scores for females (Mean = 31.0, SD = 6.1) and male (Mean = 31.8, SD = 5.7) respondents were not significantly different (t = 0.11, df = 40). Current Reliability Studies Evidence of Internal Consistency Reliability Cronbach’s alpha and the standard error of measurement (SEM) were calculated for the samples used for the current norm groups (see Table 8.2). Reliability estimates for these samples were similar to those found in previous studies and ranged from .76 to .85. Consistent with previous research, these values indicate that the total score possesses adequate reliability. The test scores obtained lower estimates of internal consistency reliability, thereby suggesting that the test scores alone should not be used. 29 Watson-Glaser Short Form Manual Table 8.2 Means, Standard Deviations (SD), Standard Errors of Measurement (SEM) and Internal Consistency Reliability Coefficients (ralpha ) for the Current Short Form Norm Groups Group Industry Advertising/Marketing/Public Relations Education Financial Services/Banking/Insurance Government/Public Service/Defense Health Care Information Technology/Telecommunications Manufacturing/Production Professional Business Services (e.g., Consulting, Legal) Retail/Wholesale Occupation Accountant/Auditor/Bookkeeper Consultant Engineer Human Resource Professional Information Technology Professional Sales Representative—Non-Retail Position Type/Level Executive Director Manager Supervisor Professional/Individual Contributor Hourly/Entry-Level Norms by Occupation Within Specific Industry Manager in Manufacturing/Production Engineer in Manufacturing/Production N Mean SD SEM r alpha 101 119 228 130 195 295 561 153 307 28.7 30.2 31.2 30.0 28.3 31.2 32.0 31.9 30.8 6.1 5.4 5.7 6.3 6.5 5.5 5.3 5.6 5.3 2.52 2.47 2.42 2.44 2.60 2.40 2.31 2.31 2.43 .83 .79 .82 .85 .84 .81 .81 .83 .79 118 139 225 140 222 353 30.2 33.3 32.8 30.0 31.4 29.8 5.8 4.8 4.8 5.7 5.9 5.1 2.46 2.20 2.25 2.48 2.36 2.50 .82 .79 .78 .81 .84 .76 409 387 973 202 842 332 33.4 32.9 30.7 28.8 30.6 27.7 4.5 4.7 5.4 6.2 5.6 5.9 2.20 2.20 2.41 2.63 2.44 2.64 .76 .78 .80 .82 .81 .80 170 112 31.9 32.9 5.3 4.7 2.31 2.25 .81 .77 Evidence of Test-Retest Reliability Test-Retest reliability was evaluated for the total score and for the individual test scores in a sample of job incumbents representing various organizational levels and industries. (N = 57). The test-retest intervals ranged from 4 to 26 days, with a mean interval of 11 days. As the data in Table 8.3 indicate, the Watson-Glaser Total score demonstrates acceptable test-retest reliability (r12 = .89). The difference in mean scores between the first testing and the second testing is statistically small (d′ = 0.17). This difference (d′ ), proposed by Cohen (1988), is useful as an index to measure the magnitude of the actual difference between two means. The difference (d′ ) is calculated from dividing the difference of the two test means by the square root of the pooled variance, using Cohen’s (1996) Formula 10.4. 30 Chapter 8 Evidence of Reliability The test-retest reliability of the Watson-Glaser Total score has a small difference index (d′ = 0.17), indicating that the magnitude of the difference in mean scores between first testing and the retesting is statistically small. In other words, the Watson-Glaser Total score is stable over the test-retest period. The test-retest reliability coefficients of the test scores are somewhat lower, suggesting that the Total score is more reliable than the test scores as a measure of critical thinking. Table 8.3 Test-Retest Reliability of the Short Form First Testing Second Testing Score Mean SD Mean SD r12 Difference (d ′) Total 29.5 7.0 30.7 7.0 .89 .17 Inference Recognition of Assumptions Deduction Interpretation Evaluation of Arguments 4.3 6.1 7.1 5.0 7.2 1.9 2.2 1.8 1.6 1.6 4.6 6.5 7.0 5.1 7.5 1.9 2.0 2.0 1.6 1.3 .70 .83 .55 .78 .67 .16 .19 –.05 .06 .21 31 9 Evidence of Validity The validity of a test refers to the degree to which specific data, research, or theory support that the test measures what it is intended to measure. Validity is a unitary concept. It is the extent to which all the accumulated evidence supports the intended interpretation of test scores for the proposed purpose (AERA, APA, & NCME, 1999). “Validity is high if a test gives the information the decision maker needs” (Cronbach, 1970). Data from the Short Form sample was analyzed for evidence of validity based on content, test-criterion relationships, and evidence of convergent and discriminant validity. Evidence of Validity Based on Content Evidence based on the content of a test exists when a test includes a representative sample of tasks, behaviors, knowledge, skills, abilities, or other characteristics necessary to perform the job. Evidence of content validity is usually gathered through job analysis and is most appropriate for evaluating knowledge and skills tests. Evaluation of content-related evidence is usually a rational, judgmental process (Cascio & Aguinis, 2005). In employment settings, the principal concern is with making inferences about how well the test samples a job performance domain—a segment or aspect of the job performance universe which has been identified and about which inferences are to be made (Lawshe, 1975). Because most jobs have several performance domains, a standardized test generally applies only to one segment of the job performance universe (e.g., a typing test administered to a secretary applies to typing, one job performance domain in the job performance universe of a secretary). Thus, the judgment of whether content-related evidence exists depends upon an evaluation of whether the same capabilities are required in both the job performance domain and the test (Cascio & Aguinis, 2005). In an employment setting, evidence based on test content should be established by demonstrating that the jobs for which the test will be used require the critical thinking abilities measured by the Watson-Glaser. Content-related validity evidence of the Watson-Glaser in classroom and instructional settings may be examined by noting the extent to which the Watson-Glaser measures a sample of the specified objectives of such instructional programs. 33 Watson-Glaser Short Form Manual Evidence of Criterion-Related Validity One of the primary reasons tests are used is to provide an educated guess about an examinee’s potential for future success. For example, selection tests are used to hire or promote those individuals most likely to be productive employees. The rationale behind selection tests is this: the better an individual performs on a test, the better this individual will perform as an employee. Criterion-related validity evidence addresses the inference that individuals who score better on tests will be successful on some criterion of interest. Criterionrelated validity evidence indicates the statistical relationship (e.g., for a given sample of job applicants or incumbents) between scores on the test and one or more criteria, or between scores on the tests and independently obtained measures of subsequent job performance. By collecting test scores and criterion scores (e.g., job performance ratings, grades in a training course, supervisor ratings), one can determine how much confidence may be placed in using test scores to predict job success. Typically, correlations between criterion measures and scores on the test serve as indices of criterion-related validity evidence. Provided that the conditions for a meaningful validity study have been met (sufficient sample size, adequate criteria, etc.), these correlation coefficients are important indices of the utility of the test. Unfortunately, the conditions for evaluating criterion-related validity evidence are often difficult to fulfill in the ordinary employment setting. Studies of testcriterion relationships should involve a sufficiently large number of persons hired for the same job and evaluated for success using a uniform criterion measure. The criterion itself should be reliable and job-relevant, and should provide a wide range of scores. In order to evaluate the quality of studies of test-criterion relationships, it is essential to know at least the size of the sample and the nature of the criterion. Assuming that the conditions for a meaningful evaluation of criterion-related validity evidence have been met, Cronbach (1970) characterized validity coefficients of .30 or better as having “definite practical value.” The U.S. Department of Labor (1999) provides the following general guidelines for interpreting validity coefficients: above .35 are considered “very beneficial,” .21–.35 are considered “likely to be useful,” .11–.20 “depends on the circumstances,” and below .11 “unlikely to be useful”. It is important to point out that even relatively lower validities (e.g., .20) may justify the use of a test in a selection program (Anastasi & Urbina, 1997). This suggestion is because the practical value of the test depends not only on the validity, but also other factors, such as the base rate for success on the job (i.e., the proportion of people who would be successful in the absence of any selection procedure). If the base rate for success on the job is low (i.e., few people would be successful on the job), tests of low validity can have considerable utility or value. When the base rate is high (i.e., selected at random, most people would succeed on the job), even highly valid tests may not contribute significantly to the selection process. In addition to the practical value of validity coefficients, the statistical significance of coefficients should be noted. Statistical significance refers to the odds that a non-zero correlation could have occurred by chance. If the odds are 1 in 20 that a non-zero correlation could have occurred by chance, then the correlation is considered statistically significant. Some experts prefer even more stringent odds, such as 1 in 100, although the generally accepted odds are 1 in 20. In statistical analyses, these odds are designated by the lower case p (probability) to signify whether a non-zero correlation is statistically significant. When p is less than or 34 Chapter 9 Evidence of Validity equal to .05, the odds are presumed to be 1 in 20 (or less) that a non-zero correlation of that size could have occurred by chance. When p is less than or equal to .01, the odds are presumed to be 1 in 100 (or less) that a non-zero correlation of that size occurred by chance. Previous Studies of Evidence of Criterion-Related Validity Previous studies have shown evidence of the relationship between Watson-Glaser scores and various job and academic success criteria. Gaston (1993), in a study of law enforcement personnel, found a relationship between Watson-Glaser scores and organizational level. Among his findings were that executives scored in the 7–9 decile range more often than non-executives, and non-executives scored in the 1–3 decile range more often than executives. Holmgren and Covin (1984), in a study of students majoring in education, found Watson-Glaser scores correlated .50 with grade-point average (GPA) and .46 with English Proficiency Test scores. In studies of nursing students, Watson-Glaser scores correlated .50 with National Council Licensure Exam (NCLEX) scores (Bauwens and Gerhard, 1987) and .38 with state licensing exam scores (Gross, Takazawa, & Rose). Additional studies of the relationship between Watson-Glaser scores and various criteria are reported in the previous version of the manual (1994). Current Studies of Evidence of Criterion-Related Validity Studies continue to provide strong criterion-related validity evidence for the Watson-Glaser. Kudisch and Hoffman (2002) reported that, in a sample of 71 leadership assessment center participants, Watson-Glaser scores correlated with ratings on Analysis, .58, and with ratings on Judgment, .43. Ratings on Analysis and Judgment were based on participants’ performance on assessment center exercises, including a coaching meeting, in-basket exercise or simulation, and a leaderless group discussion. Spector, Schneider, Vance, and Hezlett (2000) evaluated the relationship between Watson-Glaser scores and assessment center exercise performance for managerial and executive level assessment center participants. They found that Watson-Glaser scores significantly correlated with overall scores on six of eight assessment center exercises, and related more strongly to exercises involving primarily cognitive problem-solving skills (e.g., r = .26, p < .05, with in-basket scores) than exercises involving a greater level of interpersonal skills (e.g., r = .16, p < .05, with in-basket coaching exercise). In a study we conducted for this revision of the manual in 2005, we examined the relationship between Watson-Glaser scores and on-the-job performance of 142 job incumbents in various industries. Job performance was defined as supervisory ratings on behaviors determined through research to be important to most professional, managerial, and executive jobs. The study found that Watson-Glaser scores correlated .33 with supervisory ratings on a dimension made up of Analysis and Problem Solving behaviors, and .23 with supervisory ratings on a dimension made up of Judgment and Decision Making behaviors. Supervisory ratings from the sum of ratings on 19 job performance behaviors (“Total Performance”), as well as ratings on a single-item measure of “Overall Potential” were also obtained. The Watson-Glaser scores correlated .28 with “Total Performance” and .24 with ratings of Overall Potential. In an analysis of a sub-group of the 2005 study mentioned above, we examined the relationship between the Watson-Glaser scores and on-the-job performance of 64 analysts from a government agency. The results showed that Watson-Glaser 35 Watson-Glaser Short Form Manual scores correlated .40 with supervisory ratings on each of the two dimensions composed of (a) Analysis and Problem Solving behaviors and, (b) Judgment and Decision Making behaviors, and correlated .37 with supervisory ratings on a dimension composed of behaviors dealing with Professional/Technical Knowledge and Expertise. In the sample of 64 analysts mentioned above, the Watson-Glaser scores correlated .39 with “Total Performance” and .25 with Overall Potential. Another part of the study we conducted in 2005 for this revision of the manual examined the relationship between Watson-Glaser scores and job success as indicated by organizational level achieved, for 2,303 job incumbents across 9 industry categories. Results indicated that Watson-Glaser scores correlated .33 with organizational level. Other studies of job-relevant criteria have found significant correlations between Watson-Glaser scores and creativity (Gadzella & Penland, 1995), facilitator effectiveness (Offner, 2000), positive attitudes toward women (Loo & Thorpe, 2005), and openness to experience (Spector, et al., 2000). In the educational domain, Behrens (1996) found that Watson-Glaser scores correlated .59, .53, and .51 respectively, with semester GPA for three freshmen classes in a Pennsylvania nursing program. Similarly, Gadzella, Baloglu, & Stephens (2002) found Watson-Glaser subscale scores explained 17% of the total variance in GPA (equivalent to a multiple correlation of .41) for 114 Education students. Williams (2003), in a study of 428 educational psychology students, found Watson-Glaser total scores correlated .42 and .57 with mid-term and final exam scores, respectively. Gadzella, Ginther, and Bryant (1996), in a study of 98 college freshmen, found that Watson-Glaser scores were significantly higher for A students than B and C students, and significantly higher for B students relative to C students. Studies have also shown significant relationships between Watson-Glaser scores and clinical decision making effectiveness (Shin, 1998), educational experience and level (Duchesne, 1996; Shin, 1998; Yang & Lin, 2004), educational level of parents (Yang & Lin, 2004), academic performance during pre-clinical years of medical education (Scott & Markert, 1994), and preference for contingent, relativistic thinking versus “black-white, right-wrong” thinking (Taube, 1995). Table 9.1 presents a summary of studies that evaluated criterion-related validity evidence for the Watson-Glaser since 1994 when the previous manual was published. Only studies that reported validity coefficients are shown. Additional studies are reported in this chapter as well as the previous version of manual (1994). In Table 9.1, the column entitled N details the number of cases in the sample. The criterion measures include job performance and grade point average, among others. Means and standard deviations, for studies in which they were available, are shown for both the test and criterion measures. The validity coefficient for the sample appears in the last column. Validity coefficients such as those reported in Table 9.1 apply to the specific samples listed. 36 Chapter 9 Evidence of Validity Table 9.1 Studies Showing Evidence of Criterion-Related Validity Watson-Glaser Group N Leadership assessment center participants from a national retail chain and a utility service (Kudisch & Hoffman, 2002) 71 Middle-management assessment center participants (Spector, Schneider, Vance, & Hezlett, 2000) 189–407 Form 80-item 80-item Mean – 66.5 Criterion SD – 7.3 Mean SD Assessor Ratings: Description r Analysis – – .58* Judgment Assessor Ratings: – – .43* In-basket 2.9 0.7 .26* In-basket Coaching 3.1 0.7 .16* Leaderless Group 3.0 0.6 .19* Project Presentation 3.0 0.7 .25* Project Discussion 2.9 0.6 .16* Team Presentation 3.1 0.6 .28* 41.8 6.4 .36* CPI Score: Job incumbents across multiple industries (Harcourt Assessment, Inc., 2005) Job applicants and incumbents across multiple industries (Harcourt Assessment, Inc., 2005) 142 2,303 Short Short Openness to Experience Supervisory Ratings: Analysis and Problem Solving .33** Judgment and Decision Making .23** Total Performance .28** Potential .24** .33* Org. Level (continued) 37 Watson-Glaser Short Form Manual Table 9.1 Studies Showing Evidence of Criterion-Related Validity (continued) Watson-Glaser Group N Form Incumbent analysts from a government agency (Harcourt Assessment, Inc., 2005) 64 Short Mean Criterion SD Description Mean SD Supervisory Ratings: Analysis and Problem Solving .40** Judgment and Decision Making .40** Professional /Technical Knowledge & Expertise .37** Total Performance .39** Potential Freshmen classes in a Pennsylvania nursing program (Behrens, 1996) 50.5 – Semester 1 GPA 2.5 – .25* .59** 31 52.1 – Semester 1 GPA 2.5 – .53** 37 114 80-item 52.1 51.4 – 9.8 Semester 1 GPA GPA 2.4 3.1 – .51 .51** .41** 158–164 Short – – Exam 1 Score – – .42** Exam 2 Score Education Level – – – – .57** .57** 8.1 GPA 2.8 .51 .30* – Checklist of Educational Views Course Grades 41 Education majors (Gadzella, Baloglu, & Stephens, 2002) Educational psychology students (Williams, 2003) Job applicants and incumbents across multiple industries (Harcourt Assessment, Inc., 2005) Education majors (Taube, 1995) Educational psychology students (Gadzella, Stephens, & Stacks, 2004) r 147–194 139 80-item 80-item 80-item 54.9 – GPA 50.1 7.6 .33* – – .42** – – .28** * p < .05. ** p < .01 Test users should not automatically assume that these data constitute sole and sufficient justification for use of the Watson-Glaser. Inferring validity for one group from data reported for another group is not appropriate unless the organizations and job categories being compared are demonstrably similar. Careful examination of Table 9.1 can help test users make an informed judgment about the appropriateness of the Watson-Glaser for their own organization. However, the data presented here are not intended to serve as a substitute for locally obtained data. Locally conducted validity studies, together with locally derived norms, provide a sound basis for determining the most appropriate use of the Watson-Glaser. Hence, whenever technically feasible, test users should study the validity of the Watson-Glaser, or any selection test, at their own location. 38 Chapter 9 Evidence of Validity Sometimes it is not possible for a test user to conduct a local validation study. There may be too few incumbents in a particular job, an unbiased and reliable measure of job performance may not be available, or there may not be a sufficient range in the ratings of job performance to justify the computation of validity coefficients. In such circumstances, evidence of a test’s validity reported elsewhere may be relevant, provided that the data refer to comparable jobs. Evidence of Convergent and Discriminant Validity Convergent evidence is provided when scores on a test relate to scores on other tests or variables that purport to measure similar traits or constructs. Evidence of relations with other variables can involve experimental (or quasi-experimental) as well as correlational evidence (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). Discriminant evidence is provided when scores on a test do not relate closely to scores on tests or variables that measure different traits or constructs. Previous Studies of Evidence of Convergent and Discriminant Validity Convergent validity evidence for the Watson-Glaser has been provided in a variety of instructional settings. Given that the Watson-Glaser is a measure of critical thinking ability, experience in programs aimed at developing this ability should be reflected in changes in performance on the test. Sorenson (1966) found that participants in laboratory-centered biology courses showed greater change in Watson-Glaser scores than did members of classes where the teaching method was predominately a traditional lecture approach. Similarly, Agne and Blick (1972) found that Watson-Glaser performance was differentially affected by teaching earth science through a data-centered, experimental approach as compared with a traditional lecture approach. In both of these studies, critical thinking was cited as a key curriculum element for the nontraditional teaching methods. In addition, studies have reported that Watson-Glaser scores may be influenced by critical thinking and problem-solving courses (Arand & Harding 1987; Herber, 1959; Pierce, Lemke, & Smith, 1988; Williams, 1971), debate training (Brembeck, 1949; Colbert , 1987; Follert & Colbert, 1983; Jackson, 1961), developmental advising (Frost, 1989, 1991), group problem-solving (Gadzella, Hartsoe, & Harper, 1989; Garris, 1974; Goldman & Goldman, 1981; Neimark, 1984), reading and speech courses (Brownell, 1953; Duckworth, 1968; Frank, 1969; Livingtson, 1965; Ness, 1967), exposure to college curriculum (Berger, 1984; Burns, 1974; Fogg & Calia, 1967; Frederickson & Mayer, 1977; McMillan, 1987; Pardue, 1987) and computer-related courses (Jones, 1988; Wood, 1980; Wood & Stewart, 1987). Convergent validity evidence for the Watson-Glaser has also been provided in studies that examined its relationship with other tests. For example, studies reported in Watson and Glaser (1994) showed significant relationships between Watson-Glaser scores and scores on the following tests: Otis-Lennon Mental Ability Test (Forms J & K), the California Test of Mental Maturity, the Verbal IQ test of the Wechsler Adult Intelligence Scale (WAIS), the Miller Analogies Test, Wesman Personnel Test, Differential Aptitude Test (Abstract Reasoning), College Entrance Examination Board, Scholastic Aptitude Tests (SAT), Stanford Achievement Tests, and the American College Testing Program (ACT). 39 Watson-Glaser Short Form Manual Discriminant validity evidence for the Watson-Glaser has been shown in studies such as Robertson and Molloy (1982), which found non-significant correlations between Watson-Glaser scores and measures of dissimilar constructs such as Social Skills and Neuroticism. Similarly, Watson and Glaser (1994) found that Watson-Glaser scores correlated stronger with a test measuring reasoning ability (i.e., Wesman Personnel Test) than a test measuring the conceptually less closely related construct of visual-spatial ability (i.e., Employee Aptitude Survey—Space Visualization). Studies of the Relationship Between the Watson-Glaser and General Intelligence An area concerning both convergent and discriminant validity evidence is the relationship of the Watson-Glaser to general intelligence. Although the WatsonGlaser has been found to correlate with general intelligence, its overlap as a construct is not complete. Factor analyses of the Watson-Glaser Short Form tests with other measures of intelligence generally indicate that the Watson-Glaser is measuring a dimension of ability that is distinct from overall intellectual ability. Landis (1976), for example, performed a factor analysis of the Watson-Glaser tests with measures drawn from the Guilford Structure of Intellect Model. The Watson-Glaser tests were found to reflect a dimension of intellectual functioning that was independent of that tapped by the measures of the structure of the intellect system. Follman, Miller, and Hernandez (1969) also found that the Watson-Glaser tests were represented by high loadings on a single factor when analyzed along with various achievement and ability tests. Ross (1977) reported that the Watson-Glaser tests, Inference and Deduction, loaded on a verbally based induction and a deduction factor, respectively. Current Studies of Evidence of Convergent and Discriminant Validity Studies of instructional programs aimed at developing critical thinking abilities continue to support the effectiveness of the Watson-Glaser as a measure of such abilities. For example, increases in critical thinking have been found as a result of scenario-based community health education (Sandor, Clark, Campbell, Rains, & Cascio, 1998), medical school curriculum (Scott, Markert, & Dunn, 1998), critical thinking instruction (Gadzella, Ginther, & Bryant, 1996), teaching methods designed to stimulate critical thinking (Williams, 2003), nursing program experience (Frye, Alfred, & Campbell, 1999; Pepa, Brown, & Alverson, 1997), and communication skills education such as public speaking courses (Allen, Bekowitz, Hunt, & Louden, 1999). Recent studies have also shown that the Watson-Glaser relates to tests of similar and dissimilar constructs in an expected manner. In 2005, Harcourt Assessment conducted a study of 63 individuals employed in various roles and industries and found that Watson-Glaser total scores correlated .70 with scores on the Miller Analogies Test for Professional Selection. Rust (2002), in a study of 1,546 individuals from over 50 different occupations in the United Kingdom, reported a correlation of .63 between the Watson-Glaser and a test of critical thinking with numbers, the Rust Advanced Numerical Reasoning Appraisal. In a study of Education majors, Taube (1995) found that Watson-Glaser scores correlated .37 with scores on an essay test designed to measure critical thinking (Ennis-Weir Critical Thinking Essay Test), .43 with SAT-Verbal scores, and .39 with SAT-Math scores. 40 Chapter 9 Evidence of Validity Regarding discriminant validity evidence, non-significant correlations have been found between Watson-Glaser scores and tests measuring dissimilar constructs such as the Big Five construct Emotional Stability (Spector, et al., 2000) and the psychological type characteristic Thinking/Feeling (Yang & Lin, 2004). Table 9.2 presents correlations between the Watson-Glaser and other tests. Additional studies are reported in the previous version of manual (1994). In Table 9.2, a description of the study participants appears in the first column. The second column lists the total number of participants (N), followed by the Watson-Glaser form for which data were collected, the mean and standard deviation (SD) of Watson-Glaser scores, and the comparison test name. The mean and standard deviation of scores on the comparison test are reported next, followed by the correlation coefficient (r) indicating the relationship between scores on the Watson-Glaser and the comparison test. High correlation coefficients indicate overlap between Watson-Glaser and comparison tests. Low correlations, on the other hand, suggest that the tests measure different traits. Table 9.2 Watson-Glaser Convergent Evidence of Validity Watson-Glaser Group N Form Job incumbents across multiple industries (Harcourt Assessment, Inc., 2005) 63 Short Job incumbents from multiple occupations in UK (Rust, 2002) 1,546 CUK Education majors (Taube, 1995) 147–194 80-item Mean Other Test SD Description Mean SD r Miller Analogies Test for Professional Selection .70** .63** – – Rust Advanced Numerical Reasoning Appraisal 54.9 8.1 SAT-Verbal 431.5 75.3 .43** SAT-Math 495.5 91.5 .39* Ennis-Weir Critical Thinking Essay Test 14.6 6.1 .37* Baccalaureate Nursing Students (Adams, Stover, & Whitlow, 1999) 203 80-item 54.0 9.3 ACT Composite 21.0 – .53** Dispatchers at a Southern railroad company (Watson & Glaser, 1994) 180 Short 24.9 5.0 Industrial Reading Test 29.9 4.4 .53** 73.7 11.4 .50** Lower-level management applicants (Watson & Glaser, 1994) 219 Wesman, Verbal 27.5 6.0 .51** EAS, Verbal Comp. 20.7 3.1 .54** 16.7 4.6 .48** Test of Learning Ability 217 217 Short 33.5 4.4 EAS, Verbal Reasoning (continued) 41 Watson-Glaser Short Form Manual Table 9.2 Watson-Glaser Convergent Evidence of Validity (continued) Watson-Glaser Group N Form Mean SD Description Mean SD r Mid-level management applicants (Watson & Glaser, 1994) 209 Short 34.0 4.2 Wesman, Verbal 27.5 6.0 .66** EAS, Verbal Comp. 21.0 3.0 .50** 16.6 4.9 .51** 27.0 5.8 .54** 21.1 3.4 .42** 16.2 4.2 .47** Executive management applicants (Watson & Glaser, 1994) * p < .05. ** p < .01. 42 Other Test 209 208 440 437 436 Short 33.4 4.2 EAS, Verbal Reasoning Wesman, Verbal EAS, Verbal Comp. EAS, Verbal Reasoning Using the Watson-Glaser as an Employment Selection Tool 10 The Watson-Glaser is used to predict success in certain occupations and instructional programs that require critical thinking ability. The test is also used to measure gains in critical thinking ability resulting from instructional and training programs, and to determine for research purposes, the relationship between critical thinking ability and other abilities or traits. Employment Selection Many organizations use testing as a component of their employment selection process. Typical selection test programs make use of cognitive ability tests, aptitude tests, personality tests, basic skills tests, and work values tests, to name a few. Tests are used to screen out unqualified candidates, to categorize prospective employees according to their probability of success on the job, or to rank order a group of candidates according to merit. The Watson-Glaser is designed to assist in the selection of employees for jobs that require careful, analytical reasoning. Many executive, administrative, and technical professions require the type of critical thinking ability measured by the Watson-Glaser. The test has been used to assess applicants for a wide variety of jobs, including administrative and sales positions, and lower-to-upper level management jobs in construction, production, marketing, healthcare, financial, police, public sector organizations, teaching facilities, and religious institutions. It should not be assumed that the type of critical thinking required in a particular job is identical to that measured by the Watson-Glaser. Job analysis and validation of the Watson-Glaser for selection purposes should follow accepted human resource research procedures, and conform to existing guidelines concerning fair employment practices. In addition, no single test score can possibly suggest all of the requisite knowledge and skills necessary for success in a job. It is the responsibility of the hiring authority to determine how it uses the Watson-Glaser scores. It is recommended that if the hiring authority establishes a cut score, examinees’ scores should be considered in the context of appropriate measurement data for the test, such as the standard error of measurement and data regarding the predictive validity of the test. In addition, it is recommended that selection decisions be based on multiple job-relevant measures rather than relying on any single measure (e.g., using only Watson-Glaser scores to make decisions). Organizations using the Watson-Glaser are encouraged to examine the relationship between examinees’ scores and their subsequent performance on the job. This locally obtained information will provide the best assistance in score interpretation and will most effectively enable a user of the Watson-Glaser to 43 Watson-Glaser Short Form Manual differentiate examinees that are likely to be successful from those who are not. Harcourt Assessment, Inc. does not establish or recommend a passing score for the Watson-Glaser. Fairness in Selection Testing Fair employment regulations and their interpretation are continuously subject to changes in the legal, social, and political environments. It therefore is advised that a user of the Watson-Glaser consult with qualified legal advisors and human resources professionals as appropriate. Legal Considerations There are governmental and professional regulations that cover the use of all personnel selection procedures. Relevant source documents that the user may wish to consult include the Standards for Educational and Psychological Testing (AERA et al., 1999); the Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology, 2003); and the federal Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, 1978). For an overview of the statutes and types of legal proceedings which influence an organization’s equal employment opportunity obligations, the user is referred to Cascio and Aguinis (2005) or the U.S. Department of Labor’s (2000) Testing and Assessment: An Employer’s Guide to Good Practices. Group Differences/Adverse Impact Local validation is particularly important when a selection test may have adverse impact. According to the Uniform Guidelines on Employee Selection Procedures (EEOC, 1978) adverse impact is normally indicated when the selection rate for one group is less than 80% (or 4 out of 5) that of another. Adverse impact is likely to occur with cognitive ability tests such as the Watson-Glaser. While it is not unlawful to use a test with adverse impact (EEOC, 1978), the testing organization must be prepared to demonstrate that the selection test is job-related and consistent with business necessity. A local validation study, in which scores on the Watson-Glaser are correlated with indicators of on-the-job performance, will help provide evidence to support the use of the test in a particular job context. In addition, an evaluation that demonstrates that the Watson-Glaser is equally predictive for protected subgroups, as outlined by the Equal Employment Opportunity Commission, will assist in the demonstration of fairness of the test. Monitoring the Selection System The abilities to evaluate selection strategies and to implement fair employment practices depend on an organization’s awareness of the demographic characteristics of applicants and incumbents. Monitoring these characteristics and accumulating test score data are clearly necessary for establishing legal defensibility of a selection system, including those systems that incorporate the Watson-Glaser. The most effective use of the Watson-Glaser will be achieved where a local norms database is established and continuously monitored for unfair consequences. 44 Chapter 10 Using the Watson-Glaser as an Employment Selection Tool Research The Watson-Glaser provides a reliable measure of critical thinking ability and has been included in a variety of research studies on critical thinking and related topics. Several studies are summarized in this manual. The citations are listed in the References section. Other research studies are listed in the Research Bibliography section. 45 Appendix A Description of the Normative Sample and Percentile Ranks Table A.1 Description of the Normative Sample by Industry Industry Norms and Sample Characteristics Advertising/Marketing/ Public Relations N = 101 Various occupations within advertising, marketing, and public relations industries. Mean = 28.7 SD = 6.1 Occupational Characteristics 12.9% Hourly/Entry-Level 4.3% Supervisor 34.4% Manager 16.1% Director 17.2% Executive 15.1% Professional/Individual Contributor Education N = 119 Various occupations within the education industry. Mean = 30.2 SD = 5.4 Occupational Characteristics 11.5% Hourly/Entry-Level 1.0% Supervisor 21.2% Manager 23.1% Director 26.0% Executive 17.3% Professional/Individual Contributor (continued) 47 Watson-Glaser Short Form Manual Table A.1 Description of the Normative Sample by Industry (continued) Industry Norms and Sample Characteristics Financial Services/ Banking/Insurance N = 228 Various occupations within financial services, banking, and insurance industries. Mean = 31.2 SD = 5.7 Occupational Characteristics 10.4% Hourly/Entry-Level 8.1% Supervisor 19.4% Manager 7.6% Director 23.2% Executive 31.3% Professional/Individual Contributor Government/Public Service/Defense Various occupations within government, public service, and defense agencies. N = 130 Mean = 30.0 SD = 6.3 Occupational Characteristics 19.8% Hourly/Entry-Level 10.3% Supervisor 20.7% Manager 7.8% Director 2.6% Executive 38.8% Professional/Individual Contributor Health Care N = 195 Various occupations within the health care industry. Mean = 28.3 SD = 6.5 Occupational Characteristics 19.1% Hourly/Entry-Level 9.5% Supervisor 18.5% Manager 15.5% Director 11.9% Executive 25.6% Professional/Individual Contributor (continued) 48 Appendix A Table A.1 Description of the Normative Sample by Industry (continued) Industry Norms and Sample Characteristics Information Technology/ Telecommunications N = 295 Various occupations within information technology and telecommunications industries. Mean = 31.2 SD = 5.5 Occupational Characteristics 8.1% Hourly/Entry-Level 4.0% Supervisor 22.8% Manager 8.1% Director 12.5% Executive 44.5% Professional/Individual Contributor Manufacturing/ Production N = 561 Various occupations within manufacturing and production industries. Mean = 32.0 SD = 5.3 Occupational Characteristics 7.6% Hourly/Entry-Level 10.2% Supervisor 36.8% Manager 15.9% Director 9.1% Executive 20.6% Professional/Individual Contributor Professional Business Services N = 153 Various occupations within the professional business services industry (e.g., consulting, legal) Mean = 31.9 SD = 5.6 Occupational Characteristics 8.9% Hourly/Entry-Level 1.5% Supervisor 23.7% Manager 12.6% Director 16.3% Executive 37.0% Professional/Individual Contributor (continued) 49 Watson-Glaser Short Form Manual Table A.1 Description of the Normative Sample by Industry (continued) Industry Norms and Sample Characteristics Retail/Wholesale N = 307 Various occupations within retail and wholesale industries. Mean = 30.8 SD = 5.3 Occupational Characteristics 8.1% Hourly/Entry-Level 3.9% Supervisor 45.3% Manager 10.5% Director 22.1% Executive 10.2% Professional/Individual Contributor 50 Appendix A Table A.2 Description of the Normative Sample by Occupation Occupation Norms and Sample Characteristics Accountant/Auditor/ Bookkeeper N = 118 Accountant, auditor, and bookkeeper positions within various industries. Mean = 30.2 SD = 5.8 Industry Characteristics 2.4% Advertising/Marketing/Public Relations 8.2% Education 24.7% Financial Services/Banking/Insurance 5.9% Government/Public Service/Defense 8.2% Health Care 8.2% Information Technology/High-Tech/Telecommunications 29.4% Manufacturing/Production 8.2% Professional Business Services 4.7% Retail/Wholesale Consultant N = 139 Consultant positions within various industries. Mean = 33.3 SD = 4.8 Industry Characteristics 6.3% Advertising/Marketing/Public Relations 2.7% Education 3.6% Financial Services/Banking/Insurance 1.8% Government/Public Service/Defense 0.9% Health Care 20.5% Information Technology/High-Tech/Telecommunications 6.3% Manufacturing/Production 50.9% Professional Business Services 7.1% Retail/Wholesale (continued) 51 Watson-Glaser Short Form Manual Table A.2 Description of the Normative Sample by Occupation (continued) Occupation Norms and Sample Characteristics Engineer N = 225 Engineer positions within various industries. Mean = 32.8 SD = 4.8 Industry Characteristics 0.0% Advertising/Marketing/Public Relations 1.3% Education 0.6% Financial Services/Banking/Insurance 9.0% Government/Public Service/Defense 0.6% Health Care 14.7% Information Technology/High-Tech/Telecommunications 71.8% Manufacturing/Production 1.9% Professional Business Services 0.0% Retail/Wholesale Human Resource Professional N = 140 Human resource professional positions within various industries. Mean = 30.0 SD = 5.7 Industry Characteristics 0.9% Advertising/Marketing/Public Relations 7.5% Education 10.3% Financial Services/Banking/Insurance 9.4% Government/Public Service/Defense 13.1% Health Care 4.7% Information Technology/High-Tech/Telecommunications 35.5% Manufacturing/Production 7.5% Professional Business Services 11.2% Retail/Wholesale (continued) 52 Appendix A Table A.2 Description of the Normative Sample by Occupation (continued) Occupation Norms and Sample Characteristics Information Technology Professional N = 222 Information technology positions within various industries. Mean = 31.4 SD = 5.9 Industry Characteristics 1.0% Advertising/Marketing/Public Relations 2.0% Education 9.4% Financial Services/Banking/Insurance 4.0% Government/Public Service/Defense 5.5% Health Care 62.9% Information Technology/High-Tech/Telecommunications 8.4% Manufacturing/Production 3.0% Professional Business Services 4.0% Retail/Wholesale Sales Representative— Non-Retail Sales representative positions (non-retail) within various industries. N = 353 Mean = 29.8 SD = 5.1 Industry Characteristics 11.6% Advertising/Marketing/Public Relations 4.2% Education 7.9% Financial Services/Banking/Insurance 1.1% Government/Public Service/Defense 11.1% Health Care 13.2% Information Technology/High-Tech/Telecommunications 26.5% Manufacturing/Production 8.5% Professional Business Services 15.9% Retail/Wholesale 53 Watson-Glaser Short Form Manual Table A.3 Description of the Normative Sample by Position Type/Level Position Type/Level Norms and Characteristics Executive N = 409 Executive-level positions (e.g., CEO, CFO, VP) within various industries. Mean = 33.4 SD = 4.5 Industry Characteristics 5.7% Advertising/Marketing/Public Relations 9.6% Education 17.4% Financial Services/Banking/Insurance 1.1% Government/Public Service/Defense 7.1% Health Care 12.1% Information Technology/High-Tech/Telecommunications 17.0% Manufacturing/Production 7.8% Professional Business Services 22.3% Retail/Wholesale Director N = 387 Director-level positions within various industries. Mean = 32.9 SD = 4.7 Industry Characteristics 6.2% Advertising/Marketing/Public Relations 9.9% Education 6.6% Financial Services/Banking/Insurance 3.7% Government/Public Service/Defense 10.7% Health Care 9.1% Information Technology/High-Tech/Telecommunications 34.6% Manufacturing/Production 7.0% Professional Business Services 12.4% Retail/Wholesale (continued) 54 Appendix A Table A.3 Description of the Normative Sample by Position Type/Level (continued) Position Type/Level Norms and Characteristics Manager N = 973 Manager-level positions within various industries. Mean = 30.7 SD = 5.4 Industry Characteristics 5.6% Advertising/Marketing/Public Relations 3.9% Education 7.2% Financial Services/Banking/Insurance 4.2% Government/Public Service/Defense 5.5% Health Care 10.9% Information Technology/High-Tech/Telecommunications 34.3% Manufacturing/Production 5.6% Professional Business Services 22.7% Retail/Wholesale Supervisor N = 202 Supervisor-level positions within various industries. Mean = 28.8 SD = 6.2 Industry Characteristics 3.1% Advertising/Marketing/Public Relations 0.8% Education 13.3% Financial Services/Banking/Insurance 9.4% Government/Public Service/Defense 12.5% Health Care 8.6% Information Technology/High-Tech/Telecommunications 42.2% Manufacturing/Production 1.6% Professional Business Services 8.6% Retail/Wholesale (continued) 55 Watson-Glaser Short Form Manual Table A.3 Description of the Normative Sample by Position Type/Level (continued) Position Type/Level Norms and Characteristics Professional/Individual Contributor N = 842 Professional-level positions and individual contributor positions within various industries. Mean = 30.6 SD = 5.6 Industry Characteristics 2.8% Advertising/Marketing/Public Relations 3.6% Education 13.3% Financial Services/Banking/Insurance 9.1% Government/Public Service/Defense 8.7% Health Care 24.4% Information Technology/High-Tech/Telecommunications 22.0% Manufacturing/Production 10.1% Professional Business Services 5.9% Retail/Wholesale Hourly/Entry-Level N = 332 Hourly and entry-level positions within various industries. Mean = 27.7 SD = 5.9 Industry Characteristics 6.1% Advertising/Marketing/Public Relations 6.1% Education 11.1% Financial Services/Banking/Insurance 11.6% Government/Public Service/Defense 16.2% Health Care 11.1% Information Technology/High-Tech/Telecommunications 20.2% Manufacturing/Production 6.1% Professional Business Services 11.6% Retail/Wholesale Position Type/Occupation Within Specific Industry Norms Manager in Manufacturing/ Production Managers in manufacturing and production industries. Engineer in Manufacturing/ Production Engineers in manufacturing and production industries. 56 N = 170 Mean = 31.9 SD = 5.3 N = 112 Mean = 32.9 SD = 4.7 Appendix A Table A.4 Percentile Ranks of Total Raw Scores for Industry Groups Industry Raw Score Advertising/ Marketing/ Public Relations 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 99 99 98 95 90 82 77 73 69 64 58 54 48 39 34 31 26 21 19 14 11 8 7 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Raw Score Mean Raw Score SD N 28.7 6.1 101 Education Financial Services/ Banking/ Insurance Government/ Public Service/ Defense Health Care Raw Score 99 99 98 93 87 84 76 70 62 54 46 40 36 30 27 21 16 13 8 8 6 4 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 97 95 88 80 73 65 58 53 47 41 36 31 24 20 16 14 11 8 7 6 5 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 98 94 88 82 75 68 62 58 56 53 48 42 36 31 28 22 19 13 12 8 4 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 98 95 91 84 78 74 69 66 58 53 50 45 39 36 31 26 22 15 12 9 7 6 5 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 30.2 5.4 119 31.2 5.7 228 30.0 6.3 130 28.3 6.5 195 Raw Score Mean Raw Score SD N (continued) 57 Watson-Glaser Short Form Manual Table A.4 Percentile Ranks of Total Raw Scores for Industry Groups (continued) Industry Raw Score Information Technology/ Telecommunications Manufacturing/ Production Professional/ Business Services Retail/ Wholesale Raw Score 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 99 99 94 89 82 74 64 57 50 45 41 36 30 25 22 18 15 12 8 6 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 94 86 77 68 60 51 45 41 36 31 24 20 17 13 11 9 6 4 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 95 88 76 67 56 52 46 37 33 29 26 20 16 13 12 9 8 6 5 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 96 91 85 79 73 65 58 51 42 34 29 25 21 17 14 10 10 7 5 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 31.2 5.5 295 32.0 5.3 561 31.9 5.6 153 30.8 5.3 307 Raw Score Mean Raw Score SD N 58 Raw Score Mean Raw Score SD N Appendix A Table A.5 Percentile Ranks of Total Raw Scores for Occupations Occupation Raw Score Accountant/ Auditor/ Bookkeeper Engineer Human Resource Professional Information Technology Professional Sales Representative Non-Retail Consultant Raw Score 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 99 98 95 91 88 81 71 63 57 53 46 42 41 32 26 22 19 12 11 9 8 6 4 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 91 86 71 63 49 40 33 24 20 18 16 14 13 9 8 6 5 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 92 83 72 63 55 49 42 35 30 26 21 15 12 9 7 5 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 99 94 86 79 73 68 61 55 49 44 39 32 27 25 21 17 14 10 6 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 98 93 88 78 71 61 52 47 42 40 35 29 24 20 17 15 11 9 9 7 5 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 97 95 91 86 80 74 66 61 53 47 41 31 25 21 17 13 10 7 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Raw Score Mean Raw Score SD N 30.2 5.8 118 33.3 4.8 139 32.8 4.8 225 30.0 5.7 140 31.4 5.9 222 29.8 5.1 353 Raw Score Mean Raw Score SD N 59 Watson-Glaser Short Form Manual Table A.6 Percentile Ranks of Total Raw Scores for Position Type/Level Position Type/Level 60 Raw Score Executive Director Manager Supervisor Professional/ Individual Contributor 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 99 97 92 83 72 63 53 42 37 28 23 16 13 11 10 7 6 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 98 94 87 76 63 53 43 38 32 27 23 19 15 11 8 8 4 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 97 92 85 78 72 65 57 50 44 37 32 27 22 18 14 11 8 7 5 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 97 92 88 82 76 72 68 63 59 55 48 41 38 33 28 23 19 15 12 6 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 95 90 84 78 70 64 57 51 45 40 35 29 25 20 16 13 10 7 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 97 96 94 91 88 80 76 72 66 59 52 45 41 36 32 29 23 18 13 8 6 5 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Raw Score Mean Raw Score SD N 33.4 4.5 409 32.9 4.7 387 30.7 5.4 973 28.8 6.2 202 30.6 5.6 842 27.7 5.9 332 Raw Score Mean Raw Score SD N Hourly/ Entry-Level Raw Score Appendix A Table A.7 Percentile Ranks of Total Raw Scores for Position Type/Occupation Within Industry Position Type/Occupation Within Industry Raw Score Manager in Manufacturing/Production Engineer in Manufacturing/Production Raw Score 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 99 99 95 89 81 72 65 54 45 39 35 26 22 18 14 13 11 8 5 5 5 4 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 99 99 91 82 70 63 54 48 41 37 31 28 22 15 13 8 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 31.9 5.3 170 32.9 4.7 112 Raw Score Mean Raw Score SD N Raw Score Mean Raw Score SD N 61 Appendix B Final Item Statistics for the Watson-Glaser– Short Form Three-Parameter IRT Model Table B.1 Final Item Statistics for the Watson-Glaser Short Form Three-Parameter IRT Model (reprinted from Watson & Glaser, 1994) Form A Item Short Form Item One-Parameter Rasch-Model Difficulty Index (b) 1 2 3 5 11 12 14 20 21 22 26 27 28 31 32 33 34 36 37 38 39 40 41 42 52 53 54 55 57 62 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1.09 1.04 1.01 1.03 1.05 1.04 1.04 1.05 0.92 0.99 0.95 1.02 1.02 1.03 1.03 0.96 0.98 0.92 1.10 1.00 0.99 1.02 0.94 0.95 1.00 1.01 0.87 0.96 0.92 0.92 Discrimination (a) 0.70 0.76 0.63 0.87 0.61 0.56 0.84 0.57 1.22 0.66 1.02 0.43 0.53 0.54 0.55 1.51 0.76 1.06 0.56 0.71 0.60 0.51 0.92 0.72 1.05 0.60 1.34 0.75 0.92 0.87 Difficulty (b) Guessing (c) Single-Factor ML Solution Factor Loading 0.36 0.18 –0.39 0.33 –0.09 –0.36 0.55 –0.61 –1.19 –1.21 –0.04 –3.31 –2.16 –1.97 –0.48 –0.33 –1.09 0.21 0.62 –0.75 –1.56 –1.68 –0.31 –3.10 0.19 –0.89 –0.48 –0.68 –0.61 –1.93 0.43 0.36 0.28 0.31 0.38 0.29 0.27 0.31 0.40 0.31 0.28 0.32 0.37 0.36 0.32 0.50 0.39 0.23 0.50 0.36 0.33 0.31 0.25 0.28 0.35 0.34 0.29 0.21 0.26 0.22 0.21 0.27 0.31 0.31 0.25 0.29 0.29 0.29 0.40 0.32 0.39 0.19 0.24 0.24 0.27 0.35 0.33 0.39 0.20 0.33 0.29 0.27 0.42 0.28 0.34 0.29 0.49 0.41 0.43 0.41 (continued) 63 Watson-Glaser Short Form Manual Table B.1 Final Item Statistics for the Watson-Glaser Short Form Three-Parameter IRT Model (reprinted from Watson & Glaser, 1994) (continued) Form A Item Short Form Item One-Parameter Rasch-Model Difficulty Index (b) 64 68 69 71 72 74 75 76 77 80 31 32 33 34 35 36 37 38 39 40 0.94 0.96 0.90 1.02 1.05 0.92 0.99 0.95 1.07 1.02 Discrimination (a) 0.77 0.58 0.98 0.47 0.46 0.78 1.13 0.96 0.40 0.42 Difficulty (b) Guessing (c) Single-Factor ML Solution Factor Loading –2.97 –2.90 –1.86 –1.43 –1.40 –1.87 0.01 –0.70 –1.37 –2.49 0.25 0.25 0.23 0.23 0.28 0.21 0.38 0.42 0.31 0.26 0.32 0.28 0.43 0.30 0.26 0.40 0.37 0.37 0.23 0.26 Note. Though item statistics were generated using both a one-parameter Rasch model and a three-parameter IRT model, item selection decisions were based on the Rasch model for the following reasons: 64 1. the c parameter is held constant, as guessing is seldom truly random (i.e., even with true/false items, some partially incorrect knowledge is used by the examinee to select the response); 2. one-to-one correspondence with the raw score scale (i.e., “number correct” scoring) is possible; and regardless of the discrimination metric (Classical Test Theory or Item Response Theory), the estimates were similar across items—thus, little was gained in using a more complicated two- or three-parameter estimate of discrimination. References Adams, M.H., Stover, L.M., & Whitlow, J.F. (1999). A longitudinal evaluation of baccalaureate nursing students’ critical thinking abilities. Journal of Nursing Education, 38, 139–141. Agne, R. & Blick, D. (1972). A comparison of earth science classes taught by using original data in a research-approach technique versus classes taught by conventional approaches not using such data. Journal of Research in Science Teaching, 9, 83–89. Aiken, L. R. (1979). Psychological testing and assessment, third edition. Boston: Allyn & Bacon. Allen, M., Berkowitz, S., Hunt, S., & Louden, A. (1999). A meta-analysis of the impact of forensics and communication education on critical thinking. Communication Education, 48, 18–30. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: Author. Brembeck, W. L. (1949). The effects of a course in argumentation on critical thinking ability. Speech Monographs, 16, 177–189. Brownell, J. A. (1953). The influence of training in reading in the social studies on the ability to think critically. California Journal of Educational Research, 4, 25–31. Burns, R. L. (1974). The testing of a model of critical thinking ontogeny among Central Connecticut State College undergraduates. (Doctoral dissertation, University of Connecticut). Dissertation Abstracts International, 34, 5467A. Cascio, W. F. (1982). Applied psychometrics in personnel management, second edition, Reston, VA: Reston Publishing. Cascio, W. F., & Aguinis, H. (2005). Applied psychology in human resource management (6th ed.). Upper Saddle River, NJ: Prentice Hall. Civil Rights Act of 1991 (Pub. L. 102-166). United States Code, Volume 42, Sections 101-402. Americans With Disabilities Act of 1990, Titles I & V (Pub. L. 101-336). United States Code, Volume 42, Sections 12101-12213. Cohen, B.H. (1996). Explaining psychological statistics. Pacific Grove, CA: Brooks & Cole. Anastasi, A. & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, N.J.: Prentice Hall. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.), Hillsdale, NJ: Lawrence Erlbaum Associates. Arand, J. U. & Harding, C. G. (1987). An investigation into problem solving in education; A problem-solving curricular framework. Journal of Allied Health, 16, 7–17. Colbert, K. R. (1987). The effects of CEDA and NDT debate training on critical thinking ability. Journal of the American Forensic Association, 23, 194–201. Bauwens, E. E. & Gerhard, G. G. (1987). The use of the Watson-Glaser Critical Thinking Appraisal to predict success in a baccalaureate nursing program. Journal of Nursing Education, 26, 278–281. Behrens, P. J. (1996). The Watson-Glaser Critical Thinking Appraisal and academic performance of diploma school students. Journal of Nursing Education, 35, 34–36. Berger, M. C. (1984). Clinical thinking ability and nursing students. Journal of Nursing Education, 23, 306–308. Cronbach, L. J. (1970). Essentials of psychological testing, third edition, New York: Harper & Row Duchesne, R. E., Jr. (1996). Critical thinking, developmental learning, and adaptive flexibility in organizational leaders (Doctoral dissertation, University of Connecticut). Dissertation Abstracts International, 57, 2121. Duckworth, J. B. (1968). The effect of instruction in general semantics on the critical thinking of tenth and eleventh grade students. (Doctoral dissertations, Wayne State University). Dissertation Abstracts, 29, 4180A. 65 Watson-Glaser Short Form Manual Equal Employment Opportunity Commission. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43(166), 38295–38309. Fogg, C. & Calia, V. (1967). The comparative influence of two testing techniques on achievement in science and critical thinking ability. The Journal of Experimental Education, 35, 1–14. Follert, V. F. & Colbert, K. R. (1983, November). An analysis of the research concerning debate Training and critical thinking. Paper presented at the 69th Annual Meeting of the Speech Communication Association, Washington, DC. Follman, J., Miller, W., & Hernandez, D. (1969). Factor analysis of achievement, scholastic aptitude, and critical thinking subtests. The Journal of Experimental Education, 38, 48–53. Frank, A. D. (1969). Teaching high school speech to improve critical thinking ability. Speech Teacher, 18, 296–302. Frederickson, K. & Mayer, G. G. (1977). Problem-solving skills: What effect does education have? American Journal of Nursing, 77, 1167–1169. Frost, S. (1989, October). Academic advising and cognitive development: Is there a link? Paper presented at the 13th Annual Conference of the National Academic Advising Association, Houston. Frost, S. H. (1991). Fostering critical thinking of college women through academic advising and faculty contact. Journal of College Student Development, 32, 359–366. Frye, B., Alfred, N., & Campbell, M. (1999). Use of the Watson-Glaser Critical Thinking Appraisal with BSN students. Nursing and Health Care Perspectives, 20(5), 253–255. Fulton, R. D. (1989). Critical thinking in adulthood. (Eric Document Reproduction Service No. ED 320015). Gadzella, B. M., Baloglu, M., & Stephens, R. (2002). Prediction of GPA with educational psychology grades and critical thinking scores. Education, 122(3), 618–623. Gadzella, B. M., Ginther, D. W., & Bryant, G. W. (1996, August). Teaching and learning critical thinking skills. Paper presented at the XXVI International Congress of Psychology, Montreal, Quebec. Gadzella, B. M., Hartsoe, K., & Harper, J. (1989). Critical thinking and mental ability groups. Psychological Reports, 65, 1019–1026. Gadzella, B. M., & Penland, E. (1995). Is creativity related to scores on critical thinking? Psychological Reports, 77, 817–818. 66 Garris, C. W. (1974). A study comparing the improvement of students’ critical thinking ability achieved through the teacher’s increased use of classroom questions resulting from individualized or group training programs. (Doctoral dissertation, Pennsylvania State University). Dissertation Abstracts International, 35, 7123A. Gaston, A. (1993). Recognizing potential law enforcement executives. (Reports No. NCJ 131646) Washington, D.C.: National Institute of Justice/NCJRS. Gibson, J. W., Kibler, R. J., & Barker, L. L. (1968). Some relationship between selected creativity and critical thinking measures. Psychological Reports, 23, 707–714. Glaser, E. M. (1937). An experiment in the development of critical thinking. Contributions to Education, No. 843. New York: Bureau of Publications, Teachers College, Columbia University. Goldman, F. W. & Goldman M. (1981). The effects of dyadic group experience in subsequent individual performance. Journal of Social Psychology, 115, 83–88. Gross, Y. T., Takazawa, E. S., & Rose, C. L. (1987). Critical thinking and nursing education. Journal of Nursing Education, 26, 317–323. Gurfein, H. (1977). Critical thinking in parents and their adolescents children. Dissertation Abstracts, 174A. Harris, A. J. & Jacobson, M. D. (1982) Basic reading vocabularies. New York: MacMillan. Herber, H. L. (1959). An inquiry into the effect of instruction in critical thinking upon students in grades ten, eleven, and twelve. (Doctoral dissertation, Boston University). Dissertation Abstracts, 20, 2174. Holmgren, B. & Covin, T. (1984). Selective characteristics of preservice professionals. Education, 104, 321–328. Jackson, T. R. (1961). The effects of intercollegiate debating on critical thinking ability. (Doctoral dissertation, University of Wisconsin). Dissertation Abstracts, 21, 3556. Jaeger, R. M. & Freijo, T. D. (1975). Race and sex as concomitants of composite halo in teachers’ evaluative rating of pupils. Journal Educational Psychology, 67, 226–237. Jones, P. K. (1988). The effect of computer programming instruction on the development of generalized problem solving skills in high school students. Unpublished doctoral dissertation Nova University. Jones, S. H. & Cook, S. W. (1975). The influence of attitude on judgments of the effectiveness of alternative social policies. Journal of Personality and Social Psychology, 32, 767–773. References Kudish, J. D. & Hoffman, B. J. (2002, October). Examining the relationship between assessment center final dimension ratings and external measures of cognitive ability and personality. Paper presented at the 30th International Congress on Assessment Center Methods, Pittsburgh, PA. Landis, R. E. (1976). The psychological dimensions of three measures of critical thinking and twenty-four structure-of-intellect tests for a sample of ninthgrade students. (Doctoral dissertation, University of Southern California.) Dissertation Abstracts International, 37, 5705A. Pardue, S. F. (1987). Decision-making skills and critical thinking ability among associate degree, diploma, baccalaureate, and master’s prepared nurses. Journal of Nursing Education, 26, 354–361. Pepa, C. A., Brown, J. M., & Alverson, E. M. (1997). A comparison of critical thinking abilities between accelerated and traditional baccalaureate nursing students. Journal of Nursing Education, 36, 46–48. Pierce, W., Lemke, E., & Smith, R. (1988). Critical thinking and moral development in secondary students. High School Journal, 71, 120–126. Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28, 563–575. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press. Livingston, H. (1965). An investigation of the effect of instruction in general semantics on critical reading ability. California Journal of Educational Research, 16, 93–96. Robertson, I. T. & Molloy, K. J. (1982). Cognitive complexity, neuroticism, and research ability. British Journal of Educational Psychology, 52, 113–118. Loo, R. & Thorpe, K. (1999). A psychometric investigation of scores on the Watson-Glaser Critical Thinking Appraisal New Form S. Educational and Psychological Measurement, 59(6), 995–1003. Loo, R. & Thorpe, K. (2005). Relationships between critical thinking and attitudes toward women’s roles in society. The Journal of Psychology, 139(1), 47–55. Ross, G. R. (1977, April). A factor-analytic study of inductive reasoning tests. Paper presented as the 61st Annual Meeting of the American Educational Research Association, New York. Rust, J. (2002). Rust advanced numerical reasoning appraisal manual. London: The Psychological Corporation. McMillan, J. H. (1987). Enhancing college students’ critical thinking: A review of studies. Research in Higher Education, 26, 3–19. Sandor, M. K., Clark, M., Campbell, D., Rains, A. P., & Cascio, R. (1998). Evaluating critical thinking skills in a scenario-based community health course. Journal of Community Health Nursing, 15(1), 21–29. Mead, A. D. & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114(3), 449–458. Scott, J. N. & Markert, R. J. (1994). Relationship between critical thinking skills and success in preclinical courses. Academic Medicine, 69(11), 920–924. Mitchell, H. E. & Byrne, D. (1973). The defendant’s dilemma: Effects of juror’s attitudes and authoritarianism on judicial decision. Journal of Personality and Social Psychology, 25, 123–129. Morse, H. & McCune, G. (1957). Selected items for the testing of study skills and critical thinking (3rd Edition). Washington, DC: National Council for the Social Studies. Neimark, E. D. (1984, August). A cognitive style change approach to the modification of thinking in college students. Paper presented at the Conference on Thinking, Cambridge, MA. (ERIC Document Reproduction Service No. ED 261301). Ness, J. H. (1967). The effects of a beginning speech course on critical thinking ability. (Doctoral dissertation, University of Minnesota). Dissertation Abstracts, 28, 5171A. Offner, A. N. (2000). Tacit knowledge and group facilitator behavior (Doctoral dissertation, Saint Louis University, 2000). Dissertation Abstracts International, 60, 4283. Scott, J. N., Markert, R. J., & Dunn, M.M. (1998). Critical thinking: change during medical school and relationship to performance in clinical clerkships. Medical Education, 32(1), 14–18. Sherif, C., Sherif, M., & Nebergall, R. (1965). Attitude and attitude change. Philadelphia: W. B. Saunders. Shin, K. R. (1998). Critical thinking ability and clinical decision-making skills among senior nursing students in associate and baccalaureate programmes in Korea. Journal of Advanced Nursing, 27(2), 414–418. Simon, A. & Ward, L. (1974). The performance on the Watson-Glaser Critical Thinking Appraisal of university students classified according to sex, type of course pursued, and personality score category. Educational and Psychological Measurement, 34, 957–960. Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author. 67 Watson-Glaser Short Form Manual Sorenson, L. (1966). Watson-Glaser Critical Thinking Appraisal: Changes in critical thinking associated with two methods of teaching high school biology. Test Data Report No. 51. New York: Harcourt Brace & World. Spector, P. A., Schneider, J. R., Vance, C. A., & Hezlett, S. A. (2000). The relation of cognitive ability and personality traits to assessment center performance. Journal of Applied Social Psychology, 30(7), 1474–1491. Taube, K. T. (1995, April). Critical thinking ability and disposition as factors of performance on a written critical thinking test. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA. Taylor, S. E., Frankenpohl, H., White, C. E., Nieroroda, B. W., Browning, C. L., & Birsner, E. P. (1989). EDL core vocabularies in reading, mathematics, science, and social studies. Columbia, SC: EDL. Thouless, R. H. (1949). Review of Watson-Glaser Critical Thinking Appraisal. In O. K. Buros (Ed.), The third mental measurements yearbook. Lincoln: University of Nebraska Press. U.S. Department of Labor. (1999). Testing and assessment: An employer’s guide to good practices. Washington, DC: Author. Watson, G. B. (1925). The measurement of fairmindedness. Contributions to Education, No. 176. New York: Bureau of Publications, Teachers College, Columbia University. Watson, G., & Glaser, E. M. (1994). Watson-Glaser Critical Thinking Appraisal, Form S manual. San Antonio, TX: The Psychological Corporation. Williams, B. R. (1971). Critical thinking ability as affected by a unit on symbolic logic. (Doctoral dissertation, Arizona State University). Dissertation Abstracts International, 31, 6434A. Williams, R. L. (2003). Critical thinking as a predictor and outcome measure in a large undergraduate educational psychology course. (Report No. TM-035-016). Knoxville, TN: University of Tennessee. (ERIC Document Reproduction Service No. ED478075) Wood, L. E. (1980). An “intelligent” program to teach logical thinking skills. Behavior Research Methods and Instrumentation, 12, 256–258. Wood, L. E. & Stewart, P. W. (1987). Improvement of practical reasoning skills with a computer game. Journal of Computer-Based instruction, 14, 49-53. Yang, S. C. & Lin, W. C. (2004). The relationship among creative, critical thinking and thinking styles in Taiwan High School Students. Journal of Instructional Psychology, 31(1). 33–45. 68 Research Bibliography Alexakos, C. E. (1966). Predictive efficiency of two multivariate statistical techniques in comparison with clinical predictions. Journal of Educational Psychology, 57, 297–306. Alspaugh, C. A. (1970). A study of the relationships between student characteristics and proficiency in symbolic and algebraic computer programming. (Doctoral dissertation, University of Missouri). Dissertation Abstracts International, 31, 4627B. Bergman, L. M. E. (1960). A study of the relationship between selected language variables in extemporaneous speech and critical thinking ability. (Doctoral dissertation, University of Minnesota). Dissertation Abstracts, 21, 3552. Bessent, E. W. (1961). The predictability of selected elementary school principals’ administrative behavior. (Doctoral dissertation, The University of Texas at Austin). Dissertation Abstracts, 22, 3479. Alston, D. N. (1972). An investigation of the critical reading ability of classroom teachers in relation to selected background factors. Educational Leadership, 29, 341–343. Betres, J. J. A. (1971). A study in the development of the critical thinking skills of preservice elementary teachers. (Doctoral dissertation, Ohio University). Dissertation Abstracts International, 32, 2520A. Annis, L. F. & Annis, D. B. (1979). The impact of philosophy on students’ critical thinking ability. Contemporary Educational Psychology, 4, 219–226. Bishop, T. (1971). Predicting potential: Selection for science-based industries. Personnel Management, 3, 31–33. Armstrong, N. A. (1970). The effect of two instructional inquiry strategies on critical thinking and achievement in eighth grade social studies. (Doctoral dissertation, Indiana University). Dissertation Abstracts International, 31, 1611A. Bitner, B. (1991). Formal operational reasoning modes: Predictors of critical thinking abilities and grades assigned by teachers in science and mathematics for students in grades nine through twelve. Journal of Research in Science Teaching, 28, 265–274. Awomolo, A. A. (1973). Teacher discussion leadership behavior in a public issues curriculum and some cognitive and personality correlates. (Doctoral dissertation, University of Toronto). Dissertation Abstracts International, 35, 316A. Bitner, C. & Bitner, B. (1988, April). Logical and critical thinking abilities of sixth through twelfth grade students and formal reasoning modes as predictors of critical thinking abilities and academic achievement. Paper presented at the 61st Annual Meeting of the National Association for Research in Science Teaching, Lake of the Ozarks, MO. Bass, J. C. (1959). An analysis of critical thinking in a college general zoology class. (Doctoral dissertation, University of Oklahoma). Dissertation Abstracts, 20, 963. Beckman, V. E. (1956). An investigation and analysis of the contributions to critical thinking made by courses in argumentation and discussion in selected colleges. (Doctoral dissertation, University of Minnesota). Dissertation Abstracts, 16, 2551. Berger, A. (1985). Review of test: Watson-Glaser Critical Thinking Appraisal. In J.V. Mitchell (Ed.), The ninth mental measurements yearbook. Lincoln: University of Nebraska Press. Blai, B. (1989). Critical thinking: The flagship of thinking skills? (ERIC Document Reproduction Service No. ED 311752). Bledso, J. C. (1955). A comparative study of values and critical thinking skills of a group of educational workers. The Journal of Educational Psychology, 46, 408–417. Bostrom, E. A. (1969). The effect of class size on critical thinking skills. (Doctoral dissertation, Arizona State University). Dissertation Abstracts, 29, 2032A. Brabeck, M. M. (1981, August). The relationship between critical thinking skills and the development of reflective judgment among adolescent and adult women. Paper presented at the 89th Annual Convention of the American Psychological Association, Los Angeles. 69 Watson-Glaser Short Form Manual Brabeck, M. M. (1983). Critical thinking skills and reflective judgment development; Redefining the aims of higher education. Journal of Applied Developmental Psychology, 4, 23–24. Bradberry, R. D. (1968). Relationships among critical thinking ability, personality attributes, and attitudes of students in a teacher education program. (Doctoral dissertation, North Texas State University, Denton). Dissertation Abstracts, 29, 163A. Brakken, E. (1965). Intellectual factors in Pssc and conventional high school physics. Journal of Research in Science Teaching, 3, 19–25. Braun, J. R. (1969). Search for correlates of self-actualization. Perceptual and Motor Skills, 28, 557–558. Broadhurst, N. A. (1969). A measure of some learning outcomes in matriculation chemistry in South Australia. Australian Science Teaching Journal, 15, 67–70. Broadhurst, N.A. (1969). A study of selected teacher factors in learning outcomes in chemistry in secondary schools in South Australia. (Doctoral dissertation, Oregon State University). Dissertation Abstracts International, 30, 485A. Broadhurst, N. A. (1970). An item analysis of the Watson-Glaser Critical Thinking Appraisal (Form Ym). Science Education, 54, 127–132. Brouillette, O. J. (1968). An interdisciplinary comparison of critical thinking objective among science and non-science majors in higher education. (Doctoral dissertation, University of Southern Mississippi). Dissertation Abstracts, 29, 2877A. Brubaker, H. L. (1972). Selection of college major by the variables of intelligence, creativity, and critical thinking. (Doctoral dissertation, Temple University). Dissertation Abstracts International, 33, 1507A. Bunt, D. D. (1974). Prediction of academic achievement and critical thinking of eighth graders in suburban, urban, and private schools through specific personality, ability, and school variables. (Doctoral dissertation, Northern Illinois University). Dissertation Abstracts International, 35, 2042A. Burns, R. L. (1974). The testing of a model of critical thinking ontogeny among Central Connecticut State College undergraduates. (Doctoral dissertation, University of Connecticut). Dissertation Abstracts International, 24, 5467A. Bybee, J. R. (1972). Prediction in the college of education doctoral program at the Ohio State University. (Doctoral dissertation, Ohio State University). Dissertation Abstracts International, 33, 4111A. 70 Campbell, I. C. (1976). The effects of selected facets of critical reading instruction upon active duty servicemen and civilian evening college adults. (Doctoral dissertation, University of Georgia). Dissertation Abstracts International, 37, 2591A. Cannon, A. G. (1974). The development and testing of a policy-capturing model for the selection of school administrators in a large urban school district. (Doctoral dissertation, The University of Texas at Austin). Dissertation Abstracts International, 35, 2565A. Canter, R. R. Jr. (1951). A human relations training program. Journal of Applied Psychology, 35, 38–45. Carleton, F. O. (1970). Relationships between follow-up evaluations and information developed in a management assessment center. Proceedings of the 78th Annual Convention of the American Psychological Association, 5, 565–566. Carlson, D. A. (1975). Training in formal reasoning abilities provided by the inquiry role approach and achievement on the Piagetian formal operational level. (Doctoral dissertation, University of Northern California). Dissertation Abstracts International, 36, 7368A. Carnes, D. D. (1969). A study of the critical thinking ability of secondary summer school mathematics students. (Doctoral dissertation, University of Mississippi). Dissertation Abstracts International, 30, 2242A. Cass, M. & Evans, E. D. (1992). Special education teachers and critical thinking. Reading Improvement, 29, 228–230. Chang, E. C. G. (1969). Norms and correlates of the Watson-Glaser Critical Thinking Appraisal and selected variables for Negro college students. (Doctoral dissertation, University of Oklahoma). Dissertation Abstracts International, 30, 1860A. Coble, C.R. & Hounshell, P.B. (1972). Teacher self-actualization and student progress. Science Education, 26, 311–316. Combs, C. M. Jr. (1968). An experiment with independent study in science education. (Doctoral dissertation, University of Mississippi). Dissertation Abstracts, 29, 3489A. Cook, J. (1955). Validity Information Exchange, No. 813: D.O.T. Code 0-17-01, Electrical Engineer. Personnel Psychology, 8, 261–262. Cooke, M. M. (1976). A study of the interaction of student and program variables for the purpose of developing a model for predicting graduation from graduate programs in educational administration at the State University of New York at Buffalo. (Doctoral dissertation, State University of New York). Dissertation Abstracts International, 37, 827A. Research Bibliography Corell, J. H. (1968). Comparison of two methods of counseling with academically deteriorated university upperclassmen. (Doctoral dissertation, Indiana University). Dissertation Abstracts, 29, 1419A. Corlett, D. (1974). Library skills, study habits and attitudes, and sex as related to academic achievement. Educational and Psychological Measurement, 34, 967–969. Cornett, E. (1977). A study of the aptitude-treatment interactions among nursing students regarding programmed modules and personological variables. (Doctoral dissertation, The University of Texas at Austin). Dissertation Abstracts International, 38, 3119B. Cousins, J. E. (1962). The development of reflective thinking in an eighth grade social studies class. (Doctoral dissertation, Indiana University). Dissertation Abstracts, 24, 195. Coyle, F. A. Jr. & Bernard, J. L. (1965). Logical thinking and paranoid schizophrenia. Journal of Psychology, 60, 283–289. Crane, W. J. (1962). Screening devices for occupational therapy majors. American Journal of Occupational Therapy, 16, 131–132. Crawford, C. D. (1956). Critical thinking and personal values in the listening situation: An exploratory investigation into the relationships of three theoretical variables in human communication as indicated by the relation between measurements on the Allport-Vernon-Lindzey Study of Values and the Watson-Glaser Critical Thinking Appraisal, and similar measurements of responses to a recorded radio news commentary. (Doctoral dissertation, New York University). Dissertation Abstracts, 19, 1845. Crites, J. O. (1972). Review of test: Watson-Glaser Critical Thinking Appraisal. In O.K. Buros (Ed.), The seventh mental measurements yearbook. Lincoln: University of Nebraska Press. Crosson, R. F. (1968). An investigation into certain personality variables among capital trial jurors. Proceedings of the 76th Annual Convention of the American Psychological Association, 3, 371–372. Cruce-Mast, A. L. (1975). The interrelationship of critical thinking, empathy, and social interest with moral judgment. (Doctoral dissertation, Southern Illinois University). Dissertation Abstracts International, 36, 7945A. Cyphert, F. R. (1961). The value structures and critical thinking abilities of secondary-school principals. The Bulletin of the National Association of Secondary-School Principals, 45, 43–47. D’Aoust, T. (1963). Predictive validity of four psychometric tests in a selected school of nursing. Unpublished master’s thesis, Catholic University of America, Washington, DC. Davis, W. N. (1971). Authoritarianism and selected trait patterns of school administrators: Seventeen case studies. (Doctoral dissertation, North Texas State University, Denton). Dissertation Abstracts International, 32, 1777A. De Loach, S. S. (1976). Level of ego development, degree of psychopathology, and continuation or termination of outpatient psychotherapy involvement. (Doctoral dissertation, Georgia State University). Dissertation Abstracts International, 37, 5348B. De Martino, H. A. (1970). The relations between certain motivational variables and attitudes about mental illness in student psychiatric nurses. (Doctoral dissertation, St. John’s University). Dissertation Abstracts International, 31, 3036A. Denney, L. L. (1968). The relationships between teaching method, critical thinking, and other selected teacher traits. (Doctoral dissertation, University of Missouri). Dissertation Abstracts, 29, 2586A. Dirr, P. M. (1966). Intellectual variables in achievement in modern algebra. (Doctoral dissertation, Catholic University of America). Dissertation Abstracts, 27, 2873A. Dispenzieri, A., Giniger, S., Reichman, W., & Levy, M. (1971). College performance of disadvantaged students as a function of ability and personality. Journal of Counseling Psychology, 18, 298–305. Dowling, R. E. (1990, February). Reflective judgment in debate: Or, the end of “critical thinking” as the goal of educational debate. Paper presented at the Annual Meeting of the Western Speech Communication Association, Sacramento. Dressel, P. & Mayhew, L. (1954). General education: Exploration in evaluation. Final Report of the Cooperative Study of Evaluation in General Education. Washington D.C.: American Council of Education. Eisenstadt, J. W. (1986). Remembering Goodwin Watson. Journal of Social Issues, 42, 49–52. Embretson, S. E. & Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates. Ennis, R.H. (1958). An appraisal of the Watson-Glaser Critical Thinking Appraisal. Journal of Educational Research, 52, 155–158. Fisher, A. (2001). Critical thinking: An introduction. New York: Cambridge University Press. Flora, L. D. (1966). Predicting academic success at Lynchburg College from multiple correlational analysis of four selected predictor variables. (Doctoral dissertation, University of Virginia). Dissertation Abstracts, 27, 2276A. 71 Watson-Glaser Short Form Manual Follman, J. (1969). A factor-analysis of three critical thinking tests, one logical reasoning test, and one English test. Yearbook of the National Reading Conference, 18, 154–160. Follman, J., (1970). Correlational and factor analysis of critical thinking, logical reasoning, and English total test scores. Florida Journal of Educational Research, 12, 91–94. Flollman, J., Brown, L., & Burg, E. (1970). Factor analysis of critical thinking, logical reasoning, and English subtest. Journal of Experimental Education, 38, 11–6. Follman, J., Hernandez, D., & Miller, W. (1969). Canonical correlation of scholastic aptitude and critical thinking. Psychology, 6, 3–6. Follman, J., Miller, W., & Burg, E. (1971). Statistical analysis of three critical thinking tests. Psychological Measurement, 31, 519–520. Foster, P. J. (1981). Clinical discussion groups: Verbal participation and outcomes. Journal of Medical Education, 56, 831–838. Frank, A. D. (1967). An experimental study in improving the critical thinking ability of high school students enrolled in a beginning speech course. (Doctoral dissertation, University of Wisconsin). Dissertation Abstracts, 28, 5168A. Friend, C. M. & Zubek, J. P. (1958). The effects of age on critical thinking ability. Journal of Gerontology, 13, 407–413. Gable, R. K., Roberts, A. D., & Owens, S. V. (1977). Affective and cognitive correlates of classroom achievement. Educational and Psychological Measurement, 37, 977–986. Geckler, J. W. (1965). Critical thinking, dogmatism, social status, and religious affiliation of tenth-grade students. (Doctoral dissertation, University of Tennessee). Dissertation Abstracts, 26, 886. Geisinger, K. F. (1998). Review of Watson-Glaser Critical Thinking Appraisal. In Impara, J. C. & Plake, B. S. (Eds.), The thirteenth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. George, K.I. & Smith, M.C. (1990). An empirical comparison of self-assessment and organizational assessment in personnel selection. Public Personnel Management, 19, 175–190. George, K. D. (1968). The effect of critical thinking ability upon course grades in biology. Science Education, 52, 421–426. Glidden, G. W. (1964). Factors that influence achievement in senior high school American History. (Doctoral dissertation, University of Nebraska). Dissertation Abstracts, 25, 3429. 72 Grace, J. L., Jr. (1968). Critical thinking ability of students in Catholic and public high schools. National Catholic Education Association Briefs, 65, 49–57. Grasz, C. S. (1977). A study to determine the validity of test scores and other selected factors as predictors of success in a basic course in educational administration. (Doctoral dissertation, Rutgers - The State University of New Jersey). Dissertation Abstracts International, 37, 7436A. Gunning, C. S. (1981). Relationships among field independence, critical thinking ability, and clinical problem-solving ability of baccalaureate nursing students. (Doctoral dissertation, The University of Texas at Austin). Dissertation Abstracts International, 42, 2780. Guster, D. & Batt, R. (1989). Cognitive and affective variables and their relationships to performance in a Lotus 1-2-3 class. Collegiate Microcomputer, 7, 151–156. Haas, M. G. (1963). A comparative study of critical thinking, flexibility of thinking, and reading ability involving religious and lay college seniors. (Doctoral dissertation, Fordham University). Dissertation Abstracts, 24, 622. Hall, W. C., Jr. & Myers, C. B. (1977). The effect of a training program in the Taba teaching strategies on teaching methods and teacher perceptions of their teaching. Peabody Journal of Education, 54, 162–167. Hardesty, D. L. & Jones, W. S. (1968). Characteristics of judged high potential management personnel— The operations of an industrial assessment center. Personnel Psychology, 21, 85–98. Hatano, G. & Kuhara, K. (1980, April). The recognition of inferences from a story among high and low critical thinkers. Paper presented at the 64th Annual Meeting of the American Educational Research Association, Boston. Helm, C. R. (1967). Watson-Glaser-DAT graduate norms. Unpublished master’s thesis, University of Toledo, OH. Helmstadter, G. C. (1972). Review of Watson-Glaser Critical Thinking Appraisal. In O. K. Buros (Ed.), The seventh mental measurements yearbook. Lincoln: University of Nebraska Press. Henkel, E. T. (1968). Undergraduate physics instruction and critical thinking ability. Journal of Research in Science Teaching, 5, 89–94. Hicks, R. E. & Southey, G. N. (1990). The Watson-Glaser Critical Thinking Appraisal and the performance of business management students. Psychological Test Bulletin, 3, 74–81. Research Bibliography Hildebrant, S & Lucas, J. A. (1980). Follow-up students who majored and are majoring in legal technology. Research report, William Raney Harper College, Pallatine, IL. Hill, O. W., Pettus, W. C., & Hedin, B. A. (1990). Three studies of factors affecting the attitudes of Blacks and females toward the pursuit of science and sciencerelated careers. Journal of Research in Science Teaching, 27, 289–314. Hill, W. H. (1959). Review of Watson-Glaser Critical Thinking Appraisal. In O. K. Buros (Ed.), The fifth mental measurements yearbook. Lincoln: University of Nebraska Press. Hillis, S. R. (1975). The relationship of inquiry orientation in secondary physical science classrooms and student’s critical thinking skills, attitudes, and views of science. (Doctoral dissertation. The University of Texas at Austin). Dissertation Abstracts International, 36, 805A. Himaya, M. I. (1972). Identification of possible variables for predicting student changes in physical science courses designed for non-science majors. (Doctoral dissertation, University of Iowa). Dissertation Abstracts International, 34, 67A. Hudson, V. C., Jr. (1972). A study of the relationship between the social studies student teacher’s divergent thinking ability and his success in promoting divergent thinking in class discussion. (Doctoral dissertation, University of Arkansas). Dissertation Abstracts International, 33, 2219A. Hughes, T. M., et al. (1987, November). The prediction of teacher burnout through personality type, critical thinking, and self-concept. Paper presented at the Annual Meeting of the Mid-South Educational Research Association, Mobile, AL. Hunt, D. & Randhawa, B. S. (1973). Relationship between and among cognitive variables and achievement in computational science. Educational and Psychological Measurement, 33, 921–928. Hunt, E. J. (1967). The critical thinking ability of teachers and its relationship to the teacher’s classroom verbal behavior and perceptions of teaching purposes. (Doctoral dissertation, University of Maryland). Dissertation Abstracts, 28, 4511A. Hunter, N. W. (1968). A study of the factors which may affect a student’s success in quantitative analysis. (Doctoral dissertation, University of Toledo). Dissertation Abstracts, 29, 2437A. Hinojosa, T. R., Jr. (1974). The influence of idiographic variables on leadership style: A study of special education administrators (Plan A) in Texas. (Doctoral dissertation, The University of Texas at Austin). Dissertation Abstracts International, 35, 2082A. Hurov, J. T. (1987). A study of the relationship between reading, computational, and critical thinking skills and academic success in fundamentals of chemistry. (ERIC Document Reproduction Service No. ED 286569). Hjelmhaug, N. N. (1971). Context instruction and the ability of college students to transfer learning. (Doctoral dissertation, Indiana University). Dissertation Abstracts International, 32, 1356A. Ivens, S. H. (1998). Review of Watson-Glaser Critical Thinking Appraisal. In Impara, J. C. & Plake, B. S. (Eds.), The thirteenth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Holdampf, B. A. (1983). Innovative associate degree nursing program—remote area: A comprehensive final report on exemplary and innovative proposal. Department of Occupational Education and Technology, Texas Education Agency, Austin, TX. Jabs, M. L. (1969). An experimental study of the comparative effects of initiating structure and consideration leadership on the educational growth of college students. (Doctoral dissertation, University of Connecticut). Dissertation Abstracts International, 30, 2762A. Hollenbach, J. W. & De Graff, C. (1957). Teaching for thinking. Journal of Higher Education, 28, 126–130. Hoogstraten, J. & Christiaans, H. H. C. M. (1975). The relationship of the Watson-Glaser Critical Thinking Appraisal to sex and four selected personality measures for a sample of Dutch first-year psychology students. Educational and Psychological Measurement, 35, 969–973. Houle, C. (1943). Evaluation in the eight-year study. Curriculum Journal, 14, 18–21. Hovland, C. I. (1959). Review of Watson-Glaser Critical Thinking Appraisal. In O. K. Buros (Ed.), The fifth mental measurements yearbook. Lincoln: University of Nebraska Press. James, R. J. (1971). Traits associated with the initial and persistent interest in the study of college science. (Doctoral dissertation, State University of New York). Dissertation Abstracts International, 32, 1296A. Jenkins, A. C. (1966). The relationship of certain measurable factors to academic success in freshman biology. (Doctoral dissertation, New York University). Dissertation Abstracts, 27, 2279A. Jurgenson, E. M. (1958). The relationship between success in teaching vocational agriculture and ability to make sound judgments as measured by selected instruments. (Doctoral dissertation, Pennsylvania State University). Dissertation Abstracts, 19, 96. 73 Watson-Glaser Short Form Manual Kenoyer, M. F. (1961). The influence of religious life on three levels of perceptual processes. (Doctoral dissertation, Fordham University). Dissertation Abstracts, 22, 909. Ketefian, S. (1981). Critical thinking, educational preparation, and development of moral judgment among selected groups of practicing nurses. Nursing Research, 30, 98–103. Kintgen-Andrews, J. (1988). Development of critical thinking: Career ladder P.N. and A.D. nursing students, pre-health science freshmen, generic baccalaureate sophomore nursing students. (ERIC Document Reproduction Service No. ED 297153). Kirtley, D. & Harkless, R. (1970). Student political activity in relation to personal and social adjustment. Journal of Psychology, 75, 253–256. Klassen, Peter T. (1984). Changes in personal orientation and critical thinking among adults returning to school through weekend college: An alternative evaluation. Innovative Higher Education, 8, 55–67. Kleg, M. (1987). General social studies knowledge and critical thinking among pre-service elementary teachers. International Journal of Social Education, 1, 50–63. Kooker, E. W. (1971). The relationship between performance in a graduate course in statistics and the Miller Analogies Tests and the Watson-Glaser Critical Thinking Appraisal. The Journal of Psychology, 77, 165–169. Krockover, G. H. (1965). The development of critical thinking through science instruction. Proceedings of the Iowa Academy of Sciences, 72, 402–404. La Forest, J. R. (1970). Relation of critical thinking to program planning. (Doctoral dissertation, North Carolina State University). Dissertation Abstracts International, 32, 1253A. Land, M. (1963). Psychological tests as predictors for scholastic achievement of dental students. Journal of Dental Education, 27, 25–30. Landis, R. F. & Michael, W. B. (1981). The factorial validity of three measures of critical thinking within the context of Guilford’s structure-of-intellect model for a sample of ninth grade students. Educational & Psychological Measurement, 41, 1147–1166. Larter, S. J. & Taylor, P. A. (1969). A study of aspects of critical thinking. Manitoba Journal of Education, 5, 35–53. Leadbeater, B. J. & Dionne, J. P. (1981). The adolescent’s use of formal operational thinking solving problems related to identity resolution. Adolescence, 16, 111–121. Lewis, D. R. & Dahl, T. (1971). The Test of Understanding in College Economics and its construct validity. Journal of Economics Education, 2, 155–166. 74 Little, T. L. (1972). The relationship of critical thinking ability to intelligence, personality factors, and academic achievement. (Doctoral dissertation, Memphis State University). Dissertation Abstracts International, 33, 5554A. Litwin, J. L. & Haas, P. F. (1983). Critical thinking: An intensive approach. Journal of Learning Skills, 2, 43–47. Lowe, A. J., Follman, J., Burley, W., & Follman, J. (1971). Psychometric analysis of critical reading and critical thinking test scores – twelfth grade. Yearbook of the National Reading Conference, 20, 142–174. Lucas, A. M. & Broadhurst, N. A. (1972). Changes in some content-free skills, knowledge, and attitudes during two terms of Grade 12 biology instruction in ten South Australian schools, Australian Science Teaching Journal, 18, 66–74. Luck, J. I. & Gruner, C. R. (1970). Note on authoritarianism and critical thinking ability. Psychological Reports, 27, 380. Luton, J. N. (1955). A study of the use of standardized tests in the selection of potential educational administrators. Unpublished doctoral dissertation, University of Tennessee, Memphis. Lysaught, J. P. (1963). An analysis of factors related to success in construction programmed learning sequences. Journal Programmed Instruction, 2, 35–42. Lysaught, J. P. (1964). An analysis of factors related to success in constructing programmed learning sequences. (Doctoral dissertation, University of Rochester, NY). Dissertation Abstracts, 25, 1749. Lysaught, J. P. (1964). Further analysis of success among auto-instructional programmers, Teaching Aid News, 4, 6–11. Lysaught, J. P. (1964). Selecting instructional programmers: new research into characteristics of successful programmers. Training Directors Journal, 18, 8–14. Lysaught, J. P. & Pierleoni, R. G. (1964). A comparison of predicted and actual success in auto-instructional programming. Journal of Programmed Instruction, 3, 14–23. Lysaught, J. P. &Pierleoni, R. G. (1970). Predicting individual success in programming self-instructional materials. Audio-Visual Community Research, 18, 5–24. Marrs, L. W. (1971). The relationship of critical thinking ability and dogmatism to changing regular class teachers’ attitudes toward exceptional children. (Doctoral dissertation, The University of Texas at Austin). Dissertational Abstracts International, 33, 638A. Mathias, R. O. (1973). Assessment of the development of critical thinking skills and instruction in grade eight social studies in Mt. Lebanon school district. (Doctoral dissertation, University of Pittsburgh). Dissertation Abstracts International, 34, 1064A. Research Bibliography McCammon, S. L., Golden, J., & Wuensch, K. L. (1988). Predicting course performance in freshman and sophomore physics courses: Women are more predictable than men. Journal of Research in Science Teaching, 25, 501–510. McCloudy, C. W. (1974). An experimental study of critical thinking skills as affected by intensity and types of sound. (Doctoral dissertation, East Texas State University, Commerce). Dissertation Abstracts International, 35, 4086A. McCutcheon, L. E., Apperson J. M., Hanson, E., & Wynn, V. (1992). Relationships among critical thinking skills, academic achievement, and misconceptions about psychology. Psychological Reports, 71, 635–639. McMurray, M. A., Beisenherz, P., & Thompson, B. (1991). Reliability and concurrent validity of a measure of critical thinking skills in biology. Journal of Research in Science Teaching, 28, 183–191. Miller, D. A., Sadler, J. Z. & Mohl, P. C. (1993). Critical thinking in preclinical course examinations, Academic Medicine, 68, 303–305. Miller, W., Follman, J., & Hernandez, D. E. (1970). Discriminate analysis of school children in integrated and non-integrated schools using tests of critical thinking. Florida Journal of Educational Research, 12, 63–68. Milton, O. (1960). Primitive thinking and reasoning among college students. Journal of Higher Education, 31, 218–220. Moskovis, L. M. (1967). An identification of certain similarities and differences between successful and unsuccessful college level beginning shorthand students and transcription student. (Doctoral dissertation, Michigan State University). Dissertation Abstracts, 28, 4826A. Moskovis, L. M. (1970). Similarities and differences of college-level successful and unsuccessful shorthand students. Delta Pi Epsilon Journal, 12, 12–16. Murphy, A. J. (1973). The relationship of leadership potential to selected admission criteria for the advanced programs in educational administration. (Doctoral dissertation, State University of New York). Dissertation Abstracts International, 34, 1545A. Nixon, J. T. (1973). The relationship of openness to academic performance, critical thinking, and school morale in two school settings, (Doctoral dissertation, George Peabody College for Teachers). Dissertation Abstracts International, 34, 3999A. Norris, C. A., Jackson, L., & Poirot, J. L. (1992). The effect of computer science instruction on critical thinking skills and mental alertness. Journal of Research on Computing in Education, 24, 329–337. Norris, S. P. (1986). Evaluating critical thinking ability. History and Social Science Teacher, 21, 135–146. Norris, S. P. (1988). Controlling for background beliefs when developing multiple-choice critical thinking tests. Educational Measurement, 7, 5–11. Modjeski, R. B. & Michael, W. B. (1983). An evaluation by a panel of psychologists of the reliability and validity of two tests of critical thinking. Educational and Psychological Measurement, 43, 1187–1197. Norton, S., et al (1985, November). The effects of an independent laboratory investigation on the critical thinking ability and scientific attitude of students in a general microbiology class. Paper presented at the 14th Annual Meeting of the Mid-South Research Association, Biloxi, MS. Moffett, C. R. (1954). Operational characteristics of beginning master’s students in educational administration and supervision. Unpublished doctoral dissertation, University of Tennessee, Memphis. Nunnery, M. Y. (1959). How useful are standardized psychological tests in the selection of school administrators? Educational Administration and Support, 45, 349–356. Molidor, J., Elstein, A., & King, L. (1978). Assessment of problem solving skills as a screen for medical school admissions. Michigan State University, Report No. TM 800383. East Lansing: National Fund for Medical Education. (ERIC Document Reproduction Service No. Ed. 190595). Obst, F. (1963). A study of abilities of women students entering the Colleges of Letters and Science and Applied Arts at the University of California, Los Angeles, Journal of Educational Research, 57, 54–86. Moore, M. L. (1976). Effects of value clarification on dogmatism, critical thinking, and self-actualization. (Doctoral dissertation, Arizona State University). Dissertation Abstracts International, 37, 907A. Moore, M. R. (1973). An investigation of the relationships among teacher behavior, creativity, and critical thinking ability. (Doctoral dissertation, University of Missouri), Dissertation Abstracts International, 35, 10270A. O’Neill, M. R. (1973). A study of critical thinking, open-mindedness, and emergent values among high school seniors and their teachers. (Doctoral dissertation, Fordham University). Dissertation Abstracts International, 34, 2278A. O’Toole, D. M. (1971). An accountability evaluation of an in-service economic education experience. (Doctoral dissertation, Ohio University). Dissertation Abstracts International, 32, 2315A. Owens, T. R. & Roaden, A. L. (1966). Predicting academic success in master’s degree programs in education. Journal of Educational Research, 60, 124–126. 75 Watson-Glaser Short Form Manual Parsley, J. F., Jr. (1970). A comparison of the ability of ninth grade students to apply several critical thinking skills to problematic content presented through two different media. (Doctoral dissertation, Ohio University). Dissertation Abstracts International, 31, 4620A. Parson, C. V. (1991). Barrier to success: Community college students critical thinking skills. (ERIC Document Reproduction Service No. ED 340415). Rust, V. I. (1959). Factor analyses of three tests of critical thinking. (Doctoral dissertation, University of Illinois). Dissertation Abstracts, 20, 225. Rust, V. I. (1960). Factor analyses of three tests of critical thinking. Journal of Experimental Education, 29, 177–182. Pascarella, E. T. (1987, November). The development of critical thinking: Does college make a difference? Paper presented at the Annual Meeting of the Association for the Study of Higher Education, Baltimore, MD. Rust, V. I. (1961). A study of pathological doubting as a response set. Journal of Experimental Education, 29, 393–400. Pierleoni, R. G. & Lysaught, J. P. (1970). A decision ladder for prediction programmer success. NSPI Journal, 9, 6–7. Rust V. I., Jones, R. S., & Kaiser, H. F. (1962). A factor analytic study of critical thinking. Journal of Educational Research, 55, 253–259. Pillai, N. P. & Nayar, P. P. (1968). The role of critical thinking in science achievement. Journal of Educational Research and Extension, 5, 1–8. Ryan, A. M. & Sackett, P. R. (1987). A survey of individual assessment practices by I/O Psychologists. Personnel Psychology, 40, 455–488. Poel, R. H. (1970). Critical thinking as related to PSSC and non-PSSC physics programs. (Doctoral dissertation, Western Michigan University). Dissertation Abstracts International, 31, 3983A. Schafer, P. J. (1972). An inquiry into the relationship between the critical thinking ability of teachers and selected variables. (Doctoral dissertation, University of Pittsburgh). Dissertation Abstracts International, 33, 1066A. Quinn, P. V. (1965). Critical thinking and openmindedness in pupils from public and Catholic secondary schools. Journal Social Psychology, 66, 23–30. Schmeck, R. R. & Ribich, F. D. (1978). Construct validation of the Inventory of Learning Processes. Applied Psychological Measurement, 2, 551–562. Raburn, J. & Van Scuyver, B. (1984). The relationship of reading and writing to thinking among university students taking English Composition. (ERIC Document Reproduction Service No. ED 273978). Scott, D. W. (1983). Anxiety, critical thinking, and information processing during and after breast biopsy. Nursing Research, 32, 24–28. Radebaugh, B. F. & Johnson, J. A. (1971). Excellent teachers: What makes them outstanding? Phase 2. Illinois School Research, 7, 12–20. Scott, D. W., Oberts, M. T., & Bookbinder, M. I. (1984). Stress-coping response to genitourinary carcinoma in men. Nursing Research, 33, 24–28. Rawls, J. R. & Nelson, O. T. (1975). Characteristics associated with preferences for certain managerial positions. Psychological Reports, 36, 911–918. Seymour, L. A. & Sutman, F. X. (1973). Critical thinking ability, open-mindedness, and knowledge of the processes of science of chemistry and non-chemistry students. Journal of Research in Science Teaching, 10, 159–163. Ribordy, S. C., Holmes, D. S., & Buchsbaum, H. K. (1980). Effects of affective and cognitive distractions on anxiety reduction. Journal of Social Psychology, 112, 121–127. Richards, M. A. (1977). One integrated curriculum: An empirical evaluation. Nursing Research, 26, 90–95. Richardson, Bellows, Henry & Co. (1963). Normative information: Manager and executive testing. New York: Author. 76 Rose, R. G. (1980). An examination of the responses to a multivalue logic test. Journal of General Psychology, 102, 275–281. Shatin, L. & Opdyke, D. (1967). A critical thinking appraisal and its correlates. Journal of Medical Education, 42, 789–792. Sherman, M. (1978). Concurrent validation of entry level police officers’ examination. (Technical Report 78-1). State of Minnesota Department of Personnel. Shneidman, E. S. (1961). The case of E1: Psychological test data. Journal of Projective Techniques, 25, 131–154. Roberts, A. D., Gable, R. K., & Owen, S. V. (1977). An evaluation of minicourse curricula in secondary social studies. Journal of Experimental Education, 46, 4–11. Shockley, J. T. (1962). Behavioral rigidity in relation to student success in college physical science. Science Education, 46, 67–70. Rodd, W. G. (1959). A cross-cultural study of Taiwan’s schools. Journal of Social Psychology, 50, 3–36. Shultz, K.S. & Whitney, D.J. (2005). Measurement Theory in Action: Case Studies and Exercises. London: Sage Publications. Research Bibliography Singer, E. & Roby, T. B. (1967). Dimensions of decisionmaking behavior. Perceptual and Motor Skills, 24, 571–595. Sternberg, R. J. (1986). Critical thinking: Its nature, measurement, and improvement. (ERIC Document Reproduction Service No. ED 272882). Skelly, C. G. (1961). Some variables which differentiate the highly intelligent and highly divergent thinking adolescent. (Doctoral dissertation, University of Connecticut). Dissertation Abstracts, 22, 2699. Stevens, J. T. (1972). A study of the relationships between selected teacher affective characteristics and student learning outcomes in a junior high school science program. (Doctoral dissertation, University of Virginia). Dissertation Abstracts International, 33, 3430A. Skinner, S. B. (1970). A study of the effect of the St. Andrews Presbyterian College Natural Science course upon critical thinking ability. (Doctoral dissertation, University of North Carolina). Dissertation Abstracts International, 31, 3984A. Skinner, S. B.& Hounshell, P. B. (1972). The effect of St. Andrews College Natural Science Course upon critical thinking ability. School Science and Math, 72, 555–562. Smith, D. G. (1977). College classroom interactions and critical thinking. Journal of Educational Psychology, 69, 180–190. Smith, D. G. (1980, April). College instruction: four empirical views. Paper presented at the 64th Annual Meeting of the American Educational Research Association, Boston. Smith, J. R. (1969). A comparison of two methods of conduction introductory college physics laboratories. (Doctoral dissertation, Colorado State College). Dissertation Abstracts International, 30, 4159A. Smith, P. M., Jr. (1963). Critical thinking and the science intangibles. Science Education, 47, 405–408. Smith, R. G. (1965). An evaluation of selected aspects of a teacher education admissions program. (Doctoral dissertation, North Texas State University, Denton). Dissertation Abstracts, 26, 3771. Smith, R. L. (1971). A factor-analytic study of critical reading/thinking, influence ability, and related factors. (Doctoral dissertation, University of Maine). Dissertation Abstracts International, 32, 6220A. Snider, J. G. (1964). Some correlates of all-inclusive conceptualization in high school pupils. (Doctoral dissertation, Stanford University). Dissertation Abstracts, 25, 4005. Sparks, C. P. (1990). How to read a test manual. In J. Hogan & R. Hogan (Eds.), Business and industry testing. Austin, TX: Pro-Ed. Stalnaker, A. W. (1965). The Watson-Glaser Critical Thinking Appraisal as a predictor of programming performance. Proceedings of the Annual Computer Personnel Research Conference, 3, 75–77. Stephens, J. A. (1966). A study of the correlation between critical thinking abilities and achievement in algebra involving advanced placement. Unpublished master’s thesis, North Carolina State University, Raleigh. Steward, R. J. & Al Abdulla, Y. (1989). An examination of the relationship between critical thinking and academic success on a university campus. (ERIC Document Reproduction Service No. ED 318936). Story, L. E., Jr. (1974). The effect of the BSCS inquiry slides on the critical thinking ability and process skills of first-year biology students. (Doctoral dissertation, University of Southern Mississippi). Dissertation Abstracts International, 35, 2796A. Taylor, L. E. (1972). Predicted role of prospective activity-centered vs. textbook-centered elementary science teachers correlated with 16 personality factors and critical thinking abilities. (Doctoral dissertation, University of Idaho). Dissertation Abstracts International, 34, 2415A. Thompson, A. P. & Smith, L. M. (1982). Conceptual, computational, and attitudinal correlates of student performance in introductory statistics. Australian Psychologist, 17, 191–197. Titus, H. W. (1969). Prediction of supervisory success by use of standard psychological tests. Journal of Psychology, 72, 35–40. Titus, H. E. & Goss, R. G. (1969). Psychometric comparison of old and young supervisors. Psychological Reports, 24, 727–733. Trela, T. M. (1962). A comparison of ninth grade achievement on selected measures of general reading comprehension, critical thinking, and general educational development. (Doctoral dissertation, University of Missouri). Dissertation Abstracts, 23, 2382. Trela, T. M. (1967). Comparing achievement on tests of general and critical reading. Journal of Reading Specialists, 6, 140–142. Troxel, V. A. & Snider, C. F. B. (1970). Correlations among student outcomes on the Test of Understanding Science, Watson-Glaser Critical Thinking Appraisal, and the American Chemical Society Cooperative Examination—General Chemistry. School Science and Math, 70, 73–76. Vance, J. S. (1972). The influence of a teacher questioning strategy on attitude and critical thinking. (Doctoral dissertation, Arizona State University). Dissertation Abstracts International, 33, 669A. 77 Watson-Glaser Short Form Manual Vidler, D. & Hansen, R. (1980). Answer changing on multiple-choice tests. Journal of Experimental Education, 49, 18–20. Walton, F. X. (1968). An investigation of differences between more effective and less effective counselors with regard to selected variables. (Doctoral dissertation, University of South Carolina). Dissertation Abstracts, 29, 3844A. Ward, J. (1972). The saga of Butch and Slim. British Journal of Educational Psychology, 42, 267–289. Watson, G., & Glaser, E. M. (1980). Watson-Glaser Critical Thinking Appraisal, Forms A and B manual. San Antonio, TX: The Psychological Corporation. Welsch, L. A. (1967). The supervisor’s employee appraisal heuristic: The contribution of selected measures of employee aptitude, intelligence, and personality. (Doctoral dissertation, University of Pittsburgh). Dissertation Abstracts, 28, 4321A. Wenberg, B. G. & Ingersoll, R. W. (1965). Medical dietetics: Part 2, The development of evaluative techniques. Journal of the American Dietetic Association, 47, 298–300. Wenberg, C. W. & Burness, G. (1969). Evaluation of dietetic interns. Journal of the American Dietetic Association, 54, 297–301. Westbrook, B. W. & Sellers, J. R. (1967). Critical thinking, intelligence, and vocabulary. Educational and Psychological Measurement, 27, 443–446. Wevrick, L. (1970). Evaluation of the personnel test battery. In L. Wevrick, et al. (Eds.), Applied research in public personnel administration. Chicago: Public Personnel Association. White, W. F. & Burke, C. M. (1992). Critical thinking and teaching attitudes of preservice teachers. Education, 112, 443–450. Wilson, D. G. & Wagner, E. E. (1981). The WatsonGlaser Critical Thinking Appraisal as predictor of performance in a critical thinking course. Educational and Psychological Measurement, 41, 1319–1322. Woehlke, P. L. (1985). Watson-Glaser Critical Thinking Appraisal. In D. J. Keyser & R. C. Sweetland (Eds.), Test critiques (Vol. III, pp. 682-685). Kansas City, MO: Test Corporation of America. Yager, R. E. (1968). Critical thinking and reference materials in the physical science classroom. School Science and Math, 68, 743–746. Yoesting, C. & Renner, J. W. (1969). Is critical thinking an outcome of a college general physics science course? School Science and Math, 69, 199–206. 78 Glossary of Measurement Terms This glossary of terms in intended to aid in the interpretation of statistical information presented in the Watson-Glaser–Short Form Manual, as well as other manuals published by Harcourt Assessment, Inc. The terms defined are fairly common and basic. In the definitions, certain technicalities have been sacrificed for the sake of succinctness and clarity. achievement test—A test that measures the extent to which a person has “achieved” something, acquired certain information, or mastered certain skills—usually as a result of planned instruction or training. alternate-form reliability—The closeness of correspondence, or correlation, between results on alternate forms of a test; thus, a measure of the extent to which the two forms are consistent or reliable in measuring whatever they do measure. The time interval between the two testings must be relatively short so that the examinees are unchanged in the ability being measured. See reliability, reliability coefficient. aptitude—A combination of abilities and other characteristics, whether innate or acquired, that are indicative of an individual’s ability to learn or to develop proficiency in some particular area if appropriate education or training is provided. Aptitude tests include those of general academic ability (commonly called mental ability or intelligence tests); those of special abilities, such as verbal, numerical, mechanical, or musical; tests assessing “readiness” for learning; and prognostic test, which measure both ability and previous learning and are used to predict future performance – usually in a field requiring specific skills, such as speaking a foreign language, taking shorthand, or nursing. arithmetic mean—A kind of average usually referred to as the “mean.” It is obtained by dividing the sum of a set of scores by the number of scores. See central tendency. average—A general term applied to the various measures of central tendency. The three most widely used averages are the arithmetic mean (mean), the median, and the mode. When the “average” is used without designation as to type, the most likely assumption is that it is the arithmetic mean. See central tendency, arithmetic mean, median, mode. battery—A group of several tests standardized on the same population so that results on the several tests are comparable. Sometimes applied to any group of tests administered together, even though not standardized on the same subjects. ceiling—The upper limit of ability that can be measured by a test. When an individual earns a score which is at or near the highest possible score, it is said that the “ceiling” of the test is too low for the individual. The person should be given a higher level test. 79 Watson-Glaser Short Form Manual central tendency—A measure of central tendency provides a single most typical score as representative of a group of scores. The “trend” of a group of measures is indicated by some type of average, usually the mean or the median. Classical Test Theory (also known as True Score Theory)—The earliest theory of psychological measurement which is based on the idea that the observed score a person gets on a test is composed of the person’s theoretical “true score” and an “error score” due to unreliability (or imperfection) in the test. In Classical Test Theory (CTT), item difficulty is indicated by the proportion (p) of examinees that answer a given item correctly. Note that in CTT, the more difficult an item is, the lower p is for that item. composite score—A score which combines several scores, usually by addition; often different weights are applied to the contributing scores to increase or decrease their importance in the composite. Most commonly, such scores are used for predictive purposes and the weights are derived through multiple regression procedures. correlation—Relationship or “going-togetherness” between two sets of scores or measures; tendency of one score to vary concomitantly with the other, as the tendency of students of high IQ to be above the average in reading ability. The existence of a strong relationship (i.e., a high correlation) between two variables does not necessarily indicate that one has any causal influence on the other. Correlations are usually denoted by a coefficient; the correlation coefficient most frequently used in test development and educational research is the Pearson or product-moment r. Unless otherwise specified, “correlation” usually refers to this coefficient. Correlation coefficients range from –1.00 to +1.00; a coefficient of 0.0 (zero) denotes a complete absence of relationship. Coefficients of –1.00 or +1.00 indicate perfect negative or positive relationships, respectively. criterion—A standard by which a test may be judged or evaluated; a set of other test scores, job performance rating, etc., with which a test is designed to measure, to predict, or to correlate. See validity. cutoff score (cut score)—A specified point on a score scale at or above which applicants pass the test and below which applicants fail the test. deviation—The amount by which a score differs from some reference value, such as the mean, the norm, or the score on some other test. difficulty index (p or b)—The proportion of examinees correctly answering an item. The greater the proportion of correct responses, the easier the item. discrimination index (d or a)—The difference between the proportion of high-scoring examinees who correctly answer an item and the proportion of lowscoring examinees who correctly answer the item. The greater the difference, the more information the item has regarding the examinee’s level of performance. distribution (frequency distribution)—A tabulation of the scores (or other attributes) of a group of individuals to show the number (frequency) of each score, or of those within the range of each interval. factor analysis—A term that represents a large number of different mathematical procedures for summarizing the interrelationships among a set of variables or items in terms of a reduced number of hypothetical variables, called factors. Factors are used to summarize scores on multiple variables in terms of a single score, and to select items that are homogeneous. 80 Glossary of Measurement Terms factor loading—An index, similar to the correlation coefficient in size and meaning, of the degree to which a variable is associated with a factor; in test construction, a number that represents the degree to which an item is related to a set of homogeneous items. Fit to the model—No model can be expected to represent complex human behavior or ability perfectly. As a reasonable approximation, however, such a model can provide many practical benefits. Item-difficulty and person-ability values are initially estimated on the assumption that the model is correct. An examination of the data reveals whether or not the model satisfactorily predicts each person’s actual pattern of item passes and failures. The model-fit statistic, based on discrepancies between predicted and observed item responses, identifies items that “fit the model” better. Such items are then retained in a shorter version of a long test. internal consistency—Degree of relationship among the items of a test; consistency in content sampling. item response theory (IRT)—Refers to a variety of techniques based on the assumption that performance on an item is related to the estimated amount of the “latent trait” that the examinee possesses. IRT techniques show the measurement efficiency of an item at different ability levels. In addition to yielding mathematically refined indices of item difficulty (b) and item discrimination (a), IRT models may contain additional parameters (i.e., Guessing). mean (M)—See arithmetic mean, central tendency. median (Md)—The middle score in a distribution or set of ranked scores; the point (score) that divides the group into two equal parts; the 50th percentile. Half of the scored are below the median and half above it, except when the median itself is one of the obtained scores. See central tendency. mode—The score or value that occurs most frequently in a distribution. multitrait-multimethod matrix—An experimental design to examine both convergent and discriminant validity, involving a matrix showing the correlations between the scores obtained (1) for the same trait by different methods, (2) for different traits by the same method, and (3) for different traits by different methods. Construct-valid measures show higher same trait-different methods correlations than the correlations obtained for different traits-different methods and different traits-same method correlations. normal distribution—A distribution of scores or measures that in graphic form has a distinctive bell-shaped appearance. In a perfect normal distribution, scores or measures are distributed symmetrically around the mean, with as many cases up to various distances above the mean as down to equal distances below it. Cases are concentrated near the mean and decrease in frequency, according to a precise mathematical equation, the farther one departs from the mean. Mean, median, and mode are identical. The assumption that mental and psychological characteristics are distributed normally has been very useful in test development work. normative data (norms)—Statistics that supply a frame of reference by which meaning may be given to obtained test scores. Norms are based upon the actual performance of individuals in the standardization sample(s) for the test. Since they represent average or typical performance, they should not be regarded as 81 Watson-Glaser Short Form Manual standards or as universally desirable levels of attainment. The most common types of norms are deviation IQ, percentile rank, grade equivalent, and stanine. Reference groups are usually those of specified occupations, age, grade, gender, or ethnicity. part-whole correlation—A correlation between one variable and another variable representing a subset of the information contained in the first; in test construction, the correlation between a score based on a set of items and another score based on a subset of the same items. percentile (P)—A point (score) in a distribution at or below which fall the percent of cases indicated by the percentile. Thus a score coinciding with the 35th percentile (P35) is regarded as equaling or surpassing 35% of the persons in the group, such that 65% of the performances exceed this score. “Percentile” does not mean the percent of correct answers on a test. Use of percentiles in interpreting scores offers a number of advantages: percentiles are easy to compute and understand, can be used with any type of examinee, and are suitable for any type of test. The primary drawback of using a raw scoreto-percentile conversion is the resulting inequality of units, especially at the extremes of the distribution of scores. For example, in a normal distribution, scores cluster near the mean and decrease in frequency the farther one departs from the mean. In the transformation to percentiles, raw score differences near the center of the distribution are exaggerated—small raw score differences may lead to large percentile differences. This is especially the case when a large proportion of examinees receive same or similar scores, causing a one- or two-point raw score difference to result in a 10- or 15-unit percentile difference. Short tests with a limited number of possible raw scores often result in a clustering of scores. The resulting effect on tables of selected percentiles is “gaps” in the table corresponding to points in the distribution where scores cluster most closely together. percentile band—An interpretation of a test score which takes into account the measurement error that is involved. The range of such bands, most useful in portraying significant differences in battery profiles, is usually from one standard error of measurement below the obtained score to one standard error of measurement above the score. percentile rank (PR)—The expression of an obtained test score in terms of its position within a group of 100 scores; the percentile rank of a score is the percent of scores equal to or lower than the given score in its own or some external reference group. point-biserial correlation (rpbis )—A type of correlation coefficient calculated when one variable represents a dichotomy (e.g., 0 and 1) and the other represents a continuous or multi-step scale. In test construction, the dichotomous variable is typically the score (i.e., correct or incorrect) and the other is typically the number correct for the entire test; good test items will have moderate to high positive point-biserial correlations (i.e., more high-scoring examines answer the item correctly than low-scoring examinees). practice effect—The influence of previous experience with a test on a later administration of the same or similar test; usually an increased familiarity with the directions, kinds of questions, etc. Practice effect is greatest when the interval between testings is short, when the content of the two tests is identical or very similar, and when the initial test-taking represents a relatively novel experience for the subjects. 82 Glossary of Measurement Terms profile—A graphic representation of the results on several tests, for either an individual or a group, when the results have been expressed in some uniform or comparable terms (standard scores, percentile ranks, grade equivalents, etc.). The profile method of presentation permits identification of area of strength or weakness. r—See correlation. range—For some specified group, the difference between the highest and the lowest obtained score on a test; thus a very rough measure of spread or variability, since it is based upon only two extreme scores. Range is also used in reference to the possible range of scores on a test, which in most instances is the number of items in the test. Rasch model—A technique in Item Response Theory (IRT) using only the item difficulty parameter. This model assumes that both guessing and item differences in discrimination are negligible. raw score—The first quantitative result obtained in scoring a test. Examples include the number of right answers, number right minus some fraction of number wrong, time required for performance, number of errors, or similar direct, unconverted, uninterpreted measures. reliability—The extent to which a test is consistent in measuring whatever it does measure; dependability, stability, trustworthiness, relative freedom from errors of measurement. Reliability is usually expressed by some form of reliability coefficient or by the standard error of measurement derived from it. reliability coefficient—The coefficient of correlation between two forms of a test, between scores on two administrations of the same test, or between halves of a test, properly corrected. The three measure somewhat different aspects of reliability, but all are properly spoken of as reliability coefficients. See alternate-form reliabilty, split-half reliability coefficient, test-retest reliablity coefficient. representative sample—subset that corresponds to or matches the population of which it is a sample with respect to characteristics important for the purposes under investigation. In a clerical aptitude test norm sample, such significant aspects might be the level of clerical training and work experience of those in the sample, the type of job they hold, and the geographic location of the sample. split-half reliability coefficient—A coefficient of reliability obtained by correlating scores on one half of a test with scores on the other half, and applying the Spearman-Brown formula to adjust for the double length of the total test. Generally, but not necessarily, the two halves consist of the odd-numbered and the even-numbered items. Split-half reliability coefficients are sometimes referred to as measures of the internal consistency of a test; they involve content sampling only, not stability over time. standard deviation (SD)—A measure of the variability or dispersion of a distribution of scores. The more the scores cluster around the mean, the smaller the standard deviation. For a normal distribution, approximately two thirds (68.25%) of the scores are within the range from one SD below the mean to one SD above the mean. Computation of the SD is based upon the square of the deviation of each score from the mean. 83 Watson-Glaser Short Form Manual standard error (SE)—A statistic providing an estimate of the possible magnitude of “error” present in some obtained measure, whether (1) an individual score or (2) some group measure, as a mean or a correlation coefficient. (1) standard error of measurement (SEM)—As applied to a single obtained score, the amount by which the score may differ from the hypothetical true score due to errors of measurement. The larger the SEM, the less reliable the measurement and the less reliable the score. The SEM is an amount such that in about two-thirds of the cases, the obtained score would not differ by more than one SEM from the true score. (Theoretically, then, it can be said that the chances are 2:1 that the actual score is within a band extending from the true score minus one SEM to the true score plus one SEM; but since the true score can never be known, actual practice must reverse the true-obtained relation for an interpretation.) Other probabilities are noted under (2) below. See true score. (2) standard error—When applied to sample estimates (e.g., group averages, standard deviation, correlation coefficients), the SE provides an estimate of the “error” which may be involved. The sample or group size and the SD are the factors on which standard errors are based. The same probability interpretation is made for the SEs of group measures as is made for the SEM; that is, 2 out of 3 sample estimates will lie within 1.0 SE of the “true” value, 95 out of 100 within 1.96 SE, and 99 out of 100 within 2.6 SE. standard score—A general term referring to any of a variety of “transformed” scores, in terms of which raw scores may be expressed for reasons of convenience, comparability, ease of interpretation, etc. The simplest type of standard score, known as a z score, is an expression of the deviation of a score from the mean score of the group in relation to the standard deviation of the scores of the group. Thus, Standard Score = (Score - Mean) / Standard Deviation Adjustments may be made in this ratio so that a system of standard scores having any desired mean and standard deviation may be set up. The use of such standard scores does not affect the relative standing of the individuals in the group or change the shape of the original distribution. Standard scores are useful in expressing the raw score of two forms of a test in comparable terms in instances where tryouts have shown that the two forms are not identical in difficulty. Also, successive levels of a test may be linked toForm A continuous standard-score scale, making across-battery comparisons possible. standardized test (standard test)—A test designed to provide a systematic sample of individual performance, administered according to prescribed directions, scored in conformance with definite rules, and interpreted in reference to certain normative information. Some would further restrict the usage of the term “standardized” to those tests for which the items have been chosen on the basis of experimental evaluation, and for which data on reliability and validity are provided. statistical equivalence—Occurs when test forms measure the same construct and every level of the construct is measured with equal accuracy by the forms. Statistically equivalent test forms may be used interchangeably. testlet—A single test scenario that has a number of test questions based directly on the scenario. A testlet score is generated by summing the responses for all items in the testlet. 84 Glossary of Measurement Terms test-retest reliability coefficient—A type of reliability coefficient obtained by administering the same test a second time, after a short interval, and correlating the two sets of scores. “Same test” was originally understood to mean identical content, i.e., the same form. Currently, however, the term “test-retest” is also used to describe the administration of different forms of the same test, in which case this reliability coefficient becomes the same as the alternate-form coefficient. In either type, the correlation may be affected by fluctuations over time, differences in testing situations, and practice. When the time interval between the two testings is considerable (i.e., several months), a test-retest reliability coefficient reflects not only the consistency of measurement provided by the test, but also the stability of the trait being measured. true score—A score entirely free of error; hence, a hypothetical value that can never be obtained by psychological testing, because testing always involves some measurement error. A “true” score may be thought of as the average score from an infinite number of measurements from the same or exactly equivalent tests, assuming no practice effect or change in the examinee during the testings. The standard deviation of this infinite number of “samplings” is known as the standard error of measurement. validity—The extent to which a test does the job for which it is used. This definition is more satisfactory than the traditional “extent to which a test measures what it is supposed to measure,” since the validity of a test is always specific to the purposes for which the test is used. (1) content validity. For achievement tests, validity is the extent to which the content of the test represents a balanced and adequate sampling of the outcomes (knowledge, skills, etc.) of the job, course, or instructional program it is intended to cover. It is best evidenced by a comparison of the test content with job descriptions, courses of study, instructional materials, and statements of educational goals; and often by analysis of the process required in making correct responses to the items. Face validity, referring to an observation of what a test appears to measure, is a non-technical type of evidence; apparent relevancy is, however, quite desirable. (2) criterion-related validity. The extent to which scores on the test are in agreement with (concurrent validity) or predict (predictive validity) some given criterion measure. Predictive validity refers to the accuracy with which an aptitude, prognostic, or readiness test indicates future success in some area, as evidenced by correlations between scores on the test and future criterion measures of such success (e.g., the relation of the score on a clerical aptitude test administered at the application phase to job performance ratings obtained after a year of employment). In concurrent validity, no significant time interval elapses between administration of the test and collection of the criterion measure. Such validity might be evidenced by concurrent measures of academic ability and of achievement, by the relation of a new test to one generally accepted as or known to be valid, or by the correlation between scores on a test and criteria measures which are valid but are less objective and more time-consuming to obtain than a test score. 85 Watson-Glaser Short Form Manual (3) evidence based on internal structure. The extent to which a test measures some relatively abstract psychological trait or construct; applicable in evaluating the validity of tests that have been constructed on the basis of analysis (often factor analysis) of the nature of the trait and its manifestations. (4) convergent and discriminant validity. Tests of personality, verbal ability, mechanical aptitude, critical thinking, etc., are validated in terms of the relation of their scores to pertinent external data. Convergent evidence refers to the relationship between a test score and other measures that have been demonstrated to measure similar constructs. Discriminant evidence refers to the relationship between a test score and other measures demonstrated to measure dissimilar constructs. variability—The spread or dispersion of test scores, best indicated by their standard deviation. variance—For a distribution, the variance is the average of the squared deviations from the mean. Thus, the variance is the square of the standard deviation. 86