The primary purpose of this study was to investigate test usefulness for the five leading standardized speaking tests Task characteristics, scoring rubrics, and test methods were evaluated from these tests in order to further investigate the variances...
The primary purpose of this study was to investigate test usefulness for the five leading standardized speaking tests Task characteristics, scoring rubrics, and test methods were evaluated from these tests in order to further investigate the variances that influence speaking performances.
The following components of test usefulness were checked and measured: reliability, construct validity, authenticity, and interactiveness. Differences in task types, testing, and scoring methods were identified as sources of variance that influence speaking assessments. Therefore, the different test tasks and the contributions of the tasks relative to a test taker’s speaking ability were examined for test usefulness. Due to limited resources, actual one-on-one interviews are not always feasible in L2 testing conditions. Therefore, for this study the commonly substituted method of the Simulated Oral Proficiency Interview (SOPI), was administered, with results compared to the Oral Proficiency Interview (OPI) method. As various tasks require different scoring rubrics, the nature of these scoring rubrics was also examined to determine their impact on test takers.
To address these issues, two tasks from the Test of English for International Communication (TOEIC) Speaking, the Test of English Proficiency developed by Seoul National University (TEPS) Speaking, the International English Language Testing System (IELTS) Speaking – General Training, the Test of English as a Foreign Language (TOEFL) iBT Speaking, and the American Council on the Teaching of Foreign Languages (ACTFL) Oral Proficiency Interview Computer Test (OPIc) were administered to seventy four college students. These standardized tests were administrated using the SOPI method, while the one-on-one interview was conducted following the OPI method. Examinees’ performances were rated three times: first, according to a rubric based on communicative language ability (CLA), and then following the rubric that originally accompanied each task, and finally according to a holistic rubric. Task completion was added to the CLA rubric to further examine the task effects on the test takers. After this, the data were analyzed using several analytic methods, including a multi-faceted Rasch model and factor analysis.
The results indicated that the TOEFL and IELTS had the most overall usefulness, characterized by a good degree of authenticity and interactiveness. On the other hand, the TOEIC was the least useful test, with an ill-defined Target Language Use (TLU) task and TLU domain. Factor analysis revealed that unlike the high correlation found among all the tests and the interview from the preliminary reliability estimation and previous research, no other test loaded on the same factor as the interview. Therefore, while OPI may not be a replica of real communication, it was at least found to measure different constructs of speaking ability when compared to the other SOPI methods.
For the task evaluation, the results indicated that overall, the TOEFL and the IELTS with integrated tasks had the highest test usefulness. These two tasks were also the most difficult tasks for the test takers and they both had a high degree of authenticity and interactiveness. On the other hand, a low degree of authenticity and interactiveness did not necessarily coincide with ease of the test task. Therefore, the qualities of a test task should not be evaluated independently, nor determined by a single quality of a given test. The findings also revealed latent factors based not on the operational constructs of speaking ability, but according to the tasks. Different scoring rubrics yielded different performance measures; however, a CLA rubric and holistic scoring were consistent in producing stable measures, regardless of the different tasks.
Based on findings on test usefulness and the variables that affect speaking ability, it is recommended to first develop a test that has a well-defined TLU domain in order to improve the quality of speaking assessments in current English as a Foreign Language (EFL) settings. The correspondences between the TLU domain, the TLU task, and the test task were shown to be the most important features of test usefulness. Next, it was found that assessing speaking ability via the SOPI method is necessary yet provides insufficient evidence for the test taker’s speaking ability. Therefore, the SOPI method should be accompanied by the OPI method. When such assessments are not feasible, as in most cases for our L2 learning environments, task selection for a SOPI test should include at least one task that is similar to a one-on-one interview in terms of the task characteristics and test usefulness.