Moray House school of Education MEd TESOL 2008-2009 Learner Assessment in TESOL Tutor: Gillies Haughton Student: Shaaban Ahmed S0899927 Learner Assessment in TESOL Assessing Listening Introduction Assessing listening is one of the most significant areas of language testing. However, it is the least area that is cared for and developed in assessment. Perhaps this is due to it is a very complex process. Listening is NOT assessed in my context and this has caused dramatic drawbacks in language learning and teaching related to listening skill.
Therefore, the test specs and test instrument constitute a proposal to incorporate them into English language testing in my context. The importance of assessing listening is in the potentials of washback on classroom practices and priorities. This paper attempts to provide a rationale and evaluation of a test design, instrument, process and outcomes. Rationale Testing the ability to understand oral discourse in English as a foreign language in my context will considerably encourage students and teachers to enhance and reinforce listening teaching and learning in classrooms. Backwash” refers to the effect of testing on teaching and learning (Hughes, 1989, Desforges, 1989, Heaton, 1975). This effect can influence classroom practices either negatively or positively. The positive backwash expected from assessing listening is that it may have and immediate effect on; the syllabus, selection of new coursebooks and methodology (Baker, 1989, Davies, 1990, Alderson and Wall, 1993, Cheng et al, 2004). Black and William (2006) indicate that assessment in education must, first and foremost, serve the purpose of supporting learning.
There are four main reasons behind my proposal to assess listening. Firstly, students and teachers will pay due attention to listening skill activities in the coursebook and consequently the four skills will be equally catered for in the coursebooks. Secondly, learners’ autonomy will be provoked as students will be encouraged by parents, teachers and peers to search for extra materials by which they can enhance and improve their listening abilities.
Thirdly, curricula planners and coursebook designers will improve and develop listening tasks so as to meet the demands and expected goals formulated by public examination authority or the ministry of education especially for high-stakes examinations such as GSECE. Fourthly, assessing listening may affect speaking skill positively because teachers, by teaching listening, will necessarily have to engage in some sort of speaking activity in class (Buck, 2001). In addition, considerable amount of listening input will inevitably result in considerable oral output from students (Krashen, 1995).
It also deserve mentioning that listening is ignored because it is not assesses “Areas that are not tested are likely to become areas ignored in teaching and learning” (Hughes, 2003: 27). Gipps and Stobart (1993) also indicate that teachers concentrate on what assessment measures and not teach other untested skills. Choices in the test and test specs My rationale behind the choices I have made in the test specs will be focused on these aspects; the purpose of the test, text type, method, and marking (see appendix 1).
Concerning the purpose of the test, I have notified in the test specs that the test will measure the students’ ability to recognize spoken language at word, sentence and text levels. The rationale behind this purpose is to include the two comprehension processes bottom-up and top-down. These two processes refer to the acoustic input and the linguistic information. Buck (2001) argues that these two processes should not be underestimated when we consider listening comprehension.
The listening operations mentioned in the test specs are based on the general goals of teaching listening for third year secondary schools in Hello! Series (IELP-II, 2003). My reference in formulating these operations is the taxonomy of communicative listening sub-skills by Weir (1993). I have chosen Weir’s taxonomy because he describes listening skills in communicative terms and this compiles with the communicative language teaching approach which is adopted by Hello! Series to be the methodology that guide classroom practices.
According to Weir (1993: 98-99), these listening sub-skills should be included in listening texts; direct meaning comprehension, inferred meaning comprehension, contributory meaning comprehension and listening and taking notes. Weir (1990) asserts that the main proviso for testing within communicative framework is that the test tasks should as far as possible reflect realistic discourse processing. Identifying specific details and main ideas from an oral discourse are unique to real world listening and this makes the purpose of the test more encouraging to teachers and students to practise comprehending realistic spoken English.
In addition, the purpose of assessing listening on the word level will evoke teachers and learners to pay more attention to phonological modifications in the coursebook. In this concern, Buck explains, “Any lack of such knowledge is likely to be reflected in reduced comprehension (Buck, 2001: 33). Another choice I have made in the specs is the test type. After formulating the test construct, I have encountered a more important step in designing the test which is selecting an appropriate text relevant to the students’ level of proficiency and containing a range of vocabulary and structures they have learnt in the coursebook.
The reason for choosing a speech (non-collaborative situation) as the listening text is due to the fact that assessing listening aims to prepare students to oral discourse especially lectures which is a problem they face when they start their academic study in university. That is why I have chosen a non-collaborative listening situation in which “there is neither feedback nor interaction between language users” (Bachman and Palmer, 1996: 55). Moreover, it is relatively convenient and cheap to test non collaborative listening in large scale situations where test-takers can take the test at the same time (Buck, 2001).
Hughes points out that a test should be “economical in time and money” (Hughes, 1989: 8). It also worth of mentioning that setting up collaborative or interactive listening test is expensive, time consuming and will require versatile expertise with testing speaking as well and this may be impossible with GSECE in which more than 600,000 students set for exams. The test is delivered by a native speaker in American accent. Buck (2001) states that the most pressing practical issue in the assessment of listening comprehension is providing real-world spoken texts.
The reason for this is to increase the positive washback on listening lessons inside classrooms and to expose students to real spoken English and to urge teachers not to read the text script in their voices but to play the cassette. Though the disadvantages of using Multiple-Choice Question as a test method far outweigh its advantages, to a considerable extent, my choice of this method (see appendix 2) is due to a number of reasons. On the first hand, high-stakes examinations, such as GSECE in Egypt, require good tests in terms of reliability and objectivity.
One of the stark qualities to be considered in using MCQ tests is reliability as they result in perfectly reliable scores because they are free of error measurement (Backman, 1990, Weir, 1990). Baker (1989) and Schofield (1972) indicate MCQ tests are easy to administer and score and this is suitable for large scale examinations. Moreover, marking does not constitute any sources of measurement due to the nature of the test items which requires choosing the right answer from four options. The scorers are to be consistent because there is only one easily recognized correct response (Hughes, 1989).
Besides, the test specs are provided with answer key for markers to avoid any errors. The mark scheme increases the reliability of the test because the origin of unreliability is not found as the test can be marked giving the same scores by different markers on different occasions (Harrison, 1983). On the other hand, the seriousness and importance of English language testing in GSECE, in which a mark can change the future of a student (see appendix 1 ), necessitate choosing tasks that require high scorer reliability and objectivity.
MCQ activities use items which permit completely objective scoring (Hughes, 1989, Schofield, 1972, Heaton, 1975) and the students’ marks can not be affected by the personal judgement or idiosyncrasies of the marker (Weir, 1990). Another advantage that can be attributed to MCQ is that scoring criterion is made explicit to the test takers that a mark is given to the correct choice. This decreases the room of ambiguity that may accompany other tasks such as gap filling or open ended questions (Weir, 1990).
This ambiguity may distract the test-taker for a while to think how far he/she has covered the required information (Buck, 2001). Another choice related to MCQ is the number of options in every item. Multiple-choice item can be with three, four or five options (Buck, 2001). I have made four options in every item as found in international tests such as TOEIC and TOEFL in order to decrease the guessing possibility that students can make with three-option item (Desforges, 1989, Shipman, 1983, Heaton, 1975). Heaton (1975) explains that four options are recommended for tests.
I did not make five options to increase the students focus on listening operations and to limit the reading effect to avoid the risk of construct-irrelevant problems (Weir, 2005). The four options are short, not complex or long to avoid the intervention of reading skill and this can also influence listening scores (Buck, 2001). The stem in each item is an incomplete statement for the stem can be constructed either to be incomplete statement of a direct question (Fulcher and Davidson, 2007, Heaton, 1975). The reason behind my choice is that the target students are familiar with such kind of stem.
As for the items in general, they are organized according to sequence of ideas in the listening text. The ten items are created to measure the three levels of listening operations in the specs. Operations on word level are catered for in items 6 and 10. Items 2, 4 and 5 are created to measure listening operations on sentence level. Items 3, 7, 8 and 9 measure students’ ability to infer information and identify main ideas on text level Before playing the listening text, I have provided test-takers with the context of the talk, “president Obama is giving a speech after his victory in the presidential election”.
The rationale behind this is to create a known situation that can be used to test interpretations and inferences based on the knowledge provided in the listening text. In addition, it is possible to turn the test itself into a context as real-world listening always takes place within a context (Buck, 2001). This heading will arouse the test-takers’ background knowledge about the topic of listening and this helps in listening comprehension. This may advantage those who do not have background information about the topic.
However, it is worth of mentioning that correct responses depend on knowledge that has been provided in the text. Evaluation Evaluation is the process of making judgement about the value or worth of a test and its results. My evaluation will be related to the test specs and the test instrument. In evaluating the test, I will discuss three technical criteria; reliability, validity and practicality. I will also evaluate the test items in terms of strengths and weaknesses according to the colleagues’ feedback and the result of piloting the test.
Reliability of the test is previously discussed in the rationale as one of the advantages of MCQ. As for the validity of the test, it refers to the extent to which it measures what it is supposed to measure and nothing else (Heaton, 1975, Harrison, 1983, Desforges, 1989, Davies, 1990, Gipps and Stobart, 1993, Harmer, 2001). According to the purpose mentioned in the test specs, there is high content validity because the items cover the objectives of teaching listening in the coursebook and listening operations in the specs. The content of achievement tests is based on the coursebook (Alderson et al, 1995, Harrison, 1983).
Though this validity, we are not sure that students who get high scores are good at these sub-skills because only one task can not represent students’ achievement. It is a good idea to have a variety of task types in the listening test because test-takers’ performance can vary depending on the type of the task (Buck, 2001). The test method can affect the test scores so Weir recommends, “As a general rule it is best to assess by a variety of test formats” (Weir, 1990: 42). Satterly (1989) argues that validation of an instrument calls for an integration of many types of evidence.
Weir (1990) also points out that there is a considerable doubt about MCQ’s validity as measures of language ability. Buck states, “The greater and more varied the sample, the more likely it is to represent the construct fairly” (Buck, 2001: 120). Brindley (1998) and Shipman (1983) explain that using a variety of different task types makes the test more likely to provide a balanced assessment. Baker (1989) also criticises using multiple-choice tests because he thinks they are unnatural and may lead to excessive practise of these activities in classrooms.
Therefore, this test could be better if it had different task types such as short answer questions with focus on meaning. As for the test practicality, it seems quite difficult to apply a listening test because the main questions of practicality are administrative (Harrison, 1983). “Practicality pertains to the ways in which the test will be implemented” (Bachman and Palmer, 1996: 39). The test must be well organized and prepared for with special arrangements. Exam room should be will equipped with cassette players or computers and rooms should be physically isolated from noise.
These may be difficult as most schools lie in busy and crowded streets and it seems impossible to guarantee safe procedures. Proceeding on the test specs and the test instrument, I will consider an internal review of the instrument, compare the instrument with the specs and review assessment experience. It is worth of mentioning that my evaluation in this part is based on my colleagues’ feedback in addition to the analysis of a checklist of evaluation which I have sent to the senior teachers who administered the test in Egypt (see appendix 3).
Test specs provide the rationale behind the various choices I make because they are explanatory documents for the creation of test tasks (Fulcher and Davidson, 2007). However, the specs still need amplification, especially with regard to sequence of difficulty (Sumner, 1987) and description of items related to text type. Specifications of text type should not describe the current test; instead they should provide specifications that other tests should meet. The purpose of the test is clearly mentioned in the specs but it is better to provide listening definition documented with reference in the test specs.
In addition, specs should be refined and the test method should accompany other tasks for purpose of validity. The specs also lack a very important point; whether this listening test will be part of the English test which means in the same time with the other tasks of English test or it will be separate and it will be marked separately. The specs also lack a sample of the task items and “guiding language about the item” (ibid: 54). The test design matches the test specs in terms of number of questions, task, distribution of marks and the time allocated.
The test content compiles with the content of the syllabus. The text is relevant, motivating and interesting for the students as it is at the right level of difficulty in terms of vocabulary and structures. The length of the listening text matches the test specs. The task evaluation is inspired by the guidelines of Hughes (1989, chapter 5) in relation to reliability. As for the rubrics, the instructions are clear on what the students have to do. The rubrics are also written in simple, clear language and the mark of the task is provided for the students.
However, the total time allowed is not mentioned and this will be considered when improving the test. Marking scheme is clear and based on the objective testing which means no judgement is required on the part of the scorer. Items are unambiguous and the correct answer appears in different rank order every time (Schofield, 1972). The answer key is correct and complete and there is only one correct answer for each item. Test Trialling Creating multiple-choice test is a complex skill and really requires great effort and high proficiency in creating appropriate options and distracters.
It is possible that they may be problems in the items which test-developers can not notice. The best way to avoid this is to give the test to a small number of potential test-takers or colleagues to complete the task and then solicit their feedback (Buck, 2001). This considerably reveals any obvious problems, as trialling the items should ensure that the questions are unambiguous and sufficiently focused (Weir, 2005, Sumner, 1987). Buck asserts, “Test-developers are well advised to practise the timing of activities and trialling the whole procedures is advisable” (Buck, 2001: 121).
Therefore, this listening test has been trialled in the Egyptian context to evaluate the whole experience of assessing listening and to evaluate the test itself. In this part, I will report what actually happened and I will provide an analysis of the result. The test and test specs have been sent to El-Atf Secondary School in Egypt as Davies (1990) recommends that the “try-out” should be on a sample as the same kind of people on whom the test is to be used. Two senior teachers conducted the test on a sample of ten students enrolled in the third year, three girls and seven boys.
The test was conducted on the 2nd of April 2009 in the Multi-media Lab which is equipped with a computer to play the listening text, speakers, light and all required physical facilities. The text was played only once and took the allocated time in the test specs. After the test, teachers held interviews with the students to elicit their feedback about the test (Sumner, 1987). Teachers’ report provides very significant remarks. Firstly, some students complained of the fast rate of delivery.
However, this may be normal because these students are not used to such listening situation albeit the speaker is not fast and very relevant to their proficiency level. According to Buck (2001), these students lack processing automaticity which can be improved if they get trained and learn to process the language more automatically, then speech, for them, may seem to become slower. Another significant point is that the students wanted the text to be played again to check their answers. This point should be considered in the test specs. Berne (1995) recommends playing the text twice to make the task easier for listeners.
Moreover, playing the text twice was also suggested by colleagues to help test-takers check and revise. However, it is more realistic to play the text only once as this is a feature of listening in real world. Generally speaking, most students approved the experience and this may be a good indicator to conduct in-depth research in this area of assessing listening in my context. It is appropriate to evaluate and analyse the test items with reference to the result obtained. The table below shows the result of the test and how many correct responses for each item: 12345678910Total
Ali/////5/10 Khalid/////5/10 Samah////////8/10 Omar///3/10 Soliman/////5/10 Eman//////6/10 Yasmeen/////5/10 Salamah////////8/10 Mohamed/////5/10 Ahmed//////6/10 Correct responses3/108/102/1010/104/108/106/101/106/108/10 Analysing the items according to the data provided in the table, we find that some items need to be improved, refined or replaced. It is clear from the correct responses that the items are not arranged in rough order of increasing difficulty. Heaton (1975) argues that it is generally important to have one or two simple items to “lead in” the test-takers.
Item 1 seems to be not a good opening in the test and it should be simplified or replaced with an easy item then difficulty can increase gradually. It is advisable to start the test with easy and simple items. Item 8 is too hard, this may be due to that “selfless” is not familiar to the students or may be the item is trapping. Therefore, it needs to be refined. It seems that item 4 is an invalid one and should be excluded because it does not discriminate well between weaker and stronger students (Hughes, 2003, Sumner, 1987).
Perhaps, this is due to the fact that it can be answered without listening to the text. Therefore, this item should be replaced with another one that measures a listening operation on text level. Though Item 5 is simple, it has only four correct responses. It is badly designed and this may have resulted in misunderstanding for test-takers. It is remarkably notable that most students answered items created for listening operations on word level correctly. Students dramatically vary in their responses to items created for listening operations on text level. Conclusion
Based on discussion and analysis in the rationale, evaluation and trialling, it is clear that neither the test nor the specs are perfect and they both need amplifications and some parts in the specs need to be refined and to be documented. Generally speaking, the course has provided me with an overview of variety of theories and practices in language testing through which I can critically evaluate existing tests and to design tests appropriately with reference to given specs. Bibliography -Alderson, J. C. & Wall, D (1993) Does washback exist? Applied Linguistics 14/2 pp. 15-129 -Alderson, J. C. , Clapham, C. & Wall, D (1995) Language test construction and evaluation. Cambridge: Cambridge University Press -Backman, L. F. (1990) Fundamental Considerations in Language Testing. Oxford University Press -Bachman, L. F. and Palmer, A. S. (1996) Language Testing in Practice. Oxford University Press -Baker, D. (1989) Language Testing. A Critical Survey and Practical Guide. Edward Arnold -Berne, J. E. (1995) how does varying pre-listening activities affect second language comprehension? Hispania, Vol. 78 no. 2 pp. 316-329 (http://www. jstor. rg/stable/pdfplus/345428. pdf) -Brindley, G. (1990) Assessing Listening Abilities. Annual Review of Applied Linguistics, 18, 171-91 -Black, P and William, D (2006) Assessment for Learning in the classroom. In Gardener, J. Assessment and Learning. Sage Publications Ltd -Buck, G (2001) Assessing Listening. Cambridge University Press -Cheng, L. , Watanabe, Y. and Curtis, A. (2004) Washback in Language Testing. Research context and methods. Lawrence Erlbaum Association -Davies, A. (1990) Principles of Language Testing. Basil Blackwell -Desforges, C (1989) Testing and Assessment.
Cassell Educational Limited -Fulcher, G & Davidson, F (2007) Language Testing and Assessment. An advanced resource book. Routledge -Gipps, C and Stobart, G (1993) Assessment (2nd ed). A teachers’ Guide to the Issues. Hodder & Stoughton -Harrison, A. (1983) A language testing handbook. London. Macmillan Press. -Heaton, J B (1975) Writing English Language Tests. London: Longman -Hughes, A. (1989) Testing for Language Teachers. Cambridge. Cambridge University Press -IELP-II (2003) Student Achievement Test Development Manual. Academy for Educational Development -Krashen, S. 985 the Input Hypothesis: issues and implications. Harlow: Longman -Satterly, D. (1989) Assessment in Schools (2nd ed). Basil Blackwell -Shipman, M (1983) Assessment in Primary and Middle Schools. Croom HELM: London & Canberra -Shoefield, H (1972) Assessment and Testing: An Introduction. George Allen & Unwin Ltd -Sumner, R (1987) THE ROLE OF TESTING IN SCHOOLS. NFER-NELSON -Weir, C. J. (1990) Communicative Language Testing. Prentice Hall Weir, C. (1993) Understanding & Developing Language Tests. Prentice Hall -Weir, C. J. (2005) Language Testing and Validation. Palgrave Macmillan.