A Conversation Analysis–Informed Test of L2 Aural Pragmatic Comprehension F SCOTT WALTERS Queens College, City University of New York Flushing, New York, United States Speech act theory–based, second language pragmatics testing (SLPT) raises test-validation issues owing to a lack of correspondence with empirical conversational data On the assumption that conversation analysis (CA) provides a more accurate account of language use, it is suggested that CA serve as a more empirically valid basis for SLPT development The current study explores this notion by administering a pilot CA-informed test (CAIT) of listening comprehension to learners of English as second language (ESL) and to a control group of native speakers of English The listening CAIT protocol involved participants’ addressing multiple-choice items after listening to audiotaped conversational sequences derived from the CA literature Statistical analyses of pilot-test responses, correlations of test score with participant demographic variables, and CA-informed, qualitative analyses of nonnative and native speaker responses with reference to operationalized pragmatic norms provided tentative evidence that the CAIT aural-comprehension measure possesses some utility in SLPT S econd language pragmatics testing (SLPT) is a relatively new area of language testing (e.g., Hudson, Detmer, & Brown, 1995; Yamashita, 2001) Nevertheless, well-known standardized tests of English as a second or foreign language have components which address pragmatics For example, the TOEFL Internet-based Test (or TOEFL iBT) contains a section focusing in part on “listening for pragmatic understanding,” defined as “to understand a speaker’s purpose, attitude, degree of certainty, etc.” (Educational Testing Service, 2007, p 31) Similarly, Part of the Cambridge First Certificate in English (FCE), Paper 4, contains multiplechoice items, some of which target pragmatics, that is, “function, purpose, [or] attitude” of a speaker (University of Cambridge Local Examinations Syndicate, 2007, p 2) SLPT itself has a theoretical basis in speech act theory (Austin, 1962; Searle, 1975) and methodological roots in cross-cultural pragmatics (e.g., Blum-Kulka, House, & Kasper, 1989) and in interlanguage pragmatics (e.g., Blum-Kulka, 1982; Faerch & Kasper, 1989) However, current speech act–based SLPT practices reveal certain problems in validity That is, given that validity (or validation) TESOL QUARTERLY Vol 43, No 1, March 2009 29 involves making inferences from test responses about a given skill (Bachman, 1990; Chapelle, 1999), this article argues that speech act–based SLPT elicits responses from which appropriate skill inferences cannot be made This article then describes a pilot study in which conversation analysis (CA) is used as a possible alternative on which to base an SLPT measure SECOND LANGUAGE PRAGMATICS COMPREHENSION As Kasper and Rose (2002) indicate, within the developmental second language (L2) pragmatics literature, studies into L2 pragmatic comprehension have been relatively rare Early studies investigated L2 learners’ attribution of illocutionary force, that is, the process through which a hearer interprets the meaning of an utterance as being of a particular speech act, such as a request or a refusal Carrell (1979), for example, found that advanced L2 learners have full access to such inferential skills and are able to infer indirect speech acts Carrell (1981) also found evidence of a hierarchy of difficulty in one particular speech act—indirect requests— depending on how the act was syntactically constructed When such requests—in this case, to paint a circle blue—were phrased as interrogatives (i.e., Must you make the circle blue?) or negatives, they were more difficult for lower proficiency learners to interpret than were requests in conventional form (i.e., Please + imperative verb phrase) Gibbs (1984) found that L2 listeners have direct access to the meanings of indirect speech acts if the utterance forms and situations are conventionalized, but lack of familiar situational contexts and the presence of nonconventionalized utterances must be processed sequentially—literal meaning first, then nonliteral Bouton (1988) found that groups of learners from differing cultural backgrounds (German, Portuguese-Spanish, and Taiwanese on the one hand and Korean, Japanese, and mainland Chinese on the other) have different perceptions of indirect answers, such as indirect criticism (e.g., p 59: Speaker A: “What did you think of [Mark’s term paper]?” Speaker B: “I thought it was well typed”) Koike (1996) tested the ability of L2 students of Spanish to identify direct and indirect speech acts (requests and apologies) recited on videotaped monologues by a native speaker of Spanish, finding an association with length of L2 study These results are somewhat uncertain, however, because a third of the participants were actually Spanish-English bilinguals In any event, one may note in these studies assumptions regarding the notions of direct and indirect speech acts, namely, that there is a regular association between form and function of so-called direct speech acts, and that those speech acts which are indirect are understood by hearers to be so with reference to the direct versions These assumptions, as we shall see, not stand up well under scrutiny 30 TESOL QUARTERLY SECOND LANGUAGE PRAGMATICS TESTING Such assumptions have carried over into SLPT development For example, in an early study, Hudson et al (1995) devised a prototype battery of SLP test instruments employing a written discourse completion test (DCT) A DCT consists of a short situational description followed by a blank into which a respondent writes what he or she feels is an appropriate response, as in the following item (p 87): Situation 2: You work at a small shop that repairs jewelry A valued customer comes into the shop to pick up an antique watch that you know is to be a present It is not ready yet, even though you promised it would be You: _ The DCTs in Hudson et al were intended to elicit requests, refusals, and apologies—three so-called speech acts Prepilot rating of the situational prompts (items) by native speakers (NS) of English showed a 92% agreement with the intended speech-act codings (e.g., as an apology); high percentage of agreement can be interpreted to mean that the operational norm (i.e., the provisional standard embodied in the test item) assumed by the researcher is likely to be valid Prompts on which native speakers had disagreed as to speech-act codings were later found to elicit multiple speech acts with nonnative-speaking participants Brown (2001) examined various SLP test methods adapted from the Hudson et al (1995) study, focusing on requests, refusals, and apologies, and given to Japanese learners of English as a foreign language (EFL) and American learners of Japanese as a second language (JSL) Methods included a written discourse completion task (WDCT), the responses scored by raters on a five-point scale of appropriateness A multiplechoice discourse completion task (MDCT) replaced a fill-in blank with a set of three response options on a test sheet In an oral discourse completion task (ODCT), the situation descriptions were delivered via tape recording, and the participant’s response was orally produced and recorded There were three additional tasks: The first was the discourse role-play task (DRPT), which involved a printed situation acted out by the test taker and a native speaker of the target language The second, a roleplay self-assessment task (RPSA), involved the test taker viewing an audiotaped record of his or her performance on the DRPT and rating it on a five-point scale A discourse self-assessment task (DSAT) was similar to the DRPT in that after reading a situation-prompt, the test taker rated his or her ability to respond on a five-point Likert scale In comparing the results, Brown found high reliability coefficients for most measures across both EFL and JSL groups Factor analysis suggested method effects, that A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 31 is, a productive language factor with the WDCT, ODCT, and DRPTs; and a paper-and-pencil factor for the WDCT and the MDCT with the JSL group That is, success at the WDCT, ODCT, and DRPT seemed to depend on productive language ability, and, for the JSL group, success at the WDCT and MDCT tasks seemed to depend on skills associated with use of paperand-pencil tests Yamashita (2001) attempted to improve on written situational prompts by employing pictures to elicit apology strategies A picture response test (PRT) was developed, with feedback from native speakers of both Japanese and English to eliminate cultural ambiguity Six pictorial situations (e.g., Person A borrows Person B’s book and then accidentally drops it into a pond) involved the test taker adopting a Person A character and responding in writing to a fictitious Person B Frequency counts were made of targeted phenomena, for example, expressions of dismay, explicit apology-devices (i.e., I’m sorry), and interjections PROBLEMS WITH MAINSTREAM SLPT Although the preceding literature review is not exhaustive, it is an adequately representative sample to discuss the limits which speech act theory–based SLPT instruments have with regard to validation Such methods evince problems with authenticity and hence with producing appropriate evidence of L2 pragmatic competence First, the SLPT methods reflect an underlying assumption regarding speech act form–function correspondence, that is, direct versus indirect speech act However, as Levinson (1983) points out, the form–function correspondence implicit in speech act theory is essentially meaningless because most speech acts are indirect Second, we may observe, as Levinson (1983), Richards and Schmidt (1983), and Rose (1992), that speech act theory tends to focus simplistically on the speaker at the expense of the hearer; however, conversational actions, such as promises, cannot be performed alone Mey (2001) similarly refers to pragmatic acts occurring above the level of individual utterances Given these criticisms of the theory underlying traditional SLPT, one can see that the very unit of analysis built into the test methods of Hudson et al (1995), Yamashita (2001), and in those reviewed in Brown (2001) is suspect from the standpoint of authenticity; hence, inferring pragmatic competence from these measures is dubious The use of the DCT and related methods also evince shortcomings The format is clearly artificial, and there is a risk, despite careful wording and translation of the situation-prompts, that participants will elaborate on the context in ways not envisaged by the investigator (Yamashita 2001, p 48; Hudson et al., 1995, p 52) Moreover, some DCT-related studies suggest a method effect: excessively long written responses by 32 TESOL QUARTERLY high-proficiency nonnative speakers (NNSs) (Blum-Kulka & Olshtain, 1986; Faerch & Kasper, 1989; see also Brown, 2001) The implications of these shortcomings for practical L2 pragmatics assessment can be underscored by pointing out that if a teacher or student desires a valid assessment of that student’s L2 pragmatics ability, an SLPT instrument, such as the DCT, drawing on speech act theory for assessment criteria is unlikely to provide useful information The assessment fails because the construct speech act is founded on learners’ and researchers’ intuitive understandings of what that act consists of, rather than on some objectively verifiable criterion of pragmatic behavior That is, the assessment results will indicate what the learner believes is so, rather than demonstrating L2 pragmatic mastery per se Some researchers have attempted to address DCT shortcomings For example, Rose (1992) added rejoinders (i.e., a third, hearer’s turn after the traditional second-turn blank into which the subject writes his or her response) intended to make the DCT exchanges more nearly authentic by embedding responses in a simulated stream of discourse Unfortunately, the two conditions in Rose (1992)—items with and without rejoinder— did not elicit statistically significant differences Again, such attempts not appear to address the problem: If speech acts are in fact products of metapragmatic judgments that only loosely correspond to actual conversational practice, then inferences of L2 pragmatics ability made from such elicitations are of questionable validity CONVERSATION ANALYSIS Such problems with speech act theory and the DCT and its variants are crucially illuminated by findings in conversation analysis (CA; e.g., Sacks, Schegloff, & Jefferson, 1974; Schegloff, 2007) CA is an approach to the study of language that avoids categorizations of use based on nativespeaker intuitions Data in CA consist of audio- and/or videotaped natural conversations, which are finely transcribed using special conventions (see appendix) A working principle of CA is that no aspect of talk can be assumed to be nonfunctional (Heritage, 1984), and a fundamental question is, Why that now? (Schegloff, 2007): Why is the interlocutor making a particular utterance at a particular point in a given conversational sequence? Conclusions about language use can only arise by determining how interlocutors themselves orient to (i.e., demonstrate understanding of) a given utterance, as evidenced by explicit, recorded turns of talk— rather than from possibly erroneous researcher or respondent intuition In explicating language use, it is not enough to rely on linguistic form and assign utterances to speech act–theoretical categories such as question, because such utterances may be being oriented to the speakers as A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 33 something rather different, such as a complaint or request (Schegloff, 1984) Hence, from the CA perspective, as Schegloff (1988) points out, the so-called speech act, the “single act of utterance” (p 56) is not a fundamental unit of talk in an absolute sense Rather, the fundamental unit of analysis is the sequence (as in, sequence of turns of talk), of which the minimal example is the adjacency pair (e.g., Sacks, Schegloff, & Jefferson, 1974), that is, a two-part structure composed of a first pair part uttered by one speaker (e.g., It was just beautiful), which projects a second pair part which is “conditionally relevant” (Schegloff, 2007, p 20), that is, potentially uttered by another speaker (e.g., Well thank you uh I thought it was quite nice) (Pomerantz, 1978) The qualification in “potentially uttered” is important because conversational meanings and structures are not predetermined but co-constructed by the speakers, turn by turn Hence, the a priori, linguistic form–function pairing of utterances in speech act theory is analyzably absent from natural conversation (Schegloff, 1984, 1988) CA criticisms extend to the DCT itself, which does not allow for the concept of sequence, integral to analyses of actual “talk-in-interaction” (Schegloff, 2007, p xiii) Golato (2003), for example, shows that DCTs not capture some aspects of actual pragmatic behavior and may elicit responses that not appear in natural conversation Golato transcribed compliment responses gathered in a German L1 domestic setting on videotape and compared them with responses to compliments elicited from L1 German speakers via written DCTs Comparison of frequency counts of various response categories revealed striking differences For example, what may be termed the archetypal response to a compliment, danke (“thank you”), appeared in 12.4% of the DCT responses, but not once in the naturally occurring data On the other hand, compliment responses that included assessments that agreed with the complimenter and that also contain a positive pursuit marker in a later turn (A: The meat is excellent/B: Super, right? … Yeah) appeared in 12% of the naturally occurring talk but in only 0.5% of the DCT-elicited responses Here, a zero or near-zero percentage figure (i.e., frequency) can be interpreted to mean that the results not reflect a valid picture of native-speaker behavior Given the above review of traditional SLPT in the light of CA findings, at least two conclusions may be considered First, the speech act theory– based SLPT emphasis on the speaker at the expense of the hearer has resulted in a paucity of research into testing of SLPT aural comprehension; recall that the ODCT in Brown (2001) involved audio recordings of situational descriptions, not conversational sequences Second, in order to enhance SLPT validity, speech act–theoretical test methods—whether written, aural, or pictorial—should be replaced by an approach more directly reflecting actual language use Speech act theory, as Kasper (2005) points out, is rooted in a rationalistic approach to language study that assumes an ideal speaker endowed with reason An L2 test developer, 34 TESOL QUARTERLY approaching language use from a rationalistic perspective, makes assumptions regarding the speaker’s intentions behind his or her language use However, intentions are invisible to the test developer, whose assumptions regarding a subject’s intentions may lead one’s data-capturing method astray, as shown by Golato (2003), whose DCT, for example, provided data which was empirical, but which concerned beliefs and intentions, not actual language use In the end, when searching for a valid set of norms on which to base an SLPT measure, what is most useful is the empirical language-use data itself, open to public scrutiny Given corpora of CA analyses that are available as resources for test development, SLPT may well abandon its rationalistic perspective and instead adopt a datadriven approach, employing empirical findings as an operational SLP test norm, thereby potentially enhancing validity, a fundamental concern in assessment (Bachman, 1990) All of which leads to a question: If the use of rationalistic, speech act theory–derived test methods is seen to be of dubious value in light of CA, might then there be CA-informed methods that, by application of principles such as conditional relevance, can enhance SLPT validation in the testing of aural pragmatic competence? CA has been used in the analysis of existing educational (Marlaire & Maynard, 1990) and L2 oral language tests (Lazaraton, 1997; Johnson & Tyler, 1998; Kim & Suh, 1998; Lazaraton, 2002; Ross, 2007a, 2007b) To date, however, no attempts have been made to actually develop CA-informed tests of SLP aural ability RESEARCH QUESTIONS Application of CA to SLPT initially seems to present a paradigmatic mismatch: Traditional language testing (LT) usually strives to produce a generalizable numerical score to represent a target skill level with reference to an objective criterion or norm On the other hand, CA regards even a single interactional event, contextualized and irreproducible, as having a nonstatistical (indeed, nonnormative in the LT sense) significance and so quantification of behavior is irrelevant (Schegloff, 1993) In other words, CA research, it is argued, is nonstatistical in nature and thus inappropriate for quantitative measurement However, in practical terms, these differing approaches to talk-in-interaction need not preclude one approach informing the other The strength of CA, and its potential benefit for SLPT, lies in its ability to uncover how speakers use various pragmatic actions (or practices) to co-construct sequences of conversation, by analyzing how they display their respective orientations to the emergent talk Indeed, some LT studies have applied CA principles to the validation of existing L2 tests For example, Ross (2007a) performed contrastive CA analyses on transcripts of oral A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 35 proficiency interviews (OPIs) of an EFL examinee who had backslid to a lower numerical score since his initial interview Analyses found differences in rater severity and examinee behavior that affected the score In a followup statistical study, Ross (2007b) applied a form of Rasch modeling to a body of OPI test-score data, which determined a negligible effect of overall rater severity on OPI score, if rater differences were corrected through statistical equating The essential point is that consideration of quantitative data prompted an application of CA, which provided data on L2-test interactions (uncovering a heretofore uninvestigated range of OPI practices), in turn motivating a statistical examination of related test data, finally resulting in a reconsideration of test-use validity From such cross-paradigm (CA, qualitative; LT, quantitative) research (see also Lazaraton, 1997, 2002), it appears that CA can indeed inform SLP test validation and use Given this potential, it seems not unwarranted to hypothesize that CA could serve as a resource for SLP test-item construction as well Moreover, given the theoretical and methodological problems with speech act–based SLPT, it can be reasoned that employing CA findings as an operational test norm in actual SLPT development might potentially enhance validity (Bachman, 1990) Accordingly, in an attempt to develop an alternative to traditional SLPT methods, a pilot study was initiated into the development of a CA-informed test (CAIT) of English as a second language (ESL) pragmatic listening comprehension The overall goal of the study was to determine the feasibility of a CAIT measure In this connection, it should be noted that this overarching goal does not necessarily imply achieving a high degree of validity in the use of the measure at an early stage of development Evidence of feasibility, then, can lead to test revisions, in turn leading to eventual validation of CAIT use For this study, the overarching goal encompassed the following specific research questions: What would be the statistical features of a listening CAIT when administered to advanced ESL speakers? How would responses to a listening CAIT vary according to group differences? Would the operationalized norm be confirmed or violated by nativespeaker responses to listening CAIT questions? Finally, can items testing L2 listening comprehension be practically and usefully derived from CA data? PARTICIPANTS Participating were 70 adults—43 nonnative speakers (NNSs) of English, and a control group of 27 native speakers (NSs) of English L1s spoken by 36 TESOL QUARTERLY the NNS group were Korean (10), Arabic (8), Chinese (6), Spanish (4), Japanese (4), Urdu (2), Albanian (2), and one each of Yoruba, Kikuyu, Turkish, French, Thai, Brazilian Portuguese, and Baule, a language of Côte d’Ivoire Participants were graduate students and some of their spouses at a U.S university Demographic information was collected on all participants for age, sex, native language, second (or third) language, academic status, and number of years of formal English-language study Information was collected on NNS persons only for age of arrival, length of stay in the United States, and most recent score on the Test of English as a Foreign Language (TOEFL), the score range for which was 550–670 on the paper-and-pencil scale This score suggests that these NNSs were at a relatively high level of English proficiency PRAGMATIC TARGETS For this study, ESL aural pragmatic competence was operationalized as the ability to understand three types of pragmatic actions: assessment responses, compliment responses, and presequence responses These actions not constitute a representative sample of overall pragmatic competence; to infer such competence would be invalid Rather, they were chosen because they are well documented in the CA literature (Pomerantz, 1978, 1984; Schegloff, 2007), and three targets seemed enough for a pilot intended to determine overall CAIT feasibility Assessment responses are actions in which a speaker displays evaluations of events Among the various types, two are shown here (Pomerantz, 1984) Some are upgrades, which agree with the assessment in the first pair part (see Transcript 1, Speaker A), with a more emphatic word choice (Speaker B): Transcript A : T’s- tsuh beautiful day out isn’t it? B : Yeh it’s just gorgeous ¬ (p 61) Other assessment responses are disagreements., The disagreement given by Speaker B in Transcript 2, for example, is a weakened agreement.: Transcript A : I know but I, I-I still say thet the sewing machine’s quicker B : Oh it c’n be quicker but it doesn’ the job ¬ (p 73) There is also a range of compliment responses, as described by Pomerantz (1978) In addition to acceptance tokens such as “Thank you” there are evaluative shifts, in which compliment recipients may either disagree or offer a scaled-down version of it, as Speaker B does in A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 37 Transcript (Pomerantz, 1978), offering “quite nice” instead of speaker A’s “beautiful”: Transcript A : It was just beautiful B : Well thank you uh I thought it was quite nice, ← Yet another type of compliment response consists of reference shifts, whereby the recipient of a compliment focuses the talk away from him- or herself and onto something else: Transcript A : You’re a good rower, Honey B : These are very easy to row Very light ← (p 102) The third target used in the pilot was the presequence response, as given by Speaker B in Transcript (Schegloff, 2007): Transcript A : Hey I got sump’in thet’s wild B : What ← A : Ya know one a’ these great big red fire alarm boxes thet’r on the corner? I got one (p 39) Speaker A in the first turn performs a presequence called a pretelling, a check to see if conditions are appropriate for delivering a news item Speaker B in the second turn performs the actual presequence response, in this case a go-ahead, showing that speaker B is willing to listen to the coming message Speaker B, however, could have prevented A from delivering the item, by performing a blocking response, as shown in Transcript (Schegloff, 2007): Transcript A : Didju hear about thee, pottery and lead poisoning B : Yeah Ethie wz just telling us ← (p 40) TEST METHOD Administration of the 10-item listening CAIT took approximately 20 minutes Before administration of the pilot test, participants answered a short demographic questionnaire The listening CAIT itself involved test takers’ listening to short tape-recorded dialogues between two native speakers of English There were 10 listening items in all, each having a particular pragmatic action as the (provisional) target The scripted dialogue prompts were derived directly from CA data examples, adopting 38 TESOL QUARTERLY In the conversation, what WAS the man doing? (a) (b) (c) (d) agreeing that there are many things rejecting the woman’s point of view understating his own achievement* emphasizing his own accomplishment Here, the woman compliments the man on his having prepared a large meal, whereas the man delivers a disagreement proposing that the credit given is exaggerated Inasmuch as this particular task is intended to test a learner’s ability to identify both the compliment and the most likely appropriate response, the answer choices are crafted in such a way as to make the identification slightly difficult; hence, the explicit term compliment is not inserted into any of the response options, and the answer coded correct is circuitously worded The recorded part of each prompt was played twice so that the skill of aural pragmatic competence would not be confounded by the skills of overall comprehension or memory Prior to administering the pilot measure, audio scripts and multiple-choice items were reviewed by two CA scholars as content specialists, which resulted in changes to some items For example, one piloted item (Item 1) was based on a recorded telephone conversation (Schegloff, 2007, p 31) However, the original data excerpt itself contained little information to convey this fact, no lexis referring to the call, nor any telephone ring In order to avoid ambiguity, the first turn—namely, Hi Jane,—was slightly modified to read, Hi Jane, this is Dick, the addition serving to identify the turn as part of a phone call This addition arguably threatened the authenticity of this particular item; this issue will be taken up in the Discussion section RESULTS Descriptive Statistics Descriptive statistics of the listening CAIT—relating to the first research question—are given in Table The NS score range was smaller than those of the combined and NNS groups, but the fact that there was a range suggests that the operationalized norm was disconfirmed The overall variance (average score difference from the mean) for both groups separately and together was narrow, less than two score points—1.87 for the NNS group, 1.69 for the NS group, and 1.79 for the combined group The low NNS variance values seemed to indicate that most NNS participants responded near-normatively; that is, they got almost all the correct answers All distributions were nearly normal Item-level statistics were also calculated The p statistic, the proportion of participants who passed 40 TESOL QUARTERLY TABLE Pilot Listening CAIT Descriptive Statistics by Group Statistic NS* Mean Mode Median Range Variance Standard deviation Skewness Kurtosis Ave item facility Ave item discrimination 7.33 8.00 7.33 5.00 1.69 1.30 0.77 −0.91 0.73 0.30 NNS** 7.12 7.00 7.08 6.00 1.87 1.38 0.08 −0.14 0.71 0.28 Whole group 7.20 7.00 7.20 6.00 1.79 1.33 0.01 −0.19 0.72 0.29 * NS group n = 27 ** NNS group n = 43 the item, sometimes called an item mean (Davidson, 2000), is an index of item difficulty or facility The average facility value for the whole group was p = 0.72, that for the NNS group p = 0.71, and that for the NS group p = 0.73, suggesting that the test overall was relatively easy The item discrimination index (symbolized by d) was obtained with the point-biserial coefficient If a single item correlated highly with the total test score of a group, then the item was considered to discriminate among high and low scorers (Haladyna, 1999) A value of d = 0.30 or higher is usually considered acceptable On the listening CAIT, the d values were moderate, the average index for the whole group being d = 0.29; for the NNS group, d = 0.28; and for the NS group, d = 0.30 Of course, it may be somewhat misleading to interpret item discrimination with an NS control group Reliability—the ability of a test to measure a particular trait in a consistent manner across groups (Bachman, 1990)—was calculated with the Spearman-Brown split-half coefficient, using Horst’s (1953) correction formula To obtain this coefficient, the test was divided into two equivalent halves on the basis of target pragmatic skill (see Table 2) It may be argued the halves are not entirely equivalent For example, Items and each involve different actions: a disagreement with a compliment (“ … you’re doing a great job in all your classes”/“Well I guess she hasn’t seen my term paper for Astronomy class”) and a downgrade (“Wow, you made,—like a ton of stuff …”/“Oh::, just a few little things really”) However, these actions may be considered roughly equivalent because both are evaluative shifts (Pomerantz, 1978) Items and each involve differing actions, preinvitations and preoffers, yet Schegloff (2007) indicates that invitations often seem to be a subclass of offers Thus, there are empirical grounds for considering the split halves of the listening CAIT to be content equivalent Statistically, the halves possessed equal means and variances, a requirement for using the Spearman-Brown (Bachman, 1990) split-half coefficient, as shown by nondirectional t tests run on the A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 41 TABLE Division of Listening CAIT Into Equivalent Halves Split-half test A Split-half test B Item Pragmatic target(s) Item Preoffers Prerequests Disagreeing with compliments Preinvitations, go-aheads Pretellings 10 Pragmatic target(s) Preinvitations, hedges Prerequests, preoffers Downgrading compliments Preinvitations, go-aheads Pretellings split-half means, and nondirectional F-tests run on the variances (see Table 3) The coefficients obtained were r = −0.137 for the whole group; r = −0.019 for the NNS group; and r = −0.369 for the NS group The coefficients reveal pronounced attenuation, attributable to a narrow range of ESL pragmatics ability among the participants; recall that the NNS group average TOEFL score was 617.18, which was relatively high Demographic Factors The second research question concerned how responses to a listening CAIT might vary according to group differences Pearson’s productmoment correlation coefficient (symbolized by r) was used to determine the correlation between listening-CAIT score and the demographic variables—age, sex, native language, second (or third) language, academic status, age at arrival and length of stay in the United States, recent TOEFL score, and number of years of formal English study defined for both groups as formal instruction in English grammar and language arts (though these may differ for NS and NNS learners) The point-biserial correlation coefficient was used to determine the correlation between test score and gender The correlation values are given in Table Possibly TABLE Split-Half Statistics: Listening CAIT NS Mean, Test A Mean, Test B Variance, Test A Variance, Test B t test* F-test** Df 3.67 3.67 0.85 1.15 2.06 (n.s.) 1.33 (n.s.) 27 NNS Whole group 3.58 3.54 0.82 1.06 0.83 (n.s.) 1.30 (n.s.) 43 3.61 3.59 0.82 1.09 0.86 (n.s.) 1.32 (n.s.) 69 * t-critical: 2.056 (NS); 2.021 (NNS); 1.976 (whole-group) ** F-critical: 1.93 (NS); 1.69 (NNS); 1.54 (whole-group); Alpha = 0.05 42 TESOL QUARTERLY because of the limited variance, as mentioned earlier (see Table 1), correlation strengths were from r = 0.28 and below, that is, from moderate to negligible; that is, the low variances can be seen as the cause of depressed correlation values (Cziko, 1981; Kunnan, 1992) For example, because the test responses did not vary widely from the mean, variances were minimal; thus correlation values for age (e.g., whole group r = −0.02), length of U.S residence (0.02), educational status (0.01), and so on, were artificially low There was a slight correlation of test score overall with female participant (whole group r = 0.18) The strongest correlations, for both NS as well as NNS groups, were with the number of languages studied (NS r = 0.28; NNS r = 0.24) and years of formal English study (r = 0.26; r = 0.30); implications of this result are given in the Discussion section No correlations with L1 appeared in the data, nor was there an association with NNS proficiency as measured by the TOEFL Listening CAIT Item-Content Analysis Analysis of item-response patterns can help one determine whether inferences about learner ability—here, ability to understand compliments, assessments, and presequences, as well as responses—can be validly made given the adequacy of the content coverage of the test items and the responses to those items (Messick, 1988) Examination of these responses was the focus of the third research question Table presents a breakdown of operationally correct responses and alternate interpretations—in CA terms, orientations A glance at the operationalized target (correct) columns shows that a significant number of NSs did not adhere to the operationalized norm, as several distracters, the intended wrong answers, were chosen by the native speakers at varying rates of frequency The table also shows that the NSs and NNSs produced similar response patterns For example, pluralities of respondents in both groups selected options coded TABLE Correlations of Listening CAIT Score With Demographic Variables Age Sex (F)* Educational status No of languages Years of English Length of residence Age of arrival TOEFL NS versus NNS* NS NNS Whole group −0.16 0.11 0.02 0.28 0.26 −0.18 0.23 0.07 0.16 0.24 0.30 0.21 0.05 −0.01 −0.02 0.18 0.01 0.18 0.22 0.02 0.08 * Point-biserial coefficient used for this variable; all others Pearson’s r A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 43 TABLE Proportions of NS and NNS Orientations to Listening CAIT Items Item NS NNS Other orientations* NS NNS Preinvitation, blocking 0.37 0.49 Prerequest, offers 0.59 0.81 Compliment, disagreement 0.70 0.47 Preinvitation, go-ahead 0.89 0.93 Pretelling 0.63 0.70 Prerequest, offer 0.85 0.84 Compliment downgrade 0.89 0.58 Pretelling 0.93 0.81 Specific preoffer 0.85 0.65 Preinvitation, go-ahead 0.63 0.84 Invitation Suggestion Offer States possession of car Giving directions Expressing dislike Expressing pride Guessing Suggestion Telling Guessing Express sympathy Suggestion/reminder Disagreement Agreement Story different topic Story different topic Offer Generic preoffer Offer Blocking, “busy” Blocking, “go away” Expressing boredom 0.44 0.15 0.04 0.37 0.04 0.30 — 0.07 0.04 0.37 — 0.15 — 0.11 — 0.04 0.04 — 0.15 — 0.37 — — 0.47 0.05 — 0.16 0.02 0.42 0.12 0.05 0.09 0.19 0.09 0.12 0.05 0.23 0.19 0.09 — 0.09 0.09 0.26 0.12 0.02 0.02 10 Operationalized target(s) Note Dashes indicate no action was observed for the given option Some values inflated due to rounding * Chosen distractors correct, with the exception of the preinvitation Item 1, in which a plurality of NS respondents selected the invitation distracter It is interesting that this plurality (p = 0.47) of NS respondents appears to have perceived the sequential implicativeness of the preinvitation “uh what ’r you guys doing” in line (see Item 1), but not the significance of the blocking move in line 4, thereby addressing the prompt by selecting distracter (d) About half of the NNS and more than a third of the NS participants appear to have understood the meaning of the blocking move in line 4, but about half of all participants did not, selecting the invitation distracter (d) This trend suggests that native speakers of American English, and advanced ESL speakers who have been exposed to the pragmatic norms, in fact choose to disregard blocking maneuvers and proceed with the prefigured invitation If so, an invitation distracter may make for an item from which ESL aural pragmatic competence vis-à-vis blocking cannot be reliably determined In several items, NNS participants chose distractors that NS participants did not select, suggesting at least a partial skill discrimination according to the operationalized norm One example was Item 3, in which 12% of NNS respondents chose the expressing pride distracter (b) 44 TESOL QUARTERLY Another example was Item 7, where 19% of NNS chose the agreement option (a) Test Practicality The fourth research question concerned practicality of listening CAIT development and use Applying Roid and Haladyna’s (1982) item-form technique was relatively straightforward Moreover, the test administration was relatively brief, requiring about 20 minutes DISCUSSION Before discussing the results, limitations of the current study should be noted First, the present sample of ESL participants was not truly random; hence, any conclusions regarding the impact of demographic variables on ESL pragmatic competence cannot be final Second, the domain of pragmatic competence selected—assessment responses, compliment responses, and presequence responses—was necessarily limited Piloting items testing other skills, for example, turn-taking or self-initiated repair (Levinson, 1983), is one task of future CAIT development studies Third, because of the low reliability coefficients, apparently due to low test variance (Kunnan, 1992), it is unclear whether or not quantitative SLPT using the present approach is workable, given a methodological conundrum: Broaden the range of L2 proficiency to include intermediate- as well as high-proficiency test takers, and pragmatic performance may be confounded with overall aural comprehension or grammatical competence; indeed, we may note Bardovi-Harlig and Dornyei’s (1998) finding that low-proficiency EFL learners were more sensitive to target language (TL) grammatical issues than they were to pragmatic issues, though the reverse was true for ESL learners Conversely, if one delimits the pool to high-proficiency learners, then one may reap low variances and thus poor reliabilities If so, validation of test-score inferences regarding advanced L2 pragmatic competence may rest on accepting validity without statistical reliability (Moss, 1994) Such an approach is not unthinkable because “items may be indicated as undesirable on the basis of fit or discrimination, yet have content that is representative of the trait” (Hudson, 1991, p 180) Further research with intermediate-proficiency learners, perhaps with more controls for NS versus ESL/EFL instructional approaches, seems indicated to determine the feasibility of applying CAIT to SLPT development The first research question of this study concerned the statistical features of the pilot listening CAIT when administered to NSs and advanced NNSs In addition to the low reliabilities, a statistical item of concern is that A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 45 the NS-group p values evinced a wide range, suggesting that the NS norm derived from the CA literature was limited On the positive side, the overall item discrimination was moderate and the frequency distribution of number correct was approximately normal, suggesting that the pilot CAIT method may have some usefulness in testing SLP aural comprehension The second research question concerned whether demographic factors affected listening CAIT responses (see Table 4) For the NNS group, test score correlated modestly with length of residence in the United States Although correlation does not necessarily imply causality, this finding would seem to resonate with interlanguage pragmatics studies which show length of stay in the TL community as a factor in acquisition of TL pragmatics (Blum-Kulka & Olshtain, 1986), especially in U.S academic contexts (Bardovi-Harlig & Hartford, 1993) Even without overt training in pragmatics (Bouton, 1992), TL community resident L2 learners are often more sensitive than their nonresident L2-learning counterparts to pragmatics errors than grammatical ones (Bardovi-Harlig & Dornyei, 1998) However, NNS total test score did not correlate with TOEFL score, a result that would appear to give support to Bachman’s (1990) model of language competence, which separates grammatical and pragmatic competencies The strongest positive correlations with NNS and NS total test scores were with total number of foreign languages learned and years of formal English study (see Table 4) Again, though correlation does not imply causation, it is tempting to hypothesize that this factor, theoretically dealing with meta-aware, attention-allocating use of language, had significant bearing on the pragmatic performance by both NS and NNS participants in this study Such a conclusion would seem to find theoretical and empirical resonance in the notions in second language acquisition of attention to input (e.g., Schmidt, 1983, 1990) as well as the paper-and-pencil, or formal-traditional test-taking, factor found in Brown (2001) One may not unreasonably assume that attention to ESL pragmatic input occurred in the process of the formal learning of the several L2s by the respective participants Possible corroborative evidence for this assumption can be seen in the coefficients obtained for total test score with educational status for the NNS group (r = 0.16) in contrast with that obtained for the NS group (r = 0.02) That is, along with years of formal English study and length of residence in the United States, the number of years devoted to formal education seems to have had some relationship to NNS acquisition of ESL pragmatic skills, whereas length of formal education does not for NSs, for whom acquisition of English pragmatic competence will have largely been accomplished prior to and at least somewhat independently of formal schooling However, it may be noted that for the NS group, years of formal English study may have correlated highly with CAIT score for much the same reasons that it did with the 46 TESOL QUARTERLY NNS group, in that such would cultivate metalinguistic and metapragmatic skill If true, then years of formal English study, somewhat counterintuitively, would not correlate highly with age or length of residence, because metalinguistic or metapragmatic training in school may well have been more variable across participants than mere physical residence in the Anglophone environment One may also speculate that relatively high correlations of CAIT score with years of English study and with number of languages studied resulted from affective flexibility—empathy, or the ability to adopt new language egos (Guiora, Brannon, & Dull, 1972) Such a conclusion should be taken with caution, partly because of the difficulty in operationalizing empathy cross-culturally (Brown 1994), and partly because the narrow range of variance in participant responses has arguably (Cziko, 1981; Kunnan, 1992) caused the absolute values of all the coefficients to be low, thus possibly obscuring actual relationships between demographic variables and listening CAIT score The third research question concerned whether the operationalized norm would be confirmed by NS responses to listening CAIT questions Again, one may take as a general principle that the higher the p values (or frequencies) among the NS participants, the more nearly valid the operational norm exemplified by the item, and conversely, the lower the percent values, the greater the disconfirmation of a norm Here, one may note the wide range of NS p values, the correct response rate being 73.3% for the NS and 71.2% for the NNS participants—results which recall those of the Hudson et al (1995) study One possible explanation for the NS response patterns is that some of the NS-chosen distracter items (see Table 5) represent an NS norm that is somewhat broader than the operationalized norm derived from Pomerantz (1978, 1984) and Schegloff (2007) If so, such responses may call into question the validity of inferring ESL pragmatic ability among the NNS participants, at least with regard to the three pragmatic actions used in this study, and additional CA research may be needed to shed light on these results, to provide a firmer basis for an operational CAIT norm One may note, however, that while the 73.3% agreement rate among NS participants is somewhat lower than the 92% NS agreement rate in Hudson et al (1995), this overall rate reflects a majority of the pilot-test sample and is somewhat encouraging This violation of the operational norm points to a fundamental consideration regarding the validity of CAIT results and thus usefulness of the pilot CAIT itself, namely, method effect—in CA terms, the sequential organization of the protocol As Marlaire and Maynard (1990) point out, test results emerge through social collaborations between tester and test taker The outcome of such collaborations with a listening CAIT will depend significantly on the application of CA data and principles to aural prompts, and on the careful crafting of the stem and response options A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 47 A problematic example highlighted by the application of CA data and principles to aural prompts involves the telephone-call item mentioned earlier (Item 1) Textual additions, such as the one to compensate for the absence of a recorded ring, may threaten the authenticity of an item— and paradoxically invoke the kind of rationalistic assumptions about NS behavior that application of CA to SLPT should avoid Examples of the latter issue, namely, careful crafting of stem and response options, are manifest in the numerous instances in which NS participants chose distracters One approach of the CAIT developer in addressing the distracter issue would be to replace them with those that better distract, these items being repiloted to determine d and p values with a wide sample of NSs, to eliminate nonfunctioning response options However, this raises several questions: One is how to precisely word distracters In Item 7, for example, it is important to critically examine the item content, as well as the NS response patterns, to determine whether option (c) is in fact the right answer, or whether it should be worded in more technically correct CA terms as downgrading the compliment or responding to a compliment In preparing the pilot, downgrading was considered too obscure for the ESL learners participating in the study, and understating seemed a reasonable, nontechnical synonym NS response data showed that a majority (89%) chose option (c), “understating his own achievement,” but 11% selected option (b) “rejecting the woman’s point of view,” analyzable as being synonymous with option (c) Overall, the adequacy of CAIT distracter wording will depend on several issues One concerns the purposes of testing and instruction One may imagine an ESL course in which explicit mastery of CA metalanguage, for example, downgrading or presequence, was an instructional goal, making explicit inclusion in a correct answer choice relevant In such cases, prudent test design would suggest the same word be used also in a distracter, but with an incorrect grammatical object, for example, downgrading the amount of food A second issue is whether nontechnical wordings such as understating can serve as unambiguous substitutes for more technical CA terminology without misleading the L2 test taker A third issue is whether a given nontechnical wording can adequately distract without seeming to overlap descriptively with the answer coded as correct; given the NS response pattern in Item 7, option (b) may need revision Attending to such details, with reference to possibly emergent findings from future CA research as to the NS pragmatic norm, will hopefully affect the “conversation” between CAIT items and the test taker such that valid skill-inferences can be made Related to this third issue of CAIT item construction is the second question alluded to earlier—a fundamental concern involving application of CA principles: Recall that one shortcoming of speech act–based SLPT was the inability of the measures to adequately operationalize and assess 48 TESOL QUARTERLY pragmatic competence with regard to conversational sequences As Schegloff (2007) notes, turns of talk are generally characterized by a relationship of adjacency or “nextness” (p 15); that is, a next turn both displays a speaker’s understanding of another speaker’s just-prior turn and also constitutes an action that is responsive to that just-prior turn However, next turns are not generated deterministically; for example, just-prior turns are not always understood as their speaker had intended; hence, next turns may be various Schegloff (1988), for instance, analyzes examples of conversational breakdowns in which the speakers over a four-turn sequence display variant understandings of the first speaker’s initial question Thus, a conversational sequence can evolve in varying, often unpredictable directions, co-constructed by interactants in real time, evincing a range of possible sequential trajectories (Schegloff, 1993) Hence, to return to the pilot items under review, a given set of response options, as in Item 1, may well legitimately embody more than one possible adjacency pair—more than one correct answer Thus, in applying CA principles to pilot posttest analysis, we see that the hypothetical speaker of the blocked preinvitation in Item could explain that he had intended to make an invitation (option c) or could ignore the blocking action and deliver the invitation (option d) From this perspective, the CAIT developer may consider that the problem may not be a limited NS norm but that the item may not operationalize conditional manifestations of that norm A solution to this problem of sequential indeterminancy for CAIT item construction might be to use a multiple-correct item format, with perhaps as many as three correct out of five options, to accommodate multiple conditional relevancies In this connection, it seems appropriate to point out that the operational use of any candidate multiple-correct options, such as Item 1’s option (d) or Item 7’s option (b), should be supported by actual CA studies of talk-in-interaction and not by metapragmatic judgments of NSs who may be called on to review draft items Otherwise, a CAIT approach to SLPT will have little validation benefit over speech act–based SLPT It may thus be useful to apply CA principles (such as conditional relevance) to the wordings of the item stems and prompts in light of NS and NNS pilot-test response patterns as a way not only to generate hypotheses regarding the TL norm (see Schegloff, 1993, pp 114–115, for relevant discussion), but to effect refinements on the wording (and coding) of items In such fine-grained considerations, one may see the potential of CA to inform the crafting of assessment tasks that may more adequately reflect TL behavior and thus support validity of CAIT item use The fourth research question concerned the overall feasibility of employing L2 listening-comprehension items derived from CA data From an item-writer’s perspective, the suggested listening CAIT development procedure evinces practicality in that designing and revising prompts from CA transcriptions using Roid and Haladyna’s (1982) itemA TEST OF L2 AURAL PRAGMATIC COMPREHENSION 49 form technique was relatively straightforward Nor was test administration time consuming, taking approximately 20 minutes Yet, practicality of method, while necessary, is insufficient, whereas validity is essential Recall that the basic objection to the use of speech act theory–based SLPT methods was lack of authenticity because of the arguably intuitionbased, largely norm irrelevant, metapragmatic nature of DCT responses and hence the questionable validity of skill inference Further, recall the plausible conclusion, based on the moderate correlation between test score and formal L2 instruction, that the participants were using metapragmatic skill in addressing the listening CAIT items, plus the aforementioned violations of the operationalized norm (see Table 5): One may then legitimately ask whether the application of CA has actually enhanced validity over traditional SLPT methods Three findings suggest a positive, though tentative, answer The first is the overall 73.3% agreement rate among NS participants This finding also resonates with Golato’s (2003) finding of a degree of overlap between DCT-collected data and data gathered by CA methodology Such overlap of offline metapragmatic and online pragmatic behavior suggests a corresponding overlap in underlying trait In this light, the overall 73% NS congruence with the operationalized CAIT norm—as well as the fact that several NS item responses had 85% or better normative congruence (see Table 5)—can be seen as evidence for validity with regard to assessing the delimited, aural pragmatic competence in the L2 population tested It would thus appear that the process of using CA research findings as a provisional norm, as well as CA practitioners as content specialists in the prepilot evaluation of that provisional norm, has led to test results, in the form of high NS p values, that suggest that an NS norm can be approached, despite the indeterminacy of natural conversation (However, as suggested earlier, employment of a multiple-correct item format may be necessary to accommodate alternate sequential trajectories.) One may further note that, unlike with intuition-driven, speech act–based SLPT studies, this finding is made with reference to an objective, empirically supportable NS norm (Pomerantz, 1978, 1984; Schegloff, 2007) A second finding tentatively supporting usefulness of the pilot CAIT method involves nonnormative NNS responses that NS participants did not make—apparently genuinely NNS behavior (see Table 5) Examples include the expressing pride distracter in Item (testing compliments) and the guessing distracter of Item (testing pretellings) These responses are encouraging from the standpoint of CAIT development in that despite seeming limitations of the operationalized norm, it is demonstrated that alternative responses can be crafted that are useable as candidate distracters for future iterations of the listening CAIT More fundamentally, the very observable existence of an area of NNS behavior outside of an operationalized NS norm further suggests that the norm itself is realizable, 50 TESOL QUARTERLY despite CA methodological perspectives on their irrelevance (Schegloff, 1993; but see Atkinson & Heritage, 1984, p 2), and can be a basis on which CAIT measures can be constructed and inferences of SLP skill validly made A third tentative finding involves postpilot application of CA principles to draft items, in light of NS and NNS response patterns, which revealed possible multiple conversational trajectories beyond the operational norm Such may indicate the outlines of a workable method for CAIT validation, as a way to generate both hypotheses regarding NS pragmatic norms and potential refinements on the wording and coding of response options in keeping with CA principles such as conditional relevance Such findings seem to suggest that CAIT development may hold some promise as an area of SLPT research and as an advance on traditional approaches to SLPT THE AUTHOR F Scott Walters is an assistant professor in the Department of Linguistics and Communication Disorders, Queens College, City University of New York, United States His research interests include L2 testing, conversation analysis, and TESOL assessment-literacy training He also has engaged in less commonly taught languages (LCTL) and students with interrupted formal education (SIFE) program evaluation REFERENCES Atkinson, J M., & Heritage, J (Eds.) (1984) Structures of social action: Studies in conversational analysis New York: Cambridge University Press Austin, J L (1962) How to things with words Cambridge, MA: Harvard University Press Bachman, L F (1990) Fundamental considerations in language testing Cambridge: Cambridge University Press Bardovi-Harlig, K., & Dornyei, Z (1998) Do language learners recognize pragmatic violations? Pragmatic versus grammatical awareness in instructed L2 learning TESOL Quarterly, 32, 233–262 Bardovi-Harlig, K., & Hartford, B S (1993) Input in an institutional setting Studies in Second Language Acquisition, 17, 171–188 Blum-Kulka, S (1982) Learning how to say what you mean in a second language Applied Linguistics, 3, 29–59 Blum-Kulka, S., House, J., & Kasper, G (Eds.) (1989) Cross-cultural pragmatics: Requests and apologies Norwood, NJ: Ablex Blum-Kulka, S., & Olshtain, E (1986) Too many words: Length of utterance and pragmatic failure Studies in Second Language Acquisition, 8, 47–61 Bouton, L F (1988) A cross-cultural study of ability to interpret implicatures in English World Englishes, 17, 183–196 Bouton, L F (1992) The interpretation of implicature in English by NNS: Does it come automatically—without being explicitly taught? In L F Bouton & Y Kachru (Eds.), Pragmatics and language learning: Vol (pp 53–65) Urbana-Champaign: University of Illinois Press A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 51 Brown, H D (1994) Principles of language learning and teaching (3rd ed.) Englewood Cliffs, NJ: Prentice-Hall Brown, J D (2001) Pragmatics tests: Different purposes, different tests In K R Rose, & G Kasper (Eds.), Pragmatics in language teaching (pp 301–325) Cambridge: Cambridge University Press Carrell, P L (1979) Indirect speech acts in ESL: Indirect answers In C A Yorio, K Perkins, & J Schachter (Eds.), On TESOL ’79 (pp 297–307) Washington, DC: TESOL Carrell, P L (1981) Relative difficulty of request forms in L1/L2 comprehension In M Hines & W Rutherford (Eds.), On TESOL ’81 (pp 141–152) Washington, DC: TESOL Chapelle, C (1999) Validity in language assessment Annual Review of Applied Linguistics, 19, 254–272 Cziko, G (1981) Psychometric and edumetric approaches to language testing Applied Linguistics, 2, 27–43 Davidson, F (2000) The language tester’s statistical toolbox System, 28, 605–617 Educational Testing Service (2007) TOEFL iBT tips: How to prepare for the TOEFL iBT Retrieved February 12, 2008, from http://www.ets.org/Media/Tests/TOEFL/ pdf/TOEFL_Tips.pdf Faerch, C., & Kasper, G (1989) Internal and external modification in interlanguage request realization In S Blum-Kulka, J House, & G Kasper (Eds.), Cross-cultural pragmatics: Requests and apologies (pp 221–247) Norwood, NJ: Ablex Gibbs, R W (1984) Literal meaning and psychological theory Cognitive Science, 8, 275–304 Golato, A (2003) Studying compliment responses: A comparison of DCTs and recordings of naturally occurring talk Applied Linguistics, 1, 1–54 Guiora, A Z., Brannon, R C., & Dull, C Y (1972) Empathy and second language learning Language Learning, 22, 111–130 Haladyna, T M (1999) Developing and validating multiple-choice test items (2nd ed.) Mahwah, NJ: Erlbaum Heritage, J 1984 Conversation analysis In J Heritage (Ed.), Garfinkel and ethnomethodology (pp 233–292) Cambridge: Polity Press Horst, P (1953) Correcting the Kuder-Richardson reliability for dispersion of item difficulties Psychological Bulletin, 50, 371–374 Hudson, T D (1991) Relationships among IRT item discrimination and item fit indices in criterion-referenced language testing Language Testing, 8, 160–181 Hudson, T., Detmer, E., & Brown, J D (1995) Developing prototypic measures of crosscultural pragmatics Honolulu: University of Hawai’i Press Johnson, M., & Tyler, A (1998) Re-analyzing the OPI: How much does it look like natural conversation? In R Young & A W He (Eds.), Talking and testing: discourse approaches to the assessment of oral proficiency (pp 27–51) Amsterdam: Benjamins Kasper, G (2005, April) Speech acts in interaction: Towards discursive pragmatics Plenary talk presented at the 16th International Conference on Pragmatics and Language Learning Bloomington, IN Kasper, G., & Rose, K R (2002) Pragmatic development in a second language Oxford: Blackwell Kim, K., & Suh, K (1998) Confirmation sequences as interactional resources in Korean language proficiency interviews In R Young & A W He (Eds.), Studies in Bilingualism: Vol 14 Talking and testing: Discourse approaches to the assessment of oral proficiency (pp 297–332) Amsterdam: Benjamins Koike, D A (1996) Transfer of pragmatic competence and suggestions in Spanish foreign language learning In S M Gass & J Neu (Eds.), Speech acts across cultures: 52 TESOL QUARTERLY Challenges to communication in a second language (pp 257–281) Berlin: Mouton de Gruyter Kunnan, A J (1992) An investigation of a criterion referenced test using G-theory, and factor and cluster analyses Language Testing, 9, 30–49 Lazaraton, A (1997) Preference organization in oral proficiency interviews: The case of language ability assessments Research on Language and Social Interaction, 30, 53–72 Lazaraton, A (2002) Studies in language testing: Vol 14 A qualitative approach to the validation of oral language tests Studies in language testing Cambridge: Cambridge University Press Levinson, S C (1983) Pragmatics Cambridge: Cambridge University Press Marlaire, C L., & Maynard, D W (1990) Standardized testing as an interactional phenomenon Sociology of Education, 63, 83–101 Messick, S (1988) The once and future issues of validity: Assessing the meaning and consequences of measurement In H Wainer & H I Bruan (Eds.), Test validity (pp 33–45) Hillsdale, NJ: Erlbaum Mey, J L (2001) Pragmatics: An introduction Malden, MA: Blackwell Moss, P A (1994) Can there be validity without reliability? Educational Researcher, 23(2), 5–12 Pomerantz, A (1978) Compliment responses: Notes on the co-operation of multiple constraints In J Schenkein (Ed.), Studies in the organization of conversational interaction (pp 57–101) New York: Academic Press Pomerantz, A (1984) Agreeing and disagreeing with assessments: Some features of preferred/dispreferred turn shapes In J M Atkinson & J Heritage (Eds.), Structures of social action: Studies in conversational analysis (pp 79–112) New York: Academic Press Richards, J C., & Schmidt, R W (1983) Conversational analysis In J C Richards & R W Schmidt (Eds.), Language and communication (pp 117–154) London: Longman Roid, G., & Haladyna, T M (1982) Toward a technology of test-item writing New York: Academic Press Rose, K (1992) Speech acts and questionnaires: The effect of hearer response Journal of Pragmatics, 17, 49–62 Ross, S J (2007a) A comparative task-in-interaction analysis of OPI backsliding Journal of Pragmatics, 39, 2017–2044 Ross, S J (2007b, November) An event history approach to unbiased task assessment Paper presented at the East Coast Organization of Language Testers conference, George Washington University, Washington, DC Sacks, H., Schegloff, E., & Jefferson, G (1974) A simplest schematics for the organization of turn-taking for conversation Language, 50, 696–735 Schegloff, E A (1984) On questions and ambiguities in conversation In J M Atkinson & J Heritage (Eds.), Structures of social action: Studies in conversational analysis (pp 28–52) New York: Cambridge University Press Schegloff, E A (1988) Presequences and indirection: Applying speech act theory to ordinary conversation Journal of Pragmatics, 12, 55–62 Schegloff, E A (1993) Reflections on quantification in the study of conversation Research on Language and Social Interaction, 26, 99–128 Schegloff, E A (2007) Sequence organization in interaction: A primer in conversation analysis: Vol New York: Cambridge University Press Schmidt, R (1983) Interaction, acculturation, and acquisition of communicative competence In N Wolfson & E Judd (Eds.), Sociolinguistics and second language acquisition (pp 137–174) Rowley, MA: Newbury House Schmidt, R (1990) The role of consciousness in second language learning Applied Linguistics, 11, 129–158 A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 53 Searle, J R (1975) Indirect speech acts In P Cole & J Morgan (Eds.), Syntax and semantics: Vol Speech acts (pp 59–82) New York: Academic Press University of Cambridge Local Examinations Syndicate (2007, November) Reviewing FCE and CAE (Bulletin 9) Retrieved February 12, 2008, from http://www.cambridgeesol.org/assets/pdf/fcecae_review9.pdf Yamashita, S (2001) Using pictures for research in pragmatics: Eliciting pragmatic strategies by picture response tests In T Hudson & J D Brown (Eds.), A focus on language test development (pp 35–56) Honolulu: University of Hawai’i, Second Language Teaching and Curriculum Center APPENDIX Select Conversation Analysis Transcription Conventions [period]—falling intonation contour , [comma]—continuing intonation ? [question mark]—rising intonation, not necessarily a question :: [colons]—prolongation of sound preceding the colon - [hyphen]—after a word or word part, a self-interruption or cut-off < [less-than sign]—succeeding talk starts with a rush talk—underlines indicate emphasis or stress ((words))—double parentheses contain information on nonverbal events accompanying the interaction 54 TESOL QUARTERLY ... prompts, and on the careful crafting of the stem and response options A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 47 A problematic example highlighted by the application of CA data and principles to aural. .. applied CA principles to the validation of existing L2 tests For example, Ross (200 7a) performed contrastive CA analyses on transcripts of oral A TEST OF L2 AURAL PRAGMATIC COMPREHENSION 35 proficiency... Schegloff, 2007) CA is an approach to the study of language that avoids categorizations of use based on nativespeaker intuitions Data in CA consist of audio- and/or videotaped natural conversations,
