Research and Theory

Test Construction

Writing items requires a decision about the nature of the item or question to which we ask students to respond, that is, whether discreet or integrative, how we will score the item; for example, objectively or subjectively, the skill we purport to test, and so on. We also consider the characteristics of the test takers and the test taking strategies respondents will need to use. What follows is a short description of these considerations for constructing items.

Test Items

A test item is a specific task test takers are asked to perform.Test items can assess one or more points or objectives, and the actual item itself may take on a different constellation depending on the context. For example, an item may test one point (understaning of a given vocabulary word) or several points (the ability to obtain facts from a passage and then make inferences based on the facts). Likewise, a given objective may be tested by a series of items. For example, there could be five items all testing one grammatical point (e.g., tag questions). Items of a similar kind may also be grouped together to form subtests within a given test.

Classifying Items

Discrete – A completely discrete-point item would test simply one point or objective such as testing for the meaning of a word in isolation. For example:

Choose the correct meaning of the word paralysis.

(A) inability to move
(B) state of unconscious
(C) state of shock
(D) being in pain


Integrative – An integrative item would test more than one point or objective at a time. (e.g., comprehension of words, and ability to use them correctly in context). For example:

Demonstrate your comprehension of the following words by using them together in a written paragraph: “paralysis,” “accident,” and “skiing.”

Sometimes an integrative item is really more a procedure than an item, as in the case of a free composition, which could test a number of objectives; for example, use of appropriate vocabulary, use of sentence level discourse, organization, statement of thesis and supporting evidence. For example:

Write a one-page essay describing three sports and the relative likelihood of being injured while playing them competitively.

Objective – A multiple-choice item, for example, is objective in that there is only one right answer.

Subjective – A free composition may be more subjective in nature if the scorer is not looking for any one right answer, but rather for a series of factors (creativity, style, cohesion and coherence, grammar, and mechanics).

The Skill Tested

The language skills that we test include the more receptive skills on a continuum – listening and reading, and the more productive skills – speaking and writing. There are, of course, other language skills that cross-cut these four skills, such as vocabulary. Assessing vocabulary will most likely vary to a certain extent across the four skills, with assessment of vocabulary in listening and reading – perhaps covering a broader range than assessment of vocabulary in speaking and writing. We can also assess nonverbal skills, such as gesturing, and this can be both receptive (interpreting someone else’s gestures) and productive (making one’s own gestures).

The Intellectual Operation Required

Items may require test takers to employ different levels of intellectual operation in order to produce a response (Valette, 1969, after Bloom et al., 1956). The following levels of intellectual operation have been identified:

  • knowledge (bringing to mind the appropriate material);
  • comprehension (understanding the basic meaning of the material);
  • application (applying the knowledge of the elements of language and comprehension to how they interrelate in the production of a correct oral or written message);
  • analysis (breaking down a message into its constituent parts in order to make explicit the relationships between ideas, including tasks like recognizing the connotative meanings of words and correctly processing a dictation, and making inferences);
  • synthesis (arranging parts so as to produce a pattern not clearly there before, such as in effectively organizing ideas in a written composition); and
  • evaluation (making quantitative and qualitative judgments about material).

It has been popularly held that these levels demand increasingly greater cognitive control as one moves from knowledge to evaluation – that, for example, effective operation at more advanced levels, such as synthesis and evaluation, would call for more advanced control of the second language. Yet this has not necessarily been borne out by research (see Alderson & Lukmani, 1989). The truth is that what makes items difficult, sometimes defies the intuitions of the test constructors.

The Tested Response Behavior

Items can also assess different types of response behavior. Respondents may be tested for accuracy in pronunciation or grammar. Likewise, they could be assessed for fluency, for example, without concern for grammatical correctness. Aside from accuracy and fluency, respondents could also be assessed for speed – namely, how quickly they can produce a response, to determine how effectively the respondent replies under time pressure.

In recent years, there has also been an increased concern for developing measures of performance – that is, measures of the ability to perform real-world tasks, with criteria for successful performance based on a needs analysis for the given task (Brown, 1998; Norris, Brown, Hudson, & Yoshioka, 1998).

Performance tasks might include “comparing credit card offers and arguing for the best choice” or “maximizing the benefits from a given dating service.” At the same time that there is a call for tasks that are more reflective of the real world, there is a commensurate concern for more authentic language assessment. At least one study, however, notes that the differences between authentic and pedagogic written and spoken texts may not be readily apparent, even to an audience specifically listening for differences (Lewkowicz, 1997). In addition, test takers may not necessarily concern themselves with task authenticity in a test situation. Test familiarity may be the overriding factor affecting performance.

Characteristics of Respondents

Items can be designed to be appropriate for groups of test-takers with differing characteristics. Bachman and Palmer (1996: 64-78) classify these characteristics into four categories:

  • the personal characteristics of the respondents – for example, their age, gender, and native language;
  • the knowledge of the topic that they bring to the language testing situation;
  • their affective schemata (that is, their prior likes and dislikes with regard to assessment); and
  • their language ability.

Research into the impact of these characteristics continues. For example, with regard to the age variable, researchers have suggested that educators revisit this issue and perhaps conceive of new ways to consider the impact of the age variable in assessing language ability (Marinova-Todd, Marshall, & Snow, 2000).

With regard to performance on language measures, it would appear that age interacts with other variables such as attitudes, motivation, the length of exposure to the target language, as well as the nature and quality of language instruction (see García Mayo & García Lecumberri, 2003).

With regard to language ability, both Bachman and Palmer (1996) and Alderson (2000) detail the many types of knowledge that respondents may need to draw on to perform well on a given item or task:

  • world knowledge and culturally-specific knowledge,
  • knowledge of how the specific grammar works,
  • knowledge of different oral and written text types,
  • knowledge of the subject matter or topic, and
  • knowledge of how to perform well on the given task.

Item-Elicitation Format

The format for item elicitation has to be determined for any given item. An item can have a spoken, written, or visual stimulus, as well as any combination of the three. Thus, while an item or task may ostensibly assess one modality, it may also be testing some other as well.

So, for example, a subtest referred to as “listening” which has respondents answer oral questions by means of written multiple-choice responses is testing reading as well as listening.

It would be possible to avoid introducing this reading element by having the multiple-choice alternatives presented orally as well. But then the tester would be introducing yet another factor, namely, short-term memory ability, since the respondents would have to remember all the alternatives long enough to make an informed choice.

Item-Response Format

The item-response format can be fixed, structured, or open-ended. Item responses with a fixed format include true/false, multiple-choice, and matching items.

Item responses, which call for a structured format include ordering (where respondents are requested to arrange words to make a sentence, and several orders are possible), duplication – both written (such as., dictation) and oral (for example, recitation, repetition, mimicry), identification (explaining the part of speech of a form), and completion.

Those item responses calling for an open-ended format include composition – both written (for example, creative fiction, expository essays) and oral (such as a speech) – as well as other activities, such as free oral response in role-playing situations.

Grammatical competence

According to Canale and Swain (1980, p. 29), grammatical competence includes phonology, morphology, syntax, knowledge of lexical items, and semantics, as well as matters of mechanics (spelling, punctuation, capitalization, and handwriting). It would seem that this definition is perhaps too broad for practical purposes.

A truly perplexing issue is determining what constitutes a grammatical error, as well as determining the severity of this error. In other words, will the use of the error stigmatize the speaker? Let us say that we are using a grammatical scale which deals with how acceptably words, phrases, and sentences are formed and pronounced in the respondents' utterances. Let us assume that the focus is on both of the following:

  • clear cases of errors in form, such as the use of the present perfect for an action completed in the past (e.g., ”We have had a great time at your house last night."), and
  • matters of style, such as the use of a passive verb form in a context where a native would use the active form (e.g., Question - “What happened to the CD I lent you, Jorge?” Reply - "The CD was lost." vs. "I lost your CD.").

Major grammatical errors might be considered those that either interfere with intelligibility or stigmatize the speaker. Minor errors would be those that do not get in the way of the listener's comprehension nor would they annoy the listener to any extent.

Thus, getting the tense wrong in the above example, "We have had a great time at your house last night" could be viewed as a minor error, whereas in another case, producing "I don't have what to say" ("I really have no excuse" by translating directly from the appropriate Hebrew language) could be considered a major error since it is not only ungrammatical but also could stigmatize the speaker as rude and unconcerned, rather than apologetic.


CARLA Mailing List Signup Contact CARLA CARLA Events Donate to CARLA CARLA on Facebook CARLA on YouTube Twitter
Center for Advanced Research on Language Acquisition (CARLA) • 140 University International Center • 331 - 17th Ave SE • Minneapolis, MN 55414