High Stakes Testing—The Science

A while back I was challenged by reader Jack Shalom regarding student testing. In response to Timeliness and Grading, Jack wrote

IMO, there are exactly two reasons to give a test:

To sort students.

To help students learn more.

I believe reason #1 is the main reason tests (in particular, standardized tests) are given. We know this because for most standardized tests, teachers and students get no feedback at all about what items have been missed and why. Certainly by the time results of any kind are received, the student has moved on to a new teacher.

If the purpose of a test is to learn more, then it needs to be designed as such, and teachers need to treat them as such. Why, then, would there then need to be a score? When was the last time your tennis coach gave you a precise grade on your backhand? Would that have helped you play tennis better?

The same can be said for end of term grades.

At the time I promised Jack that I would respond to his challenge.

The opportunity presented itself afresh in an OP ED in the Edmonton Journal by my colleague Dr. Jacqueline Leighton.

Jacqueline wrote

The science behind high-stakes testing is based on giving all students the opportunity to show what they know under the same conditions by writing the same test — a test that will count significantly toward their final grade and has been developed to be reliable and valid in providing information about student learning and mastery. This is important because students working with different teachers, and completing different assignments and assessments during the year can end up with the same teacher-awarded grade at the end of the year — say, 85 per cent — but actually possess very different levels of preparedness, learning and mastery. Committees of content, technical and assessment specialists, composed of highly experienced educators and scientists, create high-stakes tests. These committees, using the latest educational, scientific and technical methods for test design and development, make sure that (a) the content material taught in classrooms is adequately covered in the high-stakes test, (b) new test items are reviewed and field-tested before they appear on the final operational test to make sure the wording is understandable, does not bias or offend students, and conforms to the technical standards of previous items, (c) test items are double and triple-checked using a variety of technical analyses to ensure that the results are consistent within a pattern or trend — for example, students who respond correctly to one item are also responding correctly to items measuring the same material; this is done to ensure that items are not underestimating or overestimating what students know, and (d) test results are constantly monitored so that the test continues to measure the appropriate content and skills in students who have learned the material well and achieved mastery. Advancements in the science of testing are continuously integrated into the design and development of high-stakes tests.

If we read this uncharitably, perhaps Jack is right. Leighton says that the test does, indeed, sort students, but, she adds, it does it well. I’ll grant this as trivially true. On the other hand, well-constructed exams provide a second level of assurance that the student has met some standard or other, that the student is competent. Further, it sets the standard of competence in a publicly understood ways. Yes, we are separating those who meet the standard from those who do not, but what alternative do we have?

I think that there is another important function of these examinations: they provide evidence of system-wide performance. That is, it is crucial that a publicly sponsored and funded system of education provide curriculum that is attainable by the students, that the resources appropriately support the curriculum, that teachers teach the curriculum to students and that students learn the curriculum. Any individual student can have an especially fortunate or unfortunate day. The test does not guarantee that the student has learned (or not learned) the material; but it shouldn’t be too gruesomely off. Because randomness and luck travel in all directions, a complete class of students should have a test average that is not too far off an accurate measurement of all their learning. This is important information for teachers, schools, and jurisdictions.

All the above is predicated on the sort of high-quality, curriculum-referenced test that Jacqueline Leighton is talking about. Test construction is a highly technical business, and it should not be taken lightly. No matter how high you make the stakes, a badly constructed test provides little to no valuable information about individuals or groups. Similarly, generic ability tests cannot measure the curricular achievement of students.

So if you’re going to have a high stakes examination, make sure it’s professionally constructed, validated and relevant to the curriculum taught and learned.

Leighton and Gierl, eds.

Oh, and if you want some fun reading in your spare time, here’s a book Jacquie co-edited and in which I co-authored Chapter 3.

5 thoughts on “High Stakes Testing—The Science

  1. Sorry, I just saw this post now, John. Thanks for addressing some of my thoughts.

    I’m afraid, though that Ms. Leighton’s comments are woefully inadequate and reside somewhere in the realm of fantasy rather than reality.

    First she says: “students working with different teachers, and completing different assignments and assessments during the year can end up with the same teacher-awarded grade at the end of the year — say, 85 per cent — but actually possess very different levels of preparedness, learning and mastery.”

    But this is exactly true of two students who receive an 85 on a standardized test. In fact we even know *less* about these two students than before–we know *nothing* about their preparation, their consistency, persistence, character, areas of high ability, obstacles faced and obstacles overcome. In short, the student has been erased in favor of some numerical ranking. A ranking that totally obscures precisely the fact that an 85 for one student could means something very different for another. One student knows nothing about logarithms while the other knows nothing about quadratic equations. But they are both 85 students. Standardization means precisely that there will be loss of information about the individual data points we call students.

    But more incredible are her assumptions a, b, c, d.

    We have many years now (at least in the US) of reality to check against.

    a) In fact, tests are very often *not* aligned with classroom practice.

    b) In fact, tests are riddled with mistakes and sloppily worded questions. Testmakers are like any other industry and they seek to cut costs. In the US, Pearson is trying to make it a crime to release the questions to their tests, even after the tests are given, because they have *repeatedly* been embarrassed by the terrible quality of the questions. They have tried to force districts to buy computer equipment for the administration of their tests, and then the networks fail citywide and the tests have to be postponed.

    c) Technical analyses for internal reliability are silly in a timed, scored test as these are. These are not personality tests. A student may *not* necessarily answer two questions the same, even though they appear to test the same content, if they appear in different contexts within the test. Do different answers mean that the student has not learned the concept? Should the student get no credit, half credit or full credit for that concept?

    d) “Test results are constantly monitored so that the test continues to measure the appropriate content and skills in students who have learned the material well and achieved mastery.”
    No, in fact experience shows just the opposite–rather than the test being a reflection of classroom practice, the high stakes test *drives* the classroom practice, and forces desperate teachers and students to focus all their energies on adjusting to the educational misconceptions of the test makers. The curriculum becomes dry, classroom time is spent on sussing out the test, and anything that cannot be tested in a standardized way is thrown out the window.

    Instead of living only in theory, it’s important to test theory against what actual practice has been. Testmakers have put out a call in the US for temps at $12 / hr to score the enormous numbers of standardized tests that are now being given. Yes, a student’s English essay is being scored by a $12/ hr temp–and a bachelor’s degree is not even a necessary requirement.

    It’s all crap, and the proof is in the results. Leighton can write as many books as she likes from her ivory tower about quality tests, but we live in a real (capitalist) world, where private companies under a profit motive try to keep up with the (created) demand for their product. There is no quality, and there can be no quality. It’s all lip service. These tests provide zero information about a student that the student’s teacher could not tell you with far more accuracy; and they provide no information about a teacher that a teacher’s principal could not tell you with far more accuracy.

    Liked by 1 person

  2. Thanks for the reply, Jack. I’ll provide a bit of context, as the situation Dr. Leighton is discussing is rather different from the scenarios you discuss.

    The news issue was a change of weighting on Alberta High School Diploma exams. These exams had been worth 50% of a student’s final mark in Grade 12 English (or Français), Social Studies, Mathematics, Chemistry, Biology, Physics or (general) Science; it was recently reduced to a 30% weighting.

    In Alberta, the exams are made by teachers under secondment to the Ministry of Education. Working with psychometricians, these teachers create exam items, field test them in schools and then implement the items in the exams. Fresh exams are created twice each school year, with a small number of “anchor” items retained for cross-test comparison. All items except for the anchor items are released.

    Exams are partially machine-scored. All human-scored items (e.g. essays) are scored by certificated Alberta teachers who are currently teaching high school and who have taught the course in question for more than two years. These teachers are paid to mark exams for a fixed number of days during their scheduled vacation time.

    Our system is often criticized as expensive. It certainly seems preferable to the situation you describe.


  3. John, glad to hear that the exams are of the kind you describe. In New York State, we have a similar set of exams called the Regents given in all the major subjects. They are written and scored by teachers. A student must pass five of them in order to graduate high school. For a long time they were the only standardized tests students took. Unfortunately, in the last decade that situation has rapidly changed.

    We are now seeing the imposition of more and more senseless standardized tests created by Pearson and McGraw-Hill. Where the Regents are effectively exit exams, there are now many, many more additional standardized exams every year from first grade on. It’s really alarming. With the current Canadian government, I think you must be on the look-out for attempts at privatization at every step, including test-making. It has been the go-to answer here for US politicians seeking to improve school quality–throw another test at them to solve the problem.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s