Using Data to Write Better Questions

Concurrent Session 6

Session Materials

Brief Abstract

This presentation will show how to use data, in the form of Item Discrimination (ID), to create fairer—and ultimately better—multiple-choice questions. The audience will come away not only knowing the importance of ID but also having a specific tool they can use when creating their own multiple-choice questions.



Chris Lele is the GRE and SAT Curriculum Manager (and vocabulary wizard) at Magoosh Online Test Prep. In his time at Magoosh, he has inspired countless students across the globe, turning what is otherwise a daunting experience into an opportunity for learning, growth, and fun. Some of his students have even gone on to get near perfect scores. Chris is also very popular on the internet. His GRE channel on YouTube has over 10 million views.
After 20 years of teaching high school math and science, I started at Magoosh. My title is GMAT Curriculum Manager. I record videos, write blogs, and create questions & explanations for our students. I am also a long term meditation practitioner and I run a meditation center in Berkeley, where I lead a weekly sangha. My dharma talks are posted at

Extended Abstract

At Magoosh, we have spent years perfecting our in-house test prep questions. One might imagine us endlessly discussing the nuances of a question, tinkering and re-tinkering, and discussing yet some more, until we finally get it right. While we do discuss and tinker quite a bit, this overlooks perhaps the single biggest contributor to the question creation process: data.

When your content is delivered through books and in-person classes, hard data can be difficult to come by. But teaching online by its very nature generates data, and that data can be put to good use to improve the questions you ask students. At Magoosh, we’ve been able to use the data our growing student base generates to help us write good questions. What I mean by “good” isn’t that a multiple-choice question wasn’t cleverly devised (though hopefully a few were) but that question is one strong students answer correctly for the right reasons and less strong students miss because the question is legitimately difficult. By contrast, a poor question is one that strong students miss because of potential misinterpretation, and one that weak students answer correctly for the wrong reasons. In other words, the “right” answer was either debatable or one of the “wrong” answers was debatably correct (or a mixture thereof.)  

To ensure that a multiple-choice question is valid, we use a metric called Item Discrimination (ID). Item discrimination is the correlation between the performance of students on the whole and their performance on an individual question. In practice, we figure out what percent of “high scoring students” (the top third of students) answers a specific question correctly and what percent of “low scoring students” (the bottom third) answers this same question correctly. If both 50% of “high scoring students” answer the question correctly and 50% of the “low scoring students” answer the question correctly, the question fails to discriminate between these two groups, hence the word “discrimination” (“item” is test prep speak for a question.)

Ideally, a question should tell you who the top third of students are and who the bottom third are. For example, if 75% of “high scoring students” answer a question correctly and 25% of “low scoring students” answer a question correctly, then you know this is a strong question. In fact, there is an exact score a question receives—the Item Discrimination (ID) score—which is found by subtracting the percent of “high scoring students” who answer the question correctly by the percent of “low scoring students” who answer the question correctly. The ID score for this question, which is expressed in decimal form (to echo the idea of correlation) would therefore be .75 - .25 = .50.

Historically, the creators of standardized tests (ETS, GMAC, etc.) have used .40 as a benchmark for a quality question. These corporations also have the advantage of enormous pools of data, which they collect in “experimental sections”—test sections in which the questions do not count towards a person’s score but are used to determine the ID of questions in the developmental phase. But to determine whether a question is valid, you do not need mountains of data; even a pool of a couple of hundreds students can give you a strong sense of how valid a question likely is.  

Typically, a question that has an ID score between .20 and .40 is an “okay” question, one that might need to be modified. A few of these questions on a test will likely not compromise the overall quality of the test. It is the questions that fall below .20 that might be problematic, not merely because they are “bad” questions but because they can hurt a student’s performance—and ultimately impair student learning. That is, students might be rewarded for faulty logic while those who apply correct logic might be penalized. Of course, a valid question doesn’t only hinge on logic: an ambiguously worded question can lead to two possible answers. Whatever the case may be, when students are penalized for choosing a valid answer, it can make them highly discouraged, sometimes leading them to question thought processes that are actually sound.

Whether you are writing online questions that might be seen by thousands or by a classroom, trusting your intuition that your questions are always valid is dangerous. ID, of course, should not replace your judgment but should act in tandem with it, so that you can create high quality questions that are fair and foster learning amongst whomever your audience may be.

What I am going to show you how to do today is how to evaluate a few questions. We will take a look at the ID, try to figure out what is wrong with the question, and then take a look at two possible rewrites. Next, we will look at the ID of these two rewrites and come up with several hypotheses for why the IDs differ. Then, we will use the insights we learned from looking at these questions to look at an entirely new question and make a few tweaks.

Next, we will use a spreadsheet that I’ve created (and which you can use with your classroom) in which we will use a mock classroom and a mock test to determine the ID of a few questions.

Finally, I will also demonstrate the limitations of ID, how a question that the data shows us is correct is actually flawed. Or, as is often the case with very difficult questions, that a question that has a mediocre ID is in fact perfectly valid The hope is to show that the creation of content is an interplay of data and our own expertise.