Item Analysis: Evaluating Multiple Choice Questions

CVM faculty receive information about the quality of their tests and quizzes several ways.

They may look at student performance data on particular tasks, activities, quizzes, or tests in Carmen.
They may be notified of item analysis generated when they administer Scantron tests.
They may review a “Test and Question Report” from ExamSoft, a secure-testing application available to all faculty and currently used across first-year core courses.

The latter two are specifically designed to validate exam reliability, consistency, and quality.

These formal and informal processes allow us to create strong assignments and assessments, refine components of those assessments over time, and align the way we assess students with the learning outcomes identified.

CVM faculty have been urged to use item analysis to identify questions that may be “mis-keyed” (meaning the wrong answer was designated as correct for grading purposes); vague or poorly constructed; or not indicative of the information covered in a given lecture, series of lectures, or course. They then decide whether to throw out a question or tag it for further revision.

The Office of Teaching & Learning believes the most significant value of these reports comes when faculty use them to identify especially strong questions and build a robust question bank over time. These reports also help to improve content and pedagogy.

According to an excellent primer authored by ExamSoft’s director of academic strategy, these data points quantify exam quality:

Item Difficulty Index (p-value): Values are from 0.00 to 1.00, with a high number indicating the exam item was mastered or not as difficult/discriminatory and a lower number indicating difficulty or discrimination. Look for extremes to indicate when you might want to review a question.
Upper Difficulty Index (Upper 27%): When the value approaches 1.00, your high performers scored well on the item. At 0.5 or below, your high performers failed to get the question right, so you may want to review the question.
Lower Difficulty Index (Lower 27%): When the value approaches 1.00, the item may be less difficult.
Discrimination Index: This index compares the upper and lower 27%. If you have a 0.3 or above, there’s good discrimination. You may want to look at 0.10 to 0.29 as there’s fair discrimination. If you have a negative value, the item is considered flawed.
Point Bi-serial Correlation Coefficient: Values range from -1.00 to 1.00. At 1.00, exam takers who did well overall also did well on a particular question. You will want to review negatives.

It is critical that faculty understand none of these data points alone provide enough information to make a decision about quality. Moreover, while Teaching & Learning and Professional Program staff may consult on questions or exams, individual faculty and faculty teams should make final decisions based on what they know has happened in their classrooms.

When you review questions, ask yourself if the question is clear, if it is aligned with outcomes for a lecture or course, and if the appropriate number of strong distractors (choices) are evident. (Try to keep distractors to 5 or less.) Multiple choice items that force students into critical thinking before selecting a response may be harder and distinguish between high and lower performing students.

If a question has desirable difficulty and high performers got it correct, we recommend keeping it in the question bank. Obviously flawed questions can be thrown out, and mis-keys corrected.

Ohio State nav bar

Subscribe By Email