The effect of formative assessment on summative assessment in an online testing environment for a blended course

Concurrent Session 4
Streamed Session

Watch This Session

Add to My Schedule

Brief Abstract

A large, blended, international program (N=25,000) is examined for online formative assessment (FA) effects on summative assessment (SA), along with other variables. Overall, FA scores only moderately predict SA scores, while other variables indicate universally low effects. Findings inform implementation of online testing as a formative assessment strategy.  

Extended Abstract


Formative assessment can take on many forms yet the general purpose is to provide the student with direct feedback on his or her progress towards meeting learning objectives.  Online self-assessment quizzes and tests can facilitate individualized learning and can, using the construct of testing effect, increase scores on summative assessments. However, testing effect can also have a negative effect on summative assessments and enduring understandings of content. The specific ways in which practice exams and testing effect in general are harnessed can predict the effects of a formative assessment strategy.


The purpose of the study is to determine if the current program can be improved with respect to its use of formative assessment strategies. Findings can inform assessment strategies in other blended and online learning environments.

Overview of Program of Study

The program under study is a curriculum designed to teach about computer networking basics in educational, residential and commercial contexts. As an academic-vendor partnership, the program includes instructional resources for face-to-face delivery as well as online practice and testing platform. This study focuses on the online testing components of the program. The program was developed over nearly twenty years.  The result of that evolution was three variations of the curriculum, each with four courses.


While the program overall served 345,600 students, 25,000 of these students completed a four course program and are thus used for this analysis. Approximately 75% of this subset of students were from K-12 schools (e.g. high school students in vocational and technology curricula) and 25% from adult and higher education institutions in academic, vocation or job-retraining programs.  

Blended Learning

This program represents a common application of blended learning. For this program, the instruction occurs in a classroom while the assessments, which included practice quizzes, chapter tests and summative course tests, were performed online. Blended learning, therefore, was an intentional pedagogical strategy to affect student performance. Findings from this study inform how educators might use practice quizzes and chapter tests as formative assessments in a blended learning environment.

Literature Review

Formative assessment is: “a planned process in which assessment-elicited evidence…is used by teachers to adjust their ongoing instructional procedures or by students to adjust their current learning tactics” (Popham, 2014, p.290).  


Within formative assessment are many tools that educators can use to prepare students for mastery of content and follow-up summative assessments. Quizzes and practice tests, when administered purposefully for formative effect, are just a few. There can be two different interpretations of “testing effect” in educational assessment. The first is a positive benefit to learning in which long-term remembering of content is increased through practice tests with feedback to the learner.


The second interpretation is the negative aspects of testing effect; or rather the negative side of the positive uses of it, in which short term remembering of content is artificially “spiked” by frequent, or close-in-time practice tests to the formal, summative measure. High scores on practice tests conducted “urgently” negatively affect long-term remembering of content, i.e. the learner “crammed” for the test at the expense of enduring understandings (Rawson, K., Vaughn, K. & Carpenter, S.; 2015).


The program in this study sought to harness the positive benefits of testing effect, as a component of a formative assessment strategy.  

Research Questions

The primary research question is whether this method of formative assessment (FA) has any impact on student learning, as measured by the summative assessment (SA). Other independent variables include: Geopolitical regions of this international program; language of the student; language of the course; course completion (in a four-course sequence); level of the institution offering the program; and educational purpose of the institution. The data set included programs from the Americas, Europe, Africa, Asia and Pacific.  


This is a quantitative study that examines archival data from two years of the an online testing component this blended program. Analyses include descriptive statistics and a simple linear regression model to see how performance on formative assessment (FA) alone might explain performance on the summative assessment (SA). A multiple regression (using ANCOVA) was conducted to examine the effects of five independent variables on the variation of scores on SA, controlling for FA (the covariate).  


Because of the program’s use of practice and chapter tests as its formative strategy, testing effect is the primary construct through which the findings can be interpreted.  Chapter tests were generated from a pool of items aligned to the learning constructs and the summative assessments. Theoretically, results on progressive performance assessments should yield similar results on summative assessments if the procedures for their administration are implemented in a manner that can harness the positive aspects of the testing effect.


For the newest version of the curriculum and only for program completers, findings indicate that FA scores moderately predict SA scores using a simple linear regression. But, overall, the beneficial aspects of testing effect, as a formative assessment strategy and measurable construct, were not evident in these data using the statistical means possible for the data set as provided. A multiple regression (using ANCOVA) of the independent variables indicate some significant effects, but with universally low to zero effects sizes and some violations of homogeneity of regression slope, no valid interpretations can be made.  

Discussions, Conclusions, Interpretations

Limitations of the study include lack of data about the frequency and outcomes of practice quizzes and chapter tests as well as lack of performance assessment data to corroborate objective summative tests. There were also statistical violations in homogeneity of regression slope for the ANCOVA which indicate that the model may not be appropriate for data.  Furthermore, the significant findings are likely due to the very large sample size at the expense of effect size. While these are significant limitations, the resolution of these limitations can be easily remedied in follow up studies and professional development of blended learning programs and faculty who seek to use online testing as a formative assessment strategy.

Summary of Findings

Overall, these findings show that the positive benefits of testing effect were not evident in the data provided and with the analyses possible (and with significant limitations noted above). The highly (and negatively) skewed means for chapter tests hint at mastery of content yet the three curricula show lower summative assessments means. Since the practice quizzes and chapter test items were from the same item pool designed—with appropriate and documented rigor–for the summative, end-of-course assessment, the data show that longer term (until the end of the course) gains in retention of the material did not occur for most students. This does not mean that students did not learn the material, for their means were in the “A” and “B” letter grade ranges.  


Simply put, if there was a testing effect possible in this program, there should have seen roughly equal SA means to the FA means. This was not the case. While the sources of the variance cannot be definitively explained because of limitations in the data provided, the differences were not likely the result of the measured variables’ attenuation of the testing effect, which is focused on gains in summative scores (or minimally gains between practice quizzes and chapter tests), not absolute scores (interaction notwithstanding).


One plausible (but not measureable with these data) explanation for an attenuation between chapter tests and summative tests is that the chapter tests scores were likely “spiked” by the open, unlimited practice tests completed close to the scored testing event. This afforded short-term recall of content just-in-time the chapter tests. Another explanation is that the items on the chapter tests were not sufficiently challenging the students, compared to the summative tests. Yet the two tests pull from a rigorously developed pool of equivalent items. One other (and untested) explanation is testing fatigue.  

Recommendations for Online Assessment

1. Harness the testing effect for active recall. Provide defined lag times before the learner can take the chapter quiz. Provide direct pointers to the readings and activities to learn the material missed on the quizzes. This will structure the practice testing to purposefully require reviews before attempting the next chapter test or summative test.

2. Increase the stakes of the chapter tests relative to content scope and difficulty. The value of the chapter tests must be high enough to cause the learner to seek to make use of the formative feedback from the quizzes, yet not so high as to attenuate the formative effect and their motivation.

3. Limit last minute practicing so as not to spike chapter tests (and, thus, creating a false sense of confidence).

Recommendations for Continued Research

Despite the limitations discussed above, the chosen methodology is well suited for future analyses of large-scale, online testing environments. This study in particular sets the stage for multiple follow-up studies using the same sample but with data cut differently to address the methodological limitations addressed above. A few examples of these modifications include: individual scores for each chapter test, enforced modifications of the practice methods described above, and segmentation of the data for curriculum by institution type and level.