Measuring Learning Effectiveness: A New Look at No-Significant-Difference Findings

Researchers, instructional designers and consumers of ALNs must be cautious when interpreting results of media comparison studies. Much of the literature purports to have found no significant difference in learning effectiveness between technology-based and conventional delivery media. This research, though, is largely flawed. In this paper, we first outline the philosophical positions of the opposing sides of an intense debate in the literature as to whether delivery media alone influence learning outcomes. We then select at random several representative media comparison studies to illustrate the inadequacy of their methodologies and conclusions. More important, we derive critical design considerations for those who evaluate or conduct media comparison research. ALN practitioners should not assume that students would learn better from technology delivery systems. Rather, ALN practitioners should adhere to time-tested instructional design strategies, regardless of the medium they choose. Learning effectiveness is a function of effective pedagogical practices. Accordingly, the question for ALN practitioners ought to be: "What combination of instructional strategies and delivery media will best produce the desired learning outcome for the intended audience?"


I. INTRODUCTION
Much of the literature in the field of instructional technology purports to have found no significant difference in learning effectiveness between technology-based and conventional delivery media. In the wake of these findings, and as technology has improved, there is a growing demand for instruction delivered through electronic media, especially through asynchronous learning networks. Practitioners and consumers of asynchronous learning networks (ALNs) need to be aware that much of the research in this field is seriously flawed, rendering many of the conclusions inaccurate or open to debate. Designers of ALN should not assume that one delivery medium is as effective as another. Such a determination should be made through properly designed and executed research. Regrettably, this proposition has eluded researchers for years.
For much of the twentieth century, educators and behavioral scientists have compared the instructional effectiveness of different delivery media. At first glance, this seems a logical pursuit given the changes in instructional media over the years, particularly in computer-mediated instruction. Although there are many media comparisons in the literature, the most common approach has been to compare traditional (teacher-mediated) learning with technology-based devices as either a substitute for or supplement to the teacher. In the United States prior to World War II, film and radio were the focus of many comparison studies [1]. After the war, the ensuing race for space between the two superpowers led to extensive government funding for research and development. About this time, researchers began to assess not just the equipment, but its capabilities for improving instructional effectiveness [2]. The focus on specific attributes of media has grown with improvements in technology. In recent years, comparisons have shifted their emphasis from television and programmed instruction to personal computers and distance education.
Prior to 1980, there were more descriptive studies than experimental studies comparing computerdelivered instruction with traditional delivery modes. According to Maddux [3], this trend shifted in the 1980s as researchers and educational software developers became interested in establishing cause-andeffect relationships between computer and non-computer delivery modes. Many turned to scientific experimentation as the method of choice. As we will explain shortly, however, many of these studies suffer from problems of internal and external validity because of specification and design error. In the 1990s, Maddux observed a shift toward improvement in the internal design of computer delivery in experimental comparison studies, which he attributed to better-controlled laboratory research. Still, the major flaw in much of the research was its limited generalizability beyond the laboratory.
Throughout these developmental stages of media comparison research, the debate as to whether media alone influence learning outcomes has continued. In the lead of those who believe that media will never influence learning has been Richard E. Clark [4], [5], [6]. He has argued (and has been supported independently by Gagne et al. [7]) that media per se do not influence learning. Rather, "learning is caused by the instructional methods embedded in the media presentation" [6, p. 26]. As cited by Clark [6, p. 23], Gavriel Salomon defined instructional method as "any way to shape information that activates, supplants, or compensates for the cognitive processes necessary for achievement or motivation." Gagne et al. [7] also specified those events that are necessary to learning as comprising both internal and cognitive information processing events, such as temporary storage, encoding, and retrieval. He also specified external instructional events, such as gaining attention, drill and practice, and feedback, which are necessary to activate and support the internal process [7, p. 202]. Gagne et al. treat media not as an external instructional event, but rather as a "vehicle for the communications and stimulation that make up instruction" [7, p. 205].
Because Clark contends that media themselves are merely the conveyors of instructional methods and content, he goes on to conclude that they do not directly influence learning in any way. His challenge to the critics embodied his main argument: "We need to ask whether there are other media or another set of media attributes that would yield similar learning gains. If a treatment can be replaced by another treatment with similar results, the cause of the results is in some shared (and uncontrolled) properties of both treatments" [6, p. 22].
Opposing Clark is Robert Kozma of the Center for Technology and Learning. His main argument has been that media and methods are inextricably interconnected. According to Kozma, both media and methods are part of the instructional design. "Media must be designed to give us powerful new methods, and our methods must take appropriate advantage of media's capabilities" [8, p. 16]. Further, Kozma has maintained that learning from media can be thought of as a complementary process within which representations are constructed and procedures performed, sometimes by the learner and sometimes by the medium [ 8, p. 11]. By means of its technological capabilities, symbol system, and processing capabilities, Kozma argues that a particular medium "can be described in terms of its capability to present certain operations in interaction with learners who are similarly engaged" [8, p. 11].
In our view, Clark has made a stronger case than Kozma. First, Clark's argument that media are primarily vehicles for instructional methods coincides with widely accepted learning theory. Until that theory is revised to include media as an external instructional event, or as an instructional method, the two must be treated as separate yet complementary components of an instructional process. Even more compelling is Clark's argument that many different media attributes may accomplish the same learning goal [6].
Until a single medium becomes necessary for learning, which is not likely in most settings, media will continue to serve only as a conveyor of the processes contributing to learning. This is not to say that some media are not vastly superior to others as conveyors of methods and content. Technology has made great strides and is rapidly outpacing other modes of delivery in terms of cost and efficiency, as we will show later. Nevertheless, as long as there is another medium capable of conveying similar methods of instruction in any given instructional setting resulting in similar learning outcomes, Clark's argument that the cause of the learning must be something other than the media themselves is compelling indeed. Kozma's retort [9] is that, if two treatments yield a similar outcome, it does not mean that they resulted from the same cause. Again, Clark is not claiming that there necessarily is a single causal factor for learning to occur with different media. He is simply arguing that, as long as similar learning occurs with different media, there must be some cause other than the media themselves.
While this fundamental debate rages on, researchers continue to conduct media comparison studies. Most of these studies report no significant differences in learning effectiveness between electronic and conventional classroom delivery. One of the best known summaries of such research [10] lists several hundred studies claiming to have found such results. Because of the increasing demand for electronic delivery of instruction, and because of the controversial arguments just cited, a closer look at the issues of research design quality, results and interpretations is warranted.

II. MEDIA COMPARISON: FLAWS IN THE EXPERIMENTAL RESEARCH
In a meta-analysis of a randomly selected 30 percent of approximately 500 experimental studies conducted in the 1970s, Clark [5] found many examples of achievement gains for technology-delivered instruction. He found that the impact of technology-delivered instruction was "overestimated due to often uncontrolled for but robust instructional treatments embedded in computer-based-instruction (CBI)" [5, p. 249]. Specifically, he noted that many of the instructional methods built into CBI, such as corrective feedback and individual pacing, were not used by teachers [5, p. 250]. In essence, Clark determined that in such confounded research, "causes and effects cannot be unambiguously identified. When causes are confounded, it is still possible to determine the size of the effect of the treatment on the outcome. However, it is not possible to determine which of many uncontrolled-yet correlated-parts of the treatment contributed to the differences" [5, p. 252]. Overall, Clark concluded that about 75 percent of the studies he examined had serious design flaws, 50 percent failed to control for time on task, and only 50 percent controlled for instructional method. There were significant differences in favor of CBI in only two of fifteen studies that controlled for instructional method [5, p. 259].
In a meta-analysis of 12 studies involving computer-aided instruction (CAI) in adult basic and secondary education, Rachal [11] found similar design flaws, including lack of specification of time on task and the size of control and experimental groups. Also, he found that the number of subjects was small, treatment periods were too short, and the studies never assessed the abilities of the individual instructors assigned to teach the control groups. He concluded that, while results in at least one well-controlled study found significant differences favoring CAI over non-CAI delivery, expectations of CAI becoming a miracle cure for adult education programs are unrealistic [11, p. 172].
In a paper dealing with problems of learner control research, Reeves [12] held that "much of the research in the field of CBI is pseudoscience because it fails to live up to the theoretical, definitional, methodological, and/or analytic demands of the paradigm upon which it was based" [12, p. 39]. He recommended that graduate students and researchers "develop an improved understanding of contemporary philosophy of science, a topic largely ignored in many research methodology and statistics courses" [12, p. 43]. In this way, researchers would be exposed to a wider range of options in conducting experimental and non-experimental designs.
In a report on effectiveness of technology in schools, Sivin-Kachala and Bialo [13] summarized the findings of 133 media comparison studies. Sponsored by the Software Publishers Association, the authors argued that "educational technology has demonstrated a significant positive effect on achievement" [13, p. 2]. As with several literature reviews, the authors did not report the adequacy of the research methods in these studies.
In a study of CAI effects of math achievement on African-American students seeking admission to teacher education programs, Reglin [14] found significant effectiveness differences between CAI and traditional instruction. However, he failed to control for prior knowledge, ability, learning style, instructor effects, learner familiarity with technology, and m ethod of instruction. Based on lack of adequate controls, there simply is no way to verify that the differences in pre-and post-tests were the result of the media used in the experimental group.
Clements [15] studied enhancement of creativity in computer environments using LOGO software versus a non-LOGO CAI user group. While he, too, found a significant difference in pre-and post-test scores favoring LOGO users, he failed to adequately explain differences in instructional methods other than to say that one group used LOGO and the control group did not. Also, he failed to control for ability, learner style, and learner familiarity with the technology, and students in the control group had less time on task than the LOGO users.
Skinner [16] examined the effects of CBI on achievement by tracking 36 college students who rotated among three instructional delivery treatments: Text/CBI/Guided Tutorial, Text/CBI/Solo, and Text-only. Skinner also used a matched-pair procedure for assignment of subjects into two groups to control for possible confounding effects of different module difficulty levels. He controlled for ability by dividing students into levels of achievement based on previous course performance. Results indicated that test scores were significantly higher for students in CBI modules. Average performance was lower in Textonly modules. Although he controlled for individual effects across all three delivery modes, he failed to control for instructional methods because the Text-only modules lacked many of the methods employed by CBI, such as drill, practice and feedback. Also, because learners were able to do as much as they chose, time on task varied widely, particularly in the SOLO delivery mode.
Gardner, Simmons, and Simpson [ 17] conducted an experiment from which they determined that combining CAI with hands-on science activities significantly increased elementary students' learning outcomes. They administered the same pre-and post-test to both control and experimental groups. Although they controlled for time on task and instructor effects, they failed to control adequately for method of instruction, prior knowledge, and learner ability and familiarity with technology. As with many other media comparison studies, it was impossible to determine whether the improvement in achievement was the result of the delivery mode or some other uncontrolled factors.
Finally, among those studies reported by Sivin-Kachala and Bialo [13] as proving the positive effects of computers on learning achievement, Funkhouser [ 18] investigated the influence of problem-solving software on student attitudes about math. He determined that the use of problem-solving software resulted in more positive student attitudes about themselves as learners of math, and about math as a discipline [18, p. 339]. While he used a pre-test to control for prior knowledge and attitudes toward math, there were no other experimental controls, nor was there any control group (all 40 students were administered the treatment). To measure attitudinal changes, he compared test scores using the same assessment test both before and after treatment. With no control group and lack of controls for random selection and assignment of subjects, this procedure was seriously flawed from the outset. It is impossible to determine whether the changes in student attitudes were caused by the use of the same pre-and post-tests, by the instructional methods presented in the software, or by any other factor, such as familiarity with the software program.
In a meta-analysis of military training, Orlansky [19] assessed the cost and instructional effectiveness of major innovations to training, such as flight simulators, CBI and maintenance training simulators. Orlansky reviewed 30 studies conducted since 1968, which compared conventional and CBI delivery modes. He selected studies in which the control and experimental groups used nearly identical course content, allowed equivalent time on task between delivery modes, and used similar performance measures [19, p. 17]. According to Orlansky, the only difference was that one group was instructed conventionally and the other group used CBI. Conspicuously absent from his criteria were more explicit controls to avoid confounding in critical factors, such as instructional method, prior knowledge and ability. In 40 courses where CAI was compared with conventional instruction, he found student achievement the same in 24, superior in 15 and inferior in one. He concluded that there were no significant differences between delivery modes. For example, in 12 of 13 maintenance training comparisons, he found that "students trained with simulators achieved the same or better test scores than those trained with actual equipment" [19, p. 26]. This is not surprising in view of the limits that Orlansky placed on control mechanisms in the research he cited.
Based on all of the above, it seems clear that findings of no significant difference are often misinterpreted. For example, Orlansky [19] concluded that "flight simulators, CBI, and maintenance simulators are as effective for training as aircraft, conventional instruction, and actual equipment, respectively" [19, p. 41]. In poorly designed research, this may not be a valid conclusion.
If the researcher has not carefully controlled for the most likely factors explaining variance in student achievement, it is less likely that one will find significant difference between experimental and control groups. This is because there is no absolute difference affecting only the treatment group. Likewise, if a significant difference is found in poorly designed research, it may be the result of one or more uncontrolled variables, such as a specific method of instruction that was presented to the experimental group only. This common misinterpretation was also addressed in Clark's seminal article on learning from media [4], in which he argued that "no significant difference results simply suggest that the changes in outcome scores (e.g., learning) did not result from any systematic differences in the treatments compared" [4, p. 447].

III. CONCLUSION
The outlook for improving the design of media comparison studies is bleak. Even if a legitimate scientific model could be designed to properly control for each independent variable, its usefulness for predicting learning outcomes would, in all likelihood, be extremely limited. This is because the researcher would have to impose artificial controls to produce such a result. For those who choose to persist in comparing delivery media, the evaluation criteria in Table 1 provide a helpful list of variables to consider when evaluating or designing such research. Any media comparison or design that fails to control for the variables listed in Table 1 is probably suspect and may produce inconclusive results. This paper has argued that ALN practitioners must be cautious when interpreting results of media comparison studies. ALN practitioners should not assume that students would learn better from technology delivery systems. Rather, ALN practitioners should adhere to their time-tested instructional design strategies, regardless of the medium they choose. It is widely accepted that learning effectiveness is a function of effective pedagogical practices. Accordingly, the question for researchers, instructional designers, and consumers of ALNs ought to be: "What combination of instructional strategies and delivery media will best produce the desired learning outcome for the intended audience?"