Adapting for Scalability : Automating the Video Assessment of Instructional Learning

Although scalable programs, such as online courses, have the potential to reach broad audiences, they may pose challenges to evaluating learners’ knowledge and skills. Automated scoring offers a possible solution. In the current paper, we describe the process of creating and testing an automated means of scoring a validated measure of teachers’ observational skills, known as the Video Assessment of Instructional Learning (VAIL). Findings show that automated VAIL scores were consistently correlated with scores assigned by the hand scoring system. In addition, the automated VAIL replicated intervention effects found in the hand scoring system. The automated scoring technique appears to offer an efficient and reliable assessment. This study may offer additional insight into how to utilize similar techniques in other large-scale programs and interventions.


Introduction
Implementing large-scale evidence-based programs offers a promising means of reaching broad audiences (Franks & Schroder, 2013).Massively Open Online Courses demonstrate oneway educational content is being disseminated widely (Vale & Littlejohn, 2014).The use of online courses has increased among various groups of professionals, including teachers, to advance learners' technical knowledge and skills (Gill, 2011; U.S. Department of Education, National Center for Education Statistics, 2016).Although recent shifts towards large-scale programs are promising, they pose challenges to assessment, particularly the assessment of learners' skills.Skill assessments are an integral part of most educational programs and are often used to understand individuals' growth trajectories (Biggs & Tang, 2011).Automated coding systems offer a possible means of assessing learners' skills in larger scale programs (Williamson, Xi, & Bryer, 2012).
Given the increased offering of online professional development to teachers (Gill, 2011;Means, Toyama, Murphy, Bakia, & Jones, 2009), the present study sought to adapt a validated measure of teachers' observational skills of effective teacher-child interactions, the Video Assessment of Instructional Learning, or VAIL (Jamil, Sabol, Hamre, & Pianta, 2015), to be automatically scored, rather than manually hand scored, and thus applicable for large-scale interventions.More specifically, the goals of the study were to determine whether the automated VAIL scoring system related to the previously validated hand scoring system and if it was sensitive to intervention effects in a previous professional development program.To achieve these aims, we first correlated automated scores with hand scores and then compared intervention and control group means to determine if the automated VAIL replicated results previously found using the hand scored VAIL.These results are presented along with a discussion of how automated measures may be useful in other large-scale programs.

Literature Review
As the availability of online coursework grows, course designers and instructors are faced with the challenge of determining how to accurately and efficiently assess students (Palloff & Pratt, 2008).Assessments provide valuable information regarding motivation and progress, and can be used to provide feedback to both learners and instructors.Assessing learners' knowledge and skills from the beginning to the end of a course can also determine the efficacy of a program and may suggest modifications that need to be made (Boston, 2002).In general, assessment questions may be open-ended (short-answer or essays) or closed-ended (true/false or multiple choice.)Open-ended questions provide more thorough information regarding learners' mastery because they require learners to generate responses, rather than simply identifying correct answers from a prescribed list of options (Foddy, 1993).Although, open-ended questions may provide more useful information, they are more arduous to score which may be difficult in courses with large numbers of enrollees (Landauer, Laham, & Foltz, 2003).
Assessing open-ended items typically relies on the knowledge and expertise of human raters, such as instructors or teaching assistants, to manually score and make personal judgments for each response.This technique is not always feasible in large-scale programs because it is time consuming, and thus, costly.Interventions that require humans to score large quantities of responses may take inordinate amounts of time and deplete resources (Landauer et al., 2003;Williamson et al., 2012).Additionally, placing such potentially burdensome demands on human raters may increase the likelihood of fatigue and error (Ramineni & Williamson, 2013).In response to the practical limitations associated with manual hand scoring, automated scoring techniques may offer a possible solution.

Automated Scoring Techniques
Automated scoring systems, in which responses are scored by machines, have been used to assess short-answer responses, essays, and spoken responses.For example, the Educational Testing Service (ETS) has utilized automated essay scoring (AES) for high-stakes assessments, such as the GMAT or GRE, for over 10 years (Williamson et al., 2012).Various AES systems exist, which all require large numbers of essay samples to base their scoring and feedback on.These systems tend to provide both holistic and specific feedback, although the exact content varies by system.Overall, AES systems have been found to be valid and reliable (Dikli, 2006).Automated scoring techniques are not without controversy, however.In particular, AES has been criticized for oversimplifying the assessment of writing to focus on rote elements, such as word count or complexity of word choice, rather than less easily quantifiable aspects, such as thoughtfulness of response or writing to a specific audience (Condon, 2013;Perelman, 2014).The application of automated scoring techniques for short answer responses is less highly contested, especially when such techniques are not used to holistically rate the quality of writing with high-stakes implications.However, it has been suggested that automated short answer scoring is more arduous to create than AES, because AES tends to focus on grammar and mechanics while automated short answer scoring tends to focus on content (Brew & Leacock, 2013).Brew and Leacock (2013) further suggested that automated short answer scoring systems are underutilized because "it is currently impossible to buy an off-the-shelf short answer scoring engine that will work for all items" (p.151).
Despite the aforementioned challenges, automated short answer systems are best suited to measure explicit concepts, facts, or skills (Brew & Leacock, 2013).Perhaps the most wellknown automated system for short answer responses is known as "C-rater" which was developed by ETS (Leacock & Chodorow, 2003).The validity of C-rater has been evaluated by comparing automated scores with hand scores.Leacock and Chodorow (2003) found that automated scores matched scores assigned by human raters 84% of the time.Similar findings have been replicated elsewhere (Burstein, Chodorow, & Leacock, 2003), suggesting that automated short answer systems can offer a valid means of assessment.Despite the fact that the technology for creating and using similar automated systems has existed for at least a decade, few have been developed and disseminated across fields.Building off previous work, this study focused on the VAIL assessment, described in more detail below.The VAIL relies on short answer responses, rather than essays, and appears to be particularly conducive to adaptation into an automated system.

Video Assessment of Instructional Learning (VAIL)
The VAIL is grounded in social learning theory, the notion that learning occurs largely through observation, as well as evidence suggesting that observational skills are valuable in developing expertise (Bandura, 1986;Jamil et al., 2015;Miller, 2011).The VAIL assesses observational skill by first asking teachers to watch a short video clip of an actual classroom and then identify and describe the effective teaching behaviors they observed.The process of "seeing" is an integral part of intentional teaching.One must be able to objectively identify and assess the effectiveness of specific practices in the classroom, and subsequently reflect on and modify personal teaching practices as necessary (Hamre, Downer, Jamil, & Pianta, 2012).
The VAIL has been validated (Jamil et al., 2015) and utilized in studies of both in-service (Hamre, Pianta, et al., 2012) and pre-service (Wiens, Hessberg, LoCasale-Crouch, & DeCoster, 2013) educators, offering a potentially valuable tool for teacher training and professional development.In a sample of pre-service teachers, demographic and programmatic characteristics did not consistently predict VAIL scores, suggesting that variation in VAIL scores were due largely to individual differences rather than group membership (Wiens et al., 2013).Furthermore, Jamil et al. (2015) found that the VAIL related to teachers' observed practices, suggesting that teachers who were more adept at identifying effective teacher-child interactions were also more likely to implement high quality teaching practices.
Furthermore, the National Center for Research on Early Childhood Education (NCRECE) Professional Development study, a randomized controlled evaluation of two forms of professional development, coursework and coaching, utilized the hand-scored VAIL as a measure of teachers' observational skills.At the end of the intervention, teachers enrolled in coursework were found to have significantly improved observational skills than teachers not enrolled in the course, and these improvements translated into meaningful changes in teachers' practice (Hamre, Pianta, et al., 2012;Downer et al., in press).Put differently, the VAIL was sensitive to intervention effects in the NCRECE course; this assessment tool seemed to detect important material teachers learned in the course that ultimately led to demonstrated improvements in practice.On the contrary, teachers' observational skills did not significantly change for teachers enrolled in the coaching intervention (Downer et al., in press).Although the reason for this is unclear, it may have been the result of the differences in content and delivery.For instance, it is possible that the VAIL may be more proximal to the content of the course, which focused on observing other teachers' practices, as opposed to the coaching intervention, which focused mostly on observing teachers' own personal practices.
In summary, although the VAIL measure is especially pertinent to current trends in education, the previously utilized hand scoring techniques may limit the scalability of the measure.As a result, the development of an automated means of scoring the VAIL was warranted.Automated scoring systems are indefatigable, systematic, and reliable, and offer a sustainable alternative to hand scoring.Nevertheless, it is necessary to ascertain that automated scoring systems are valid and useful, which were the aims of the present study.

Current Study
The present study explored the extent to which the VAIL, an assessment of teachers' observational skills of teacher-child interactions, could be adapted into an automated scoring system.Specifically, the goals were to determine whether the automated VAIL scoring system related to the previously validated hand scoring system; and was sensitive to intervention effects in the NCRECE professional development study.Building off of previous work utilizing handscores from the same study (Hamre, Pianta, et al., 2012;Downer et al., in press), we anticipated that the automated VAIL would be sensitive to intervention effects in the NCRECE course, but not the coaching intervention.

Method Study Overview and Participants
This study utilized data from the NCRECE randomized, controlled evaluation of two forms of professional development designed to improve prekindergarten teachers' interactions with children.The NCRECE study was designed to evaluate scalable approaches to early childhood professional development (Pianta, Hamre, & Hadden, 2012).Teachers were placed randomly into treatment or control groups for the first phase, a 14 week (one semester) in-person course, and their group placements were then re-randomized for the second phase, the year-long MyTeachingPartner web-mediated coaching intervention.
The present study utilizes data on all teachers across all conditions who completed the post-intervention survey (n = 175).Seventy-two (41.1%) teachers received the course in phase I and 88 (50.3%) received coaching in phase II.Most teachers (95.9%) were female, and 48.5% of all teachers were African American, 34.5% White, 10.5% Hispanic, 4.1% multi-racial, and 2.3% Asian.Roughly half (50.9%) of teachers taught in Head Start centers and 37.1% worked in public schools.On average, teachers were 41.54 years old (SD = 10.41) with 10.93 years of experience teaching pre-kindergarten (SD = 7.64).In terms of their degree attainment, 34.5% of teachers held a bachelor's degree, 33.3% had less than a bachelor's degree, and 32.2% held an advanced degree.
It is important to note that the current sample represents a fraction (43.5%) of the entire NCRECE sample (n = 402).This reduced rate of completion is likely due to the fact that the post-intervention survey was optional.A series of t-tests were conducted to test whether those who completed the survey differed from those who chose not to complete the survey.Teachers did not significantly differ in terms of gender, age, and study condition; however, teachers who completed the survey were more likely to hold at least a bachelor's degree (M = .65,SD = .47)than those who did not complete the survey (M = .58,SD = .49;t (378) = -1.5, p <.01.)Teachers who completed the survey were also less likely to be Latino (M = .10,SD = .30)than those who did not complete the survey (M = .19,SD = .39;t (378) = 2.31, p <.001.) The post intervention survey was sent via email and completed online.A portion of the survey included the Video Assessment of Interactions and Learning (VAIL) which is based on the Classroom Assessment Scoring System, or CLASS (Pianta, La Paro, & Hamre, 2008) a commonly used observational tool of teacher-child interactions.There are multiple forms of the VAIL focused on different elements of teaching; the present study focused on the VAIL designed to capture teachers' ability to detect aspects of Emotional Support, which includes creating a positive classroom climate, being sensitive to students' needs, and having regard for students' perspectives.To complete the VAIL, teachers viewed a two-and-a-half-minute video clip that depicted a teacher engaging a student in a conversation about her weekend.Teachers were instructed to watch the video as many times as they wished, but were encouraged not to spend more than 10 minutes on the video.Then teachers were asked to name up to five strategies the teacher used to support the student's social and emotional development.Finally, teachers were asked to, list a specific, behavioral example of each strategy from the clip.

Measures
Hand Scoring System.All VAIL responses were hand scored based on a previously established standardized rubric that aligns with the CLASS (Pianta, La Paro, et al., 2008).For each response, trained coders assessed four elements: (1) strategy, if the learner identified a behavioral indicator consistent with the CLASS; (2) example, if the learner provides a specific behavioral description from the video of the teacher demonstrating a strategy; (3) breadth, the specific CLASS indicator that the strategy is most consistent with; and (4) match, whether the strategy and example pair is representative of the same CLASS indicator.
In keeping with the behavioral indicators for the Emotional Support domain of CLASS (Pianta, La Paro, et al., 2008) Collectively, this information was then used to calculate, per teacher, the total number of: correct strategies, correct examples, unique breadth scores, and strategy-example matches.Because teachers could enter up to five possible responses, the range for each of the four aforementioned categories was 0-5.The four categories rendered a Cronbach's alpha coefficient of .79.An excerpt from the coding manual is shown in Figure 1 along with further description of how the manual was used to assess the four components (strategy, example, breadth, and match) that were described above.Automated Scoring System.The VAIL hand coding manual was used to build a "dictionary" for the automated system.The dictionary was organized in a format that could be utilized in automated scoring software, specifically the Linguistic Inquiry Word Count (LIWC) software (Pennebaker, Booth, & Francis, 2007).In an attempt to replicate the hand scoring system, key words and phrases for the automated dictionary were pulled directly from the hand coding manual.For instance, according to the VAIL manual the teacher demonstrates regard for the student's perspective by allowing choice, so the term "choic*" was included in the automated dictionary.As part of the LIWC software, any letters in the word that appeared after the asterisk were disregarded (therefore "choice" and "choices" would both be considered correct.)Additionally, notes were collected from trained VAIL coders regarding language used in teacher responses that varied from the coding manual, but was still considered correct (i.e., synonyms).Consistent with the previous example, words such as "choose", "choosing" and "option" were included as synonyms of "choice".

"Strategies" "Examples" Figure 1. Excerpt from the VAIL Coding Manual
The LIWC software program allows the user to upload a dictionary organized broadly by a designated number of categories with various words comprising each category.As a result, our dictionary contained a total of 24 categories that represented the 12 possible breadth categories for strategies and, similarly, the 12 possible categories for examples.Thus, further building on "Match" if strategy and example categories are the same.

"Breadth
Categories" (Circled) the example outlined in Figure 1, in the automated dictionary, the strategy category for the "Support for Leadership and Autonomy" would be comprised of words such as "choice" and "responsibility".The corresponding example category would be comprised of phrases, such as "choose the marker" or "say to mommy."As previously mentioned, each teacher had up to five VAIL responses; to maintain consistency with the hand scoring technique, each response (one response per text file) was independently run through the software.The process of actually running all responses through the software was done simultaneously (all text files were run at the same time), allowing all responses to be scored in a matter of minutes.
LIWC provides output indicating the proportion of words that fall into certain categories, which, in this case, was used to assess whether or not the words used fell into any of the 24 previously mentioned categories.This information was then used to create: (1) strategy scores, represented as "correct" if a number other than zero appeared in any of the 12 strategy categories (suggesting that a word in the prescribed dictionary was used) (2) example scores, similarly, were "correct" if a number other than zero appeared in any of the 12 example categories (3) breadth scores, the specific category for which the strategy was located (4) match scores, if the strategy category matched the example category.Similar to the hand scoring technique, results ranged from 0-5.The four components rendered a Cronbach's alpha coefficient of .81.In addition, a mean score was also created by averaging across the four components in order to maintain consistency with previous work (Jamil et al., 2015).

Results
First, descriptive statistics (Table 1) were examined to compare the various components (i.e., strategy, example, breadth, and match scores) of the automated scoring system and the hand scoring system.The means, ranges, and overall distributions proved to be very similar.The automated scoring system did not consistently yield higher or lower results than the hand scoring system.Next, the four components of the automated system were correlated with the four components of the hand scoring system (Table 2).The bivariate correlations were consistently high across the four components, ranging from .72 to .87,suggesting that the automated scoring system was consistently replicating the hand scoring.High correlations were also observed among the strategy, breadth, and match components for both the automated and hand-scoring systems, which is not surprising given that breadth and match scores are only assigned when a strategy is "correct", rendering high correlations among these elements.To further unpack these associations, Table 3 provides the percentage of cases in agreement and disagreement (by one, two and three or more points) among the automated and hand-scored systems.The automated and hand scoring systems agreed in roughly two thirds of all cases.Last, we explored the extent to which the automated system could be used to detect intervention effects.As previously mentioned, the first intervention phase was a course, and the second phase was a coaching intervention.As a result, we tested the extent to which the automated scoring system could be used to detect intervention effects by comparing the means for the treatment and control groups separately by phase (Table 4).For the course phase, three of the four components, as well as the mean score of the automated VAIL were found to be significantly different by intervention group.In particular, there were significant differences in scores for the treatment and control groups for strategy, breadth, match, and overall scores, such that the treatment group was significantly more likely to receive credit for these VAIL components than the control group.Associated effect sizes were moderately large, ranging from .54 to .60 (Table 4).There was no significant difference between groups for example scores.Consistent with findings from prior research using the hand-scoring method, there were no significant mean differences for any of the VAIL scores for the coaching phase.As shown in Table 4, the automated and hand-scored results yielded relatively similar effect sizes.In sum, these results suggest that the automated VAIL system replicated findings from the hand-scoring method.

Discussion
Automated scoring systems offer a systematic and sustainable alternative to hand scoring techniques, which may be particularly valuable in large-scale interventions (Shermis & Burstein, 2013).The purpose of this study was to explore the extent to which an automated VAIL scoring system was able to replicate scores produced by the previously validated hand scoring system.Findings show that the automated VAIL scores were strongly related to the hand coded VAIL scores, suggesting that the automated VAIL consistently replicated previous scores.Similarly, the automated VAIL system detected similar intervention effects as the hand-scoring method.Taken together, these findings suggest that the automated VAIL appears to be a valid and reliable means of measurement.
It is worth noting that the strength of the associations between automated and handscored systems varied by what was assessed.More specifically, strategy and example scores had the strongest associations while match and breadth scores had the weakest.This is not surprising given that the automated dictionary was specifically built around the language of strategies and examples.In the VAIL, breadth and match are only scored if the strategy is found to be "correct", as neither breadth nor match are relevant if the strategy is incorrect.This has implications in our study because exact matches of breadth and match among automated and hand coding systems required an exact match on strategy as well.This likely accounts for the slightly higher discrepancy among the two systems for breath and match scores, although the correlations are still high.
The automated system replicated roughly two thirds of hand scores, which is slightly lower than previous work (Leacock & Chodorow, 2003).Of course, exact replication of the hand scored system is irrelevant in building an automated system given that the hand scores contain some degree of human error.Instead, we sought to develop a better system for measurement that would address issues in scalability.
Utilizing the automated VAIL or similar measures may be particularly advantageous in large-scale interventions, such as online courses, which are pervasive in various fields (U.S.Department of Education, National Center for Education Statistics, 2016).Automated measures allow intervention designers and instructors to incorporate open-ended questions into learner assessments, rather than simply relying on the more restrictive, close-ended alternative.Previously, open-ended questions posed logistical challenges for scoring in large-scale programs; however, automated systems offer a feasible alternative to manual hand scoring (Shermis & Burstein, 2013).Aside from open-ended questions being a better metric of student learning, this solution is also cost-effective and efficient (Williamson et al., 2012).The time spent designing and using an automated system is front-loaded, meaning more time is spent in the construction of the system than in the actual process of running the system to obtain scores (Brew & Leacock, 2013).By comparison, the hand scoring systems require more time to be spent maintaining highly reliable coders.Although human coders are likely to become more efficient over time, the possibility of making a mistake is ever-present (Ramineni & Williamson, 2013).
Although the VAIL is most applicable to teacher education, the findings from this study may be relevant to other fields.For this reason, it is useful to consider why the VAIL was particularly conducive to adaptation and how this information can be used in other studies.In this study, correct responses were finite and language-specific.More specifically, there were a fixed number of correct strategies that teachers could identify, and thus, the building of the automated dictionary encompassed words taken directly from the pre-constructed manual, as well as a limited number of associated synonyms.This was due in large part to the fact that VAIL responses were prompted by the viewing of a short video clip followed several shortanswer questions.In other words, respondents were not asked to generate examples of Positive Climate from their potentially limitless repertoires of personal experience; instead, they were asked to identify concrete, specific examples from the video.Furthermore, there is less controversy surrounding the use of automated systems for scoring short answer responses, as compared to essays.AES has been criticized for oversimplifying the assessment to abstract concepts that may be difficult to quantify (Condon, 2013;Perelman, 2014).
Course designers in various fields could use relevant video, short writing excerpts, or images to elicit explicit concepts or facts conducive to automated scoring.Consistent with the use of videos to demonstrate how to do (or not do) something, videos could also be used to assess a learners' abilities to effectively see a skill or concept in action (Hamre, Downer, et al., 2012).For instance, kinesiology students could watch a video of a person exercising and analyze his gait or specific muscles that are being activated.Assuming that the video could be used to generate a finite number of correct responses, this information lends itself well to automated scoring, which would be particularly useful if a large number of students were enrolled in the course.In other fields, such as English or history, it may be most relevant to utilize writing excerpts or images to test learners' understanding of specific content.Ultimately, as online coursework continues to be scaled up, across fields, it is necessary to think creatively about how to effectively assess learning.Automated scoring systems offer one possible technique, especially for short-answer assessments that have a limited number of correct answers.

Limitations
In the present study, VAIL scores for the CLASS domain of interest were only collected at the conclusion of the intervention study.As a result, it was not possible to examine changes in VAIL scores from the beginning to the end of the intervention.In addition, the timing and delivery of the VAIL at the end of the post-intervention survey may have contributed to a lowered response rate.Those who self-selected to complete this survey proved to be different than those who chose not to complete the survey in terms of ethnicity and educational attainment.

Future Directions
Future work should continue to develop and test automated systems across fields.Given the availability of various automated software programs and the ecological validity of using automated systems, it is important for researchers to share their methods in developing and testing automated systems.The process of automating assessment scoring should be driven by the characteristics of the assessment questions and answers.Therefore, sharing techniques for automating various types of assessments, and establishing validity, can advance the use of automated scoring systems across fields.
In terms of the automated VAIL specifically, it would be beneficial to conduct similar inquiries using different samples of teachers enrolled in different interventions.Future studies should continue to utilize different samples to test the validity of the automated VAIL to verify that the current results consistently replicate.In light of recent findings that suggest automated scoring techniques differentially benefit native English speakers (Reilly et al., 2016), it would be advantageous to test the automated VAIL, and similar measures, in linguistically diverse samples to ensure validity.As previously mentioned the present study considered in-service teachers who were enrolled in coursework and/or coaching.Given the utility of the automated VAIL in largescale interventions, future inquiries should also examine the extent to which the automated VAIL can be used to detect intervention effects using other intervention designs, such as online coursework.Additionally, based on previous work suggesting the VAIL is valuable in assessing teacher skill in both in-service and pre-service teachers (Wiens et al., 2013), it would also be informative to utilize samples of pre-service, as well as in-service teachers.

Conclusion & Implications
The design and delivery of scalable interventions and programs is of central importance to many fields utilizing online coursework.In teacher education, a variety of professional development programs for teachers have focused on training teachers to engage in positive interactions with children (Bierman et al., 2008;Domitrovich et al., 2009;Pianta, Mashburn, et al., 2008).However, such large-scale interventions pose challenges to assessment, especially the assessment of more complex open-ended responses.As a result, the automated VAIL appears to offer a valid and practically useful means of assessing teachers' skills in observing teacher-child interactions with implications for pre-service training, higher education, and professional development.
, VAIL responses could fall into twelve possible breadth categories, including: (1) Relationships: being in close physical proximity, engaging in shared activity, matching the child's affect, and engaging in social conversation with the student; (2) Positive Affect, smiling, laughing, showing enthusiasm; (3) Positive Communication: demonstrating verbal affection (4) Respect: maintaining eye contact, having a warm and calm voice, cooperating and sharing; (5) Awareness: anticipating problems, noticing difficulties; (6) Responsiveness: acknowledging emotions, providing comfort and individualized support; (7) Addressing Problems: helping in a timely and effective manner and resolving problems; (8) Student Comfort: the student seeks support, freely participates, and takes risks; (9) Flexibility and Student Focus: showing flexibility, incorporating the student's ideas, following the student's lead; (10) Support for Autonomy and Leadership: allowing choice, allowing students to lead lessons, giving students responsibility; (11) Student Expression: encouraging student talk and eliciting ideas; and (12) Restriction of Movement: allowing movement, not being rigid.

Table 2
Correlations among Automated and Hand Scoring System

Table 4
Mean Differences and Effect Sizes by Treatment Condition& Intervention Phase