The following is taken from my 18 month upgrade report, which I hope provides an interesting overview of a subject very close to my current area of research.
Grading, ranking, classifying, and recording student activity are fundamental activities in formal education wherever institutions or teachers need to know how students are developing, and where students require feedback, either formative or summative, on their progress. Computational approaches to measuring and analysing this activity holds the promise of relieving human effort and dealing with large amounts of data at speed, but is a controversial topic that demands a multidisciplinary perspective “involving not only psychometrics and statistics, but also linguistics, English composition, computer science, educational psychology, natural-language analysis, curriculum, and more” (Page, 1966, p. 88).
In the mid 1960s as part of ‘Project Essay Grade’ (PEG), a number of experiments to assess the reliability of machine-based essay grading were undertaken, adopting a word and punctuation count method of “actuarial optimization” to “simulate the behaviour of qualified judges” (Page, 1966, p. 90). Using 30 features that approximated to values previously identified by human experts, PEG explored essays written by high school students, and found statistically significant associations with human criteria. The highest associations being with average word length (r = 0.51), use of words commonly found in literature (r = -0.48), word count (r = 0.32), and prepositions (r = 0.25) (p. 93). While costly in terms of the time taken to input, these initial experiments were highly successful, showing strong multiple correlation coefficients equivalent to human experts (r = 0.71). In the face of hostility and suspicion from progressive as well as established interests, and hampered by the rudimentary computing facilities available at the time, further development of the project waned (Wresch, 1993).
As computers became ubiquitous and as software improved in the decades that followed these initial experiments, PEG was revived and applied to large-scale datasets. These experiments resulted in algorithms that were shown to surpass the reliability of human expert rating (Page, 1994). In recent years the focus of developing automated essay scoring (AES) algorithms has shifted from faculty to the research and development departments of corporations. AES has been successfully marketed, and different systems are currently used to assess students’ writing in professional training, formal education, and Massive Open Online Courses, primarily in the United States (Williamson, 2003; National Council of Teachers of English, 2013; Whithaus, 2015; Balfour, 2013).
While the details of proprietary AES algorithm design is a matter of commercial confidentiality, systems continue to be based on word and punctuation counts and word lists, with the addition of Natural Language Processing techniques (Burstein et al., 1998), Latent Sentiment Analysis (Landauer, Foltz and Laham, 1998), and Machine Learning methods (Turnitin, LLC, 2015; McCann Associates, 2016).
Controversy and criticism of AES has focused on the inability of machines to recognise or judge the variety of complex elements associated with good writing, the training of humans to mimic computer scoring, over-emphasis on word count and flamboyant language, and the ease with which students can be coached to ‘game the system’ (National Council of Teachers of English, 2013; Perelman, 2014).
However, many of these criticisms are levelled at the wide-spread application of computational methods to replace human rating, criticisms which were clearly addressed early in the development of AES. Page argued that computational approaches are based on established experimental methods that privileges, “data concerning behaviour, rather than internal states, and the insistence upon operational definitions, rather than idealistic definitions” (Page, 1969, p. 3), and that machine grading simply replicated the behaviour of human experts. In response to arguments that machines where not capable of judging creativity, Wresch cites Slotnick’s support for the use of AES to indicate deviations from norms and highlight unusual writing, which could then be referred for further human assessment (Wresch, 1993). In recent work exploring the use of automated assessment in MOOCs, while recognising the limitations of AES in assessing unique writing (e.g. individually selected topics, poetry, original research), Balfour suggests the use of computational methods to correct mechanical writing problems, combined with a final, human, peer review (Balfour, 2013).