Automatic Essay Scoring

Papa bloggt! 255/365/ Dennis Skley ©2014/cc by-nd 2.0

Papa bloggt! 255/365/ Dennis Skley ©2014/cc by-nd 2.0

The following is taken from my 18 month upgrade report, which I hope provides an interesting overview of a subject very close to my current area of research.

Grading, ranking, classifying, and recording student activity are fundamental activities in formal education wherever institutions or teachers need to know how students are developing, and where students require feedback, either formative or summative, on their progress. Computational approaches to measuring and analysing this activity holds the promise of relieving human effort and dealing with large amounts of data at speed, but is a controversial topic that demands a multidisciplinary perspective “involving not only psychometrics and statistics, but also linguistics, English composition, computer science, educational psychology, natural-language analysis, curriculum, and more” (Page, 1966, p. 88).

In the mid 1960s as part of ‘Project Essay Grade’ (PEG), a number of experiments to assess the reliability of machine-based essay grading were undertaken, adopting a word and punctuation count method of “actuarial optimization” to “simulate the behaviour of qualified judges” (Page, 1966, p. 90). Using 30 features that approximated to values previously identified by human experts, PEG explored essays written by high school students, and found statistically significant associations with human criteria. The highest associations being with average word length (r = 0.51), use of words commonly found in literature (r = -0.48), word count (r = 0.32), and prepositions (r = 0.25) (p. 93). While costly in terms of the time taken to input, these initial experiments were highly successful, showing strong multiple correlation coefficients equivalent to human experts (r = 0.71).  In the face of hostility and suspicion from progressive as well as established interests, and hampered by the rudimentary computing facilities available at the time, further development of the project waned (Wresch, 1993).

As computers became ubiquitous and as software improved in the decades that followed these initial experiments, PEG was revived and applied to large-scale datasets. These experiments resulted in algorithms that were shown to surpass the reliability of human expert rating (Page, 1994). In recent years the focus of developing automated essay scoring (AES) algorithms has shifted from faculty to the research and development departments of corporations. AES has been successfully marketed, and different systems are currently used to assess students’ writing in professional training, formal education, and Massive Open Online Courses, primarily in the United States (Williamson, 2003; National Council of Teachers of English, 2013; Whithaus, 2015; Balfour, 2013).

While the details of proprietary AES algorithm design is a matter of commercial confidentiality, systems continue to be based on word and punctuation counts and word lists, with the addition of Natural Language Processing techniques (Burstein et al., 1998), Latent Sentiment Analysis (Landauer, Foltz and Laham, 1998), and Machine Learning methods (Turnitin, LLC, 2015; McCann Associates, 2016).

Controversy and criticism of AES has focused on the inability of machines to recognise or judge the variety of complex elements associated with good writing, the training of humans to mimic  computer scoring, over-emphasis on word count and flamboyant language, and the ease with which students can be coached to ‘game the system’ (National Council of Teachers of English, 2013; Perelman, 2014).

However, many of these criticisms are levelled at the wide-spread application of computational methods to replace human rating, criticisms which were clearly addressed early in the development of AES. Page argued that computational approaches are based on established experimental methods that privileges, “data concerning behaviour, rather than internal states, and the insistence upon operational definitions, rather than idealistic definitions” (Page, 1969, p. 3), and that machine grading simply replicated the behaviour of human experts. In response to arguments that machines where not capable of judging creativity, Wresch cites Slotnick’s support for the use of AES to indicate deviations from norms and highlight unusual writing, which could then be referred for further human assessment (Wresch, 1993). In recent work exploring the use of automated assessment in MOOCs, while recognising the limitations of AES in assessing unique writing (e.g. individually selected topics, poetry, original research), Balfour suggests the use of computational methods to correct mechanical writing problems, combined with a final, human, peer review (Balfour, 2013).


  • Balfour, S. P. (2013) ‘Assessing writing in MOOCS: Automated essay scoring and Calibrated Peer Review’. In Research & Practice in Assessment, 8, pp. 40–48.
  • Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Lu, C., Nolan, J., Rock, D. and Wolff, S. (1998) Computer analysis of essay content for automated score prediction. Report for the Educational Testing Service.
  • Landauer, T. K., Foltz, P. W. and Laham, D. (1998) ‘An introduction to latent semantic analysis’. In Discourse Processes, 25(2 & 3), pp. 259–284.
  • McCann Associates (2016) IntelliMetric®. [Online] Available at: (Accessed: 1 March 2016).
  • National Council of Teachers of English (2013). NCTE Position Statement on Machine Scoring. [Online] Available at: (Accessed: 1 March 2016).
  • Page, E. B. (1966) ‘Grading Essays by Computer: Progress Report’. In Invitational Conference on Testing Problems, 29 October, 1966. New York: Educational Testing Service, pp. 87–100.
  • Page, E. B. (1994) ‘Computer grading of student prose, using modern concepts and software’. In The Journal of Experimental Education. Taylor & Francis, 62(2), pp. 127–142.
  • Perelman, L. (2014) ‘When “the state of the art” is counting words’. In Assessing Writing. Elsevier Inc., 21, pp. 104–111.
  • Turnitin, LLC (2015) Turnitin Scoring Engine FAQ. [Online] Available at: (Accessed: 1 March 2016).
  • Whithaus, C. (2015) Algorithms at the seam: machines reading humans + / -, Media Commons. [Online] Available at: (Accessed: 1 March 2016).
  • Williamson, M. M. (2003) ‘Validity of automated scoring: Prologue for a continuing discussion of machine scoring student writing’. In Journal of Writing Assessment, 1(2), pp. 85–104.
  • Wresch, W. (1993) ‘The Imminence of Grading Essays by Computer – 25 Years later’. In Computers and Composition, 10(2), pp. 45–58.

Video production

While  spend most of my time working at my PhD research, because of my previous experience as a media producer, I’m occasionally asked to produce videos that support the work of the Web Science Institute. Last month I produced a video showing the type of work we do here at Southampton (for display on a large screen in a public space in our School), as well a short piece featuring Professor Dame Wendy Hall’s reflections of the 10th anniversary of Twitter.

PhD Research Update

Banna Beach at Sunset Andrew Bennett ©2009 BY 2.0

Banna Beach at Sunset/Andrew Bennett ©2009/cc-by 2.0

It’s been quite a while since I posted, for which I partly blame: writing up the second stage of my research for publication, and for my 18 month upgrade, plus taking on a part time role as Web Science Trust project support officer.

I handed in my upgrade a few weeks ago and had a viva to defend my thesis last week.  The 18 month viva is not as intense as the final grilling you get at the end of the PhD, but provides, as the University of Southampton website says, “a great opportunity to talk about your work in-depth with experts in your field, who have read, and paid great attention to, your work”.  This is true, but I also found it quite unnerving, as it made me realise I still had a long way to go to have confidence in my thesis. Despite what I thought was a fairly lacklustre performance, I somehow managed to pass and am now in the final stretch working towards my final PhD hand in next year. My final piece of work includes a fairly complex and challenging Machine Learning experiment and a series of interviews with MOOC instructors. More of this later.

Going back to my last experiment, this involved  a large scale content analysis of MOOC discussion forum comments which I wrote about in a previous post. Between last November and January this year I recruited and trained a group of 8 research assistants to rate comments in MOOC discussion forums according to two content analysis methods. Overall 1500 comments were rated, and correlations of various strengths were established between the analysis methods and with linguistic indicators of critical thinking. The outputs have provided a useful basis for the next stage – developing a method to automate comment rating that approximates human rating.

A paper on the initial stages of my research that I submitted to Research in Learning Technology has been peer reviewed and accepted, and I am awaiting the outcome of deliberations on the changes I’ve made prior to publication later this year. A paper I hoped to get into the Learning Analytics Special Edition of Transactions on Learning Technologies was rejected (2 to 1 against publication – can’t win ’em all!). But they’ve suggested I re-submit following changes to the text. I’ve just re-written the abstract, which goes like this:

Typically, learners’ progression within Computer-Supported Collaborative Learning (CSCL) environments is measured via analysis and interpretation of quantitative web interaction measures (e.g. counting the number of logins, mouse clicks, and accessed resources). However, the usefulness of these ‘proxies for learning’ is questioned as they only depict a narrow spectrum of behaviour and do not facilitate the qualitative evaluation of critical reflection and dialogue – an essential component of collaborative learning. Research indicates that pedagogical content analysis methods have value in measuring critical discourse in small scale, formal, online learning environments, but little research has been carried out on high volume, informal, Massive Open Online Course (MOOC) forums. The challenge in this setting is to develop valid and reliable indicators that operate successfully at scale. In this paper we test two established pedagogical content analysis methods in a large-scale review of comment data randomly selected from a number of MOOCs. Pedagogical Scores (PS) are derived from ratings applied to comments by a group of coders, and correlated with linguistic and interaction indicators. Results show that the content analysis methods are reliable, and are very strongly correlated with each other, suggesting that their specific format is not significant. In addition, the methods are strongly associated with some relevant linguistic indicators of higher levels of learning (e.g. word count and occurrence of first-person pronouns), and have weaker correlations with other linguistic and interaction metrics (e.g. sentiment, ‘likes’, words per sentence, long words). This suggests promise for further research in the development of content analysis methods better suited to informal MOOC forum settings, and the practical application of linguistic proxies for learning. Specifically using Machine Learning techniques to automatically approximate human coding, and provide realistic feedback to instructors, learners and learning designers.

Just need to re-do the rest now…

I’ve also undertaken two online introductory courses in using the Weka machine learning workbench application and am currently waiting for the Advanced course to start. I’m also attending the Learning Analytics and Knowledge Conference (LAK16) in Edinburgh next week, where I’m very much looking forward to taking a workshop in data mining (using Weka), as well as attending loads of presentations and engaging in some serious networking.

Also, I’m very much looking forward to the summer (hence the photo at the top of the page).