Automatic Essay Scoring

Papa bloggt! 255/365/ Dennis Skley ©2014/cc by-nd 2.0

Papa bloggt! 255/365/ Dennis Skley ©2014/cc by-nd 2.0

The following is taken from my 18 month upgrade report, which I hope provides an interesting overview of a subject very close to my current area of research.

Grading, ranking, classifying, and recording student activity are fundamental activities in formal education wherever institutions or teachers need to know how students are developing, and where students require feedback, either formative or summative, on their progress. Computational approaches to measuring and analysing this activity holds the promise of relieving human effort and dealing with large amounts of data at speed, but is a controversial topic that demands a multidisciplinary perspective “involving not only psychometrics and statistics, but also linguistics, English composition, computer science, educational psychology, natural-language analysis, curriculum, and more” (Page, 1966, p. 88).

In the mid 1960s as part of ‘Project Essay Grade’ (PEG), a number of experiments to assess the reliability of machine-based essay grading were undertaken, adopting a word and punctuation count method of “actuarial optimization” to “simulate the behaviour of qualified judges” (Page, 1966, p. 90). Using 30 features that approximated to values previously identified by human experts, PEG explored essays written by high school students, and found statistically significant associations with human criteria. The highest associations being with average word length (r = 0.51), use of words commonly found in literature (r = -0.48), word count (r = 0.32), and prepositions (r = 0.25) (p. 93). While costly in terms of the time taken to input, these initial experiments were highly successful, showing strong multiple correlation coefficients equivalent to human experts (r = 0.71).  In the face of hostility and suspicion from progressive as well as established interests, and hampered by the rudimentary computing facilities available at the time, further development of the project waned (Wresch, 1993).

As computers became ubiquitous and as software improved in the decades that followed these initial experiments, PEG was revived and applied to large-scale datasets. These experiments resulted in algorithms that were shown to surpass the reliability of human expert rating (Page, 1994). In recent years the focus of developing automated essay scoring (AES) algorithms has shifted from faculty to the research and development departments of corporations. AES has been successfully marketed, and different systems are currently used to assess students’ writing in professional training, formal education, and Massive Open Online Courses, primarily in the United States (Williamson, 2003; National Council of Teachers of English, 2013; Whithaus, 2015; Balfour, 2013).

While the details of proprietary AES algorithm design is a matter of commercial confidentiality, systems continue to be based on word and punctuation counts and word lists, with the addition of Natural Language Processing techniques (Burstein et al., 1998), Latent Sentiment Analysis (Landauer, Foltz and Laham, 1998), and Machine Learning methods (Turnitin, LLC, 2015; McCann Associates, 2016).

Controversy and criticism of AES has focused on the inability of machines to recognise or judge the variety of complex elements associated with good writing, the training of humans to mimic  computer scoring, over-emphasis on word count and flamboyant language, and the ease with which students can be coached to ‘game the system’ (National Council of Teachers of English, 2013; Perelman, 2014).

However, many of these criticisms are levelled at the wide-spread application of computational methods to replace human rating, criticisms which were clearly addressed early in the development of AES. Page argued that computational approaches are based on established experimental methods that privileges, “data concerning behaviour, rather than internal states, and the insistence upon operational definitions, rather than idealistic definitions” (Page, 1969, p. 3), and that machine grading simply replicated the behaviour of human experts. In response to arguments that machines where not capable of judging creativity, Wresch cites Slotnick’s support for the use of AES to indicate deviations from norms and highlight unusual writing, which could then be referred for further human assessment (Wresch, 1993). In recent work exploring the use of automated assessment in MOOCs, while recognising the limitations of AES in assessing unique writing (e.g. individually selected topics, poetry, original research), Balfour suggests the use of computational methods to correct mechanical writing problems, combined with a final, human, peer review (Balfour, 2013).

References

  • Balfour, S. P. (2013) ‘Assessing writing in MOOCS: Automated essay scoring and Calibrated Peer Review’. In Research & Practice in Assessment, 8, pp. 40–48.
  • Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Lu, C., Nolan, J., Rock, D. and Wolff, S. (1998) Computer analysis of essay content for automated score prediction. Report for the Educational Testing Service.
  • Landauer, T. K., Foltz, P. W. and Laham, D. (1998) ‘An introduction to latent semantic analysis’. In Discourse Processes, 25(2 & 3), pp. 259–284.
  • McCann Associates (2016) IntelliMetric®. [Online] Available at: http://www.mccanntesting.com/products-services/intellimetric/ (Accessed: 1 March 2016).
  • National Council of Teachers of English (2013). NCTE Position Statement on Machine Scoring. [Online] Available at: http://www.ncte.org/positions/statements/machine_scoring (Accessed: 1 March 2016).
  • Page, E. B. (1966) ‘Grading Essays by Computer: Progress Report’. In Invitational Conference on Testing Problems, 29 October, 1966. New York: Educational Testing Service, pp. 87–100.
  • Page, E. B. (1994) ‘Computer grading of student prose, using modern concepts and software’. In The Journal of Experimental Education. Taylor & Francis, 62(2), pp. 127–142.
  • Perelman, L. (2014) ‘When “the state of the art” is counting words’. In Assessing Writing. Elsevier Inc., 21, pp. 104–111.
  • Turnitin, LLC (2015) Turnitin Scoring Engine FAQ. [Online] Available at: https://guides.turnitin.com/Turnitin_Scoring_Engine/Turnitin_Scoring_Engine_FAQ (Accessed: 1 March 2016).
  • Whithaus, C. (2015) Algorithms at the seam: machines reading humans + / -, Media Commons. [Online] Available at: http://mediacommons.futureofthebook.org/question/what-opportunities-are-available-influence-way-algorithms-are-programmed-written-executed-6 (Accessed: 1 March 2016).
  • Williamson, M. M. (2003) ‘Validity of automated scoring: Prologue for a continuing discussion of machine scoring student writing’. In Journal of Writing Assessment, 1(2), pp. 85–104.
  • Wresch, W. (1993) ‘The Imminence of Grading Essays by Computer – 25 Years later’. In Computers and Composition, 10(2), pp. 45–58.

Video production

While  spend most of my time working at my PhD research, because of my previous experience as a media producer, I’m occasionally asked to produce videos that support the work of the Web Science Institute. Last month I produced a video showing the type of work we do here at Southampton (for display on a large screen in a public space in our School), as well a short piece featuring Professor Dame Wendy Hall’s reflections of the 10th anniversary of Twitter.

PhD Research Update

Banna Beach at Sunset Andrew Bennett ©2009 BY 2.0

Banna Beach at Sunset/Andrew Bennett ©2009/cc-by 2.0

It’s been quite a while since I posted, for which I partly blame: writing up the second stage of my research for publication, and for my 18 month upgrade, plus taking on a part time role as Web Science Trust project support officer.

I handed in my upgrade a few weeks ago and had a viva to defend my thesis last week.  The 18 month viva is not as intense as the final grilling you get at the end of the PhD, but provides, as the University of Southampton website says, “a great opportunity to talk about your work in-depth with experts in your field, who have read, and paid great attention to, your work”.  This is true, but I also found it quite unnerving, as it made me realise I still had a long way to go to have confidence in my thesis. Despite what I thought was a fairly lacklustre performance, I somehow managed to pass and am now in the final stretch working towards my final PhD hand in next year. My final piece of work includes a fairly complex and challenging Machine Learning experiment and a series of interviews with MOOC instructors. More of this later.

Going back to my last experiment, this involved  a large scale content analysis of MOOC discussion forum comments which I wrote about in a previous post. Between last November and January this year I recruited and trained a group of 8 research assistants to rate comments in MOOC discussion forums according to two content analysis methods. Overall 1500 comments were rated, and correlations of various strengths were established between the analysis methods and with linguistic indicators of critical thinking. The outputs have provided a useful basis for the next stage – developing a method to automate comment rating that approximates human rating.

A paper on the initial stages of my research that I submitted to Research in Learning Technology has been peer reviewed and accepted, and I am awaiting the outcome of deliberations on the changes I’ve made prior to publication later this year. A paper I hoped to get into the Learning Analytics Special Edition of Transactions on Learning Technologies was rejected (2 to 1 against publication – can’t win ’em all!). But they’ve suggested I re-submit following changes to the text. I’ve just re-written the abstract, which goes like this:

Typically, learners’ progression within Computer-Supported Collaborative Learning (CSCL) environments is measured via analysis and interpretation of quantitative web interaction measures (e.g. counting the number of logins, mouse clicks, and accessed resources). However, the usefulness of these ‘proxies for learning’ is questioned as they only depict a narrow spectrum of behaviour and do not facilitate the qualitative evaluation of critical reflection and dialogue – an essential component of collaborative learning. Research indicates that pedagogical content analysis methods have value in measuring critical discourse in small scale, formal, online learning environments, but little research has been carried out on high volume, informal, Massive Open Online Course (MOOC) forums. The challenge in this setting is to develop valid and reliable indicators that operate successfully at scale. In this paper we test two established pedagogical content analysis methods in a large-scale review of comment data randomly selected from a number of MOOCs. Pedagogical Scores (PS) are derived from ratings applied to comments by a group of coders, and correlated with linguistic and interaction indicators. Results show that the content analysis methods are reliable, and are very strongly correlated with each other, suggesting that their specific format is not significant. In addition, the methods are strongly associated with some relevant linguistic indicators of higher levels of learning (e.g. word count and occurrence of first-person pronouns), and have weaker correlations with other linguistic and interaction metrics (e.g. sentiment, ‘likes’, words per sentence, long words). This suggests promise for further research in the development of content analysis methods better suited to informal MOOC forum settings, and the practical application of linguistic proxies for learning. Specifically using Machine Learning techniques to automatically approximate human coding, and provide realistic feedback to instructors, learners and learning designers.

Just need to re-do the rest now…

I’ve also undertaken two online introductory courses in using the Weka machine learning workbench application and am currently waiting for the Advanced course to start. I’m also attending the Learning Analytics and Knowledge Conference (LAK16) in Edinburgh next week, where I’m very much looking forward to taking a workshop in data mining (using Weka), as well as attending loads of presentations and engaging in some serious networking.

Also, I’m very much looking forward to the summer (hence the photo at the top of the page).

 

Periscoping to support MOOC

Working with University of Southampton Digichamp Hannah Watts, and FutureLearn Digital Marketing MOOC educator Dr Lisa Harris I helped run probably the first video broadcast using Periscope to support an online course in the UK. The Digital Marketing MOOC (Massive Open Online Course), asks learners to try out new social tools and think about how they may (or may not) work in a learning or a business context. So to demonstrate what this involves, Lisa decided to give this relatively new social video broadcasting app a try.

Periscope allows you to watch live videos from your mobile device and interact with the presenters in real time, either directly within the app or via Twitter. I’d had a go with the app a few times over the summer and found that it worked well in connecting with a reasonably large audience. We’d also seen Inger Mewburn on the edX Surviving your PhD MOOC, and the BBC Outside Source broadcasts, and decided it was time to take the plunge into a more planned approach to this new form of social broadcasting.

Using Periscope is very straightforward; you just download the app to your mobile device, sign into Twitter, log in to Periscope and then start broadcasting. But if you have an expectant audience and a message to deliver, you can’t leave much to chance. The plan was for Lisa to discuss questions from the MOOC, and from the live Twitter feed with facilitator Chris Phethelan at a prearranged time (15:00 GMT, 5 November 2015). We’d had network problems with previous attempts, so we had an additional camera on stand-by to ensure we had something for our audience to watch if the broadcast failed.

Periscope holding screen

Periscope holding screen

With a crew of two (me supervising the broadcast, and Hannah noting comments as they appeared – and passing on questions) we used an iPhone 5s as the broadcast camera set up in horizontal mode (Periscope broadcast in vertical video, but correct this on playback). In order to let our audience know where to find the broadcast (and with the iPhone pointing at a ‘holding screen’), we hit the ‘start broadcasting’ button 15 minutes before the discussion was due to begin. This automatically created a tweet on the Digital Marketing MOOC twitter account containing a link to the broadcast – which we copied and posted on the MOOC’s comment forum.

About 30 seconds before the start of the discussion I started recording on the standby camera, and used Quicktime to screen record the Periscope browser window. At 3pm the holding screen was removed from in front of the camera to reveal Lisa and Chris ready to start. Within seconds sound was turned on and the discussion could begin.

During the broadcast Lisa and Chris discussed comments from the previous week on the MOOC and also able to answer questions posted on Twitter during the broadcast. Altogether we had over 90 viewers watching and a high number of interactions during transmission – plus some very positive feedback.

You may wonder why we went to such great lengths to record the broadcast. Firstly, Periscope broadcasts only stay online for 24 hours, so we needed a copy to put on YouTube for those who missed it. Also, while the iPhone records the video, the quality is quite poor – and it doesn’t record the questions, comments and other feedback that are visible in the Periscope broadcast. So we needed to record the browser window off screen at high resolution (MacBook Pro with Retina screen) to ensure we had a copy that could be used later in the course – or possibly to support later iterations. Finally, apologies for the jerkiness of the video – although we were on a very high speed network, this seems to be how Periscope currently works.

My 9 month PhD Poster

My 9 month PhD Poster/Tim O’Riordan ©2014/Creative Commons by-nc-nd License.

A few months ago I reached a milestone in my PhD by passing my 9 month viva, and last week I was reminded (along with the rest of the lab) that my old poster was “looking as retro as a set of Alexis Carrington‘s shoulder pads” (to quote Prof Les Carr). So I set to, downloaded a trial version of Adobe Photoshop, and got designing.

Essentially I’ve retained the style of my previous poster and added some new words, scatter plots and logos to reflect my progress over the past few months. My supervisors love it, and in less than a week it’s had outing at the LACE SoLAR Flare 2015, and at JP Rangaswami’s Web Science Institute Distinguished Lecture.

What are my key findings?

Building on my earlier learning analytics work that used a single approach to rate comments associated with learning objects on a Massive Open Online Course (MOOC) in an attempt to identify ‘attention to learning’, I undertook further content analysis. The main idea was to use 3 highly cited pedagogically-based methods (Blooms Taxonomy, SOLO Taxonomy, and Community of Inquiry (CoI)) in addition to the less well-known DiAL-e method (that I had used in an earlier study), to see if there was any correlation between them, to test intra-rater reliability, and to see how these methods squared up against typical measures of online learning engagement.

I discovered that my intra-rater reliability was high, as were correlations between methods. That is, all methods of rating  learners’ comments produced very similar results – with Bloom and CoI producing the best results out of the 4 methods. Correlations with other measures (sentiment, words per sentence, and ‘likes‘) confirmed my earlier work: language used in comments appears to provide a good indication of depth of learning, and people ‘like’ online comments for many reasons, not necessarily for the depth of learning demonstrated by the comment maker.

So, I’m about half way through my PhD and still have a lot of work to do. The next stage involves employing some willing research assistants to rate many more comments derived from many more MOOCs than I am able to do.  The aim is collect enough data to train Machine Learning algorithms to rate comments automatically.

Why is this important?

Making education and training more widely available is vital for human development, and the Web has a significant part to play in delivering these opportunities. Running a successful online learning programme (e.g. a MOOC) should involve managing a great deal of learner interaction – answering questions, making suggestions, and generally guiding learners along their paths. But coping effectively with high levels of engagement is time intensive and involves the attention of highly qualified (and expensive) teachers and educational technologists. My hope is that through my research an automated means of showing how well and to what extend learners are attending to learning can be developed that will make a useful contribution to managing online teaching and learning.