Evaluation and automatic analysis of MOOC forum comments

My doctoral thesis is available to download via the University of Southampton’s ePrints site.


Moderators of Massive Open Online Courses (MOOCs) undertake a dual role. Their work entails not just facilitating an effective learning environment, but also identifying excelling and struggling learners, and providing pedagogical encouragement and direction. Supporting learners is a critical part of moderators’ work, and identifying learners’ level of critical thinking is an important part of this process. As many thousands of learners may communicate 24 hours a day, 7 days a week using MOOC comment forums, providing support in this environment is a significant challenge for the small numbers of moderators typically engaged in this work. In order to address this challenge, I adopt established coding schemes used for pedagogical content analysis of online discussions to classifying comments, and report on several studies I have undertaken which seek to ascertain the reliability of these approaches, establishing associations with these methods and linguistic and other indicators of critical thinking. I develop a simple algorithmic method of classification based on automatically sorting comments according to their linguistic composition, and evaluate an interview-based case study, where this algorithm is applied to an on-going MOOC. The algorithm method achieved good reliability when applied to a prepared test data set, and when applied to unlabelled comments in a live MOOC and evaluated by MOOC moderators, it was considered to have provided useful, actionable feedback. This thesis provides contributions that help to understand the usefulness of automatic analysis of levels of critical thinking in MOOC comment forums, and as such has implications for future learning analytics research, and e-learning policy making.


Automatic Essay Scoring

Papa bloggt! 255/365/ Dennis Skley ©2014/cc by-nd 2.0

Papa bloggt! 255/365/ Dennis Skley ©2014/cc by-nd 2.0

The following is taken from my 18 month upgrade report, which I hope provides an interesting overview of a subject very close to my current area of research.

Grading, ranking, classifying, and recording student activity are fundamental activities in formal education wherever institutions or teachers need to know how students are developing, and where students require feedback, either formative or summative, on their progress. Computational approaches to measuring and analysing this activity holds the promise of relieving human effort and dealing with large amounts of data at speed, but is a controversial topic that demands a multidisciplinary perspective “involving not only psychometrics and statistics, but also linguistics, English composition, computer science, educational psychology, natural-language analysis, curriculum, and more” (Page, 1966, p. 88).

In the mid 1960s as part of ‘Project Essay Grade’ (PEG), a number of experiments to assess the reliability of machine-based essay grading were undertaken, adopting a word and punctuation count method of “actuarial optimization” to “simulate the behaviour of qualified judges” (Page, 1966, p. 90). Using 30 features that approximated to values previously identified by human experts, PEG explored essays written by high school students, and found statistically significant associations with human criteria. The highest associations being with average word length (r = 0.51), use of words commonly found in literature (r = -0.48), word count (r = 0.32), and prepositions (r = 0.25) (p. 93). While costly in terms of the time taken to input, these initial experiments were highly successful, showing strong multiple correlation coefficients equivalent to human experts (r = 0.71).  In the face of hostility and suspicion from progressive as well as established interests, and hampered by the rudimentary computing facilities available at the time, further development of the project waned (Wresch, 1993).

As computers became ubiquitous and as software improved in the decades that followed these initial experiments, PEG was revived and applied to large-scale datasets. These experiments resulted in algorithms that were shown to surpass the reliability of human expert rating (Page, 1994). In recent years the focus of developing automated essay scoring (AES) algorithms has shifted from faculty to the research and development departments of corporations. AES has been successfully marketed, and different systems are currently used to assess students’ writing in professional training, formal education, and Massive Open Online Courses, primarily in the United States (Williamson, 2003; National Council of Teachers of English, 2013; Whithaus, 2015; Balfour, 2013).

While the details of proprietary AES algorithm design is a matter of commercial confidentiality, systems continue to be based on word and punctuation counts and word lists, with the addition of Natural Language Processing techniques (Burstein et al., 1998), Latent Sentiment Analysis (Landauer, Foltz and Laham, 1998), and Machine Learning methods (Turnitin, LLC, 2015; McCann Associates, 2016).

Controversy and criticism of AES has focused on the inability of machines to recognise or judge the variety of complex elements associated with good writing, the training of humans to mimic  computer scoring, over-emphasis on word count and flamboyant language, and the ease with which students can be coached to ‘game the system’ (National Council of Teachers of English, 2013; Perelman, 2014).

However, many of these criticisms are levelled at the wide-spread application of computational methods to replace human rating, criticisms which were clearly addressed early in the development of AES. Page argued that computational approaches are based on established experimental methods that privileges, “data concerning behaviour, rather than internal states, and the insistence upon operational definitions, rather than idealistic definitions” (Page, 1969, p. 3), and that machine grading simply replicated the behaviour of human experts. In response to arguments that machines where not capable of judging creativity, Wresch cites Slotnick’s support for the use of AES to indicate deviations from norms and highlight unusual writing, which could then be referred for further human assessment (Wresch, 1993). In recent work exploring the use of automated assessment in MOOCs, while recognising the limitations of AES in assessing unique writing (e.g. individually selected topics, poetry, original research), Balfour suggests the use of computational methods to correct mechanical writing problems, combined with a final, human, peer review (Balfour, 2013).


  • Balfour, S. P. (2013) ‘Assessing writing in MOOCS: Automated essay scoring and Calibrated Peer Review’. In Research & Practice in Assessment, 8, pp. 40–48.
  • Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Lu, C., Nolan, J., Rock, D. and Wolff, S. (1998) Computer analysis of essay content for automated score prediction. Report for the Educational Testing Service.
  • Landauer, T. K., Foltz, P. W. and Laham, D. (1998) ‘An introduction to latent semantic analysis’. In Discourse Processes, 25(2 & 3), pp. 259–284.
  • McCann Associates (2016) IntelliMetric®. [Online] Available at: http://www.mccanntesting.com/products-services/intellimetric/ (Accessed: 1 March 2016).
  • National Council of Teachers of English (2013). NCTE Position Statement on Machine Scoring. [Online] Available at: http://www.ncte.org/positions/statements/machine_scoring (Accessed: 1 March 2016).
  • Page, E. B. (1966) ‘Grading Essays by Computer: Progress Report’. In Invitational Conference on Testing Problems, 29 October, 1966. New York: Educational Testing Service, pp. 87–100.
  • Page, E. B. (1994) ‘Computer grading of student prose, using modern concepts and software’. In The Journal of Experimental Education. Taylor & Francis, 62(2), pp. 127–142.
  • Perelman, L. (2014) ‘When “the state of the art” is counting words’. In Assessing Writing. Elsevier Inc., 21, pp. 104–111.
  • Turnitin, LLC (2015) Turnitin Scoring Engine FAQ. [Online] Available at: https://guides.turnitin.com/Turnitin_Scoring_Engine/Turnitin_Scoring_Engine_FAQ (Accessed: 1 March 2016).
  • Whithaus, C. (2015) Algorithms at the seam: machines reading humans + / -, Media Commons. [Online] Available at: http://mediacommons.futureofthebook.org/question/what-opportunities-are-available-influence-way-algorithms-are-programmed-written-executed-6 (Accessed: 1 March 2016).
  • Williamson, M. M. (2003) ‘Validity of automated scoring: Prologue for a continuing discussion of machine scoring student writing’. In Journal of Writing Assessment, 1(2), pp. 85–104.
  • Wresch, W. (1993) ‘The Imminence of Grading Essays by Computer – 25 Years later’. In Computers and Composition, 10(2), pp. 45–58.

PhD Research Update

Banna Beach at Sunset Andrew Bennett ©2009 BY 2.0

Banna Beach at Sunset/Andrew Bennett ©2009/cc-by 2.0

It’s been quite a while since I posted, for which I partly blame: writing up the second stage of my research for publication, and for my 18 month upgrade, plus taking on a part time role as Web Science Trust project support officer.

I handed in my upgrade a few weeks ago and had a viva to defend my thesis last week.  The 18 month viva is not as intense as the final grilling you get at the end of the PhD, but provides, as the University of Southampton website says, “a great opportunity to talk about your work in-depth with experts in your field, who have read, and paid great attention to, your work”.  This is true, but I also found it quite unnerving, as it made me realise I still had a long way to go to have confidence in my thesis. Despite what I thought was a fairly lacklustre performance, I somehow managed to pass and am now in the final stretch working towards my final PhD hand in next year. My final piece of work includes a fairly complex and challenging Machine Learning experiment and a series of interviews with MOOC instructors. More of this later.

Going back to my last experiment, this involved  a large scale content analysis of MOOC discussion forum comments which I wrote about in a previous post. Between last November and January this year I recruited and trained a group of 8 research assistants to rate comments in MOOC discussion forums according to two content analysis methods. Overall 1500 comments were rated, and correlations of various strengths were established between the analysis methods and with linguistic indicators of critical thinking. The outputs have provided a useful basis for the next stage – developing a method to automate comment rating that approximates human rating.

A paper on the initial stages of my research that I submitted to Research in Learning Technology has been peer reviewed and accepted, and I am awaiting the outcome of deliberations on the changes I’ve made prior to publication later this year. A paper I hoped to get into the Learning Analytics Special Edition of Transactions on Learning Technologies was rejected (2 to 1 against publication – can’t win ’em all!). But they’ve suggested I re-submit following changes to the text. I’ve just re-written the abstract, which goes like this:

Typically, learners’ progression within Computer-Supported Collaborative Learning (CSCL) environments is measured via analysis and interpretation of quantitative web interaction measures (e.g. counting the number of logins, mouse clicks, and accessed resources). However, the usefulness of these ‘proxies for learning’ is questioned as they only depict a narrow spectrum of behaviour and do not facilitate the qualitative evaluation of critical reflection and dialogue – an essential component of collaborative learning. Research indicates that pedagogical content analysis methods have value in measuring critical discourse in small scale, formal, online learning environments, but little research has been carried out on high volume, informal, Massive Open Online Course (MOOC) forums. The challenge in this setting is to develop valid and reliable indicators that operate successfully at scale. In this paper we test two established pedagogical content analysis methods in a large-scale review of comment data randomly selected from a number of MOOCs. Pedagogical Scores (PS) are derived from ratings applied to comments by a group of coders, and correlated with linguistic and interaction indicators. Results show that the content analysis methods are reliable, and are very strongly correlated with each other, suggesting that their specific format is not significant. In addition, the methods are strongly associated with some relevant linguistic indicators of higher levels of learning (e.g. word count and occurrence of first-person pronouns), and have weaker correlations with other linguistic and interaction metrics (e.g. sentiment, ‘likes’, words per sentence, long words). This suggests promise for further research in the development of content analysis methods better suited to informal MOOC forum settings, and the practical application of linguistic proxies for learning. Specifically using Machine Learning techniques to automatically approximate human coding, and provide realistic feedback to instructors, learners and learning designers.

Just need to re-do the rest now…

I’ve also undertaken two online introductory courses in using the Weka machine learning workbench application and am currently waiting for the Advanced course to start. I’m also attending the Learning Analytics and Knowledge Conference (LAK16) in Edinburgh next week, where I’m very much looking forward to taking a workshop in data mining (using Weka), as well as attending loads of presentations and engaging in some serious networking.

Also, I’m very much looking forward to the summer (hence the photo at the top of the page).


My 9 month PhD Poster

My 9 month PhD Poster/Tim O’Riordan ©2014/Creative Commons by-nc-nd License.

A few months ago I reached a milestone in my PhD by passing my 9 month viva, and last week I was reminded (along with the rest of the lab) that my old poster was “looking as retro as a set of Alexis Carrington‘s shoulder pads” (to quote Prof Les Carr). So I set to, downloaded a trial version of Adobe Photoshop, and got designing.

Essentially I’ve retained the style of my previous poster and added some new words, scatter plots and logos to reflect my progress over the past few months. My supervisors love it, and in less than a week it’s had outing at the LACE SoLAR Flare 2015, and at JP Rangaswami’s Web Science Institute Distinguished Lecture.

What are my key findings?

Building on my earlier learning analytics work that used a single approach to rate comments associated with learning objects on a Massive Open Online Course (MOOC) in an attempt to identify ‘attention to learning’, I undertook further content analysis. The main idea was to use 3 highly cited pedagogically-based methods (Blooms Taxonomy, SOLO Taxonomy, and Community of Inquiry (CoI)) in addition to the less well-known DiAL-e method (that I had used in an earlier study), to see if there was any correlation between them, to test intra-rater reliability, and to see how these methods squared up against typical measures of online learning engagement.

I discovered that my intra-rater reliability was high, as were correlations between methods. That is, all methods of rating  learners’ comments produced very similar results – with Bloom and CoI producing the best results out of the 4 methods. Correlations with other measures (sentiment, words per sentence, and ‘likes‘) confirmed my earlier work: language used in comments appears to provide a good indication of depth of learning, and people ‘like’ online comments for many reasons, not necessarily for the depth of learning demonstrated by the comment maker.

So, I’m about half way through my PhD and still have a lot of work to do. The next stage involves employing some willing research assistants to rate many more comments derived from many more MOOCs than I am able to do.  The aim is collect enough data to train Machine Learning algorithms to rate comments automatically.

Why is this important?

Making education and training more widely available is vital for human development, and the Web has a significant part to play in delivering these opportunities. Running a successful online learning programme (e.g. a MOOC) should involve managing a great deal of learner interaction – answering questions, making suggestions, and generally guiding learners along their paths. But coping effectively with high levels of engagement is time intensive and involves the attention of highly qualified (and expensive) teachers and educational technologists. My hope is that through my research an automated means of showing how well and to what extend learners are attending to learning can be developed that will make a useful contribution to managing online teaching and learning.

Early days

Beetham031114Poster Session at ILiAD Conference 2014/Sue White © 2014/CC BY 2.0

A recent highlight was the ILIaD Conference, held at the University of Southampton, where I had the opportunity to discuss my summer research project with Digital Literacies expert, author of Rethinking Learning for the Digital Age, and keynote speaker, Helen Beetham during the poster session.

These early days in my life as a PhD research student seem to have been mainly taken up with training. I undertook an excellent Demonstrator training session at the end of September, which has led to some paid work evaluating first year student presentations throughout November. I have also undertaken the very useful Public Engagement and Presenting Your Research training as well as the Successful Business Engagement programme, Finding Information to Support Your Research and the Lifelong Learning PGR Training Package.

Public Engagement/Tim O'Riordan ©2014/CC BY 2.0

Public Engagement/Tim O’Riordan ©2014/CC-BY 2.0

Induction into the Web and Internet Science research group provided an opportunity to find out what my lab colleagues are mainly up to and to do some lightweight research into Provenance (see: WAIS Provenance Project).

I attended a seminar on the Portus MOOC (the focus of my summer dissertation project) and met with MOOC team leader and dedicated online learning exponent, Graeme Earl. It was great to hear Graeme talking about how he approached designing the course and using some of the key terms from the model I used for coding attention to learning in my project. He also raised issues about managing high numbers of learners comments. How do we scale MOOCs, engage larger numbers of learners and still pay attention to scaffolding? My hope is that my research will provide some support in this area.

I joined the PEGaSUs e-learning research group at the end of October. This is a interdisciplinary group of research students from diverse backgrounds who have a shared interest in online learning. Topics for discussion included teaching in emerging economies, blended learning, measuring teacher performance, communities of practice, e-portfolios, and (of course) Massive Open Online Courses. Post-meeting I set up a (closed) Facebook group to help keep the discussion going between meetings. I think PEGaSUs stands for Postgraduate E-learning Group at Southampton University (looks about right).

Finally, The British Computing Society Introduction to Badges for Learning event run by Sam Taylor and Julian Prior at Southampton Solent University at the end of October provided an excellent overview of this emerging method for enhancing learner engagement.

My growing to-do list is mainly taken up with the tasks involved in reviewing, editing and generally preparing my summer project for publication. This involves:

  • Finding suitable journals/conferences that may be interested in publishing my work.
  • Looking for what’s missing in my dissertation.
  • Interrogating my analysis (Is it watertight? Does it hold up under scrutiny?).
  • Reading more papers on learning analytics and content analysis.
  • Apply my analysis to other data sets (possibly the Web Science MOOC comment data).
  • Getting a paper ready for submission by the end of February 2015.