Building a Reflective Writing Corpus for Analytics Research 

In this blog post, we introduce a new stream of work in CIC’s Academic Writing Analytics project, as we explore the potential of machine learning to augment our existing infrastructure.

Reflective writing

Reflective writing is widely used to trigger the process of self-reflection that is necessary for metacognitive regulation, and for the effective integration of academic knowledge with experiential professional knowledge. We’ve reviewed relevant scholarship into the challenges of teaching and learning reflective writing in our publications [1, 2] and see also Ullmann’s literature review [3].

When we look at where we currently stand, conventional assessment practice depends on human analysis to evaluate reflective writing [4]. Now clearly, there’s nothing as valuable as detailed instruction on reflective writing, and then receiving feedback from a good mentor. But this is (i) a scarce skillset and (ii) it’s very labour-intensive. The practical consequence is that many students do not understand what good reflective writing is, and do not receive good feedback on their work. For these reasons, there’s growing interest in the potential of automated techniques to relieve some of the load.

Automated feedback on reflective writing

The AcaWriter tool we’ve developed detects specific rhetorical moves associated with good reflective writing (try the demo):

We’ll shortly release an update that provides formative feedback to help the student recognise steps they can take to improve their text. A couple of screenshots are below from the current prototype.

The system currently implements the underlying concept matching model using a rule-based grammar and human-curated lexicons, which for those not familiar with this kind of work, brings both pros and cons. The rules are grounded in scholarly literature on the features of academic research writing, and have been tested on diverse texts by the team through close manual analysis. The lexicons can be edited to tune them to the language used in different disciplines and subjects. This relatively traditional AI approach provides familiar intellectual credentials when introducing the system to educators, and when we’re testing it, the underlying behaviour is easier to explain, and errors can be diagnosed very precisely. However, it brings the limitations associated with any rule-based approach: given the richness of open-ended reflective writing, there is a never-ending set of exception cases, and improvements to the system’s performance require manual edits to the rules and lexicon.

We are now beginning work to investigate if a machine learning approach can augment the current infrastructure [5]. Recent years, with the availability of “big data”, such as large question answer banks (SQuAD; [6]), and effective machine learning algorithms, e.g. deep neural networks [7], data driven approaches based on new data processing architectures have attracted a great attention in natural language processing tasks, such as neural text summarization [8] and neural machine translation [9], mainly because these approaches do not require human defined rules and have good generalization power.  However, such data driven approaches require a large amount of data, and some statistical learning models such as deep neural networks are not easy to comprehend.

Therefore, we are interested in building a text corpus of reflective writing that can bring together researchers from computer science, education, the learning sciences, cognitive and other branches of psychology, as well as practitioners/trainers in leadership development and other applied domains who use reflective writing. To our knowledge, there is no substantial reflective writing corpus available for research. As demonstrated in many other fields, a common corpus could serve as a platform to advance research on multiple fronts:

  1. It provides evaluation benchmarks in writing analytics so that different analytic tools can be tested.
  2. It provides a training and test corpus to develop classifiers through machine learning (e.g. deep neural networks).
  3. It presents different examples for teachers to teach reflective writing.

Thomas Ullmann’s PhD is to our knowledge the only work to take a ML approach to reflective writing analytics [4,5], and we aim to build on his work, and ultimately extend it by deploying the technology in a live system with students.

Similar to the Stanford Question Answering Dataset (SQuAD), the Written Reflection Dataset (WReD) we intend to build will be publicly available on Github (a web-based hosting service for version control and social network-like functions such as feeds, followers, wikis) under Creative Commons Attribution-ShareAlike 4.0 International Public License. We envision that WReD will contain about 10,000 reflective writing texts, having around 100,000 rhetorical moves in reflective writing, such as Experience, Feelings, Beliefs, Difficulties, Perspectives, Intentions, Learning, and Reflection.

We invite a range of contributions, including:

  • Collections of reflective writing by students, and professionals
  • Collections of reflective writing annotated by humans (e.g. researchers or crowdsourced input). The annotation might be grades, or qualitative coding
  • Additional metadata associated with the texts (e.g. educational context; author metadata)

However, the creation of a corpus for reflective writing raises ethical challenges, and we invite your thoughts on what these are, and how we might address them. Our current proposal for contributing to WReD goes something like this:

  1. obtain necessary permissions / ethics approval to share the corpus (we have obtained ethics approval from University of Technology Sydney (UTS) to analyze UTS student writing.)
  2. deidentify the text of any sensitive names and other details
  3. submit it as a plain text file, or if the text comes with grades or other metadata, as a CSV file
  4. a supporting document with clear information about any metadata fields
  5. a background document explaining the context that gave rise to the reflective writing, and any relevant links or references.

What do you think? Are there examples of other text corpora dealing with sensitive material from which we can learn?

To join the conversation, please subscribe to the WReD Google Group.


Get in touch with me: / @MingLiuResearch


  1. Gibson, A., Aitken, A., Sándor, Á., Buckingham Shum, S., Tsingos-Lucas, C. and Knight, S. (2017). Reflective Writing Analytics for Actionable FeedbackProceedings of LAK17: 7th International Conference on Learning Analytics & Knowledge, March 13-17, 2017, Vancouver, BC, Canada. (ACM Press). [Preprint] [Replay]
  2. Buckingham Shum, S., Á. Sándor, R. Goldsmith, R. Bass and M. McWilliams (2017). Towards Reflective Writing Analytics: Rationale, Methodology and Preliminary ResultsJournal of Learning Analytics, 4, (1), 58–84. 
  3. Ullmann, T.D. (2017). Reflective Writing Analytics – Empirically Determined Keywords of Written Reflection. Proceedings of LAK17: 7th International Conference on Learning Analytics & Knowledge, March 13-17, 2017, Vancouver, BC, Canada. (ACM Press). 163-167.
  4. Klaus H. Krippendorff. 2003. Content Analysis: An Introduction to Its Methodology. Sage Publications, Thousand Oaks, CA.
  5. Ullmann, T.D. (2015). Automated detection of reflection in texts. A machine learning based approach. PhD Thesis, The Open University, UK.
  6. Rajpurkar, P.J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2383–2392.
  7. Lecun, Y., Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol.521, no. 7553.pp. 436–444, 2015.
  8. See, A., P. J. Liu, and C. D. Manning, “Get To The Point: Summarization with Pointer-Generator Networks,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 1073–1083.
  9. Bahdanau, D., K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in Proceeding of the 3rd International Conference on Learning Representations, 2014, pp. 1–15.