One of my data-sources is the set of etherpads which were used to write (mostly in small groups) a report. This data includes the final output (obviously) including some basic formatting, across which we could look for the presence of URLs and cue-phrases, etc. It also includes the whole revision history of each etherpad.
I’ve been looking around for some code to help me out, and to my surprise it appears (a) not much learning research has been done on such data (specifically using trace-data, rather than video, etc.) and (b) where research has been conducted, it’s with custom tools/code and not released.
From the academic side, from a quick look on google scholar (first 10 pages of ~30 + some citation following, note this probably excludes some google-docs & wiki related research and similar) I found a few interesting things:
- This cool project – Collabode – which developed a collaborative real-time coding space using Etherpad+Eclipse alludes to such analysis in the paper about it, but I can’t see any expansion or available code.
- Stian Haklav has also talked about some cool ideas around etherpad idea-convergence and scripts to work with etherpads.
- Another paper (Hirsch, Benjamin, et al. “Collaborative learning in action.” Teaching, Assessment and Learning for Engineering (TALE), 2013 IEEE International Conference on. IEEE, 2013.) which used an etherpad describes analysis, but again without code: “Each keystroke entered into the CLEs collaborative editing pad is recorded, including edits, deletes, copy/pastes, etc., and is stored in a database. A statistics module was built in order to visually display information about the pad (see Fig. 2), including a breakdown of each students contribution (such as typed characters, final characters, copy/paste/delete actions) along with a means to see how students contributed to the pad over time. This tool can be accessed by faculty in order to assist in forming an understanding of each individual’s contribution to a collaborative assignment.”
- Similar paper: Vahakangas, Taneli, and Joel Pyykko. “VisciPad: Peeking into a Collaborative Creative Writing Project in Elementary School.” Creating, Connecting and Collaborating through Computing (C5), 2012 10th International Conference on. IEEE, 2012.
- Which cites this nice paper: Southavilay, Vilaythong, Kalina Yacef, and Rafael A. Calvo. “Process Mining to Support Students’ Collaborative Writing.” EDM. 2010. which used process mining (ProM) and an LSA tool to run analysis on contribution types on Google Docs contributions (again, I can’t see anything reproducible/downloadable). The tool is described in more detail in this paper Southavilay, Vilaythong, Kalina Yacef, and Rafael A. Calvo. “WriteProc: A framework for exploring collaborative writing processes.” ADCS 2009 (2009): 129.
-
Liu, M., Calvo, R. A., & Pardo, A. (2013, July). Tracer: A Tool to Measure and Visualize Student Engagement in Writing Activities. In Advanced Learning Technologies (ICALT), 2013 IEEE 13th International Conference on (pp. 421-425). IEEE. – Tracer looks interesting, but I can’t see any code
-
Southavilay, V., Yacef, K., Reimann, P., & Calvo, R. A. (2013, April). Analysis of collaborative writing processes using revision maps and probabilistic topic models. In Proceedings of the Third International Conference on Learning Analytics and Knowledge (pp. 38-47). ACM.
From the abstract, analysis includes: “(1) the revision map, which summarises the text edits made at the paragraph level, over the time of writing. (2) the topic evolution chart, which uses probabilistic topic models, especially Latent Dirichlet Allocation (LDA) and its extension, DiffLDA, to extract topics and follow their evolution during the writing process. (3) the topic-based collaboration network, which allows a deeper analysis of topics in relation to author contribution and collaboration, using our novel algorithm DiffATM in conjunction with a DiffLDA-related technique”
-
Handayani, N. S. (2012). Examining the Writing Phases and Revision Patterns in Online Collaborative Writing: What Can We Learn from Them?. Malaysian Journal of Distance Education, 14(2), 39-62.
And I’ve seen a few discussions in forums and plugin spaces:
- This plugin exports including taking authorship metadata at a contribution level (but only at the line level)
- This plugin also exports authorship metadata, possibly at a finer grain?
- A developer was working on a stats plugin (commercial license) (here’s another (dead) thread on it) which (from contacting John) includes:
- Characters
- Word counts
- Revision counts
- Saved revisions
- Authors
- A set of author stats, including (a) n of words contributed, (b) n of lines contributed to, (c) n of lines as only contributor, (d) n of characters
So the third is probably worth a look in, and the 2nd one might be useful if the ‘spans’ of authorship colour are fully exported in the html (much easier to work with than the etherpad data structure).
Given the above bits of research, and thinking about (a) what etherpad records and (b) what sort of things we’re interested in for learning contexts, it’s interesting to consider what we’d want from an etherpad analytic tool. E.g.
- The stats from the ep_stats plugin above (especially author contribution counts, and a proportion based measure here).
- N of ‘touch points’ – e.g. if every other word were written by a different author, the N of touch points would be 1/2 the N of words, we’d want some way to express this as a number between 0-1 probably
- N of uninterrupted blocks (similar to (c) above: ‘n of lines as only contributor’)
- Temporal analysis?
- Perhaps including contribution over time versus contribution in the final pad (a crude ‘survival’ measure) or deletion over time (and authorial deletion) e.g. did one author appear to contribute less, but in fact their edits were deleted
- Do groups engage in different processes, e.g. working on their own sections throughout or co-editing throughout, making notes and then refining or starting from ‘clean’ text, do they engage in ‘linear’ editing (adding at the bottom) or other forms of insertion, etc.
- Possibly tied in with other trace data (e.g. when URL x was inserted, was it being talked about in the chat?)
- Topic based analysis (possibly related to temporal) as per Southavilay et al (2013) above, we might be interested in whether individuals contribute to just one topic, or across them, whether they tend to start topics, join them later, or a mixture, etc. (On topic things, see also Stian’s stuff around topic tags in ‘2’ above).
- Chat data (if using the etherpad chat), e.g. did those chatting more edit more, do chats and edits co-occur
- ???
One thing I’m interested in, is just a very simple operationalisation in which:
- Collaboration is taken to be the extent to which authors interact around the same text (i.e., their edits ‘touch’ more often, as per ‘2’ above)
- Co-operation is taken to be the extent to which authors edit on their own areas, their contributions are ‘stacked’ not interlinked (i.e. there are fewer edits touching, even though the overall pad size might be similar, as per ‘3’ above)
What else might we want? And does anyone know any (ideally easy, or implemented) ways to do any interesting things with such data?