In analysis of written texts there are various reasons we might want to understand how similar multiple texts are. We might be interested in:
- Whether text is plagiarised, from a particular external source, or from another student
- Whether a text has features indicating particular authorship
- Whether a text has features indicating a particular genre
- Whether a text apparently draws from particular sources (it has features, for example – topics – indicating particular sources have been used in its writing)
Sometimes, similarity is a good thing – it indicates that some or other useful source has been productively used, it indicates adherence to particular stylistic features of writing, etc. – other times it might be less so, as is the case in plagiarism.
Notes on plagiarism
Although, it is important to note there are a range of reasons text overlaps (some of which might look like plagiarism), and that in some cases (e.g. creative commons adaptation) – the reuse of material (and its alteration) is perfectly acceptable (see e.g. this post). Indeed, as a recent discussion on the ALT discussion list noted, software can detect similarity – but judgement is required to understand where that similarity comes from.
In the case of plagiarism though, there are some great resources available (of which this set is in no way exhaustive):
- UTS have developed a really nice quiz to help students understand the requirements
- there’s also a nice paper from Rolfe on using turnitin for formative feedback on plagiarism
- the University of Sydney has some great advice on ways to prevent students from plagiarising
- JISC/Oxford Brookes published a great guide on the same issue including ‘designing out’ opportunities for plagiarism (e.g. by not using the same assignments every year!),
- and a nice resource on combating ‘contract cheating’ (from the hea)
- In addition (and with more acknowledgement of appropriate adaptation of open knowledge material), Wikipedia has some rather good guidance on Plagiarism and Copyright & The WikiEd Foundation produced a really nice leaflet.
Importantly, as McGowan discusses (abstract):
…there is a first step that is still being overlooked, the initial induction of students into the research-led, evidence-based culture of academic endeavour. By focussing on rules and strategies for avoiding plagiarism, but ignoring the basic reasons for these requirements, we have put the cart before the horse. This paper suggests that tertiary induction of new students needs to focus firstly on developing an appreciation of the culture of enquiry that characterises learning at the tertiary level and that success is more likely if the students’ goal is something positive: to achieve a new approach to learning, than if it is something negative: to avoid ‘committing’ plagiarism.
Text similarity methods
In any case, in identifying text similarity there are, of course, various methods (beyond Turnitin or Urkund).
In a paper I really like (Hastings et al., 2012), the researchers asked students to answer a writing prompt making use of three texts with relatively little semantic overlap, in order to investigate which sources were drawn on in the outputs. They used three methods for this, pattern matching (i.e., looking for common text strings); latent semantic analysis (LSA) to compare semantic-content at a sentence level across student outputs and assigned texts; and machine learning (using support vector machines) assigning student sentences to topic-classes assigned by human-raters. Their analysis indicated that LSA was best to identify explicit use of assigned texts, with pattern-matching superior at detecting intra and inter-textual inferences.
For some resources on that:
- Bodong Chen discusses using latent semantic analysis for document similarity in R
- Matching of keywords or strings – The idea here is to look for the use of similar expressions across documents E.g. Tony Hirst had a nice post on n-gram matching recently, but what that misses is the kind of data tf;idf measures capture – that some expressions (or terms) are infrequent in a document set, so their occurrence in a source document and output document is probably notable (it implies the source document has indeed been sourced from) even if longer patterns do not match; this stackoverflow q discusses tf;idf and cosine similarity, and a nice discussion here.
Tools to explore document similarities
- Doccop – Identifies associations between documents, or a document and the web – https://doccop.com/index.html
- Chimpsky – Identifies duplicated text within documents, or between a document and the web http://chimpsky.uwaterloo.ca/
- There are various online tools to check similarity between 2 pasted in texts, e.g. http://utext.rikuz.com/en/ or given 2 urls http://tool.motoricerca.info/similarity-analyzer.phtml
- Then there are various offline tools to compare multiple documents, e.g. http://winmerge.org/about/ ; http://copycat.en.softonic.com/ ;
Tools to detect keyphrases in document which appear elsewhere on the web
- Duplichecker – Searches the web for phrases taken from a provided document and returns links to their sources (could be used to identify original sourcing of material, or of use of that material across the web) http://www.duplichecker.com/
- Plagiarism-detect – does the same as ‘2’ but in a much sexier way (identified text is highlighted and becomes a clickable hyperlink to its source on the web); but it seems it’s not free http://plagiarismdetect.org/howitworks.html
- Plagiarism-checker – same as above, nice feature is that it ignores text in quotations, i.e. it ignores already cited information looking only at the uncited. http://www.dustball.com/cs/plagiarism.checker/
- PlagScan – Same again, looks for keyphrases from a given text across the web, returns matches http://www.plagscan.com/seesources/search.php?
- ArticleChecker – same as above, seems to be targeted at content-producers finding reuse of their content across the web (and isn’t working for me currently) http://www.articlechecker.com/
- Plagiarism-checker – Same as above but a bit limited – it lets you run a few searches at once (but you have to select the lines of text to search on, it won’t do it for you), targeted at both content-producers and teachers. Has a “google alerts” function built in which is pretty cool, to alert authors if their content appears elsewhere on the web http://www.plagiarismchecker.com/
- Copyscape – same as above give it a URL it’ll show if the content is being used elsewhere on the web, again with a google alert. Fairly limited. http://www.copyscape.com/