Cleaning essays in R (ish)

“Should I worry about this?”

I’m currently working on a corpus of student essays. We want to run various analyses on these essays, but for that to work we need to have them in a fairly universal state, with some elements removed/modified for processing purposes.

As is entirely predictable, the essays vary in format and structure a lot, for example:#

Most are docx format, some are pdf
Some include the student ID in the filename, many do not (astonishingly, some don’t include it anywhere in the text either!)
Some are formatted in columns
Some have included the essay prompt at the beginning of the text, most have not
Student IDs and names are scattered across: Start/end sections, filename, headers/footers
Many figures and tables are included along with captions – we can’t do anything meaningful with these in our tools at the moment, the figures are lost in conversion to .txt anyway but other remnants are noise for analysis.

I started trying to do some cleaning of the above manually, and realised that would take me about 3 days for every batch – not viable. So, I moved on to try and automate. My thanks to Laura Allen at ASU for her incredibly useful advice on this (to be clear, my terrible code has nothing to do with her).

I went about trying to:

Convert all documents to .txt
Rename all documents to the student ID, and remove any references to their name or ID from the document text
Remove any fore or tail text
Remove headings, captions, and other random short text elements
Convert references and links to unigrams (otherwise e.g. individual names and dates become features, which we don’t want)
Clean some punctuation (specifically dashes, hyphens, etc. – a massive hassle)
Convert bullets and numbered lists to paragraph text.

And I wanted to do all of this in R……however, I realised that was going to be a challenge, so instead it’s all called from within R/RStudio, with some system calls and a bit of embedded Python.

I’ve copied the working code (which I will clean up at some point) below…as will be obvious, well this:

Computer Programming To Be Officially Renamed “Googling Stack Overflow” Source: http://t.co/xu7acfXvFF pic.twitter.com/iJ9k7aAVhd

— Stack Exchange (@StackExchange) July 20, 2015

So lots is from SO, and I asked 3 questions specifically:

Late on I found the qdapRegex package, which has a load of canned regex expressions (including for URL and citation extraction), so I rewrote some of the code and reduced it a bit. I’ve left most of the old stuff in though, so you can see the drafting (and the utter mess…so many things to improve). The ‘tryCatch’ stuff was also added late…uhum, this is to identify where specific errors were coming from.

Challenge to the reader – improve it? Have I missed anything (I’m aware headings/footers are imported currently, but think they’re excluded by the short-line remover) ?

library(tm)
source('aspellCheck.R')

# folder with 1000s of PDFs
dest <- "science/turnitin_zip_download"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# convert each PDF file that is named in the vector into a text file 
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"xpdfbin-win-3.04/xpdfbin-win-3.04/bin64/pdftotext.exe"', 
             paste0('"', i, '" -layout')), wait = FALSE) )

#this should work but doesn't seem to...
#pyExec('
#import magic
#import io
#
#files = glob.glob("science/turnitin_zip_download_cleaned/*.txt")
#
#for x in files:
  #blob = open(x).read()
  #m = magic.Magic(mime_encoding=True)
  #enc = m.from_buffer(blob)
  #with io.open(x, "r", encoding=enc) as f:
    #text = f.read()
  ##process Unicode text
  #with io.open(x, "w", encoding="utf8") as f:
    #f.write(text)
#'
#)

#the below is stable and does the conversion (but doesn't seem to be working perfectly)
#pyExec('
#import io
#files = glob.glob("science/turnitin_zip_download_cleaned/*.txt")
#for x in files:
 # with io.open(x, "r", encoding="utf8", errors = "ignore") as f:
  #  text = f.read()
  #process Unicode text
  #with io.open(x, "w", encoding="utf8") as f:
   # f.write(text)
#'
#)

#Use doc2txt (perl). Other options, e.g. http://poi.apache.org/, https://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/, https://github.com/pzaich/doc_ripper, or extracting the data yourself unzip -p document.docx word/document.xml | sed -e 's/</w:p>/n/g; s/<[^>]{1,}>//g; s/[^[:print:]n]{1,}//g'
#cmd /C ""C:Program Files (x86)OpenOffice 4programsoffice.exe" -headless -conversionmode -convert-to txt:"Text" "C:Users5295Teaching student data123 scienceturnitin_zip_download_cleaned826397.docx" -outputdir "C:Users5295Teaching student data123 scienceturnitin_zip_download_cleaned" "

#once any manual cleaning of the .docx is done, make a vector of the doc file names
dest <- "science/turnitin_zip_download_cleaned"
myfiles <- list.files(path = dest, pattern = "docx",  full.names = TRUE)


#in python (i use activestate IDE with pypm) pip install python-docx2txt
library(PythonInR)
autodetectPython()
pyConnect()

pyExec('
import docx2txt
import glob
files = glob.glob("science/turnitin_zip_download_cleaned/*.docx")

for x in files:
  text = docx2txt.process(x)
  text = text.encode("utf8")
  file = open(x + ".txt", "w")
  file.write(text)
  file.close()
'
)

#I think this now works, it's worth running whenever new files are imported, primarily I care about the pdf files but no harm in re-encoding the docx files imported above
pyExec('
import chardet
import io

files = glob.glob("science/turnitin_zip_download_cleaned/*.txt")

for x in files:
  rawdata = open(x, "r").read()
  result = chardet.detect(rawdata)
  charenc = result["encoding"]
  with io.open(x, "r", encoding=charenc, errors = "replace") as f:
    text = f.read()
  #process Unicode text
  with io.open(x, "w", encoding="utf8") as f:
    f.write(text)
'
)

The below chunk gets IDs for matching purposes:

#read the sheet with all the student IDs and names in them
students <- read.csv("sciencestudents.csv", header = T, stringsAsFactors = F)
#list the .txt files
files <- "scienceturnitin_zip_download_cleaned"
myfiles <- list.files(path = files, pattern = "*.txt",  full.names = TRUE, all.files = T, ignore.case = T)
rm(files)

#get the ID column
IDs <- students$STUDENT.NO
#get the student names, join these
students$join <- paste(students$FIRST.NAME, students$LAST.NAME, sep = " ")
students$joinrev <- paste(students$LAST.NAME, students$FIRST.NAME, sep = " ")

Names <- students$join
Names2 <- students$joinrev

library(readr)
library(stringr)
dir <- "scienceturnitin_zip_download_cleaned"

Then the actual cleaning and processing

for(ppl in myfiles){
  tryCatch({

y <- read_file(ppl) #load the file into memory, using the readr package
####################################################
####################################################
####################################################
#FIND student IDs, and set the filename to them.
student_id <- str_extract_all(ppl,IDs, simplify = T) #check the filename for a student ID 
student_id <- unique(student_id[student_id != ""])
id_there <- length(unique(student_id[student_id != ""])) #check if any are returned

student_id_file <- str_extract_all(y,IDs, simplify = T) #check the file for a student ID 
student_id_file <- unique(student_id_file[student_id_file != ""])
id_there_file <- length(unique(student_id_file[student_id_file != ""])) #check if any are returned
#later these variables are used to rename the file

####################################################
####################################################
####################################################
#FIND names in the text, and delete them and the student ID
#find names
student_name <- str_extract_all(y, coll(Names, ignore_case = T), simplify = T)
student_name <- c(student_name,str_extract_all(y, coll(Names2, ignore_case = T), simplify = T))
student_name <- unique(student_name[student_name != ""])

#now I want to write a gsub, to go through that list and the student ID, and rmeove all instaces from the text
str <- paste(str_trim(paste(student_name,"|", collapse = "|"), side = c("both")), str_trim(paste(student_id, collapse = "|"), side = c("both")), collapse = "|")
str <- gsub(" | ", "|", str) #escape the | character and replace " | " with "|", can't work out how to do this through the paste settings or str_trim
#test <- substr(y, 95, 110)
tryCatch({
y <- gsub("s+(?=p{Pd})|(?<=p{Pd})s+", "", y, perl=TRUE) # make sure any hyphenated names (and other stuff, to be addressed later) are actually connected per the name list

y <- gsub("p{Pd}", "-", y, perl=TRUE) # make sure all the hyphens are the same type
  }, error=function(e){cat("ERROR in the hypen stripping bit:",conditionMessage(e),ppl, "n")})

y <- gsub(str, "", y, ignore.case = T, perl=T) #remove names n IDS from the text
#substr(y, 95, 110)
       
####################################################
####################################################
####################################################
#now I want to delete a whole host of other things

y <- gsub("`", "'", y, ignore.case = T) # Check people are using ' not `
y <- gsub("-", " ", y, ignore.case = T) # remove -- or - (parsers don't like them)

# Remove everything before "abstract" (deletes the forematter including student ID/name, title, etc.)
tryCatch({
y <- ifelse(grepl("abstract", y, ignore.case = T), unlist(strsplit(y,"abstract|Abstract|ABSTRACT"))[2], y)
  }, error=function(e){cat("ERROR in the abstract removal bit:",conditionMessage(e),ppl, "n")})

# Remove everything after "References" (deletes reference list, and some students put their student ID/name at the end)
tryCatch({
y <- ifelse(grepl("references|References|REFERENCES|Bibliography|bibliography|BIBLIOGRAPHY", y, ignore.case=T), unlist(strsplit(y,"references|References|REFERENCES|Bibliography|bibliography|BIBLIOGRAPHY"))[1], y)
  }, error=function(e){cat("ERROR in the references removal bit:",conditionMessage(e),ppl, "n")})

# convert bullets and numbered lists to paragraph text, make sure each ends in a period first, then delete the line breaks
y <- gsub("(r?n|r)*", " ", y) #replace any instances of a new line and then an asterix such that the new lines are collapsed into the originating paragraph and the asterixes removed
y <- gsub("(r?n|r)[1:9]", " ", y) #replace any instances of a new line and then one of:
# i ii iii, a,b,c, 1,2,3 with any of these formats: (x); x); x.  # 1, 2, 3 # 1.1, 1.1.1.1  # A. B. C.
y <- gsub("r(?m)n?^[a-zA-Z0-9]+.+[a-zA-Z0-9]|r(?m)n?^[a-zA-Z0-9]+.|r(?m)n?^[a-zA-Z0-9]+)|n(?m)r?^[a-zA-Z0-9]+.|n(?m)r?^[a-zA-Z0-9]+)|n(?m)r?^([a-zA-Z0-9]+)|r(?m)n?^([a-zA-Z0-9]+)", "", y, perl=T)

# Strip multiple whitespace paragraphs
y <- gsub("r", "n", y, perl=T) #replace all carriage returns with newline (should move this up really)
y <- gsub("n+", "n", y, perl=T) #replace multiple newlines with one

# Strip anything more than a double space
y <- gsub("  ", " ", y) # working on the basis that if it's more than a double space, and survived to this point, it probably has a purpose

# headings - ideally, remove these, unless only looking for global cohesion.  One way to do this is by looking for anything that's on a single line (sentences under, say, 20 words, with or without a period starting and ending with n or r)
tryCatch({
holding <- unlist(strsplit(y,"n"))
holding <- lapply(holding, function (bits) ifelse(nchar(bits) < 75, "", ifelse(nchar(bits)<150, ifelse(sum(str_count(bits, "."),str_count(bits, ":"),str_count(bits, "?"))<3, "", bits), bits)))
holding <- holding[holding != ""]; # without elements that are empty
#recombine that back into y
y <- paste(holding, collapse = "n")
rm(holding)
  }, error=function(e){cat("ERROR in the single line removal bit:",conditionMessage(e),ppl,"n")})

#gsub("/^n(.{50,}[.?:].*){2,}$/gmi", "n", test, perl = T) #this regex works...
####################################################
####################################################
####################################################
# Convert all () paranethetical references and in-text citations to uni-grams ('this is referenced in 2009Knight')
#http://stackoverflow.com/questions/1138552/replace-string-in-parentheses-using-regex
# http://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r
#library (devtools)
#install_github("trinker/qdapRegex") #this is pretty cool, canned regex - INCLUDING CITATION EXTRACTION!! And URL identification
mgsub <- function(pattern, replacement, x, ...) {
  if (length(pattern)!=length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result <- gsub(pattern[i], replacement[i], result, ...)
  }
  result
}

library(qdapRegex)

y <- gsub("~", "", y) #want to use this character momentarily
tryCatch({
bracks_original <- unique(unlist(ex_citation(y))) #this uses the qdapRegex function
#bracks <- unique(as.data.frame(regmatches(y, gregexpr("(?<=().*?(?=))", y, perl=T))[[1]])) #both of these give me all the bracketed contents
#gsub("(([^()]+))", "1", str_extract_all(y, "(([^()]+))")[[1]]) Ideally we'd also capture e.g. Name and Name (YYYY) refs.
#bracks$rep <- apply(bracks[1], 1, function (refs) unlist(str_split(gsub("([0-9]{4})","1~", refs), "~"))) # then using strsplit I can split these up by 4 digit date (in case there are brackets with multiple citations in them); this replaces the dates so there's a ~ after them, and then splits on that (the final "~"), which is then discarded. This gives us a list of 'Knight 2009' form refs
#bracks$rep <- lapply(bracks$rep, as.list) #for some reason it's created as a factor, so fix this (might be useful later?)
bracks <- unlist(lapply(bracks_original, function(refs) gsub("[[:punct:]]", "", refs, perl = T)))
#bracks$rep <- apply(bracks[2], 1, function (refs) gsub("[[:punct:]]", "", unlist(refs), perl = T)) #delete punctuation
#bracks$rep <- apply(bracks[2], 1, function (refs) gsub(" ", "", unlist(refs)))#then delete whites pace
#bracks$rep <- apply(bracks[2], 1, function (refs) unlist(refs, recursive = T))
#library(rlist)
#bracks$rep <- apply(bracks[2], 1, function (refs) list.clean(unlist(refs), fun = function(els) nchar(els) == 0|els == "", recursive = T)) #get rid of empty elements
bracks <- unlist(lapply(bracks, function(refs) gsub("([0-9]{4})", paste("1","REFCITE"), refs, perl = T)))
bracks <- unlist(lapply(bracks, function(refs) gsub("s+", "", refs, perl = T)))
#bracks$rep <- apply(bracks[2], 1, function (refs) lapply(refs, function(ref) gsub("([0-9]{4})",paste("1","cite"), ref)))#add 'cite' to all ref strings
#bracks$rep <- apply(bracks[2], 1, function (refs) ifelse(grepl("cite", unlist(refs)), gsub(" ", "", unlist(refs)), unlist(refs)))#then delete white  space in ref strings (but not others)
#bracks$count <- unlist(lapply(bracks$rep, function (refs) length(unlist(refs)))) #add a ref counter
  }, error=function(e){cat("ERROR in the ref getting bit:", conditionMessage(e), ppl, "n")})
#then do a find and replace on all items in the lists
#library(gsubfn)
#apply(bracks[1,], 1, function (refs) y <<- gsub(paste0(refs[1]), ifelse(refs[3]>1, paste0("cite",unlist(refs[2])), paste0("cite", refs[2])), y)) #
tryCatch({
y <- mgsub(bracks_original, bracks, y)
#apply(bracks, 1, function (refs) y <<- gsub(paste0(refs[1]), paste(unlist(refs[2]), collapse = " "), y)) #paste in each element & rewrite to y...this is horribly inefficient (because it rewrites y on each pass)
  }, error=function(e){cat("ERROR in the ref compiling bit:",conditionMessage(e), ppl, "n")})

#mapply(gsub, bracks[1], (ifelse(bracks[3]>1, paste0("cite", unlist(bracks[2])), paste0("cite", bracks[2]))), y)
#gsubfn(bracks[1], ifelse(bracks[3]>1, paste0("cite",unlist(bracks[2])), paste0("cite", bracks[2])), y)
rm(bracks, bracks_original)
#SUGGESTION - for both the URL and http matching, build up a list outside of the loop, and then apply the index from that list. That'll give you information re: which citations are used by whom.

# Convert all URLs to named-strings ('this is referenced in httpKnight')
tryCatch({
#url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
links <- unique(ex_url(y))
#links <- list(unique(unlist(str_extract_all(y, url_pattern))))
links_clean <- list(gsub("[[:punct:]]", "", unlist(links)))
links <- links[links != ""] # without elements that are empty
ifelse(length(unlist(links)) != 0, {
  links <- unlist(links, recursive = T) #level up
  links_clean <- unlist(links_clean, recursive = T)
  names(links) <- unlist(links)
  names(links_clean) <- unlist(links)
  y <- mgsub(links, links_clean, y)
  #for(i in seq_along(unlist(links))) y <- gsub(links[[1]][i], links_clean[[1]][i], y, fixed = TRUE) #another way of matching and replacing
},"")
  }, error=function(e){cat("ERROR in the url pattern:",conditionMessage(e), ppl, "n")})

# check for footnotes/endnotes/headers/footers but I think these are ok in this corpus (the law corpus will be horrible for this) - I ran a test with a footnote and it looks like these don't get transcribed into R.
#Headers and footers are unfortunately imported by the python code, and I can't see how to turn this off. However, they should be removed by (1) removing the student name and ID, and (2) rmeoving all the short lines with no punctuation...

#devtools::install_github("schaunwheeler/tmt")
#do some spelling correction per http://www.r-bloggers.com/automatic-cleaning-of-messy-text-data/
#note, I'm setting the PATH in the source  (it's the only mod from the original). This is pretty slow.
#aspellCheck(y, output = "fix", sep = FALSE, keep_caps = TRUE, pattern_flag = NULL, word_flag = NULL, split_missing = FALSE, mode = "ultra", dict = "en_GB")

#write it all back to the file
writeLines(y,ppl)

#rename the file, using the IDs from earlier
ifelse(id_there > 0, file.rename(from = paste0(ppl), to = paste0(dir,student_id,".txt")), "") #if any IDs are inthe filename, rewrite filename to that ID, else nothing
ifelse(id_there > 0, file.rename(from = paste0(ppl), to = paste0(dir,student_id_file,".txt")), "") #if IDs are in the file contents, rewrite filename to that ID, else nothing

  }, error=function(e){cat("ERROR :",conditionMessage(e), ppl, "n")})

}

#end the loop

Source: SK Blog
Link: Cleaning essays in R (ish)

Writing Analytics

UTS Writing Analytics

Leave a Reply Cancel reply