Extracting Body of Text from Research Articles; Several Attempted Methods - r

I need to extract the body of texts from my corpus for text mining as my code now includes references, which bias my results. All coding is performed in R using RStudio. I have tried many techniques.
I have text mining code (of which only the first bit is included below), but recently found out that simply text mining a corpus of research articles is insufficient as the reference section will bias results; reference sections alone may provide another analysis, which would be a bonus.
EDIT: perhaps there is an R package that I am not aware of
My initial response was to clean the text formats after converting from pdf to text using Regex commands within quanteda. As a reference I was intending to follow: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005962&rev=1 . Their method confuses me not just in coding a parallel regex code, but in how to implement recognizing the last reference section to avoid cutting off portions of the text when "reference" appears prior to that section; I have been in contact with their team, but am waiting to learn more about their code since it appears they use a streamlined program now.
PubChunks and LAPDF-text were my next two options the latter of which is referenced in the paper above. In order to utilize the PubChunks package I need to convert all of my pdf (now converted to text) files into XML. This should be straightforward only the packages I found (fileToPDF, pdf2xml, trickypdf) did not appear to work; this seems to be a within-R concern. (Coding relating to trickypdf is included below).
For LAPDF-text, ...[see edit]... the code did not seem to run properly. There are also very limited resources out there for this package in terms of guides etc and they have shifted their focus to a larger package using different language that does happen to include LAPDF-text.
EDIT: I installed java 1.6 (SE 6) and Maven 2.0 then ran the LAPDF-text installer, which seemed to work. That being said, I am still having issues with this process and mvn commands recognizing folders though am continuing to work through it.
I am guessing there is someone else out there, as there are related research papers with similarly vague processes, who has done this before and has also got their hands dirty. Any recommendations is greatly appreciated.
Cheers
library(quanteda)
library(pdftools)
library(tm)
library(methods)
library(stringi) # regex pattern
library(stringr) # simpler than stringi ; uses stringi on backend
setwd('C:\\Users\\Hunter S. Baggen\\Desktop\\ZS_TestSet_04_05')
files <- list.files(pattern = 'pdf$')
summary(files)
files
# Length 63
corpus_tm <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
corpus_tm
# documents 63
inspect(corpus_tm)
meta(corpus_tm[[1]])
# convert tm::Corpus to quanteda::corpus
corpus_q <- corpus(corpus_tm)
summary(corpus_q, n = 2)
# Add Doc-level Variables here *by folder and meta-variable year
corpus_q
head(docvars(corpus_q))
metacorpus(corpus_q)
#_________
# extract segments ~ later to remove segments
# corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
corpus_q_refA <- corpus_reshape(corpus_q, to = "paragraphs", showmeta = TRUE)
corpus_q_refA
# Based upon Westergaard et al (15 Million texts; removing references)
corpus_q_refB <- corpus_trim(corpus_q, what = c('sentences'), exclude_pattern = '^\[\d+\]\s[A-Za-z]')
corpus_q_refB # ERROR with regex above
corpus_tm[1]
sum(str_detect(corpus_q, '^Referen'))
corpus_qB <- corpus_q
RemoveRef_B <- corpus_segment(corpus_q, pattern = 'Reference', valuetype = 'regex')
cbind(texts(RemoveRef_B), docvars(corpus_qB))
# -------------------------
# Idea taken from guide (must reference guide)
setGeneric('removeCitations', function(object, ...) standardGeneric('removeCitations'))
'removCitations'
setMethod('removeCitations', signature(object = 'PlainTextDocument'),
function(object, ...) {
c <- Content(object)
# remove citations tarting with '>'
# EG for > : citations <- grep('^[[:blank:]]*>.*', c) if (length(citations) > 0) c <- c[-citations]
# EG for -- : signatureStart <- grep('^-- $', c) if (length(signatureStart) > 0) c <- c[-(signatureStart:length(c))]
# using 15 mil removal guideline
citations <- grep('^\[\d+\]\s[A-Za-z]')
}
# TRICKY PDF download from github
library(pubchunks)
library(polmineR)
library(githubinstall)
library(devtools)
library(tm)
githubinstall('trickypdf') # input Y then 1 if want all related packages
# library(trickypdf)
# This time suggested I install via 'PolMine/trickypdf'
# Second attempt issue with RPoppler
install_github('PolMine/trickypdf')
library(trickypdf) # Not working
# Failed to install package 'Rpoppler' is not available for R 3.6.0
Short of the RPoppler issue above the initial description should be sufficient.
UPDATE: Having reached out to several research groups the TALN-UPF researchers got back to me and provided me with a pdfx java program that has allowed me to convert my pdfs easily into xml. Of course, now I learn that PubChunks is created with its sister program that extracts xmls from search engines and therefore is of little use to me. That being said, the TALN-UPF group will hopefully advise whether I can extract the body of each text via their other programs (Dr Inventor and Grobid). If this is possible then everything will be accomplished. Of course if not I will be back at RegEx.

Related

How do I download many PubMed records using the easyPubMed package?

Everything I've tried doesn't seem to work. Currently running version 4.1.3 of R, tried re-installing the easyPubMed package, but nothing seems to work. Here's the code I have so far:
new_query <- "(APE1[TI] OR OGG1[TI]) AND (2012[PDAT]:2016[PDAT])"
out.A <- batch_pubmed_download(pubmed_query_string = new_query,dest_file_prefix = "easyPM_example", batch_size = 150, encoding = "UTF-8")
cat(readLines(out.A[1])[1:32], sep = "\n")
For some reason, it returns with 1 row of all the collected xml-style text, followed by 31 lines of NA.
I've also looked into using fetching_pubmed_data() to serve a similar purpose, but whenever I check the class of what I have, I get character, when it should be XMLInternalDocument and XMLAbstractDocument. Here's my code:
my_query <- '"genetic therapy"[MeSH Terms]'
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_xml <- fetch_pubmed_data(my_entrez_id)
class(my_abstracts_xml)
Any help would be greatly, greatly appreciated!
The issue is with your readLines call; both batch_pubmed_download and fetch_pubmed_data work as expected.
In your batch_pubmed_download example, the downloaded files are XML files with three text lines (can confirm with readLines or in a terminal with wc -l). So readLines(out.A[1])[1:32] makes no sense, as only 3 lines exist and all the other indices lead to NAs. The XML content is in the first 3 lines.

Create data tables using SPSS in R

Using expss package I am creating cross tabs by reading SPSS files in R. This actually works perfectly but the process takes lots of time to load. I have a folder which contains various SPSS files(usually 3 files only) and through R script I am fetching the last modified file among the three.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
#get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
#use file.info to get the index of the file most recently modified
pass<-all_sav[with(file.info(all_sav), which.max(mtime))]
mydata = read_spss(pass,reencode = TRUE) # read SPSS file mydata
w <- data.frame(mydata)
args <- commandArgs(TRUE)
Everything is perfect and works absolutely fine but it generally takes too much time to load large files(112MB,48MB for e.g) which isn't good.
Is there a way I can make it more time-efficient and takes less time to create the table. The dropdowns are created using PHP.
I have searched for this and found another library called 'haven' but I am not sure whether that can give me significance as well. Can anyone help me with this? I would really appreciate that. Thanks in advance.
As written in the expss vignette (https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html) you can use in the following way:
# we need to load packages strictly in this order to avoid conflicts
library(haven)
library(expss)
spss_data = haven::read_spss("spss_file.sav")
# add missing 'labelled' class
spss_data = add_labelled_class(spss_data)

Reading Text file in SparkR 1.4.0

Does anyone know how to read a text file in SparkR version 1.4.0?
Are there any Spark packages available for that?
Spark 1.6+
You can use text input format to read text file as a DataFrame:
read.df(sqlContext=sqlContext, source="text", path="README.md")
Spark <= 1.5
Short answer is you don't. SparkR 1.4 has been almost completely stripped from low level API, leaving only a limited subset of Data Frame operations.
As you can read on an old SparkR webpage:
As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.
Probably the closest thing is to load text files using spark-csv:
> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
| C0|
+--------------------+
| # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+
Since typical RDD operations like map, flatMap, reduce or filter are gone as well it is probably what you want anyway.
Now, low level API is still underneath so you can always do something like below but I doubt it is a good idea. SparkR developers most likely had a good reason to make it private. To quote ::: man page:
It is typically a design mistake to use ‘:::’ in your code since
the corresponding object has probably been kept internal for a
good reason. Consider contacting the package maintainer if you
feel the need to access the object for anything but mere
inspection.
Even if you're willing to ignore good coding practices I it is most likely not worth the time. Pre 1.4 low level API is embarrassingly slow and clumsy and without all the goodness of the Catalyst optimizer it is most likely the same when it comes to internal 1.4 API.
> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)
[[1]]
[1] 14
[[2]]
[1] 0
[[3]]
[1] 78
Not that spark-csv, unlike textFile, ignores empty lines.
Please follow the links
http://ampcamp.berkeley.edu/5/exercises/sparkr.html
we can simply use -
textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")
While checking the SparkR code, Context.R has textFile method , so ideally a SparkContext must have textFile API to create the RDD , but thats missing in doc.
# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# #param sc SparkContext to use
# #param path Path of file to read. A vector of multiple paths is allowed.
# #param minPartitions Minimum number of partitions to be created. If NULL, the default
# value is chosen based on available parallelism.
# #return RDD where each item is of type \code{character}
# #export
# #examples
#\dontrun{
# sc <- sparkR.init()
# lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
# Allow the user to have a more flexible definiton of the text file path
path <- suppressWarnings(normalizePath(path))
# Convert a string vector of paths to a string containing comma separated paths
path <- paste(path, collapse = ",")
jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
# jrdd is of type JavaRDD[String]
RDD(jrdd, "string")
}
Follow the link
https://github.com/apache/spark/blob/master/R/pkg/R/context.R
For test case
https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R
Infact, you could use the databricks/spark-csv package to handle tsv files too.
For example,
data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")
A host of options is provided here -
databricks-spark-csv#features

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

Read SPSS file into R

I am trying to learn R and want to bring in an SPSS file, which I can open in SPSS.
I have tried using read.spss from foreign and spss.get from Hmisc. Both error messages are the same.
Here is my code:
## install.packages("Hmisc")
library(foreign)
## change the working directory
getwd()
setwd('C:/Documents and Settings/BTIBERT/Desktop/')
## load in the file
## ?read.spss
asq <- read.spss('ASQ2010.sav', to.data.frame=T)
And the resulting error:
Error in read.spss("ASQ2010.sav", to.data.frame = T) : error
reading system-file header In addition: Warning message: In
read.spss("ASQ2010.sav", to.data.frame = T) : ASQ2010.sav: position
0: character `\000' (
Also, I tried saving out the SPSS file as a SPSS 7 .sav file (was previously using SPSS 18).
Warning messages: 1: In read.spss("ASQ2010_test.sav", to.data.frame =
T) : ASQ2010_test.sav: Unrecognized record type 7, subtype 14
encountered in system file 2: In read.spss("ASQ2010_test.sav",
to.data.frame = T) : ASQ2010_test.sav: Unrecognized record type 7,
subtype 18 encountered in system file
I had a similar issue and solved it following a hint in read.spss help.
Using package memisc instead, you can import a portable SPSS file like this:
data <- as.data.set(spss.portable.file("filename.por"))
Similarly, for .sav files:
data <- as.data.set(spss.system.file('filename.sav'))
although in this case I seem to miss some string values, while the portable import works seamlessly. The help page for spss.portable.file claims:
The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. They are also adapted to load efficiently large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.
The read.spss seems to be outdated a little bit, so I used package called memisc.
To get this to work do this:
install.packages("memisc")
data <- as.data.set(spss.system.file('yourfile.sav'))
You may also try this:
setwd("C:/Users/rest of your path")
library(haven)
data <- read_sav("data.sav")
and if you want to read all files from one folder:
temp <- list.files(pattern = "*.sav")
read.all <- sapply(temp, read_sav)
I know this post is old, but I also had problems loading a Qualtrics SPSS file into R. R's read.spss code came from PSPP a long time ago, and hasn't been updated in a while. (And Hmisc's code uses read.spss(), too, so no luck there.)
The good news is that PSPP 0.6.1 should read the files fine, as long as you specify a "String Width" of "Short - 255 (SPSS 12.0 and earlier)" on the "Download Data" page in Qualtrics. Read it into PSPP, save a new copy, and you should be in business. Awkward, but free.
,
You can read SPSS file from R using above solutions or the one you are currently using. Just make sure that the command is fed with the file, that it can read properly. I had same error and the problem was, SPSS could not access that file. You should make sure the file path is correct, file is accessible and it is in correct format.
library(foreign)
asq <- read.spss('ASQ2010.sav', to.data.frame=TRUE)
As far as warning message is concerned, It does not affect the data. The record type 7 is used to store features in newer SPSS software to make older SPSS software able to read new data. But does not affect data. I have used this numerous times and data is not lost.
You can also read about this at http://r.789695.n4.nabble.com/read-spss-warning-message-Unrecognized-record-type-7-subtype-18-encountered-in-system-file-td3000775.html#a3007945
It looks like the R read.spss implementation is incomplete or broken. R2.10.1 does better than R2.8.1, however. It appears that R gets upset about custom attributes in a sav file even with 2.10.1 (The latest I have). R also may not understand the character encoding field in the file, and in particular it probably does not work with SPSS Unicode files.
You might try opening the file in SPSS, deleting any custom attributes, and resaving the file.
You can see whether there are custom attributes with the SPSS command
display attributes.
If so, delete them (see VARIABLE ATTRIBUTE and DATAFILE ATTRIBUTE commands), and try again.
HTH,
Jon Peck
If you have access to SPSS, save file as .csv, hence import it with read.csv or read.table. I can't recall any problem with .sav file importing. So far it was working like a charm both with read.spss and spss.get. I reckon that spss.get will not give different results, since it depends on foreign::read.spss
Can you provide some info on SPSS/R/Hmisc/foreign version?
Another solution not mentioned here is to read SPSS data in R via ODBC. You need:
IBM SPSS Statistics Data File Driver. Standalone driver is enough.
Import SPSS data using RODBC package in R.
See the example here. However I have to admit that, there could be problems with very big data files.
For me it works well using memisc!
install.packages("memisc")
load('memisc')
Daten.Februar <-as.data.set(spss.system.file("NPS_Februar_15_Daten.sav"))
names(Daten.Februar)
I agree with #SDahm that the haven package would be the way to go. I myself have struggled a bit with string values when starting to use it, so I thought I'd share my approach on that here, too.
The "semantics" vignette has some useful information on this topic.
library(tidyverse)
library(haven)
# Some interesting information in here
vignette('semantics')
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})
There is no such problem with packages you are using. The only requirement for read a spss file is to put the file into a PORTABLE format file. I mean, spss file have *.sav extension. You need to transform your spss file in a portable document that uses *.por extension.
There is more info in http://www.statmethods.net/input/importingdata.html
In my case this warning was combined with a appearance of a new variable before first column of my data with values -100, 2, 2, 2, ..., a shift in the correspondence between labels and values and the deletion of the last variable. A solution that worked was (using SPSS) to create a new dump variable in the last column of the file, fill it with random values and execute the following code:
(filename is the path to the sav file and in my case the original SPSS file had 62 columns, thus 63 with the additional dumb variable)
library(memisc)
data <- as.data.set(spss.system.file(filename))
copyofdata = data
for(i in 2:63){
names(data)[i] <- names(copyofdata)[i-1]
}
data[[1]] <- NULL
newcopyofdata = data
for(i in 2:62){
labels(data[[i]]) <- labels(newcopyofdata[[i-1]])
}
labels(data[[1]]) <- NULL
Hope the above code will help someone else.
Turn your UNICODE in SPSS off
Open SPSS without any data open and run the code below in your syntax editor
SET UNICODE OFF.
Open the data set and resave it to remove the Unicode
read.spss('yourdata.sav', to.data.frame=T) works correctly then
I just came came across an SPSS file that I couldn't get open using haven, foreign, or memisc, but readspss::read.por did the trick for me:
download.file("http://www.tcd.ie/Political_Science/elections/IMSgeneral92.zip",
"IMSgeneral92.zip")
unzip("IMSgeneral92.zip", exdir = "IMSgeneral92")
# rio, haven, foreign, memisc pkgs don't work on this file! But readspss does:
if(!require(readspss)) remotes::install_git("https://github.com/JanMarvin/readspss.git")
ims92 <- readspss::read.por("IMSgeneral92/IMS_Nov7 92.por", convert.factors = FALSE)
Nice! Thanks, #JanMarvin!
1)
I've found the program, stat-transfer, useful for importing spss and stata files into R.
It resolves the issue you mention by converting spss to R dataset. Also very useful for subsetting super large datasets into smaller portions consumable by R. Not free, but a very useful tool for working with datasets from different programs -- especially if you don't have access to them.
2)
Memisc package also has an spss function worth trying.

Resources