How to calculate readabilty in R with the tm package - r

Is there a pre-built function for this in the tm library, or one that plays nicely with it?
My current corpus is loaded into tm, something like as follows:
s1 <- "This is a long, informative document with real words and sentence structure: introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find."
s2 <- "This is a short jibberish lorem ipsum document. Selling anything to strangers and get money! Woody equal ask saw sir weeks aware decay. Entrance prospect removing we packages strictly is no smallest he. For hopes may chief get hours day rooms. Oh no turned behind polite piqued enough at. "
stuff <- rbind(s1,s2)
d <- Corpus(VectorSource(stuff[,1]))
I tried using koRpus, but it seems silly to retokenize in a different package than the one I'm already using. I also had problems vectorizing its return object in a way that would allow me to reincorporate the results into tm. (Namely, due to errors, it would often return more or fewer readability scores than the number of documents in my collection.)
I understand I could do a naive calculation parsing vowels as syllables, but want a more thorough package that takes care of the edge cases already (address silent e's, etc.).
My readability scores of choice are Flesch-Kincaid or Fry.
What I had tried originally where d is my corpus of 100 documents:
f <- function(x) tokenize(x, format="obj", lang='en')
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='remove') %do% g(f(d[[i]]))
Unfortunately, x returns less than 100 documents, so I can't associate successes with the correct document. (This is partly my misunderstanding of 'foreach' versus 'lapply' in R, but I found the structure of a text object sufficiently difficult that I could not appropriately tokenize, apply flesch.kincaid, and successfully check errors in a reasonable sequence of apply statements.)
UPDATE
Two other things I've tried, trying to apply the koRpus functions to the tm object...
Pass arguments into the tm_map object, using the default tokenizer:
tm_map(d,flesch.kincaid,force.lang="en",tagger=tokenize)
Define a tokenizer, pass that in.
f <- function(x) tokenize(x, format="obj", lang='en')
tm_map(d,flesch.kincaid,force.lang="en",tagger=f)
Both of these returned:
Error: Specified file cannot be found:
Then lists the full text of d[1]. Seems to have found it? What should I do to pass the function correctly?
UPDATE 2
Here's the error I get when I try to map koRpus functions directly with lapply:
> lapply(d,tokenize,lang="en")
Error: Unable to locate
Introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find.
This looks like a strange error---I almost don't think it means it can't locate the text, but that it can't locate some blank error code (such as, 'tokenizer'), before dumping the located text.
UPDATE 3
Another problem with retagging using koRpus was that retagging (versus the tm tagger) was extremely slow and output its tokenization progress to stdout. Anyway, I've tried the following:
f <- function(x) capture.output(tokenize(x, format="obj", lang='en'),file=NULL)
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='pass') %do% g(f(d[[i]]))
y <- unlist(sapply(x,slot,"Flesch.Kincaid")["age",])
My intention here would be to rebind the y object above back to my tm(d) corpus as metadata, meta(d, "F-KScore") <- y.
Unfortunately, applied to my actual data set, I get the error message:
Error in FUN(X[[1L]], ...) :
cannot get a slot ("Flesch.Kincaid") from an object of type "character"
I think one element of my actual corpus must be an NA, or too long, something else prohibitive---and due to the nested functionalizing, I am having trouble tracking down exactly which it is.
So, currently, it looks like there is no pre-built function for reading scores that play nicely with the tm library. Unless someone sees an easy error-catching solution I could sandwich into my function calls to deal with inability to tokenize some apparently erroneous, malformed documents?

You get an error because koRpus functions can't deal with corpus object. It is better to create a kRp.tagged object then apply all koRpus features on it. Here I will show how I do this using ovid data of tm package.
I use list.files to get my list of source files. You just need to give the right path to your sources text files.
ll.files <- list.files(path = system.file("texts", "txt",
package = "tm"),
full.names=T)
Then I construct a list of kRp.tagged object using tokenize which is a the default tagger given with the koRpus package(It is recommanded to use TreeTagger but you need to install it)
ll.tagged <- lapply(ll.files, tokenize, lang="en") ## tm_map is just a wrapper of `lapply`
Once I have my list of "tagged" objects I can apply readability formula on it. Since flesch.kincaid is a wrapper of readability, I will apply directly the latter:
ll.readability <- lapply(ll.tagged,readability) ## readability
ll.freqanalysis <- lapply(ll.tagged,kRp.freq.analysis) ## Conduct a frequency analysis
ll.hyphen <- lapply(ll.tagged,hyphen) ## word hyphenation
etc,....all this produces a list of S4 object. The desc slot gives an easy access to this list:
lapply(lapply(ll.readability ,slot,'desc'), ## I apply desc to get a list
'[',c('sentences','words','syllables'))[[1]] ## I subset to get some indexes
[[1]]
[[1]]$sentences
[1] 10
[[1]]$words
[1] 90
[[1]]$syllables
all s1 s2 s3 s4
196 25 32 25 8
You can for example , use the slot hyphen to get a data frame with two colums, word (the hyphenated words) and syll (the number of syllables). here, using lattice, I bind all the data.frames, to plot a dotplot for each document.
library(lattice)
ll.words.syl <- lapply(ll.hyphen,slot,'hyphen') ## get the list of data.frame
ll.words.syl <- lapply(seq_along(ll.words.syl), ## add a column to distinguish docs
function(i)cbind(ll.words.syl[[i]],group=i))
dat.words.syl <- do.call(rbind,ll.words.syl)
dotplot(word~syll|group,dat.words.syl,
scales=list(y=list(relation ='free')))

I'm sorry the koRpus package doesn't interact with the tm package that smoothly yet. I've been thinking of ways to translate between the two object classes for months now but haven't yet come up with a really satisfying solution. If you have ideas for this, don't hesitate to contact me.
However, I'd like to refer you to the summary() method for readability objects produced by koRpus, which returns a condensed data.frame of relevant results. This is probably much easier to access than the alternative crawl through the rather complex S4 objects ;-) You might also try summary(x, flat=TRUE).
#agstudy: Nice graph :-) To save some time, you should run hyphen() before readability(), so you can re-use the results via the "hyphen" argument. Or you can simply access the "hyphen" slot of readability() results afterwards. It will hyphenate automatically if needed, and keep the results. A manual call to hyphen should only be necessary if you need to change the output of hyphen() before next steps. I might add that 0.05-1 is much faster at this than its predecessors.

As of qdap version 1.1.0, qdap has a number of functions to be more compatible with the tm package. Here is a way to approach your problem using the same Corpus you provide (note that the Fry was originally a graphical measure and qdap preserves this; also by way of your Corpus and the random sampling Fry suggested your sample Corpus is not large enough to calculate Fry's on):
library(qdap)
with(tm_corpus2df(d), flesch_kincaid(text, docs))
## docs word.count sentence.count syllable.count FK_grd.lvl FK_read.ease
## 1 s1 33 1 54 16.6 34.904
## 2 s2 49 1 75 21.6 27.610
with(tm_corpus2df(d), fry(text, docs))
## To plot it
qheat(with(tm_corpus2df(d), flesch_kincaid(text, docs)), values=TRUE, high="red")

Related

Determining which version of a function is active when many packages are loaded

If I have multiple packages loaded that define functions of the same name, is there an easy way to determine which version of the function is currently the active one? Like, lets say I have base R, the tidyverse, and a bunch of time series packages loaded. I'd like a function which_package("intersect") that would tell me the package name of the active version of the intersect function. I know you can go back and look at all the warning messages you recieved when installing packages, but I think that sort of manual search is not only tedious but also error-prone.
There is a function here that does sort of what I want, except it produces a table for all conflicts rather than the value for one function. I would actually be quite happy with that, and would also accept a similar function as an answer, but I have had problems with the implimentation of function given. As applied to my examples, it inserts vast amounts of white space and many duplicates of the package names (e.g. the %>% function shows up with 132 packages listed), making the output hard to read and hard to use. It seems like it should be easy to remove the white space and duplicates, and I have spent considerable time on various approaches that I expected to work but which had no impact on the outcome.
So, for an example of many conflicts:
install.packages(pkg = c("tidyverse", "fpp3", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink")
lapply(x = c("tidyverse", "fable", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink"),
library, character.only = TRUE)
You can pull this information with your own function helper.
which_package <- function(fun) {
if(is.character(fun)) fun <- getFunction(fun)
stopifnot(is.function(fun))
x <- environmentName(environment(fun))
if (!is.null(x)) return(x)
}
This will return R_GlobalEnv for functions that you define in the global environment. There is also the packageName function if you really want to restrict it to packages only.
For example
library(MASS)
library(dplyr)
which_package(select)
# [1] "dplyr"

Trouble accessing quanteda corpus quantities in version >= 2

I am having a problem when running the same script I have written before. Back then, when I applied quanteda::corpus on a readtext object, it returned a "corpus" and "list" class object. But when I run the same script it returns "corpus" and "character" class objects now. And this affects the subsequent codes. What could be the reason for this and how can I solve this issue?
Here is the script:
txt <- readtext("C:/Users/aerol/Desktop/txt_sample")
corpus_txt <- corpus(txt) %>%
corpus_reshape(to = "sentences")
docvars(corpus_txt, "Treaty") <- corpus_txt$documents$`_document`
docvars(corpus_txt, "Year") <- as.integer(stri_sub(corpus_txt$documents$`_document`, -9, -6))
The files are international treaties. All the filenames are in the same format, they contain the name of the treaty and the year it was signed. And I was extracting these.
Back then the the class of corpus txt was "corpus" "list":
> class(corpus_txt)
[1] "corpus" "list"
But now:
> class(corpus_txt)
[1] "corpus" "character"
> packageVersion("quanteda")
[1] ‘2.1.2’
And I cannot extract information from the corpus the way I did before. Since I was working on this since the last October I should be using the same version all along.
Many thanks in advance.
We changed the corpus internal structure in v2, after two years of warning in the documentation that users should not access the corpus internals directly, or their code would not likely work under future major versions.
From https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20:
quanteda 2.0 introduces some major changes, detailed here.
New corpus object structure.
The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in
attributes. These are all updated to work with the existing extractor
and replacement functions. If you were using these before, then you
should not even notice the change. Docvars are now handled separately
from the texts, in the same way that docvars are handled for tokens
objects.
From ?corpus:
For quanteda >= 2.0, this is a specially classed character vector. It
has many additional attributes but you should not access these
attributes directly, especially if you are another package author. Use
the extractor and replacement functions instead, or else your code is
not only going to be uglier, but also likely to break should the
internal structure of a corpus object change. Using the accessor and
replacement functions ensures that future code to manipulate corpus
objects will continue to work.
Solution? Use docnames(corpus_txt).

Named Entity Recognition for Tweets using R

My objective : I am trying to recognize the location in a tweet(if it exists).
I tried using the Opennlp package and ran into "out of memory error" several times in spite of increasing the heap memory size.
The code terminates after identifying location for 6-8 tweets.
I am interested in only 100-150 locations(scope list), but matching each word in a tweet with the list for a collection of tweets is extremely inefficient.
I wanted to know if there are any suitable packages that can enable NER for twitter data using R besides 'NLP' ?
Also, what would be the most efficient way to perform this routine ?
I am not very familiar with python/Java hence would like to use R.
Thank You.
Okay, so I sorted the out-of-memory issue.
I was importing the models repeatedly for each tweet.
so I Commented that bit now..works fine.
Here is my code :
#Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 64-bit version\
#Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') # for 32-bit version
#library(rJava)
#install.packages("openNLPmodels.en_1.5-1.tar.gz", repos = "http://datacube.wu.ac.at/", type = "source")
#library(openNLP)
#library(NLP)
#install en-ner-location.bin from http://opennlp.sourceforge.net/models-1.5/ and save in location given below
for(i in 1:nrow(quake1))
{
s<-quake1$text[i]
#sent_token_annotator <- Maxent_Sent_Token_Annotator()
#word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
#entity_annotator <- Maxent_Entity_Annotator(kind="location",language="en",model='C:\\Program Files\\R\\R-3.1.1\\library\\openNLP\\en-ner-location.bin')
a3<-entity_annotator(s, a2)
location<-""
if(length(a3)>0) {
for(j in 1:length(a3))
location<-paste(location,substring(s,a3$start[j],a3$end[j]),sep=";")
}
quake1$location[i]<-location
}
New objective : I want to ensure locations like #SanJose are also identified. Since most of hashtags w.r.t locations are single string.

Extract English words from a text in R

I have a text and I need to extract all English words from it. For instance I want to have a function which would analyse the vector
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
And return only English words from this vector i.e. "picture", "carpet", "lamp"
I do understand that the definition of "English word" depends on the dictionary but I would be satisfied even with a basic dictionary.
You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
library(qdapDictionaries)
vector[vector %in% GradyAugmented]
## [1] "picture" "carpet" "lamp"
intersect(vector, GradyAugmented)
## [1] "picture" "carpet" "lamp"
The error you are receiving with installing qdap sounds like #Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

Search R file by package used

Let's say I open R file that I used way back when. On top of the page I see a library loaded, but I don't remember what it does any more. So I think to myself: hm, I wonder where in this long R file is this library used?
Is there a way to list what functions from a given package were used in particula file?
There are certainly other ways to do this but if you can get a list of the functions for the package you could combine readLines (to read the script into R as characters), grepl (to detect matches), and sapply. The way I would grab the functions is using p_funs from the pacman package. (Full disclosure: I am one of the authors).
Here is an example script that I have saved as "test.R"
library(ggplot2)
x <- rnorm(20)
y <- rnorm(20)
qplot(x, y)
summary(x)
and here is a session where I detect which functions are used
script <- readLines("test.R")
funs <- p_funs(ggplot2)
out <- sapply(funs, function(input){any(grepl(input, x = script))})
funs[out]
#[1] "ggplot" "qplot"
If you don't want to install pacman you can use any other method to get a list of the functions in the package. You could replace that with
funs <- objects("package:ggplot2")
and you would essentially get the same answer.
Note that you may get more matches than there actually are in the file - note that the ggplot function wasn't actually in my script but the string "ggplot" is in library(ggplot2). So you may still need to do a little bit of additional digging after the initial sweep through the file.

Resources