Trouble accessing quanteda corpus quantities in version >= 2 - r

I am having a problem when running the same script I have written before. Back then, when I applied quanteda::corpus on a readtext object, it returned a "corpus" and "list" class object. But when I run the same script it returns "corpus" and "character" class objects now. And this affects the subsequent codes. What could be the reason for this and how can I solve this issue?
Here is the script:
txt <- readtext("C:/Users/aerol/Desktop/txt_sample")
corpus_txt <- corpus(txt) %>%
corpus_reshape(to = "sentences")
docvars(corpus_txt, "Treaty") <- corpus_txt$documents$`_document`
docvars(corpus_txt, "Year") <- as.integer(stri_sub(corpus_txt$documents$`_document`, -9, -6))
The files are international treaties. All the filenames are in the same format, they contain the name of the treaty and the year it was signed. And I was extracting these.
Back then the the class of corpus txt was "corpus" "list":
> class(corpus_txt)
[1] "corpus" "list"
But now:
> class(corpus_txt)
[1] "corpus" "character"
> packageVersion("quanteda")
[1] ‘2.1.2’
And I cannot extract information from the corpus the way I did before. Since I was working on this since the last October I should be using the same version all along.
Many thanks in advance.

We changed the corpus internal structure in v2, after two years of warning in the documentation that users should not access the corpus internals directly, or their code would not likely work under future major versions.
From https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20:
quanteda 2.0 introduces some major changes, detailed here.
New corpus object structure.
The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in
attributes. These are all updated to work with the existing extractor
and replacement functions. If you were using these before, then you
should not even notice the change. Docvars are now handled separately
from the texts, in the same way that docvars are handled for tokens
objects.
From ?corpus:
For quanteda >= 2.0, this is a specially classed character vector. It
has many additional attributes but you should not access these
attributes directly, especially if you are another package author. Use
the extractor and replacement functions instead, or else your code is
not only going to be uglier, but also likely to break should the
internal structure of a corpus object change. Using the accessor and
replacement functions ensures that future code to manipulate corpus
objects will continue to work.
Solution? Use docnames(corpus_txt).

Related

R: How far does it go? (Plus venting)

I have an object called defaultPacks, containing the names of packages installed on all the computers I use. Much abbreviated:
defaultPacks <- c(
"AER",
"plyr",
"dplyr"
)
I want to save this object to file in a shared directory all of them can reach. I am using Dropbox for this, with sync always paused when R is running.
save(defaultPacks,
file.path("C:","Users","andrewH","Dropbox","R_PROJ","sharedSettings.rdata"))
Then I want to load the object and install the packages the names of which are in the object defaultPacks.
SyncPacks <- function(fileString){
defaultPacks <- load(file=fileString)
install.packages(defaultPacks, repos="http://cran.us.r-project.org")
}
SyncPacks(file.path("C:","Users","andrewH","Dropbox","R_PROJ","sharedSettings.rdata")
If I do this, I get a warning:
Warning in install.packages: package ‘defaultPacks’ is not available (for R version 3.2.1)
I look what is in defaultPacks immediately after I load and assign it: the string "defaultPacks". So it seems to loading just be a string rather than an object.
So I go back to my save, and try
save(get(defaultPacks), file.path(etc.))
This gives me an different error:
Error in save(get("defaultPacks"), file = file.path("C:", "Users", "andrewH", :
object ‘get("defaultPacks")’ not found.
Then I tried dynGet() with the same result.
So where before it was treating a symbol as a string, now it is treating a function as a string.
So I try the list option for save:
save(list = defaultPacks, file = file.path(etc))
And get yet another error:
Error in save(list = defaultPacks, file = file.path("C:", "Users", "andrewH", :
objects ‘AER’, ‘plyr’, ‘dplyr’, (etc.) not found
So where before I couldn't get to my character vector, now I am shooting right past it, evaluating defaultPacks to find the strings, and then treating each string as a symbol, and evaluating it to its (nonexistent) object.
So, I want to know how to make this work. But I am asking for something more than that. I have this problem, or an analogous problem, all the time. After several years of using R, I still have it a couple of times a week. I don't know how many steps of evaluation R is going to take on any given occasion. I hand a function an object name, and the function treats it as a string. I hand a function a string, and the R function converts it to a symbol and tries to evaluate it. Here, I don't understand why the save function does not save the object I gave it, and then give it back with load.
I've read the discussions on scoping in ten different R books, from Chambers "Software for Data Analysis" to Wickham's "Advanced R." Twice. Four times in some cases. I know about the four environments of a function, and the difference between the call stack and the chain of environmental parents. And yet, it is clear that I am missing something basic. It is not just that I don't know why save does not take a name in its ... argument and save it as an object (unless the problem is at the load end). I don't know how I can know. The function description says, of the ...s, "the names of the objects to be saved (as symbols or character strings)." So why is it saving a name as a string? Or why is load returning a string, if save saved an object? And how could I predict that?
Experienced R programmers, I know you can tell in advance how a given R function is going to treat one of its arguments. You know how far it will be evaluated. You can make it go as far as you want it to, and then STOP. You don't have to write str()'s into your functions every time you want to figure out what the heck it thinks its arguments mean. How do you do it?
Bloody "R Inferno". It's an understatement.
One way of seeing the problem is to note that the value of defaultPacks changes from before to after these operations.
> fname = tempfile()
> orig = defaultPacks = c("AER", "plyr", "dplyr")
> save(defaultPacks, file=fname)
> defaultPacks = load(fname)
> identical(orig, defaultPacks)
[1] FALSE
The problem starts with an understanding of what save() does. From ?save, the object that is saved is named defaultPacks and it has value c("AER", "plyr", "dplyr"). save() could save multiple objects, each with a name and associated value, so it somehow has to save the name of each object.
load() restores the objects that save() has written, and returns (from ?load) a "character vector of the names of objects created". In this case load() restores (creates in the global environment) the symbol defaultPacks, populates it with the character vector of default packages, and returns the name (i.e., character vector of length 1 "defaultPacks") of the object(s) it has restored. The return value then overwrites the restored value, and we have defaultPacks = "defaultPacks".
install.packages doesn't do anything fancy with it's first argument, which from ?install.packages is a "character vector of the names of packages whose current versions should be downloaded". The character vector happens to be the symbol defaultPacks, but the error comes from the value of the symbol, which is the character vector "defaultPacks".
save() and load() more or less have to work the way they do to support multiple objects. On the other hand saveRDS() and readRDS() (ok, why read instead of load?) have a contract to save a single object. The name of the saved object does not need to be stored to be able to recover the values associated with it. So saveRDS(defaultPacks, fname); defaultPacks = readRDS(fname) works, and in particular the value of defaultPacks before and after this series of operations remains unchanged.
> orig = defaultPacks = c("AER", "plyr", "dplyr")
> saveRDS(defaultPacks, fname)
> defaultPacks = readRDS(fname)
> identical(orig, defaultPacks)
[1] TRUE
Without meaning to be too much of a jerk, the answer to the question "Experienced R programmers...how do you do it?" the answer is implied by the ? above -- by carefully reading the manual. Also, there are not that many places in base R code where evaluation is non-standards -- formulas and library are the main culprits -- so recognizing what the problem is not can help to focus on what is actually going on.

Extract English words from a text in R

I have a text and I need to extract all English words from it. For instance I want to have a function which would analyse the vector
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
And return only English words from this vector i.e. "picture", "carpet", "lamp"
I do understand that the definition of "English word" depends on the dictionary but I would be satisfied even with a basic dictionary.
You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
library(qdapDictionaries)
vector[vector %in% GradyAugmented]
## [1] "picture" "carpet" "lamp"
intersect(vector, GradyAugmented)
## [1] "picture" "carpet" "lamp"
The error you are receiving with installing qdap sounds like #Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

How to calculate readabilty in R with the tm package

Is there a pre-built function for this in the tm library, or one that plays nicely with it?
My current corpus is loaded into tm, something like as follows:
s1 <- "This is a long, informative document with real words and sentence structure: introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find."
s2 <- "This is a short jibberish lorem ipsum document. Selling anything to strangers and get money! Woody equal ask saw sir weeks aware decay. Entrance prospect removing we packages strictly is no smallest he. For hopes may chief get hours day rooms. Oh no turned behind polite piqued enough at. "
stuff <- rbind(s1,s2)
d <- Corpus(VectorSource(stuff[,1]))
I tried using koRpus, but it seems silly to retokenize in a different package than the one I'm already using. I also had problems vectorizing its return object in a way that would allow me to reincorporate the results into tm. (Namely, due to errors, it would often return more or fewer readability scores than the number of documents in my collection.)
I understand I could do a naive calculation parsing vowels as syllables, but want a more thorough package that takes care of the edge cases already (address silent e's, etc.).
My readability scores of choice are Flesch-Kincaid or Fry.
What I had tried originally where d is my corpus of 100 documents:
f <- function(x) tokenize(x, format="obj", lang='en')
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='remove') %do% g(f(d[[i]]))
Unfortunately, x returns less than 100 documents, so I can't associate successes with the correct document. (This is partly my misunderstanding of 'foreach' versus 'lapply' in R, but I found the structure of a text object sufficiently difficult that I could not appropriately tokenize, apply flesch.kincaid, and successfully check errors in a reasonable sequence of apply statements.)
UPDATE
Two other things I've tried, trying to apply the koRpus functions to the tm object...
Pass arguments into the tm_map object, using the default tokenizer:
tm_map(d,flesch.kincaid,force.lang="en",tagger=tokenize)
Define a tokenizer, pass that in.
f <- function(x) tokenize(x, format="obj", lang='en')
tm_map(d,flesch.kincaid,force.lang="en",tagger=f)
Both of these returned:
Error: Specified file cannot be found:
Then lists the full text of d[1]. Seems to have found it? What should I do to pass the function correctly?
UPDATE 2
Here's the error I get when I try to map koRpus functions directly with lapply:
> lapply(d,tokenize,lang="en")
Error: Unable to locate
Introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find.
This looks like a strange error---I almost don't think it means it can't locate the text, but that it can't locate some blank error code (such as, 'tokenizer'), before dumping the located text.
UPDATE 3
Another problem with retagging using koRpus was that retagging (versus the tm tagger) was extremely slow and output its tokenization progress to stdout. Anyway, I've tried the following:
f <- function(x) capture.output(tokenize(x, format="obj", lang='en'),file=NULL)
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='pass') %do% g(f(d[[i]]))
y <- unlist(sapply(x,slot,"Flesch.Kincaid")["age",])
My intention here would be to rebind the y object above back to my tm(d) corpus as metadata, meta(d, "F-KScore") <- y.
Unfortunately, applied to my actual data set, I get the error message:
Error in FUN(X[[1L]], ...) :
cannot get a slot ("Flesch.Kincaid") from an object of type "character"
I think one element of my actual corpus must be an NA, or too long, something else prohibitive---and due to the nested functionalizing, I am having trouble tracking down exactly which it is.
So, currently, it looks like there is no pre-built function for reading scores that play nicely with the tm library. Unless someone sees an easy error-catching solution I could sandwich into my function calls to deal with inability to tokenize some apparently erroneous, malformed documents?
You get an error because koRpus functions can't deal with corpus object. It is better to create a kRp.tagged object then apply all koRpus features on it. Here I will show how I do this using ovid data of tm package.
I use list.files to get my list of source files. You just need to give the right path to your sources text files.
ll.files <- list.files(path = system.file("texts", "txt",
package = "tm"),
full.names=T)
Then I construct a list of kRp.tagged object using tokenize which is a the default tagger given with the koRpus package(It is recommanded to use TreeTagger but you need to install it)
ll.tagged <- lapply(ll.files, tokenize, lang="en") ## tm_map is just a wrapper of `lapply`
Once I have my list of "tagged" objects I can apply readability formula on it. Since flesch.kincaid is a wrapper of readability, I will apply directly the latter:
ll.readability <- lapply(ll.tagged,readability) ## readability
ll.freqanalysis <- lapply(ll.tagged,kRp.freq.analysis) ## Conduct a frequency analysis
ll.hyphen <- lapply(ll.tagged,hyphen) ## word hyphenation
etc,....all this produces a list of S4 object. The desc slot gives an easy access to this list:
lapply(lapply(ll.readability ,slot,'desc'), ## I apply desc to get a list
'[',c('sentences','words','syllables'))[[1]] ## I subset to get some indexes
[[1]]
[[1]]$sentences
[1] 10
[[1]]$words
[1] 90
[[1]]$syllables
all s1 s2 s3 s4
196 25 32 25 8
You can for example , use the slot hyphen to get a data frame with two colums, word (the hyphenated words) and syll (the number of syllables). here, using lattice, I bind all the data.frames, to plot a dotplot for each document.
library(lattice)
ll.words.syl <- lapply(ll.hyphen,slot,'hyphen') ## get the list of data.frame
ll.words.syl <- lapply(seq_along(ll.words.syl), ## add a column to distinguish docs
function(i)cbind(ll.words.syl[[i]],group=i))
dat.words.syl <- do.call(rbind,ll.words.syl)
dotplot(word~syll|group,dat.words.syl,
scales=list(y=list(relation ='free')))
I'm sorry the koRpus package doesn't interact with the tm package that smoothly yet. I've been thinking of ways to translate between the two object classes for months now but haven't yet come up with a really satisfying solution. If you have ideas for this, don't hesitate to contact me.
However, I'd like to refer you to the summary() method for readability objects produced by koRpus, which returns a condensed data.frame of relevant results. This is probably much easier to access than the alternative crawl through the rather complex S4 objects ;-) You might also try summary(x, flat=TRUE).
#agstudy: Nice graph :-) To save some time, you should run hyphen() before readability(), so you can re-use the results via the "hyphen" argument. Or you can simply access the "hyphen" slot of readability() results afterwards. It will hyphenate automatically if needed, and keep the results. A manual call to hyphen should only be necessary if you need to change the output of hyphen() before next steps. I might add that 0.05-1 is much faster at this than its predecessors.
As of qdap version 1.1.0, qdap has a number of functions to be more compatible with the tm package. Here is a way to approach your problem using the same Corpus you provide (note that the Fry was originally a graphical measure and qdap preserves this; also by way of your Corpus and the random sampling Fry suggested your sample Corpus is not large enough to calculate Fry's on):
library(qdap)
with(tm_corpus2df(d), flesch_kincaid(text, docs))
## docs word.count sentence.count syllable.count FK_grd.lvl FK_read.ease
## 1 s1 33 1 54 16.6 34.904
## 2 s2 49 1 75 21.6 27.610
with(tm_corpus2df(d), fry(text, docs))
## To plot it
qheat(with(tm_corpus2df(d), flesch_kincaid(text, docs)), values=TRUE, high="red")

Kindly check the R command

I am doing following in Cooccur library in R.
> fb<-read.table("Fb6_peaks.bed")
> f1<-read.table("F16_peaks.bed")
everything is ok with the first two commands and I can also display the data:
> fb
> f1
But when I give the next command as given below
> explore_pairs(c("fb", "f1"))
I get an error message:
Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) :
invalid 'type' (list) of argument
Could anyone suggest something?
Despite promising to release a version to the Bioconductor depository in the article the authors published over a year ago, they have still not delivered. The gz file that is attached to the article is not of a form that my installation recognizes. Your really should be corresponding with the authors for this question.
The nature of the error message suggests that the function is expecting a different data class. You should be looking at the specification for the arguments in the help(explore_pairs) file. If it is expecting 2 matrices, then wrapping data.matrix around the arguments may solve the problem, but if it is expecting a class created by one of that packages functions then you need to take the necessary step to construct the right objects.
The help file for explore_pairs does exist (at least in the MAN directory) and says the first argument should be a character vector with further provisos:
\arguments{
\item{factornames}{an vector of character strings, each naming a GFF-like
data frame containing the binding profile of a DNA-binding factor.
There is also a load utility, load_GFF, which I assume is designed for creation of such files.
Try rename your data frame:
names(fb)=c("seq","start","end")
Check the example datasets. The column names are as above. I set the names and it worked.

Loading someone else's .rdata file, can't access the data

My professor has sent me an .rdata file and wants me to do some analysis on the contents. Although I'm decent with R, I've never saved my work in .rdata files, and consequently haven't ever worked with them.
When I try to load the file, it looks like it's working:
> load('/home/swansone/Desktop/anes.rdata')
> ls()
[1] "25383-0001-Data"
But I can't seem to get at the data:
> names("25383-0001-Data")
NULL
I know that there is data in the .rdata file (it's 13 MB, there's definitely a lot in there) Am I doing something wrong? I'm at a loss.
Edit:
I should note, I've also tried not using quotes:
> names(25383-0001-Data)
Error: object "Data" not found
And renaming:
> ls()[1] <- 'nes'
Error in ls()[1] <- "nes" : invalid (NULL) left side of assignment
You're going to run into a lot of issues with an object that doesn't begin with a letter or . and a letter (as mentioned in An Introduction to R).
Use backticks to access this object (the "Names and Identifiers" section of help("`") explains why this works) and assign the object to a new, syntactically validly named object.
Data <- `25383-0001-Data`
Maybe it has to do with the unusual use of dashes in the name and backquotes work:
names(`25383-0001-Data`)
Edit:
More for reference (since Joshua already answered the main question perfectly), you can also reassign an object from ls() (what Wilduck tried in the question) using get(). This might be useful if the object of the name contains very weird characters:
foo <- 1:5
bar <- get(ls()[1])
bar
[1] 1 2 3 4 5
This of course requires the index of foo in ls() to be [1], but looking up the index of the required object is not too hard.

Resources