Extract English words from a text in R - r

I have a text and I need to extract all English words from it. For instance I want to have a function which would analyse the vector
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
And return only English words from this vector i.e. "picture", "carpet", "lamp"
I do understand that the definition of "English word" depends on the dictionary but I would be satisfied even with a basic dictionary.

You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
library(qdapDictionaries)
vector[vector %in% GradyAugmented]
## [1] "picture" "carpet" "lamp"
intersect(vector, GradyAugmented)
## [1] "picture" "carpet" "lamp"
The error you are receiving with installing qdap sounds like #Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

Related

str_detect producing vector related errors in R code (which previously worked) since update 1.5.0

I'm trying to do some simple str_detects as follows:
index1 <- str_detect(colnames(DataFrame), paste0("^", name_))
also, name_ is just a character string so paste0("^", name_)) is of length 1.
which yields the following error:
Error in stop_vctrs(): ! Input must be a vector, not an environment.
When I check rlang::last_error() I get:
`Backtrace:
stringr::str_detect(colnames(DataFrame), paste0("^", name_))
vctrs:::stop_scalar_type(<fn>(<env>), "")
vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)`
I know that in this instance I could use the base R alternative:
grep(paste0("^", name_), colanmes(DataFrame))
but the issue is that I have many long scripts which feature str_detect many times...
I'd like to understand the ways around this new error so that I can best fix all these instances in my code, thank you.
I have read the update on Stringr 1.5.0 written by Hadley about the stricter vector definitions which have been implemented in tidyverse but I still pose my question
EDIT: uninstallation and reinstallation of R/studio/tools fixed the issue

How to check whether a dataset exists in package?

Is there a more elegant (fail-safe/robust, shorter) way of checking whether a dataset (whose name is known as a character string) exists in a package than this?
rda.name <- "Animals" # name of the data set/.rda
rda.name %in% data(package = "MASS")[["results"]][,"Item"]
You can try this approach using exists:
exists(data("Animals", package = "MASS"))
# [1] TRUE
As mentioned in the comment, I cannot replicate Sven's answer (under any recent version of R). The following works, but the usage of suppressWarnings() is rather ugly and the dataset is also loaded when calling data() this way (instead of just checking its existence). As such, I don't think this is preferable over my original version, but perhaps inspires someone to provide a fix.
exists(suppressWarnings(data(list = rda.name, package = "MASS")))

Check if a function name is used in existing CRAN packages

I am creating an R package that I plan to submit to CRAN. How can I check if any of my function names conflict with function names in packages already on CRAN? Before my package goes public, it's still easy to change the names of functions, and I'd like to be a good citizen and avoid conflicts where possible.
For instance, the packages MASS and dplyr both have functions called "select". I'd like to avoid that sort of collision.
There are a lot of packages (9008 at the moment, Aug 2016), so it is almost certainly better to only look at a subset you want to avoid clashes with. Also, to re-emphasise some of the good advice in the comments (just for the record in case comments get deleted, or hidden):
sharing function names with other packages is not really a big problem, and not worth avoiding beyond, perhaps avoiding clashes with common packages that are most likely to be loaded at the same time (thanks #Nicola and #Joran)
Unnecessarily avoiding re-usue of names "leads to bad function names because the good ones are taken" (#Konrad Rudolph)
But, if you really want to check all the packages, perhaps to at least know which packages use the same names as yours, you can get a vector of the package names by
crans <- available.packages()[, "Package"]
# A3 abbyyR abc ABCanalysis abc.data abcdeFBA
# "A3" "abbyyR" "abc" "ABCanalysis" "abc.data" "abcdeFBA"
length(crans)
# [1] 9008
You can then install them in bulk using
N = 4 # only using the 1st 4 packages here -
# doing it for the whole lot will take a lot of time and disk space!!!
install.packages(crans[1:N])
Then you can get a list of the function names in these packages with
existing_functions = sapply(1:N, function(i) ls(getNamespace(crans[i])))

Is there a way to stop Aspell from suggesting certain words (say, offensive) to incorrect spellings in R?

I am using the aspell function of utils package in R to spell check my text. Also I am trying to extract correct words for incorrect words detected by Aspell. But Aspell is suggesting offensive words for some incorrect words. I do not want that. How do I stop Aspell from doing this? Is there a way to remove certain words from Aspell dictionary using R only? This is how I am using Aspell.
spelling_mistakes <- aspell(file_location2,"Rd", control = c("--master=en_US"),
program = aspell_location)
incorrect_words_list <- spelling_mistakes[, 1]
correct_words_for_incorrect_words <- spelling_mistakes[, 5]
How about:
badWords <- scan(url("http://www.bannedwordlist.com/lists/swearWords.txt"),
what=character(0))
## note that the 'bad' words include "job", and "hit" ...
clean_words <- setdiff(spelling_mistakes[,5],badWords)
You haven't given a reproducible example, so I haven't tested this ...
Note that this will not give alternative suggestions. But it will get you partway there. The documentation for aspell does suggest that you can use alternative dictionaries, but you could read that yourself ... http://wordlist.aspell.net/other-dicts/
See also http://lists.gnu.org/archive/html/aspell-user/2007-07/msg00004.html

How to calculate readabilty in R with the tm package

Is there a pre-built function for this in the tm library, or one that plays nicely with it?
My current corpus is loaded into tm, something like as follows:
s1 <- "This is a long, informative document with real words and sentence structure: introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find."
s2 <- "This is a short jibberish lorem ipsum document. Selling anything to strangers and get money! Woody equal ask saw sir weeks aware decay. Entrance prospect removing we packages strictly is no smallest he. For hopes may chief get hours day rooms. Oh no turned behind polite piqued enough at. "
stuff <- rbind(s1,s2)
d <- Corpus(VectorSource(stuff[,1]))
I tried using koRpus, but it seems silly to retokenize in a different package than the one I'm already using. I also had problems vectorizing its return object in a way that would allow me to reincorporate the results into tm. (Namely, due to errors, it would often return more or fewer readability scores than the number of documents in my collection.)
I understand I could do a naive calculation parsing vowels as syllables, but want a more thorough package that takes care of the edge cases already (address silent e's, etc.).
My readability scores of choice are Flesch-Kincaid or Fry.
What I had tried originally where d is my corpus of 100 documents:
f <- function(x) tokenize(x, format="obj", lang='en')
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='remove') %do% g(f(d[[i]]))
Unfortunately, x returns less than 100 documents, so I can't associate successes with the correct document. (This is partly my misunderstanding of 'foreach' versus 'lapply' in R, but I found the structure of a text object sufficiently difficult that I could not appropriately tokenize, apply flesch.kincaid, and successfully check errors in a reasonable sequence of apply statements.)
UPDATE
Two other things I've tried, trying to apply the koRpus functions to the tm object...
Pass arguments into the tm_map object, using the default tokenizer:
tm_map(d,flesch.kincaid,force.lang="en",tagger=tokenize)
Define a tokenizer, pass that in.
f <- function(x) tokenize(x, format="obj", lang='en')
tm_map(d,flesch.kincaid,force.lang="en",tagger=f)
Both of these returned:
Error: Specified file cannot be found:
Then lists the full text of d[1]. Seems to have found it? What should I do to pass the function correctly?
UPDATE 2
Here's the error I get when I try to map koRpus functions directly with lapply:
> lapply(d,tokenize,lang="en")
Error: Unable to locate
Introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find.
This looks like a strange error---I almost don't think it means it can't locate the text, but that it can't locate some blank error code (such as, 'tokenizer'), before dumping the located text.
UPDATE 3
Another problem with retagging using koRpus was that retagging (versus the tm tagger) was extremely slow and output its tokenization progress to stdout. Anyway, I've tried the following:
f <- function(x) capture.output(tokenize(x, format="obj", lang='en'),file=NULL)
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='pass') %do% g(f(d[[i]]))
y <- unlist(sapply(x,slot,"Flesch.Kincaid")["age",])
My intention here would be to rebind the y object above back to my tm(d) corpus as metadata, meta(d, "F-KScore") <- y.
Unfortunately, applied to my actual data set, I get the error message:
Error in FUN(X[[1L]], ...) :
cannot get a slot ("Flesch.Kincaid") from an object of type "character"
I think one element of my actual corpus must be an NA, or too long, something else prohibitive---and due to the nested functionalizing, I am having trouble tracking down exactly which it is.
So, currently, it looks like there is no pre-built function for reading scores that play nicely with the tm library. Unless someone sees an easy error-catching solution I could sandwich into my function calls to deal with inability to tokenize some apparently erroneous, malformed documents?
You get an error because koRpus functions can't deal with corpus object. It is better to create a kRp.tagged object then apply all koRpus features on it. Here I will show how I do this using ovid data of tm package.
I use list.files to get my list of source files. You just need to give the right path to your sources text files.
ll.files <- list.files(path = system.file("texts", "txt",
package = "tm"),
full.names=T)
Then I construct a list of kRp.tagged object using tokenize which is a the default tagger given with the koRpus package(It is recommanded to use TreeTagger but you need to install it)
ll.tagged <- lapply(ll.files, tokenize, lang="en") ## tm_map is just a wrapper of `lapply`
Once I have my list of "tagged" objects I can apply readability formula on it. Since flesch.kincaid is a wrapper of readability, I will apply directly the latter:
ll.readability <- lapply(ll.tagged,readability) ## readability
ll.freqanalysis <- lapply(ll.tagged,kRp.freq.analysis) ## Conduct a frequency analysis
ll.hyphen <- lapply(ll.tagged,hyphen) ## word hyphenation
etc,....all this produces a list of S4 object. The desc slot gives an easy access to this list:
lapply(lapply(ll.readability ,slot,'desc'), ## I apply desc to get a list
'[',c('sentences','words','syllables'))[[1]] ## I subset to get some indexes
[[1]]
[[1]]$sentences
[1] 10
[[1]]$words
[1] 90
[[1]]$syllables
all s1 s2 s3 s4
196 25 32 25 8
You can for example , use the slot hyphen to get a data frame with two colums, word (the hyphenated words) and syll (the number of syllables). here, using lattice, I bind all the data.frames, to plot a dotplot for each document.
library(lattice)
ll.words.syl <- lapply(ll.hyphen,slot,'hyphen') ## get the list of data.frame
ll.words.syl <- lapply(seq_along(ll.words.syl), ## add a column to distinguish docs
function(i)cbind(ll.words.syl[[i]],group=i))
dat.words.syl <- do.call(rbind,ll.words.syl)
dotplot(word~syll|group,dat.words.syl,
scales=list(y=list(relation ='free')))
I'm sorry the koRpus package doesn't interact with the tm package that smoothly yet. I've been thinking of ways to translate between the two object classes for months now but haven't yet come up with a really satisfying solution. If you have ideas for this, don't hesitate to contact me.
However, I'd like to refer you to the summary() method for readability objects produced by koRpus, which returns a condensed data.frame of relevant results. This is probably much easier to access than the alternative crawl through the rather complex S4 objects ;-) You might also try summary(x, flat=TRUE).
#agstudy: Nice graph :-) To save some time, you should run hyphen() before readability(), so you can re-use the results via the "hyphen" argument. Or you can simply access the "hyphen" slot of readability() results afterwards. It will hyphenate automatically if needed, and keep the results. A manual call to hyphen should only be necessary if you need to change the output of hyphen() before next steps. I might add that 0.05-1 is much faster at this than its predecessors.
As of qdap version 1.1.0, qdap has a number of functions to be more compatible with the tm package. Here is a way to approach your problem using the same Corpus you provide (note that the Fry was originally a graphical measure and qdap preserves this; also by way of your Corpus and the random sampling Fry suggested your sample Corpus is not large enough to calculate Fry's on):
library(qdap)
with(tm_corpus2df(d), flesch_kincaid(text, docs))
## docs word.count sentence.count syllable.count FK_grd.lvl FK_read.ease
## 1 s1 33 1 54 16.6 34.904
## 2 s2 49 1 75 21.6 27.610
with(tm_corpus2df(d), fry(text, docs))
## To plot it
qheat(with(tm_corpus2df(d), flesch_kincaid(text, docs)), values=TRUE, high="red")

Resources