Named Entity Recognition for Tweets using R - r

My objective : I am trying to recognize the location in a tweet(if it exists).
I tried using the Opennlp package and ran into "out of memory error" several times in spite of increasing the heap memory size.
The code terminates after identifying location for 6-8 tweets.
I am interested in only 100-150 locations(scope list), but matching each word in a tweet with the list for a collection of tweets is extremely inefficient.
I wanted to know if there are any suitable packages that can enable NER for twitter data using R besides 'NLP' ?
Also, what would be the most efficient way to perform this routine ?
I am not very familiar with python/Java hence would like to use R.
Thank You.

Okay, so I sorted the out-of-memory issue.
I was importing the models repeatedly for each tweet.
so I Commented that bit now..works fine.
Here is my code :
#Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 64-bit version\
#Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') # for 32-bit version
#library(rJava)
#install.packages("openNLPmodels.en_1.5-1.tar.gz", repos = "http://datacube.wu.ac.at/", type = "source")
#library(openNLP)
#library(NLP)
#install en-ner-location.bin from http://opennlp.sourceforge.net/models-1.5/ and save in location given below
for(i in 1:nrow(quake1))
{
s<-quake1$text[i]
#sent_token_annotator <- Maxent_Sent_Token_Annotator()
#word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
#entity_annotator <- Maxent_Entity_Annotator(kind="location",language="en",model='C:\\Program Files\\R\\R-3.1.1\\library\\openNLP\\en-ner-location.bin')
a3<-entity_annotator(s, a2)
location<-""
if(length(a3)>0) {
for(j in 1:length(a3))
location<-paste(location,substring(s,a3$start[j],a3$end[j]),sep=";")
}
quake1$location[i]<-location
}
New objective : I want to ensure locations like #SanJose are also identified. Since most of hashtags w.r.t locations are single string.

Related

Determining which version of a function is active when many packages are loaded

If I have multiple packages loaded that define functions of the same name, is there an easy way to determine which version of the function is currently the active one? Like, lets say I have base R, the tidyverse, and a bunch of time series packages loaded. I'd like a function which_package("intersect") that would tell me the package name of the active version of the intersect function. I know you can go back and look at all the warning messages you recieved when installing packages, but I think that sort of manual search is not only tedious but also error-prone.
There is a function here that does sort of what I want, except it produces a table for all conflicts rather than the value for one function. I would actually be quite happy with that, and would also accept a similar function as an answer, but I have had problems with the implimentation of function given. As applied to my examples, it inserts vast amounts of white space and many duplicates of the package names (e.g. the %>% function shows up with 132 packages listed), making the output hard to read and hard to use. It seems like it should be easy to remove the white space and duplicates, and I have spent considerable time on various approaches that I expected to work but which had no impact on the outcome.
So, for an example of many conflicts:
install.packages(pkg = c("tidyverse", "fpp3", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink")
lapply(x = c("tidyverse", "fable", "tsbox", "rugarch", "Quandl", "DREGAR", "dynlm", "zoo", "GGally", "dyn", "ARDL", "bigtime", "BigVAR", "dLagM", "VARshrink"),
library, character.only = TRUE)
You can pull this information with your own function helper.
which_package <- function(fun) {
if(is.character(fun)) fun <- getFunction(fun)
stopifnot(is.function(fun))
x <- environmentName(environment(fun))
if (!is.null(x)) return(x)
}
This will return R_GlobalEnv for functions that you define in the global environment. There is also the packageName function if you really want to restrict it to packages only.
For example
library(MASS)
library(dplyr)
which_package(select)
# [1] "dplyr"

ViSEAGO tutorial: visualising topGO object

Earlier, I had posted a question and was able to load in my data successfully and create a topGO object after some help. I'm trying to visualise GO terms that are significantly associated with the list of differentially expressed genes that I have from mouse RNA-seq data.
Now, I'd want to raise a concern about ViSEAGO's tutorial. The tutorial initially specifies loading two files: 'selection.txt' and 'background.txt'. The origin of these files is not clearly stated. However, after a lot of digging into topGO's documentation, I was able to find the datatypes for each of the files. But, even after following these, I have a problem running the following code. Does anyone have any insights to share?
WORKING CODE:
mysampleGOdata <- new("topGOdata",
description = "my Simple session",
ontology = "BP",
allGenes = geneList_new,
nodeSize = 1,
annot = annFUN.org,
mapping="org.Mm.eg.db",
ID = "SYMBOL")
resultFisher <- runTest(mysampleGOdata, algorithm = "classic", statistic = "fisher")
head(GenTable(mysampleGOdata,fisher=resultFisher),20)
myNewBP<-GenTable(mysampleGOdata,fisher=resultFisher)
PROBLEMS:
> head(myNewBP,2)
GO.ID Term Annotated Significant Expected fisher
1 GO:0006006 glucose metabolic process 194 12 0.19 1.0e-19
2 GO:0019318 hexose metabolic process 223 12 0.22 5.7e-19
> ###################
> # merge results
> myBP_sResults<-ViSEAGO::merge_enrich_terms(
+ Input=list(
+ condition=c("mysampleGOdata","resultFisher")
+ )
+ )
Error in setnames(x, value) :
Can't assign 3 names to a 2 column data.table
> myNewBP<-GenTable(mysampleGOdata,fisher=resultFisher)
> ###################
> # display the merged table
> ViSEAGO::show_table(myNewBP)
Error in ViSEAGO::show_table(myNewBP) :
object must be enrich_GO_terms, GO_SS, or GO_clusters class objects
According to the tutorial, the printed table contains for each enriched GO terms, additional columns including the list of significant genes and frequency (ratio of the number of significant genes to the number of background genes) evaluated by comparison. I think I have that, but it's definitely not working.
Can someone see why? I'm not very clear on this.
Thanks!
I think you try to circumvent an error you made at the beginning. You receive the error due to the fact that you did not use the wrapper function from the ViSEAGO package. As you stated in your last question, you had initial problems formatting your data.
Here are some tips:
The "selection" file is a character vector with your DEGs names or IDs. I recommend using EntrezID's.
The "Background" file is a character vector with known genes. I recommend using EntrezID's as well. You can easily generate this character vector with:
background=keys(org.Hs.eg.db, keytype ='ENTREZID').
With these two files, you can easily proceed to the next steps of the package as described in the vignette.
# connect to EntrezGene
EntrezGene<-ViSEAGO::EntrezGene2GO()
# load GO annotations from EntrezGene
# with the add of GO annotations from orthologs genes (see above)
#id = "9606" = homo sapiens
myGENE2GO<-ViSEAGO::annotate(id="9606", EntrezGene)
BP<-ViSEAGO::create_topGOdata(
geneSel = selection, #your DEG vector
allGenes = background, #your created background vector
gene2GO=myGENE2GO,
ont="BP",
nodeSize=5
)
classic<-topGO::runTest(
BP,
algorithm ="classic",
statistic = "fisher"
)
# merge results
BP_sResults<-ViSEAGO::merge_enrich_terms(
Input=list(
condition=c("BP","classic")
)
)
You should get a merged list of your enriched GO terms with the corresponding statistical tests you prefer.
I have faced this problem recently, it was very frustrating. In my case the whole issue seemed to be related to the package version I was using.
I used conda to install ViSEAGO. Nevertheless, R's version in my conda environment was a bit old (i.e. 3.6.1 to be specific). Therefore, when installing ViSEAGO with conda, the version 1.0.0 of the package was installed. Please note that the most recent version of ViSEAGO is 1.4.0.
Therefore, I created a conda environment with R version 4.0.3, and repeated the procedure to install ViSEAGO by using conda. When doing this, ViSEAGO's 1.4.0 version was installed, and everything went fine.
I've tried to backtrack the error, and only find one thing: in the older ViSEAGO version, the function Custom2GO loaded tables with 4 columns; in the most recent version it admits 5 columns (the new one being 'gene_symbol'). I think this disagreement might be part of the issue, as the source code of the function merge_enrich_terms seems to deal with the columns 'gene_id' and 'gene_symbol' at some point, but I'm not sure.
Hope you find my comment helpful!
Cheers,
Mauricio

Running as.Node from data.tree package in R

I'm trying to use the as.Node function from the data.tree library in R to visualize a set of media server log data as a tree. I've subset the original data frame by month and year, so that I can run one month's worth of data at a time. My function code for turning the data into a tree, and then printing it out as a .csv, is as follows:
treetrimmer2 <- function(x, y) {
urimodel <- as.Node(x)
uridf <- ToDataFrameTree(urimodel, "level", "count")
uridf <- filter(uridf, level <= y, count != 0)
filename <- paste(x$year[1], x$month[1], ".csv", sep="")
write.csv(uridf, file = filename, fileEncoding = "CP1252")
}
Some months finish without any issue. Other months, however, give me the following error (and traceback):
Error in (function () : unused argument (quote(<environment>))
7 (function ()
{
c(self$parent$path, self$name)
})(quote(<environment>))
6 self$AddChildNode(child)
5 mynode$AddChild(path)
4 FromDataFrameTable(x, pathName, pathDelimiter, colLevels, na.rm)
3 as.Node.data.frame(x)
2 as.Node(x) at media_visualizer.R#63
1 treetrimmer2(uricut$`2015.06`, 5)
Can anyone give me some guidance on what 'unused argument (quote())' means? I've tried googling it, and found that in some cases, it means that a function or term has already been defined in another context. But I'm still too novice to understand what that means here.
I'm running rStudio 0.99.896 and R 3.2.4 on Mac OS 10.11.5. I would share my data set, except that it is pretty massive, and I'm not sure which lines are causing the problem...
I can't claim credit for this; Christoph Glur (see the comments on the main post) figured it out. But it might be useful for others to share the cause, and my solution:
The problem is that a few of the log files contain one of the data.tree package's reserved words, in this case, "path". The format of the lines was "/something/something/path/something/something.jpg", so that data.tree read "path" as an independent word. There were other instances of "path" as part of a larger word, e.g., "pathString" or "pathTo", that didn't cause the bug.
Once he'd figured it out, my solution was to run the following command on all of the log files in Terminal:
sed -i '' 's/\/path\//\/spath\//' *.log
I'm still a novice, but as I understand it, what that means is "find and replace, in place, instances of "/path/" with "/spath/" in all of the .log files." I don't actually care about that one word, path vs. spath (which is gibberish), so changing it didn't matter. And now the as.Node() function runs properly on the data set.
Thank you, Christoph!

How to save Variant Call Format (VCF) file to disk in R using VariantAnnotation Package

I've searched the web for this without much luck. More or less you always get to the example from the VariantAnnotation Package. And since this example works fine on my computer I have no idea why the VCF I created does not.
The problem: I want to determine the number and location of SNPs in selected genes. I have a large VCF file (over 5GB) that has info on all SNPs on all chromosomes for several mice strains. Obviously my computer freezes if I try to do anything on the whole genome scale, so I first determined genomic locations of genes of interest on chromosome 1. I then used the VariantAnnotation Package to get only the data relating to my genes of interest out of the VCF file:
library(VariantAnnotation)
param<-ScanVcfParam(
info=c("AC1","AF1","DP","DP4","INDEL","MDV","MQ","MSD","PV0","PV1","PV2","PV3","PV4","QD"),
geno=c("DP","GL","GQ","GT","PL","SP","FI"),
samples=strain,
fixed="FILTER",
which=gnrng
)
The code above is taken out of a function I wrote which takes strain as an argument. gnrng refers to a GRanges object containing genomic locations of my genes of interest.
vcf<-readVcf(file, "mm10",param)
This works fine and I get my vcf (dim: 21783 1) but when I try to save it won't work
file.vcf<-tempfile()
writeVcf(vcf, file.vcf)
Error in .pasteCollapse(ALT, ",") : 'x' must be a CharacterList
I even tried in parallel, doing the example from the package first and then substituting for my VCF file:
#This is the example:
out1.vcf<-tempfile()
in1<-readVcf(fl,"hg19")
writeVcf(in1,out1.vcf)
This works just fine, but if I only substitute in1 for my vcf I get the same error.
I hope I made myself clear... And any help will be greatly appreciated!! Thanks in advance!
Thanks for reporting this bug. The problem is fixed in version 1.9.47 (devel branch). The fix will be available in the release branch after April 14.
The problem was that you selectively imported 'FILTER' from the 'fixed' field but not 'ALT'. writeVcf() was throwing an error because there was no ALT value to write out. If you don't have access to the version with the fix, a work around would be to import the ALT field.
ScanVcfParam(fixed = c("ALT", "FILTER"))
You can see what values were imorted with the fixed() accessor:
fixed(vcf)
Please report and bugs or problems on the Bioconductor mailing list Martin referenced. More Bioc users will see the question and you'll get help more quickly.
Valerie
Here's a reproducible example
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
param <- ScanVcfParam(fixed="FILTER")
writeVcf(readVcf(fl, "hg19", param=param), tempfile())
## Error in .pasteCollapse(ALT, ",") : 'x' must be a CharacterList
The problem seems to be that writeVcf expects the object to have an 'ALT' field, so
param <- ScanVcfParam(fixed="ALT")
writeVcf(readVcf(fl, "hg19", param=param), tempfile())
succeeds.

How to calculate readabilty in R with the tm package

Is there a pre-built function for this in the tm library, or one that plays nicely with it?
My current corpus is loaded into tm, something like as follows:
s1 <- "This is a long, informative document with real words and sentence structure: introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find."
s2 <- "This is a short jibberish lorem ipsum document. Selling anything to strangers and get money! Woody equal ask saw sir weeks aware decay. Entrance prospect removing we packages strictly is no smallest he. For hopes may chief get hours day rooms. Oh no turned behind polite piqued enough at. "
stuff <- rbind(s1,s2)
d <- Corpus(VectorSource(stuff[,1]))
I tried using koRpus, but it seems silly to retokenize in a different package than the one I'm already using. I also had problems vectorizing its return object in a way that would allow me to reincorporate the results into tm. (Namely, due to errors, it would often return more or fewer readability scores than the number of documents in my collection.)
I understand I could do a naive calculation parsing vowels as syllables, but want a more thorough package that takes care of the edge cases already (address silent e's, etc.).
My readability scores of choice are Flesch-Kincaid or Fry.
What I had tried originally where d is my corpus of 100 documents:
f <- function(x) tokenize(x, format="obj", lang='en')
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='remove') %do% g(f(d[[i]]))
Unfortunately, x returns less than 100 documents, so I can't associate successes with the correct document. (This is partly my misunderstanding of 'foreach' versus 'lapply' in R, but I found the structure of a text object sufficiently difficult that I could not appropriately tokenize, apply flesch.kincaid, and successfully check errors in a reasonable sequence of apply statements.)
UPDATE
Two other things I've tried, trying to apply the koRpus functions to the tm object...
Pass arguments into the tm_map object, using the default tokenizer:
tm_map(d,flesch.kincaid,force.lang="en",tagger=tokenize)
Define a tokenizer, pass that in.
f <- function(x) tokenize(x, format="obj", lang='en')
tm_map(d,flesch.kincaid,force.lang="en",tagger=f)
Both of these returned:
Error: Specified file cannot be found:
Then lists the full text of d[1]. Seems to have found it? What should I do to pass the function correctly?
UPDATE 2
Here's the error I get when I try to map koRpus functions directly with lapply:
> lapply(d,tokenize,lang="en")
Error: Unable to locate
Introduction to teaching third-graders to read. Vocabulary is key, as is a good book. Excellent authors can be hard to find.
This looks like a strange error---I almost don't think it means it can't locate the text, but that it can't locate some blank error code (such as, 'tokenizer'), before dumping the located text.
UPDATE 3
Another problem with retagging using koRpus was that retagging (versus the tm tagger) was extremely slow and output its tokenization progress to stdout. Anyway, I've tried the following:
f <- function(x) capture.output(tokenize(x, format="obj", lang='en'),file=NULL)
g <- function(x) flesch.kincaid(x)
x <- foreach(i=1:length(d), .combine='c',.errorhandling='pass') %do% g(f(d[[i]]))
y <- unlist(sapply(x,slot,"Flesch.Kincaid")["age",])
My intention here would be to rebind the y object above back to my tm(d) corpus as metadata, meta(d, "F-KScore") <- y.
Unfortunately, applied to my actual data set, I get the error message:
Error in FUN(X[[1L]], ...) :
cannot get a slot ("Flesch.Kincaid") from an object of type "character"
I think one element of my actual corpus must be an NA, or too long, something else prohibitive---and due to the nested functionalizing, I am having trouble tracking down exactly which it is.
So, currently, it looks like there is no pre-built function for reading scores that play nicely with the tm library. Unless someone sees an easy error-catching solution I could sandwich into my function calls to deal with inability to tokenize some apparently erroneous, malformed documents?
You get an error because koRpus functions can't deal with corpus object. It is better to create a kRp.tagged object then apply all koRpus features on it. Here I will show how I do this using ovid data of tm package.
I use list.files to get my list of source files. You just need to give the right path to your sources text files.
ll.files <- list.files(path = system.file("texts", "txt",
package = "tm"),
full.names=T)
Then I construct a list of kRp.tagged object using tokenize which is a the default tagger given with the koRpus package(It is recommanded to use TreeTagger but you need to install it)
ll.tagged <- lapply(ll.files, tokenize, lang="en") ## tm_map is just a wrapper of `lapply`
Once I have my list of "tagged" objects I can apply readability formula on it. Since flesch.kincaid is a wrapper of readability, I will apply directly the latter:
ll.readability <- lapply(ll.tagged,readability) ## readability
ll.freqanalysis <- lapply(ll.tagged,kRp.freq.analysis) ## Conduct a frequency analysis
ll.hyphen <- lapply(ll.tagged,hyphen) ## word hyphenation
etc,....all this produces a list of S4 object. The desc slot gives an easy access to this list:
lapply(lapply(ll.readability ,slot,'desc'), ## I apply desc to get a list
'[',c('sentences','words','syllables'))[[1]] ## I subset to get some indexes
[[1]]
[[1]]$sentences
[1] 10
[[1]]$words
[1] 90
[[1]]$syllables
all s1 s2 s3 s4
196 25 32 25 8
You can for example , use the slot hyphen to get a data frame with two colums, word (the hyphenated words) and syll (the number of syllables). here, using lattice, I bind all the data.frames, to plot a dotplot for each document.
library(lattice)
ll.words.syl <- lapply(ll.hyphen,slot,'hyphen') ## get the list of data.frame
ll.words.syl <- lapply(seq_along(ll.words.syl), ## add a column to distinguish docs
function(i)cbind(ll.words.syl[[i]],group=i))
dat.words.syl <- do.call(rbind,ll.words.syl)
dotplot(word~syll|group,dat.words.syl,
scales=list(y=list(relation ='free')))
I'm sorry the koRpus package doesn't interact with the tm package that smoothly yet. I've been thinking of ways to translate between the two object classes for months now but haven't yet come up with a really satisfying solution. If you have ideas for this, don't hesitate to contact me.
However, I'd like to refer you to the summary() method for readability objects produced by koRpus, which returns a condensed data.frame of relevant results. This is probably much easier to access than the alternative crawl through the rather complex S4 objects ;-) You might also try summary(x, flat=TRUE).
#agstudy: Nice graph :-) To save some time, you should run hyphen() before readability(), so you can re-use the results via the "hyphen" argument. Or you can simply access the "hyphen" slot of readability() results afterwards. It will hyphenate automatically if needed, and keep the results. A manual call to hyphen should only be necessary if you need to change the output of hyphen() before next steps. I might add that 0.05-1 is much faster at this than its predecessors.
As of qdap version 1.1.0, qdap has a number of functions to be more compatible with the tm package. Here is a way to approach your problem using the same Corpus you provide (note that the Fry was originally a graphical measure and qdap preserves this; also by way of your Corpus and the random sampling Fry suggested your sample Corpus is not large enough to calculate Fry's on):
library(qdap)
with(tm_corpus2df(d), flesch_kincaid(text, docs))
## docs word.count sentence.count syllable.count FK_grd.lvl FK_read.ease
## 1 s1 33 1 54 16.6 34.904
## 2 s2 49 1 75 21.6 27.610
with(tm_corpus2df(d), fry(text, docs))
## To plot it
qheat(with(tm_corpus2df(d), flesch_kincaid(text, docs)), values=TRUE, high="red")

Resources