Slow bigram frequency function in R - r

I’m working with Twitter data and I’m currently trying to find frequencies of bigrams in which the first word is “the”. I’ve written a function which seems to be doing what I want but is extremely slow (originally I wanted to see frequencies of all bigrams but I gave up because of the speed). Is there a faster way of solving this problem? I’ve heard about the RWeka package, but have trouble installing it, I get an error about (ERROR: dependencies ‘RWekajars’, ‘rJava’ are not available for package ‘RWeka’)…
required libraries: tau and tcltk
bigramThe <- function(dataset,column) {
bidata <- data.frame(x= character(0), y= numeric(0))
pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
for (i in 1:nrow(dataset)) {
a <- column[i]
bi<-textcnt(a, n = 2, method = "string")
tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
tweetbi$grepl<-grepl("the ",tweetbi$V1)
tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
bidata <- rbind(bidata, tweetbi)
setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
close(pb)
return(aggbi)
}
I have almost 500,000 rows of tweets stored in a column that I pass to the function. An example dataset would look like this:
text userid
tweet text 1 1
tweets text 2 2
the tweet text 3 3

To use RWeka, first run sudo apt-get install openjdk-6-jdk (or install/re-install your JDK in Windows GUI) then try re-installing the package.
Should that fail, use download.file to download the source .zip file and install from source, i.e. install.packages("RWeka.zip", type = "source", repos = NULL).
If you want to speed things up without using a different package, consider using multicore and re-writing the code to use an apply function which can take advantage of parallelism.

You can get rid of the evil loop structure by collapsing the text column into one long string:
paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")
I expected to also need
subset(bi, function(x) !grepl("*", x)
But it turns out that the textcnt method doesn't include bigrams with * in them, so you're good to go.

Related

Trying to use Bert (R add-on for Excel) with auto.arima not working

I'm using Bert (R add on for Excel).
When I try to run the following:
sales <- sample(100:170, 4*10, replace = TRUE)
advertising <- sample(50:70, 4*10, replace = TRUE)
sales_ts <- ts(sales, frequency = 4, end = c(2017, 4))
fit <- forecast::auto.arima(sales_ts, xreg = advertising,d=1,ic=c("aic"))
The arima works.
But when I try to use auto.arima
fit.arima <- arima(sales_ts,xreg =advertising,order=c(3,1,1))
I get the following error:
Error in rbind(info, getNamespaceInfo(env, "S3methods")) : number of columns of matrices must match (see arg 2)
Please help!
I was able to run this in R itself, and faced no issues (it worked fine).
Based on your error message, it's possible there's a namespace/library clash causing this, or that one of your libraries/packages wasn't installed properly.
You can either:
remove the package(s)
Use find.package("insert package name here") to find its location, then remove.packages("insert package name here") to remove it. Re-install and proceed
add the package name itself when referencing the function, like you did with forecast, e.g.
# adding "stats" in front, if that's the package being used
fit.arima <- stats::arima(sales_ts,xreg =advertising,order=c(3,1,1))

R code optimization: For loop and writing to a database

I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.

Defining large matrix with "big memory" package in R

I am using the big memory package and need to define a large matrix (20000 * 20000).
A <- big.matrix (20000 , 20000 , type ="double", init = 0)
Resulting in:
Error: memory could not be allocated for instance of type big.matrix
My questions:
(1.) Does the package enables a matrix of that size in general?
(2.) If not, are there any other options to create such a matrix in R?
Many thanks for your help
This answer espands on Imo's explanation of specifying file-backing.
Unfortunately, the current CRAN version of the package (4.5.36) doesn't contain a vignette anymore, but thankfully it's possible to download older versions that contain it. For example, the vignette for version 4.5.28 contains the following piece of code:
x <- read.big.matrix("airline.csv", type="integer", header=TRUE,
backingfile="airline.bin",
descriptorfile="airline.desc",
extraCols="Age")
If you wish to keep your working directory clean, you can use the temppath() and tempdir() functions. Here's one example:
temp_file <- gsub("/", "", tempfile(tmpdir = ""))
A <- big.matrix(
20000 , 20000 , type ="double", init = 0,
backingpath = tempdir(),
backingfile = paste0(temp_file, ".bak"),
descriptorfile = paste0(temp_file, ".desc"),
)

read, manipulate and export multiple .dta Files using a for Loop in R

I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.

Including a "Hash Table" in a package

I am in the process of putting together a package I've been working on for almost a year now. I have what I call a hash table that a syllable look up function requires. The hash table is really just an environment (I think I'm not computer whiz) that's a look up table. You can see the function I create it with below. I have a data set DICTIONARY(about 20,000 words) that will load when the package is loaded. I also what this DICTIONARY to be passed to the hash function to create a new environment when the package is loaded; something like env <- hash(DICTIONARY) as htis is how I load the environment now. How do I make a function run on start up when the package is loaded so that this new environment is created for those using my package?
hash <- function(x, type = "character") {
e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv())
char <- function(col) assign(col[1], as.character(col[2]), envir = e)
num <- function(col) assign(col[1], as.numeric(col[2]), envir = e)
FUN <- if(type=="character") char else num
apply(x, 1, FUN)
return(e)
}
#currently how I load the environment with the DICTIONARY lookup table
env <- hash(DICTIONARY)
Here's the head of DICTIONARY if it's helpful:
word syllables
1 hm 1
2 hmm 1
3 hmmm 1
4 hmph 1
5 mmhmm 2
6 mmhm 2
7 mm 1
8 mmm 1
9 mmmm 1
10 pff 1
Many of you may be thinking "This is up to the user to determine if they want the environment loaded". Valid point but the intended audience of this package is people in the literacy field. Not many in that field are R users and so I have to make this thing as easy as possible to use. Just wanted to get out the philosophy of why I want to do this, out there so that it doesn't become a point of contention.
Thank you in advance. (PS I've looked at this manual (LINK) but can't seem to locate any info about this topic)
EDIT:
Per Andrei's suggestion i think it will be something like this? But I'm not sure. Does this load after all the other functions and data sets in the package load? This stuff is a little confusing to me.
.onLoad <- function(){
env <- hash(DICTIONARY)
}
If the hash is going to change infrequently (this seems like the case, from your problem description), then save the hash into your package source tree as
save(env, file="<my_pkg>/R/sysdata.rda")
After installing the package, env will be available inside the name space, my_pkg:::env. See section 1.1.3 of "Writing R Extensions". You might have a script, say in "/inst/scripts/make_env.R" that creates env, and that you as the developer use on those rare occasions when env needs to be updated.
Another possibility is that the hash changes, but only on package installation. Then the solution is to write code that is evaluated at package installation. So in a file /R/env.R write something along the lines of
env <- local({
localenv <- new.env(parent=emptyenv())
## fill up localenv, then return it
localenv[["foo"]] = "bar"
localenv
})
The possibility solved by .onLoad is that the data changes each time the package is loaded, e.g., because it is retrieving an update from some on-line source.
env <- new.env(parent=emptyenv())
.onLoad <- function(libname, pkgname)
{
## fill up env
env[["foo"]] = "bar"
}

Resources