How to improve processing time of large rolling window computation - r

I have a large dataset, including about 100,000 entries. I am using the tibbletime package to create a rolling version of the DL.test function from the vrtest package.
I am using a rolling window (size=1000), leading to about 99,000 computations. The code looks like this:
#installing packages
install.packages("tibbletime")
install.packages("vrtest")
#importing libraries
library(vrtest)
library(dplyr)
library(tibbletime)
library(tibble)
#generating demo data
data <- data.frame(replicate(1,sample(0:1,1010,rep=TRUE)))
names(data)[names(data) == "replicate.1..sample.0.1..1010..rep...TRUE.."] <- "log_return"
#running DL.test once
DL.test(data, 300, 1)
#creating a rolling window version of DL.test
test <- rollify(DL.test, window=1000, unlist=FALSE)
#applying function and saving results
results <- dplyr::mutate(data, test = test(log_return))
The issue now is that running DL.test even once takes a little less than 5 minutes on my current setup. Having to repeat this step nearly 100,000 times limits the practicality fairly strong.
What options do I have to speed this process up?
My current idea would be to create many smaller versions of my original dataset (e.g., entries 1 - 1500 for the first 500 computations, 501 - 2000 for the second batch...) and somehow employ parallel processing.
Any hints are highly appreciated!

Related

How to run Dirichlet Regression with a big data set in R?

I would like to run a Dirichlet regression on a large data set using the DirichReg Package in R. I currently have data.frame with 37 columns and ~13,000,000 rows.
However, running this model on all of my data instantly crashes R. I am using a Linux machine with 16 cores and 128 GB of memory. Even just cutting down my data to only 1000 points still causes R to almost immediately crash and restart.
Am I doing something wrong? Is there any way I can parallelize this operation to get this model to run?
I am running a model with the following syntax:
data.2 <- data
data.2$y_variable <- DR_data(data[,c(33:35)])
model <- DirichReg(y_variable ~ x_variable, data.2)
I have to create the y_variable in a separate data.2 data.frame, because running data$y_variable <- DR_data(data[,c(33:35)]) will crash R. I have no idea why this is.
Bit of a guess why it's 'crashing' R, but if it's due to RAM issues then you can update the table by reference, rather than making a shallow copy of the entire data:
library(data.table)
setDT(data)
dat[, y := DR_data(data[,c(33:35)])]

Calculate sentiment of each row in a big dataset using R

I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?
I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.
library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])
produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.
One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).
batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}
batch_sentiment_by(reviews)
Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.

use ape to phase a fasta file and create a DNAbin file as output, then test tajima's D using pegas

I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.

as.h2o() in R to upload files to h2o environment takes a long time

I am using h2o to carry out some modelling, and having tuned the model, i would now like it to be used to carry out a lot of predictions approx 6bln predictions/rows, per prediction row it needs 80 columns of data
The dataset I have already broken down the input dataset down so that it is in about 500 x 12 million row chunks each with the relevant 80 columns of data.
However to upload a data.table that is 12 million by 80 columns to h2o takes quite a long time, and doing it 500 times for me is taking a prohibitively long time...I think its because it is parsing the object first before it is uploaded.
The prediction part is relatively quick in comparison....
Are there any suggestions to speed this part up? Would changing the number of cores help?
Below is an reproducible example of the issues...
# Load libraries
library(h2o)
library(data.table)
# start up h2o using all cores...
localH2O = h2o.init(nthreads=-1,max_mem_size="16g")
# create a test input dataset
temp <- CJ(v1=seq(20),
v2=seq(7),
v3=seq(24),
v4=seq(60),
v5=seq(60))
temp <- do.call(cbind,lapply(seq(16),function(y){temp}))
colnames(temp) <- paste0('v',seq(80))
# this is the part that takes a long time!!
system.time(tmp.obj <- as.h2o(localH2O,temp,key='test_input'))
#|======================================================================| 100%
# user system elapsed
#357.355 6.751 391.048
Since you are running H2O locally, you want to save that data as a file and then use:
h2o.importFile(localH2O, file_path, key='test_intput')
This will have each thread read their parts of the file in parallel. If you run H2O on a separate server, then you would need to copy the data to a location that the server can read from (most people don't set the servers to read from the file system on their laptops).
as.h2o() serially uploads the file to H2O. With h2o.importFile(), the H2O server finds the file and reads it in parallel.
It looks like you are using version 2 of H2O. The same commands will work in H2Ov3, but some of the parameter names have changed a little. The new parameter names are here: http://cran.r-project.org/web/packages/h2o/h2o.pdf
Having also struggled with this problem, I did some tests and found that for objects in R memory (i.e. you don't have the luxury of already having them available in .csv or .txt form), by far the quickest way to load them (~21 x) is to use the fwrite function in data.table to write a csv to disk and read it using h2o.importFile.
The four approaches I tried:
Direct use of as.h2o()
Writing to disk using write.csv() then load using h2o.importFile()
Splitting the data in half, running as.h2o() on each half, then combining using h2o.rbind()
Writing to disk using fwrite() from data.table then load using h2o.importFile()
I performed the tests on a data.frame of varying size, and the results seem pretty clear.
The code, if anyone is interested in reproducing, is below.
library(h2o)
library(data.table)
h2o.init()
testdf <-as.data.frame(matrix(nrow=4000000,ncol=100))
testdf[1:1000000,] <-1000 # R won't let me assign the whole thing at once
testdf[1000001:2000000,] <-1000
testdf[2000001:3000000,] <-1000
testdf[3000001:4000000,] <-1000
resultsdf <-as.data.frame(matrix(nrow=20,ncol=5))
names(resultsdf) <-c("subset","method 1 time","method 2 time","method 3 time","method 4 time")
for(i in 1:20){
subdf <- testdf[1:(200000*i),]
resultsdf[i,1] <-100000*i
# 1: use as.h2o()
start <-Sys.time()
as.h2o(subdf)
stop <-Sys.time()
resultsdf[i,2] <-as.numeric(stop)-as.numeric(start)
# 2: use write.csv then h2o.importFile()
start <-Sys.time()
write.csv(subdf,"hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,3] <-as.numeric(stop)-as.numeric(start)
# 3: Split dataset in half, load both halves, then merge
start <-Sys.time()
length_subdf <-dim(subdf)[1]
h2o1 <-as.h2o(subdf[1:(length_subdf/2),])
h2o2 <-as.h2o(subdf[(1+length_subdf/2):length_subdf,])
h2o.rbind(h2o1,h2o2)
stop <-Sys.time()
resultsdf[i,4] <- as.numeric(stop)-as.numeric(start)
# 4: use fwrite then h2o.importfile()
start <-Sys.time()
fwrite(subdf,file="hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,5] <-as.numeric(stop)-as.numeric(start)
plot(resultsdf[,1],resultsdf[,2],xlim=c(0,4000000),ylim=c(0,900),xlab="rows",ylab="time/s",main="Scaling of different methods of h2o frame loading")
for (i in 1:3){
points(resultsdf[,1],resultsdf[,(i+2)],col=i+1)
}
legendtext <-c("as.h2o","write.csv then h2o.importFile","Split in half, as.h2o and rbind","fwrite then h2o.importFile")
legend("topleft",legend=legendtext,col=c(1,2,3,4),pch=1)
print(resultsdf)
flush.console()
}

How to work with a large multi type data frame in Snow R?

I have a large data.frame of 20M lines. This data frame is not only numeric, there is characters as well. Using a split and conquer concept, I want to split this data frame to be executed in a parallel way using snow package (parLapply function, specifically). The problem is that the nodes run out of memory because the data frame parts are worked in RAM. I looked for a package to help me with this problem and I found just one (considering the multi type data.frame): ff package. Another problem comes from the use of this package. The split result of a ffdf is not equal to a split of a commom data.frame. Thus, it is not possible to run the parLapply function.
Do you know other packages for this goal? Bigmemory only supports matrix.
I've benchmarked some ways of splitting the data frame and parallelizing to see how effective they are with large data frames. This may help you deal with the 20M line data frame and not require another package.
The results are here. The description is below.
This suggests that for large data frames the best option is (not quite the fastest, but has a progress bar):
library(doSNOW)
library(itertools)
# if size on cores exceeds available memory, increase the chunk factor
chunk.factor <- 1
chunk.num <- kNoCores * cut.factor
tic()
# init the cluster
cl <- makePSOCKcluster(kNoCores)
registerDoSNOW(cl)
# init the progress bar
pb <- txtProgressBar(max = 100, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
# conduct the parallelisation
travel.queries <- foreach(m=isplitRows(coord.table, chunks=chunk.num),
.combine='cbind',
.packages=c('httr','data.table'),
.export=c("QueryOSRM_dopar", "GetSingleTravelInfo"),
.options.snow = opts) %dopar% {
QueryOSRM_dopar(m,osrm.url,int.results.file)
}
# close progress bar
close(pb)
# stop cluster
stopCluster(cl)
toc()
Note that
coord.table is the data frame/table
kNoCores ( = 25 in this case) is the number of cores
Distributed memory. Sends coord.table to all nodes
Shared memory. Shares coord.table with nodes
Shared memory with cuts. Shares subset of coord.table with nodes.
Do par with cuts. Sends subset of coord.table to nodes.
SNOW with cuts and progress bar. Sends subset of coord.table to nodes
Option 5 without progress bar
More information about the other options I compared can be found here.
Some of these answers might suit you, although they doesn't relate to distributed parlapply and I've included some of them in my benchmarking options.

Resources