R Speed up string decomposition - r

I am relatively new to R, so my repertoire of commands is limited.
I am trying to write a script that will decompose a series of Markovian sequences, contained in a text string and delimited with a '>' sign, into a contingency "from - to" table.
The attached code, with dummy data, is where I have been able to get the code. On the small 7 case example included this will run relatively quickly. However the reality is that I have millions of cases to parse and my code just isn't efficient enough to process in a timely fashion (it had taken well over an hour and this time frame isn't feasible).
I'm convinced there is a more efficient way of structuring this code so that it executes quickly as I have seen this operation performed in other Markov packages within a few minutes. I need my own scripted version though to allow flexibility in processing hence I have not turned to these.
What I would like to request are improvements to the script to increase processing efficiency please.
Seq <- c('A>B>C>D', 'A>B>C', 'A', 'A', 'B', 'B>D>C', 'D') #7 cases
Lives <- c(0,0,0,0,1,1,0)
Seqdata <- data.frame(Seq, Lives)
Seqdata$Seq <- gsub("\\s", "", Seqdata$Seq)
fromstep <- list()
tostep <- list()
##ORDER 1##
for (x in 1:nrow(Seqdata)) {
steps <- unlist(strsplit(Seqdata$Seq[x], ">"))
for (i in 1:length(steps)) {
if (i==1) {fromstep <- c(fromstep, "Start")
tostep <- c(tostep, steps[i])
}
fromstep <- c(fromstep, steps[i])
if (i<length(steps)) {
tostep <- c(tostep, steps[i+1])
} else if (Seqdata$Lives[x] == 1) {
tostep <- c(tostep, 'Lives')
} else
tostep <- c(tostep, 'Dies')
}
}
transition.freq <- table(unlist(fromstep), unlist(tostep))
transition.freq

I'm not familiar with Markovian sequences, but this produces the same output:
xx <- strsplit(Seqdata$Seq, '>', fixed=TRUE)
table(From=unlist(lapply(xx, append, 'Start', 0L)),
To=unlist(mapply(c, xx, ifelse(Seqdata$Lives == 0L, 'Dies', 'Lives'))))

Related

How would you write this using apply family of functions in R? Should you?

Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.

R data.table performance - proceed line by line

I have a performance issue in R using a data.table
I have a data.table with mixed data and I need to do some simple computations proceeding line by line. Therefor I need to do a for-loop through all lines. Usually I avoid for-loops but since the computation of one line depends on the previous one - I can't avoid the for-loop.
I've build a basic example which reproduces the issue:
# create some sample-data
dt <- data.table(x=1:300000,y=rnorm(100),z=c("a","b","c","d"))
dt$new <- NA_integer_
# init some demo variables
xxx <- 612341
yyy <- 1
for (i in (1:nrow(dt))){
dt[i,new := xxx] # write something
yyy <- dt[i,y] # read something
if ((i %% 20000) == 0){print(i)} # see progress
}
Using system.time this runs: 230 seconds. Pretty long for very simple computations on 300.000 data-lines ...?
For performance optimization I allocate memory for the new column before the loop. Also I use the := operator to avoid copying.
Is there any way to speed this up?
Here is the operation I am trying to compute actually.
I want to add the value of the previous line if the current line belongs to the same class.
dt <- data.table(x=1:30000,y=rnorm(100),z=c("a","a","a","a","b","b","c","c","c","d"))
dt$new <- NA_real_
remember <- NA_real_
currentclass <- ""
for (i in (1:nrow(dt)))
{
if (dt[i,z] == currentclass)
{
dt[i,new := remember]
remember <- dt[i,y]
}
else
{
currentclass <- dt[i,z]
remember <- dt[i,y]
dt[i,new := NA]
}
}

R Function Slow, Looking to Increase Speed/Performance

I've built a prediction function in R, but when I run it's very slow, and I'm only using a sample of 1% of the data I'll be using in production. The function is intended to predict the next word given a series of ngrams (two-word, three-word, or four-word combinations - created from my corpus).
I pass the words to the function, for example "i can", and the series of three-word combinations. The output ranked in order decreasing would be "i can read", count of 4.
Here is the two-word ngram passed is a matrix, the dim and example data from position 100.
dim(bigram_index)
[1] 46201 3
bigram_index[,1][100]
[1] "abandon"
bigram_index[,2][100]
[1] "contemporary"
bigram_index[,3][100]
[1] "1"
Here is the prediction function:
predict.next.word <- function(word, ng_matrix){
ngram_df <- data.frame(predicted=character(), count = numeric(), stringsAsFactors=FALSE)
col_ng_matrix <- nrow(bigram_index)
if(ncol(ng_matrix)==3){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
count_word <- ng_matrix[,3][i]
if (word[1] == first_word && !is.na(first_word)){
matched_factor <- structure(c(second_word, count_word), .Names = c("predicted", "count"))
ngram_df[i,] <- as.list(matched_factor)
}
}
} else if(ncol(ng_matrix)==4){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
third_word <- ng_matrix[,3][i]
count_word <- ng_matrix[,4][i]
if (word[1] == first_word && !is.na(first_word) && word[2] == second_word && !is.na(second_word)){
matched_factor <- structure(c(third_word, count_word), .Names = c("predicted", "count"))
ngram_df[i,] <- as.list(matched_factor)
}
}
} else if(ncol(ng_matrix)==5){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
third_word <- ng_matrix[,3][i]
fourth_word <- ng_matrix[,4][i]
count_word <- ng_matrix[,5][i]
if (word[1] == first_word && !is.na(first_word) && word[2] == second_word
&& !is.na(second_word) && word[3] == third_word && !is.na(third_word)){
ngram_df[i,] <- as.list(matched_factor)
}
}
}
ngram_df <- transform(ngram_df, count = as.numeric(count))
return (ngram_df[order(ngram_df$count, decreasing = TRUE),])
}
Using the smallest ngram (only two-word) here is the time results:
system.time(predict.next.word(c("abandon"), bigram_index))
user system elapsed
92.125 59.395 152.149
Again, the ngram passed again is only 1% of production data, and when I get into three and four-word, it takes much longer. Please provide your insight on how to improve this function's speed.
Instead of looping through columns, I would writing a function that performs the key actions of the for() loop, and use apply() (with MARGIN=2 for columns, 1 for rows; I think you'll be using latter) to apply that function to each column (FUN= argument set equal to your function). Depending on the output format, apply might not be suitable. At that point you could look into plyr package, dplyr, or, my favorite (but somewhat of a learning curve, as is dplyr) the data.table package.
In general, take a look at Hadley's book chapter on the topic: http://adv-r.had.co.nz/Performance.html
Currently, your code doesn't take advantage of the fact that so-call "vectorized" R code performs loops in C, making them much faster (forgive me if this description is technically incorrect; just getting the idea across).
For a more specific example, it might be helpful to see input (use dput(data)) and desired output. Then I'd have an easier time digesting what you want your function to accomplish.
Some general points that could help, at least a little:
You do ncol(ng_matrix) several times; instead, do nc.ngm < - ncol(ng_matrix) once at the start. Savings will be minimal, but the idea still useful.
Instead of defining first_word second, etc., just do something like words <- ng_matrix[i,]. Then use the previously-mentioned object to get the count_word by doing count_word <- words[nc.ngm] and get the other words as numbered_words <- words[nc.ngm]. To compare the word object elements to the words elements, you could even make use of mapply to get your logic. Again, this is all a little hard to follow without an example. But in general, do things "in bulk" (vectorize).

Looping through numerous functions with a vector of characters

I thought this would be easier than it is, but I am learning to code in R so looping is certainly not my strong point.
What I am attempting to do here is take a series of functions that all have a common theme of MISO. As you can see in the first batch of code below MISO is common, but I would like to swap the MISO for i as one would do in a for loop and then loop through a few different names in a character vector. Let's call that character vector ID so that ID <- c("MISO","PJM","SERC")
At this point in the code, all_Cities_MISO is already a data frame in my environment. I just want to break it up and perform some calculations.
meanAvgHighMISO <- mean(all_Cities_MISO$Col21)
meanAvgLowMISO <- mean(all_Cities_MISO$Col20)
meanAvgMISO <- mean(cbind(meanAvgHighMISO,meanAvgLowMISO))
names(meanAvgMISO) <- ifelse(meanAvgMISO<65,"HDD","CDD")
MISO_Avg_DD <- ifelse(meanAvgMISO<65,(65-meanAvgMISO),(meanAvgMISO-65)) #average degree days for each period
MISO_op_mean <- apply(all_Cities_MISO[,1:19],2, mean)
So I attempted this a few different times, but keep getting errors like Error: unexpected string constant in:
" meanAvgHigh"i""
I feel like it should be simple to replace the MISO with a PJM throughout the code above with a simple for loop, but no luck. It must be something with it not liking MISO as a character.
Here is my attempt at the for loop:
ID <- c("MISO","PJM","SERC")
for(i in ID){
meanAvgHigh"i" <- mean(all_Cities_"i"$Col21)
meanAvgLow"i" <- mean(all_Cities_"i"$Col20)
meanAvg"i" <- mean(cbind(meanAvgHigh"i",meanAvgLow"i"))
names(meanAvg"i") <- ifelse(meanAvg"i"<65,"HDD","CDD")
"i"_Avg_DD <- ifelse(meanAvg"i"<65,(65-meanAvg"i"),(meanAvg"i"-65)) #average degree days for each period
"i"_op_mean <- apply(all_Cities_"i"[,1:19],2, mean)
}
I attempted using [i] instead of "i", but that didn't work either. I understand that MISO itself in the first code I displayed is not a character, but I'm not sure how R would recognize it in the loop otherwise... I just would like to do a simple swap of names in a loop. MISO for PJM or for SERC etc etc.
Any help is greatly appreciated, thank you.
While you can do that by using the environment() and assign() methods, I would advise against it. You should instead use a nested list to save those values.
However, for completeness, this is how I think one would do it (untested):
env <- environment()
ID <- c("MISO","PJM","SERC")
for(i in ID){
assign(paste0("meanAvgHigh", i), mean(env[[ paste0("all_Cities_", i) ]]$Col21))
assign(paste0("meanAvgLow", i), mean(env[[ paste0("all_Cities_", i) ]]$Col20))
assign(paste0("meanAvg", i), mean(cbind(env[[ paste0("meanAvgHigh", i) ]], env[[ paste0("meanAvgLow", i) ]])))
names(env[[ paste0("meanAvg", i) ]]) <- ifelse(env[[ paste0("meanAvg", i) ]] < 65,"HDD","CDD")
##### Note: The ifelse can probably be replaced by an abs
assign(paste0(i, "_Avg_DD"), ifelse( env[[ paste0("meanAvg", i) < 65,
(65 - env[[ paste0("meanAvg", i) ]]),
(env[[ paste0("meanAvg", i) ]] - 65)
)) #average degree days for each period
assign(paste0(i, "_op_mean"), apply(env[[ paste0("all_Cities_", i) ]][,1:19], 2, mean)
}
The basic idea is using assign to set the values in the current environment and then using the current environment env to get them via indexing.

memory allocation error while using mclapply

it's the first i use mclapply to run parallel script on multiple process, but the problem that i've tried the script on my laptop and it worked very well and filled the dataframe correctly, but now when i run the script on my office pc, when the printing ends and it's time to collect the data, the script stops with this error :
Error: cannot allocate vector of size 80 Kb
fun <- function(testdf) {
l=12000
errordf=data.frame()
errordf <- mclapply(1:nrow(15000), function(i)
{
for (ind in 1:nrow(testdf))
{
if( i >= l/2 ){
testdf[ind,]$X = testdf[ind,]$pos * 2
} else
{
testdf[ind,]$X = testdf[ind,]$pos/l
}
}
permdf <- testdf
lapply(1:100, function(j)
{ permdf$X<- sample(permdf$X,nrow(permdf), replace=FALSE)
fit=lm(X ~ gx, permdf) #linear regression calculation
regerror=sum(residuals(fit)^2)
data.frame(pc=i,error=regerror )
})
}, mc.cores=3)
res<-NULL
tmp <- lapply(errordf, function(ii){
tmp <- lapply(ii, function(ij){ #rbind the data and return the dataframe
res<<- rbind(res, ij)
})
})
return (res)
}
testdf example:
structure(list(ax = c(-0.0242214, 0.19770304, 0.01587302, -0.0374415,
0.05079826, 0.12209738), gx = c(-0.3913043, -0.0242214, -0.4259067,
-0.725, -0.0374415, 0.01587302), pos = c(11222, 13564, 16532,
12543, 12534, 14354)), .Names = c("ax", "gx", "pos"), row.names = c(NA,
-6L), class = "data.frame")
i'm sure that the code is working (that's why i did not included the full code), because i tried it multiple times on my laptop, but when i tries it on my office pc it lunch this error.
any help would be appreciatd
Right now you don't use the apply as intended in you last double nested lapply loop, you might as well use a for loop instead of using lapply combined with a global variable. In addition, you continuously grow res, this is rather memory and time intensive. Normally, an lapply loop would not suffer from this problem, but your use of a global variable totally negates the advantage. You seem to have a double nested list you want to rbind. I would defintely not loop over the data structure, I would just use something along the lines of do.call("rbind", data_structure) to do this, although it is hard to provide concrete advice without a reproducible example. This solution does not suffer from the continuous growing problem you experience.

Resources