I've built a prediction function in R, but when I run it's very slow, and I'm only using a sample of 1% of the data I'll be using in production. The function is intended to predict the next word given a series of ngrams (two-word, three-word, or four-word combinations - created from my corpus).
I pass the words to the function, for example "i can", and the series of three-word combinations. The output ranked in order decreasing would be "i can read", count of 4.
Here is the two-word ngram passed is a matrix, the dim and example data from position 100.
dim(bigram_index)
[1] 46201 3
bigram_index[,1][100]
[1] "abandon"
bigram_index[,2][100]
[1] "contemporary"
bigram_index[,3][100]
[1] "1"
Here is the prediction function:
predict.next.word <- function(word, ng_matrix){
ngram_df <- data.frame(predicted=character(), count = numeric(), stringsAsFactors=FALSE)
col_ng_matrix <- nrow(bigram_index)
if(ncol(ng_matrix)==3){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
count_word <- ng_matrix[,3][i]
if (word[1] == first_word && !is.na(first_word)){
matched_factor <- structure(c(second_word, count_word), .Names = c("predicted", "count"))
ngram_df[i,] <- as.list(matched_factor)
}
}
} else if(ncol(ng_matrix)==4){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
third_word <- ng_matrix[,3][i]
count_word <- ng_matrix[,4][i]
if (word[1] == first_word && !is.na(first_word) && word[2] == second_word && !is.na(second_word)){
matched_factor <- structure(c(third_word, count_word), .Names = c("predicted", "count"))
ngram_df[i,] <- as.list(matched_factor)
}
}
} else if(ncol(ng_matrix)==5){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
third_word <- ng_matrix[,3][i]
fourth_word <- ng_matrix[,4][i]
count_word <- ng_matrix[,5][i]
if (word[1] == first_word && !is.na(first_word) && word[2] == second_word
&& !is.na(second_word) && word[3] == third_word && !is.na(third_word)){
ngram_df[i,] <- as.list(matched_factor)
}
}
}
ngram_df <- transform(ngram_df, count = as.numeric(count))
return (ngram_df[order(ngram_df$count, decreasing = TRUE),])
}
Using the smallest ngram (only two-word) here is the time results:
system.time(predict.next.word(c("abandon"), bigram_index))
user system elapsed
92.125 59.395 152.149
Again, the ngram passed again is only 1% of production data, and when I get into three and four-word, it takes much longer. Please provide your insight on how to improve this function's speed.
Instead of looping through columns, I would writing a function that performs the key actions of the for() loop, and use apply() (with MARGIN=2 for columns, 1 for rows; I think you'll be using latter) to apply that function to each column (FUN= argument set equal to your function). Depending on the output format, apply might not be suitable. At that point you could look into plyr package, dplyr, or, my favorite (but somewhat of a learning curve, as is dplyr) the data.table package.
In general, take a look at Hadley's book chapter on the topic: http://adv-r.had.co.nz/Performance.html
Currently, your code doesn't take advantage of the fact that so-call "vectorized" R code performs loops in C, making them much faster (forgive me if this description is technically incorrect; just getting the idea across).
For a more specific example, it might be helpful to see input (use dput(data)) and desired output. Then I'd have an easier time digesting what you want your function to accomplish.
Some general points that could help, at least a little:
You do ncol(ng_matrix) several times; instead, do nc.ngm < - ncol(ng_matrix) once at the start. Savings will be minimal, but the idea still useful.
Instead of defining first_word second, etc., just do something like words <- ng_matrix[i,]. Then use the previously-mentioned object to get the count_word by doing count_word <- words[nc.ngm] and get the other words as numbered_words <- words[nc.ngm]. To compare the word object elements to the words elements, you could even make use of mapply to get your logic. Again, this is all a little hard to follow without an example. But in general, do things "in bulk" (vectorize).
Related
I want to adjust my function so that my if and else if statements recognize the name of the dataframe used and execute the correct plotting function. These are some mock data structured the same as mine:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
A<-lapply(tbls,"[", c(1,2))
B<-lapply(tbls,"[", c(3,4))
C<-lapply(tbls,"[", c(3,4))
list<-list(A,B,C)
names(list)<-c("A","B","C")
And this is my function:
plot_1<-function (section, subsample) {
data<-list[grep(section, names(list))]
data<-data[[1]]
name=as.character(names(data))
if(section=="A" && subsample=="None"){plot_likert_general_section(df1[c(1:2)],"A")}
else if (section==name && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the",name,"topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the",name,"topics rank?"))}
}
Basically what I want it to do is plot a certain graph by specifying section and subsample I'm interested in if, for example, I want to plot section C and subsample dummy.1, I just write:
plot_1(section="C", subsample="dummy1)
I want to avoid writing this:
else if (section=="A" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the A topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the A topics rank?"))}
else if (section=="B" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the B topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the B topics rank?"))}
else if (section=="C" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the c topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the C topics rank?"))}
else if (section=="C" && subsample=="dummy2")...
.
.
}
So I tried to extract the dataframe used from the list so that it matches the string of the section typed in the function (data<-list[grep(section, names(list))]) and store its name as a character (name=as.character(names(data))), because I thought that in this way the function would have recognized the string "A", "B" or "C" by itself, without the need for me to specify each condition.
However, if I run it, I get this error: Warning message: In section == name && subsample == "dummy1" : 'length(x) = 4 > 1' in coercion to 'logical(1)', that, from what I understand, is due to the presence of a vector in the statement. But I have no idea how to correct for this (I'm still quite new to R).
How can I fix the function so that it does what I want? Thanks in advance!
Well, I can't really test your code without the plot_likert_general_section function or the plot_likert function, but I've done a bit of simplifying and best practices--passing list in as an argument, consistent spaces and assignment operators, etc.--and this is my best guess as to what you want:
plot_1 = function(list, section, subsample) { ## added `list` as an argument
data = list[[grep(section, names(list))]] # use [[ to extract a single item
name = as.character(names(data))
if(subsample == "None"){
plot_likert_general_section(df1[c(1:2)], section)
} else {
yesno = paste(subsample, c("yes", "no"), sep = ".")
plot_likert(data[[yesno[1]]], title = paste("How do the", name, "topics rank?"))
plot_likert(data[[yesno[2]]], title = paste("How do the", name, "topics rank?"))
}
}
plot_1(list, section = "C", subsample = "dummy1)
I'm not sure if your plot_likert functions use base or grid graphics--but either way you'll need to handle the multiple plots. With base, probably use mfrow() to display both of them, if grid I'd suggest putting them in a list to return them both, and then maybe using gridExtra::grid.arrange() (or similar) to plot both of them.
You're right that the error is due to passing a vector where a single value is expected. Try inserting print statements before the equality test to diagnose why this is.
Also, be careful with choosing variable names like name which are baseR functions (e.g. ?name). I'd also recommend following the tidyverse style guide here: https://style.tidyverse.org/.
Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.
I am relatively new to R, so my repertoire of commands is limited.
I am trying to write a script that will decompose a series of Markovian sequences, contained in a text string and delimited with a '>' sign, into a contingency "from - to" table.
The attached code, with dummy data, is where I have been able to get the code. On the small 7 case example included this will run relatively quickly. However the reality is that I have millions of cases to parse and my code just isn't efficient enough to process in a timely fashion (it had taken well over an hour and this time frame isn't feasible).
I'm convinced there is a more efficient way of structuring this code so that it executes quickly as I have seen this operation performed in other Markov packages within a few minutes. I need my own scripted version though to allow flexibility in processing hence I have not turned to these.
What I would like to request are improvements to the script to increase processing efficiency please.
Seq <- c('A>B>C>D', 'A>B>C', 'A', 'A', 'B', 'B>D>C', 'D') #7 cases
Lives <- c(0,0,0,0,1,1,0)
Seqdata <- data.frame(Seq, Lives)
Seqdata$Seq <- gsub("\\s", "", Seqdata$Seq)
fromstep <- list()
tostep <- list()
##ORDER 1##
for (x in 1:nrow(Seqdata)) {
steps <- unlist(strsplit(Seqdata$Seq[x], ">"))
for (i in 1:length(steps)) {
if (i==1) {fromstep <- c(fromstep, "Start")
tostep <- c(tostep, steps[i])
}
fromstep <- c(fromstep, steps[i])
if (i<length(steps)) {
tostep <- c(tostep, steps[i+1])
} else if (Seqdata$Lives[x] == 1) {
tostep <- c(tostep, 'Lives')
} else
tostep <- c(tostep, 'Dies')
}
}
transition.freq <- table(unlist(fromstep), unlist(tostep))
transition.freq
I'm not familiar with Markovian sequences, but this produces the same output:
xx <- strsplit(Seqdata$Seq, '>', fixed=TRUE)
table(From=unlist(lapply(xx, append, 'Start', 0L)),
To=unlist(mapply(c, xx, ifelse(Seqdata$Lives == 0L, 'Dies', 'Lives'))))
Please excuse me if there are already answers to this, but I can't quite figure it out from the archives.
I have generated a list of very similar functions via a for-loop in R:
adoptint.fun=list()
for(i in 1:40) {
#function name for each column
func.name <- paste('adoptint',i,sep='')
#function
func = paste('function(yearenter, adoptyear, yearleave) {ifelse(is.na(yearenter) | yearenter >', i+1905, ' | is.na(adoptyear) | yearleave > ', i+1905, ', NA, ifelse(yearenter <= ', i+1905, ' & adoptyear <= ', i+1905, ', 1, 0))}', sep='')
adoptint.fun[[func.name]] = eval(parse(text=func))
}
I am now interested in applying this function to generate values for variables that have yet to be created in the dataframe. I want to do this using a loop or similar since the process is identical, though the specific values change, over the 40 iterations. The code would look something like:
#generate variables that will be inserted into dataframe, dfanal.reshape
var_names <- paste("dfanal.reshape$adopt", 1:40, sep="")
#run function i to obtain values for variable i, which should be appended to dataframe
for(i in 1:40){
var_names[i] <- eval(parse(paste("adoptint.fun[[" ,i, "]](dfanal.reshape$intoobsyear,dfanal.reshape$adoptyear,dfanal.reshape$yearleave)", sep="")))
}
I have played around with mget for the var_names segment, but that doesn't seem to work and the eval segment is also not working (i.e., not assigning the values determined by the function (which works fine) to the appropriate dataframe column.
Again, apologies if this has already been answered and thanks in advance for your help.
How about adding an extra argument to your function?
func <- function(yearenter, adoptyear, yearleave,i) {
ifelse(is.na(yearenter) | yearenter > i+1905 | is.na(adoptyear) | yearleave > i+1905 , NA,
ifelse(yearenter <= i+1905 & adoptyear <= i+1905, 1, 0))
}
This would allow you to do the replacement quite a lot easier, using the fact that a dataframe is a special kind of list. That was your original problem I believe :
for(i in 1:40){
varname <- paste('adopt',i,sep='')
dfanal.reshape[[varname]] <-
with(dfanal.reshape,
func(intoobsyear,adoptyear,yearleave,i)
)
}
Check also the help pages ?which and ?Extract
Now without reproducible example (see How to make a great R reproducible example? ), it's hard to guess what you want to do and how to do this more economical. You're still using a lot of calculation time. The following function might do what you want :
func <- function(df,j){
out <- matrix(0,nrow=nrow(df),ncol=j)
attach(df)
idna <- sapply(1:j,function(i)
is.na(yearenter) | yearenter > i+1905 | is.na(adoptyear) | yearleave > i+1905
)
out[idna] <- NA
id1 <- sapply(1:j,function(i)
yearenter <= i+1905 & adoptyear <= i+1905
)
out[id1] <- 1
detach(df)
colnames(out)<- paste('adopt',1:j,sep='')
cbind(df,out)
}
which allows you to simply do
dfanal.reshape <- func(dfanal.reshape,40)
to get the desired result. This is given that the names of your variables are yearenter, adoptyear and yearleave. As far as I can see, you have to change yearenter to intoobsyear in the function, but that's a detail.
Learning to use indices will save you a lot of frustration. And please, never ever make 40 identical functions again if adding one argument will do.
The following function does work, but the last as.Date part was more or less an result of trial and error that do not understand fully.
### This function creates a real date column out of year / period that is saved in
### in separate columns, plus it handles a 13th period in case of overlapping period
### terminology. Turns quarters into months.
realDate <- function (table,year="year_col",period="period_col"){
if (is.character(table) == TRUE)
{
dframe <- get(table)
}
else{
dframe <- table
}
x <- expression({resDate <- with(dframe,
as.Date(paste(get(year),"-",
ifelse(get(period) > 9, get(period),
paste("0", get(period), sep = "")),
"-01", sep = "")))
})
y <- expression({resDate <- with(dframe,as.Date(paste(get(year) + 1,"-","01","-01",sep="")))})
#### I do not get this? Why do I have to do this?
a <- ifelse(get(period) == 13,eval(y),eval(x))
a <-as.Date(a, origin="1970-01-01")
return(a)
}
Instead I tried to do it like this (because it was more intuitively to me):
{ ....
ifelse(get(period) == 13,eval(y),eval(x))
return(resDate)
}
This returned the corrected values whenever the condition was FALSE (no) but returned NA if the condition was TRUE (yes). Why is that? And if I use the function above, why do I have to define the origin again? Why I even have call as.Date again?
EDIT:
a <- rep(2002:2010,2)
b <- rep(1:13,2)
d<-cbind(a,b[1:length(a)])
names(d) <- c("year_col","period_col")
P.S.:
I found this thread on vectorized ifelse.
Your construct is "interesting" at least. To start with, neither x nor y gives output. I wonder why you use an assignment in your eval(). this gives you a resDate vector that is exactly what the last call has been. And that is not dependent on the condition, it's the last one written (eval(x) in your case). They get executed before the ifelse clause is executed.
Plus, the output you get is the numeric representation of your data, not the data object. That is in resDate. I guess that ifelse cannot determine the class of the output vector as you use the eval() inside. I'm surprised you get output at all, in fact you're effectively using something that could be called a "bug" in R (Microsoft would call it a feature :-) ).
Your mistake is in your ifelse : get(period) doesn't exist. it should be get(period, dframe). Then it works. The only reason why it works on your computer, is because you have a period in your workspace presumably. Classis problem when debugging.
In any case, I'd make it:
realDate <- function (table,year="year_col",period="period_col"){
if (is.character(table)){ # is.character(table) returns a boolean already.
dframe <- get(table)
} else {
dframe <- table
}
year <- get(year,dframe)
period <- get(period,dframe)
year[period==13] <- year[period==13]+1
period[period==13] <- 1
as.Date(paste(year,"-",period,"-01",sep=""))
}
This is quite a bit faster than your own, has less pitfalls and conversions, and is more the R way of doing it. You could change year[...] and period [...] by ifelse constructs, but using indices is generally faster.
EDIT :
This is easier for the data generation:
dframe <- data.frame(
year_col= rep(2006:2007,each=13),
period_col = rep(1:13,2)
)
realDate(dframe)
[1] "2006-01-01" "2006-02-01" "2006-03-01" "2006-04-01" "2006-05-01"
"2006-06-01" "2006-07-01" "2006-08-01" "2006-09-01"
[10] "2006-10-01" "2006-11-01" "2006-12-01" "2007-01-01" "2007-01-01"
"2007-02-01" "2007-03-01" "2007-04-01" "2007-05-01"
[19] "2007-06-01" "2007-07-01" "2007-08-01" "2007-09-01"
"2007-10-01" "2007-11-01" "2007-12-01" "2008-01-01"