I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.
Related
I am working on a code but it has a step that is just super slow. Basically I just need to check for 2 columns and if their value is the same (at their respective row) I mark a 1 at a third column. Like the code below:
#FLAG_REPETIDOS
df1$FLAG_REPETIDOS <- ""
j <-1
for (j in 1:nrow(df1)) {
df1$FLAG_REPETIDOS[[j]] <- ifelse(df1$DATO[[j]]==df1$DATO_ANT[[j]], 1, df1$FLAG_REPETIDOS[[j]])
df1$FLAG_REPETIDOS[[j]] <- ifelse(is.na(df1$FLAG_REPETIDOS[[j]])==TRUE, "", df1$FLAG_REPETIDOS[[j]])
x <- j/100
if ((x == round(x))==TRUE){
print(paste(j, "/", nrow(df1)))
}
}
print(paste("Check 11:", Sys.time(), sep=" "))
Some more information: I am using data table, not data frame. My computer is no the best one, only 8G RAM and the data I am using has 1M rows more or less. Accordingly with my estimative it should take around 72h to end just this step of the code, which is unreasonable.
Is my code doing something it could be done easier and faster? Is there any way to optimize it? I am new to R so I dont know a lot about optimization.
Thanks in advance
I already changed from dataframe to datatable, I've researched on google about optimization and it was one of the things I could try.
The way to make R code go fast is to vectorize your code.
Assuming df is a dataframe, you could probably replace all your included code with something like:
library(dplyr)
df %>%
mutate(
FLAG_REPETIDOS = case_when(
is.na(DATO) | is.na(DATO_ANT) ~ "",
DATO == DATO_ANT ~ 1,
TRUE ~ ""
)
)
However, I'm not able to check since you did not include any data with your question.
Your loop is equivalent to this much simpler and faster code.
df1$FLAG_REPETIDOS <- ""
df1$FLAG_REPETIDOS[which(df1$DATO == df1$DATO_ANT)] <- "1"
Note that which doesn't have the danger of getting NA's in the 2nd code line index.
Hard to know without sample data, but this should work using data.table
library(data.table)
dt <- data.table(x=c(1,3,5,7,9), y=c(1,2,5,6,7)) # example
dt[, z:='']
dt[x==y, z:='1']
Problem:
Firstly, I am just starting out. While I was proud of my code, I have realised how inefficient and non-replicable it is coming back to it and using it on a different variable. Particularly, #3) has a manual component when excluding columns (downpour, precipitation, rainwater) which is not very replicable. Could anyone advise? (it looked worse before if you can believe)
Code:
# 1) filter for dictionaries containing 1,000 noun counts or more
f1_raincount <- raincount %>% filter(total_ncount >= 1000)
# 2) filter for dictionaries which contain 3 or more tokens from our set of rain-related tokens
f2_raincount <- f1_raincount
#compute rain-set count
f2_raincount$set_count <- f2_raincount %>% select(cloud:thunderstorm) %>% apply(1, function(x) sum(x != 0, values_drop_na=TRUE))
f2_raincount <- f2_raincount %>% filter(set_count >= 3)
# 3) Select for rain-related noun tokens with frequencies greater than 10 across dictionaries
#First, compute dictionary counts
f3_raincount <- f2_raincount
f3_dict_long <- f3_raincount %>% select(cloud:thunderstorm) %>% apply(2, function(x) sum(x !=0))
#Second, exclude those under 10: downpour, precipitation, rainwater
f3_raincount <- f3_raincount %>% select(-c(downpour, precipitation, rainwater) )
# 4) given exclusion #3, compute rain set count and filter again
f4_raincount <- f3_raincount
f4_raincount$set_count2 <- f4_raincount %>% select(cloud:thunderstorm) %>% apply(1, function(x) sum(x != 0))
f4_raincount <- f4_raincount %>% filter(set_count2 >= 3) %>%
select(id:dictsize) #select final rain-set
What I normally do is have all ETL code inside a ETL function even if i only plan to run it once on the entire script.
why?
Find it easy to debug if errors arise with debug
While on the topic of debuging, it's easier to debug also because the enviroment will only contain the used variables and not everything else
Auxiliar variables are automatically deleted once the function call is over
Easier to document that chunk of code with a title
More reproducible
because of this my scripts tend to be 20% setting parameters and libraries
60% functions and 20% code that runs those functions
your final code should then look like this:
f4_raincount <- funcName(raincount)
naturally having all the other messy code inside funcName
As for the actual code I'd need an actual example (data table and libraries) since it looks to me that you are just adding count columns that could be done with mutate function from dplyr. If that is indeed the case then you have a lot of optimization in front of you :P. but not knowing what cloud:thunderstorm is it's hard to give you more feedback.
EDIT:
ETL (Extract Transform Load) might not have been a good idea for me to mention since we are only Transforming data and neither Extracting or Loading.
Either way, I belive it's best if I demonstrate on a chunk of code.
Imagine we have a dataframe df.MyData and want to calculate the ratio between 2 variables times a certain ratio (just becaue)
Here's how one might approach this simple problem:
library(dplyr)
df.MyData <- data.frame(#this is of course a bad idea, But since a real world example would make this unreproducible code I went with it anyway.
Group = c("A","A","B","B","B"),
Value = c(1,3,1,4,5)
)
n.Value_A <- sum(filter(df.MyData, Group == "A")$Value)
n.Value_B <- sum(filter(df.MyData, Group == "B")$Value)
n.Result <- n.Value_A / n.Value_B * pf.n.Ratio
Here's how I would do it:
# LIBRARY ####
library(dplyr)
# PARAMETERS ####
df.MyData <- data.frame(#this is of course a bad idea, But since a real world example would make this unreproducible code I went with it anyway.
Group = c("A","A","B","B","B"),
Value = c(1,3,1,4,5)
)
# FUNCTIONS ####
fn.CalculateRatio <- function(pf.df.MyData = df.MyData, pf.n.Ratio = 2)
{
n.Value_A <- sum(filter(df.MyData, Group == "A")$Value)
n.Value_B <- sum(filter(df.MyData, Group == "B")$Value)
n.Result <- n.Value_A / n.Value_B * pf.n.Ratio
return(n.Result)
}
# PROCESS ####
fn.CalculateRatio()
My approach clearly has more code, so it very well might be disregarded by many, but I prefer it nonetheless as I tend to find it more organized on bigger pieces of code.
Your example would look like this:
fn.MyFunc <- function(pf.raincount = raincount){
# 1) filter for dictionaries containing 1,000 noun counts or more
f1_raincount <- pf.raincount %>% filter(total_ncount >= 1000)
.......[your code (excluding first 2 rows) goes here]
return(f4_raincount)
}
fn.MyFunc()
You could naturally go the extra mile, and replace the mention of (what looks like arbitrary numbers) 1000 and 3 by other variables and place them on the function itself. This way, should you want to change them you simply need to explicitly mention the value you want to use when running the function
fn.MyFunc(pf.raincount = NEWraincount)
or something else if you define other variables
I'm using prefixes on all my variables to identify what they are fn for functions, df for data frames, pf for function parameters, n for length 1 number vectors... the list is quite extensive, and I go as far as having a rulebook with all the rules that I use to stay consistent across projects but that's also a story for another day
Finally, I find # XXXXXX #### very useful as I can hide chunks of code when I'm not working on them
Again this is how I find organization on hundreds or thousands of lines of code. Each one of us has to figure his own style, the only thing I believe we all agree with, is that consistency is key.
I'm straining a bit of topic though, the key idea of using this wrapper function is that the other tables you define stay inside the function environment and get deleted afterwards. It's better to actually edit the code to not create them in the first place, but at the least you can use this method as a bandage since it takes close to no time or skill to clean those variables (I didn't have to understand your code to make this post)
I am simulating data and filling a matrix using a for loop in R. Currently the loop is running slower than I would like. I've done some work to vectorize some of the variables to improve the loops speed but it still taking some time. I believe the
mat[j,year] <- sum(vec==1)/x
part of the loop is slowing things down. I've looked into filling matrices more efficiently but could not find anything to help my current problem. Eventually this will be used as a part of a shiny app so all of variables I assign will need to be easily assigned different values.
Any advice to speed up the loop or more efficiently write this loop would be greatly appreciated.
Here is the loop:
#These variables are all specified because they need to change with different simulations
num.sims <- 20
time <- 50
mat <- matrix(nrow = num.sims, ncol = time)
x <- 1000
init <- 0.5*x
vec <- vector(length = x)
ratio <- 1
freq <- -0.4
freq.vec <- numeric(nrow(mat))
## start a loop
for (j in 1:num.sims) {
vec[1:init] <- 1; vec[(init+1):x] <- 2
year <- 2
freq.vec[j] <- sum(vec==1)/x
for (i in 1:(x*(time-1))) {
freq.1 <- sum(vec==1)/x; freq.2 <- 1 - freq.1
fit.ratio <- exp(freq*(freq.1-0.5) + log(ratio))
Pr.1 <- fit.ratio*freq.1/(fit.ratio*freq.1 + freq.2)
vec[ceiling(x*runif(1))] <- sample(c(1,2), 1, prob=c(Pr.1,1-Pr.1))
## record data
if (i %% x == 0) {
mat[j,year] <- sum(vec==1)/x
year <- year + 1
}}}
The inner loop is what is slowing you down. You're doing x number of iterations to update each cell in the matrix. Since each trip to modify vec depends on the previous iteration, this would be difficult to simplify. #Andrew Feierman is probably correct that this would benefit from being moved to C++, at least the four lines before the if statement.
Alternatively, this only takes 10-20 seconds to run. Unless you're going to scale this up or run it many times, it might not be worth the trouble to speed it up. If you do keep it as is, you could put a progress bar in Shiny to let the user know things are still working.
Depending on how often you will need to call this loop, it could be worth rewriting it in C++. R is built on C++, and any C++ will run many, many times faster than even efficient R code.
sourceCpp is a good package to start with: https://www.rdocumentation.org/packages/Rcpp/versions/0.12.11/topics/sourceCpp
Let's assume that I have a dataset with the following structure:
I have N products
I'm operating in N countries
I have N payment partner
May dataset contains of N days
I have N different prices that customers can choose from
For example:
customer.id <- c(1,2,3,4,5,6,7,8)
product <- c("product1","product2","product1","product2","product1","product2","product1","product2")
country <- c("country1","country2","country1","country2","country1","country2","country1","country2")
payment.partner <- c("pp1","pp2","pp1","pp2","pp1","pp2","pp1","pp2")
day <- c("day1","day2","day1","day2","day1","day2","day1","day2")
price <- c("price1","price2","price1","price2","price1","price2","price1","price2")
customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
customer.data <- data.table(customer.data)
Suppose I want to generate an aggregate out of it that, for instance, performs a forecasting algorithm for each combination. In order to do so, I identify the unique items for each condition and iterate it as follows:
unique.products <- droplevels(unique(customer.data[,product]))
unique.countries <- droplevels(unique(customer.data[,country]))
unique.payment.partners <- droplevels(unique(customer.data[,payment.partner]))
unique.days <- droplevels(unique(customer.data[,day]))
unique.prices <- droplevels(unique(customer.data[,price]))
for(i in seq_along(unique.products)){
temp.data1 <- customer.data[product==unique.products[[i]]]
for(j in seq_along(unique.countries)){
temp.data2 <- temp.data1[country==unique.countries[[j]]]
for(k in seq_along(unique.payment.partners)){
temp.data3 <- temp.data2[payment.partner==unique.payment.partners[[k]]]
for(l in seq_along(unique.days)){
temp.data4 <- temp.data3[day==unique.days[[l]]]
for(m in seq_along(unique.prices)){
temp.data5 <- temp.data4[price==unique.prices[[m]]]
if(nrow(temp.data5)!=0){
# do your calculations here
print(temp.data5)
}
}
}
}
}
}
In general, this code structure works fine, but it gets really annoying when applying real data with 5 million rows on it. I guess R is not the best language in terms of speed and performance. Of course, I have used multicore processing in the past or tried to get such an aggregate straight out of Hive or an MySQL DataWarehouse. Using another language like C++ or Python is also always an option.
However, sometimes all these options are not possible, which then always leads me to that exact same processing structure. So I'm wondering for quite a while if there is a better, respectively faster solution from a rather architectural point of view since it is known (and also becomes VERY clear when benchmarking) that for loops and frequent data subselection is very, very slow.
Grateful for all comments, hints and possible solutions!
You should read the documentation of packages you are using. Package data.table offers some excellent introductory tutorials.
customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
library(data.table)
setDT(customer.data)
customer.data[,
print(customer.data[.I]), #don't do this, just refer to the columns you want to work on
by = .(product, country, payment.partner, day, price)]
Of course, generally, you wouldn't print the data.table subset here, but work directly on specific columns.
From your description (but not your code which I found incomprehensible as to its purpose, I am thinking you may want to use the `interaction function:
customer.data$grp=droplevels( with( customer.data,
interaction(product, country ,payment.partner, day, price) ) )
table(customer.data$grp)
#-----------------------
product1.country1.pp1.day1.price1
4
product2.country2.pp2.day2.price2
4
You could then use lapply( split( dat, dat$grp) , analytic_function) to create separate analyses within subsets. I didn't have data.table loaded so showed the method for dataframes but there's no reason interaction shouldn't succeed in the data.table world:
customer.data[ , grp2 := droplevels(interaction(
product, country ,payment.partner, day, price) ) ]
I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.