Is there a way to make a for-loop faster? - r

I am working on a code but it has a step that is just super slow. Basically I just need to check for 2 columns and if their value is the same (at their respective row) I mark a 1 at a third column. Like the code below:
#FLAG_REPETIDOS
df1$FLAG_REPETIDOS <- ""
j <-1
for (j in 1:nrow(df1)) {
df1$FLAG_REPETIDOS[[j]] <- ifelse(df1$DATO[[j]]==df1$DATO_ANT[[j]], 1, df1$FLAG_REPETIDOS[[j]])
df1$FLAG_REPETIDOS[[j]] <- ifelse(is.na(df1$FLAG_REPETIDOS[[j]])==TRUE, "", df1$FLAG_REPETIDOS[[j]])
x <- j/100
if ((x == round(x))==TRUE){
print(paste(j, "/", nrow(df1)))
}
}
print(paste("Check 11:", Sys.time(), sep=" "))
Some more information: I am using data table, not data frame. My computer is no the best one, only 8G RAM and the data I am using has 1M rows more or less. Accordingly with my estimative it should take around 72h to end just this step of the code, which is unreasonable.
Is my code doing something it could be done easier and faster? Is there any way to optimize it? I am new to R so I dont know a lot about optimization.
Thanks in advance
I already changed from dataframe to datatable, I've researched on google about optimization and it was one of the things I could try.

The way to make R code go fast is to vectorize your code.
Assuming df is a dataframe, you could probably replace all your included code with something like:
library(dplyr)
df %>%
mutate(
FLAG_REPETIDOS = case_when(
is.na(DATO) | is.na(DATO_ANT) ~ "",
DATO == DATO_ANT ~ 1,
TRUE ~ ""
)
)
However, I'm not able to check since you did not include any data with your question.

Your loop is equivalent to this much simpler and faster code.
df1$FLAG_REPETIDOS <- ""
df1$FLAG_REPETIDOS[which(df1$DATO == df1$DATO_ANT)] <- "1"
Note that which doesn't have the danger of getting NA's in the 2nd code line index.

Hard to know without sample data, but this should work using data.table
library(data.table)
dt <- data.table(x=c(1,3,5,7,9), y=c(1,2,5,6,7)) # example
dt[, z:='']
dt[x==y, z:='1']

Related

Subtract each col in a df from every other col

I would like to try out a normalisation method a friend recommended, in which each col of a df should be subtracted, at first from the first col and next from every other col of that df.
eg:
df <- data.frame(replicate(9,1:4))
x_df_1 <- df[,1] - df[2:ncol(df)]
x_df_2 <- df[,2] - df[c(1, 3:ncol(df))]
x_df_3 <- df[,3] - df[c(1:2, 4:ncol(df))]
...
x_cd_ncol(df) <- df[c(1: (1-ncol(df)))]
As the df has 90 cols, doing this by hand would be terrible (and very bad coding). I am sure there must be an elegant way to solve this and to receive at the end a list containing all the dfs, but I am totally stuck how to get there. I would appreciate a dplyr method (for familiarity) but any working solution would be fine.
Thanks a lot for your help!
Sebastian
I may have found a solution that I am sharing here.
Please correct me if im wrong.
This is a permutation without replacement task.
The original df has 90 cols.
Lets check how many combinations there are possible first:
(from: https://davetang.org/muse/2013/09/09/combinations-and-permutations-in-r/)
comb_with_replacement <- function(n, r){
return( factorial(n + r - 1) / (factorial(r) * factorial(n - 1)) )
}
comb_with_replacement(90,2) #4095 combinations
Now using a modified answer from here: https://stackoverflow.com/a/16921442/10342689
(df has 90 cols. don't know how to create this proper as an example df here.)
cc_90 <- combn(colnames(df), 90)
result <- apply(cc_90, 2, function(x) df[[x[1]]]-df[[x[2]]])
dim(result) #4095
That should work.
In R one can index using negative indices to represent "all except this index".
So we can re-write the first of your normalization rows:
x_df_1 <- df[,1] - df[2:ncol(df)]
# rewrite as:
x_df_1 <- df[,1] - df[,-1]
From this, it's a pretty easy next step to write a loop to generate the 90 new dataframes that you generated 'by hand':
list_of_dfs=lapply(seq_len(ncol(df)),function(x) df[,x]-df[,-x])
This seems to be somewhat different to what you're proposing in your own answer to your question, though...

R: Looping in R and write result columnwise in other data frame

since I am fairly new to R I am struggling for days to come to the right solution. All the internet and stackoverflow search could not bring me ahead so far.
All tries with rbind, cbind, lapply, sapply did not work. So here is the problem:
I have a data frame given wich a time series in column "value X"
I want to calculate single and exponential moving averages on this column (SMA and EMA).
Since you can change the parameter "n" as window size in SMA/EMA calculation I want to change the parameter in a loop starting from 5 to 150 in steps of 5. And then write the result into a data frame.
So the data frame should look like.
SMA_5 | SMA_10 | SMA_15 .... EMA_5 | EMA_10 | EMA_15 ...
Ideally the column names are also created in this loop.
Can you help me out?
Thank you in advance
As far as I know, the loops are seen as a non-optimal solution in R and should be avoided if possible. It seems to me that in-built R functions sapply and colnames may provide quite a simple solution for your problem:
library("TTR")
# example of data
test <- data.frame(moments = 101:600, values = 1:500)
seq_of_windows_size <- seq(from = 5, to = 150, by = 5)
col_names_of_sma <- paste("SMA", seq_of_windows_size, sep = "_")
SMA_columns <- sapply(FUN = function(i) SMA(x = test$values, n = i),
X = seq_of_windows_size)
colnames(SMA_columns) <- col_names_of_sma
Then you'll have just to add the SMA_columns to your original dataframe. The steps for EMA may be much the same.
Hope, it helps :)

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

speeding up applying a function to unique values in R

I was hoping somebody could help, I'm trying to speed up an apply function, and I've tried a few tricks but it is still quite slow and I was wondering if anybody had any more suggestions.
I have data as follows:
myData= data.frame(ident=c(3,3,4,4,4,4,4,4,4,4,4,7,7,7,7,7,7,7),
group=c(7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8),
significant=c(1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0),
year=c(2003,2002,2001,2008,2010,2007,2007,2008,2006,2012,2008,
2012,2006,2001,2014,2012,2004,2007),
month=c(1,1,9,12,3,2,4,3,9,5,12,8,11,3,1,6,3,1),
subReport=c(0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0),
prevReport=c(1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1))
and I want to end up with a dataframe like this:
results=data.frame(ident=c(3,4,7),
significant=c(1,0,1),
prevReports=c(2,6,7),
subReport=c(0,1,0),
group=c(7,7,8))
To do this I wrote the code below and to do it quickly i've tried converting to data tables and using rbindlist instead of rbind, which I've found suggested in a few threads. I've also tried parLapply, I still find the process to be quite slow however, (I'm tring to do this on about 250,000 data points).
dt<-data.table(myData)
results<-NULL
ApplyModel <- function (id,data) {
dtTemp<-dt[dt$ident== id,]
if(nrow(dtTemp)>=1){
prevReport = if(sum(dtTemp$prevReport)>=1) sum(dtTemp$prevReport) else 0
subsequentReport = if(sum(dtTemp$subReport)>=1) 1 else 0
significant = as.numeric(head(dtTemp$sig,1))
group = head(dtTemp$group,1)
id= as.numeric(head(dtTemp$id,1))
output<-cbind(id, significant ,prevReport,subsequentReport ,group)
output<-output[!duplicated(output[,1]),]
print(output)
results <- rbindlist(list(as.list(output)))
}
}
results<-lapply(unique(dt$ident), ApplyModel)
results<-as.data.frame(do.call(rbind, results))
Any suggestions on how this might be speeded up would be most welcome! I think it may be to do with the subsetting, I want to apply the function to a subset based on a unique value but I think lapply is really more for applying a function to each value, so subsetting is defeating the object somewhat...
Here, your code produces an error:
results<-lapply(unique(dt$ident), ApplyModel)
Error in dt$ident : object of type 'closure' is not subsettable
It appears to me, that you are looking for tapply instead of lapply. Using tapply you could express roughly the above in much more concise ways:
results2 <- data.frame(significant = tapply(myData$significant, myData$ident, function(x) return(x[1])),
prevreports = tapply(myData$prevReport, myData$ident, sum),
subReports = tapply(myData$subReport, myData$ident, function(x) as.numeric(any(x==1))),
group = tapply(myData$group, myData$ident, function(x) return(x[1])))
Should do about the same job but be much more readable. Now this should really be fast except for huge datasets. In most cases it should be faster to wait for R to complete the job than to spend more time programming. One way to make this even faster would be to use the power of the data.table package, but just invoking it doesn't do the trick. You'll need to learn it's very special syntax. Please check before, that the code given this way really is too slow.
If it is really too slow, check this:
library(data.table)
first <- function(x) x[1]
myAny <- function(x) as.numeric(any(x==1))
myData <- data.table(myData)
myData[, .(significant=first(significant),
prevReports=sum(prevReport),
subReports=myAny(subReport),
group=first(group)), ident]
You could use dplyr:
require(dplyr)
new <- myData %>% group_by(ident) %>%
summarise(first(significant),sum(prevReport),(n_distinct(subReport)-1), first(group)) %>%
data.frame()

My for loop won't run in r

I can't get this for loop to run.
loopLength <- length(vector_X)
i <- 1
for (x in 1:loopLength)
vector_Y <- Frame_X$column_a == vector_X[i]
Frame_Y <- Frame_X[Vector_Y,]
Frame_A <- Frame_Y$column_b == vector_X[i]
Frame_Z <- Frame_Y[Frame_A,]
Vector_T <- Frame_Y$column_c == Frame_Z[1,2]
Frame_Z2 <- Frame_Y[Vector_T,]
returnSum1[i] <- sum(Frame_Z2$column_d)
Frame_Z3 <- Frame_Y[!(Frame_Z1),]
returnSum2[i] <- sum(Frame_X3$column_d)`
I can run the stand_alone code block by replacing the i with an integer (it is only running from 1 to 20) and crosscheck the db and the results are correct. However, I can't seem to iterate it.
I think I'm missing something glaring about integrating a loop but I've looked and can't seem to find it.
It doesn't work when I try to run it as for (i in 1:20) either.
Nor do the inclusion or exclusion around brackets around the code block work either.
The variable you defined in your for loop is named x, not i. If that isn't it, then the error might come from the fact that if Frame_Z happens to have 0 rows, then Frame_Z[1,2] doesn't exist! I think that step in particular is not very clear. I could help more if you posted an example data.frame and said what you want to do. Also, it would make your code easier to read if you used less steps and didn't name indices Frames (as in Frame_A and Frame_Z1). Also, I think using dplyr would be easier. Something like:
library(dplyr)
loopLength <- length(vector_X)
for(i in 1:loopLength){
xval <- vector_X[i]
Frame_Z <- Frame_X %>%
filter(column_a == xval, column_b == xval)
...
}
I can't post more because I don't quite get what you are trying to do though.

Resources