Making my code more efficient in R - r

I am trying to execute a code that takes way too much time (>6 days). Maybe there is a way of making it more efficient. Any ideas?
library(haven)
library(plyr)
AFILIAD1 <- read_sav("XXXX")
#this sav has around 6 million rows.
AFILIAD1$F_ALTA<- as.character(AFILIAD1$F_ALTA)
AFILIAD1$F_BAJA<- as.character(AFILIAD1$F_BAJA)
AFILIAD1$F_ALTA <- as.Date(AFILIAD1$F_ALTA, "%Y%m%d")
AFILIAD1$F_BAJA <- as.Date(AFILIAD1$F_BAJA, "%Y%m%d")
#starting and ending date
meses <- seq(as.Date("1900-01-01"), as.Date("2014-12-31"), by = "month")
#this is the function that needs to be more efficient
ocupados <- function(pruebas){
previo <- c()
total <- c()
for( i in 1:length(meses)){
for( j in 1:nrow(pruebas)){
ifelse(pruebas$F_ALTA[j] <= meses[i] & pruebas$F_BAJA[j] >=
meses[i], previo[j]<- pruebas$IPF[j],previo[j]<- NA)
}
total[i] <- (length(unique(previo))-1)
}
names(total)<-meses
return(total)
}
#this takes >6 days to execute
afiliado1 <- ocupados(AFILIAD1)

There is a lot you can do to speed this up. Here's one example:
library(tidyverse) % adds pipes
ocupados <- function(pruebas) {
total <- map_int(meses, function(x) {
with(pruebas, {
IPF[F_ALTA <= x & F_BAJA >= x] %>%
n_distinct() #I'm assuming you subtract 1 to remove the NA effect - no longer needed
})
})
names(total) <- meses
return(total)
}
There are two big speed ups here. First, the inner loop is implemented in compiled code (so you don't see it here), which will be huge savings for you.
Second, we never define empty vectors. Those empty vectors have to be copied EVERY time you increase the length - which is very expensive. Instead, all I'm saving is the final result. The apply family of functions behave like loops, but implement the code in a function.
If you're not familiar with the pipe operator (%>%), all it does is call the next function with the result from the previous function as the next argument. So
length(unique(x))
is the same as
x %>%
unique() %>%
length()
The advantage is readability - it's easier to see that I apply unique, then length using the pipe.
One more comment - without a reproducible example, I cannot test this code. If you have trouble, you need to include a small reproducible data set so we can actually test what the code is doing.

Related

what is the most efficient way to find the most common value in a vector?

I'm trying to create a function to solve this puzzle:
An Arithmetic Progression is defined as one in which there is a constant difference between the consecutive terms of a given series of numbers. You are provided with consecutive elements of an Arithmetic Progression. There is however one hitch: exactly one term from the original series is missing from the set of numbers which have been given to you. The rest of the given series is the same as the original AP. Find the missing term.
You have to write the function findMissing(list), list will always be at least 3 numbers. The missing term will never be the first or last one.
The next section of code shows my attempt at this function. The site i'm on runs tests against the function, all of which passed, as in they output the correct missing integer.
The problem i'm facing is it's giving me a timeout error, because it takes to long to run all the tests. There are 102 tests and it's saying it takes over 12 seconds to complete them. Taking more than 12 seconds means the function isn't efficient enough.
After running my own timing tests in RStudio it seems running the function would take considerably less time than 12 seconds to run but regardless i need to make it more efficient to be able to complete the puzzle.
I asked on the site forum and someone said "Sorting is expensive, think of another way of doing it without it." I took this to mean i shouldn't be using the sort() function. Is this what they mean?
I've since found a few different ways of getting my_diff which is calculated using the sort() function. All of these ways are even less efficient than the original way of doing it.
Can anyway give me a more efficient way of doing the sort to find my_diff or maybe make other parts of the code more efficient? It's the sort() part which is apparently the inefficient part of the code though.
find_missing <- function(sequence){
len <- length(sequence)
if(len > 3){
my_diff <- as.integer(names(sort(table(diff(sequence)), decreasing = TRUE))[1])
complete_seq <- seq(sequence[1], sequence[len], my_diff)
}else{
differences <- diff(sequence)
complete_seq_1 <- seq(sequence[1],sequence[len],differences[1])
complete_seq_2 <- seq(sequence[1],sequence[len],differences[2])
if(length(complete_seq_1) == 4){
complete_seq <- complete_seq_1
}else{
complete_seq <- complete_seq_2
}
}
complete_seq[!complete_seq %in% sequence]
}
Here are a couple of sample sequences to check the code works:
find_missing(c(1,3,5,9,11))
find_missing(c(1,5,7))
Here are some of the other things i tried instead of sort:
1:
library(pracma)
Mode(diff(sequence))
2:
library(dplyr)
(data.frame(diff_1 = diff(sequence)) %>%
group_by(diff_1) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(count==max(count)))[1]
3:
MaxTable <- function(sequence, mult = FALSE) {
differences <- diff(sequence)
if (!is.factor(differences)) differences <- factor(differences)
A <- tabulate(differences)
if (isTRUE(mult)) {
as.integer(levels(differences)[A == max(A)])
}
else as.integer(levels(differences)[which.max(A)])
}
Here is one way to do this using seq. We can create a sequence from minimum value in sequence to maximum value in the sequence having length as length(x) + 1 as there is exactly one term missing in the sequence.
find_missing <- function(x) {
setdiff(seq(min(x), max(x), length.out = length(x) + 1), x)
}
find_missing(c(1,3,5,9,11))
#[1] 7
find_missing(c(1,5,7))
#[1] 3
This approach takes the diff() of the vector - there will always be one difference higher than the others.
find_missing <- function(x) {
diffs <- diff(x)
x[which.max(diffs)] + min(diffs)
}
find_missing(c(1,3,5,9,11))
[1] 7
find_missing(c(1,5,7))
[1] 3
There is actually a simple formula for this, which will work even if your vector is not sorted...
find_missing <- function(x) {
(length(x) + 1) * (min(x) + max(x))/2 - sum(x)
}
find_missing(c(1,5,7))
[1] 3
find_missing(c(1,3,5,9,11,13,15))
[1] 7
find_missing(c(2,8,6))
[1] 4
It is based on the fact that the sum of the full series should be the average value times the length.

How to make a more elegant and shorter for loop with multiple internal functions and results

These are the steps I am following:
subset two matrices by a range of proportions (e.g. 80-85, 85-90)
run two separate distance measure functions for each subset of data
run a mantel using the distance matrix produced by each subset of data
produce a list of each test result, each with a unique name
produce a data frame of all the mantel-r results and their
corresponding p-values
I have written code that will complete this process, but I feel there is a more elegant and better way to do so. What I have works, but I would like to improve my R-skills, so any advice/ideas would be welcomed. I am not new to R, but I am far from being where I would like to be.
Also, my code produces unnecessary objects (i.e. SS, HB, sp.dis, epa.dis, and nam in the code below). They are not a big deal, but it would be nice to have code that doesn’t produce this side effect. A reproducible example (modeled after how my data is formatted) and the packages I’m using are below:
library(tidyverse)
library(betapart)
library(vegan)
set.seed(2)
spe2<-data.frame(replicate(10,sample(0:100,100,replace=T)))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-data.frame(replicate(3,sample(1:20,100,replace=T)))
epa2$Ag<-spe2$Ag
Mantel.List<-list()
List.names <- list()
for(i in seq(from=0.85, to=0.95,by=0.05 )){
SS<-spe2 %>%
filter(Ag >= i & Ag < i+0.05)
HB<-epa2 %>%
filter(Ag >= i & Ag < i+0.05)
sp.dis<-beta.pair(decostand(SS[,1:ncol(SS)-1],'pa'))
epa.dis<-vegdist(HB[,1:ncol(HB)-1],
method = 'euclidean')
mnt<-mantel(sp.dis$beta.sor,epa.dis)
Mantel.List[[length(Mantel.List)+1]] <- mnt
nam<-paste('M.tt',i*100,sep='')
List.names[[length(List.names)+1]] <- nam
}
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
Mantel.Results
Thank you!
I've done two things two try to make this code a little better. First, I eliminated all the unnecessary objects, and I've done this by using data.table package, which is usually the most efficient way to handle data.frames, cause it doesn't make copies of itself when subsetting.
Secondly, instead of using a for loop, I'm using an apply function. Note the assigner <<- inside doit(), which will replace the object outside the function.
Here's my suggestion:
library(data.table)
set.seed(2)
spe2<-as.data.table(data.frame(replicate(10,sample(0:100,100,replace=T))))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-as.data.table(data.frame(replicate(3,sample(1:20,100,replace=T))))
epa2$Ag<-spe2$Ag
doitAll=function(dt1,dt2){
Mantel.List<-list()
List.names <- list()
doit=function(x,dt1,dt2){
mnt<-mantel(beta.pair(decostand(dt1[Ag >= x & Ag < x+0.05,1:(ncol(dt1)-1),with=F],'pa'))$beta.sor,
vegdist(dt2[Ag >= x & Ag < x+0.05,1:(ncol(dt2)-1),with=F],
method = 'euclidean'))
Mantel.List[[length(Mantel.List)+1]] <<- mnt
nam<-paste('M.tt',x*100,sep='')
List.names[[length(List.names)+1]] <<- nam
}
sapply(seq(from=0.85, to=0.95,by=0.05 ),doit,dt1=dt1,dt2=dt2)
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
return(Mantel.Results)
}
doitAll(dt1=spe2,dt2=epa2)
It might be a little hard to read, but it's surely more efficient.

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

Better way to write this R data-cleaning function?

I am writing a R function that takes a dataframe column (probably preferably of type factor) and clumps together all the entries below a user-defined frequency as "other." This is done for data cleaning.
Here is what I have written:
zcut <- function(column, threshold){
dft <- data.frame(table(column))
dft_ind <- sapply(dft$Freq, function(x) x < threshold)
dft_list <- dft[[1]][dft_ind]
levels(column)[levels(column) %in% dft_list] <- "Other"
return(column)
}
I think this is pretty straightforward, but are there ways to make my code more concise or exact?
I would have asked this on the Code Review stack exchange, although it's not clear to me many R experts lurk there.
You don't need sapply here. Try:
dft_ind <- dft$Freq < threshold
This should speed up the function in the case of large data.frames.

R: rewrite loop with apply

I have the following type of data set:
id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA
That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.
I've already done, one year at the time, this using a loop with a nested if-clause:
for(i in 1:nrow(input_data)) {
temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
loc2011 <- which(!is.na(temp_row))
if(length(loc2011 ) > 0) {
temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
} else {
temp_row_2011[i,] <- NA
}
}
Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?
Are you after something like this?:
temp_row_2011 <- apply(input_data, 1, function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
})
If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:
reduceyear <- function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
}
# compile, just in case it runs faster:
reduceyear_c <- compiler:::cmpfun(reduceyear)
# this ought to do the same as the above.
temp_row_2011 <- apply(input_data, 1, reduceyear_c)
You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).
[EDIT: full example, motivated by DWin]
input_data <- matrix(ncol=24,nrow=10)
# years and months:
colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
# some ids
rownames(input_data) <- 1:10
# put in some values:
input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
# make an all-NA case:
input_data[2,1:12] <- NA
# and here's the full deal:
sapply(2010:2011, function(x,input_data){
input_data_yr <- input_data[, grep(x, colnames(input_data) )]
apply(input_data_yr, 1, function(id){
id[!is.na(id)][1]
}
)
}, input_data)
All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.
I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:
# funyear <- function to work on one year's data and return a single vector
# my efforts keep failing on the all(NA) row by year combos
sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )

Resources