speeding up applying a function to unique values in R - r

I was hoping somebody could help, I'm trying to speed up an apply function, and I've tried a few tricks but it is still quite slow and I was wondering if anybody had any more suggestions.
I have data as follows:
myData= data.frame(ident=c(3,3,4,4,4,4,4,4,4,4,4,7,7,7,7,7,7,7),
group=c(7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8),
significant=c(1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0),
year=c(2003,2002,2001,2008,2010,2007,2007,2008,2006,2012,2008,
2012,2006,2001,2014,2012,2004,2007),
month=c(1,1,9,12,3,2,4,3,9,5,12,8,11,3,1,6,3,1),
subReport=c(0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0),
prevReport=c(1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1))
and I want to end up with a dataframe like this:
results=data.frame(ident=c(3,4,7),
significant=c(1,0,1),
prevReports=c(2,6,7),
subReport=c(0,1,0),
group=c(7,7,8))
To do this I wrote the code below and to do it quickly i've tried converting to data tables and using rbindlist instead of rbind, which I've found suggested in a few threads. I've also tried parLapply, I still find the process to be quite slow however, (I'm tring to do this on about 250,000 data points).
dt<-data.table(myData)
results<-NULL
ApplyModel <- function (id,data) {
dtTemp<-dt[dt$ident== id,]
if(nrow(dtTemp)>=1){
prevReport = if(sum(dtTemp$prevReport)>=1) sum(dtTemp$prevReport) else 0
subsequentReport = if(sum(dtTemp$subReport)>=1) 1 else 0
significant = as.numeric(head(dtTemp$sig,1))
group = head(dtTemp$group,1)
id= as.numeric(head(dtTemp$id,1))
output<-cbind(id, significant ,prevReport,subsequentReport ,group)
output<-output[!duplicated(output[,1]),]
print(output)
results <- rbindlist(list(as.list(output)))
}
}
results<-lapply(unique(dt$ident), ApplyModel)
results<-as.data.frame(do.call(rbind, results))
Any suggestions on how this might be speeded up would be most welcome! I think it may be to do with the subsetting, I want to apply the function to a subset based on a unique value but I think lapply is really more for applying a function to each value, so subsetting is defeating the object somewhat...

Here, your code produces an error:
results<-lapply(unique(dt$ident), ApplyModel)
Error in dt$ident : object of type 'closure' is not subsettable
It appears to me, that you are looking for tapply instead of lapply. Using tapply you could express roughly the above in much more concise ways:
results2 <- data.frame(significant = tapply(myData$significant, myData$ident, function(x) return(x[1])),
prevreports = tapply(myData$prevReport, myData$ident, sum),
subReports = tapply(myData$subReport, myData$ident, function(x) as.numeric(any(x==1))),
group = tapply(myData$group, myData$ident, function(x) return(x[1])))
Should do about the same job but be much more readable. Now this should really be fast except for huge datasets. In most cases it should be faster to wait for R to complete the job than to spend more time programming. One way to make this even faster would be to use the power of the data.table package, but just invoking it doesn't do the trick. You'll need to learn it's very special syntax. Please check before, that the code given this way really is too slow.
If it is really too slow, check this:
library(data.table)
first <- function(x) x[1]
myAny <- function(x) as.numeric(any(x==1))
myData <- data.table(myData)
myData[, .(significant=first(significant),
prevReports=sum(prevReport),
subReports=myAny(subReport),
group=first(group)), ident]

You could use dplyr:
require(dplyr)
new <- myData %>% group_by(ident) %>%
summarise(first(significant),sum(prevReport),(n_distinct(subReport)-1), first(group)) %>%
data.frame()

Related

Recall different data names inside loop

here is how I created number of data sets with names data_1,data_2,data_3 .....and so on
for initial
dim(data)<- 500(rows) 17(column) matrix
for ( i in 1:length(unique( data$cluster ))) {
assign(paste("data", i, sep = "_"),subset(data[data$cluster == i,]))
}
upto this point everything is fine
now I am trying to use these inside the other loop one by one like
for (i in 1:5) {
data<- paste(data, i, sep = "_")
}
however this is not giving me the data with required format
any help will be really appreciated.
Thank you in advance
Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe.
Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split, by and tapply, or at the plyr and dplyr packages.
Using Base R
With base R, it depends on what you want to do. In the most general case you can use a combination of split() and lapply or even a for loop:
mylist <- split( data, f = data$cluster)
for(mydata in mylist){
head(mydata)
...
}
Or
mylist <- split( data, f = data$cluster)
result <- lapply(mylist, function(mydata){
doSomething(mydata)
})
Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for loop.
If you want to add some variables based on other variables, then the plyr or dplyr packages come in handy
Using plyr and dplyr
These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split and lapply but then in a way Hadley approves of :-)
For example:
library(plyr)
result <- ddply(data, .(cluster),
function(mydata){
doSomething(mydata)
})
Use dlply if the result should be a list.

Expand dataframe in R with rbind (union)

I need to scale up a set of files for a proof of concept in my company. Essentially have several 1000row files with around 200 columns each, and I want to rbind them until I reach the desired scale. This might be 1Million rows or more.
The output will be essentially a repetition of data (sounds a bit silly) and I'm aware of that, but i just need to prove something.
I used a while loop in R similar to this:
while(nrow(df) < 1000000) {df <- rbind(df,df);}
This seems to work but it looks a bit computationally heavy. It might might take like 10-15minutes.
I though of creating a function (below) and use an "apply" family function on the df, but couldn't succeed:
scaleup_function <- function(x)
{
while(nrow(df) < 1000)
{
x <- rbind(df, df)
}
}
Is there a quicker and more efficient way of doing it (it doesn't need to be with rbind) ?
Many thanks,
Joao
This should do the trick:
df <- matrix(0,nrow=1000,ncol=200)
reps_needed <- ceiling(1000000 / nrow(df))
df_scaled <- df[rep(1:nrow(df),reps_needed),]

Better way to write this R data-cleaning function?

I am writing a R function that takes a dataframe column (probably preferably of type factor) and clumps together all the entries below a user-defined frequency as "other." This is done for data cleaning.
Here is what I have written:
zcut <- function(column, threshold){
dft <- data.frame(table(column))
dft_ind <- sapply(dft$Freq, function(x) x < threshold)
dft_list <- dft[[1]][dft_ind]
levels(column)[levels(column) %in% dft_list] <- "Other"
return(column)
}
I think this is pretty straightforward, but are there ways to make my code more concise or exact?
I would have asked this on the Code Review stack exchange, although it's not clear to me many R experts lurk there.
You don't need sapply here. Try:
dft_ind <- dft$Freq < threshold
This should speed up the function in the case of large data.frames.

alternative to subsetting in R

I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.

Apply function to every value in an R dataframe

I have a 58 column dataframe, I need to apply the transformation $log(x_{i,j}+1)$ to all values in the first 56 columns. What method could I use to go about this most efficiently? I'm assuming there is something that would allow me to do this rather than just using some for loops to run through the entire dataframe.
alexwhan's answer is right for log (and should probably be selected as the correct answer). However, it works so cleanly because log is vectorized. I have experienced the special pain of non-vectorized functions too frequently. When I started with R, and didn't understand the apply family well, I resorted to ugly loops very often. So, for the purposes of those who might stumble onto this question who do not have vectorized functions I provide the following proof of concept.
#Creating sample data
df <- as.data.frame(matrix(runif(56 * 56), 56, 56))
#Writing an ugly non-vectorized function
logplusone <- function(x) {log(x[1] + 1)}
#example code that achieves the desired result, despite the lack of a vectorized function
df[, 1:56] <- as.data.frame(lapply(df[, 1:56], FUN = function(x) {sapply(x, FUN = logplusone)}))
#Proof that the results are the same using both methods...
#Note: I used all.equal rather than all so that the values are tested using machine tolerance for mathematical equivalence. This is probably a non-issue for the current example, but might be relevant with some other testing functions.
#should evaluate to true
all.equal(log(df[, 1:56] + 1),as.data.frame(lapply(df[, 1:56], FUN = function(x) {sapply(x, FUN = logplusone)})))
You should be able to just refer to the columns you want, and do the operation, ie:
df[,1:56] <- log(df[,1:56]+1)

Resources