Applying a function to increasingly larger subsets of a data frame - r

I want to apply a statistical function to increasingly larger subsets of a data frame, starting at row 1 and incrementing by, say, 10 rows each time. So the first subset is rows 1-10, the second rows 1-20, and the final subset is rows 1-nrows. Can this be done without a for loop? And if so, how?

here is one solution:
# some sample data
df <- data.frame(x = sample(1:105, 105))
#getting the endpoints of the sequences you wanted
row_seq <- c(seq(0,nrow(df), 10), nrow(df))
#getting the datasubsets filtering df from 1 to each endpoint
data.subsets <- lapply(row_seq, function(x) df[1:x, ])
# applying the mean function to each data-set
# just replace the function mean by whatever function you want to use
lapply(data.subsets, mean)

Related

R: How to apply customized function rowwise to dataframe (avoiding for loop)

I need help to rewrite my function (see below called randomdraws()) that operates now through a repeat loop and a for loop. This does take a lot of time (especially the for loop) for my bigger datasets. Additionally I need to repeat this function x-times and want to store the results in a list object.
Here is what I am trying to achieve: I have two dataframes (here df_1 and df_2) which I need as an input for my function randomdraws(). The interesting part of this function begins in the repeat{} section; here I need to draw a number of values from the extreme value distribution (evd) equal to the length of df_1. Afterwards I need to add this values (called evd_draw) to the values of df_1 and perform a check, if this altered values fulfill a certain condition (i.e. varX==varY). If this is not the case (condition is not met with the random draw added) I want to repeat this part until the condition is met. If the condition is met, I need to store the evd_draw with which the condtion was met. I now want to iterate this over each row in my dataframe df_1. In the end I get a new dataframe with the "stored" random draws of the evd per row of df_1 that fulfilled the condition varX==varY. In my example below, for only 10 observations, my code runs just fine.
But: if the number of rows and columns of df_1 (and df_2) expand, the function randomdraws() gets very slow. I therefore need another solution that performs the calculation of the repeat loop for each row of dataframe df_1. I think I need to parallelize my computations instead of iterating over each row one after another but I seem to fail at (i) rewriting my repeat function part for this and (ii) use that in functions likewise apply()/ map()/...
QUESTION: Is there a way that I can achieve my result (i.e. a dataframe/list of the random draws that fulfilled the condition performed on dataframes df_1 and df_2) avoiding the for loop and that is quick for large datasets/dataframes?
Example data:
df_1 <- as.data.frame(rbind(c(0.23040,0.10153,0.28394,0.17105,0.00125),
c(0.11839,0.16768 ,0.26914 ,0.19163,0.00126),
c(0.11703,0.18245 ,0.16571 ,0.16223,0.00423),
c(0.39406,0.08535 ,0.21181 ,0.12780,0.00039),
c(0.16097 ,0.16369, 0.23839, 0.17830,0.00158),
c(0.39812 ,0.04525, 0.17583, 0.09064,0.00167),
c(0.30547 ,0.10900, 0.18930 ,0.12665,0.00197),
c(0.19357 ,0.17854, 0.18003 ,0.19576,0.00189),
c(0.19466 ,0.17339, 0.21267 ,0.18410,0.00069),
c(0.07884 ,0.21299 ,0.18480 ,0.17908,0.00178)))
colnames(df_1) <- c("xf0m40","xf30m40","xf10m40","xf20m40","xf40m0")
rownames(df_1) <- c(2,7,21,33,50,77,80,96,102,110)
df_2 <- cbind.data.frame(varX=c("xf0m40","xf30m40","xf10m40","xf0m40","xf20m40","xf0m40","xf0m40","xf40m0","xf10m40","xf30m40"),
id=c(2,7,21,33,50,77,80,96,102,110))
Function (that runs smoothly but is too slow):
randomdraws <- function(df_1, df_2) {
require(tidyverse)
require(EnvStats)
dfx <- df_1 #here df_1 is actually retrieved from fitted values of regression output,
# simplified here for the sake of clarity
df <- df_2 #select two variables from separate dataframe df_2
#(already simplified here), where varX is a character var, id is numeric
# matrix containing only 0; to be filled with rowwise iteration
df_evd <- matrix(0, nrow = nrow(dfx), ncol= ncol(dfx), byrow = T)
colnames(df_evd) <- colnames(dfx)
rownames(df_evd) <- rownames(dfx)
for (i in 1:nrow(dfx)){
repeat {
evd_draw <- revd(length(dfx), scale = .5) #draw from evd for length of one row
t <- as.data.frame(dfx[i,] + evd_draw) %>% bind_cols(df[i,]) %>%
mutate(varY=as.character(pmap(across(1:ncol(dfx)),~ names(c(...)[which.max(c(...))]))),
overlap=ifelse(varX == varY,1,0))
#object t should sum row i values of dfx and evd_draw, then add varX and id from
#df_2 and calculate new varY to check if varX==varY
df_evd[i,] <- evd_draw
if (t[,ncol(t)]==1) break
#this code section should be repeated until the condition varX==varY (in
#other words; overlap==1 or t[,ncol(t)]==1 is true
}
}
return(df_evd)
}
Apply function on data:
system.time(exampledf <- randomdraws(df_1, df_2))
#replicate this function 3 times (takes even longer then!)
ls_example <- replicate(3, list(as.data.frame(randomdraws(df_1, df_2))), simplify=TRUE)

Bootstrapping data frame columns independently in R

I have a data.frame where each column represents a different individual and each row represents different food items eaten.
My goal is to resample each column via bootstrapping and then calculate a metric score and C.I.s for each individual (data column) using a defined function.
I have done this successfully on a single vector but cannot figure out how to apply the bootstrapping and metric function to individual columns in a data frame. Below is the code I have to apply it to a single vector:
data.1 <- c(10, 50, 200, 54, 6) ## example vector
## create function
metric.function <- function(x){
p <- x/sum(x)
dap <- 1/sum(p^2)
return(dap)
}
vect <- c() ## empty vector for bootstrap data
for (i in 1:1000){
data.2 <- sample(data.1, replace = TRUE) ##bootstrap sample ##
vect[i] <- metric.function (data.2) ## apply metric.function ##
}
summary(vect) ## summary
quantile(vect, probs = c(0.025, 0.975)) ## C.I.
This works fine for a single vector but I want to apply it independently to multiple columns in a data frame, for example in the example.df below I want to apply it to x1:x10 independently resulting in 10 metric scores and 10 C.I.s
example.df<-data.frame(replicate(10,sample(0:50,10,rep=TRUE)))
I have tried changing the vector item to a data.frame and messing around with apply and dply but cannot figure it out, can anyone suggest how to do it or point me in the direction of useful guide/website etc?
This is a perfect chance to use replicate and sapply.
replicate(1000, sapply(example.df, function(x)
metric.function(sample(x, replace = TRUE))))
sapply will operate column-wise (given that a data.frame is in a sense a list of columns); once we've isolated a column within sapply, we need only resample it & apply our metric.

Comparing two dataframes in ddply function

I've two dataframes, Data and quantiles. Data has a dimension of 23011 x 2 and consists of columns "year" and "data" where year are the sequence of days from 1951:2013. The Quantiles df has a dimension of 63x2 consists of columns "year" and "quantiles" , where year are 63 rows, ie. 1951:2013.
I need to compare Quantile df against the Data df and count the sum of data values exceeding the quantiles value for each year. For that, I'm using ddply in this manner :
ddply(data, .(year), function(y) sum(y[which(y[,2] > quantile[,2]),2]) )
However, the code compares only against the first row of quantile and is not iterating over each of the year against the data df.
I want to iterate over each year in quantile df and calculate the sum of data exceeding the quantile df in each year.
Any help shall be greatly appreciated.
The example problem -
quantile df is here
and Data is pasted here
The quantile df is derived from the data , which is the 90th percentile data df exceeding value 1
quantile = quantile(data[-c(which(prcp2[,2] < 1)),x],0.9)})
In addition to the Heroka answer above, If you have 10,000 columns and need to iterate over each of the column, you can use matrix notation in this form -
lapply(x, function(y) {ddply(data,.(year), function(x){ return(sum(x[x[,y] > quantile(x[x[,y]>1,y],0.9),y]))})})
where x is the size of columns, ie, 1:1000 and data is the df which contains the data.
The quantile(x[x[,y]>1,y],0.9),y]) will give the 90th percentile for data values exceeding 1 .
x[x[,y] > quantile(x[x[,y]>1,y],0.9),y] returns the rows which satisfies the condition for the yth column and sum function is used to calculate the sum.
Why not do this in one go? Creating the quantiles-dataframe first and then referring back to it makes things more complicated than they need to be. You can do this with ddply too.
set.seed(1)
data <- data.frame(
year=sample(1951:2013,23011,replace=T),
data=rnorm(23011)
)
res <- ddply(data,.(year), function(x){
return(sum(x$data[x$data>quantile(x$data,.9)]))
})
And -as plyr seems to be replaced with dplyr - :
library(dplyr)
res2 <- mydf %>% group_by(year) %>% summarise(
test=sum(value[value>quantile(value,.9)])
)

Obtain a data.frame (or list) of X times the the original source, after applying some function

I've been having problems with this one for a while.
What I would like, is to apply a function to a data.frame that is divided by factors. This data frame has n>2 columns of values that I need to use for this function.
For the sake of this example, this dataset has a column of 5 factors (a,b,c,d,e), and 2 columns of values (values1,values2). I would like to apply a number of functions that takes into account each column of values (auto.arima first and forecast.Arima, in this case). A dataset to play follows:
library(forecast)
set.seed(2)
dat <- data.frame(factors = letters[1:5],values1 = rnorm(50), values2 =rnorm(50))
This previous dataset has a column of 5 factors (a,b,c,d,e), and 2 columns of values (values1,values2). I would like (for the sake of the exercise), to apply auto.arima to values1 and values 2, per factor. My expected output would be something that, per factor, takes into account both columns of values, and forecasts both (each as its own univariate time series). So if the dataset has 5 factors and 2 columns of values, I would need 10 lists/data.frames.
Some options that did not work: Splitting the data.frame per factor via:
split(dat, dat$factor)
And then using rapply:
rapply(dat,function(x) forecas.Arima(auto.arima(x)),dat$factors)
Or lapply:
lapply(split(dat,dat$factors), function(x) forecast.Arima(auto.arima(x)))
And some other combinations, all to no avail.
I thought that the easiest solution would involve a function in the apply family, but any solution would be valid.
Is this what you're looking for?
m = melt(dat, id.vars = "factors")
l = split(m, paste(m$factors, m$variable))
lapply(l, function(x) forecast.Arima(auto.arima(x$value)))
i.e. splitting the data into 10 different frames, then applying the forecast on the values?
The problem with you apply solutions is that you were passing the whole dataframe to the auto.arima function which take a vector so you'd need something like this:
lapply(split(dat,dat$factors), function(df) {
apply(df[,-1], 2, function(col) forecast.Arima(auto.arima(col)))
})
This splits the dataframe as before on the factors and then applies over each column (ignoring the first which is the factor) the auto.arima function wrapped in forecast.Arima. This returns a list of lists (5 factors by 2 values) so allows you to keep values1 and values2 separate.
You can use unlist(x, recursive=FALSE) to flatten this to a list of 10.

Efficient way of spliting, applying function and returning data.frame with variable vector length

I am currently trying to use plyr + reshape2 to proccess my data, but it is taking a lot of time.
I have a dataframe (df) with 3 columns: network, user_id and date.
My goal is:
To split df in 2 levels (network and user_id);
apply a function (get_interval) in each split;
bind the results in another dataframe (df2).
get_interval returns a vector of the same length as the number of rows of the input.
Thus, df2 has the same size of df, but with the results computed by get_interval.
The problem is that I cannot use ddply directly, since it only handles vectors of equal length and the results of the function have varied length.
I came up with this solution:
aux <- melt(dlply(df,.(network,user_id), get_interval))
df2 <- cbind(interval=aux$value,colsplit(aux$L1,"\\.",names=c("network","user_id")))
But it is very inefficient, and since df is quite big I waste hours every time I have to run it.
Is there a way of doing this more efficiently?
EDIT
The basic operation of get_interval is as follows:
get_interval <- function(df){
if(nrow(df) < 2)
return (NA)
x <- c(NA,df$date[-1] - df$date[-nrow(df)])
return(x) ## ceiling wont work because some intervals are 0.
}
It is possible to generate this data artificially with:
n <- 1000000
ref_time <- as.POSIXct("2013-12-17 00:00:00")
interval_range <- 86400*10 # 10 days
df <- data.frame(user_id=floor(runif(n,1,n/10)),
network=gl(2,n,labels=c("anet","unet")),
value=as.POSIXct(ref_time - runif(n,0,interval_range)))

Resources