Comparing two dataframes in ddply function

Comparing two dataframes in ddply function - r

I've two dataframes, Data and quantiles. Data has a dimension of 23011 x 2 and consists of columns "year" and "data" where year are the sequence of days from 1951:2013. The Quantiles df has a dimension of 63x2 consists of columns "year" and "quantiles" , where year are 63 rows, ie. 1951:2013.
I need to compare Quantile df against the Data df and count the sum of data values exceeding the quantiles value for each year. For that, I'm using ddply in this manner :
ddply(data, .(year), function(y) sum(y[which(y[,2] > quantile[,2]),2]) )
However, the code compares only against the first row of quantile and is not iterating over each of the year against the data df.
I want to iterate over each year in quantile df and calculate the sum of data exceeding the quantile df in each year.
Any help shall be greatly appreciated.
The example problem -
quantile df is here
and Data is pasted here
The quantile df is derived from the data , which is the 90th percentile data df exceeding value 1
quantile = quantile(data[-c(which(prcp2[,2] < 1)),x],0.9)})

In addition to the Heroka answer above, If you have 10,000 columns and need to iterate over each of the column, you can use matrix notation in this form -
lapply(x, function(y) {ddply(data,.(year), function(x){ return(sum(x[x[,y] > quantile(x[x[,y]>1,y],0.9),y]))})})
where x is the size of columns, ie, 1:1000 and data is the df which contains the data.
The quantile(x[x[,y]>1,y],0.9),y]) will give the 90th percentile for data values exceeding 1 .
x[x[,y] > quantile(x[x[,y]>1,y],0.9),y] returns the rows which satisfies the condition for the yth column and sum function is used to calculate the sum.

Why not do this in one go? Creating the quantiles-dataframe first and then referring back to it makes things more complicated than they need to be. You can do this with ddply too.
set.seed(1)
data <- data.frame(
year=sample(1951:2013,23011,replace=T),
data=rnorm(23011)
)
res <- ddply(data,.(year), function(x){
return(sum(x$data[x$data>quantile(x$data,.9)]))
})
And -as plyr seems to be replaced with dplyr - :
library(dplyr)
res2 <- mydf %>% group_by(year) %>% summarise(
test=sum(value[value>quantile(value,.9)])
)

Related

Extract a new vector of multiple mean values from a data frame

I have a large data frame with multiple columns.
Two of my columns look like this:
day_of_year <- c(123,312,23,123,322,1,23,321,124,192, ...)
group <- c(1,1,1,1,3,3,3,2,2,2, ...)
I want to create a new vector with mean values of "day_of_year" for each group separated. Meaning my output vector should contain as many (mean) values as different groups in "group". Please note that some Groups have more values than others!
I hope you can help me with this one!

That's a case for tapply
day_of_year <- c(123,312,23,123,322,1,23,321,124,192)
group <- c(1,1,1,1,3,3,3,2,2,2)
tapply(day_of_year, group, mean)
# 1 2 3
#145.2500 212.3333 115.3333

it would be really helpful if you can post the end result that you are looking for, however, as per my understanding, if you are looking for a mean value per group, the following would work. (Install dplyr package)
install.packages('dplyr')
library('dplyr')
New.Data.Frame <- Your.Data.Frame >%>
group_by(,group)>%>
summarise(,Mean_Day_of_Year= mean(day_of_year ) )

Applying a function to increasingly larger subsets of a data frame

I want to apply a statistical function to increasingly larger subsets of a data frame, starting at row 1 and incrementing by, say, 10 rows each time. So the first subset is rows 1-10, the second rows 1-20, and the final subset is rows 1-nrows. Can this be done without a for loop? And if so, how?

here is one solution:
# some sample data
df <- data.frame(x = sample(1:105, 105))
#getting the endpoints of the sequences you wanted
row_seq <- c(seq(0,nrow(df), 10), nrow(df))
#getting the datasubsets filtering df from 1 to each endpoint
data.subsets <- lapply(row_seq, function(x) df[1:x, ])
# applying the mean function to each data-set
# just replace the function mean by whatever function you want to use
lapply(data.subsets, mean)

R count days of exceedance per year

My aim is to count days of exceedance per year for each column of a dataframe. I want to do this with one fixed value for the whole dataframe, as well as with different values for each column. For one fixed value for the whole dataframe, I found a solution using count with aggregate and another solution using the package plyr with ddply and colwise. But I couldn't figure out how to do this with different values for each column.
Approach for one fixed value:
# create example data
date <- seq(as.Date("1961/1/1"), as.Date("1963/12/31"), "days") # create dates
date <- date[(format.Date(as.Date(date), "%m %d") !="02 29")] # delete leap days
TempX <- rep(airquality$Temp, length.out=length(date))
TempY <- rep(rev(airquality$Temp), length.out=length(date))
df <- data.frame(date, TempX, TempY)
# This approachs works fine for specific values using aggregate.
library(plyr)
dyear <- as.numeric(format(df$date, "%Y")) # year vector
fa80 <- function (fT) {cft <- count(fT>=80); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fa80) # use aggregate to apply function to dataframe
# Another approach using ddply with colwise, which works fine for one specific value.
fd80 <- function (fT) {cft <- count(fT>=80); cft[2,2]}; # function to count days of exceedance
ddply(cbind(df[,-1], dyear), .(dyear), colwise(fd80)) # use ddply to apply function colwise to dataframe
In order to use specific values for each column separatly, I tried passing a second argument to the function, but this didn't work.
# pass second argument to function
Oc <- c(80,85) # values
fo80 <- function (fT,fR) {cft <- count(fT>=fR); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fo80, fR=Oc) # use aggregate to apply function to dataframe
I tried using apply.yearly, but it didn't work with count. I want to avoid using a loop, as it is slowly and I have a lot of dataframes with > 100 columns and long timeseries to process.
Furthermore the approach has to work for subsets of the dataframe as well.
# subset of dataframe
dfmay <- df[(format.Date(as.Date(df$date),"%m")=="05"),] # subset dataframe - only may
dyearmay <- as.numeric(format(dfmay$date, "%Y")) # year vector
aggregate(dfmay[,-1],list(year=dyearmay),fa80) # use aggregate to apply function to dataframe
I am out of ideas, how to solve this problem. Any help will be appreciated.

You could try something like this:
#set the target temperature for each column
targets<-c(80,80)
dyear <- as.numeric(format(df$date, "%Y"))
#for each row of the data, check if the temp is above the target limit
#this will return a matrix of TRUE/FALSE
exceedance<-t(apply(df[,-1],1,function(x){x>=targets}))
#aggregate by year and sum
aggregate(exceedance,list(year=dyear),sum)

How can I cumulatively apply a custom function to a vector in R? In an efficient and idiomatic way?

I know the function cumsum in R which compute a cumulative sum of its vector argument.
I need to "cumulatively apply" not the sum function but a generic function, in my specific case, the quantile function.
My current solution is based on a loop:
set.seed(42)
df<-data.frame(measurement=rnorm(1000),upper=0,lower=0)
for ( r in seq(1,nrow(df))){
df$upper[r]<-quantile(df[seq(1,r),"measurement"],c(.99))
df$lower[r]<-quantile(df[seq(1,r),"measurement"],c(.01))
}
x=seq(1,nrow(df))
plot(df$measurement,type="l",col="grey")
lines(x,df$upper,col="red")
lines(x,df$lower,col="blue")
It works but it is not efficient and I feel there should be a more idiomatic way of doing it in R.

You can use this approach:
set.seed(42)
df <- data.frame(measurement = rnorm(1000))
res <- sapply(seq(nrow(df)), function(x)
quantile(df[seq(x), "measurement"], c(.01, .99)))
It creates a matrix with nrow(df) columns and 2 rows, one row for the 1st percentile and one row for the 99th percentile.
You can add this information to you data frame df (as two olumns):
df <- setNames(cbind(df, t(res)), c(names(df), "lower", "upper"))

Creating multiple subsets all in one data.frame (possibly with ddply)

I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?

You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.

Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing two dataframes in ddply function - r

Related

Extract a new vector of multiple mean values from a data frame

Applying a function to increasingly larger subsets of a data frame

R count days of exceedance per year

How can I cumulatively apply a custom function to a vector in R? In an efficient and idiomatic way?

Creating multiple subsets all in one data.frame (possibly with ddply)

Categories

Resources