R count days of exceedance per year - r

My aim is to count days of exceedance per year for each column of a dataframe. I want to do this with one fixed value for the whole dataframe, as well as with different values for each column. For one fixed value for the whole dataframe, I found a solution using count with aggregate and another solution using the package plyr with ddply and colwise. But I couldn't figure out how to do this with different values for each column.
Approach for one fixed value:
# create example data
date <- seq(as.Date("1961/1/1"), as.Date("1963/12/31"), "days") # create dates
date <- date[(format.Date(as.Date(date), "%m %d") !="02 29")] # delete leap days
TempX <- rep(airquality$Temp, length.out=length(date))
TempY <- rep(rev(airquality$Temp), length.out=length(date))
df <- data.frame(date, TempX, TempY)
# This approachs works fine for specific values using aggregate.
library(plyr)
dyear <- as.numeric(format(df$date, "%Y")) # year vector
fa80 <- function (fT) {cft <- count(fT>=80); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fa80) # use aggregate to apply function to dataframe
# Another approach using ddply with colwise, which works fine for one specific value.
fd80 <- function (fT) {cft <- count(fT>=80); cft[2,2]}; # function to count days of exceedance
ddply(cbind(df[,-1], dyear), .(dyear), colwise(fd80)) # use ddply to apply function colwise to dataframe
In order to use specific values for each column separatly, I tried passing a second argument to the function, but this didn't work.
# pass second argument to function
Oc <- c(80,85) # values
fo80 <- function (fT,fR) {cft <- count(fT>=fR); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fo80, fR=Oc) # use aggregate to apply function to dataframe
I tried using apply.yearly, but it didn't work with count. I want to avoid using a loop, as it is slowly and I have a lot of dataframes with > 100 columns and long timeseries to process.
Furthermore the approach has to work for subsets of the dataframe as well.
# subset of dataframe
dfmay <- df[(format.Date(as.Date(df$date),"%m")=="05"),] # subset dataframe - only may
dyearmay <- as.numeric(format(dfmay$date, "%Y")) # year vector
aggregate(dfmay[,-1],list(year=dyearmay),fa80) # use aggregate to apply function to dataframe
I am out of ideas, how to solve this problem. Any help will be appreciated.

You could try something like this:
#set the target temperature for each column
targets<-c(80,80)
dyear <- as.numeric(format(df$date, "%Y"))
#for each row of the data, check if the temp is above the target limit
#this will return a matrix of TRUE/FALSE
exceedance<-t(apply(df[,-1],1,function(x){x>=targets}))
#aggregate by year and sum
aggregate(exceedance,list(year=dyear),sum)

Related

R: How to Count Rows with Subsetted Date in Date Formatted Column

I have about 30,000 rows of data with a Date column in date format. I would like to be able to count the number of rows by month/year and year, but when I aggregate with the below code, I get a vector within the data table for my results instead of a number.
Using the hyperlinked csv file, I have tried the aggregate function.
https://www.dropbox.com/s/a26t1gvbqaznjy0/myfiles.csv?dl=0
short.date <- strftime(myfiles$Date, "%Y/%m")
aggr.stat <- aggregate(myfiles$Date ~ short.date, FUN = count)
Below is a view of the aggr.stat data frame. There are two columns and the second one beginning with "c(" is the one where I'd like to see a count value.
1 1969/01 c(-365, -358, -351, -347, -346)
2 1969/02 c(-323, -320)
3 1969/03 c(-306, -292, -290)
4 1969/04 c(-275, -272, -271, -269, -261, -255)
5 1969/05 c(-245, -240, -231)
6 1969/06 c(-214, -211, -210, -205, -204, -201, -200, -194, -190, -186)
I'm not much into downloading any unknown file from the internet, so you'll have to adapt my proposed solution to your needs.
You can solve the problem with the help of data.table and lubridate.
Imagine your data has at least one column called dates of actual dates (it is, calling class(df$dates) will return at least Date or something similar (POSIXct, etc).
# load libraries
library(data.table)
library(lubridate)
# convert df to a data.table
setDT(df)
# count rows per month
df[, .N, by = .(monthDate = floor_date(dates, "month")]
.N counts the number of rows, by = groups the data. See ?data.table for further details.
Consider running everything from data frames. Specifically, add needed month/year column to data frame and then run aggregate using data argument (instead of running by separate vectors). Finally, there is no count() function in base R, use length instead:
# NEW COLUMN
myfiles$short.date <- strftime(myfiles$Date, "%Y/%m")
# AGGREGATE WITH SPECIFIED DATA
aggr.stat <- aggregate(Date ~ short.date, data = myfiles, FUN = length)

R year quarter for in loop

I am looking to loop over my R data frame that is in year-quarter and run a rolling regression across every year quarter. I then use the coefficients from this model to fit values that are 1 quarter ahead. But would like to use quarterly date format in R?
I had similar issue with
[Stata question] (Stata year-quarter for loop), but revisiting it in R. Does R have the notion of year quarters that can be easily used in a loop? For e.g., one possibly round about way is
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
## Loop over the month and years
for(yidx in years.list)
{
for(midx in months.list)
{
}
}
I see zoo:: package has some functions, but not sure which one can I use that is specific to my case. Some thing along the following lines would be ideal:
for (yqidx in 1992Q1:2007Q4){
z <- lm(y ~ x, data = mydata <= yqidx )
}
When I do the look ahead, I need to hand it so that the predicated value is run on the the next quarter that is yqidx + 1, and so 2000Q4 moves to 2001Q1.
If all you need help on is how to generate quarters,
require(data.table)
require(zoo)
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
#The next line of code generates all the month-year combinations.
df<-expand.grid(year=years.list,month=months.list)
#Then, we paste together the year and month with a day so that we get dates like "2007-03-01". Pass that to as.Date, and pass the result to as.yearqtr.
df$Date=as.yearqtr(as.Date(paste0(df$year,"-",df$month,"-01")))
df<-df[order(df$Date),]
Then you can use loops if you'd like. I'd personally consider using data.table like so:
require(data.table)
require(zoo)
DT<-data.table(expand.grid(year=years.list,month=months.list))
DT<-DT[order(year,month)]
DT[,Date:=as.yearqtr(as.Date(paste0(year,"-",month,"-01")))]
#Generate fake x values.
DT[,X:=rnorm(64)]
#Generate time index.
DT[,t:=1:64]
#Fake time index.
DT[,Y:=X+rnorm(64)+t]
#Get rid of the year and month columns -unneeded.
DT[,c("year","month"):=NULL]
#Create a second data.table to hold all your models.
Models<-data.table(Date=DT$Date,Index=1:64)
#Generate your (rolling) models. I am assuming you want to use all past observations in each model.
Models[,Model:=list(list(lm(data=DT[1:Index],Y~X+t))),by=Index]
#You can access an individual model thusly:
Models[5,Model]

R - Improve speed of do.call / by function

I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!

Comparing two dataframes in ddply function

I've two dataframes, Data and quantiles. Data has a dimension of 23011 x 2 and consists of columns "year" and "data" where year are the sequence of days from 1951:2013. The Quantiles df has a dimension of 63x2 consists of columns "year" and "quantiles" , where year are 63 rows, ie. 1951:2013.
I need to compare Quantile df against the Data df and count the sum of data values exceeding the quantiles value for each year. For that, I'm using ddply in this manner :
ddply(data, .(year), function(y) sum(y[which(y[,2] > quantile[,2]),2]) )
However, the code compares only against the first row of quantile and is not iterating over each of the year against the data df.
I want to iterate over each year in quantile df and calculate the sum of data exceeding the quantile df in each year.
Any help shall be greatly appreciated.
The example problem -
quantile df is here
and Data is pasted here
The quantile df is derived from the data , which is the 90th percentile data df exceeding value 1
quantile = quantile(data[-c(which(prcp2[,2] < 1)),x],0.9)})
In addition to the Heroka answer above, If you have 10,000 columns and need to iterate over each of the column, you can use matrix notation in this form -
lapply(x, function(y) {ddply(data,.(year), function(x){ return(sum(x[x[,y] > quantile(x[x[,y]>1,y],0.9),y]))})})
where x is the size of columns, ie, 1:1000 and data is the df which contains the data.
The quantile(x[x[,y]>1,y],0.9),y]) will give the 90th percentile for data values exceeding 1 .
x[x[,y] > quantile(x[x[,y]>1,y],0.9),y] returns the rows which satisfies the condition for the yth column and sum function is used to calculate the sum.
Why not do this in one go? Creating the quantiles-dataframe first and then referring back to it makes things more complicated than they need to be. You can do this with ddply too.
set.seed(1)
data <- data.frame(
year=sample(1951:2013,23011,replace=T),
data=rnorm(23011)
)
res <- ddply(data,.(year), function(x){
return(sum(x$data[x$data>quantile(x$data,.9)]))
})
And -as plyr seems to be replaced with dplyr - :
library(dplyr)
res2 <- mydf %>% group_by(year) %>% summarise(
test=sum(value[value>quantile(value,.9)])
)

Efficient way of spliting, applying function and returning data.frame with variable vector length

I am currently trying to use plyr + reshape2 to proccess my data, but it is taking a lot of time.
I have a dataframe (df) with 3 columns: network, user_id and date.
My goal is:
To split df in 2 levels (network and user_id);
apply a function (get_interval) in each split;
bind the results in another dataframe (df2).
get_interval returns a vector of the same length as the number of rows of the input.
Thus, df2 has the same size of df, but with the results computed by get_interval.
The problem is that I cannot use ddply directly, since it only handles vectors of equal length and the results of the function have varied length.
I came up with this solution:
aux <- melt(dlply(df,.(network,user_id), get_interval))
df2 <- cbind(interval=aux$value,colsplit(aux$L1,"\\.",names=c("network","user_id")))
But it is very inefficient, and since df is quite big I waste hours every time I have to run it.
Is there a way of doing this more efficiently?
EDIT
The basic operation of get_interval is as follows:
get_interval <- function(df){
if(nrow(df) < 2)
return (NA)
x <- c(NA,df$date[-1] - df$date[-nrow(df)])
return(x) ## ceiling wont work because some intervals are 0.
}
It is possible to generate this data artificially with:
n <- 1000000
ref_time <- as.POSIXct("2013-12-17 00:00:00")
interval_range <- 86400*10 # 10 days
df <- data.frame(user_id=floor(runif(n,1,n/10)),
network=gl(2,n,labels=c("anet","unet")),
value=as.POSIXct(ref_time - runif(n,0,interval_range)))

Resources