R year quarter for in loop - r

I am looking to loop over my R data frame that is in year-quarter and run a rolling regression across every year quarter. I then use the coefficients from this model to fit values that are 1 quarter ahead. But would like to use quarterly date format in R?
I had similar issue with
[Stata question] (Stata year-quarter for loop), but revisiting it in R. Does R have the notion of year quarters that can be easily used in a loop? For e.g., one possibly round about way is
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
## Loop over the month and years
for(yidx in years.list)
{
for(midx in months.list)
{
}
}
I see zoo:: package has some functions, but not sure which one can I use that is specific to my case. Some thing along the following lines would be ideal:
for (yqidx in 1992Q1:2007Q4){
z <- lm(y ~ x, data = mydata <= yqidx )
}
When I do the look ahead, I need to hand it so that the predicated value is run on the the next quarter that is yqidx + 1, and so 2000Q4 moves to 2001Q1.

If all you need help on is how to generate quarters,
require(data.table)
require(zoo)
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
#The next line of code generates all the month-year combinations.
df<-expand.grid(year=years.list,month=months.list)
#Then, we paste together the year and month with a day so that we get dates like "2007-03-01". Pass that to as.Date, and pass the result to as.yearqtr.
df$Date=as.yearqtr(as.Date(paste0(df$year,"-",df$month,"-01")))
df<-df[order(df$Date),]
Then you can use loops if you'd like. I'd personally consider using data.table like so:
require(data.table)
require(zoo)
DT<-data.table(expand.grid(year=years.list,month=months.list))
DT<-DT[order(year,month)]
DT[,Date:=as.yearqtr(as.Date(paste0(year,"-",month,"-01")))]
#Generate fake x values.
DT[,X:=rnorm(64)]
#Generate time index.
DT[,t:=1:64]
#Fake time index.
DT[,Y:=X+rnorm(64)+t]
#Get rid of the year and month columns -unneeded.
DT[,c("year","month"):=NULL]
#Create a second data.table to hold all your models.
Models<-data.table(Date=DT$Date,Index=1:64)
#Generate your (rolling) models. I am assuming you want to use all past observations in each model.
Models[,Model:=list(list(lm(data=DT[1:Index],Y~X+t))),by=Index]
#You can access an individual model thusly:
Models[5,Model]

Related

Creating time series in R

I have a CSV file containing data as follows-
date, group, integer_value
The date starts from 01-January-2013 to 31-October-2015 for the 20 groups contained in the data.
I want to create a time series for the 20 different groups. But the dates are not continuous and have sporadic gaps in it, hence-
group4series <- ts(group4, frequency = 365.25, start = c(2013,1,1))
works from programming point of view but is not correct due to gaps in data.
How can I use the 'date' column of the data to create the time series instead of the usual 'frequency' parameter of 'ts()' function?
Thanks!
You could use zoo::zoo instead of ts.
Since you don't provide sample data, let's generate daily data, and remove some days to introduce "gaps".
set.seed(2018)
dates <- seq(as.Date("2015/12/01"), as.Date("2016/07/01"), by = "1 day")
dates <- dates[sample(length(dates), 100)]
We construct a sample data.frame
df <- data.frame(
dates = dates,
val = cumsum(runif(length(dates))))
To turn df into a zoo timeseries, you can do the following
library(zoo)
ts <- with(df, zoo(val, dates))
Let's plot the timeseries
plot.zoo(ts)

R - More elegant way of writing line of code

Ok so lets take this code below which calculates a rolling simple moving average over 2 day period:
# Use TTR package to create rolling SMA n day moving average
new.df$close.sma.n2 <- SMA(new.df[,"Close"], 2)
Lets say I want to calculate the n day period of 2:30
The inputs here is:
close.sma.n**
and also the numerical value for the SMA calculation.
So my question is:
How can I write one line of code to perform the above calculation on different SMA periods and also making a new column with corresponding close.sma.n2,3,4,5,6 etc value in a data frame?
I understand I can do:
n.sma <- 2:30
and put that variable in:
new.df$close.sma.n2 <- SMA(new.df[,"Close"], n.sma)
Perhaps I can:
name <- "n2:30"
and place that inside:
new.df$close.sma.name <- SMA(new.df[,"Close"], n.sma)
You didn't provide sample data or SMA, so I made up dummy functions to test my code.
df <- data.frame(Close=c(1, 2, 3, 4))
SMA <- function(x, numdays) {numdays}
Then, I wrote a function that takes in the number of days to average, and returns a function that takes a data.frame and takes the SMA over that many days.
getSMA <- function(numdays) {
function(new.df) {
SMA(new.df[,"Close"], numdays)
}
}
Then, create a matrix to put the SMAs in
smas <- matrix(nrow=nrow(df), ncol=0)
and fill it.
for (i in 2:30) {
smas <- cbind(smas, getSMA(i)(df))
}
Then, set the column names to what you want them to be
colnames(smas) <- sapply(2:30, function(n)paste("close.sma.n", n, sep=""))
and bind it with the starting data frame.
df <- cbind(df, smas)
Now you have your original data.frame, plus "close.sma.n2" through ".n30"

Monthly average from netCDF files in R

I have one netCDF file (.nc) with 16 years(1998 - 2014) worth of daily precipitation (5844 layers). The 3 dimensions are time (size 5844), latitude (size 19) and longitude (size 20)
Is there a straightforward approach in R to compute for each rastercell:
Monthly & yearly average
A cummulative comparison (e.g. jan-mar compared to the average of all jan-mar)
So far I have:
library(ncdf4)
library(raster)
Rname <- 'F:/extracted_rain.nc'
rainfall <- nc_open(Rname)
readRainfall <- ncvar_get(rainfall, "rain") #"rain" is float name
raster_rainfall <- raster(Rname, varname = "rain") # also tried brick()
asdatadates <- as.Date(rainfall$dim$time$vals/24, origin='1998-01-01') #The time interval is per 24 hours
My first challenge will be the compuatation of monthly averages for each raster cell. I'm not sure how best to proceed while keeping the ultimate goal (cummulative comparison) in mind. How can I easily access only days from a certain month?
raster(readRainfall[,,500])) # doesn't seem like a straightforward approach
Hopefully I made my question clear, a first push in the right direction would be much appreciated.
Sample data here
The question asked for a solution in R, but in case anyone is looking to do this task and wants a simple alternative command-line solution, these kind of statistics are the bread and butter of CDO
Monthly averages:
cdo monmean in.nc monmean.nc
Annual averages:
cdo yearmean in.nc yearmean.nc
Make the average of all the Jan, Feb etc:
cdo ymonmean in.nc ymonmean.nc
The monthly anomaly relative to the long term annual cycle:
cdo sub monmean.nc ymonmean.nc monanom.nc
Then you want a specific month, just select with selmon, or seldate.
you can call these functions from R using the system command.
Here is one approach using the zoo-package:
### first read the data
library(ncdf4)
library(raster)
library(zoo)
### use stack() instead of raster
stack_rainfall <- stack(Rname, varname = "rain")
### i renamed your "asdatadates" object for simplicity
dates <- as.Date(rainfall$dim$time$vals/24, origin='1998-01-01')
In your example dataset you only have 18 layers, all coming from January 1998. However, the following should also work with more layers (months).
First, we will build a function that operates one one vector of values (i.e. pixel time series) to convert the input to a zoo object using dates and the calculates the mean using aggregate. The function returns a vector with the length equal to the number of months in dates.
monthly_mean_stack <- function(x) {
require(zoo)
pixel.ts <- zoo(x, dates)
out <- as.numeric(aggregate(pixel.ts, as.yearmon, mean, na.rm=TRUE))
out[is.nan(out)] <- NA
return(out)
}
Then, depending on whether you want the output to be a vector / matrix / data frame or want to stay in the raster format, you can either apply the function over the cell values after retrieving them with getValues, or use the calc-function from the raster-package to create a raster output (this will be a raster stack with as many layers as there a months in your data)
v <- getValues(stack_rainfall) # every row displays one pixel (-time series)
# this should give you a matrix with ncol = number of months and nrow = number of pixel
means_matrix <- t(apply(v, 1, monthly_mean_stack))
means_stack <- calc(stack_rainfall, monthly_mean_stack)
When you're working with large raster datasets you can also apply your functions in parallel using the clusterR function. See ?clusterR
I think easiest to convert to raster brick and then into a data.frame.
Then can pull stats quite easily using general code DF$weeklymean <- rowMeans(DF[, ])

R count days of exceedance per year

My aim is to count days of exceedance per year for each column of a dataframe. I want to do this with one fixed value for the whole dataframe, as well as with different values for each column. For one fixed value for the whole dataframe, I found a solution using count with aggregate and another solution using the package plyr with ddply and colwise. But I couldn't figure out how to do this with different values for each column.
Approach for one fixed value:
# create example data
date <- seq(as.Date("1961/1/1"), as.Date("1963/12/31"), "days") # create dates
date <- date[(format.Date(as.Date(date), "%m %d") !="02 29")] # delete leap days
TempX <- rep(airquality$Temp, length.out=length(date))
TempY <- rep(rev(airquality$Temp), length.out=length(date))
df <- data.frame(date, TempX, TempY)
# This approachs works fine for specific values using aggregate.
library(plyr)
dyear <- as.numeric(format(df$date, "%Y")) # year vector
fa80 <- function (fT) {cft <- count(fT>=80); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fa80) # use aggregate to apply function to dataframe
# Another approach using ddply with colwise, which works fine for one specific value.
fd80 <- function (fT) {cft <- count(fT>=80); cft[2,2]}; # function to count days of exceedance
ddply(cbind(df[,-1], dyear), .(dyear), colwise(fd80)) # use ddply to apply function colwise to dataframe
In order to use specific values for each column separatly, I tried passing a second argument to the function, but this didn't work.
# pass second argument to function
Oc <- c(80,85) # values
fo80 <- function (fT,fR) {cft <- count(fT>=fR); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fo80, fR=Oc) # use aggregate to apply function to dataframe
I tried using apply.yearly, but it didn't work with count. I want to avoid using a loop, as it is slowly and I have a lot of dataframes with > 100 columns and long timeseries to process.
Furthermore the approach has to work for subsets of the dataframe as well.
# subset of dataframe
dfmay <- df[(format.Date(as.Date(df$date),"%m")=="05"),] # subset dataframe - only may
dyearmay <- as.numeric(format(dfmay$date, "%Y")) # year vector
aggregate(dfmay[,-1],list(year=dyearmay),fa80) # use aggregate to apply function to dataframe
I am out of ideas, how to solve this problem. Any help will be appreciated.
You could try something like this:
#set the target temperature for each column
targets<-c(80,80)
dyear <- as.numeric(format(df$date, "%Y"))
#for each row of the data, check if the temp is above the target limit
#this will return a matrix of TRUE/FALSE
exceedance<-t(apply(df[,-1],1,function(x){x>=targets}))
#aggregate by year and sum
aggregate(exceedance,list(year=dyear),sum)

Extract data from a by-timeseries object

Let's start from the end: the R output will be read in Tableau to create a dashboard, and therefore I need the R output to look like in a certain way. With that in mind, I'm starting with a data frame in R with n groups of time series. I want to run auto.arima (or another forecasting method from package forecast) on each by group. I'm using the by function to do that, but I'm not attached to that approach, it's just what seemed to do the job for an R beginner like me.
The output I need would append a (say) 1 period forecast to the original data frame, filling in the date (variable t) and by variable (variable class).
If possible I'd like the approach to generalize to multiple by variables (i.e class_1,...class_n,).
#generate fake data
t<-seq(as.Date("2012/1/1"), by = "month", length.out = 36)
class<-rep(c("A","B"),each=18)
set.seed(1234)
metric<-as.numeric(arima.sim(model=list(order=c(2,1,1),ar=c(0.5,.3),ma=0.3),n=35))
df <- data.frame(t,class,metric)
df$type<-"ORIGINAL"
#sort of what I'd like to do
library(forecast)
ts<-ts(df$metric)
ts<-by(df$metric,df$class,auto.arima)
#extract forecast and relevant other pieces of data
#???
#what I'd like to look like
t<-as.Date(c("2013/7/1","2015/1/1"))
class<-rep(c("A","B"),each=1)
metric<-c(1.111,2.222)
dfn <- data.frame(t,class,metric)
dfn$type<-"FORECAST"
dfinal<-rbind(df,dfn)
I'm not attached to the how-to, as long as it starts with a data frame that looks like what I described, and outputs a data frame like the output I described.
Your description is a little vague, but something along these lines should work:
library(data.table)
dt = data.table(df)
dt[, {result = auto.arima(metric);
rbind(.SD,
list(seq(t[.N], length.out = 2, by = '1 month')[2], result$sigma2, "FORECAST"))},
by = class]
I arbitrarily chose to fill in the sigma^2, since it wasn't clear which variable(s) you want there.

Resources