R - More elegant way of writing line of code - r

Ok so lets take this code below which calculates a rolling simple moving average over 2 day period:
# Use TTR package to create rolling SMA n day moving average
new.df$close.sma.n2 <- SMA(new.df[,"Close"], 2)
Lets say I want to calculate the n day period of 2:30
The inputs here is:
close.sma.n**
and also the numerical value for the SMA calculation.
So my question is:
How can I write one line of code to perform the above calculation on different SMA periods and also making a new column with corresponding close.sma.n2,3,4,5,6 etc value in a data frame?
I understand I can do:
n.sma <- 2:30
and put that variable in:
new.df$close.sma.n2 <- SMA(new.df[,"Close"], n.sma)
Perhaps I can:
name <- "n2:30"
and place that inside:
new.df$close.sma.name <- SMA(new.df[,"Close"], n.sma)

You didn't provide sample data or SMA, so I made up dummy functions to test my code.
df <- data.frame(Close=c(1, 2, 3, 4))
SMA <- function(x, numdays) {numdays}
Then, I wrote a function that takes in the number of days to average, and returns a function that takes a data.frame and takes the SMA over that many days.
getSMA <- function(numdays) {
function(new.df) {
SMA(new.df[,"Close"], numdays)
}
}
Then, create a matrix to put the SMAs in
smas <- matrix(nrow=nrow(df), ncol=0)
and fill it.
for (i in 2:30) {
smas <- cbind(smas, getSMA(i)(df))
}
Then, set the column names to what you want them to be
colnames(smas) <- sapply(2:30, function(n)paste("close.sma.n", n, sep=""))
and bind it with the starting data frame.
df <- cbind(df, smas)
Now you have your original data.frame, plus "close.sma.n2" through ".n30"

Related

I have multiple time series and need to put them in one data.frame

I have around thirty separate time series in R. I would like to put them all inside one large data set but can not seem to do this.
I have used the following code but it doesn't work. All my time series are names ts1,ts2 etc. if i was to do df <- data.frame(ts1,ts2) this works individually but not if I input it this way
for(i in 2:nrow(deal))
{
temp <- paste("ts",i,sep="")
mystring <- paste(mystring,temp,sep=",")
}
df <- data.frame(mystring)
Given that df <- data.frame(ts1,ts2) works, the following should work:
N <- nrow(deal) # or whatever number of time series you have
df <- data.frame(sapply(1:N, function(i) eval(parse(text=paste("ts",i,sep="")))))
Notes:
sapply loops over sequence from 1 to N and applies a function. The results of the function are gathered as columns to a matrix that is then coerced into a data frame.
The function that is applied constructs the string for the name of the i-th time series and uses this SO answer to evaluate the expression from the string. This returns the time series.
Hope this helps.

R year quarter for in loop

I am looking to loop over my R data frame that is in year-quarter and run a rolling regression across every year quarter. I then use the coefficients from this model to fit values that are 1 quarter ahead. But would like to use quarterly date format in R?
I had similar issue with
[Stata question] (Stata year-quarter for loop), but revisiting it in R. Does R have the notion of year quarters that can be easily used in a loop? For e.g., one possibly round about way is
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
## Loop over the month and years
for(yidx in years.list)
{
for(midx in months.list)
{
}
}
I see zoo:: package has some functions, but not sure which one can I use that is specific to my case. Some thing along the following lines would be ideal:
for (yqidx in 1992Q1:2007Q4){
z <- lm(y ~ x, data = mydata <= yqidx )
}
When I do the look ahead, I need to hand it so that the predicated value is run on the the next quarter that is yqidx + 1, and so 2000Q4 moves to 2001Q1.
If all you need help on is how to generate quarters,
require(data.table)
require(zoo)
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
#The next line of code generates all the month-year combinations.
df<-expand.grid(year=years.list,month=months.list)
#Then, we paste together the year and month with a day so that we get dates like "2007-03-01". Pass that to as.Date, and pass the result to as.yearqtr.
df$Date=as.yearqtr(as.Date(paste0(df$year,"-",df$month,"-01")))
df<-df[order(df$Date),]
Then you can use loops if you'd like. I'd personally consider using data.table like so:
require(data.table)
require(zoo)
DT<-data.table(expand.grid(year=years.list,month=months.list))
DT<-DT[order(year,month)]
DT[,Date:=as.yearqtr(as.Date(paste0(year,"-",month,"-01")))]
#Generate fake x values.
DT[,X:=rnorm(64)]
#Generate time index.
DT[,t:=1:64]
#Fake time index.
DT[,Y:=X+rnorm(64)+t]
#Get rid of the year and month columns -unneeded.
DT[,c("year","month"):=NULL]
#Create a second data.table to hold all your models.
Models<-data.table(Date=DT$Date,Index=1:64)
#Generate your (rolling) models. I am assuming you want to use all past observations in each model.
Models[,Model:=list(list(lm(data=DT[1:Index],Y~X+t))),by=Index]
#You can access an individual model thusly:
Models[5,Model]

R - Improve speed of do.call / by function

I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!

Calling a double condition on data in R?

I am trying to create two vectors of the 20th and 80th percentiles of monthly return data for companies from 1927 to 2013. The issue I have encountered is that in my nested four loop I don't know how to reference both the month and the year (i.e. the returns across all companies in April 1945). Right now the code looks like this:
qunatile<-function(r){
vec20<-c(rep(0,1038))
vec80<-c(rep(0,1038))
for(i in 1927:2013){
for(j in 1:12){
vec20[j+12(i-1927)]<-quantile(r$(i, j),20)
vec80[j+12(i-1927)]<-quantile(r$(i, j),80)
}
}
data1decilest<-rbind(ps1NYSE,vec20,vec80)
}
But I know that that r$(i, j) notation is not correct. I was wondering if anyone knew how to do what I am attempting with that clearly incorrect code (i.e. reference all returns from a given month in a given year.
Thank you!
One option that would eliminate nesting loops is to create a column in your dataframe that contains a month/year combo (e.g. "Jan1955", "Apr1999", etc.) and then split your dataframe on that variable, and apply quantile functions. It's hard to say if this is solving your problem since there is not a reproducible example. I assume here your data is called df and contains a date and a value column.
library(lubridate)
library(plyr)
df$newtime <- paste0(month(df$date, label = T, abbr = T), year(df$date))
q20 <- function(df){ quantile(df$value, 20) }
q80 <- function(df){ quantile(df$value, 80) }
vec20 <- ddply(df, .(newtime), FUN=q20)
vec80 <- ddply(df, .(newtime), FUN=q80

R count days of exceedance per year

My aim is to count days of exceedance per year for each column of a dataframe. I want to do this with one fixed value for the whole dataframe, as well as with different values for each column. For one fixed value for the whole dataframe, I found a solution using count with aggregate and another solution using the package plyr with ddply and colwise. But I couldn't figure out how to do this with different values for each column.
Approach for one fixed value:
# create example data
date <- seq(as.Date("1961/1/1"), as.Date("1963/12/31"), "days") # create dates
date <- date[(format.Date(as.Date(date), "%m %d") !="02 29")] # delete leap days
TempX <- rep(airquality$Temp, length.out=length(date))
TempY <- rep(rev(airquality$Temp), length.out=length(date))
df <- data.frame(date, TempX, TempY)
# This approachs works fine for specific values using aggregate.
library(plyr)
dyear <- as.numeric(format(df$date, "%Y")) # year vector
fa80 <- function (fT) {cft <- count(fT>=80); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fa80) # use aggregate to apply function to dataframe
# Another approach using ddply with colwise, which works fine for one specific value.
fd80 <- function (fT) {cft <- count(fT>=80); cft[2,2]}; # function to count days of exceedance
ddply(cbind(df[,-1], dyear), .(dyear), colwise(fd80)) # use ddply to apply function colwise to dataframe
In order to use specific values for each column separatly, I tried passing a second argument to the function, but this didn't work.
# pass second argument to function
Oc <- c(80,85) # values
fo80 <- function (fT,fR) {cft <- count(fT>=fR); return(cft[2,2])}; # function for counting days of exceedance
aggregate(df[,-1], list(year=dyear), fo80, fR=Oc) # use aggregate to apply function to dataframe
I tried using apply.yearly, but it didn't work with count. I want to avoid using a loop, as it is slowly and I have a lot of dataframes with > 100 columns and long timeseries to process.
Furthermore the approach has to work for subsets of the dataframe as well.
# subset of dataframe
dfmay <- df[(format.Date(as.Date(df$date),"%m")=="05"),] # subset dataframe - only may
dyearmay <- as.numeric(format(dfmay$date, "%Y")) # year vector
aggregate(dfmay[,-1],list(year=dyearmay),fa80) # use aggregate to apply function to dataframe
I am out of ideas, how to solve this problem. Any help will be appreciated.
You could try something like this:
#set the target temperature for each column
targets<-c(80,80)
dyear <- as.numeric(format(df$date, "%Y"))
#for each row of the data, check if the temp is above the target limit
#this will return a matrix of TRUE/FALSE
exceedance<-t(apply(df[,-1],1,function(x){x>=targets}))
#aggregate by year and sum
aggregate(exceedance,list(year=dyear),sum)

Resources