Let's start from the end: the R output will be read in Tableau to create a dashboard, and therefore I need the R output to look like in a certain way. With that in mind, I'm starting with a data frame in R with n groups of time series. I want to run auto.arima (or another forecasting method from package forecast) on each by group. I'm using the by function to do that, but I'm not attached to that approach, it's just what seemed to do the job for an R beginner like me.
The output I need would append a (say) 1 period forecast to the original data frame, filling in the date (variable t) and by variable (variable class).
If possible I'd like the approach to generalize to multiple by variables (i.e class_1,...class_n,).
#generate fake data
t<-seq(as.Date("2012/1/1"), by = "month", length.out = 36)
class<-rep(c("A","B"),each=18)
set.seed(1234)
metric<-as.numeric(arima.sim(model=list(order=c(2,1,1),ar=c(0.5,.3),ma=0.3),n=35))
df <- data.frame(t,class,metric)
df$type<-"ORIGINAL"
#sort of what I'd like to do
library(forecast)
ts<-ts(df$metric)
ts<-by(df$metric,df$class,auto.arima)
#extract forecast and relevant other pieces of data
#???
#what I'd like to look like
t<-as.Date(c("2013/7/1","2015/1/1"))
class<-rep(c("A","B"),each=1)
metric<-c(1.111,2.222)
dfn <- data.frame(t,class,metric)
dfn$type<-"FORECAST"
dfinal<-rbind(df,dfn)
I'm not attached to the how-to, as long as it starts with a data frame that looks like what I described, and outputs a data frame like the output I described.
Your description is a little vague, but something along these lines should work:
library(data.table)
dt = data.table(df)
dt[, {result = auto.arima(metric);
rbind(.SD,
list(seq(t[.N], length.out = 2, by = '1 month')[2], result$sigma2, "FORECAST"))},
by = class]
I arbitrarily chose to fill in the sigma^2, since it wasn't clear which variable(s) you want there.
Related
I'm working with a large dataset that has multiple locations measured monthly, but each site has different number of measurement and NAs, creating a broken time series. To get around this, I've created a for loop, looped at each site, to fill in the gaps using an interpolation technique. From this, I get an interpolated output and would ideally like to add this back into the original dataset. For example:
library(imputeTS)
Sites = c(rep("A", 5), rep("B", 4), rep("C", 10))
Meas = c(25,20,NA,21,NA,23,21,22,26,27,15,20,NA,25,NA,28,28,27,NA)
df= data.frame(Sites, Meas)
for(i in Sites) {
d = subset(df, Sites = i)
d$fit = na.interpolation(d$Meas)
}
What I would like is to take d$fit and match it back into a new column, df$fit, such that the number of measurements and each site is matched properly. Any suggestions, or complete overhauls to my approach? Thanks in advance!
It's not often that you actually need for loops. You can do this particular task with the ave() function
df$fit <- ave(df$Meas, df$Sites, FUN=na.interpolation)
In this case the function applies the na.interpolation function to each of the Meas values for each of the different values of Sites and then puts everything back in the right order.
Another stragegy you could use for something more complex, is split/unsplit. Something like
ss <- split(df$Meas, df$Sites)
df$fit <- unsplit(lapply(ss, na.interpolation), df$Sites)
I am looking to loop over my R data frame that is in year-quarter and run a rolling regression across every year quarter. I then use the coefficients from this model to fit values that are 1 quarter ahead. But would like to use quarterly date format in R?
I had similar issue with
[Stata question] (Stata year-quarter for loop), but revisiting it in R. Does R have the notion of year quarters that can be easily used in a loop? For e.g., one possibly round about way is
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
## Loop over the month and years
for(yidx in years.list)
{
for(midx in months.list)
{
}
}
I see zoo:: package has some functions, but not sure which one can I use that is specific to my case. Some thing along the following lines would be ideal:
for (yqidx in 1992Q1:2007Q4){
z <- lm(y ~ x, data = mydata <= yqidx )
}
When I do the look ahead, I need to hand it so that the predicated value is run on the the next quarter that is yqidx + 1, and so 2000Q4 moves to 2001Q1.
If all you need help on is how to generate quarters,
require(data.table)
require(zoo)
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
#The next line of code generates all the month-year combinations.
df<-expand.grid(year=years.list,month=months.list)
#Then, we paste together the year and month with a day so that we get dates like "2007-03-01". Pass that to as.Date, and pass the result to as.yearqtr.
df$Date=as.yearqtr(as.Date(paste0(df$year,"-",df$month,"-01")))
df<-df[order(df$Date),]
Then you can use loops if you'd like. I'd personally consider using data.table like so:
require(data.table)
require(zoo)
DT<-data.table(expand.grid(year=years.list,month=months.list))
DT<-DT[order(year,month)]
DT[,Date:=as.yearqtr(as.Date(paste0(year,"-",month,"-01")))]
#Generate fake x values.
DT[,X:=rnorm(64)]
#Generate time index.
DT[,t:=1:64]
#Fake time index.
DT[,Y:=X+rnorm(64)+t]
#Get rid of the year and month columns -unneeded.
DT[,c("year","month"):=NULL]
#Create a second data.table to hold all your models.
Models<-data.table(Date=DT$Date,Index=1:64)
#Generate your (rolling) models. I am assuming you want to use all past observations in each model.
Models[,Model:=list(list(lm(data=DT[1:Index],Y~X+t))),by=Index]
#You can access an individual model thusly:
Models[5,Model]
I am using R to analyze multiple large data sets. I am trying to add a few together and averaging them to make a plot. They need to be added together with corresponding dates but the data sets are not all the same length/did not start or end at the same time. How would I go about adding them together while accounting for the differences in dates? My first option is to use an if statement, and say if date = date but I'm not sure of the correct process to call all file in the folder for comparison.
I have a script that plots one data set at a time and am simply trying to amend it to accomplish this new analysis:
library(openair)
filedir <-"C:/Users/dfmcg/Documents/Thesisfiles/NE"
myfiles <-c(list.files(path = filedir))
paste(filedir,myfiles,sep = '/')
npsfiles<-c(paste(filedir,myfiles,sep = '/'))
print(npsfiles)
for (i in npsfiles[1:3]){
x <- substr(i,54,61)
y<-paste(paste('C:/Users/dfmcg/Documents/Thesisfiles/NEavg',x,sep='/'), 'png', sep='')
png(filename = y)
timeozone<-import(i,date="DATE",date.format = "%m/%d/%Y %H",header=TRUE,na.strings="-999")
ozoneavg <- timeAverage(timeozone, pollutant = c("O3"), avg.time = "month")
timePlot(ozoneavg,pollutant=c("O3"), main = x)
dev.off()
}
Here is some of the data:
ABBR,DATE,O3,SWS,VWS,SWD,VWD,SDWD,TMP,RH,RNF,SOL
SHEN-BM,05/01/1983 00,-999,-999,-999,,-999,-999,-999,-999,-999,-999
SHEN-BM,05/01/1983 01,-999,-999,-999,,-999,-999,-999,-999,-999,-999
SHEN-BM,05/01/1983 02,-999,-999,-999,,-999,-999,-999,-999,-999,-999
Your question in not very clear. Not being very clear on exactly how you would like add the data frame together and what to average, here is a generic attempt to answer your question.
To read multiple files in and merge them into I large data frame:
#read 3 files
basefilename<-"oa_test"
npsfiles<-lapply(1:3, function(i) {read.csv(paste0(basefilename,i,".csv"))})
#merge files into one dataframe
df<-do.call(rbind, npsfiles)
#fix date column
df$DATE<-as.POSIXct(df$DATE, format="%m/%d/%Y %H")
You could use the import function from the openair package in here.
No once you have all the data into one data frame, the dplyr package makes it easy to group the data by the various variables and perform descriptive statistics on the groups:
library(dplyr)
#group by DATE and average
ozoneavedate<-summarize(group_by(df, DATE), mean(O3))
#group by ABBR and average
ozonesumabbr<-summarize(group_by(df, ABBR), sum(O3))
#group by ABBR and average
ozoneavedateabbr<-summarize(group_by(df, ABBR, DATE), mean(O3))
Hope this helps.
In the future a providing some sample data and what you hope to achieve goes a long way on soliciting help.
I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!
You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here
The real answer is .... Don't do that. It's almost never necessary.
You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.
Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind
The standard bit of code below from the VARS package forecasts values for several variables.
What I want to do is to take those values and turn them into a data frame so I can produce time series graphs.
> predict(var4, n.ahead=12, ci=0.95)
This question is highly vague. I suppose you're looking for:
x <- predict(var4, n.ahead=12, ci=0.95)
data.frame(n = rep(names(x), each = nrow(x$fcst[[1]])), do.call(rbind, x$fcst))
By the way: The package's name is vars, not VARS.