I have a time series dataset with 2 columns : x is "hourly" continuous temperature data and y is periodically sampled response data (periodic samples taken at 5am , 2pm, 8pm every day) over a couple weeks.
I would like to do 2 lag approaches to analyse the data
1) plot all my y data (constant) vs increasingly lagged x data (shift x data by 0-24 hours in 1 hour steps) i.e x at 6pm vs y at 6pm; x at 5pm vs y at 6pm ...... x(5pm on previous day) vs y (6pm)
2) The same as 1) but cumulative shifts i.e. "backward in time" cumulative lag window of 0:24 with a step of 1 for the x data and test it against the y data
i.e x at 6pm vs y at 6pm; x at (avg 5pm & 6pm) vs y at 6pm ...... x(Average of 6pm - 5pm on previous day) vs y (6pm)
I want to plot a linear model (lm) of "y" vs "shifted x" for each lag scenario (0 - 24) and make a table with a column for number of lags, p-value of lm; and Adj. R2 of lm) so I can see which lag and cumulative average lag in "x" best explains the y-data.
Essentially it is the same as the "cummean" or the "rollapply" functions, but working in a backward direction, but I could not find anything in R that does this. Flipping the X data does not work as the order of the data needs to be maintained as i need the lag for in x for several y's
I would guess it would require a 'for' loop to run through all the data at each lag with "i" being the lag
A Single run with 0 lag will be like this :
#Creating dummy data
x<- zoo(c(10,10.5,10.5,11,11.5,12,12.5,12,12,12.5,13,12.5,12,12,11.5,10.5), as.Date(1:16))
y<- zoo(c(rep("NA",3),40,rep("NA",3),45,rep("NA",3),50,rep("NA",3),40), as.Date(1:16))
z<-merge(x, y, all = FALSE)
z
reslt<-lm(z$y~z$x)
a<-summary(reslt)$coefficients[2,4]
b<-summary(reslt)$adj.r.squared
ResltTable<-c(a,b)
colnames(ResltTable)<-c("p-value","Adj. R^2")
Thanks !
This will regress y against the value of x i periods ago iterating over i. Note that in the question "NA" is used where NA should be used. Also the question refers to hourly but provides daily data so we show daily lags. dyn$lm runs lm adding automatic alignment. (Note that a new version of dyn was released to CRAN yesterday that addresses changes in R in the development version of R.) We have run this for lags 0, 1, 2, ..., 10 but if you have more data you could run it up to higher values. If you want to lag in the reverse direction then replace -i with i in lag. If you want to use all lags from 0 to i then use lag(x, 0:-i) and adjust the cbind statement appropriately.
library(dyn) # also loads zoo
x <- zoo(c(10,10.5,10.5,11,11.5,12,12.5,12,12,12.5,13,12.5,12,12,11.5,10.5), as.Date(1:16))
y <- zoo(c(rep(NA,3),40,rep(NA,3),45,rep(NA,3),50,rep(NA,3),40), as.Date(1:16))
z < -merge(x, y, all = FALSE)
z
k <- 10 # highest lag to consider
tab <- t(sapply(0:10, function(i) {
fm <- dyn$lm(y ~ lag(x, -i), z)
s <- summary(fm)
cbind(i, coef(fm)[1], coef(fm)[2], coef(s)[2, 4], s$adj.r.squared)
}))
colnames(tab) <- c("Lag", "Intercept", "Slope", "P Value", "Adj R Sq")
tab
giving:
> tab
Lag Intercept Slope P Value Adj R Sq
[1,] 0 -13.750000 5.0000000 0.04653741 0.8636364
[2,] 1 -2.542373 3.8983051 0.09717103 0.7226502
[3,] 2 -1.944444 3.8888889 0.29647353 0.2424242
[4,] 3 14.651163 2.5581395 0.49421946 -0.1162791
[5,] 4 70.357143 -2.1428571 0.78770438 -0.7857143
[6,] 5 53.571429 -0.7142857 0.87896228 -0.9285714
[7,] 6 58.461538 -1.1538462 0.84557904 -0.8846154
[8,] 7 57.884615 -1.1538462 0.84557904 -0.8846154
[9,] 8 160.000000 -10.0000000 NaN NaN
[10,] 9 102.500000 -5.0000000 NaN NaN
[11,] 10 120.000000 -6.6666667 NaN NaN
Related
I use the updated greybox package in R to forecast the consecutive 2 values (horizon=2) with a moving average scheme (See the first line of code below), where the window size is equal to 3.
For example, the overall goal is to take the average of (1+2+3)/3 = "2" as the forecasted value in horizon 1 (h=1) and then make use of the predicted value in h=1 for h=2, where (2+3+"2")=2,3334.
The following forecast window will make use of the window (2+3+4), where 4 is the actual value to predict the next h1 and h2, which equals 3 and 3,3334 respectively.
Yet, the prediction result I want "ValuesMA[[3]]" only emits one row, i.e. values for the first horizon. But it should be equal to the predifined horizon, which is two.
I have a code for an AR1 process which works perfectly (Second line of code). At the end I add an MAE test statistic to evaluate the model.
Can anyone help?
Thank you!
This is the underlying code I use:
#data
z <- c(1,2,3,4,5,6,7)
ourCall <- "mean(x=data,n.ahead=h)"
ourValue <- c("pred")
# Return a list for an forecasting horizon h
ValuesMA <- ro(z, h=2, origins=3, call=ourCall, ci=TRUE, co=TRUE)
ValuesMA[[3]]
**Which yields:**
origin3 origin4 origin5
[1,] 2 3 4
**But I want:**
origin3 origin4 origin5
[1,] 2 3 4
[2,] 2,3334 3,3334 4,3334
#data
z <- c(1,2,3,4,5,6,7)
# ci defines constant in-sample window size, co defines whether the holdout sample window size should be constant
ourCall <- "predict(arima(x=data,order=c(1,0,0)),n.ahead=h)"
# Ask for predicted values and the standard error
ourValue <- c("pred","se")
# Return a list for an forecasting horizon h with a rolling holdout size equal to origin
ValuesAR1 <- ro(z, h=2, origins=3, call=ourCall, value=ourValue, ci=TRUE, co=TRUE)
# calculate MAE
MAE_AR1 <- apply(abs(ValuesAR1$holdout - ValuesAR1$pred),1,mean,na.rm=TRUE) / mean(ValuesAR1$actuals)
ValuesAR1[[3]]
**Which yields:**
> ValuesAR1[[3]]
origin3 origin4 origin5
h1 2 3 4
h2 2 3 4
For further reading see: https://cran.r-project.org/web/packages/greybox/vignettes/ro.html
Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.
Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)
I am using Recurrent Neural Networks (RNN) for forecasting, but for some weird reason, it always outputs 1. Here I explain this with a toy example as:
Example
Consider a matrix M of dimensions (360, 5), and a vector Y which contains rowsum of M. Now, using RNN, I want to predict Y from M. Using rnn R package, I trained model as
library(rnn)
M <- matrix(c(1:1800),ncol=5,byrow = TRUE) # Matrix (say features)
Y <- apply(M,1,sum) # Output equls to row sum of M
mt <- array(c(M),dim=c(NROW(M),1,NCOL(M))) # matrix formatting as [samples, timesteps, features]
yt <- array(c(Y),dim=c(NROW(M),1,NCOL(Y))) # formatting
model <- trainr(X=mt,Y=yt,learningrate=0.5,hidden_dim=10,numepochs=1000) # training
One strange thing I observed while training is that epoch error is always 4501. Ideally, epoch error should decrease with the increase in epochs.
Next, I created a test dataset with the same structure as above one as:
M2 <- matrix(c(1:15),nrow=3,byrow = TRUE)
mt2 <- array(c(M2),dim=c(NROW(M2),1,NCOL(M2)))
predictr(model,mt2)
With prediction, I always get the output as 1.
What can be the reason for the constant epoch error and the same output?
UPDATE # 1
Answer provided by #Barker does not work on my problem. To make it open, here I share minimalistic data via dropbox links as traindata, testadata, and my R code as.
Data details: column 'power' is response variable which is a function of temperature, humidity, and power consumed on previous days from day1 to day 14.
normalize_data <- function(x){
normalized = (x-min(x))/(max(x)-min(x))
return(normalized)
}
#read test and train data
traindat <- read.csv(file = "train.csv")
testdat <- read.csv(file = "test.csv")
# column "power" is response variable and remaining are predictors
# predictors in traindata
trainX <- traindat[,1:dim(traindat)[2]-1]
# response of train data
trainY <- traindat$power
# arrange data acc. to RNN as [samples,time steps, features]
tx <- array(as.matrix(trainX), dim=c(NROW(trainX), 1, NCOL(trainX)))
tx <- normalize_data(tx) # normalize data in range of [0,1]
ty <- array(trainY, dim=c(NROW(trainY), 1, NCOL(trainY))) # arrange response acc. to predictors
# train model
model <- trainr(X = tx, Y = ty, learningrate = 0.08, hidden_dim = 6, numepochs = 400)
# predictors in test data
testX <- testdat[,1:dim(testdat)[2]-1]
testX <- normalize_data(testX) # normalize data in range of [0,1]
#testY <- testdat$power
# arrange data acc. to RNN as [samples,time steps, features]
tx2 <- array(as.matrix(testX), dim=c(NROW(testX), 1, NCOL(testX))) # predict
pred <- predictr(model,tx2)
pred
I varied parameters learning rate, hidden_dim, numepochs, but still it either results in 0.9 or 1.
Most RNNs don't like data that don't have a constant mean. One strategy for dealing with this is differencing the data. To see how this works, lets work with a base R time series co2. This is a time series with a nice smooth seasonality and trend, so we should be able to forecast it.
For our model our input matrix is going to be the "seasonality" and "trend" of the co2 time series, created using the stl decomposition. So lets make our training and testing data as you did before and train the model (note I reduced the numepochs for runtime). I will use all the data up to the last year and a half for training, and then use the last year and a half for testing:
#Create the STL decomposition
sdcomp <- stl(co2, s.window = 7)$time.series[,1:2]
Y <- window(co2, end = c(1996, 6))
M <- window(sdcomp, end = c(1996, 6))
#Taken from OP's code
mt <- array(c(M),dim=c(NROW(M),1,NCOL(M)))
yt <- array(c(Y),dim=c(NROW(M),1,NCOL(Y)))
model <- trainr(X=mt,Y=yt,learningrate=0.5,hidden_dim=10,numepochs=100)
Now we can create our predictions on the last year of testing data:
M2 <- window(sdcomp, start = c(1996,7))
mt2 <- array(c(M2),dim=c(NROW(M2),1,NCOL(M2)))
predictr(model,mt2)
output:
[,1]
[1,] 1
[2,] 1
[3,] 1
[4,] 1
[5,] 1
[6,] 1
[7,] 1
[8,] 1
[9,] 1
[10,] 1
[11,] 1
[12,] 1
[13,] 1
[14,] 1
[15,] 1
[16,] 1
[17,] 1
[18,] 1
Ewe, it is all ones again, just like in your example. Now lets try this again, but this time we will difference the data. Since we are trying to make our predictions one and a half years out, we will use 18 as our differencing lag as those are the values we would know 18 months ahead of time.
dco2 <- diff(co2, 18)
sdcomp <- stl(dco2, s.window = "periodic")$time.series[,1:2]
plot(dco2)
Great, the trend is now gone so our neural net should be able to find the pattern better. Lets try again with the new data.
Y <- window(dco2, end = c(1996, 6))
M <- window(sdcomp, end = c(1996, 6))
mt <- array(c(M),dim=c(NROW(M),1,NCOL(M)))
yt <- array(c(Y),dim=c(NROW(M),1,NCOL(Y)))
model <- trainr(X=mt,Y=yt,learningrate=0.5,hidden_dim=10,numepochs=100)
M2 <- window(sdcomp, start = c(1996,7))
mt2 <- array(c(M2),dim=c(NROW(M2),1,NCOL(M2)))
(preds <- predictr(model,mt2))
output:
[,1]
[1,] 9.999408e-01
[2,] 9.478496e-01
[3,] 6.101828e-08
[4,] 2.615463e-08
[5,] 3.144719e-08
[6,] 1.668084e-06
[7,] 9.972314e-01
[8,] 9.999901e-01
[9,] 9.999916e-01
[10,] 9.999916e-01
[11,] 9.999916e-01
[12,] 9.999915e-01
[13,] 9.999646e-01
[14,] 1.299846e-02
[15,] 3.114577e-08
[16,] 2.432247e-08
[17,] 2.586075e-08
[18,] 1.101596e-07
Ok, now there is something there! Lets see how it compares to what were were trying to forecast, dco2:
Not ideal, but we but it is finding the general "up down" pattern of the data. Now all you have to do is tinker with your learning rates and start optimizing with all those lovely hyper-parameters that make working with neural nets such a joy. When it is working how you want, you can just take your final output and add back in the last 18 months of your training data.
From my review of the examples with the package (see ?trainr) the inputs into the training function have to be binary. There are the functions int2bin and bin2int in the package.
I have not been able to get them to work correctly, but it appears conversion to binary is needed.
I am trying to extend the answer of a question R: filtering data and calculating correlation.
To obtain the correlation of temperature and humidity for each month of the year (1 = January), we would have to do the same for each month (12 times).
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
Is there any way to do each month automatically?
In my case I have more than 30 groups (not months but species) to which I would like to test for correlations, I just wanted to know if there is a faster way than doing it one by one.
Thank you!
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
gives you a 2 * 2 covariance matrix rather than a number. I bet you want a single number for each Month, so use
## cor(Temp, Humidity | Month)
with(airquality, mapply(cor, split(Temp, Month), split(Humidity, Month)) )
and you will obtain a vector.
Have a read around ?split and ?mapply; they are very useful for "by group" operations, although they are not the only option. Also read around ?cor, and compare the difference between
a <- rnorm(10)
b <- rnorm(10)
cor(a, b)
cor(cbind(a, b))
The answer you linked in your question is doing something similar to cor(cbind(a, b)).
Reproducible example
The airquality dataset in R does not have Humidity column, so I will use Wind for testing:
## cor(Temp, Wind | Month)
x <- with(airquality, mapply(cor, split(Temp, Month), split(Wind, Month)) )
# 5 6 7 8 9
#-0.3732760 -0.1210353 -0.3052355 -0.5076146 -0.5704701
We get a named vector, where names(x) gives Month, and unname(x) gives correlation.
Thank you very much! It worked just perfectly! I was trying to figure out how to obtain a vector with the R^2 for each correlation too, but I can't... Any ideas?
cor(x, y) is like fitting a standardised linear regression model:
coef(lm(scale(y) ~ scale(x) - 1)) ## remember to drop intercept
The R-squared in this simple linear regression is just the square of the slope. Previously we have x storing correlation per group, now R-squared is just x ^ 2.
How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.