How to make linear regression for time intervals? - r

I have two to three hours data measured in seconds. I want to split this up in 11 intervals and make a linear regression on each interval.
The first time interval can be from 7-17 minutes and the next 18 - 27 minutes. My data has a column of seconds and and a column for the measuring in the champer.
I have started to make a plot
library(readr)
s24kul05p <- read.delim("C:/Data/24skulp05.txt", quote="")
View(s24kul05p)
s24kul05p
head(s24kul05p)
tail(s24kul05p)
data("s24kul05p")
plot(Ch1~Min, data=s24kul05p, ylim =c(170,250), xlim=c(1, 151), col="red")
abline(lm(Ch1~Min, data=s24kul05p))
After this I get a plot with one linear model, and it could be nice if it was possible make 11 linear models?

Drop it into a matrix of 11 columns, then turn it into a data.frame again. You'll have 11 variables to run regression.
Y <- runif(231)
M <- matrix(Y, ncol = 11)
M <- as.data.frame(M)

Related

How to run a regression row by row

I just started using R for statistical purposes and I appreciate any kind of help.
As a first step, I ran a time series regression over my columns. Y values are dependent and the X is explanatory.
# example
Y1 <- runif(100, 5.0, 17.5)
Y2 <- runif(100, 4.0, 27.5)
Y3 <- runif(100, 3.0, 14.5)
Y4 <- runif(100, 2.0, 12.5)
Y5 <- runif(100, 5.0, 17.5)
X <- runif(100, 5.0, 7.5)
df1 <- data.frame(X, Y1, Y2, Y3, Y4, Y5)
# calculating log returns to provide data for the first regression
n <- nrow(df1)
X_logret <- log(X[2:n])-log(X[1:(n-1)])
Y1_logret <- log(Y1[2:n])-log(Y1[1:(n-1)])
Y2_logret <- log(Y2[2:n])-log(Y2[1:(n-1)])
Y3_logret <- log(Y3[2:n])-log(Y3[1:(n-1)])
Y4_logret <- log(Y4[2:n])-log(Y4[1:(n-1)])
Y5_logret <- log(Y5[2:n])-log(Y5[1:(n-1)])
# bringing the calculated log returns together in one data frame
df2 <- data.frame(X_logret, Y1_logret, Y2_logret, Y3_logret, Y4_logret, Y5_logret)
# running the time series regression
Regression <- lm(as.matrix(df2[c('Y1_logret', 'Y2_logret', 'Y3_logret', 'Y4_logret', 'Y5_logret')]) ~ df2$X)
# extracting the coefficients for further calculation
Regression$coefficients[2,(1:5)]
As a second step I want to run a regression row by row, which is day by day, since the data contains daily observed values. I also have a column "DATE" but I didn't know how to bring it in here in the example. The format of the DATE column is POSIXct, maybe someone has an idea how to refer to a certain period in it on which the regression should be done.
In the row by row regression I would like to use the 5 calculated coefficients (from the first regression) as an explanatory variable. The 5 Y_logret values, I would like to use as dependent variable.
Y_logret(1 to 5) = Beta * Regression$coefficients[2,(1:5)] + error value. The intercept is not needed, so I would set it to zero by adding +0 in the lm function.
My goal is to run this regression over a period of time, for example over 20 days. Day by day, this would provide a total of 20 Beta estimates (for one regression per day), but I would also need all errors for further calculation. So I have to extract 5 errors per day, that is a total of 20*5 error values.
This is just an example, in the original dataset I have 20 of the Y values and over 4000 rows. I would like to run the regression over certain intervals with 900-1000 day. Since I am completely new to R, I have no idea how to proceed. Especially how to code this in a few lines.
I really appreciate any kind of help.

How to create a sliding window in R to divide data into test and train samples to test accuracy of forecasts?

We are using the forecast package in R to read 3 weeks worth of hourly data (3*7*24 data points) and make predictions for the next 24 hours. It's a time-series with multiple seasonality.
We have the forecast model running just fine and it seems to be doing well. Now, we wish to quantify the accuracy of our approach / forecasting algorithm for our data. We wish to use the accuracy function in forecast package for this purpose. We understand that the accuracy function works so that it f is the forecast and x is the actual observation vector then accuracy(f,x) would give us the several accuracy measurements for this forecast.
We have data from the past several months and we wish to write a sliding window algorithm that picks (3*7*24) hour values and then predicts the next 24 hours. Then, compares these values against actual data for the next day / 24 hours, displays the accuracy, then slides the window by (24 points / hours) / next day and repeats.
The sample data is generated as follows:
library("forecast")
time <- 1:(12*168)
set.seed(1)
ds <- msts(sin(2*pi*time/24)+c(1,1,1.2,0.8,1,0,0)[((time-1)%/%24)%%7+1]+ time/400+rnorm(length(time),0,0.2),seasonal.periods=c(24,168))
plot(ds)
head(ds)
tail(ds)
length(ds)
length(time)
Forecasting procedure is as follows:
model <- tbats(ds[1:504])
fcst <- forecast(model,h=24,level=90)
accuracy(fcst,ds[505:528]) ##Test accuracy of forecast against next/actual 24 values
Now, we wish to slide the "window" by 24 and repeat the same procedure, that is, the next set of values used to build the model will be ds[25:528] and their accuracy will be tested against ds[529:552] ... and so on. How can we implement this?
Also, is there a better way to test overall accuracy of this forecasting algorithm for our scenario?
I would do this by creating a vector of times representing the front edge of the sliding windows, then using lapply to iterate the forecasting and scoring process over the windows those edges imply. Like...
# set a couple of parameters we'll use to slice the series into chunks:
# window width (w) and the time step at which you want to end the first
# training set
w = 24 ; start = 504
# now use those parameters to make a vector of the time steps at which each
# window will end
steps <- seq(start + w, length(ds), by = w)
# using lapply, iterate the forecasting-and-scoring process over the
# windows that created
cv_list <- lapply(steps, function(x) {
train <- ds[1:(x - w)]
test <- ds[(x - w + 1):x]
model <- tbats(train)
fcst <- forecast(model, h = w, level = 90)
accuracy(fcst, test)
})
Example output for the first window:
> cv_list[[1]]
ME RMSE MAE MPE MAPE MASE
Training set 0.0001587681 0.3442898 0.2689754 34.3957362 84.30841 0.9560206
Test set 0.2619029897 0.8961109 0.7868256 -0.6832273 36.64301 2.7966186
ACF1
Training set 0.02588145
Test set NA
If you want summaries of the scores for the whole list, you can do something like...
rmse <- mean(unlist(lapply(cv_list, '[[', "Test set","RMSE")))
...which produces this:
[1] 1.011177

calculate lag from phase arrows with biwavelet in r

I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here

ARIMA forecasts with R - how to update data

I've been trying to develop an ARIMA model to forecast wind speed values. I have a four year data series (from january 2008 until december 2011). The series presents 10 minute data, which means that in a day we have 144 observations. Well, I'm using the first three years (observations 1 to 157157) to generate the model and the last year to validate the model.
The thing is I want to update the forecast. On other words, when one forecast ends up, more data is added to the dataset and another forecast is performed. But the result seems like I had just lagged the original series. Here's the code:
#1 - Load data:
z=read.csv('D:/Faculdade/Mestrado/Dissertação/velocidade/tudo_10m.csv', header=T, dec=".")
vel=ts(z, start=c(2008,1), frequency=52000)
# 5 - ARIMA Forecasts:
library(forecast)
n=157157
while(n<=157200){
amostra <- vel[1:n] # Only data until 2010
pred <- auto.arima(amostra, seasonal=TRUE,
ic="aicc", stepwise=FALSE, trace=TRUE,
approximation=TRUE, xreg=NULL,
test="adf",
allowdrift=TRUE, lambda=NULL, parallel=TRUE, num.cores=4)
velpred <- arima(pred) # Is this step really necessary?
velpred
predvel<- forecast(pred, h=12) # h means the forecast steps ahead
predvel
plot(amostra, xlim=c(157158, n), ylim=c(0,20), col="blue", main="Previsões e Observações", type="l", lty=1)
lines(fitted(predvel), xlim=c(157158, n), ylim=c(0,20), col="red", lty=2)
n=n+12
}
But when it plot the results (I couldn't post the picture here), it exhibits the observed series and the forecasted plot, which seems just the same as the observed series, but one step lagged.
Can anyone help me examining my code and/or giving tips on how to get the best of my model? Thanks! (Hope my English is understandable...)

Linear interpolation in loops

I have two files. One file has five observations of “Flux” across a three treatment experiment (treatments=A, B, C). In these three treatments temperatures have been manipulated. The observations of Flux are taken at five points in a 24 hr period. The second file (Temp) contains the temperatures for the three treatments across the 24 hour period.
I would like to use linear interpolation to predict what the Flux values will be at every hour during the 24 hour period. Note that the interpolation equations will be slightly different between the three treatments.
Can this be done in a loop so that the values of flux are estimated for each hour in the Temp.csv file? Then have the values integrated (summed) across the 24 hour period?
The files are available on dropbox here: Temp Data
This shows: the different slopes of the best fit linear relationships between flux and temperature across the three treatments:
#subset data in flux by treatment
fluxA<-flux[which(flux$Treatment=='A'),]
fluxB<-flux[which(flux$Treatment=='B'),]
fluxC<-flux[which(flux$Treatment=='C'),]
#Regression of Flux~Temperature
modelA<-lm (Flux~Temperature, data=fluxA)
summary (modelA)
modelB<-lm (Flux~Temperature, data=fluxB)
summary (modelB)
modelC<-lm (Flux~Temperature, data=fluxC)
summary (modelC)
#plot the regressions
plot (Flux~Temperature, data=fluxA,pch=16, xlim=c(0,28), ylim=c(0,20))
abline(modelA)
points(Flux~Temperature, data=fluxB,pch=16, col="orange")
abline(modelB, col="orange")
points(Flux~Temperature, data=fluxC,pch=16, col="red")
abline(modelC, col="red")
caldat <- read.csv(text="Treatment,Temperature,Flux
A,18.64,7.75
A,16.02,8.49
A,17.41,9.24
A,21.06,4.42
A,22.8,5.61
B,19.73,5.7
B,17.45,8.37
B,19.2,5.27
B,20.97,3.37
B,27.6,2.26
C,23.79,9.91
C,15.89,15.8
C,21.93,10.28
C,24.79,6.33
C,26.64,6.64
")
plot(Flux~Temperature, data=caldat, col=Treatment)
mod <- lm(Flux~Temperature*Treatment, data=caldat)
summary(mod)
points(rep(seq(16,28, length.out=1e3),3),
predict(mod, newdata=data.frame(Temperature=rep(seq(16,28, length.out=1e3),3),
Treatment=rep(c("A", "B", "C"), each=1e3))),
pch=".", col=rep(1:3, each=1e3))
You'll need to consider carefully if this is an appropriate and "good" model. Use standard regression diagnostics.
preddata <- read.csv(text="Time,A,B,C
100,17.8,21.64,23.04
200,17.5,21.3,22.7
300,17.23,21,22.39
400,16.92,20.67,22.08
500,16.47,20.3,21.74
600,15.78,19.75,21.24
700,15.19,19.14,20.63
800,14.58,18.47,20
900,14.22,17.99,19.49
1000,13.77,17.55,19.08
1100,13.39,17.02,18.62
1200,13.34,16.76,18.26
1300,13.17,16.62,18.05
1400,13.24,16.58,17.91
1500,13.31,16.63,17.86
1600,13.26,16.61,17.81
1700,13.12,16.57,17.75
1800,12.9,16.45,17.65
1900,12.74,16.32,17.54
2000,12.57,16.2,17.42
2100,12.36,16.04,17.28
2200,12.1,15.83,17.1
2300,11.79,15.57,16.88
2400,11.53,15.3,16.64
")
library(reshape2)
preddata <- melt(preddata, id="Time",
variable.name="Treatment", value.name="Temperature")
preddata$Flux <- predict(mod, newdata=preddata)
plot(Flux~Time, data=preddata, col=Treatment)
Sum the fluxes:
aggregate(Flux ~ Treatment, data=preddata, FUN=sum)
# Treatment Flux
#1 A 247.5572
#2 B 159.6803
#3 C 309.6186

Resources