Regression with lagged time time series data in R - r

I have two-time series data A and B. I want to execute the following linear regression in R
A ~ Lags(A, 1:2) + Lags(B, 1:2)
Can you please help me with the R code ?

Using dyn and the built in BOD data frame (which contains two columns, Time and demand) we can specify the indicated lags.
Note that the dplyr package clobbers lag so just in case it is loaded we restore base lag. Note the sign needed with lag.
Using dyn$lm and zoo(BOD) will result in automatic alignment.
If BOD did not include only numeric data it would be necessary to eliminate any columns not used first; however, here BOD is entirely numeric.
library(dyn) # also loads zoo
lag <- stats::lag
fm <- dyn$lm(demand ~ lag(demand, -(1:2)) + lag(Time, -(1:2)), zoo(BOD))
fm
giving:
Call:
lm(formula = dyn(demand ~ lag(demand, -(1:2)) + lag(Time, -(1:2))),
data = zoo(BOD))
Coefficients:
(Intercept) lag(demand, -(1:2))1 lag(demand, -(1:2))2
23.5410 -0.5126 -0.5071
lag(Time, -(1:2))1 lag(Time, -(1:2))2
2.4737 NA
This shows BOD and the model.frame and model.matrix used.
> BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
> model.frame(fm)
demand lag(demand, -(1:2)).1 lag(demand, -(1:2)).2 lag(Time, -(1:2)).1 lag(Time, -(1:2)).2
3 19.0 10.3 8.3 2 1
4 16.0 19.0 10.3 3 2
5 15.6 16.0 19.0 4 3
6 19.8 15.6 16.0 5 4
> model.matrix(fm)
(Intercept) lag(demand, -(1:2))1 lag(demand, -(1:2))2 lag(Time, -(1:2))1 lag(Time, -(1:2))2
3 1 10.3 8.3 2 1
4 1 19.0 10.3 3 2
5 1 16.0 19.0 4 3
6 1 15.6 16.0 5 4
attr(,"assign")
[1] 0 1 1 2 2

Related

Make a loop function by including gender into the algorithm

I have the following data set:
Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1)
df1<-data.frame(Age,R,sex)
# Second dataset:
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.F<-data.frame(Age2, Mspline)
# Third data
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.M<-data.frame(Age2, Mspline)
I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is:
If sex=1 then use the following function to calculate Time
last = dim(df2.F)[1]
fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last])
df1$Time<-fM.F(df1$Age)
and If sex=0 then use this function to calculate Time
last = dim(df2.M)[1]
fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last])
df1$Time<-fM.M(df1$Age)
I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value.
f <- function(age, s, m,f) {
if(is.na(s)) return(NA)
if(s==0) df = m
else df = f
last = dim(df)[1]
fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last])
fM(age)
}
Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group.
library(dplyr)
df1 %>%
group_by(sex) %>%
mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F))
Output:
Age R sex time
<dbl> <dbl> <dbl> <dbl>
1 2 2 1 -0.186
2 2.1 2.1 0 1.02
3 2.2 2.2 1 -1.55
4 3.4 3.4 1 -0.461
5 3.5 3.5 1 0.342
6 4.2 4.2 1 -0.560
7 4.7 4.7 1 -0.114
8 4.8 4.8 0 0.247
9 5 5 0 -0.510
10 5.6 5.6 0 -0.982
11 NA NA NA NA
12 5.9 5.9 0 -0.231
13 NA NA 1 NA

R describeby function subscript out of bounds error

I'm fairly new to R and I'm trying to get descriptive statistics grouped by multiple variables using the describeby function from the psych package.
Here's what I'm trying to run:
JL <- describeBy(df$JL, group=list(df$Time, df$Cohort, df$Gender), digits=3, skew=FALSE, mat=TRUE)
And I get the error message Error in `[<-`(`*tmp*`, var, group + 1, value = dim.names[[group]][[groupi]]) :
subscript out of bounds
I only get this error message with my Gender variable (which is dichotomous in this datset). I'm able to run the code when I take out the mat=TRUE argument, and I see that it's generating groupings with NULL for Gender. I saw in other answers that this has something to do with the array being out of bounds but I'm not sure how to troubleshoot. Any advice is appreciated.
Thanks so much.
You could use dplyr, with some custom functions added.
library(dplyr)
se <- function(x) sd(x, na.rm=TRUE)/sqrt(length(na.omit(x)))
rnge <- function(x) diff(range(x, na.rm=TRUE))
group_by(df, Time, Cohort, Gender) %>%
summarise_at(vars(JL), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>%
as.data.frame()
Using the mtcars dataset:
group_by(mtcars, vs, am, cyl) %>%
summarise_at(vars(mpg), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>% as.data.frame()
vs am cyl n mean sd min max range se
1 0 0 8 12 15.1 2.774 10.4 19.2 8.8 0.801
2 0 1 4 1 26.0 NA 26.0 26.0 0.0 NA
3 0 1 6 3 20.6 0.751 19.7 21.0 1.3 0.433
4 0 1 8 2 15.4 0.566 15.0 15.8 0.8 0.400
5 1 0 4 3 22.9 1.453 21.5 24.4 2.9 0.839
6 1 0 6 4 19.1 1.632 17.8 21.4 3.6 0.816
7 1 1 4 7 28.4 4.758 21.4 33.9 12.5 1.798
Using the describBy function from the psych package returns your error:
library(psych)
describeBy(mtcars$mpg, group=list(mtcars$vs, mtcars$am, mtcars$cyl), digits=3, skew=FALSE, mat=TRUE)
Error in [<-(*tmp*, var, group + 1, value =
dim.names[[group]][[groupi]]) : subscript out of bounds
Because not all combinations of the three groups exist in the data.
with(mtcars,
ftable(table(vs,am,cyl)))
# cyl 4 6 8
#vs am
#0 0 0 0 12
# 1 1 3 2
#1 0 3 4 0
# 1 7 0 0

Sum total distance by groups

I have a df tracking movement of points each hour. I want to find the total distance traveled by that group/trial by adding the distance between the hourly coordinates, but I'm confusing myself with apply functions.
I want to say "in each group/trial, sum [distance(hour1-hou2), distance(hour2=hour3), distance(hour3-hour4)....] until current hour so on each line, I have a cumulative distance travelled value.
I've created a fake df below.
paths <- data.frame(matrix(nrow=80,ncol=5))
colnames(paths) <- c("trt","trial","hour","X","Y")
paths$trt <- rep(c("A","B","C","D"),each=20)
paths$trial <- rep(c(rep(1,times=10),rep(2,times=10)),times=4)
paths$hour <- rep(1:10,times=8)
paths[,4:5] <- runif(160,0,50)
#this shows the paths that I want to measure.
ggplot(data=paths,aes(x=X,y=Y,group=interaction(trt,trial),color=trt))+
geom_path()
I probably want to add a column paths$dist.traveled to keep track each hour.
I think I could use apply or maybe even aggregate but I've been using PointDistance to find the distances, so I'm a bit confused. I also would rather not do a loop inside a loop, because the real dataset is large.
Here's an answer that uses {dplyr}:
library(dplyr)
paths %>%
arrange(trt, trial, hour) %>%
group_by(trt, trial) %>%
mutate(dist_travelled = sqrt((X - lag(X))^2 + (Y - lag(Y))^2)) %>%
mutate(total_dist = sum(dist_travelled, na.rm = TRUE)) %>%
ungroup()
If you wanted the total distance but grouped only by trt and not trial you would just remove that from the call to group_by().
Is this what you are trying to achieve?:
paths %>%
mutate(dist.traveled = sqrt((X-lag(X))^2 + (Y-lag(Y))^2))
trt trial hour X Y dist.traveled
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 1 1 11.2 26.9 NA
2 A 1 2 20.1 1.48 27.0
3 A 1 3 30.4 0.601 10.4
4 A 1 4 31.1 26.6 26.0
5 A 1 5 38.1 30.4 7.88
6 A 1 6 27.9 47.9 20.2
7 A 1 7 16.5 35.3 16.9
8 A 1 8 0.328 13.0 27.6
9 A 1 9 14.0 41.7 31.8
10 A 1 10 29.7 7.27 37.8
# ... with 70 more rows
paths$dist.travelled[which(paths$hour==1)] <- NA
paths %>%
group_by(trt)%>%
summarise(total_distance = sum(dist.traveled, na.rm = TRUE))
trt total_distance
<chr> <dbl>
1 A 492.
2 B 508.
3 C 479.
4 D 462.
I am adding the new column to calculate distances for each group, and them sum them up.

How to fit a function for different groups in a data set using R

Please, how can I fit a function for different groups in a data set (Soil) using R. the first column is the group i.e. Plot and the second column is the observed variable i.e. Depth
Plot Depth
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
4 10.2
4 21.5
4 15.1
4 12.3
4 10.0
4 13.5
4 16.5
4 19.2
4 17.6
4 14.1
4 19.7
I used the 'for' statement but only saw output for Plot 1.
This was how I applied the 'for' statement:
After importing my data in R, I saved it as: SNq,
for (i in 1:SNq$Plot[i]) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
fit1
}
I think this should work. Just make one change in your code:
Why would it work ?
Because: unique function will return unique values (1,2,3) which are nothing but the groups in Plot column. With unique value, we can subset the data using SNq$Depth[SNq$Plot==i] and get depth value for that group.
for (i in unique(SNq$Plot)) { # <- here
dp <- SNq$Depth[SNq$Plot==i]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
plot(fit1)
}
A tidyverse suggestion:
library("tidyverse")
library("fitdistrplus")
fits <- SNq %>%
group_by(Plot) %>%
nest() %>%
mutate(fits = map(data, ~ fitdist(data = .$Depth, distr = "gamma")),
summaries = map(fit, summary))
You could continue with print(fits$fits) and print(fits$summaries) to access the different fits and their summary. Alternatively you can use a syntax like fits$fits[[1]] and fits$summaries[[1]] to access them.
Try:
for (i in 1:nrow(SNq)) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma")
fit1
}

Data cleaning in R, aggregation of multiple columns [duplicate]

This is best illustrated with an example
str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)
That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.
I could go.
tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)
This seems cumbersome.
Then how would I bring a 3rd variable in the mix?
This is somewhat in the space I'm thinking about.
Summary statistics using ddply
update This gets me there, but it's not pretty.
aggregate(mpg ~ am+cyl+gear, mtcars,sum)
Cheers
How about this, still using tapply()? It's more versatile than you knew!
with(mtcars, tapply(mpg, list(cyl, gear), sum))
# three four five
# four 21.5 215.4 56.4
# six 39.5 79.0 19.7
# eight 180.6 NA 30.8
Or, if you'd like the printed output to be a bit more interpretable:
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:
A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))
dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices
# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE TRUE
# 12 42
I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr package (this came up for me because I'm teaching a class right now where we use dplyr for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply or aggregate).
You can group on as many variables as you want using the group_by function and then summarize information from these groups with summarize. I think this code is more readable to an R newcomer than the formula-based interface of aggregate, yielding identical results:
library(dplyr)
mtcars %>%
group_by(am, cyl, gear) %>%
summarize(mpg=sum(mpg))
# am cyl gear mpg
# (dbl) (dbl) (dbl) (dbl)
# 1 0 4 3 21.5
# 2 0 4 4 47.2
# 3 0 6 3 39.5
# 4 0 6 4 37.0
# 5 0 8 3 180.6
# 6 1 4 4 168.2
# 7 1 4 5 56.4
# 8 1 6 4 42.0
# 9 1 6 5 19.7
# 10 1 8 5 30.8
With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread function from the tidyr package:
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
summarize(mpg=sum(mpg)) %>%
spread(gear, mpg)
# cyl 3 4 5
# (dbl) (dbl) (dbl) (dbl)
# 1 4 21.5 215.4 56.4
# 2 6 39.5 79.0 19.7
# 3 8 180.6 NA 30.8
I like Josh's answer for this, but reshape2 can also provide a nice framework for these type of problems:
library(reshape2)
#use subset to only grab the variables of interest...
mtcars.m <- melt(subset(mtcars, select = c("mpg", "gear", "cyl")), measure.vars="mpg")
#cast into appropriate format
dcast(mtcars.m, cyl ~ gear, fun.aggregate=sum, value.var="value")
cyl three four five
1 four 21.5 215.4 56.4
2 six 39.5 79.0 19.7
3 eight 180.6 0.0 30.8
The answer contains same output using tapply and aggregate function.
I would like to add some information to Josh O'Brien's answer. User can either use aggregate function or tapply depending on output. In order to use more than one factor variable in tapply one can use the method Josh has shown.
Loading dataset
data("mtcars")
Using tapply
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
The output of above code is
Gear#
Cylinder# 3 4 5
4 21.5 215.4 56.4
6 39.5 79.0 19.7
8 180.6 NA 30.8
Using aggregate function
with(mtcars, aggregate(mpg, list(Cylinder = cyl, Gear = gear), sum))
Output of aggregate function
Cylinder Gear x
1 4 3 21.5
2 6 3 39.5
3 8 3 180.6
4 4 4 215.4
5 6 4 79.0
6 4 5 56.4
7 6 5 19.7
8 8 5 30.8
Now if the user wants same output as aggregate function but using tapply.
as.data.frame(as.table(with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear),
sum))))
Output of tapply function
Cylinder. Gear. Freq
1 4 3 21.5
2 6 3 39.5
3 8 3 180.6
4 4 4 215.4
5 6 4 79.0
6 8 4 NA
7 4 5 56.4
8 6 5 19.7
9 8 5 30.8
NA's can be kept or removed as per business requirements.

Resources