Multivariate Time-Series Forecasting in R - r

I have a data set that consists of a combination of daily and non-daily data that are location specific. The location is specified by the longitude and latitude (columns V1 and V2.) Column V3 represents a location specific value, a heart-disease prevalence indicator. Columns V4 and V5 represent daily location specific recordings for wind speed on days 1-3, 2-4 respectively. Column V6 is daily cases of deaths in the ER for days 1-3; V7 is the deaths for days 2-4; V8 is the deaths for days 3-5.
I would like to create do a multiple linear regression model that can predict the new deaths for the third day in every location, given the values in V1 to V7.
This is what the data set looks like.
This is an example of what I am trying to do:
d <- as.data.frame(matrix(c(1,1,1,-2,-2,-2,14,14,14,90,90,90,103,103,103,-6,-6,-6,50,50,50,70,70,70,112,112,112,11,11,11,8,8,8,26,26,26,1.2,1.2,1.2,0.8,0.8,0.8,1.3,1.3,1.3,0.7,0.7,0.7,1.7,1.7,1.7,2,2,2,10,20,17,20,25,26,60,70,70,10,12,13,109,117,120,61,67,63,20,17,18,25,26,24,70,70,90,12,13,11,117,120,110,67,63,64,0,4,5,1,4,6,5,7,9,12,23,4,7,6,5,8,9,12,4,5,6,4,6,9,7,9,13,23,4,12,6,5,25,9,12,40,5,6,16 ,6,9,30 ,9,13,32 ,4,12,23 ,5,25,32 ,12,40,61 ), nrow = 18, ncol = 8))
dtrain <- d[1:12,]
dtest <- d[13:18,][1:7]
l <- lm(V8 ~ V1 +V2 + V3 + V4 + V5 + V6 + V7 , data = dtrain)
p <- predict(l,dtest)
I would like to know whether having non-daily and daily data mixed together will affect the accuracy of a regression model in such a case. I am new to machine learning and am unsure about the shape of my dataset and how to tackle multivariate regression (in this case a time series,too.)

Related

How to compute differences in intercept and slope in linear mixed effect regression?

I am trying to perform a simple linear mixed effect regression for the day of green-up in the Arctic.
I would like to find the effect that weather has on the green-up day within each region (16 different regions pan-Arctic), and if the green-up day differs significantly (intercept) between the different regions (ANOVA), and which region then actually differs (posthoc test).
My data, after scaling to center, is as follows:
library(readr)
library(blme)
data <- read_csv("data.csv")
data.sc <- data.frame(scale(data))
head(data.sc)
regions year greenup V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 1 -1.687957 0.9336382 0.85187965 0.71761508 3.16360255 1.6670476 -0.68640856 -0.655334953 -1.799065 1.5577978 3.0125643 -0.5009276
2 1 -1.601395 0.2370278 -0.30385233 0.59926804 0.04203437 1.3281044 -0.62391808 -0.592095987 -1.590912 0.6735797 1.2016789 -0.5955858
3 2 -1.514833 0.8030237 -0.05341494 0.05770376 -0.03230812 0.4116001 -0.34697632 -0.311837094 -1.428521 0.2459790 0.7906408 -0.7097304
4 2 -1.428272 -0.1548155 -0.52504749 0.21499928 -0.53588809 1.3991222 -0.70552873 -0.674684140 -1.491065 -0.5336937 0.1615725 -0.6405863
5 3 -1.341710 0.4111804 -0.55526813 0.41495915 -0.28220429 1.0430792 -0.69965688 -0.668741960 -1.440828 -0.1372223 0.3667727 -0.8735882
6 3 -1.255148 1.7717476 0.37503107 0.49582150 3.25563398 1.1052932 -0.06417199 -0.005461234 -1.859388 1.6666640 3.6155112 -0.6626455
I, therefore, perform a linear mixed model using the Bayesian package in R (blme) to avoid the boundary (singular) warning. Here, I make sure I get the global intercept and slope but also the deviation for each of these regions. Hence, my model looks as follows:
reg <- blmer(greenup ~ V1+V2+V3+V4+V5+V6+V7+V8+(1+V1+V2+V3+V4+V5+V6+V7+V8|regions), data = pre.sc, REML = F)
ranef(reg)
(Intercept) V1 V2 V3 V4 V5 V6 V7 V8
1 -1.5088397 -0.678780774 -0.28176494 0.03297057 0.14762373 0.135628050 0.03870133 -0.16881483 0.01049209
2 1.4277155 0.002705333 0.63878199 0.04527960 -0.29932021 -0.092542944 -0.05871342 0.12192872 0.16659908
3 -1.1971171 -0.643673842 -0.36725321 -0.00437876 0.09999908 -0.195088268 0.02960333 -0.07320157 0.21971584
4 -1.8725315 -0.785756089 -0.38010681 -0.02575994 0.28474568 0.840094435 -0.13247112 -0.05102267 -0.77963082
5 1.0132713 -0.429509626 -0.31425681 0.29430628 -0.19690372 0.282548814 0.39303350 -0.12090616 0.69670371
6 -2.4881956 -1.692458471 -0.09449177 0.01172166 0.06284188 -0.844244673 0.79118303 -0.34123995 1.04264827
7 3.1058386 0.826832058 1.11178582 0.02043532 -0.47408031 -0.338428604 -0.33133031 0.31261663 0.10124955
8 1.9322901 3.317059011 -0.30085324 -0.22707802 0.28574031 -0.569026041 -0.42553220 0.17865665 -0.47420030
9 -1.2204600 -2.438705399 1.87732120 -0.07449666 -0.38793703 0.188712028 -0.38356492 0.11440041 -0.52707496
10 2.0227413 1.395666163 0.67593119 -0.14830203 -0.16997007 -0.385349601 -0.34510500 0.26727338 -0.24816029
11 -1.9223249 -0.767426078 -0.23096931 -0.10032745 0.14049704 0.003901228 0.01478041 -0.07175487 -0.12908571
12 0.0149983 -0.492942263 -0.40391390 0.16250771 -0.06522357 0.566016764 0.35380727 -0.11917404 0.19111680
13 -0.7919515 -0.731003237 -0.34630359 0.17115953 0.10735851 0.940255206 -0.07730108 -0.11185924 -0.37309296
14 4.7216459 4.388771043 -1.08519982 0.01923765 0.10673146 -0.205195431 -0.03391026 0.21661697 0.12766651
15 0.1163438 -0.623755530 -0.09248300 0.15112399 -0.08205915 0.193081476 0.03723803 -0.03449855 0.24479648
16 -3.3534245 -0.647022299 -0.40622380 -0.32839946 0.43995636 -0.520362438 0.12958139 -0.11902088 -0.26974330
How do I proceed to investigate how the intercept (greenup day) differs between the random effect (regions)?
How do I look into, how the variables may have different interaction/influence on the greenup day too?
This may end up being better for CrossValidated.
Probably the easiest way to extract the random effects values with confidence intervals (see caveats below) is
library(broom.mixed)
tt <- (tidy(reg, effects = "ran_vals", conf.int = TRUE)
|> dplyr::filter(term == "(Intercept)")
|> dplyr::select(level, estimate, conf.low, conf.high)
)
This is (obviously) just looking at the intercept term.
You can look at the magnitudes of the variances of the different random effect components and compare their magnitudes.
If you lattice::dotplot(ranef(reg)) you will see all random-effects conditional modes, with their conditional standard deviations, with regions ordered by their intercept. You might want ``lattice::dotplot(ranef(reg), scales = "free")` instead.
If you know ggplot you can use the output of tidy above to plot more flexibly.
Formal testing is more difficult.
You could fit a model without the intercept variation:
reduced <- update(reg, . ~
V1+V2+V3+V4+V5+V6+V7+V8+(0 + V1+V2+V3+V4+V5+V6+V7+V8|regions))
anova(reduced, reg)
This won't work sensibly if any of your covariates is a factor (removing the intercept will just reparameterize the model, not actually drop the intercept variation)
I would not be surprised if the variation in all of the other effects could compensate for the restriction that the intercept isn't allowed to vary
You could do this for any of the random-effects terms.
The idea of post-hoc testing to figure out which of the regions differs significantly in its intercept is right out the window, for theoretical reasons: when you estimate variation as a random effect, the predicted values for individual levels (called "best linear unbiased predictors" [BLUPs] or "conditional modes") are no longer parameter estimates in the formal sense, with sampling distributions, p-values, etc.. In the frequentist world, this is one of the things you sacrifice by using a random effect.

Cross-correlation of autocorrelated variables with annual time step

I am trying to analyze cross-correlations between pairs from a set of many variables with an annual time step. Currently, I have difficulties dealing with auto-correlated variables...
For each variable, I first test if it is non stationary (with both Augmented Dickey–Fuller test and Kwiatkowski–Phillips–Schmidt–Shin test), then I test if the suitably lagged and iterated difference of the first order is auto-correlated (with the estimate of the autocorrelation function).
When variables are auto-correlated, I try to prewhiten them, using an ARIMA model and then run the cross-correlation analysis.
Here is an example with 3 variables with an annual time step (31 years of data):
Auto-correlated:
V1 <- c(524.800, 764.449, 557.564, 615.880, 1055.453, 1290.122, 1709.981, 2113.651, 2958.380, 2541.366, 2640.039, 2787.651, 3580.520, 3907.812, 4369.220, 5003.527, 3672.161, 4394.090, 4214.062, 3671.846, 3351.458, 3036.077, 2704.647, 1854.365, 2543.996, 2106.863, 1960.762, 2036.323, 1996.526, 1940.905, 2120.696)
Not auto-correlated:
V2 <- c(21939,20081,23702,12908,16825,23794,17748,20251,24024,19822,15272, 12258,22539,30856,24090,23512,12588,13367,11501,9099,11411,11832,11341,10899,10049,10114,10667,9902,10639,10702,10066)
Auto-correlated:
V3 <- c(3361213, 3493958, 5924759, 5283699, 9477393, 13528459, 17153250, 14509681, 27052581, 27653986, 25618494, 24642924, 31252566, 40446538, 33833710, 55843471, 31572981, 51685769, 48218752, 46947448, 52243131, 68991783, 66494239, 44108474, 67089912, 61832249, 65841550, 73992509, 62093282, 55447568, 71151920)
In short, what I want to see is: ccf(diff(V1),diff(V2)) and ccf(diff(V1),diff(V3))
So, here is what I did:
adf.test(V1) #pvalue > 0.05 non stationary #OK
kpss.test(V1) # pvalue < 0.05 non stationary #OK
the same for V2 and V3
autocorrelation function:
acf(diff(V1))
basically, if the the lag drops off sharply after time 0, V1 is not
auto-correlated. I have also seen more precisely the output values, only for security.
the same for V2 and V3
So now comes my problem:
To get ccf(diff(V1),diff(V2)), knowing that diff(V1) is auto-correlated, I tried:
V1ts <- ts(V1,frequency=1) #convert to a time series
I suppose frequency = 1 because we only have one value per year.
V2ts <- ts(V2,frequency=1) #convert to a time series
library(fpp) #load forecasting package
mod1 <- auto.arima(V1ts)
mod2 <- auto.arima(V1ts,D=0)
mod3 <- auto.arima(V1ts,D=1)
I tried also forcing D to 0 or 1.
library(TSA)
print(prewhiten(V1ts, V2ts, mod1))
with mod2 or mod3 I get the same results...
When comparing the standard ccf without dealing with autocorrelation...
ccf1 <- ccf(diff(V1ts),diff(V2ts))
ccf1
I got exactly the same results...
In conclusion, I am surely doing something wrong because I get exactly the same result with or without prewhitening...
In a nutshell, may someone help me to analyze cross-correlation of one auto-correlated variable V1 with a second one which is not auto-correlated V2 please? And in the case both variables are auto-correlated, like between V1 and V3, do I have to apply two ARIMA models and prewhiten them simultaneously? Thank you a lot in advance, I am a bit confused here...

Multiple Correspondence Analysis on longitudinal data

I would like to explore the profile of two modalities of a categorical variable over time with respect to a given set of other categorical variables. I paste a reproducible example of such a dataset below.
set.seed(90114)
V1<-sample(rep(c("a", "A"), 100))
V2<-sample(rep(c("a", "A", "b", "B"), 50))
V3<-sample(rep(c("F", "M", "I"), 67), 200)
V4<-sample(rep(c("C", "R"), 100))
V5<-sample(rep(c(1970, 1980, 1990, 2000, 2010), 40))
data<-data.frame(V1, V2, V3, V4, V5)
To explore the behavior of such modalities, I decided to use Multiple Correspondence Analysis (package FactoMineR). To account for variation over time, one possibility is to split the dataset into 5 subsamples which represent the different levels of V5 and then run MCA on each subset. The rest of the analysis consists in comparing the position of the modalities across the different biplots. However, such practice is not without problems if the original dataset is too small. In such a case, the dimensions could be flipped or worse, the location of the active variables are likely to change from one plot to the other.
To avoid the problem, one solution could be to stabilize the position of the active variables across all the subsets and predict the coordinates of the supplementary variable afterwards, allowing the latter to move over time. I read somewhere that the coordinates of a modality can be obtained by computing the weighted mean of the coordinates of individuals in which this modality is found. So finding the coordinates of a modality for the year 1970 would boil down to computing the weighted mean of the coordinates of the individuals in the 1970 subset for that modality. However, I don't know whether it's common practice and if yes, I just don't know how to implement such calculations. I paste the rest of the code in order for you to visualize the problem.
data.mca<-MCA(data[, -5], quali.sup=1, graph=F)
# Retrieve the coordinates of the first and second dimension
DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
# Append the coordinates to the original dataframe
data1<-data.frame(data, DIM1, DIM2)
# Split the data into 5 clusters according to V5 ("year")
data1.split<-split(data1, data1$V5)
data1.split<-lapply(data1.split, function(x) x=x[, -5]) # to remove the fifth column with the years, no longer needed
seventies<-as.data.frame(data1.split[1])
eightties<-as.data.frame(data1.split[2])
# ...
a.1970<-seventies[seventies$X1970.V1=="a",]
A.1970<-seventies[seventies$X1970.V1=="A",]
# The idea, then, is to find the coordinates of the modalities "a" and "A" by computing the weighted mean of their respective indivuduals for each subset. The arithmetic mean would yield
# a.1970.DIM1<-mean(a.1970$X1970.DIM1) # 0.0818
# a.1970.DIM2<-mean(a.1970$X1970.DIM2) # 0.1104
# and so on for the other levels of V5.
I thank you in advance for your help!
I found a solution to my problem. We can simply weight the mean of the coordinates by the value returned by row.w in FactoMineR. To account for the dilatation of the MCA, the values of the resulting coordinates of the barycentres should be divided by the square root of the eigenvalue of the dimension.
DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
WEIGHT<-data.mca$call$row.w
data1<-data.frame(data, WEIGHT, DIM1, DIM2)
# Splitting the dataset according to values of V1
v1_a<-data1[data1$V1=="a",]
v1_A<-data1[data1$V1=="A",]
# Computing the weighted average of the coordinates of Dim1 and Dim2 for the first category of V1
V1_a_Dim1<-sum(v1_a$WEIGHT*v1_a$DIM1)/100 # -0.0248
v1_a_Dim2<-sum(v1_a$WEIGHT*v1_a$DIM2)/100 # -0.0382
# Account for the dilatation of the dimensions...
V1_a_Dim1/sqrt(data.mca$eig[1,1])
[1] -0.03923839
v1_a_Dim2/sqrt(data.mca$eig[2,1])
[1] -0.06338353
# ... which is the same as the following:
categories<-data.mca$quali.sup$coord[, 1:2]
categories
# Dim 1 Dim 2
# V1_a -0.03923839 -0.06338353
# V1_A 0.03923839 0.06338353
This can be applied to different partitions of the data according to V5 or any other categorical variable.

Gompertz Aging analysis in R

I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.

Looping to extract coefficients from multiply imputed mer objects

I am having a hard time wrapping my head around this problem. I have a list, results4 which contains 5 elements, all of which are mer objects from the zelig package. The mer objects are the result of ls.mixed regressions on each of five imputed datasets. I am trying to combine the results using Rubin's Rules for Multiple Imputation.
I can extract the coefficients and standard errors using summary(results4[[1]])#coefs, which returns a 16x3 vector (16 variables, each with a point estimate, standard error, and t-statistic).
I am trying to loop over the five sets of results and automate the process of combining the point estimates and standard errors, but unfortunately I seem to be staring at it with no solution arising. Any suggestions?
The code that produces the mer objects follows (variable names changed):
for (i in 1:5) {
results4[i] <- zelig(DV ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 +
V9 + V10 + V11 + V12 + V13 + V14 + V15 + tag(1 | L2),
data = as.data.frame(w4[,,i]), model = "ls.mixed", REML = FALSE)
}
I'm not going to take the time to code up the multiple-imputation rules (someone who wants the credit can what I show here and build on it), but I think you should be able to do what you want by building a 16x3x5 array containing the results:
resultsList <- lapply(results,function(x) summary(x)#coefs)
library(abind)
resultsArr <- abind(resultsList,along=3)
and then using apply appropriately across the margins.
There's probably a plyr-based solution as well.
You could also do this less fancily by just defining the array up front and filling it in as you go:
sumresults <- array(dim=c(16,3,5))
for (...) {
...
sumresults[,,i] <- summary(results4[[i]])#coefs
}

Resources