Multivariate detrending under common trend of a time series data in R - r

I am looking for multivariate detrending under common trend of a time series data in R.
Time series data sample:
> head(d)
T x1 x2 x3 x4
1 1 2 4 3 1
2 2 3 5 4 4
3 3 6 6 6 6
4 4 8 9 10 7
5 5 10 13 20 9
I would like to detrend the above multivariate time series dataset d under common trend. I hope I am clear in explaining the problem that I am facing.
Thanks!

You can use multivariate regression to solve for constants. Because the betas are the same, (i.e. beta in Y=x*beta is an n by 2 matrix with identical rows), you need to account for that constraint. However, you can just string all the Ys together for this.
dvec=as.numeric(d)
n=dim(d)[1]
ncol=dim(d)[2]
x=rep(1:n,ncol)
model<-lm(dvec~x)
Then you can do
d=matrix(model$residuals,nrow=n)

Related

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

Subset columns based on row value

This may be a simple question, but I haven't been able to find any answer. Consider you have a dataframe with n columns with molecular features. In the last row of each column, a coefficient of variance is expressed.
Example data set:
a <- data.frame(matrix(runif(30),ncol=3))
b <- c(50.23,45.23,21)
a<-rbind(a,b)
X1 X2 X3
1 0.1097075 0.78584027 0.20925033
2 0.6081752 0.39669748 0.65559913
3 0.9912855 0.68462073 0.54741795
4 0.8543848 0.53776889 0.43789447
5 0.2579654 0.92188090 0.61292895
6 0.6203840 0.73152279 0.82866311
7 0.6643195 0.84953926 0.62192976
8 0.5760624 0.30949900 0.11032929
9 0.8888167 0.04530598 0.08089825
10 0.8926815 0.61736284 0.19834310
11 50.2300000 45.23000000 21.00000000
How do I subset so I only get the columns with CV>50 in the last row? So my new data.frame would be:
X1
1 0.1097075
2 0.6081752
3 0.9912855
4 0.8543848
5 0.2579654
6 0.6203840
7 0.6643195
8 0.5760624
9 0.8888167
10 0.8926815
11 50.230000
We can do
a[,a[nrow(a),]>50,drop=FALSE]

Plot empty groups in boxplot

I want to plot a lot of boxplots in on particular style to compare them.
But when a group is empty the group "isn't plotted".
lets say I have a dataframe:
a b
1 1 5
2 1 4
3 1 6
4 1 4
5 2 9
6 2 8
7 2 9
8 3 NaN
9 3 NaN
10 3 NaN
11 4 2
12 4 8
and I use boxplot to plot it:
boxplot(b ~ a , df)
than I get the plot without group 3
(which I can't show because I did not have "10 reputation")
I found some solutions for removing empty groups via Google but my problem is the other way around.
And I found the solution via at=c(1,2,4) but as I generate an Rscript with python and different groups are empty I would prefer, that the groups aren't dropped at all.
Oh I don't think I have the time to grapple with additional packages.
Therefore I would be thankful for solutions without them.
You can get the group on the x-axis by
boxplot(b ~ a , df, na.action=na.pass)
Or
boxplot(b~factor(a), df)

Competing risk analysis of interval data

I study competitive risks and use R.
I would like to use the model in Fine and Gray (1999), A proportional hazards model for the subdistribution of a competing risk, JASA, 94:496-509.
I found the cmprsk package.
However, I have an “interval data” configuration with a starting time t0 and an ending time t1 for each interval, t1 being the exit or the right censoring when I am in the last interval for a given entity. Here is an extract of the dataset
entity t0 t1 cov
1 0 3 12
1 3 7 4
1 7 9 1
2 2 3 2
2 3 10 9
3 0 10 11
4 0 1 0
4 1 6 21
4 6 7 12
...
I do not find how to implement that with cmprsk, while it is implemented FOR EXAMPLE in the survival package (Surv(time,time2,…)).
Is it possible to do it with cmprsk or should I go to another package?
I know that there is a Stata package (stcrreg) doing it but I prefer working with R.

R: Interpolation between raster layers of different dates

Let's say I have 4 raster layers with the same extend with data of 4 different years: 2006,2008,2010 and 2012:
library(raster)
r2006<-raster(ncol=3, nrow=3)
values(r2006)<-1:9
r2008<-raster(ncol=3, nrow=3)
values(r2008)<-3:11
r2010<-raster(ncol=3, nrow=3)
values(r2010)<-5:13
r2012<-raster(ncol=3, nrow=3)
values(r2012)<-7:15
Now I want to create raster layers for every year between 2006 and 2013 (or even longer) by inter-/extrapolating (a linear method should be a good start) the values of the 4 raster layers. The result should look like this:
r2006<-raster(ncol=3, nrow=3)
values(r2006)<-1:9
r2007<-raster(ncol=3, nrow=3)
values(r2007)<-2:10
r2008<-raster(ncol=3, nrow=3)
values(r2008)<-3:11
r2009<-raster(ncol=3, nrow=3)
values(r2009)<-4:12
r2010<-raster(ncol=3, nrow=3)
values(r2010)<-5:13
r2011<-raster(ncol=3, nrow=3)
values(r2011)<-6:14
r2012<-raster(ncol=3, nrow=3)
values(r2012)<-7:15
r2013<-raster(ncol=3, nrow=3)
values(r2013)<-8:16
Using lm() or approxExtrap don't seem to help a lot.
One way to do this is to separate your problem into two parts: 1. First, perform the numerical interpolation on the raster values, 2. and apply the interpolated values to the appropriate intermediate raster layers.
Idea: Build a data frame of the values() of the raster layers, time index that data frame, and then apply Linear Interpolation to those numbers. For linear interpolation I use approxTime from the simecol package.
For your example above,
library(raster)
library(simecol)
df <- data.frame("2006" = 1:9, "2008" = 3:11, "2010" = 5:13, "2012"=7:15)
#transpose since we want time to be the first col, and the values to be columns
new <- data.frame(t(df))
times <- seq(2006, 2012, by=2)
new <- cbind(times, new)
# Now, apply Linear Interpolate for each layer of the raster
approxTime(new, 2006:2012, rule = 2)
This gives:
# times X1 X2 X3 X4 X5 X6 X7 X8 X9
#1 2006 1 2 3 4 5 6 7 8 9
#2 2007 2 3 4 5 6 7 8 9 10
#3 2008 3 4 5 6 7 8 9 10 11
#4 2009 4 5 6 7 8 9 10 11 12
#5 2010 5 6 7 8 9 10 11 12 13
#6 2011 6 7 8 9 10 11 12 13 14
#7 2012 7 8 9 10 11 12 13 14 15
You can then store this, and take each row and apply to the values of that year's raster object.
Note: approxTime does not do linear extrapolation. It simply takes the closest value, so you need to account for that.

Resources