VAR Model in R, Error in Solve.default(sigma) - r

I'm currently trying to fit a VAR model with 6 variables from an XTS time series set. I have over 800 observations as well. The code I'm trying to run is
estim <- VAR(MinuteSeries, p = AIC , type = "both")
summary(estim)
The value AIC is the AIC value retrieved from the lag-select function. When I pass the summary statement I am given the error:
Error in solve.default(Sigma) :
system is computationally singular: reciprocal condition number = 5.61898e-17
I have read online that this can be due to have a larger amount of coefficients in the model than observations in the data, however I have over 800 observations in the data and still getting this issue with just 6 variables. Is the size the issue still for my model or am I missing something more important?

I had the very same issue with seemingly non-problematic data (60 observations with 4 variable TS). So, I read online that one guy advised the following:
"It isn't just the high correlation of your variables, but also their scaling with respect to the response and/or the spatial coefficient. Using a different method= (say "LU") and using a power trace vector trs= may get you there too, but re-scaling the variable will also re-scale its square. The same problem affects the STSLS - re-scale the variable. If these are say in Euro, use thousand, million or whatever Euro instead, for example."
It helped when I transformed GDP from $ to billions $.

Related

How to estimate less conservative standard errors when using post-stratified weights without full information in the survey package?

I'm encountering (very) huge standard errors in my analysis of proportions with post-stratified data when using the survey package.
I'm working with a data set including (normalized) weights calculated via raking by another party. I don't know exactly how the strata have been defined (e.g. "ageXgender" has been used, but it's unclear which categorization has been used). Let's assume a simple random sample with a considerable amount of non-response.
Is there any way to estimate reduced standard errors due to post-stratification without the exact information about the procedure in survey? I could recallibrate the weights with rake() if I can exactly define the strata but I don't have enough information for this.
I have tried to infer the strata by grouping all equal weights together and thought that I would at least get an upper bound of the reduction in standard errors this way but using them did only lead to marginally reduced standard errors and sometimes even increased standard errors:
# An example with the api datasets, pretending that pw are post-stratification weights of unknown origin
library(survey)
data(api)
apistrat$pw <-apistrat$pw/mean(apistrat$pw) #normalized weights
# Include some more extreme weights to simulate my data
mins <- which(apistrat$pw == min(apistrat$pw))
maxs <- which(apistrat$pw == max(apistrat$pw))
apistrat[mins[1:5], "pw"] <- 0.1
apistrat[maxs[1:5], "pw"] <- 10
apistrat[mins[6:10], "pw"] <- 0.2
apistrat[maxs[6:10], "pw"] <- 5
dclus1<-svydesign(id=~1, weights=~pw, data=apistrat)
# "Estimate" stratas from the weights
apistrat$ps_est <- as.factor(apistrat$pw)
dclus_ps_est <-svydesign(id=~1, strata=~ps_est, weights=~pw, data=apistrat)
svymean(~api00, dclus1)
svymean(~api00, dclus_ps_est)
#this actually increases the se instead of reducing it
My real weights are also much more complex with 700 unique values in 1000 cases.
Is it possible to somehow approximate the reduction of standard errors due to post-stratification without knowing the real variables and categories and -especially- population values for rake? Could I use rake with some assumptions about the variables and categories used in the strata definitions but without the population totals in some way?
If your data are already raked, then you know the population totals exactly: raking makes the estimated population totals equal the true population totals for the raking variables. So, if you know the raking variables you can estimate the population totals then rake. The raking won't change the weights (because ex hypothesi these were already raked) but it will change the standard error estimates
(The next version of the survey package will have an option in svydesign to do exactly this.)

Generalized linear model vs Generalized additive model

I'm trying to follow this paper: Using a data science approach to predict cocaine use frequency from depressive symptoms where they use glm, gam with the beck inventory depression. So I did found a similiar dataset to test those models. However I'm having a hard time with both models. For example I have two variables d64a and d64b, and they're coded with 1,2,3,4 meaning that they're ordinal. Also, in the paper y2 is only the value of 1 but i have also a variable extra (that can be dependent, the proportion of consume)
For the GAM model I have:
b<-gam(y2~s(d64a)+s(d64b),data=DATOS2)
but I have the following error:
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
A term has fewer unique covariate combinations than specified maximum degrees of freedom
Meanwhile for the glm, I have the following:
d<-glm(y2~d64a+d64b,data=DATOS2)
I don't know since d64a and d64b are ordinal I have to use factor()?
The error message tells you that one or both of d64a and d64b do not have 9 (nine) unique values.
By default s(...) will create a basis with nine functions. You get this error if there are fewer than nine unique values in the covariate.
Check which covariates are affected using:
length(unique(d64a))
length(unique(d64b))
and see what the number of unique values is for each of the covariates you wish to include. Then set the k argument to the number returned above if it is less than nine. FOr example, assume that the above checks returned 5 and 7 unique covariates, then you would indicate this by setting k as follows:
b <- gam(y2 ~ s(d64a, k = 5) + s(d64b, k = 7), data = DATOS2)

Count-process datasets for Non-proportional Hazard (Cox) models with interaction variables

I am trying to run a nonproportional cox regression model featuring an interaction-with-time variable, as described in Chapter 15 (section 15.3) of Applied Longitudinal Data Analaysis by Singer and Willett. However I cannot seem to get answers that agree with the book.
The data used in this book and source code is supplied at this fantastic website. Unfortunarely no R code is supplied for the final chapter and the supplied dataset for R for the example discussed in-text is incomplete and provides incorrect answers for the simplest model (which I do know how to run). Instead, to obtain the complete dataset for this example, one must click the 'Download' link in the 'SAS' column (which has the correct dataset) and then, after installing the haven package (which allows one to read in foreign data formats), read in the dataset in question via:
haven::read_sas("alda/lengthofstay.sas7bdat")
This dataset indicates participants' (variable ID) length of stay (variable DAYS) in inpatient treatment in a hospital. The censoring variable is CENSOR. The researchers hypothesised that two different types of treatment (binary variable TREAT) would predict differential values of hazard of checking out of treatment. In addition they anticipated that the between-group difference in hazard would not be constant over time, therefore requiring the creation of an interaction term. I can get the simple main effect model to work, returning the same hazard coefficients reported in the book (which is how i eventually found out the .csv file supplied with the R code was incomplete).
summary(modA <- coxph(Surv(DAYS,1-CENSOR) ~ TREAT, data = los))
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.1457 1.1568 0.1541 0.945 0.345
I tried to follow the procedure laid out here, and here, and the sources listed therein (e.g. Therneau vignette on time-varying covariates in the survival package), and, of course, when I am copy-pasting someone else's code and running that it all works fine. But I am trying to do this for myself from scratch with a dataset whose results I can compare against mine. And I just can't make it work.
first I created an EVENT variable
los$EVENT <- 1 - los$CENSOR
there is a duplicate id number in the dataset that causes issues. So we have to change it to a new ID number
los$ID[which(duplicated(los$ID))] <- 842
Now, based on what I read here and here the dataframe needs to be split so that, for every participant, there is one row indicating the EVENT status at every point prior to their event (or censorship) time when any other participant experienced an event. Therefore we need to create a vector of all the unique event times, then split the dataset on those event times
cutPoints <- sort(unique(los$DAYS[los$EVENT == 1]))
# now split the dataset
longLOS <- survSplit(Surv(DAYS,EVENT)~ ., data = los, cut = cutPoints)
# and (just because I'm anal) rename the interval upper bound column (formerly "DAYS")
names(longLOS)[5] <- "tstop"
When I looked at this dataset it appeared to be what I was after, with (1) as many rows for each participant as there are intervals prior to their event time when anyone else in the dataset experienced an event, (2) two columns indicating the lower and upper bounds of each interval, and (3) an event column with a 0 for all rows when the respondent did not experience the event, and a 1 in the final row when they either did experience the event or were censored.
Next I created the interaction-with-time variable, subtracting 1 from the 'interval upper bound' column so that main effect of TREAT represents the treatment effect on the first day of hospitalisation.
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
And ran the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
But it doesn't work! I got the (fairly unhelpful) error message
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
routine failed due to numeric overflow.This should never happen. Please contact the author.
What am I doing wrong? I have been slowly working through Singer and Willett for almost three years (I started while still a grad student), and now the final chapter is proving to be by far my greatest challenge. I have thirty pages to go; any help would be incredibly appreciated.
I figured out what I was doing wrong. A stupid error when I created the interaction variable TREATINT. instead of
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
it should have been
longLOS$TREATINT <- longLOS$TREAT*(longLOS$tstop - 1)
Now when you run the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
Not only does it work, it yields coefficients that match those reported in the Singer and Willett book.
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.706411 2.026705 0.292404 2.416 0.0157
TREATINT -0.020833 0.979383 0.009207 -2.263 0.0237
Given how dumb my mistake was I was tempted to just delete this whole post but I think I'll leave it up for others like me who want to know how to do interaction with time Cox models in R.

High error in neuralnet

I am trying to train a neural network using neuralnet in R, but am getting very high error terms (in the region of 1850). The input variables are responses to a set of 6 likert scales (all on a 1-7) and the output is whether there was a topbox response to another variable (coded 0,1). The input variables have been scaled to a 0,1 range (I've also tried normalizing to a mean of 0). I've tried a range of hidden nodes (from 1-10) and the network is converging to a threshold of 0.1 fairly reliably in 200000-400000 iterations, but with a consistent error term around 1800-1900. There are 30,000 cases in total, about 22000 in the training set. I appreciate that this type of problem doesn't need a neural network necessarily - this is proof of concept (going excellently...) on a familiar dataset before application to other questions. Any suggestions on how to reduce the error/improve the training net appreciated.
As I said, I have tried both normalising and scaling, and now also using the pca preprocessing provided in caret. Must be something in the data, but at a bit of a loss...
Code:
maxs <-apply(final, 2, max)
mins <-apply(final, 2, min)
scaled <-as.data.frame(scale(final, center=mins, scale=maxs-mins))
index<-sample(1:nrow(scaled), round(0.75*nrow(scaled)))
train_ <-scaled[index,]
test_ <-scaled[-index,]
nn<-neuralnet(Q11~A1+B1+C1+D1+E1+Q12, data=train_, hidden=5, rep=1,threshold=0.01, stepmax=6e+05, linear.output=F, lifesign='full')

residuals() function error: replacement has x rows and data has y rows

I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.
I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.

Resources