residuals() function error: replacement has x rows and data has y rows - r

I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.

I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.

Related

VAR Model in R, Error in Solve.default(sigma)

I'm currently trying to fit a VAR model with 6 variables from an XTS time series set. I have over 800 observations as well. The code I'm trying to run is
estim <- VAR(MinuteSeries, p = AIC , type = "both")
summary(estim)
The value AIC is the AIC value retrieved from the lag-select function. When I pass the summary statement I am given the error:
Error in solve.default(Sigma) :
system is computationally singular: reciprocal condition number = 5.61898e-17
I have read online that this can be due to have a larger amount of coefficients in the model than observations in the data, however I have over 800 observations in the data and still getting this issue with just 6 variables. Is the size the issue still for my model or am I missing something more important?
I had the very same issue with seemingly non-problematic data (60 observations with 4 variable TS). So, I read online that one guy advised the following:
"It isn't just the high correlation of your variables, but also their scaling with respect to the response and/or the spatial coefficient. Using a different method= (say "LU") and using a power trace vector trs= may get you there too, but re-scaling the variable will also re-scale its square. The same problem affects the STSLS - re-scale the variable. If these are say in Euro, use thousand, million or whatever Euro instead, for example."
It helped when I transformed GDP from $ to billions $.

Panel Data including Subgroups or Pooled OLS

I am analysizing a dataset which is seperated by countries, but also in age groups and gender cohorts, 5 annual periods are included. An intervention took place inbetween the years.
As the data is sparse, I want to compare the effects of each subgroup , so I may reach statistic significance afterall (e.g. female between 10-20 years for both countries, this variable I will call ID, existing only one number per country).
I have tried panel analysis with the plm package, however, I have tried to index country, year, and ID, but this does not work as it is not unique.
Is it even possible to include country effects but have subgroups of the country? (see code below)
I have tried Difference in differences, by using lmList and saving the coefficients. This for each subgroup seperated through the ID´s. (see code below)
This has worked, but through limited periods, no statistic significance is reached, even though the coefficients are all same direction. So I wonder if there is a possibility of combining those models again, and by that reaching reliable results?
1. fixed <- plm(FE ~ x , data=df, index=c("ID","country", "year"), model="within")
2. list <- coef(lmList(y~ treated + time + did | ID, data=df))
Error from 1.
duplicate couples (id-time)
In addition: Warning messages:
1: In pdata.frame(data, index) :
duplicate couples (id-time) in resulting pdata.frame
to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
2: In is.pbalanced.default(index[[1]], index[[2]]) :
duplicate couples (id-time)
For 2.
I do get a dataframe which contains all coefficients, but any ideas how I could properly summerize or display those? Just taking the mean of a coefficient seems to be a bit low-skilled.
Any help highly appreaciated.
I adress the first issue (just coding). plm requires a panel structure of index=c("individual", "time"). But you can define a new ID for whatever your unit of observation is. Here you can combine those numerical variables from your identifier with dplyr:
library(dplyr) df <- transform(df,GID=paste0(ID,country))
library(plm) summary(plm(y ~ x, index=c("GID", "year"), data = df,
model = "within"))
In general, you can define all other kind of observational groups. Is your "ID" numeric or string? You should add a more detailed data description or give some example data.

Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed)
Data
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.
Similar to what MrFlick mentioned, you are somewhat short in RAM.
Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.
I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.
Hope this helps.

Count-process datasets for Non-proportional Hazard (Cox) models with interaction variables

I am trying to run a nonproportional cox regression model featuring an interaction-with-time variable, as described in Chapter 15 (section 15.3) of Applied Longitudinal Data Analaysis by Singer and Willett. However I cannot seem to get answers that agree with the book.
The data used in this book and source code is supplied at this fantastic website. Unfortunarely no R code is supplied for the final chapter and the supplied dataset for R for the example discussed in-text is incomplete and provides incorrect answers for the simplest model (which I do know how to run). Instead, to obtain the complete dataset for this example, one must click the 'Download' link in the 'SAS' column (which has the correct dataset) and then, after installing the haven package (which allows one to read in foreign data formats), read in the dataset in question via:
haven::read_sas("alda/lengthofstay.sas7bdat")
This dataset indicates participants' (variable ID) length of stay (variable DAYS) in inpatient treatment in a hospital. The censoring variable is CENSOR. The researchers hypothesised that two different types of treatment (binary variable TREAT) would predict differential values of hazard of checking out of treatment. In addition they anticipated that the between-group difference in hazard would not be constant over time, therefore requiring the creation of an interaction term. I can get the simple main effect model to work, returning the same hazard coefficients reported in the book (which is how i eventually found out the .csv file supplied with the R code was incomplete).
summary(modA <- coxph(Surv(DAYS,1-CENSOR) ~ TREAT, data = los))
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.1457 1.1568 0.1541 0.945 0.345
I tried to follow the procedure laid out here, and here, and the sources listed therein (e.g. Therneau vignette on time-varying covariates in the survival package), and, of course, when I am copy-pasting someone else's code and running that it all works fine. But I am trying to do this for myself from scratch with a dataset whose results I can compare against mine. And I just can't make it work.
first I created an EVENT variable
los$EVENT <- 1 - los$CENSOR
there is a duplicate id number in the dataset that causes issues. So we have to change it to a new ID number
los$ID[which(duplicated(los$ID))] <- 842
Now, based on what I read here and here the dataframe needs to be split so that, for every participant, there is one row indicating the EVENT status at every point prior to their event (or censorship) time when any other participant experienced an event. Therefore we need to create a vector of all the unique event times, then split the dataset on those event times
cutPoints <- sort(unique(los$DAYS[los$EVENT == 1]))
# now split the dataset
longLOS <- survSplit(Surv(DAYS,EVENT)~ ., data = los, cut = cutPoints)
# and (just because I'm anal) rename the interval upper bound column (formerly "DAYS")
names(longLOS)[5] <- "tstop"
When I looked at this dataset it appeared to be what I was after, with (1) as many rows for each participant as there are intervals prior to their event time when anyone else in the dataset experienced an event, (2) two columns indicating the lower and upper bounds of each interval, and (3) an event column with a 0 for all rows when the respondent did not experience the event, and a 1 in the final row when they either did experience the event or were censored.
Next I created the interaction-with-time variable, subtracting 1 from the 'interval upper bound' column so that main effect of TREAT represents the treatment effect on the first day of hospitalisation.
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
And ran the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
But it doesn't work! I got the (fairly unhelpful) error message
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
routine failed due to numeric overflow.This should never happen. Please contact the author.
What am I doing wrong? I have been slowly working through Singer and Willett for almost three years (I started while still a grad student), and now the final chapter is proving to be by far my greatest challenge. I have thirty pages to go; any help would be incredibly appreciated.
I figured out what I was doing wrong. A stupid error when I created the interaction variable TREATINT. instead of
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
it should have been
longLOS$TREATINT <- longLOS$TREAT*(longLOS$tstop - 1)
Now when you run the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
Not only does it work, it yields coefficients that match those reported in the Singer and Willett book.
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.706411 2.026705 0.292404 2.416 0.0157
TREATINT -0.020833 0.979383 0.009207 -2.263 0.0237
Given how dumb my mistake was I was tempted to just delete this whole post but I think I'll leave it up for others like me who want to know how to do interaction with time Cox models in R.

Odd behavior with step()

step() and stepAIC() produce a "remove missing values error" when running the code on data with missing values.
Error in step(mod1, direction = "backward") :
number of rows in use has changed: remove missing values?
According to ?step:
The model fitting must apply the models to the same dataset. This may be
a problem if there are missing values and R's default of na.action = na.omit
is used. We suggest you remove the missing values first.
I have a data frame with one variable which has four na values. However, when I run step on the lm object, I don't get the "missing values" error even though it has missing values. Can anyone tell me what could be going on?
> d1$Impressions
[1] NA NA NA 6924180 9313226 27888455
18213812 54557205 13495553
...
This does not produce an error message:
mod1 = lm(Leads ~ G + Con + GOO + DAY + Res + SD + ED +
ME + Impressions + Inc + Sea, data=d1)
step(mod1, direction="backward")
stepAIC(mod1)
Even with a variable which has missing values, it's not generating an error message. Any ideas on what's going on?
One reason for the stated behaviour is this. step() fits the full model and hence drops 3 (as stated) observations due to presence of NAs. As long as the variables for which there are NAs remain in the model, the lm() function will remove those observations at each step. If stepping stops before it removes a variable that would result in one of the previously removed observations remaining in the model, then no error will be raised, because the numbers of rows in the model matrix will not have changed.
As an aside, stepwise selection like this is considered to be of somewhat dubious validity. Not least, in using it you a making a fairly bold statement that the effects of the eliminated variables are exactly equal to zero. This also has the effect of biasing the effect (estimated coefficients) of the variables retained in the model to have larger (absolute) value.
Alternatives to this stepwise selection include shrinkage methods such as the Lasso and the Elastic Net.

Resources