Imputing missing observation - r

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed)
Data
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.

Similar to what MrFlick mentioned, you are somewhat short in RAM.
Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.
I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.
Hope this helps.

Related

Variance inflation factors in R

I am trying to compute the VIFs from a regression model that has a lot of independent variables (> 100). I am using vif from the car package to do that.
I always get the error: cannot allocate vector of size 13.8 GB. I realize this is a memory issue, but my PC already has a lot of memory. So the question is, can this function optimized in any way, so it doesn't require so much memory? I am unsure if this is more of a stats or a computational question. And as my dataset is quite large, I am unsure how to represent this case with a MWE. Basically what is needed is a lot of independent variables (e.g. 200+) and one arbitrary dependent variable, with length of each variable around 440 observations. Thanks for any hints.
I just ran a simulated version of what you did, and it worked fine: it took less than a second to run. This is 250 explanatory variables and one response, 500 observations.
For entertainment I pasted together a formula for it, but that isn't really necessary. The vif() were computed easily.
In general, since vif(j) = 1/(1-R^2_j), where R^2_j is the R-squared value when regressing the jth explanatory variable against all the other explanatory variables, computation should take, at most, the time of 250 linear regressions, with 500 observations and 250 explanatory variables, which is very, very fast and not at all memory intensive.
You might need to post your code so we can see what went wrong.
> resp <- rnorm(500)
> X <- matrix(nrow=500, ncol=250, rnorm(500*250))
> X <- data.frame(X)
> colnames(X) <- col_names <- paste("x",1:250, sep="")
> formula <- paste(col_names, collapse="+")
> formula <- paste("resp~",formula)
> hold <- lm(formula, data=cbind(resp,X))
> summary(vif(hold))

Do imputation in R when mice returns error that "system is computationally singular"

I am trying to do imputation to a medium size dataframe (~100,000 rows) where 5 columns out of 30 have NAs (a large proportion, around 60%).
I tried mice with the following code:
library(mice)
data_3 = complete(mice(data_2))
After the first iteration I got the following exception:
iter imp variable
1 1 Existing_EMI Loan_Amount Loan_Period
Error in solve.default(xtx + diag(pen)): system is computationally singular: reciprocal condition number = 1.08007e-16
Is there some other package that is more robust to this kind of situations? How can I deal with this problem?
Your 5 columns might have a number of unbalanced factors. When these are turned into dummy variables there is a high probability that you will have one column a linear combination of another. The default imputation methods of mice involve linear regression, this results in a X matrix that cannot be inverted and will result in your error.
Change the method being used to something else like cart -- mice(data_2, method = "cart") --. Also check which seed you are calling before / during imputation for reproducible results.
My advice is to go through the 7 vignettes of mice. You can find out how to change the method of imputation being used for separate columns instead of for the whole dataset.

Speeding up the felm command in R (lfe library)

I am using the felm from the lfe library, and am running into serious speed issues when using a large data set. By large I mean 100 million rows. My data consists of one dependent variable and five categorical variables (factors). I am running regressions with no covariates, only factors.
The felm algorithm does not converge. And I also tried some of the tricks used in this short article, but it did not improve. My code is as follows:
library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))
lev1 = unique(my_data$fac1)
my_data$fac1 <- factor(my_data$fac1, levels = lev1)
lev2 = unique(my_data$fac2)
my_data$fac2 <- factor(my_data$fac2, levels = lev2)
lev3 = unique(my_data$fac3)
my_data$fac3 <- factor(my_data$fac3, levels = lev3)
and now I run the regression, without covariates (because I'm only interested in the residuals), and with interactions as follows:
est <- felm(y ~ 0|fac1:fac2+fac1:fac3, my_data)
This line takes forever and does not converge. Note the dimension of the are as follows:
fac1 has about 6000 unique values
fac2 has about 100 unique values
fac3 has about 10 unique values
(and remember there are 100 million rows). I suspect there must be something wrong with how I use the factors, because I imagine that R should be able to handle such sizes (stata's reghdfe command handles it without problems). Any suggestions are highly appreciated.

residuals() function error: replacement has x rows and data has y rows

I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.
I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.

One-way repeated measures ANOVA with unbalanced data

I'm new to R, and I've read these forums (for help with R) for awhile now, but this is my first time posting. After googling each error here, I still can't figure out and fix my mistakes.
I am trying to run a one-way repeated measures ANOVA with unequal sample sizes. Here is a toy version of my data and the code that I'm using. (If it matters, my real data have 12 bins with up to 14 to 20 values in each bin.)
## the data: average probability for a subject, given reaction time bin
bin1=c(0.37,0.00,0.00,0.16,0.00,0.00,0.08,0.06)
bin2=c(0.33,0.21,0.000,1.00,0.00,0.00,0.00,0.00,0.09,0.10,0.04)
bin3=c(0.07,0.41,0.07,0.00,0.10,0.00,0.30,0.25,0.08,0.15,0.32,0.18)
## creating the data frame
# dependent variable column
probability=c(bin1,bin2,bin3)
# condition column
bin=c(rep("bin1",8),rep("bin2",11),rep("bin3",12))
# subject column (in the order that will match them up with their respective
# values in the dependent variable column)
subject=c("S2","S3","S5","S7","S8","S9","S11","S12","S1","S2","S3","S4","S7",
"S9","S10","S11","S12","S13","S14","S1","S2","S3","S5","S7","S8","S9","S10",
"S11","S12","S13","S14")
# putting together the data frame
dataFrame=data.frame(cbind(probability,bin,subject))
## one-way repeated measures anova
test=aov(probability~bin+Error(subject/bin),data=dataFrame)
These are the errors I get:
Error in qr.qty(qr.e, resp) :
invalid to change the storage mode of a factor
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors
3: In aov(probability ~ bin + Error(subject/bin), data = dataFrame) :
Error() model is singular
Sorry for the complexity (assuming it is complex; it is to me). Thank you for your time.
For an unbalanced repeated-measures design, it might be easiest to
use lme (from the nlme package):
## this should be the same as the data you constructed above, just
## a slightly more compact way to do it.
datList <- list(
bin1=c(0.37,0.00,0.00,0.16,0.00,0.00,0.08,0.06),
bin2=c(0.33,0.21,0.000,1.00,0.00,0.00,0.00,0.00,0.09,0.10,0.04),
bin3=c(0.07,0.41,0.07,0.00,0.10,0.00,0.30,0.25,0.08,0.15,0.32,0.18))
subject=c("S2","S3","S5","S7","S8","S9","S11","S12",
"S1","S2","S3","S4","S7","S9","S10","S11","S12","S13","S14",
"S1","S2","S3","S5","S7","S8","S9","S10","S11","S12","S13","S14")
d <- data.frame(probability=do.call(c,datList),
bin=paste0("bin",rep(1:3,sapply(datList,length))),
subject)
library(nlme)
m1 <- lme(probability~bin,random=~1|subject/bin,data=d)
summary(m1)
The only real problem is that some aspects of the interpretation etc.
are pretty far from the classical sum-of-squares-decomposition approach
(e.g. it's fairly tricky to do significance tests of variance components).
Pinheiro and Bates (Springer, 2000) is highly recommended reading if you're
going to head in this direction.
It might be a good idea to simulate/make up some balanced data and do the
analysis with both aov() and lme(), look at the output, and make sure
you can see where the correspondences are/know what's going on.

Resources