I am trying to compute the VIFs from a regression model that has a lot of independent variables (> 100). I am using vif from the car package to do that.
I always get the error: cannot allocate vector of size 13.8 GB. I realize this is a memory issue, but my PC already has a lot of memory. So the question is, can this function optimized in any way, so it doesn't require so much memory? I am unsure if this is more of a stats or a computational question. And as my dataset is quite large, I am unsure how to represent this case with a MWE. Basically what is needed is a lot of independent variables (e.g. 200+) and one arbitrary dependent variable, with length of each variable around 440 observations. Thanks for any hints.
I just ran a simulated version of what you did, and it worked fine: it took less than a second to run. This is 250 explanatory variables and one response, 500 observations.
For entertainment I pasted together a formula for it, but that isn't really necessary. The vif() were computed easily.
In general, since vif(j) = 1/(1-R^2_j), where R^2_j is the R-squared value when regressing the jth explanatory variable against all the other explanatory variables, computation should take, at most, the time of 250 linear regressions, with 500 observations and 250 explanatory variables, which is very, very fast and not at all memory intensive.
You might need to post your code so we can see what went wrong.
> resp <- rnorm(500)
> X <- matrix(nrow=500, ncol=250, rnorm(500*250))
> X <- data.frame(X)
> colnames(X) <- col_names <- paste("x",1:250, sep="")
> formula <- paste(col_names, collapse="+")
> formula <- paste("resp~",formula)
> hold <- lm(formula, data=cbind(resp,X))
> summary(vif(hold))
I am trying to do imputation to a medium size dataframe (~100,000 rows) where 5 columns out of 30 have NAs (a large proportion, around 60%).
I tried mice with the following code:
library(mice)
data_3 = complete(mice(data_2))
After the first iteration I got the following exception:
iter imp variable
1 1 Existing_EMI Loan_Amount Loan_Period
Error in solve.default(xtx + diag(pen)): system is computationally singular: reciprocal condition number = 1.08007e-16
Is there some other package that is more robust to this kind of situations? How can I deal with this problem?
Your 5 columns might have a number of unbalanced factors. When these are turned into dummy variables there is a high probability that you will have one column a linear combination of another. The default imputation methods of mice involve linear regression, this results in a X matrix that cannot be inverted and will result in your error.
Change the method being used to something else like cart -- mice(data_2, method = "cart") --. Also check which seed you are calling before / during imputation for reproducible results.
My advice is to go through the 7 vignettes of mice. You can find out how to change the method of imputation being used for separate columns instead of for the whole dataset.
I am using the felm from the lfe library, and am running into serious speed issues when using a large data set. By large I mean 100 million rows. My data consists of one dependent variable and five categorical variables (factors). I am running regressions with no covariates, only factors.
The felm algorithm does not converge. And I also tried some of the tricks used in this short article, but it did not improve. My code is as follows:
library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))
lev1 = unique(my_data$fac1)
my_data$fac1 <- factor(my_data$fac1, levels = lev1)
lev2 = unique(my_data$fac2)
my_data$fac2 <- factor(my_data$fac2, levels = lev2)
lev3 = unique(my_data$fac3)
my_data$fac3 <- factor(my_data$fac3, levels = lev3)
and now I run the regression, without covariates (because I'm only interested in the residuals), and with interactions as follows:
est <- felm(y ~ 0|fac1:fac2+fac1:fac3, my_data)
This line takes forever and does not converge. Note the dimension of the are as follows:
fac1 has about 6000 unique values
fac2 has about 100 unique values
fac3 has about 10 unique values
(and remember there are 100 million rows). I suspect there must be something wrong with how I use the factors, because I imagine that R should be able to handle such sizes (stata's reghdfe command handles it without problems). Any suggestions are highly appreciated.
I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.
I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.
I'm new to R, and I've read these forums (for help with R) for awhile now, but this is my first time posting. After googling each error here, I still can't figure out and fix my mistakes.
I am trying to run a one-way repeated measures ANOVA with unequal sample sizes. Here is a toy version of my data and the code that I'm using. (If it matters, my real data have 12 bins with up to 14 to 20 values in each bin.)
## the data: average probability for a subject, given reaction time bin
bin1=c(0.37,0.00,0.00,0.16,0.00,0.00,0.08,0.06)
bin2=c(0.33,0.21,0.000,1.00,0.00,0.00,0.00,0.00,0.09,0.10,0.04)
bin3=c(0.07,0.41,0.07,0.00,0.10,0.00,0.30,0.25,0.08,0.15,0.32,0.18)
## creating the data frame
# dependent variable column
probability=c(bin1,bin2,bin3)
# condition column
bin=c(rep("bin1",8),rep("bin2",11),rep("bin3",12))
# subject column (in the order that will match them up with their respective
# values in the dependent variable column)
subject=c("S2","S3","S5","S7","S8","S9","S11","S12","S1","S2","S3","S4","S7",
"S9","S10","S11","S12","S13","S14","S1","S2","S3","S5","S7","S8","S9","S10",
"S11","S12","S13","S14")
# putting together the data frame
dataFrame=data.frame(cbind(probability,bin,subject))
## one-way repeated measures anova
test=aov(probability~bin+Error(subject/bin),data=dataFrame)
These are the errors I get:
Error in qr.qty(qr.e, resp) :
invalid to change the storage mode of a factor
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors
3: In aov(probability ~ bin + Error(subject/bin), data = dataFrame) :
Error() model is singular
Sorry for the complexity (assuming it is complex; it is to me). Thank you for your time.
For an unbalanced repeated-measures design, it might be easiest to
use lme (from the nlme package):
## this should be the same as the data you constructed above, just
## a slightly more compact way to do it.
datList <- list(
bin1=c(0.37,0.00,0.00,0.16,0.00,0.00,0.08,0.06),
bin2=c(0.33,0.21,0.000,1.00,0.00,0.00,0.00,0.00,0.09,0.10,0.04),
bin3=c(0.07,0.41,0.07,0.00,0.10,0.00,0.30,0.25,0.08,0.15,0.32,0.18))
subject=c("S2","S3","S5","S7","S8","S9","S11","S12",
"S1","S2","S3","S4","S7","S9","S10","S11","S12","S13","S14",
"S1","S2","S3","S5","S7","S8","S9","S10","S11","S12","S13","S14")
d <- data.frame(probability=do.call(c,datList),
bin=paste0("bin",rep(1:3,sapply(datList,length))),
subject)
library(nlme)
m1 <- lme(probability~bin,random=~1|subject/bin,data=d)
summary(m1)
The only real problem is that some aspects of the interpretation etc.
are pretty far from the classical sum-of-squares-decomposition approach
(e.g. it's fairly tricky to do significance tests of variance components).
Pinheiro and Bates (Springer, 2000) is highly recommended reading if you're
going to head in this direction.
It might be a good idea to simulate/make up some balanced data and do the
analysis with both aov() and lme(), look at the output, and make sure
you can see where the correspondences are/know what's going on.