Variance inflation factors in R - r

I am trying to compute the VIFs from a regression model that has a lot of independent variables (> 100). I am using vif from the car package to do that.
I always get the error: cannot allocate vector of size 13.8 GB. I realize this is a memory issue, but my PC already has a lot of memory. So the question is, can this function optimized in any way, so it doesn't require so much memory? I am unsure if this is more of a stats or a computational question. And as my dataset is quite large, I am unsure how to represent this case with a MWE. Basically what is needed is a lot of independent variables (e.g. 200+) and one arbitrary dependent variable, with length of each variable around 440 observations. Thanks for any hints.

I just ran a simulated version of what you did, and it worked fine: it took less than a second to run. This is 250 explanatory variables and one response, 500 observations.
For entertainment I pasted together a formula for it, but that isn't really necessary. The vif() were computed easily.
In general, since vif(j) = 1/(1-R^2_j), where R^2_j is the R-squared value when regressing the jth explanatory variable against all the other explanatory variables, computation should take, at most, the time of 250 linear regressions, with 500 observations and 250 explanatory variables, which is very, very fast and not at all memory intensive.
You might need to post your code so we can see what went wrong.
> resp <- rnorm(500)
> X <- matrix(nrow=500, ncol=250, rnorm(500*250))
> X <- data.frame(X)
> colnames(X) <- col_names <- paste("x",1:250, sep="")
> formula <- paste(col_names, collapse="+")
> formula <- paste("resp~",formula)
> hold <- lm(formula, data=cbind(resp,X))
> summary(vif(hold))

Related

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed)
Data
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.
Similar to what MrFlick mentioned, you are somewhat short in RAM.
Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.
I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.
Hope this helps.

Speeding up the felm command in R (lfe library)

I am using the felm from the lfe library, and am running into serious speed issues when using a large data set. By large I mean 100 million rows. My data consists of one dependent variable and five categorical variables (factors). I am running regressions with no covariates, only factors.
The felm algorithm does not converge. And I also tried some of the tricks used in this short article, but it did not improve. My code is as follows:
library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))
lev1 = unique(my_data$fac1)
my_data$fac1 <- factor(my_data$fac1, levels = lev1)
lev2 = unique(my_data$fac2)
my_data$fac2 <- factor(my_data$fac2, levels = lev2)
lev3 = unique(my_data$fac3)
my_data$fac3 <- factor(my_data$fac3, levels = lev3)
and now I run the regression, without covariates (because I'm only interested in the residuals), and with interactions as follows:
est <- felm(y ~ 0|fac1:fac2+fac1:fac3, my_data)
This line takes forever and does not converge. Note the dimension of the are as follows:
fac1 has about 6000 unique values
fac2 has about 100 unique values
fac3 has about 10 unique values
(and remember there are 100 million rows). I suspect there must be something wrong with how I use the factors, because I imagine that R should be able to handle such sizes (stata's reghdfe command handles it without problems). Any suggestions are highly appreciated.

Random Forest with caret package: Error: cannot allocate vector of size 153.1 Gb

I was trying to build a random forest model for a dataset in Kaggle, i always doing machine learning with caret package, the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size), 40+ variables are categorical and the outcome is the response i am trying to predict and it is binary. After some pre-processing with dplyr, I started working on building model with caret package, but i got this error message when i was trying to run the "train" function:"Error: cannot allocate vector of size 153.1 Gb" Here is my code:
## load packages
require(tidyr)
require(dplyr)
require(readr)
require(ggplot2)
require(ggthemes)
require(caret)
require(parallel)
require(doParallel)
## prepare for parallel processing
n_Cores <- detectCores()
n_Cluster <- makeCluster(n_Cores)
registerDoParallel(n_Cluster)
## import orginal datasets
people_Dt <- read_csv("people.csv",col_names = TRUE)
activity_Train <- read_csv("act_train.csv",col_names = TRUE)
### join two sets together and remove variables not to be used
first_Try <- people_Dt%>%
left_join(activity_Train,by="people_id")%>%
select(-ends_with("y"))%>%
filter(!is.na(outcome))
## try with random forest
in_Tr <- createDataPartition(first_Try$outcome,p=0.75,list=FALSE)
rf_Train <- firt_Try[in_Tr,]
rf_Test <- firt_Try[-in_Tr,]
## set model cross validation parameters
model_Control <- trainControl(method = "repeatedcv",repeats=2,number=2,allowParallel = TRUE)
rf_RedHat <- train(outcome~.,
data=rf_Train,
method="rf",
tuneLength=10,
importance=TRUE,
trControl=model_Control)
My computer is a fairly powerful machine with E3 processors and 32GB RAM. I have two questions:
1. Where did i get a vector that is as large as 150GB? Is it because some codes I wrote?
2. I cannot get a machine with that big ram, is there any workarouds to solve the issue that i can move on with my model building process?
the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size)
To be clear here, you most likely don't need 1.5 million rows to build a model. Instead, you should be taking a smaller subset which doesn't cause the memory problems. If you are concerned about reducing the size of your sample data, then you can do some descriptive stats on the 40 predictors, on a smaller set, and make sure that the behavior appears to be the same.
The problem is probably related to the one-hot-encoding of caret in your categorical variables. Since you have a lot of categorical variables, this seems to be a real problem such that it increases your dataset in a huge way. One-hot encoding will create a new column for every factor per categorical variables that you have.
Maybe you could try something like the h2o-package, which handles categorical variable in another way such that in not exploding your dataset when the model is run.

computing multiple fixed effects on large dataset

I'm trying to perform a fixed effects regression for two factor variables in a CSV dataset containing over 4000000 rows. These variables can respectively assume about 140000 and 50000 different integer values.
I initially attempted to perform the regression using the biglm and ff packages for R as follows on a Linux machine with 8 Gb of memory; however, it seems that this requires too much memory because R complains about having to allocate a vector of a size greater than the maximum on my machine.
library(biglm)
library(ff)
d <- read.csv.ffdf(file='data.csv', header=TRUE)
model = y~factor(a)+factor(b)-1
out <- biglm(model, data=d)
Some research online revealed that since factors are loaded into memory by ff, the latter will not significantly improve memory usage if many factor values are present.
Is anyone aware of some other way to perform the aforementioned regression on a dataset of the magnitude I described without having to resort to a machine with significantly more memory?
You should try the package lfe, it has been designed for exactly this purpose:
library(lfe)
...
out <- felm(y ~ 0|a+b, data=d)
fe <- getfe(out)
A proof of the method can be found here: http://www.sciencedirect.com/science/article/pii/S0167947313001266
Here's an R-journal article about it: http://journal.r-project.org/archive/2013-2/gaure.pdf
you can get the same mathematical meaning of fixed effects if you will demean the variables (by category). So, instead of finding a constant per dummy, you demean it. and demeaning will be very fast, as it is will be vectorized.
Edit1:
see Green 2012 p.400-401 for the mathematical proof.

Resources