I am trying to do imputation to a medium size dataframe (~100,000 rows) where 5 columns out of 30 have NAs (a large proportion, around 60%).
I tried mice with the following code:
data_3 = complete(mice(data_2))
After the first iteration I got the following exception:
iter imp variable
1 1 Existing_EMI Loan_Amount Loan_Period
Error in solve.default(xtx + diag(pen)): system is computationally singular: reciprocal condition number = 1.08007e-16
Is there some other package that is more robust to this kind of situations? How can I deal with this problem?

Your 5 columns might have a number of unbalanced factors. When these are turned into dummy variables there is a high probability that you will have one column a linear combination of another. The default imputation methods of mice involve linear regression, this results in a X matrix that cannot be inverted and will result in your error.
Change the method being used to something else like cart -- mice(data_2, method = "cart") --. Also check which seed you are calling before / during imputation for reproducible results.
My advice is to go through the 7 vignettes of mice. You can find out how to change the method of imputation being used for separate columns instead of for the whole dataset.


VAR Model in R, Error in Solve.default(sigma)

I'm currently trying to fit a VAR model with 6 variables from an XTS time series set. I have over 800 observations as well. The code I'm trying to run is
estim <- VAR(MinuteSeries, p = AIC , type = "both")
The value AIC is the AIC value retrieved from the lag-select function. When I pass the summary statement I am given the error:
Error in solve.default(Sigma) :
system is computationally singular: reciprocal condition number = 5.61898e-17
I have read online that this can be due to have a larger amount of coefficients in the model than observations in the data, however I have over 800 observations in the data and still getting this issue with just 6 variables. Is the size the issue still for my model or am I missing something more important?
I had the very same issue with seemingly non-problematic data (60 observations with 4 variable TS). So, I read online that one guy advised the following:
"It isn't just the high correlation of your variables, but also their scaling with respect to the response and/or the spatial coefficient. Using a different method= (say "LU") and using a power trace vector trs= may get you there too, but re-scaling the variable will also re-scale its square. The same problem affects the STSLS - re-scale the variable. If these are say in Euro, use thousand, million or whatever Euro instead, for example."
It helped when I transformed GDP from $ to billions $.

Intuitive problem with multiple imputation

I've read a lot of about multiple imputation and I think the explanation of that method on internet is no so comprehensive. I have some doubts about it and I will be pleased if you can help me with that.
Let's take code following :
#taking iris data
data <- iris
#Randomly pick values for NA
iris.mis <- prodNA(iris, noNA = 0.1)
#Turning on multiple imputation
imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 500)
where :
m - the number of imputations made per missing observation (5 is normal–generates 5 data sets with imputed/original values)
maxit - the number of iterations
method - We use probable means
seed - Values to randomly generate from
As far as I understand method = 'pmm' average the results,
but I'm not able to understand what exactly is happening when running that function. Can you please explain to me algorithm we are dealing with ? what exactly m and maxit are responsible for ?
As jay.sf pointed out PMM is predictive mean matching, it is an algorithm that tries to predict a "likely mean value" for the imputed metric based on the existing data in other columns. This is similar to other advanced imputation methods which do not simply take one constant value for each imputation (e.g. like mean imputation). This results in a different imputation value per case and has some inherent randomness/error.
To improve on that error and reduce variance the imputation isn't done just "one time" but multiple times.
m means the amount of imputed variants that are produced. E.g. with an m = 5 you will get five imputed data sets with slight variations in their imputations.
maxit is the maximum of iterations the algorithm is able to learn and improve, similar to the epochs/iterations in a ML model.

Generalized linear model vs Generalized additive model

I'm trying to follow this paper: Using a data science approach to predict cocaine use frequency from depressive symptoms where they use glm, gam with the beck inventory depression. So I did found a similiar dataset to test those models. However I'm having a hard time with both models. For example I have two variables d64a and d64b, and they're coded with 1,2,3,4 meaning that they're ordinal. Also, in the paper y2 is only the value of 1 but i have also a variable extra (that can be dependent, the proportion of consume)
For the GAM model I have:
but I have the following error:
Error in, dk$data, dk$knots) :
A term has fewer unique covariate combinations than specified maximum degrees of freedom
Meanwhile for the glm, I have the following:
I don't know since d64a and d64b are ordinal I have to use factor()?
The error message tells you that one or both of d64a and d64b do not have 9 (nine) unique values.
By default s(...) will create a basis with nine functions. You get this error if there are fewer than nine unique values in the covariate.
Check which covariates are affected using:
and see what the number of unique values is for each of the covariates you wish to include. Then set the k argument to the number returned above if it is less than nine. FOr example, assume that the above checks returned 5 and 7 unique covariates, then you would indicate this by setting k as follows:
b <- gam(y2 ~ s(d64a, k = 5) + s(d64b, k = 7), data = DATOS2)

Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed)
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.
Similar to what MrFlick mentioned, you are somewhat short in RAM.
Try running the algorithm on 1% of your data, and if you succeed, you should try checking out the bigmemory package for doing in-disk computations.
I also encourage you to check if the model you fit on your data is actually good without bayesian imputation, because the fact of trying to have perfect data could not be much more beneficial than just imputating mean/median/first/last values on your data.
Hope this helps.

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.
