How can I debug BLAS/LAPACK error in mgcv::gam function - r

In R I'm running a mgcv::gam() function with 12000 observations and using 130 parameters (mostly factors and 3 splines. I get the following error message:
Error in magic(G$y, G$X, msp, G$S, G$off, L = G$L, lsp0 = G$lsp0, G$rank, :
BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4
I run this function 100 times with everything from 8000 to 100000 observations and it runs fine, except in one instance. There are no NAs.
I've tried to extract the model matrix from the formula, but for gam's this seems to require a model object, which I can't get because of the error.
I'm thinking this might be a collinearity issue, but I don't know how to check without a model matrix.

Related

"non-finite value supplied by optim" when using fitCopula

when I try to do an AIC test on different copulas, R keeps giving me this error message.
Error in optim(start, logL, lower = lower, upper = upper, method = optim.method, :
non-finite value supplied by optim
but in my code, I didn't use the function optim and some give the other warnings.
Warning in fitCopula.ml(copula, u = data, method = method, start = start, : possible convergence problem: optim() gave code=52
The error message gives the NA result while the warning message gives the number that seems on the right track.
here are my codes.
AIC.result <- function(EC,copulafunction){
AIC<- matrix(nrow=length(colnames(EC)),ncol=length(colnames(EC)),byrow=T)
for (i in 1:length(colnames(EC))) {
for (j in 1:length(colnames(EC))) {
if(i==j){
AIC[i,j] <-1
}else{
u <- pobs(as.matrix(EC[,i]))
v <- pobs(as.matrix(EC[,j]))
fit<- fitCopula(copulafunction, cbind(u,v),method="ml")
AIC[i,j] <-AIC(fit)
}
}
}
mean((AIC-length(colnames(EC)))/2)
}
EC is the returns of different countries, and copulafunction is different type of copulas. And the Clayton copula and rotated Clayton copula give the error message while the rest gives the warning messages. The weirdest thing is in my case, EC contains 7 countries and it worked smoothly. When I applied to the DC which has 6 countries, the errors and warnings came. Is anyone know why?
First of all, if you want to only find the AIC for your model, then I think the fitCopula function returns it to you by default. If not then, the easy and direct way is to use the BICopEst function from the R package VineCopula. It returns the AIC and BIC. The error message is may due to fitting a wrong copula to your data, which sometimes leads the function to not converge, hence the error or warning. So, you should try another copula family. The best way to select the most appropriate copula for your data is to apply the BICselect() function in the VineCopula package. It will select the best bivariate copula among a wide range of a list based on AIC. Hence, it works for your case. Also, you can set another selection criteria supplied in the function.

Detecting errors with auto.arima beforehand (xreg is rank deficient / system is exactly singular)

I'm running auto.arima with external regressors over K=3000 different time series. For some time series, I either get the error "Error in auto.arima(Train[, K], xreg = newxreg) : xreg is rank deficient" or "Lapack routine dgesv: system is exactly singular: U[1,1] = 0". As I can't investigate all those time series manually, I'd like to add a check to my code so that auto.arima is run without external regressors if one of those two issues occurs, instead of my for-loop stopping each time.
This is my simplified code:
for(K in seq(1,3000)) {
Model <- auto.arima(Train[,K],xreg=newxreg)
}
Below is one example where I get the error "Lapack routine...". In the first column, you see the values of Train[,K] and in columns 2-6 the values of the regressors, which are stored in newxreg. So what kind of 'if' check can I add here so that i don't use auto.arima with xreg for this particular example?

R clValid function Error for huge dataset

I'm trying to evaluate my clustering results using this package
I run the following but it is giving me error;
intern <- clValid(test_clvalid, 3:25, maxitems = 260000, clMethods="kmeans", validation="internal")
Error in hclust(Dist, method) : size cannot be NA nor exceed 65536
test_clvalid is my data set, it has 256342 observations with 5 numeric variables.
When I ran the same with less data observations, it seems to run fine. Not sure why hclust() is called/giving error when I specify to use k-means evaluation.
Unfortunately that package is using hclust to initialize the input to kmeans,
as you can see here.
That also means that,
before that,
the cross-distance matrix was calculated,
which has 256,342 x 256,342 dimensions for your whole dataset.
The hclust function is hard-coded to deal with matrices that are 65536 x 65536 at the most,
so you won't be able to use that package to evaluate k-means on your data.

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

How to get stepwise logistic regression to run faster

I'm using the standard glm function with step function on 100k rows and 107 variables. When I did a regular glm I got the calculation done within a minute or two but when I added step(glm(...)) it runs for hours.
I tried to run it as a matrix, but it is still running for about 0.5 hour and I'm not sure it will ever be done. When I ran it on 9 variables it gave me the answers in a few seconds but with 9 warnings: all of them were "Warning messages:1: glm.fit: fitted probabilities numerically 0 or 1 occurred "
I used the line of code below: is it wrong? What should I do in order to gain better running time?
logit1back <- step(glm(IsChurn ~ var1 + var2+ var3+ var4+
var5+ var6+ var7+ var8+ var9, data=tdata , family='binomial'))

Resources