read file from memory for regression (R) - r

when trying to use the shglm function of the speedglm package I have a problem. As the file is too large to read into memory, I wanted to use a link function as outlined in the help pages for the package. The link function is
make.data<-function(filename, chunksize,...){
conn<-NULL
function(reset=FALSE){
if(reset){
if(!is.null(conn)) close(conn)
conn<<-file(filename,open="r")
} else{
rval<-read.table(conn, nrows=chunksize,...)
if ((nrow(rval)==0)) {
close(conn)
conn<<-NULL
rval<-NULL
}
return(rval)
}
}
}
load(ti.RData)
I then take my data fram (called ti) and write it to table
write.table(ti,"data1.txt",row.names=FALSE,col.names=FALSE)
as in the example here http://www.inside-r.org/packages/cran/speedglm/docs/shglm. Afterwards
da<-make.data("data1.txt",chunksize=10000,col.names=colnames(ti))
rm(ti)
b1<-shglm(T2D~factor(SIBCO)+factor(POCOD),datafun=da,family=binomial())
But I get an error
Error in dev.resids(y, mu, weights) :
argument mu must be a numeric vector of length 1 or length 802
I am happy to upload my data set but can somebody maybe roughly tell me where to start debugging? I think when reading in data1.txt through the link function ( with the read.table) some factors in the original data frame are by this operation converted to integers. This is the reason I put factor around the variables. Any suggestion wpould be very helpful

The short answer is that there is probably something wrong with your input data. Without the input data it is hard to say but based on my experience to run shglm with a binomial glm with factors this is where I would start.
As a general debugging strategy you can try something like the following:
add the lines debug(shglm) and options(error=recover) to your script
turn on the trace=T option for shglm
start R and load your script as source("myscript.R")
step through the debugger and use ls() to see the variables currently present and inspect them with dim() colnames() etc.
Now in my experience shglm returns rather cryptic error messages that may change depending on the size of your input chunks (as this changes the data and the factors the model knows about). Below I list a couple of things to check in your data and some common errors that I encountered while getting it to work which may help you to get your own model running.
Regarding the data, make sure that:
The dependent variable is 0/1 or that it is a proportion 0 <= y <= 1 (in case you have successes and failures, you can use the weights parameter to give the total number of tries and calculate the proportion in the formula, i.e., success/(success + failures), common errors are:
Error in if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") :
missing value where TRUE/FALSE needed
Calls: shglm -> eval -> eval
Specify all the levels of the factors (don't forget default values) and make sure that they are sorted, i.e., factor(age, levels("24andbelow, 25to49, "50to74", "75andover")), otherwise you will get errors like:
Error in crossprod(weights, y) : non-conformable arguments Calls: shglm -> crossprod -> crossprod
Error in XTX[rownames(Ax), colnames(Ax)] : subscript out of bounds
Calls: shglm
Now I did not get your specific error but something close enough that I thought I should mention. Here I tried to supply a formula with two columns (for successes and failures as you can in regular glm), i.e., cbind(success, failures)~factor(var1) + factor(var2)
Error in dev.resids(y, mu, weights) :
argument wt must be a numeric vector of length 1 or length 10
Calls: shglm -> dev.resids
I guess the main take away is to check your input data.

Related

R: invalid 'times' argument calculating CRPS

I'm trying to calculate crps using the verification package in R. The data appears to read in ok, but I get an error when trying to compute the CRPS itself: "invalid 'times' argument", however all values are real, no negative values and I'm testing for nan/na values and ignoring those. Having searched around I can't find any solution which explains why I'm getting this error. I'm reading the data in from netcdf files into larger arrays, and then computing CRPS for each grid cell in those arrays.
Any help would be greatly appreciated!
The relevant snipped from the code I'm using is:
##for each grid cell, get obs (wbarray) and 25 ensemble members of forecast eps (fcstarray)
for(x in 1:3600){
for(y in 1:1500){
obs=wbarray[x,y]
eps=fcstarray[x,y,1:25]
if(!is.na(obs)){
print(obs)
print(eps)
print("calculating CRPS - real value found")
crpsfcst=(crpsDecomposition(obs,eps)$CRPS)
CRPSfcst[x,y,w]=crpsfcst}}}
(w is specified in an earlier loop)
And the output I get:
obs: 0.3850737
eps: 0.3382506 0.3466184 0.3508921 0.3428135 0.3416993 0.3423528 0.3307764
0.3372431 0.3394377 0.3398165 0.3414395 0.3531360 0.3319155 0.3453161
0.3362813 0.3449474 0.3340050 0.3278898 0.3380596 0.3379150 0.3429202
0.3467927 0.3419354 0.3472489 0.3550797
"calculating CRPS - real value found"
Error in rep(0, nObs * (nMember +1)) : invalid 'times' argument
Calls: crpsDecomposition
Execution halted
If you type crpsDecomposition on your R command prompt you'll get the source code for the function. The first few lines show:
function (obs, eps)
{
nMember = dim(eps)[2]
nObs <- length(obs)
Since your eps data object appears to be (from your output) a one-dimensional vector, the second element of its dimension is going to be NULL, which sets nMember to NULL. Thus nObs*(nMember + 1) gets evaluated to 0. I imagine you simply need to re-examine what form eps should take because it would appear that it needs to be a matrix where each column corresponds to a different "member" (whatever that means in this context).

Error in family$linkinv(eta) : Argument eta must be a nonempty numeric vector

The reason the title of the question is the error I am getting is because I simply do not know how to interpret it, no matter how much I research. Whenever I run a logistic regression with bigglm() (from the biglm package, designed to run regressions over large amounts of data), I get:
Error in family$linkinv(eta) : Argument eta must be a nonempty numeric vector
This is how my bigglm() function looks like:
fit <- bigglm(f, data = df, family=binomial(link="logit"), chunksize=100, maxit=10)
Where f is the formula and df is the dataframe (of little over a million rows and about 210 variables).
So far I have tried changing my dependent variable to a numeric class but that didn't work. My dependent variable has no missing values.
Judging from the error message I wonder if this might have to do anything with the family argument in the bigglm() function. I have found numerous other websites with people asking about the same error and most of them are either unanswered, or for a completely different case.
The error Argument eta must be a nonempty numeric vector to me looks like your data has either empty values or NA. So, please check your data. Whatever advice we provide here, cannot be tested until we see your code or the steps involved resulting an error.
try this
is.na(df) # if TRUE, then replace them with 0
df[is.na(df)] <- 0 # Not sure replacing NA with 0 will have effect on your model
or whatever line of the code is resulting in NAs generation pass na.rm=Targument
Again, we can only speculate. Hope it helps.

What does "argument to 'which' is not logical" mean in FactoMineR MCA?

I'm trying to run an MCA on a datatable using FactoMineR. It contains only 0/1 numerical columns, and its size is 200.000 * 20.
require(FactoMineR)
result <- MCA(data[, colnames, with=F], ncp = 3)
I get the following error :
Error in which(unlist(lapply(listModa, is.numeric))) :
argument to 'which' is not logical
I didn't really know what to do with this error. Then I tried to turn every column to character, and everything worked. I thought it could be useful to someone else, and that maybe someone would be able to explain the error to me ;)
Cheers
Are the classes of your variables character or factor?I was having this problem. My solution was to change al variables to factor.
#my data.frame was "aux.da"
i=0
while(i < ncol(aux.da)){
i=i+1 aux.da[,i] = as.factor(aux.da[,i])
}
It's difficult to tell without further input, but what you can do is:
Find the function where the error occurred (via traceback()),
Set a breakpoint and debug it:
trace(tab.disjonctif, browser)
I did the following (offline) to find the name of tab.disjonctif:
Found the package on the CRAN mirror on GitHub
Search for that particular expression that gives the error
I just started to learn R yesterday, but the error comes from the fact that the MCA is for categorical data, so that's why your data cannot be numeric. Then to be more precise, before the MCA a "tableau disjonctif" (sorry i don't know the word in english : Complete disjunctive matrix) is created.
So FactomineR is using this function :
https://github.com/cran/FactoMineR/blob/master/R/tab.disjonctif.R
Where i think it's looking for categorical values that can be matched to a numerical value (like Y = 1, N = 0).
For others ; be careful : for R categorical data is related to factor type, so even if you have characters you could get this error.
To build off #marques, #Khaled, and #Pierre Gourseaud:
Yes, changing the format of your variables to factor should address the error message, but you shouldn't change the format of numerical data to factor if it's supposed to be continuous numerical data. Rather, if you have both continuous and categorical variables, try running a Factor Analysis for Mixed Data (FAMD) in the same FactoMineR package.
If you go the FAMD route, you can change the format of just your categorical variable columns to factor with this:
data[,c(3:5,10)] <- lapply(data[,c(3:5,10)] , factor)
(assuming column numbers 3,4,5 and 10 need to be changed).
This will not work for only numeric variables. If you only have numeric use PCA. Otherwise, add a factor variable to your data frame. It seems like for your case you need to change your variables to binary factors.
Same problem as well and changing to factor did not solve my answer either, because I had put every variable as supplementary.
What I did first was transform all my numeric data to factor :
Xfac = factor(X[,1], ordered = TRUE)
for (i in 2:29){
tfac = factor(X[,i], ordered = TRUE)
Xfac = data.frame(Xfac, tfac)
}
colnames(Xfac)=labels(X[1,])
Still, it would not work. But my 2nd problem was that I included EVERY factor as supplementary variable !
So these :
MCA(Xfac, quanti.sup = c(1:29), graph=TRUE)
MCA(Xfac, quali.sup = c(1:29), graph=TRUE)
Would generate the same error, but this one works :
MCA(Xfac, graph=TRUE)
Not transforming the data to factors also generated the problem.
I posted the same answer to a related topic : https://stackoverflow.com/a/40737335/7193352

Error in huge R package when criterion "stars"

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

"Error in 1:ncol(x) : argument of length 0" when using Amelia in R

I am working with panel data. I have well over 6,000 country-year observations, and have specified my Amelia imputation as follows:
(CountDependentVariable, m=5, ts="year", cs="cowcode",
sqrts=c("OtherCountVariable2", "OtherCount3", "OtherCount4"),
ords=c("OrdinalVar1", "Ordinal Variable 2"),
lgstc=c("ProportionVariale"),
noms=c("NominalVar1"),p2s = 0, idvars = c("country"))
When I run those lines of code, I continue to receive the following error:
Error in 1:ncol(x) : argument of length 0
I've seen people get a similar error, but in different contexts. Importantly, there are several continuous independent variables I left out of the Amelia code, because I am under the impression that they get imputed WITHOUT having to do so. Does anyone know:
1) What this error means?
2) How to correct this error?
Update #1: Provided more context, in terms of the types of variables in my count panel data, in the above sample code.
Update #2: I did some research, and ran into an R file containing a function that diagnoses possible errors for Amelia code. After running the code, I got the following error message first (and many more thereafter):
AMn<-nrow(x)
Error in nrow(x) : object 'x' not found
AMp<-ncol(x)
Error in ncol(x) : object 'x' not found
subbedout<-c(idvars,cs,ts)
Error: object 'idvars' not found
Error Code: 4
if (any(colSums(!is.na(x)) <= 1)) {
all.miss <- colnames(x)[colSums(!is.na(x)) <= 1]
if (is.null(all.miss)) {
all.miss <- which(colSums(!is.na(x)) <= 1)
}
all.miss <- paste(all.miss, collapse = ", ")
error.code<-4
error.mess<-paste("The data has a column that is completely missing or only has one,observation. Remove these columns:", all.miss)
return(list(code=error.code,mess=error.mess))
}
Error in is.data.frame(x) : object 'x' not found
Error codes: 5-6
Errors in one of the list variables
idout<-listcheck(idvars,"One of the 'idvars'")
Error in identical(vars, NULL) : object 'idvars' not found
Currently, there are no missing values for the country variable I place in the idvars argument. However, the very first "chunk" of errors wants me to believe that this is so.
Am I not properly specifying the Amelia code I have above?
I had forgotten to specify the dataframe in the original Amelia code (slaps hand on forehead). So now, after resolving the whacky issue above, I am getting the following error from Amelia:
Amelia Error Code: 44
One of the variable names in the options list does not match a variable name in the data.
I've checked the variable names, and they match, verbatim, to what I named them in the dataframe.

Resources