Error in sparse.model.matrix creation - r

I'm trying to create a sparse.model.matrix (from the Matrix package) with a formula where there is an interaction between two factors. This is fine if my input data has multiple rows but as soon as I have just one row I get the error:
Error in model.spmatrix(t, data, transpose = transpose, drop.unused.levels = drop.unused.levels, : cannot get a slot ("Dim") from an object of type "double"
For example: This doesn't work:
f<-(mpg~as.factor(cyl)*as.factor(hp))
y<-mtcars
y$cyl<-as.factor(y$cyl)
y$hp<-as.factor(y$hp)
x<-y[1,]
myMatrix<-Matrix::sparse.model.matrix(f,x)
However duplicating x across two rows causes the error to disappear:
x<-rbind(x,x)
myMatrix<-Matrix::sparse.model.matrix(f,x)
I've traced the error to
Matrix:::Csparse_vertcat / Matrix:::Csparse_horzcat within Matrix:::model.spmatrix but am unable to work out what this function, which is written in C, is trying to do. Any ideas? Please note as far as I can determine this error only occurs when processing the matrix creation for an interaction between two factors.

Related

read file from memory for regression (R)

when trying to use the shglm function of the speedglm package I have a problem. As the file is too large to read into memory, I wanted to use a link function as outlined in the help pages for the package. The link function is
make.data<-function(filename, chunksize,...){
conn<-NULL
function(reset=FALSE){
if(reset){
if(!is.null(conn)) close(conn)
conn<<-file(filename,open="r")
} else{
rval<-read.table(conn, nrows=chunksize,...)
if ((nrow(rval)==0)) {
close(conn)
conn<<-NULL
rval<-NULL
}
return(rval)
}
}
}
load(ti.RData)
I then take my data fram (called ti) and write it to table
write.table(ti,"data1.txt",row.names=FALSE,col.names=FALSE)
as in the example here http://www.inside-r.org/packages/cran/speedglm/docs/shglm. Afterwards
da<-make.data("data1.txt",chunksize=10000,col.names=colnames(ti))
rm(ti)
b1<-shglm(T2D~factor(SIBCO)+factor(POCOD),datafun=da,family=binomial())
But I get an error
Error in dev.resids(y, mu, weights) :
argument mu must be a numeric vector of length 1 or length 802
I am happy to upload my data set but can somebody maybe roughly tell me where to start debugging? I think when reading in data1.txt through the link function ( with the read.table) some factors in the original data frame are by this operation converted to integers. This is the reason I put factor around the variables. Any suggestion wpould be very helpful
The short answer is that there is probably something wrong with your input data. Without the input data it is hard to say but based on my experience to run shglm with a binomial glm with factors this is where I would start.
As a general debugging strategy you can try something like the following:
add the lines debug(shglm) and options(error=recover) to your script
turn on the trace=T option for shglm
start R and load your script as source("myscript.R")
step through the debugger and use ls() to see the variables currently present and inspect them with dim() colnames() etc.
Now in my experience shglm returns rather cryptic error messages that may change depending on the size of your input chunks (as this changes the data and the factors the model knows about). Below I list a couple of things to check in your data and some common errors that I encountered while getting it to work which may help you to get your own model running.
Regarding the data, make sure that:
The dependent variable is 0/1 or that it is a proportion 0 <= y <= 1 (in case you have successes and failures, you can use the weights parameter to give the total number of tries and calculate the proportion in the formula, i.e., success/(success + failures), common errors are:
Error in if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") :
missing value where TRUE/FALSE needed
Calls: shglm -> eval -> eval
Specify all the levels of the factors (don't forget default values) and make sure that they are sorted, i.e., factor(age, levels("24andbelow, 25to49, "50to74", "75andover")), otherwise you will get errors like:
Error in crossprod(weights, y) : non-conformable arguments Calls: shglm -> crossprod -> crossprod
Error in XTX[rownames(Ax), colnames(Ax)] : subscript out of bounds
Calls: shglm
Now I did not get your specific error but something close enough that I thought I should mention. Here I tried to supply a formula with two columns (for successes and failures as you can in regular glm), i.e., cbind(success, failures)~factor(var1) + factor(var2)
Error in dev.resids(y, mu, weights) :
argument wt must be a numeric vector of length 1 or length 10
Calls: shglm -> dev.resids
I guess the main take away is to check your input data.

Error in R's "ada" package : "Currently this procedure can not directly handle > 2 class response"

I'm running a basic ada model, but I'm getting a weird error.
Model:
boost1 <- ada(response ~ . ,
data = my_data_set,
subset = as.logical(tmp_train$train),
iter = 50
)
And the error I'm getting is:
Error in ada.default(x, y, ..., na.action = na.action) :
Currently this procedure can not directly handle > 2 class response
I would assume this means that my "response" column has more than two factors, but it does not:
> length(levels(my_data_set$response))
[1] 2
Is there anything else that might cause this?
It turned out that the problem was in the subset. The subset contained no TRUE values, thus, the subset had less than 2 classes of response. To check that your subset is correct you could try:
table(as.logical(tmp_train$train))
I believe a similar problem might have occurred if the training subset had only one response class. This makes sense - as an example, imagine you are trying to classify two different types of iris, and you only trained the model using one type. In this example, it would be impossible for the model to have any idea of what characteristics separate the two types, so it would throw an error.

Error in wilcox.test : object 'x' not found

I'm trying to complete a mann-whitney-wilcoxon test in R to compare brood sizes between 2 years. My data read in successfully in 2 columns, labeled x and y for each year, ranked, with unequal sample sizes. I'm getting the following error and I'm not sure what the problem is.
setwd('c:/OSPR NEST 2011 & 2012')
penob1112<-read.csv('compare_penob_11_12.csv',header=TRUE)
wilcox.test(x, y, data=penob1112)
Error in wilcox.test(x, y, data = penob1112) : object 'x' not found
Thanks for any insights!
The data argument is only taken when the first argument is of class formula. You need to explicitly call each object instead:
wilcox.test(penob1112$x, penob1112$y)
Look at ?wilcox.test - it has two methods (default and formula)

"Error in 1:ncol(x) : argument of length 0" when using Amelia in R

I am working with panel data. I have well over 6,000 country-year observations, and have specified my Amelia imputation as follows:
(CountDependentVariable, m=5, ts="year", cs="cowcode",
sqrts=c("OtherCountVariable2", "OtherCount3", "OtherCount4"),
ords=c("OrdinalVar1", "Ordinal Variable 2"),
lgstc=c("ProportionVariale"),
noms=c("NominalVar1"),p2s = 0, idvars = c("country"))
When I run those lines of code, I continue to receive the following error:
Error in 1:ncol(x) : argument of length 0
I've seen people get a similar error, but in different contexts. Importantly, there are several continuous independent variables I left out of the Amelia code, because I am under the impression that they get imputed WITHOUT having to do so. Does anyone know:
1) What this error means?
2) How to correct this error?
Update #1: Provided more context, in terms of the types of variables in my count panel data, in the above sample code.
Update #2: I did some research, and ran into an R file containing a function that diagnoses possible errors for Amelia code. After running the code, I got the following error message first (and many more thereafter):
AMn<-nrow(x)
Error in nrow(x) : object 'x' not found
AMp<-ncol(x)
Error in ncol(x) : object 'x' not found
subbedout<-c(idvars,cs,ts)
Error: object 'idvars' not found
Error Code: 4
if (any(colSums(!is.na(x)) <= 1)) {
all.miss <- colnames(x)[colSums(!is.na(x)) <= 1]
if (is.null(all.miss)) {
all.miss <- which(colSums(!is.na(x)) <= 1)
}
all.miss <- paste(all.miss, collapse = ", ")
error.code<-4
error.mess<-paste("The data has a column that is completely missing or only has one,observation. Remove these columns:", all.miss)
return(list(code=error.code,mess=error.mess))
}
Error in is.data.frame(x) : object 'x' not found
Error codes: 5-6
Errors in one of the list variables
idout<-listcheck(idvars,"One of the 'idvars'")
Error in identical(vars, NULL) : object 'idvars' not found
Currently, there are no missing values for the country variable I place in the idvars argument. However, the very first "chunk" of errors wants me to believe that this is so.
Am I not properly specifying the Amelia code I have above?
I had forgotten to specify the dataframe in the original Amelia code (slaps hand on forehead). So now, after resolving the whacky issue above, I am getting the following error from Amelia:
Amelia Error Code: 44
One of the variable names in the options list does not match a variable name in the data.
I've checked the variable names, and they match, verbatim, to what I named them in the dataframe.

R using cell values from a data frame as arguments in an already defined custom function

I am relatively new to R and programming in general, so my question might be due to a lack of experience and cryptic error messages. I have done a fair amount of investigation and experimenting with different versions of apply and functions in the plyr package. The root of my question is how to have the value from a cell in a data frame be supplied as an argument in my function? I'll do my best to provide example data.
I am working with survey data in R, so I have a data frame with many columns and rows. I created a custom function to process some of the data. I run the script for the function first, so that it is loaded in the workspace in RStudio. The function has two arguments:
myfunction <- function(id, info){
# various data processing
}
myfunction does not return anything. When using real data, it outputs some .CSVs for me, so I don't need to get anything back from it - just need it to run using the values from every row.
For the sake of this example, lets say my data frame (called mydata) only has two columns (and in fact, I can subset it down to just these two columns in the overall process if needed for the solution).
ID Gender
1 M
2 F
3 F
4 M
What I would like to happen, is have R go through each row and provide the values of the cells as the two arguments in myfunction:
# So for the first row, it should do
myfunction("1", "M")
# And the second:
myfunction("2", "F")
The closest I've gotten is this:
a_ply(mydata, c(1,2), print)
ID
1 1
2 2
3 3
4 4
Gender
1 M
2 F
3 F
4 M
Which seems like it is in the right direction, but whenever I put myfunction in the a_ply I can't get it to work the way I want. I either get this error message:
Error in eval(expr, envir, enclos) : object 'X' not found
## Which I believe is actually an error from myfunction, which would mean the
## ID value is not passing through to it correctly
Or when playing around with different versions of that a_ply command, I get this error:
Error in file(file, "rt") : invalid 'description' argument
Thanks in advance for any help, so far I've been able to make it this far reading documentation and lots of other posts here, but I can't seem to find anything explaining this.
(For completion and closing the question):
apply(mydata,1, function(x) myfunction(x[1],x[2]))

Resources