Uncommon error message converting Matrix to Sparse in R - r

I'm trying to run a LASSO on our dataset, and to do so, I need to convert non-numeric variables to numeric, ideally via a sparse matrix. However, when I try to use the Matrix command, I get the same error:
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
I thought this was due to NA's in my data, so I did an na.omit and got the same error. I tried again with a mini subset of my code and got the same error again:
> sparsecombined <- Matrix(combined1[1:10,],sparse=TRUE)
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
This is the data set I tried to convert with that last line of code:
Is there anything that jumps out that might prevent sparse conversion?

The easiest way to incorporate categorical variables into a LASSO is to use my glmnetUtils package, which provides a formula/data frame interface to glmnet.
glmnet(ArrDelay ~ ArrTime + uniqueCarrier + TailNum + Origin + Dest,
data=combined1, sparse=TRUE)
This automatically handles categorical vars via one-hot encoding (also known as dummy variables). It can also use sparse matrices if so desired.

I think the error is due to the fact that you have non-numeric data types in your matrix.
Perhaps first convert your nun-numeric columns like UniqueCarrier to binary vectors using one-hot encoding. And only then convert the matrix to sparse.
Here is my code that I used for that conversion:
# Convert Genre into binary variables
# Convert genreVector into a corpus in order to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence
library(tm)
library(slam)
convertToBinary <- function(category) {
genreVector = category
genreVector = strsplit(genreVector, "(\\s)?,(\\s)?") # separate out commas
genreVector = gsub(" ", "_", genreVector) # combine DirectorNames with whitespaces
genreCorpus = Corpus(VectorSource(genreVector))
#dtm = DocumentTermMatrix(genreCorpus, list(dictionary=genreNames))
dtm = DocumentTermMatrix(genreCorpus)
binaryGenreVector = inspect(dtm)
return(binaryGenreVector)
#return(data.frame(binaryGenreVector)) # convert binaryGenreVector to dataframe
}
directorBinary = convertToBinary(x$Director)
directorBinaryDF = as.data.frame(directorBinary)
See nograpes answer in
recommenderlab, Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix

I got this error due to passing a data frame where a matrix was expected, and it looks like that's the same reason you are getting it. The solution in simple -- convert your data to a matrix before passing it to the Matrix function:
sparsecombined <- Matrix(as.matrix(combined1[1:10,]),sparse=TRUE)
In your case, this code will probably complain because you have some non-numeric data stored in there (e.g. the TailNum column). So you would need to downselect to just the numeric columns.

Related

Error in cor(mydata) : 'x' must be numeric

In R, I have been having trouble trying to create a correlation matrix for my data. I keep running into this problem: "Error in cor(mydata) : 'x' must be numeric" and I don't know how to fix it.
> mydata <- Combo[, c(1,2,3,4,5,6,7)]
> head(mydata, 13)
> #computing matrix
> corrmax = cor(mydata)
**Error in cor(mydata) : 'x' must be numeric**
>
I believe not all the data in mydata are numeric. You can test this by running: str(mydata) or sapply(mydata, is.numeric).
If there are variables in mydata that are chr or other non-numeric formats or return FALSE in the case of sapply, you will need to convert them to numeric before running the command or be more selective about the set of variables for which you calculate a correlation. I see strings and percent signs in what you posted. The strings will need to be removed and the formatted percents (%) converted to a numeric representation (decimals).

Implementing one-hot encoding using r

For the dataset I am working on there is a lot of character variables that I want to one-hot encode them in order to build some predictive models. In my code I am excluding two variables because it does not make sense to encode them, they are the item identifier and establishment year of the store. Here is the code I am using:
one_hot_encoding = dummyVars("~.", data = train[,-
c("Item_Identifier", "Outlet_Establishment_Year")], fullRank = T)
ohe_df = data.table(predict(one_hot_encoding, train[,-
c("Item_Identifier", "Outlet_Establishment_Year")]))
train = cbind(train[,"Item_Identifier"], ohe_df)
When executing the first line it gives this error:
Error in -c("Item_Identifier", "Outlet_Establishment_Year") :
invalid argument to unary operator.
Why? and one question regarding the dummyVars function: does it by default exclude the numeric variables of the input dataset?
Yes, it excludes by default the numeric variables.
Concerncing your error, there are some workarounds:
With the dplyr-package
select(train, -Item_Identifier, -Outlet_Establishment_Year)
And with base-R
train[, -which(names(train) %in% c("Item_Identifier", "Outlet_Establishment_Year")]
OR just use the number of the column like
train[, -c(1,6)]

What does "argument to 'which' is not logical" mean in FactoMineR MCA?

I'm trying to run an MCA on a datatable using FactoMineR. It contains only 0/1 numerical columns, and its size is 200.000 * 20.
require(FactoMineR)
result <- MCA(data[, colnames, with=F], ncp = 3)
I get the following error :
Error in which(unlist(lapply(listModa, is.numeric))) :
argument to 'which' is not logical
I didn't really know what to do with this error. Then I tried to turn every column to character, and everything worked. I thought it could be useful to someone else, and that maybe someone would be able to explain the error to me ;)
Cheers
Are the classes of your variables character or factor?I was having this problem. My solution was to change al variables to factor.
#my data.frame was "aux.da"
i=0
while(i < ncol(aux.da)){
i=i+1 aux.da[,i] = as.factor(aux.da[,i])
}
It's difficult to tell without further input, but what you can do is:
Find the function where the error occurred (via traceback()),
Set a breakpoint and debug it:
trace(tab.disjonctif, browser)
I did the following (offline) to find the name of tab.disjonctif:
Found the package on the CRAN mirror on GitHub
Search for that particular expression that gives the error
I just started to learn R yesterday, but the error comes from the fact that the MCA is for categorical data, so that's why your data cannot be numeric. Then to be more precise, before the MCA a "tableau disjonctif" (sorry i don't know the word in english : Complete disjunctive matrix) is created.
So FactomineR is using this function :
https://github.com/cran/FactoMineR/blob/master/R/tab.disjonctif.R
Where i think it's looking for categorical values that can be matched to a numerical value (like Y = 1, N = 0).
For others ; be careful : for R categorical data is related to factor type, so even if you have characters you could get this error.
To build off #marques, #Khaled, and #Pierre Gourseaud:
Yes, changing the format of your variables to factor should address the error message, but you shouldn't change the format of numerical data to factor if it's supposed to be continuous numerical data. Rather, if you have both continuous and categorical variables, try running a Factor Analysis for Mixed Data (FAMD) in the same FactoMineR package.
If you go the FAMD route, you can change the format of just your categorical variable columns to factor with this:
data[,c(3:5,10)] <- lapply(data[,c(3:5,10)] , factor)
(assuming column numbers 3,4,5 and 10 need to be changed).
This will not work for only numeric variables. If you only have numeric use PCA. Otherwise, add a factor variable to your data frame. It seems like for your case you need to change your variables to binary factors.
Same problem as well and changing to factor did not solve my answer either, because I had put every variable as supplementary.
What I did first was transform all my numeric data to factor :
Xfac = factor(X[,1], ordered = TRUE)
for (i in 2:29){
tfac = factor(X[,i], ordered = TRUE)
Xfac = data.frame(Xfac, tfac)
}
colnames(Xfac)=labels(X[1,])
Still, it would not work. But my 2nd problem was that I included EVERY factor as supplementary variable !
So these :
MCA(Xfac, quanti.sup = c(1:29), graph=TRUE)
MCA(Xfac, quali.sup = c(1:29), graph=TRUE)
Would generate the same error, but this one works :
MCA(Xfac, graph=TRUE)
Not transforming the data to factors also generated the problem.
I posted the same answer to a related topic : https://stackoverflow.com/a/40737335/7193352

Error with knn function

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

input 'data' is not double type?

While programming in R, I'm continuosly facing the following error::
Error in data.validity(data, "data") : Bad usage: input 'data' is
not double type.
Can anyone please explain why this error is happening, i.e. the reasons in the dataset which cause the error to arise?
Here is the code I'm running. The packages I have loaded are cluster, psych and clv.
data1 <- read.table(file='dataset.csv', sep=',', header=T, row.names=1)
data1.p <- as.matrix(data1)
hello.data <- data1.p[,1:15]
agnes.mod <- agnes(hello.data)
v.pred <- as.integer(cutree(agnes.mod,3)) # "cut" the tree
scatt <- clv.Scatt(hello.data, v.pred)
Error in data.validity(data, "data") :
Bad usage: input 'data' is not double type.
The key part of data.validity() raising the error is:
data = as.matrix(data)
if( !is.double(data) )
stop(paste("Bad usage: input '", name, "' is not double type.", sep=""))
data is converted to a matrix and then checked if it is a numeric matrix via is.double(). If it isn't numeric the clause is true and the error raised. So why isn't your data (hello.data) numeric when converted to a matrix? Either you have character variables in your data or there are factors. Do you have factors? Try
str(hello.data)
Are there any non-numeric variables in there? If you have character data then get rid of it. If you have factors, then data.validity() could coerce via data.matrix() but as it doesn't, try
hello.data <- data.matrix(hello.data)
after the line creating hello.data then run the rest of your code.
Whether this makes sense (treating a nominal or ordinal variable as a simple numeric) is unclear as you haven't provided a reproducible example or explained what your data are etc.

Resources