Error in huge R package when criterion "stars" - r

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.

You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

Related

K-Means clustering in R error NA/NaN/Inf in foreign function call

I have a dataset that I have created in R. It is structured as follows:
I am trying to cluster the observations using k-means. However, I get the following error message:
> cl <- kmeans(sample, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it?
In the documentation of kmeans (pass ?kmeans in the console to see it), it is stipulated that the argument x has to be:
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
Here, you have the first row that is preventing to be used for kmeans. Basically, I believed that your first row is supposed to be your colnames.
Moreover, you can't make clustering with your second columns genre as it is character and I believed that the first column does not have to be used also, am I right ?
So, if your dataset is called samples, try to do:
colnames(samples) <- samples[1,]
samples_cluster <- samples[-1,3:ncol(samples)]
cl <- kmeans(samples_cluster,3)
Does it answer your question ?
If not, can you provide a reproducible example of your dataset in order we can verify the dataframe for kmeans clustering. To do this, please see: How to make a great R reproducible example

Why does 'out of bounds' indexing differ between a matrix and a data.frame?

I'm sure this is kind of basic, but I'd just like to really understand the logic of R data structures here.
If I subset a matrix by index out of bounds, I get exactly that error:
m <- matrix(data = c("foo", "bar"), nrow = 1)
m[2,]
# Error in m[2, ] : subscript out of bounds
If I do the same do a data frame, however, I get all NA rows:
df <- data.frame(foo = "foo", bar = "bar")
df[2,]
# foo bar
# NA <NA> <NA>
If I subset into a non-existent data frame column I get the familiar
df[, 3]
# Error in `[.data.frame`(df, , 3) : undefined columns selected
I know (roughly) that data frame rows are weird and to be treated carefully, but I don't quite see the connection to the above behavior.
Can someone explain why R behaves in this way for non-existent df rows?
Update
To be sure, giving NA on out-of-bounds subsets, is normal R behavior for 1D vectors:
vec <- c("foo", "bar")
vec[3]
# [1] NA
So in a way, the weird one out here is matrix subsetting, not dataframe subsetting, depending from where you're starting out.
Still the different 2D subsetting behavior (m[2, ] vs df[2, ]) might strike a dense user (as I am right now) as inconsistent.
Can someone explain why R behaves in this way[?]
Short answer: No, probably not.
Longer answer:
Once upon a time I was thinking about something similar and read this thread on R-devel: Definition of [[. Basically it boils down to:
The semantics of [ and [[ don't seem to be fully specified in the Reference manual. [...] I assume that these are features, not bugs, but I can't find documentation for them
Duncan Murdoch, a former member of the R core team gives a very nice reply:
There is more documentation in the man page for Extract, but I think it is incomplete. The most complete documentation is of course the source code*, but it may not answer the question of what's intentional and what's accidental
As mentioned in the R-devel thread, the only description in the manual is 3.4.1 Indexing by vectors:
If i is positive and exceeds length(x) then the corresponding selection is NA
But, this applies to "indexing of simple vectors". Similar out of bounds indexing for "non-simple" vectors does not seem to be described. Duncan Murdoch again:
So what is a simple vector? That is not explicitly defined, and it probably should be.
Thus, it may seem like no one knows the answer to your why question.
See also "8.2.13 nonexistent value in subscript" in the excellent R Inferno by Patrick Burns, and the section "Missing/out of bounds indices" in Hadley's book.
*Source code for the [ subset operator. A search for R_MSG_subs_o_b (which corresponds to error message "subscript out of bounds") provides no obvious clue why OOB [ indexing of matrices and when using [[ give an error, whereas OOB [ indexing of "simple vectors" results in NA.

Error in family$linkinv(eta) : Argument eta must be a nonempty numeric vector

The reason the title of the question is the error I am getting is because I simply do not know how to interpret it, no matter how much I research. Whenever I run a logistic regression with bigglm() (from the biglm package, designed to run regressions over large amounts of data), I get:
Error in family$linkinv(eta) : Argument eta must be a nonempty numeric vector
This is how my bigglm() function looks like:
fit <- bigglm(f, data = df, family=binomial(link="logit"), chunksize=100, maxit=10)
Where f is the formula and df is the dataframe (of little over a million rows and about 210 variables).
So far I have tried changing my dependent variable to a numeric class but that didn't work. My dependent variable has no missing values.
Judging from the error message I wonder if this might have to do anything with the family argument in the bigglm() function. I have found numerous other websites with people asking about the same error and most of them are either unanswered, or for a completely different case.
The error Argument eta must be a nonempty numeric vector to me looks like your data has either empty values or NA. So, please check your data. Whatever advice we provide here, cannot be tested until we see your code or the steps involved resulting an error.
try this
is.na(df) # if TRUE, then replace them with 0
df[is.na(df)] <- 0 # Not sure replacing NA with 0 will have effect on your model
or whatever line of the code is resulting in NAs generation pass na.rm=Targument
Again, we can only speculate. Hope it helps.

R skmeans package - where does this error come from: "missing value where TRUE/FALSE needed"

I tried to cluster my data in accordance with the manual provided by the skmeans packages's manual page
I started by installing all required packages.
I then imported my data table, and made a matrix out of it with:
x <- as.matrix(x)
# See dimensions
dim(x)
[1] 184 4000
When I try to hard partition my data into 5 clusters - as it is done in the manual's first example - like so:
hparty <- skmeans(x, 5, control = list(verbose = TRUE))
I receive the following error message:
Error in if (!all(row_norms(x) > 0)) stop("Zero rows are not allowed.") :
missing value where TRUE/FALSE needed
And when I just type:
test <- skmeans(x, 5)
I get:
Error in skmeans(x, 5) : Zero rows are not allowed.
I'm trying to figure out where this error is coming from, and why the function can't get a TRUE/FALSE value. Has anyone ever experienced this problem?
Thank you in advance!
Spherical means is k-means where every vector is normalized to length 1.
If you have a constant 0 vector, this is not possible, and you cannot use spherical k-means (or cosine similarity).
!all(row_norms(x) > 0))
is the test that you do not have a row of length 0.

Error with knn function

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

Resources