Error with knn function - r

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]

I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)

Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.

I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

Related

K-Means clustering in R error NA/NaN/Inf in foreign function call

I have a dataset that I have created in R. It is structured as follows:
I am trying to cluster the observations using k-means. However, I get the following error message:
> cl <- kmeans(sample, 3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
What does this mean? Am I prepocessing the data incorrectly? What can I do to fix it?
In the documentation of kmeans (pass ?kmeans in the console to see it), it is stipulated that the argument x has to be:
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
Here, you have the first row that is preventing to be used for kmeans. Basically, I believed that your first row is supposed to be your colnames.
Moreover, you can't make clustering with your second columns genre as it is character and I believed that the first column does not have to be used also, am I right ?
So, if your dataset is called samples, try to do:
colnames(samples) <- samples[1,]
samples_cluster <- samples[-1,3:ncol(samples)]
cl <- kmeans(samples_cluster,3)
Does it answer your question ?
If not, can you provide a reproducible example of your dataset in order we can verify the dataframe for kmeans clustering. To do this, please see: How to make a great R reproducible example

How to create data frame for super large vectors? ​

I have 7 verylarge vectors, c1 to c7. My task is to simply create a data frame. However when I use data.frame(), error message returns.
> newdaily <- data.frame(c1,c2,c3,c4,c5,c6,c7)
Error in if (mirn && nrows[i] > 0L) { :
missing value where TRUE/FALSE needed
Calls: data.frame
In addition: Warning message:
In attributes(.Data) <- c(attributes(.Data), attrib) :
NAs introduced by coercion to integer range
Execution halted
They all have the same length (2,626,067,374 elements), and I’ve checked there’s no NA.
I tried subsetting 1/5 of each vector and data.frame() function works fine. So I guess it has something to do with the length/size of the data? Any ideas how to fix this problem? Many thanks!!
Update
both data.frame and data.table allow vectors shorter than 2^31-1. Stil can't find the solution to create one super large data.frame, so I subset my data instead... hope larger vectors will be allowed in the future.
R's data.frames don't support such long vectors yet.
Your vectors are longer than 2^31 - 1 = 2147483647, which is the largest integer value that can be represented. Since the data.frame function/class assumes that the number of rows can be represented by an integer, you get an error:
x <- rep(1, 2626067374)
DF <- data.frame(x)
#Error in if (mirn && nrows[i] > 0L) { :
# missing value where TRUE/FALSE needed
#In addition: Warning message:
#In attributes(.Data) <- c(attributes(.Data), attrib) :
# NAs introduced by coercion to integer range
Basically, something like this happens internally:
as.integer(length(x))
#[1] NA
#Warning message:
# NAs introduced by coercion to integer range
As a result the if condition becomes NA and you get the error.
Possibly, you could use the data.table package instead. Unfortunately, I don't have sufficient RAM to test:
library(data.table)
DT <- data.table(x = rep(1, 2626067374))
#Error: cannot allocate vector of size 19.6 Gb
For that kind of data size, you must to optmize your memory, but how?
You need to write these values in a file.
output_name = "output.csv"
lines = paste(c1,c2,c3,c4,c5,c6,c7, collapse = ";")
cat(lines, file = output_name , sep = "\n")
But probably you'll need to analyse them too, and (as it was said before) it requires a lot of memory.
So you have to read the file by their lines (like, 20k lines) by iteration to opmize your RAM memory, analyse these values, save their results and repeat..
con = file(output_name )
while(your_conditional) {
lines_in_this_round = readLines(con, n = 20000)
# create data.frame
# analyse data
# save result
# update your_conditional
}
I hope this helps you.

Surprising warning in integrate (R), "longer object length not a multiple of shorter"

I reproduce a warning I get in integrate in R.
r <- list()
r$t <- 1:4
xsq <- function(x,r_obj){
r_obj$t == x
return(x^2)
}
integrate(f=xsq,0,1, r_obj = r)
The fourth line has no function in this toy example (that I constructed from a more complex function in which the line is needed). The problem is that the code above gives a warning related to this line:
0.3333333 with absolute error < 3.7e-15
Warning message:
In r_obj$t == x :
longer object length is not a multiple of shorter object length
Note that evaluating xsq at any value is not problematic, e.g. xsq(1, r_obj = r) does not give any warning. Also when r$t has less than 4 elements, e.g. r$t <- 1:3, the problem disappears. However, from ?integrate f is
An R function taking a numeric first argument and returning a numeric vector of the same length.
In my understanding xsq does just that. Where do I go wrong or why do I get a warning?

Error in huge R package when criterion "stars"

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

Unable to run ComBat script from R's sva library

I am trying to run ComBat script on a dataset with 2 batches, but I am getting errors and I do not know how to inspect code since I am an R newbie.
I am running ComBat method in this way:
# Load sva
library(sva)
# Read expression values
dat = read.table('dataset.xls', header=TRUE, sep='\t')
# Read sample information file about batches
sif = read.delim('sif.tsv', header=TRUE, sep='\t')
# Call ComBat
ComBat(dat=dat,batch=sif$Batch, mod=NULL)
Anyway my output is:
Found 2 batches
Found 0 categorical covariate(s)
Found 54675 Missing Data Values
Standardizing Data across genes
Error in solve(t(des) %*% des) %*% t(des) %*% y1 :
requires numeric/complex matrix/vector arguments
Data format for dat is:
probe set <sample1> ... <sampleN>
<gene_name> <value1> ... <valueN>
...
Data format for sif is:
Array name Sample name Batch
<Array1> <Sample1> <Batch1>
...
Any hint is appreciated. I'll provide more info if needed.
Thanks
It looks like when you're reading in the information it's not storing it in a format you expect.
From the error, ComBat is expecting a matrix with numeric or complex values. read.table from memory will give you a data.frame.
So try running:
dat <- as.matrix(dat)
ComBat(dat=dat,batch=sif$Batch, mod=NULL)
I have figured out how to do it, adding:
# Remove NA from end of lines
l_dat = length(dat)
dat[l_dat] <- NULL
# Remove probe set from beginning of lines
dat[1] <- NULL
just before the ComBat call. This because last column contains NA values (next warning goes away:
Found 54675 Missing Data Values
), and first column contains probe set (non numeric values) that raise next error:
Error in solve(t(des) %*% des) %*% t(des) %*% y1 :
requires numeric/complex matrix/vector arguments

Resources