Error: In storage.mode(x) <- "double" : NAs introduced by coercion - r

I'm new to R, but I'm trying to estimate a missing value in a large microarray dataset using impute.knn() from library(impute) using 6 nearest neighbors.
Here's an example:
seq1 <- seq(1:12)
mat1 <- matrix(seq1, 3)
mat1[2,2] <- "NA"
impute.knn(mat1, k=6)
I get the following error:
Error in knnimp.internal(x, k, imiss, irmiss, p, n, maxp = maxp) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
I've also tried the following:
impute.knn(mat1[2,2], k=6)
and I get the following error:
Error in rep(1, p) : invalid 'times' argument
My google-fu has been off today. Any suggestions to why I might be getting this error?
edit: I've tried
mat1[2,2] <- NA
as James suggested, but I get a segmentation fault. Using
replace(mat1, mat1[2,2], NA)
does not help either. Any other suggestions?

I'm not sure why impute.knn is set up the way it is, but the example within ?impute.knn uses khanmiss which is a data.frame of factors, which when coerced to matrix will be character.
You are getting a segmentation fault because you are trying to impute with K > ncol(mat1) nearest neighbours. It might be worth reported a bug to the package authors, as this could easily be checked in R and return an error, not a C level error which kills R.
mat1 <- matrix(as.character(1:12), 3)
mat1[2,2] <- NA # must not be quoted for it to be a NA value
# mat1 is a 4 column matrix so
impute.knn(mat1, 1)
impute.knn(mat1, 2)
impute.knn(mat1, 3)
impute.knn(mat1, 4)
# Will all work
note
despite the strange example, mat1will when it is integer or double as well
mat1 <- matrix(1:12,3)
mat1[2,2] <- NA
impute.knn(mat1,2)
mat1 <- matrix(seq(0,1,12),3)
mat1[2,2] <- NA
impute.knn(mat1,2)
take home message
Don't try to use impute using more information than you have.
Perhaps the package authors should take heed of
fortunes(15)
It really is hard to anticipate just how silly users can be. —Brian D.
Ripley R-devel (October 2003)
and build in some error checking so a simple error does not cause a segfault.

Related

How to create data frame for super large vectors? ​

I have 7 verylarge vectors, c1 to c7. My task is to simply create a data frame. However when I use data.frame(), error message returns.
> newdaily <- data.frame(c1,c2,c3,c4,c5,c6,c7)
Error in if (mirn && nrows[i] > 0L) { :
missing value where TRUE/FALSE needed
Calls: data.frame
In addition: Warning message:
In attributes(.Data) <- c(attributes(.Data), attrib) :
NAs introduced by coercion to integer range
Execution halted
They all have the same length (2,626,067,374 elements), and I’ve checked there’s no NA.
I tried subsetting 1/5 of each vector and data.frame() function works fine. So I guess it has something to do with the length/size of the data? Any ideas how to fix this problem? Many thanks!!
Update
both data.frame and data.table allow vectors shorter than 2^31-1. Stil can't find the solution to create one super large data.frame, so I subset my data instead... hope larger vectors will be allowed in the future.
R's data.frames don't support such long vectors yet.
Your vectors are longer than 2^31 - 1 = 2147483647, which is the largest integer value that can be represented. Since the data.frame function/class assumes that the number of rows can be represented by an integer, you get an error:
x <- rep(1, 2626067374)
DF <- data.frame(x)
#Error in if (mirn && nrows[i] > 0L) { :
# missing value where TRUE/FALSE needed
#In addition: Warning message:
#In attributes(.Data) <- c(attributes(.Data), attrib) :
# NAs introduced by coercion to integer range
Basically, something like this happens internally:
as.integer(length(x))
#[1] NA
#Warning message:
# NAs introduced by coercion to integer range
As a result the if condition becomes NA and you get the error.
Possibly, you could use the data.table package instead. Unfortunately, I don't have sufficient RAM to test:
library(data.table)
DT <- data.table(x = rep(1, 2626067374))
#Error: cannot allocate vector of size 19.6 Gb
For that kind of data size, you must to optmize your memory, but how?
You need to write these values in a file.
output_name = "output.csv"
lines = paste(c1,c2,c3,c4,c5,c6,c7, collapse = ";")
cat(lines, file = output_name , sep = "\n")
But probably you'll need to analyse them too, and (as it was said before) it requires a lot of memory.
So you have to read the file by their lines (like, 20k lines) by iteration to opmize your RAM memory, analyse these values, save their results and repeat..
con = file(output_name )
while(your_conditional) {
lines_in_this_round = readLines(con, n = 20000)
# create data.frame
# analyse data
# save result
# update your_conditional
}
I hope this helps you.

Error in FUN(X[[i]], ...) : Stan does not support NA (in t) in data

I am trying to implement facebook's new prophet api in R when i faced this issue. The code is below
library(prophet)
library(dplyr)
df <- read.csv('.../Peyton_Manning.csv') %>%
mutate(y = log(y))
m <- prophet(df)
At this line i am getting the below error
Error in FUN(X[[i]], ...) : Stan does not support NA (in t) in data
failed to preprocess the data; optimization not done
Show Traceback
Error in matrix(m$params[[name]], nrow = n.iteration) : 'data' must be of a vector type, was 'NULL'
I am not sure how to proceed from here. Please help!
I got the same problem although my dataset doesn't contain any NA value.
The problem was about one of the variables that I defined as an integer while it should be real as it is a continuous variable. Thus, I changed it to real and the problem has been solved!

All possible combinations for large numbers in R

Image i have a sequence like this:
seq <- rep(0:9, 10)
I want to know all possible combinations of this sequence. For sure, command combn isn't working:
> comb <- combn(seq, 10)
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
In addition: Warning message:
In combn(seq, 10) : NAs introduced by coercion to integer range
Can you give me a hint how to make my own function for all possible combinations?
Based on your reply to the comment , here is one thing you can do . You need the combinat package installed for this to work.
library(combinat)
seq <- c(1,2,3,4,5,6,7,8,9,0)
permn(seq)

R skmeans package - where does this error come from: "missing value where TRUE/FALSE needed"

I tried to cluster my data in accordance with the manual provided by the skmeans packages's manual page
I started by installing all required packages.
I then imported my data table, and made a matrix out of it with:
x <- as.matrix(x)
# See dimensions
dim(x)
[1] 184 4000
When I try to hard partition my data into 5 clusters - as it is done in the manual's first example - like so:
hparty <- skmeans(x, 5, control = list(verbose = TRUE))
I receive the following error message:
Error in if (!all(row_norms(x) > 0)) stop("Zero rows are not allowed.") :
missing value where TRUE/FALSE needed
And when I just type:
test <- skmeans(x, 5)
I get:
Error in skmeans(x, 5) : Zero rows are not allowed.
I'm trying to figure out where this error is coming from, and why the function can't get a TRUE/FALSE value. Has anyone ever experienced this problem?
Thank you in advance!
Spherical means is k-means where every vector is normalized to length 1.
If you have a constant 0 vector, this is not possible, and you cannot use spherical k-means (or cosine similarity).
!all(row_norms(x) > 0))
is the test that you do not have a row of length 0.

Error with knn function

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

Resources