I am trying to run ComBat script on a dataset with 2 batches, but I am getting errors and I do not know how to inspect code since I am an R newbie.
I am running ComBat method in this way:
# Load sva
library(sva)
# Read expression values
dat = read.table('dataset.xls', header=TRUE, sep='\t')
# Read sample information file about batches
sif = read.delim('sif.tsv', header=TRUE, sep='\t')
# Call ComBat
ComBat(dat=dat,batch=sif$Batch, mod=NULL)
Anyway my output is:
Found 2 batches
Found 0 categorical covariate(s)
Found 54675 Missing Data Values
Standardizing Data across genes
Error in solve(t(des) %*% des) %*% t(des) %*% y1 :
requires numeric/complex matrix/vector arguments
Data format for dat is:
probe set <sample1> ... <sampleN>
<gene_name> <value1> ... <valueN>
...
Data format for sif is:
Array name Sample name Batch
<Array1> <Sample1> <Batch1>
...
Any hint is appreciated. I'll provide more info if needed.
Thanks
It looks like when you're reading in the information it's not storing it in a format you expect.
From the error, ComBat is expecting a matrix with numeric or complex values. read.table from memory will give you a data.frame.
So try running:
dat <- as.matrix(dat)
ComBat(dat=dat,batch=sif$Batch, mod=NULL)
I have figured out how to do it, adding:
# Remove NA from end of lines
l_dat = length(dat)
dat[l_dat] <- NULL
# Remove probe set from beginning of lines
dat[1] <- NULL
just before the ComBat call. This because last column contains NA values (next warning goes away:
Found 54675 Missing Data Values
), and first column contains probe set (non numeric values) that raise next error:
Error in solve(t(des) %*% des) %*% t(des) %*% y1 :
requires numeric/complex matrix/vector arguments
Related
I have 7 verylarge vectors, c1 to c7. My task is to simply create a data frame. However when I use data.frame(), error message returns.
> newdaily <- data.frame(c1,c2,c3,c4,c5,c6,c7)
Error in if (mirn && nrows[i] > 0L) { :
missing value where TRUE/FALSE needed
Calls: data.frame
In addition: Warning message:
In attributes(.Data) <- c(attributes(.Data), attrib) :
NAs introduced by coercion to integer range
Execution halted
They all have the same length (2,626,067,374 elements), and I’ve checked there’s no NA.
I tried subsetting 1/5 of each vector and data.frame() function works fine. So I guess it has something to do with the length/size of the data? Any ideas how to fix this problem? Many thanks!!
Update
both data.frame and data.table allow vectors shorter than 2^31-1. Stil can't find the solution to create one super large data.frame, so I subset my data instead... hope larger vectors will be allowed in the future.
R's data.frames don't support such long vectors yet.
Your vectors are longer than 2^31 - 1 = 2147483647, which is the largest integer value that can be represented. Since the data.frame function/class assumes that the number of rows can be represented by an integer, you get an error:
x <- rep(1, 2626067374)
DF <- data.frame(x)
#Error in if (mirn && nrows[i] > 0L) { :
# missing value where TRUE/FALSE needed
#In addition: Warning message:
#In attributes(.Data) <- c(attributes(.Data), attrib) :
# NAs introduced by coercion to integer range
Basically, something like this happens internally:
as.integer(length(x))
#[1] NA
#Warning message:
# NAs introduced by coercion to integer range
As a result the if condition becomes NA and you get the error.
Possibly, you could use the data.table package instead. Unfortunately, I don't have sufficient RAM to test:
library(data.table)
DT <- data.table(x = rep(1, 2626067374))
#Error: cannot allocate vector of size 19.6 Gb
For that kind of data size, you must to optmize your memory, but how?
You need to write these values in a file.
output_name = "output.csv"
lines = paste(c1,c2,c3,c4,c5,c6,c7, collapse = ";")
cat(lines, file = output_name , sep = "\n")
But probably you'll need to analyse them too, and (as it was said before) it requires a lot of memory.
So you have to read the file by their lines (like, 20k lines) by iteration to opmize your RAM memory, analyse these values, save their results and repeat..
con = file(output_name )
while(your_conditional) {
lines_in_this_round = readLines(con, n = 20000)
# create data.frame
# analyse data
# save result
# update your_conditional
}
I hope this helps you.
I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.
I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.
i am struggling with an assignment and i would like your input.
note: this is a homework but when i tried to add the tag it said not to add it..
i don't want the resulting code, just suggestions on how to get this working :)
so, i have a t.test function as such:
my.t.test <- function(x,s1,s2){
x1 <- x[s1]
x2 <- x[s2]
x1 <- as.numeric(x1)
x2 <- as.numeric(x2)
t.out <- t.test(x1,x2,alternative="two.sided",var.equal=T)
out <- as.numeric(t.out$p.value)
return(out)
}
a matrix 30cols x 12k rows called data and an annotation file containing col names and data on the colums named dataAnn
dataAnn first column contains a list of M (male) or F (female) corresponding to the samples (or cols) in data (that follow the same order as in dataAnn), i have to run a t.test comparing the two samples and get the p values out
when i call
raw.pValue <- apply(data,1,my.t.test,s1=dataAnn[,1]=="M",s2=dataAnn[,1]=="F")
i get the error
Error in t.test(x1, x2, alternative = "two.sided", var.equal = T) :
unused argument(s) (alternative = "two.sided", var.equal = T)
i even tried to use
raw.pValue <- apply(data,1,my.t.test,s1=unlist(data[,1:18]),s2=unlist(data[,19:30]))
to divide the cols i want to compare but in this case i get the error
Error in x[s1] : invalid subscript type 'list'
i have been looking online, i understand that the second error is caused by an indices being a list...but this didn't really clarify it for me...
any input would be appreciated!!
You have overwritten the t.test function. Try calling it something like my.t.test, or when you want to call the original one use stats::t.test (this calls the one from the stats namespace). Remember that when you have overwritten a function you need to rm it from your workspace before you can use the original one without specifying the namespace.
While programming in R, I'm continuosly facing the following error::
Error in data.validity(data, "data") : Bad usage: input 'data' is
not double type.
Can anyone please explain why this error is happening, i.e. the reasons in the dataset which cause the error to arise?
Here is the code I'm running. The packages I have loaded are cluster, psych and clv.
data1 <- read.table(file='dataset.csv', sep=',', header=T, row.names=1)
data1.p <- as.matrix(data1)
hello.data <- data1.p[,1:15]
agnes.mod <- agnes(hello.data)
v.pred <- as.integer(cutree(agnes.mod,3)) # "cut" the tree
scatt <- clv.Scatt(hello.data, v.pred)
Error in data.validity(data, "data") :
Bad usage: input 'data' is not double type.
The key part of data.validity() raising the error is:
data = as.matrix(data)
if( !is.double(data) )
stop(paste("Bad usage: input '", name, "' is not double type.", sep=""))
data is converted to a matrix and then checked if it is a numeric matrix via is.double(). If it isn't numeric the clause is true and the error raised. So why isn't your data (hello.data) numeric when converted to a matrix? Either you have character variables in your data or there are factors. Do you have factors? Try
str(hello.data)
Are there any non-numeric variables in there? If you have character data then get rid of it. If you have factors, then data.validity() could coerce via data.matrix() but as it doesn't, try
hello.data <- data.matrix(hello.data)
after the line creating hello.data then run the rest of your code.
Whether this makes sense (treating a nominal or ordinal variable as a simple numeric) is unclear as you haven't provided a reproducible example or explained what your data are etc.