I'm trying to calculate crps using the verification package in R. The data appears to read in ok, but I get an error when trying to compute the CRPS itself: "invalid 'times' argument", however all values are real, no negative values and I'm testing for nan/na values and ignoring those. Having searched around I can't find any solution which explains why I'm getting this error. I'm reading the data in from netcdf files into larger arrays, and then computing CRPS for each grid cell in those arrays.
Any help would be greatly appreciated!
The relevant snipped from the code I'm using is:
##for each grid cell, get obs (wbarray) and 25 ensemble members of forecast eps (fcstarray)
for(x in 1:3600){
for(y in 1:1500){
obs=wbarray[x,y]
eps=fcstarray[x,y,1:25]
if(!is.na(obs)){
print(obs)
print(eps)
print("calculating CRPS - real value found")
crpsfcst=(crpsDecomposition(obs,eps)$CRPS)
CRPSfcst[x,y,w]=crpsfcst}}}
(w is specified in an earlier loop)
And the output I get:
obs: 0.3850737
eps: 0.3382506 0.3466184 0.3508921 0.3428135 0.3416993 0.3423528 0.3307764
0.3372431 0.3394377 0.3398165 0.3414395 0.3531360 0.3319155 0.3453161
0.3362813 0.3449474 0.3340050 0.3278898 0.3380596 0.3379150 0.3429202
0.3467927 0.3419354 0.3472489 0.3550797
"calculating CRPS - real value found"
Error in rep(0, nObs * (nMember +1)) : invalid 'times' argument
Calls: crpsDecomposition
Execution halted
If you type crpsDecomposition on your R command prompt you'll get the source code for the function. The first few lines show:
function (obs, eps)
{
nMember = dim(eps)[2]
nObs <- length(obs)
Since your eps data object appears to be (from your output) a one-dimensional vector, the second element of its dimension is going to be NULL, which sets nMember to NULL. Thus nObs*(nMember + 1) gets evaluated to 0. I imagine you simply need to re-examine what form eps should take because it would appear that it needs to be a matrix where each column corresponds to a different "member" (whatever that means in this context).
Related
I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.
there might be some threads on while loops but I am struggling with them. It would be great if someone could help an R beginner out.
So I am trying to do 10000 simulations from a an out of sample regression forecast using the forecast parameters: mean, sd. Thankfully, my data is normal.
This is what I have
N<-10000
i<-1:N
k<-vector(,N)
while(i<N+1){k(,i)=vector(,rnorm(N,mean=.004546,sd=.00464163))}
...and I get this error
Error in vector(, rnorm(5000, mean = 0.004546, sd = 0.00464163)) :
invalid 'length' argument
In addition: Warning message:
In while (i < N + 1) { : the condition has length > 1 and only the first element will be used
I can't seem to get my head around it.
No reason to create a loop here. If you want to put 10000 samples, normal distributed around mean = 0.004546 and sd = 0.00464163 into vector k, just do:
k <- rnorm(10000,mean = 0.004546, sd = 0.00464163)
try this
N<-10
i<-1
k<-matrix(0,1,N)
while(i<N+1){k[i]=rnorm(1,mean=.004546,sd=.00464163)
i=i+1
}
print(k)
To solve your problem, use #Esben Friis' answer. You are taking a hard approach to an easy problem.
To adress the questions you had about the error messages you got however:
Error in vector(, rnorm(5000, mean = 0.004546, sd = 0.00464163)) :
invalid 'length' argument
This is the wrong way to go as vector() will produce a vector of a set length instead of a set of values. You are thinking about the as.vector() function:
as.vector(rnorm(5000, mean = 0.004546, sd = 0.00464163))
This is however not needed as this will only create a new vector of your values, which are already in a vector structure of the type double. Using this function will therefore not change anything.
It is best to simply use:
rnorm(5000, mean=0.004546, sd=0.00464163)
Further:
In addition: Warning message:
In while(i<N+1){: the condition has length>1 and only the first element will be used
This warning stems from i being a vector 1:N with a length larger than 1. The warning states that only the first index in i will be recycled (used in all instances of the loop) which is the same as doing i[1] .
while(i<N+1){ }
#is the same as
while(i[1]<N+1){ }
Instead you want to loop a new value to N. Furthermore you can use the <= (less or equal to) operator instead of doing <N+1 .
while(newVal<=N){ }
This method will bring up new problems which could be solved by using a for() loop instead, but that is however out of the scope of the question and really not the right approach to your problem, as stated in the beginning. Hope you learned something and good luck!
when trying to use the shglm function of the speedglm package I have a problem. As the file is too large to read into memory, I wanted to use a link function as outlined in the help pages for the package. The link function is
make.data<-function(filename, chunksize,...){
conn<-NULL
function(reset=FALSE){
if(reset){
if(!is.null(conn)) close(conn)
conn<<-file(filename,open="r")
} else{
rval<-read.table(conn, nrows=chunksize,...)
if ((nrow(rval)==0)) {
close(conn)
conn<<-NULL
rval<-NULL
}
return(rval)
}
}
}
load(ti.RData)
I then take my data fram (called ti) and write it to table
write.table(ti,"data1.txt",row.names=FALSE,col.names=FALSE)
as in the example here http://www.inside-r.org/packages/cran/speedglm/docs/shglm. Afterwards
da<-make.data("data1.txt",chunksize=10000,col.names=colnames(ti))
rm(ti)
b1<-shglm(T2D~factor(SIBCO)+factor(POCOD),datafun=da,family=binomial())
But I get an error
Error in dev.resids(y, mu, weights) :
argument mu must be a numeric vector of length 1 or length 802
I am happy to upload my data set but can somebody maybe roughly tell me where to start debugging? I think when reading in data1.txt through the link function ( with the read.table) some factors in the original data frame are by this operation converted to integers. This is the reason I put factor around the variables. Any suggestion wpould be very helpful
The short answer is that there is probably something wrong with your input data. Without the input data it is hard to say but based on my experience to run shglm with a binomial glm with factors this is where I would start.
As a general debugging strategy you can try something like the following:
add the lines debug(shglm) and options(error=recover) to your script
turn on the trace=T option for shglm
start R and load your script as source("myscript.R")
step through the debugger and use ls() to see the variables currently present and inspect them with dim() colnames() etc.
Now in my experience shglm returns rather cryptic error messages that may change depending on the size of your input chunks (as this changes the data and the factors the model knows about). Below I list a couple of things to check in your data and some common errors that I encountered while getting it to work which may help you to get your own model running.
Regarding the data, make sure that:
The dependent variable is 0/1 or that it is a proportion 0 <= y <= 1 (in case you have successes and failures, you can use the weights parameter to give the total number of tries and calculate the proportion in the formula, i.e., success/(success + failures), common errors are:
Error in if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") :
missing value where TRUE/FALSE needed
Calls: shglm -> eval -> eval
Specify all the levels of the factors (don't forget default values) and make sure that they are sorted, i.e., factor(age, levels("24andbelow, 25to49, "50to74", "75andover")), otherwise you will get errors like:
Error in crossprod(weights, y) : non-conformable arguments Calls: shglm -> crossprod -> crossprod
Error in XTX[rownames(Ax), colnames(Ax)] : subscript out of bounds
Calls: shglm
Now I did not get your specific error but something close enough that I thought I should mention. Here I tried to supply a formula with two columns (for successes and failures as you can in regular glm), i.e., cbind(success, failures)~factor(var1) + factor(var2)
Error in dev.resids(y, mu, weights) :
argument wt must be a numeric vector of length 1 or length 10
Calls: shglm -> dev.resids
I guess the main take away is to check your input data.
I tried to cluster my data in accordance with the manual provided by the skmeans packages's manual page
I started by installing all required packages.
I then imported my data table, and made a matrix out of it with:
x <- as.matrix(x)
# See dimensions
dim(x)
[1] 184 4000
When I try to hard partition my data into 5 clusters - as it is done in the manual's first example - like so:
hparty <- skmeans(x, 5, control = list(verbose = TRUE))
I receive the following error message:
Error in if (!all(row_norms(x) > 0)) stop("Zero rows are not allowed.") :
missing value where TRUE/FALSE needed
And when I just type:
test <- skmeans(x, 5)
I get:
Error in skmeans(x, 5) : Zero rows are not allowed.
I'm trying to figure out where this error is coming from, and why the function can't get a TRUE/FALSE value. Has anyone ever experienced this problem?
Thank you in advance!
Spherical means is k-means where every vector is normalized to length 1.
If you have a constant 0 vector, this is not possible, and you cannot use spherical k-means (or cosine similarity).
!all(row_norms(x) > 0))
is the test that you do not have a row of length 0.
I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.