dplyr data frame tbl: incorrect reported column length - r

I came a across a problem with dplyr, which caused an error message when I used it in a survival analysis. The root cause turned out to be that when a variable in a grouped data frame (or any object with class tbl_df) is referred to using [,] notation, it always reports a length of 1, even when the real length is greater than that. Using the $x notation reports the correct length.
With a data frame, the following return the expected length of 32:
length(mtcars$mpg)
length(mtcars[ , "mpg"])
With a grouped data frame the $ notation returns 32, and all the rest using [] notation return a length of 1:
foo <- mtcars %>% group_by(cyl)
length(foo$mpg)
length(foo[ , "mpg"])
length(foo[ , 1])
VarName <- "mpg"
length(foo[ , VarName])
It is just the reported length that is incorrect The data itself is all there i.e.:
head(foo[ , "mpg"])
The incorrect reported length leads to an error message in functions such as Surv(), which presumably include a length() check. This is obviously a very simplified example to illustrate. In the failed program I was using [ , VarName] notation inside a function to refer to a variable column. The workaround is simply to convert the data from the offending Data Frame Tbl format to an ordinary data frame within the function. Can anyone shed any light on why this happens? It might save others wasting as much time as I have!

Related

sum(births) : invalid 'type' (character) of argument

Hi everyone,
I am using a sample data in RStudio. I used the code below:
njnew <- nj %>%
group_by(NAME_2) %>%
summarise(Num.totalbirths=sum(births),
Num.totalvulnerable=sum(vulnerable)) %>%
mutate(percent.potentailcase=potentialcase/Num.totalpotentialcase,
percent.vulerablecase=vulnerable/Num.vulnerablecase)
I get after running:
Error in sum(births) : invalid 'type' (character) of argument
My dataset is an csv but I manually added/filled in 2 additional columns (births, vulnerable).
Could you kindly let me know how this error may have happened?
Judging from the error message, it looks like births is of type character. However, you can only compute the sum of numeric, complex or logical vectors. This likely happened when you manually added the column after reading in the csv.
You can double-check the type of the variable with class(nj$births), which probably returns character. Try converting your variable(s) with as.numeric(). You may need to repeat that process for other variables (such as vulnerable) which you manually added, e.g.:
nj <- nj %>%
mutate(births = as.numeric(births),
vulnerable = as.numeric(vulnerable))
Then your code should work fine.

Error in huge R package when criterion "stars"

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

What is the difference between these data frame assignments?

I have a data frame that looks like so:
pid tid pname
2 NA proc/boot/procnto-smp-instr
Now if I do this, I expect nothing to happen:
y[c(FALSE), "pid"] <- 10
And nothing happens (y did not change). However, if I do this:
y[c(FALSE), ]$pid <- 10
I get:
Error in $<-.data.frame(*tmp*, "pid", value = 10) :
replacement
has 1 rows, data has 0
So my question is, what's the difference between [, "col"]<- and $col<-? Why does one throw an exception? And bonus: where in the docs can I read more about this?
The error comes from the code of $<-.data.frame which checks if the original data.frame is at least as many rows as the length of the replacement vector:
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
[<- is a different function, which does not perform this check. It is a primitive function, which you can read more about in the R Internals manual
For once, these operations are performed by two very different functions:
y[FALSE, 'pid'] <- 10 is the call to the [<-.data.frame function, while
y[FALSE, ]$pid <- 10 is the call to the $<-.data.frame function, the error message gives you this clue. Just how different they are you can see by typing their names (with back quotes, just like above). In this particular case, though, they intended to behave the same way. And they normally do. Try y[1, 'pid'] <- 1:3 vs y[1, ]$pid <- 1:3. Your case is "special" as y[FALSE, ] returns you a "strange" object - a data.frame with 0 rows and three columns. IMHO, throwing exception is a correct behavior, and this is a minor bug in the [<-.data.frame function, but language developers's opinion on this subject is more important. If you want to see yourself where the difference is, type debug([<-.data.frame) and run your example.
The answer to your "bonus" question is to type ?[<-.data.frame and read, though it is very, very dry :(. Best.
PS. Formatting strips backticks, so, for instance, [<-.data.frame meant to be . Sorry.

input 'data' is not double type?

While programming in R, I'm continuosly facing the following error::
Error in data.validity(data, "data") : Bad usage: input 'data' is
not double type.
Can anyone please explain why this error is happening, i.e. the reasons in the dataset which cause the error to arise?
Here is the code I'm running. The packages I have loaded are cluster, psych and clv.
data1 <- read.table(file='dataset.csv', sep=',', header=T, row.names=1)
data1.p <- as.matrix(data1)
hello.data <- data1.p[,1:15]
agnes.mod <- agnes(hello.data)
v.pred <- as.integer(cutree(agnes.mod,3)) # "cut" the tree
scatt <- clv.Scatt(hello.data, v.pred)
Error in data.validity(data, "data") :
Bad usage: input 'data' is not double type.
The key part of data.validity() raising the error is:
data = as.matrix(data)
if( !is.double(data) )
stop(paste("Bad usage: input '", name, "' is not double type.", sep=""))
data is converted to a matrix and then checked if it is a numeric matrix via is.double(). If it isn't numeric the clause is true and the error raised. So why isn't your data (hello.data) numeric when converted to a matrix? Either you have character variables in your data or there are factors. Do you have factors? Try
str(hello.data)
Are there any non-numeric variables in there? If you have character data then get rid of it. If you have factors, then data.validity() could coerce via data.matrix() but as it doesn't, try
hello.data <- data.matrix(hello.data)
after the line creating hello.data then run the rest of your code.
Whether this makes sense (treating a nominal or ordinal variable as a simple numeric) is unclear as you haven't provided a reproducible example or explained what your data are etc.

comparing two integers in R: "longer object length not multiple of shorter object length" ddply

I'm getting an "longer object length not multiple of shorter object length" warning in R when comparing two integers to subset a dataframe in the midst of a user defined function.
The user defined function just returns the median of a subset of integers taken from a dataframe:
function(s){
return(median((subset(EDB,as.integer(validSession) == as.integer(s)))$absStudentDeviation))
}
(I did not originally have the as.integer coercions in there. I put them there to debug, text, and I'm still getting an error.)
The specific error I'm getting is:
In as.integer(validSession) == as.integer(s) :
longer object length is not a multiple of shorter object length
I get this warning over 50 times when calling:
mediandf <- ddply(mediandf,.(validSession),
transform,
grossMed2 = medianfuncEDB(as.integer(validSession)))
The goal is to calculate the median of $validSession associated with the given validSession in the large dataframe EDB and attach that vector to mediandf.
I have actually double-checked that all values for validSession in both the mediandf dataframe and the EDB dataframe are integers by subsetting with is.integer(validSession).
Furthermore, it appears that the command actually does what I intend, I get a new column in my dataframe with values I have not verified, but I want to understand the warning. if "medianfuncEDB" is being called with an integer as its input, why am I getting a "longer object length is not multiple of shorter object length" when s == validSession is called?
Note that simple function calls, like medianfuncEDB(5) work without any problems, so why do I get warnings when using ddply?
EDIT: I found the problem with the help of Joran's comment. I did not know that transform fed entire vecotrs into the function. Using validSession[1] instead gave no warnings.
The ddply function already subsets your data frame by validSession. Hence transform is only fed a data frame with all the rows corresponding to a particular validSession.
That is, transform is already being fed subset(mediandf,validSession==s) for each s in unique(mediandf$validSession).
Since you don't have to do any subsetting (ddply takes care of that), all you need to do is:
ddply(mediandf,.(validSession),transform,grossMed2=median(absStudentDeviation))
And then you'll get mediandf back out with a new column grossMed2 with the value you want (so it will be the same value within each unique validSession).

Resources