large rowSums() results in Inf ? large number problem in R

large rowSums() results in Inf ? large number problem in R - r

I have a great data.matrix and I want to calculate the sum of the rows. Using rowSums function results in Inf values for sum because (presumably) the numbers are too large.
So I tried using Brobdingnagian numbers (from Brobdingnagian package, function as.brob) to deal with great numbers. But that is not working. Here is an example of what I have done with mtcars example dataset
library(dplyr)
library(brobdingnag)
mtcars <- data.matrix(mtcars)
mtcars.rowsum <- mtcars %>% as.brob(.) %>% rowSums(.)
Error in h(simpleError(msg, call)) :
Error argument 'x' during method selection for function 'rowSums':
invalid class “brob” object: invalid object for slot "positive" in class "brob":
got class "matrix", should be or extend class "logical"
Selecting TRUE or FALSE in brob(.,positive = ) results in an error unused argument.
How to handle great numbers for rowSums() in R? How to use as.brob in a data.matrix?

Related

Convert numerical variable to categorical variable

I have a list of columns that contain 0 and 1 as values. Right now they are treated as numerical variables but I want them to be treated as categorical.
I tried
as.factor(df[,"diseasesA":"diseaseM"], exclude = NULL)
but received the following error message:
Error in as.factor(df[,"diseasesA":"diseaseM"], :
unused argument (exclude = NULL)
not using "exclude = NULL" gave me the following error message:
Error in "diseasesA":"diseaseM" : NA/NaN argument
In addition: Warning messages:
1: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
2: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion

factor() or as.factor() works on a single column, not a data frame. So you need to apply that function to the columns you want to convert. Here are a few equivalent methods:
cols = paste0("disease", LETTERS[1:13]) # assuming your naming pattern is consistent
## base R with lapply
df[cols] = lapply(df[cols], factor)
## base R with for loop
for(i in seq_along(cols)) {
df[[i]] = factor(df[[i]])
}
## dplyr
library(dplyr)
df = df %>%
mutate(across(diseaseA:diseaseM, factor))
I will note that your question is inconsistent in its column naming pattern, disease vs diseases. In the base R methods I assumed that's a typo and further assumed you wanted to convert columns diseaseA, diseaseB, diseaseC, ..., diseaseM. In dplyr we can use across() to use X:Z to operate on all columns starting with X through Z--but there are many other methods possible to select which columns to work on, e.g., starts_with("diesease").

Why is any() only defined for a numeric and not logical data.frame?

This appears to be quite surprising:
df1 <- data.frame(A=TRUE, B=FALSE)
df2 <- data.frame(A=1, B=2)
> any(df1)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
> any(df2)
[1] TRUE
This doesn't seem to be a bug because the error correctly states that any() will only work in the case where all variables within a data.frame are numeric.
But what is the reason for any() to work on all numeric variables and not when values are all logical?

any can work if it is vector as the documentation says
Given a set of logical vectors, is at least one of the values true?
In the OP's post, both examples are not vectors. The first is a data.frame with logical columns. If we go by way to satisfy the documentation i.e create a logical vector, either convert to matrix (as a matrix is a vector anyway with some dim attributes)
any(as.matrix(df1))
#[1] TRUE
Or change it to a vector by unlisting the list (a data.frame is a list of vectors aka columns of same length)
any(unlist(df1))
In the second case, there is a warning and it is doing some coercing
any(df2)
#[1] TRUE
Warning message: In any(c(1, 2), na.rm = FALSE) : coercing argument
of type 'double' to logical

How to calculate an overall mean from more than two columns in a data frame?

I would like to have a single mean value from my selected columns in a data frame, but it doesn't works from two columns. I tried this:
testDF <- data.frame(v1 = c(1,3,15,7,18,3,5,NA,4,5,7,9),
v2 = c(11,33,55,7,88,33,55,NA,44,5,67,99),
v3 = c(NA,33,5,77,88,3,55,NA,4,55,87,14))
mean(testDF[,2:3], na.rm=T)
and I get this Warning message:
mean(testDF[,2:3], na.rm=T)
[1] NA
Warning message:
In mean.default(testDF[, 2:3], na.rm = T) :
argument is not numeric or logical: returning NA
if I use the sum() function it works perfectly, but I don't understand why it can't works with the mean() function. After some steps I did it with the melt() function from the reshape2{} package but I'm looking a short way to do it simple because I have a lot of variables and data.
Regards

The help for mean says:
Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
which makes me think that mean does not work on data frames.
Indeed you will see that doing mean(testDF) results in the same error, but mean(testDF[,1]) works.
The easiest solution is to do:
mean(as.matrix(testDF[,2:3]), na.rm=T)
Also, you can use colMeans to get the mean of each column.
Indeed, if you look at the source for colMeans, the first lines are:
if (is.data.frame(x))
x <- as.matrix(x)

R function applied on data frame grouped by multiple factors

I have a data frame called subdata, with a dimension of 10299 x 81. Column 1 called "Subject" and column 2 called "Activity". I want to calculate the average of each column grouped by "Subject" and "Activity".
Here are the functions I tried and none of them seems work so far. Finally I used colwise(mean) function, it seems work. I am new to R and just learned sapply, lapply, tapply functions and it seems mean function works in columns.
Can anyone help me explain what does these error or warning message mean and if there a way to make theses functions work?
Use lapply function:
newdata<- subdata[, lapply(.SD, mean), by = c("Subject","Activity")]
The error message:
Error in `[.data.frame`(subdata, , lapply(.SD, mean), by = c("Subject", :
unused argument (by = c("Subject", "Activity"))
Use by function:
newdata<-by(subdata, list(subdata$Subject, subdata$Activity), mean)
I got warning message:
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
Then I tried ddply in plyr package
ddply(subdata, .(Subject, Activity), mean)
I got the same warning message:
Warning messages:
1: In mean.default(piece, ...) : argument is not numeric or logical: returning NA 0
Finally I used colwise(mean)function, it seems work
newdata<-ddply(subdata, .(Subject, Activity), colwise(mean))

It is somewhat difficult to be certain with a representative sample of your dataset. Let's create some data to work with.
# Create some random demo data
subdata <- data.frame(Subject = rep(seq(5), each=4),
Activity = rep(LETTERS[1:2], 10), v1=rnorm(20), v2=rnorm(20))
Your first attempt I am not even sure where to start. It appears you are trying to subset your dataframe with the output of a list which already seems strange. You should abandon this attempt.
Your by statement is providing an error about non-numeric data. This is because the by function isn't that smart. You need to only provide the columns to be analyzed and then the indices (i.e. your factor columns).
by(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), function(x) colMeans(x))
Althought you probably want to rbind this output and reassign rownames to correspond to groups. However, for this purpose it may be best to just use something aggregate to avoid such extra computation.
aggregate(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), mean)
Your ddply statements are close but as I suggested above you should use numcolwise to summarize over your numeric columns.
library(plyr)
# summarize over all numeric columns
ddply(subdata, .(Subject, Activity), numcolwise(mean))

Basic frequency value, split by group

I'm trying to get a length(x) value for a variable split by another variable.
aggregate(ssfia$Correlation_Abs,ssfia$Disorder,length,na.rm=TRUE)
However I get an error message:
>Error in FUN(X[[1L]], ...) :
2 arguments passed to 'length' which requires 1
It appears that length(x) can't be used there like "mean" or "sd" can. Is there a function that can count rows nested in an aggregate?
Thanks in advance!

What about the following?
split the first vector into subclasses determined by the second one:
cls <- split(ssfia$Correlation_Abs, ssfia$Disorder)
count how many non-NA observations fall into each subclass:
sapply(cls, function(dat) sum(!is.na(dat)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

large rowSums() results in Inf ? large number problem in R - r

Related

Convert numerical variable to categorical variable

Why is any() only defined for a numeric and not logical data.frame?

How to calculate an overall mean from more than two columns in a data frame?

R function applied on data frame grouped by multiple factors

Basic frequency value, split by group

Categories

Resources