R function applied on data frame grouped by multiple factors - r

I have a data frame called subdata, with a dimension of 10299 x 81. Column 1 called "Subject" and column 2 called "Activity". I want to calculate the average of each column grouped by "Subject" and "Activity".
Here are the functions I tried and none of them seems work so far. Finally I used colwise(mean) function, it seems work. I am new to R and just learned sapply, lapply, tapply functions and it seems mean function works in columns.
Can anyone help me explain what does these error or warning message mean and if there a way to make theses functions work?
Use lapply function:
newdata<- subdata[, lapply(.SD, mean), by = c("Subject","Activity")]
The error message:
Error in `[.data.frame`(subdata, , lapply(.SD, mean), by = c("Subject", :
unused argument (by = c("Subject", "Activity"))
Use by function:
newdata<-by(subdata, list(subdata$Subject, subdata$Activity), mean)
I got warning message:
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
Then I tried ddply in plyr package
ddply(subdata, .(Subject, Activity), mean)
I got the same warning message:
Warning messages:
1: In mean.default(piece, ...) : argument is not numeric or logical: returning NA 0
Finally I used colwise(mean)function, it seems work
newdata<-ddply(subdata, .(Subject, Activity), colwise(mean))

It is somewhat difficult to be certain with a representative sample of your dataset. Let's create some data to work with.
# Create some random demo data
subdata <- data.frame(Subject = rep(seq(5), each=4),
Activity = rep(LETTERS[1:2], 10), v1=rnorm(20), v2=rnorm(20))
Your first attempt I am not even sure where to start. It appears you are trying to subset your dataframe with the output of a list which already seems strange. You should abandon this attempt.
Your by statement is providing an error about non-numeric data. This is because the by function isn't that smart. You need to only provide the columns to be analyzed and then the indices (i.e. your factor columns).
by(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), function(x) colMeans(x))
Althought you probably want to rbind this output and reassign rownames to correspond to groups. However, for this purpose it may be best to just use something aggregate to avoid such extra computation.
aggregate(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), mean)
Your ddply statements are close but as I suggested above you should use numcolwise to summarize over your numeric columns.
library(plyr)
# summarize over all numeric columns
ddply(subdata, .(Subject, Activity), numcolwise(mean))

Related

Convert numerical variable to categorical variable

I have a list of columns that contain 0 and 1 as values. Right now they are treated as numerical variables but I want them to be treated as categorical.
I tried
as.factor(df[,"diseasesA":"diseaseM"], exclude = NULL)
but received the following error message:
Error in as.factor(df[,"diseasesA":"diseaseM"], :
unused argument (exclude = NULL)
not using "exclude = NULL" gave me the following error message:
Error in "diseasesA":"diseaseM" : NA/NaN argument
In addition: Warning messages:
1: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
2: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
factor() or as.factor() works on a single column, not a data frame. So you need to apply that function to the columns you want to convert. Here are a few equivalent methods:
cols = paste0("disease", LETTERS[1:13]) # assuming your naming pattern is consistent
## base R with lapply
df[cols] = lapply(df[cols], factor)
## base R with for loop
for(i in seq_along(cols)) {
df[[i]] = factor(df[[i]])
}
## dplyr
library(dplyr)
df = df %>%
mutate(across(diseaseA:diseaseM, factor))
I will note that your question is inconsistent in its column naming pattern, disease vs diseases. In the base R methods I assumed that's a typo and further assumed you wanted to convert columns diseaseA, diseaseB, diseaseC, ..., diseaseM. In dplyr we can use across() to use X:Z to operate on all columns starting with X through Z--but there are many other methods possible to select which columns to work on, e.g., starts_with("diesease").

Function to impute missing values using mean in R

My tibble:
Data in Excel:
impute <- read_excel(choose.files())
imp <- function(df) {
for(i in 1:ncol(df)){
df[is.na(df[,i]),i] <- mean(df[,i],na.rm = T)
}
}
imp(impute)
Warning messages:
1: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
The above code works fine it impute is a Data.Frame, but doesn't work if it's a Tibble. Could someone please let me know how to change the code if I were to work with Tibble.
One of the differences between a data.frame and a tibble is that data frames drop dimensions when possible by default and tibbles don't.
That is, if x is a data frame then x[, i] may or may not be a data frame, depending on i. If i is one value, then x[, i] will just be a vector. If i is a vector with multiple values then x[, i] will be a data frame. This can cause bugs when i is a variable that may or may not have multiple values, because the class may be different (with the fix being to use x[, i, drop = FALSE] to guarantee a data.frame return).
Tibbles seek to address this issue by switching the default drop = TRUE to drop = FALSE, so x[, i] is a tibble, regardless of whether i has length 1 or more.
When calculating the mean, you want df[,i] to be treated as a numeric vector, not a tibble with 1 column, so you need to specify it:
df[[i]] # This is the preferred way to extract a single column
df[, i, drop = TRUE] # this will work too (since tibble version 1.4.1)
This is explained in greater detail in the "Tibbles vs data.frames" section of the Tibbles vignette.

How to calculate an overall mean from more than two columns in a data frame?

I would like to have a single mean value from my selected columns in a data frame, but it doesn't works from two columns. I tried this:
testDF <- data.frame(v1 = c(1,3,15,7,18,3,5,NA,4,5,7,9),
v2 = c(11,33,55,7,88,33,55,NA,44,5,67,99),
v3 = c(NA,33,5,77,88,3,55,NA,4,55,87,14))
mean(testDF[,2:3], na.rm=T)
and I get this Warning message:
mean(testDF[,2:3], na.rm=T)
[1] NA
Warning message:
In mean.default(testDF[, 2:3], na.rm = T) :
argument is not numeric or logical: returning NA
if I use the sum() function it works perfectly, but I don't understand why it can't works with the mean() function. After some steps I did it with the melt() function from the reshape2{} package but I'm looking a short way to do it simple because I have a lot of variables and data.
Regards
The help for mean says:
Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
which makes me think that mean does not work on data frames.
Indeed you will see that doing mean(testDF) results in the same error, but mean(testDF[,1]) works.
The easiest solution is to do:
mean(as.matrix(testDF[,2:3]), na.rm=T)
Also, you can use colMeans to get the mean of each column.
Indeed, if you look at the source for colMeans, the first lines are:
if (is.data.frame(x))
x <- as.matrix(x)

Mean function in R for data in csv file

I tried searching the error which I am getting while using "mean" function in R 3.1.2.'
Purpose: Calculate Mean of datasets
Used Functions: sapply, summary to calculate mean as shown below:
sapply(data,mean,na.rm=TRUE)
summary(data)
Problem Faced: Now, I am trying to use "mean" function to calculate mean from complete dataset. I used the function like this:
> testingnew <-data[complete.cases(data),]
> mean(testingnew)
Popped Warning :
[1] NA
Warning message:
In mean.default(testingnew) :
argument is not numeric or logical: returning NA
Que: Can someone please tell me why this warning comes, I tried to remove NA(missing values) using complete.cases.
#To Eliminate missing values: # ! = is not
testingnew <- subset(data, !(is.na(data)))
#Choose a column to calculate the mean:
#Make sure it is numeric or integer
class(testingnew$Col1)
mean(testingnew$Col1, na.rm=TRUE)
Maybe you can try to reproduce this workflow with your own dataset... It seems the only thing missing is referring to individual columns with the mean function, or using sapply as you did before.
Create a dataframe using random values
my.df <- data.frame(x1 = rnorm(n = 200), x2 = rnorm(n=200))
Spread NA's randomly into the df
is.na(my.df) <- matrix(sample(c(TRUE,FALSE), replace= TRUE, size = 400,
prob=c(0.10, 0.90)),
ncol = 2)
For getting means without using complete cases:
mean(my.df$x1, na.rm=TRUE) # mean(my.df[,1], na.rm=TRUE) is equivalent
mean(my.df$x2, na.rm=TRUE) # mean(my.df[,2], na.rm=TRUE) is equivalent
Complete-case approach (if this is what you really need):
my.df.complete <- my.df[complete.cases(my.df),]
Get means for both columns
sapply(X = my.df.complete, FUN = mean)
Get mean from individual columns
mean(my.df.complete$x1)
mean(my.df.complete$x2)
Creating a subset helped:
data3 <-subset(data, !is.na(Ozone))
mean(data3$Ozone)

Using tapply on data with NAs

I have a data column (Percent.Plant.Parasites) that has some NAs. I want to take the mean of this data sorted by factor "Stage" (ie stage1 Mean=x, stage2 Mean=y, etc). I tried doing this using
tapply(rawdata$Percent.Plant.Parasites, rawdata$Stage, mean)
However, I get NAs because there are NAs in the data. I don't believe there is an na.rm option for tapply (is there?), so I tried to calculate the mean of each individual stage factor using:
mean(subset(rawdata,subset=Stage=="stage1")$Percent.Plant.Parasites, na.rm=TRUE)
to no avail. Instead I got the error:
In mean.default(subset(rawdata, subset = Stage == "Kax")$Percent.Plant.Parasites, :
argument is not numeric or logical: returning NA
However, when I do:
typeof(subset(rawdata,subset=Stage=="Kax")$Percent.Plant.Parasites)
I get integer
Any ideas where I'm going wrong?
Thanks.
Why not just create a new function, call it mean_NA, that simply removes the NAs before calculating the mean and then use that function in tapply? Something like:
mean_NA<-function(v){
avg<-mean(v, na.rm = T)
return (avg)
}
As was commented, make sure that the data you're taking the mean of is numeric/integer and the INDEX is factor(groups). You would use the newly created function like this:
tapply(X = rawdata$Percent.Plant.Parasites, INDEX = rawdata$Stage, mean_NA)

Resources