Finding the mean of a column - r

Using R, I am trying to find the mean of a column but I can't seem to get it to work. This is my code:
mean(data_frame$column, na.rm = TRUE)
When I run it it just gives me an error message: argument is not numeric or logical: returning NA. I've tried also using colMeans by it just give another error message: 'x' must be an array of at least two dimensions. What am I doing wrong?

i think you should normalize your data first, then calculate your mean
# remove NA value
data_frame <- data_frame[!(is.na(data_frame$column) | data_frame$column==""),]
# calculate the mean
mean <- mean(data_frame[["column"]])
let me know if this works for you

Related

How to fix 'argument is not numeric or logical: returning NA'

Ive read in my excel data and found the max value for each column and calculated the mean from that.
library(readxl)
exp4 <- read_excel("exp4.xlsx")
View(exp4)
this gives you the highest value in the entire datase
maxpeak <- apply(exp4, MARGIN = 2, function(x) max(x, na.rm=TRUE))
maxpeak
#the mean of all the max peaks in this experiment
mean16052019exp4 <- mean(maxpeak)
mean16052019exp4
I've then taken the original max value and subtracted the baseline values using another excel spreadsheet read in BUT when i now want the mean of these new values:
realmaxpeak <- (maxpeak - exp4baseline)
realmaxpeak
#trying to calculate the mean of the baseline adjusted values
View(realmaxpeak)
mean(realmaxpeak)
I get: Warning message:
In mean.default(realmaxpeak[0.1]) : argument is not numeric or
logical: returning NA
Why can i not calculate the mean from the vector (realmaxpeak) i created?
TIA
Could you post the summary of the data in realmaxpeak? It could not be recognizing the field as a numeric field. If this is the case, you would utilize as.numeric()

How to calculate an overall mean from more than two columns in a data frame?

I would like to have a single mean value from my selected columns in a data frame, but it doesn't works from two columns. I tried this:
testDF <- data.frame(v1 = c(1,3,15,7,18,3,5,NA,4,5,7,9),
v2 = c(11,33,55,7,88,33,55,NA,44,5,67,99),
v3 = c(NA,33,5,77,88,3,55,NA,4,55,87,14))
mean(testDF[,2:3], na.rm=T)
and I get this Warning message:
mean(testDF[,2:3], na.rm=T)
[1] NA
Warning message:
In mean.default(testDF[, 2:3], na.rm = T) :
argument is not numeric or logical: returning NA
if I use the sum() function it works perfectly, but I don't understand why it can't works with the mean() function. After some steps I did it with the melt() function from the reshape2{} package but I'm looking a short way to do it simple because I have a lot of variables and data.
Regards
The help for mean says:
Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
which makes me think that mean does not work on data frames.
Indeed you will see that doing mean(testDF) results in the same error, but mean(testDF[,1]) works.
The easiest solution is to do:
mean(as.matrix(testDF[,2:3]), na.rm=T)
Also, you can use colMeans to get the mean of each column.
Indeed, if you look at the source for colMeans, the first lines are:
if (is.data.frame(x))
x <- as.matrix(x)

Mean function in R for data in csv file

I tried searching the error which I am getting while using "mean" function in R 3.1.2.'
Purpose: Calculate Mean of datasets
Used Functions: sapply, summary to calculate mean as shown below:
sapply(data,mean,na.rm=TRUE)
summary(data)
Problem Faced: Now, I am trying to use "mean" function to calculate mean from complete dataset. I used the function like this:
> testingnew <-data[complete.cases(data),]
> mean(testingnew)
Popped Warning :
[1] NA
Warning message:
In mean.default(testingnew) :
argument is not numeric or logical: returning NA
Que: Can someone please tell me why this warning comes, I tried to remove NA(missing values) using complete.cases.
#To Eliminate missing values: # ! = is not
testingnew <- subset(data, !(is.na(data)))
#Choose a column to calculate the mean:
#Make sure it is numeric or integer
class(testingnew$Col1)
mean(testingnew$Col1, na.rm=TRUE)
Maybe you can try to reproduce this workflow with your own dataset... It seems the only thing missing is referring to individual columns with the mean function, or using sapply as you did before.
Create a dataframe using random values
my.df <- data.frame(x1 = rnorm(n = 200), x2 = rnorm(n=200))
Spread NA's randomly into the df
is.na(my.df) <- matrix(sample(c(TRUE,FALSE), replace= TRUE, size = 400,
prob=c(0.10, 0.90)),
ncol = 2)
For getting means without using complete cases:
mean(my.df$x1, na.rm=TRUE) # mean(my.df[,1], na.rm=TRUE) is equivalent
mean(my.df$x2, na.rm=TRUE) # mean(my.df[,2], na.rm=TRUE) is equivalent
Complete-case approach (if this is what you really need):
my.df.complete <- my.df[complete.cases(my.df),]
Get means for both columns
sapply(X = my.df.complete, FUN = mean)
Get mean from individual columns
mean(my.df.complete$x1)
mean(my.df.complete$x2)
Creating a subset helped:
data3 <-subset(data, !is.na(Ozone))
mean(data3$Ozone)

Using tapply on data with NAs

I have a data column (Percent.Plant.Parasites) that has some NAs. I want to take the mean of this data sorted by factor "Stage" (ie stage1 Mean=x, stage2 Mean=y, etc). I tried doing this using
tapply(rawdata$Percent.Plant.Parasites, rawdata$Stage, mean)
However, I get NAs because there are NAs in the data. I don't believe there is an na.rm option for tapply (is there?), so I tried to calculate the mean of each individual stage factor using:
mean(subset(rawdata,subset=Stage=="stage1")$Percent.Plant.Parasites, na.rm=TRUE)
to no avail. Instead I got the error:
In mean.default(subset(rawdata, subset = Stage == "Kax")$Percent.Plant.Parasites, :
argument is not numeric or logical: returning NA
However, when I do:
typeof(subset(rawdata,subset=Stage=="Kax")$Percent.Plant.Parasites)
I get integer
Any ideas where I'm going wrong?
Thanks.
Why not just create a new function, call it mean_NA, that simply removes the NAs before calculating the mean and then use that function in tapply? Something like:
mean_NA<-function(v){
avg<-mean(v, na.rm = T)
return (avg)
}
As was commented, make sure that the data you're taking the mean of is numeric/integer and the INDEX is factor(groups). You would use the newly created function like this:
tapply(X = rawdata$Percent.Plant.Parasites, INDEX = rawdata$Stage, mean_NA)

R function applied on data frame grouped by multiple factors

I have a data frame called subdata, with a dimension of 10299 x 81. Column 1 called "Subject" and column 2 called "Activity". I want to calculate the average of each column grouped by "Subject" and "Activity".
Here are the functions I tried and none of them seems work so far. Finally I used colwise(mean) function, it seems work. I am new to R and just learned sapply, lapply, tapply functions and it seems mean function works in columns.
Can anyone help me explain what does these error or warning message mean and if there a way to make theses functions work?
Use lapply function:
newdata<- subdata[, lapply(.SD, mean), by = c("Subject","Activity")]
The error message:
Error in `[.data.frame`(subdata, , lapply(.SD, mean), by = c("Subject", :
unused argument (by = c("Subject", "Activity"))
Use by function:
newdata<-by(subdata, list(subdata$Subject, subdata$Activity), mean)
I got warning message:
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
Then I tried ddply in plyr package
ddply(subdata, .(Subject, Activity), mean)
I got the same warning message:
Warning messages:
1: In mean.default(piece, ...) : argument is not numeric or logical: returning NA 0
Finally I used colwise(mean)function, it seems work
newdata<-ddply(subdata, .(Subject, Activity), colwise(mean))
It is somewhat difficult to be certain with a representative sample of your dataset. Let's create some data to work with.
# Create some random demo data
subdata <- data.frame(Subject = rep(seq(5), each=4),
Activity = rep(LETTERS[1:2], 10), v1=rnorm(20), v2=rnorm(20))
Your first attempt I am not even sure where to start. It appears you are trying to subset your dataframe with the output of a list which already seems strange. You should abandon this attempt.
Your by statement is providing an error about non-numeric data. This is because the by function isn't that smart. You need to only provide the columns to be analyzed and then the indices (i.e. your factor columns).
by(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), function(x) colMeans(x))
Althought you probably want to rbind this output and reassign rownames to correspond to groups. However, for this purpose it may be best to just use something aggregate to avoid such extra computation.
aggregate(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), mean)
Your ddply statements are close but as I suggested above you should use numcolwise to summarize over your numeric columns.
library(plyr)
# summarize over all numeric columns
ddply(subdata, .(Subject, Activity), numcolwise(mean))

Resources