Applying a function to all columns in R [duplicate] - r

I have a dataset which I created by column binding using the cbindX function from the gdata package. This function allows me to bind columns with different numbers of rows. So, NA's are introduced when there are no values in a particular column. Now, I want to calculate the standard deviation for each column. I tried using
sapply(dataset,sd)
This returns the standard deviation for the column having all rows with values and NA for the columns having fewer rows. I tried using the na.rm argument with the sd function:
sapply(dataset,sd(na.rm=T))
and got the error message
Error in is.data.frame(x) : argument "x" is missing, with no default
For example:
firstcol <- matrix(c(1:150),ncol=1)
secondcol <- matrix(c(1:300),ncol=1)
thirdcol <- matrix(c(1:450),ncol=1)
fourthcol <- matrix(c(1:600),ncol=1)
fifthcol <- matrix(c(1:30),ncol=1)
sixthcol <- matrix(c(1:30),ncol=1)
seventhcol <- matrix(c(1:30),ncol=1)
library(gdata)
allcolscomb <- data.frame(cbindX (firstcol,secondcol,thirdcol,fourthcol,fifthcol,sixthcol,seventhcol))
names(allcolscomb) <- c("1stcol","2ndcol","3rdcol","4thcol","5thcol","6thcol","7thcol")
sapply(allcolscomb,sd)
sapply(allcolscomb,sd(na.rm=T))
How can I compute standard deviation using the sapply function?

You should read ?sapply manual. Below example of sapply with some extra arguments:
sapply(allcolscomb, sd, na.rm=TRUE)
sapply(allcolscomb, function(x) sd(x, na.rm=TRUE))

Try this.
sapply(allcolscomb,sd, na.rm = TRUE)
in the apply family functions the syntax is (data, fun, ...). The three dots are "ellipsis", they are there to host the arguments of the function passed to the apply's function.

Related

using the mean() function in R to find the means of two columns in a data frame

I am finding it difficult to work with the mean() function to find out the mean of two columns together. When I try mean(dat[, 2:3], na.rm = TRUE) , where Column 2 is say, sulfate and Column 3 is say, nitrate, R throws an error message saying that argument is not numeric or logical. Just to mention that I have also tried other variants: mean(dat[, c(2,3)], na.rm = TRUE) and also mean(dat[, c("sulfate","nitrate")], na.rm = TRUE) but the same error message props up
The mean() function only works for me when used separately for each pollutant as in -
m1 <- mean(dat[,"sulfate"], na.rm=TRUE)
m2 <- mean(dat[,"nitrate"], na.rm=TRUE)
Is there a way that one could find the means of both the columns in one command avoiding an error? I think the apply() class of functions has a possibility and also colMeans(datasubset[,c('sulfate','nitrate')], na.rm = TRUE) works for me. But I am not sure as to how the function mean()could be used to solve this purpose. Any leads would be appreciated! Thanks

Mean imputation issue with data.table

Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.

How to calculate an overall mean from more than two columns in a data frame?

I would like to have a single mean value from my selected columns in a data frame, but it doesn't works from two columns. I tried this:
testDF <- data.frame(v1 = c(1,3,15,7,18,3,5,NA,4,5,7,9),
v2 = c(11,33,55,7,88,33,55,NA,44,5,67,99),
v3 = c(NA,33,5,77,88,3,55,NA,4,55,87,14))
mean(testDF[,2:3], na.rm=T)
and I get this Warning message:
mean(testDF[,2:3], na.rm=T)
[1] NA
Warning message:
In mean.default(testDF[, 2:3], na.rm = T) :
argument is not numeric or logical: returning NA
if I use the sum() function it works perfectly, but I don't understand why it can't works with the mean() function. After some steps I did it with the melt() function from the reshape2{} package but I'm looking a short way to do it simple because I have a lot of variables and data.
Regards
The help for mean says:
Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
which makes me think that mean does not work on data frames.
Indeed you will see that doing mean(testDF) results in the same error, but mean(testDF[,1]) works.
The easiest solution is to do:
mean(as.matrix(testDF[,2:3]), na.rm=T)
Also, you can use colMeans to get the mean of each column.
Indeed, if you look at the source for colMeans, the first lines are:
if (is.data.frame(x))
x <- as.matrix(x)

How do I use arguments of a function when using sapply?

I have a dataset which I created by column binding using the cbindX function from the gdata package. This function allows me to bind columns with different numbers of rows. So, NA's are introduced when there are no values in a particular column. Now, I want to calculate the standard deviation for each column. I tried using
sapply(dataset,sd)
This returns the standard deviation for the column having all rows with values and NA for the columns having fewer rows. I tried using the na.rm argument with the sd function:
sapply(dataset,sd(na.rm=T))
and got the error message
Error in is.data.frame(x) : argument "x" is missing, with no default
For example:
firstcol <- matrix(c(1:150),ncol=1)
secondcol <- matrix(c(1:300),ncol=1)
thirdcol <- matrix(c(1:450),ncol=1)
fourthcol <- matrix(c(1:600),ncol=1)
fifthcol <- matrix(c(1:30),ncol=1)
sixthcol <- matrix(c(1:30),ncol=1)
seventhcol <- matrix(c(1:30),ncol=1)
library(gdata)
allcolscomb <- data.frame(cbindX (firstcol,secondcol,thirdcol,fourthcol,fifthcol,sixthcol,seventhcol))
names(allcolscomb) <- c("1stcol","2ndcol","3rdcol","4thcol","5thcol","6thcol","7thcol")
sapply(allcolscomb,sd)
sapply(allcolscomb,sd(na.rm=T))
How can I compute standard deviation using the sapply function?
You should read ?sapply manual. Below example of sapply with some extra arguments:
sapply(allcolscomb, sd, na.rm=TRUE)
sapply(allcolscomb, function(x) sd(x, na.rm=TRUE))
Try this.
sapply(allcolscomb,sd, na.rm = TRUE)
in the apply family functions the syntax is (data, fun, ...). The three dots are "ellipsis", they are there to host the arguments of the function passed to the apply's function.

Mean function in R for data in csv file

I tried searching the error which I am getting while using "mean" function in R 3.1.2.'
Purpose: Calculate Mean of datasets
Used Functions: sapply, summary to calculate mean as shown below:
sapply(data,mean,na.rm=TRUE)
summary(data)
Problem Faced: Now, I am trying to use "mean" function to calculate mean from complete dataset. I used the function like this:
> testingnew <-data[complete.cases(data),]
> mean(testingnew)
Popped Warning :
[1] NA
Warning message:
In mean.default(testingnew) :
argument is not numeric or logical: returning NA
Que: Can someone please tell me why this warning comes, I tried to remove NA(missing values) using complete.cases.
#To Eliminate missing values: # ! = is not
testingnew <- subset(data, !(is.na(data)))
#Choose a column to calculate the mean:
#Make sure it is numeric or integer
class(testingnew$Col1)
mean(testingnew$Col1, na.rm=TRUE)
Maybe you can try to reproduce this workflow with your own dataset... It seems the only thing missing is referring to individual columns with the mean function, or using sapply as you did before.
Create a dataframe using random values
my.df <- data.frame(x1 = rnorm(n = 200), x2 = rnorm(n=200))
Spread NA's randomly into the df
is.na(my.df) <- matrix(sample(c(TRUE,FALSE), replace= TRUE, size = 400,
prob=c(0.10, 0.90)),
ncol = 2)
For getting means without using complete cases:
mean(my.df$x1, na.rm=TRUE) # mean(my.df[,1], na.rm=TRUE) is equivalent
mean(my.df$x2, na.rm=TRUE) # mean(my.df[,2], na.rm=TRUE) is equivalent
Complete-case approach (if this is what you really need):
my.df.complete <- my.df[complete.cases(my.df),]
Get means for both columns
sapply(X = my.df.complete, FUN = mean)
Get mean from individual columns
mean(my.df.complete$x1)
mean(my.df.complete$x2)
Creating a subset helped:
data3 <-subset(data, !is.na(Ozone))
mean(data3$Ozone)

Resources