Ignoring NA values in function - r

I am writing my own function to calculate the mean of a column in a data set and then applying it using apply() but it only returns the first column's mean. Below is my code:
mymean <- function(cleaned_us){
column_total = sum(cleaned_us)
column_length = length(cleaned_us)
return (column_total/column_length)
}
Average_2 <- apply(numeric_clean_usnews,2,mymean,na.rm=T)

We need to use the na.rm=TRUE in the sum and using it in apply is not going to work as mymean doesn't have that argument
mymean <- function(cleaned_us){
column_total = sum(cleaned_us, na.rm = TRUE) #change
column_length = sum(!is.na(cleaned_us)) #change
return(column_total/column_length)
}
Note that colMeans can be used for getting the mean for each column.

In order to pass an na.rm parameter to the function you defined, you need to make it a parameter of the function. The sum() function has an na.rm param, but length() doesn't. So to write the function you are trying to write, you could say:
# include `na.rm` as a param of the argument
mymean <- function(cleaned_us, na.rm){
# pass it to `sum()`
column_total = sum(cleaned_us, na.rm=na.rm)
# if `na.rm` is set to `TRUE`, then don't count `NA`s
if (na.rm==TRUE){
column_length = length(cleaned_us[!is.na(cleaned_us)])
# but if it's `FALSE`, just use the full length
} else {
column_length = length(cleaned_us)
}
return (column_total/column_length)
}
Then your call should work:
Average_2 <- apply(numeric_clean_usnews, 2, mymean, na.rm=TRUE)

Use na.omit()
set.seed(1)
m <- matrix(sample(c(1:9, NA), 100, replace=TRUE), 10)
mymean <- function(cleaned_us, na.rm){
if (na.rm) cleaned_us <- na.omit(cleaned_us)
column_total = sum(cleaned_us)
column_length = length(cleaned_us)
column_total/column_length
}
apply(m, 2, mymean, na.rm=TRUE)
# [1] 5.000 5.444 4.111 5.700 6.500 4.600 5.000 6.222 4.700 6.200

Related

Specify function arguments from string

I'm trying to set up details for which function to run and which arguments to include at the start of my script, to then later call the function. I'm having trouble specifying arguments to be input into the function.
I have a fixed object
v <- c(1,2,3,5,6,7,8,9,NA)
I want to specify which measurement function I will use as well as any relevant arguments.
Example 1:
chosenFunction <- mean
chosenArguments <- "trim = 0.1, na.rm = T"
Example 2:
chosenFunction <- median
chosenArguments <- "na.rm = F"
Then I want to be able to run this specified function
chosenFunction(v, chosenArguments)
Unfortunately, I can't just put in the string chosenArguments and expect the function to run. Is there any alternative way to specify the arguments to my function?
Updated answer based on OP's clarifications
chosenFunction <- mean
get_summary <- function(x, fun, ...) fun(x, ...)>
v <- 1:100
get_summary(v, chosenFunction, na.rm = TRUE)
# [1] 50.5
Later on if you want to change the function
chosenFunction <- median
get_summary(v, chosenFunction, na.rm = TRUE)
# [1] 50.5
Original answer
get_summary <- function(x, chosenFunction, ...) chosenFunction(x, ...)
v <- 1:100
get_summary(v, mean, na.rm = TRUE, trim = 1)
# [1] 50.5
get_summary(v, median, na.rm = TRUE)
# [1] 50.5
By doing ..., you don't have to specify all arguments
get_summary(mean, na.rm = TRUE)
# [1] 50.5
If we want to calculate mean, we do it by
mean(v, na.rm = TRUE, time = 0.1)
#[1] 5.125
Another way is by using do.call
do.call(mean, list(v, na.rm = TRUE, trim = 0.1))
#[1] 5.125
We can leverage this fact and create a named list for chosenArguments and use it in do.call
chosenFunction <- mean
chosenArguments <- list(na.rm = TRUE, trim = 0.1)
do.call(chosenFunction, c(list(v), chosenArguments))
#[1] 5.125

Functions in R: How to Return Mean, Median, Standard Deviation Within Same Function

How can I return the mean, median, and standard deviation within same function in R? All that I can get to return is the last part of the function which calculated the standard deviation. I was thinking that by assigning summarystat(Tail_wags) to b that when I returned 'b' that I would have all three value. Added the result for the three values I need outside of the function after 'b' to see what values are supposed to be.
Dog_biscuits <- c(0,1,2,3,4,5,6,7,8,9,10)
Tail_wags <- c(0,0,1,3,8,13,14,12,15,16,14)
dog_wags<-cbind(Dog_biscuits,Tail_wags)
dog_wags
summarystat<- function(x) {
z1 <- mean(x)
z2<-median(x)
z3<-sd(x)
}
b<-summarystat(Tail_wags)
b
b
[1] 6.497552
> mean(Tail_wags)
[1] 8.727273
> median(Tail_wags)
[1] 12
> sd(Tail_wags)
[1] 6.497552
You can only return one object from a function. The trick to achieve what you want is to return a list:
summarystat<- function(x) {
z1 <- mean(x)
z2 <- median(x)
z3 <- sd(x)
return(list(mean=z1, median=z2, sd=z3))
}
You can combine and return the variables using the generic c() function.
summarystat<- function(x) {
z1 <- mean(x, na.rm = TRUE)
z2<-median(x, na.rm = TRUE)
z3<-sd(x,na.rm = TRUE)
return(c(mean=z1,median=z2,standard_dev=z3))
}
Tail_wags <- c(0,0,1,3,8,13,14,12,15,16,14)
summarystat(Tail_wags)
# mean median standard_dev
# 8.727273 12.000000 6.497552
Your are looking after something like:
summarystat <- function(x) {
my_list <- list("mean" = mean(x), "median" = median(x), "sd" = sd(x))
return(my_list)
}
Usage:
vals <- summarystat(Tail_wags)
> a$mean
> a$sd
> a$median
Function and application:
do.call("rbind", lapply(dog_wags, function(x){
list(mean_val = mean(x),
median_val = median(x),
sd_val = sd(x))
}
)
)
Data:
Dog_biscuits <- c(0,1,2,3,4,5,6,7,8,9,10)
Tail_wags <- c(0,0,1,3,8,13,14,12,15,16,14)
dog_wags <- data.frame(cbind(Dog_biscuits,Tail_wags))
Alternatively, you can get rid of the function completely and use something like pastecs::stat.desc and then subtract the values you want
Dog_biscuits <- c(0,1,2,3,4,5,6,7,8,9,10)
Tail_wags <- c(0,0,1,3,8,13,14,12,15,16,14)
dog_wags<-cbind(Dog_biscuits,Tail_wags)
pastecs::stat.desc(Tail_wags)[["mean"]]
# 8.727273
Check out this article for more summary functions.
A somewhat different approach that lets one choose the functions to return.
Code:
fooapply <- function(x, functions = c("mean", "median", "sd"), na.rm = T){
func <- functions
vec <- c()
for(i in 1:length(func)){
if(na.rm == T){
eval(parse(text = paste0("vec[", i,"]", "<-", func[i], "(x, na.rm = T)")))
}
else{
eval(parse(text = paste0("vec[", i,"]", "<-", func[i], "(x)")))
}
}
names(vec) <- functions
return(vec)
}
Result
To obtain your desired result you can just your vector into the function. Per default, the function will omit NA's and calculate the mean, median and sd.
fooapply(Tail_wags)
mean median sd
8.727273 12.000000 6.497552
Additionally, one can also add or remove functions or swap them out:
fooapply(Tail_wags, c("mean", "median", "IQR"))
Note that some of the included functions will report an error when NA's are included without specifying na.rm = T, others will just report NA as result.
The mean() function, for example, will return NA when calculated for a vector that includes NA. In contrast, IQR() will throw an error when NA's are included within the vector and hence requires the na.rm = T (which is set as TRUE by default) statement in order for fooapply() to work.

Apply function or Loop in R: Not numerical, returning NA

I am working with a resampling procedure in R (just like a bootstrap). I have a matrix of response/explanatory variables and would like to make 999 samples of this matrix to calculate for each statistic I am working their mean, sd and confidence interval. So, I wrote a function to calculate and to return a list:
mydata <- data.frame(a=rnorm(20, 1, 1), b = rnorm(20,1,1))
myfun <- function(data, n){
sample <- data[sample(n, replace = T),]
model1 <- lm(sample[,1]~sample[,2])
return(list(model1[[1]][[1]], model1[[1]][[2]]))
}
result <- as.numeric()
result <- replicate(99, myfun(mydata, 10))
Then, I have a matrix as my output in which the rows are the statistics and the columns are the samplings (nrow = 2 and ncol = 99). I need the mean and sd for each row, but when I try to use the apply function or even a loop the following message shows up:
In mean.default(newX[, i], ...) :
argument is not numeric or logical: returning NA
Moreover:
is.numeric(result)
[1] FALSE
I found it strange, because I never had such problem with similar procedures.
Any thoughts?
Use the following:
myfun <- function(dat, n){
dat1 <- dat[sample(n, replace = T),]
model1 <- lm(dat1[,1] ~ dat1[,2])
return(coef(model1))
}
replicate(99, myfun(mydata, 10))
The reason is the 'result' is a list of 198 elements with dimension attributes. We need to unlist the 'result' and provide the dimension attributes
result1 <- `dim<-`(unlist(result), dim(result))
and then use the apply
Just replace list() by c() in your myfun() function
mydata <- data.frame(a=rnorm(20, 1, 1), b = rnorm(20,1,1))
myfun <- function(data, n){
sample <- data[sample(n, replace = T),]
model1 <- lm(sample[,1]~sample[,2])
return(c(model1[[1]][[1]], model1[[1]][[2]]))
}
result <- as.numeric()
result <- replicate(99, myfun(mydata, 10))
apply(result, FUN=mean, 1)
apply(result, FUN=sd, 1)
This worked for me:
mydata <- data.frame(a=rnorm(20, 1, 1), b = rnorm(20,1,1))
myfun <- function(data, n){
sample <- data[sample(n, replace = T),]
model1 <- lm(sample[,1]~sample[,2])
return(data.frame(v1 = model1[[1]][[1]], v2 = model1[[1]][[2]]))
}
result <- do.call("rbind",(replicate(99, myfun(mydata, 10), simplify = FALSE)))

R speed up the for loop using apply() or lapply() or etc

I wrote a special "impute' function that replaces the column values that have missing (NA) values with either mean() or mode() based on the specific column name.
The input dataframe is 400,000+ rows and its vert slow , how can i speed up the imputation part using lapply() or apply().
Here is the function , mark section I want optimized with START OPTIMIZE & END OPTIMIZE:
specialImpute <- function(inputDF)
{
discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
dfList <- list()
counter = 1;
Whilecounter = nrow(inputDF)
#for testing just do 10 iterations,i = 10;
while (Whilecounter >0)
{
studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]
vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
#was discovered and subset before
if (!is.null(vect))
{
#not subset before
if (length(vect)<1)
{
#subset the dataframe base on regex inputDF$STUDYID_SUBJID
df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)
#START OPTIMIZE
for (i in nrow(df))
{
#impute , add column mean & add to list
#apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})
if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
#impute using mean for CONTINUOUS variables
if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
#impute using mode ordinal & nominal values
if (is.na(df[i,"COVAR_ORDINAL_1"])) {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
if (is.na(df[i,"COVAR_ORDINAL_2"])) {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
if (is.na(df[i,"COVAR_ORDINAL_3"])) {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
if (is.na(df[i,"COVAR_ORDINAL_4"])) {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
#nominal
if (is.na(df[i,"COVAR_NOMINAL_1"])) {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
if (is.na(df[i,"COVAR_NOMINAL_2"])) {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
if (is.na(df[i,"COVAR_NOMINAL_3"])) {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
if (is.na(df[i,"COVAR_NOMINAL_4"])) {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
if (is.na(df[i,"COVAR_NOMINAL_5"])) {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
if (is.na(df[i,"COVAR_NOMINAL_6"])) {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
if (is.na(df[i,"COVAR_NOMINAL_7"])) {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
if (is.na(df[i,"COVAR_NOMINAL_8"])) {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}
}#for
#END OPTIMIZE
dfList[[counter]] <- df
#add to discoveredDf since already substed
discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
counter = counter +1;
#for debugging to check progress
if (counter %% 100 == 0)
{
print(counter)
}
}
}
Whilecounter = Whilecounter -1;
}#end while
return (dfList)
}
Thanks
It's likely that performance can be improved in many ways, so long as you use a vectorized function on each column. Currently, you're iterating through each row, and then handling each column separately, which really slows you down. Another improvement is to generalize the code so you don't have to keep typing a new line for each variable. In the examples I'll give below, this is handled because continuous variables are numeric, and categorical are factors.
To get straight to an answer, you can replace your code to be optimized with the following (though fixing variable names) provided that your numeric variables are numeric and ordinal/categorical are not (e.g., factors):
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
# Correct cols_to_impute with names of your variables to be imputed
# e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)
cols_to_impute <- names(df) %in% c("names", "of", "columns")
library(purrr)
df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)
Below is a detailed comparison of five approaches:
Your original approach using for to iterate on rows; each column then handled separately.
Using a for loop.
Using lapply().
Using sapply().
Using dmap() from the purrr package.
The new approaches all iterate on the data frame by column and make use of a vectorized function called impute, which imputes missing values in a vector with the mean (if numeric) or the mode (otherwise). Otherwise, their differences are relatively minor (except sapply() as you'll see), but interesting to check.
Here are the utility functions we'll use:
# Function to simulate a data frame of numeric and factor variables with
# missing values and `n` rows
create_dat <- function(n) {
set.seed(13)
data.frame(
con_1 = sample(c(10:20, NA), n, replace = TRUE), # continuous w/ missing
con_2 = sample(c(20:30, NA), n, replace = TRUE), # continuous w/ missing
ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
ord_2 = sample(c(letters, NA), n, replace = TRUE) # ordinal w/ missing
)
}
# Function that imputes missing values in a vector with mean (if numeric) or
# mode (otherwise)
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
Now, wrapper functions for each approach:
# Original approach
func0 <- function(d) {
for (i in 1:nrow(d)) {
if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)
if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)
if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))
if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
}
return(d)
}
# for loop operates directly on d
func1 <- function(d) {
for(i in seq_along(d)) {
d[[i]] <- impute(d[[i]])
}
return(d)
}
# Use lapply()
func2 <- function(d) {
lapply(d, function(col) {
impute(col)
})
}
# Use sapply()
func3 <- function(d) {
sapply(d, function(col) {
impute(col)
})
}
# Use purrr::dmap()
func4 <- function(d) {
purrr::dmap(d, impute)
}
Now, we'll compare the performance of these approaches with n ranging from 10 to 100 (VERY small):
library(microbenchmark)
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
ORIGINAL = func0(dat),
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
# Plot the results
library(tidyr)
library(ggplot2)
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
It's pretty clear that the original approach is much slower than the new approaches that use the vectorized function impute on each column. What about differences between the new ones? Let's bump up our sample size to check:
ns <- seq(5000, 50000, by = 5000)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
Looks like sapply() is not great (as #Martin pointed out). This is because sapply() is doing extra work to get our data into a matrix shape (which we don't need). If you run this yourself without sapply(), you'll see that the remaining approaches are all pretty comparable.
So the major performance improvement is to use a vectorized function on each column. I suggested using dmap at the beginning because I'm a fan of the function style and the purrr package generally, but you can comfortably substitute for whichever approach you prefer.
Aside, many thanks to #Martin for the very useful comment that got me to improve this answer!
If you are going to be working with what looks like a matrix, then use a matrix instead of a dataframe, since indexing into a dataframe, like it was a matrix, is very costly. You might want to extract the numerical values to a matrix for part of your calculations. This can provide a significant increase in speed.
Here is a really simple and fast solution using data.table.
library(data.table)
# name of columns
cols <- c("a", "c")
# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x))) , .SDcols = cols ]
I haven't compared the performance of this solution to the one provided by #Simon Jackson, but this should be pretty fast.
data from reproducible example
set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1),
b=sample(1:15, 9, replace=TRUE),
c=LETTERS[c(1:6,NA,NA,1)])

error in a Function that tries to remove NAs to calculate a value

I have the following function:
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=TRUE)
p.val <- 1-pchisq(Xsq, df = 2*length(p))
return(p.val)
}
I was guessing that command na.rm=TRUE was dealing with NA in my data. However, when I test the function with simple values the behaviour is not the expected. For example:
Fisher.test(c(0.1,0.4,0.1,NA))
[1] 0.199279
Fisher.test(c(0.1,0.4,0.1))
[1] 0.08705891
Why in the first option I do not get the same result as in the second one? The na.rm=TRUE should remove the NA??
Many thanks
Because the lengths of those two vectors are different. If you just wanted to filter out NAs you could use sum(!is.na(p)) instead of length(p), but since log can produce a NaN for negative values, which will also get filtered out by your sum, I'd use sum(p >= 0, na.rm = T) instead (or just sum(!is.na(log(p))) to let R figure out the details itself):
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=TRUE)
p.val <- 1-pchisq(Xsq, df = 2*sum(p >= 0, na.rm = T))
return(p.val)
}

Resources