Mean excluding zero and na for all columns with dplyr - r

I want do do a mean of my dataframe with the dplyr package for all my colums.
n = c(NA, 3, 5)
s = c("aa", "bb", "cc")
b = c(3, 0, 5)
df = data.frame(n, s, b)
Here I want my function to get mean = 4 the n and b columns
I tried mean(df$n[df$n>0]) buts it's not easy for a large dataframe.
I want something like df %>% summarise_each(funs(mean)) ...
Thanks

If you don't want 0s it's probably that you consider them as NAs, so let's be explicit about it, then summarize numeric columns with na.rm = TRUE :
library(dplyr)
df[df==0] <- NA
summarize_if(df, is.numeric, mean, na.rm = TRUE)
# n b
# 1 4 4
As a one liner:
summarize_if(`[<-`(df, df==0, value= NA), is.numeric, mean, na.rm = TRUE)
and in base R (result as a named numeric vector)
sapply(`[<-`(df, df==0, value= NA)[sapply(df, is.numeric)], mean, na.rm=TRUE)

Cf elegant David Answer :
df %>% summarise_each(funs(mean(.[!is.na(.) & . != 0])), -s)
Or
df %>% summarise_each(funs(mean(.[. != 0], na.rm = TRUE)), -s)

Related

find the count of values having zero in rows in dataframe

i am trying to calculate the count of zero in rows and then subtract it from 5
for eg in excel =3-COUNTIF(SM1:SM3,0)
any solution for this
df <- data.frame("T_1_1"= c(68,NA,0,105,NA,0,135,NA,24),
"T_1_2"=c(26,NA,0,73,NA,97,46,NA,0),
"T_1_3"=c(93,32,NA,103,NA,0,147,NA,139),
"S_2_1"=c(69,67,94,0,NA,136,NA,92,73),
"S_2_2"=c(87,67,NA,120,NA,122,0,NA,79),
"S_2_3"= c(150,0,NA,121,NA,78,109,NA,0),
"T_1_0"= c(79,0,0,NA,98,NA,15,NA,2)
)
df <- df %>% mutate(ltc = (5-rowSums(select(., matches('T_1[1-9]')) == 0,na.rm = TRUE)))
I believe you forgot an underscore in matches().
df %>%
mutate(ltc = 5 - rowSums(select(., matches('T_1_[1-9]')) == 0, na.rm = T))
Here is a base R option using rowSums
df$ltc = 5- rowSums(df == 0, na.rm = TRUE)

Reaplace NA values from different columns

R replace problem
Can't replace in the dataset NA values from different columns with a median of the same column with NA value.
Titanic.new is the dataset.
I have tried:
fun3<-function(x)
{
column.numeric<-x[,sapply(x,is.numeric)]
column.numeric[which(is.na(column.numeric))]<-median(column.numeric,na.rm = TRUE)
return(column.numeric)
}
fun3(titanic.new)
I'm getting an error:
Error in median.default(column.numeric, na.rm = TRUE) :
need numeric data
What am I doing wrong?
We can do some modification in the function. Loop through the columns of the dataset, find whether the type is numeric ('i1') -> return a logical vector. Subset the data using the vector, loop through the columns with lapply and replace the NAs in the column with the median of that column
fun3<-function(x){
i1 <- sapply(x,is.numeric)
x[i1] <- lapply(x[i1], function(y) replace(y, is.na(y), median(y, na.rm = TRUE)))
x
}
fun3(titanic.new)
Or it can be done with tidyverse
library(tidyverse)
titanic.new %>%
mutate_if(is.numeric, list(~ replace(., is.na(.), median(., na.rm = TRUE))))
which can be wrapped in a function as well
fun4 <- function(x) {
x %>%
mutate_if(is.numeric,
list(~ replace(., is.na(.), median(., na.rm = TRUE))))
}
Also, this can be done more compactly with na.aggregate
library(zoo)
i1 <- sapply(titanic.new, is.numeric)
titanic.new[i1] <- na.aggregate(titanic.new[i1], FUN = median)

impute median plus jitter

I would like to efficiently impute missing values with a slightly different value in each cell.
for example:
df <- data_frame(x = rnorm(100), y = rnorm(100))
df[1:5,1] <- NA
df[1:5, 2] <- NA
df %<>% mutate_all(funs(ifelse(is.na(.), jitter(median(., na.rm = TRUE)), .)))
However, this imputes with the same number in all cells.
How can I add a different noise to each cell?
Of course, I could do this with a loop, but my data frame is huge and I would like to do this efficiently
We can use rep with n()
library(dplyr)
library(magrittr)
df %<>%
mutate_all(list(~ case_when(is.na(.) ~ jitter(rep(median(., na.rm = TRUE), n())),
TRUE ~ .)))

R - maximum value of variables when compared between levels of variable1 grouped by variable2

Consider the following data
set.seed(123)
example.df <- data.frame(
gene = sample(c("A", "B", "C", "D"), 100, replace = TRUE),
treated = sample(c("Yes", "No"), 100, replace = TRUE),
resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))
I am trying to get the maximum value for all variables when they are compared by the levels of gene and grouped by treated. I can create the gene combinations like so,
combn(sort(unique(example.df$gene)), 2, simplify = T)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] A A A B B c
#[2,] B c D c D D
#Levels: A B c D
Edit: The output I am looking for is a dataframe like this
comparison group max.resp max.effect
A-B no value1 value2
....
C-D no valueX valueY
A-B yes value3 value4
....
C-D yes valueXX valueYY
While I am able to get the max values for each individual gene level grouped by treated...
max.df <- example.df %>%
group_by(treated, gene) %>%
nest() %>%
mutate(mod = map(data, ~summarise_if(.x, is.numeric, max, na.rm = TRUE))) %>%
select(treated, gene, mod) %>%
unnest(mod) %>%
arrange(treated, gene)
Despite trying to tackle the issue for more than a day, I cannot figure out how to get the max for each numeric variable for each 2 level gene comparison (A vs B, A vs C, A vs D, B vs C, B vs D, and C vs D) grouped by treated.
Any help is appreciated. Thanks.
I found a solution, it might be a little messy, but I will update it in a better way, it takes no time whatsoever
library(tidyverse)
First I generate a dataframe with two columns, Gen1 and Gen2 for al possible comparisons, very similar to your use of combn but creating a data.frame
GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene)) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)
Then I loop throught it grouping by
Comps <- list()
for(i in 1:nrow(GeneComp)){
Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
group_by(treated) %>% # Then gorup by treated
summarise_if(is.numeric, max) %>% # then summarise max if numeric
mutate(Comparison = paste(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2, sep = "-")) # and generate the comparisson variable
}
Comps <- bind_rows(Comps) # and finally join in a data frame
let me know if it does everything you want
Adding in order to get only the data one time
It is important here that your genes are strings and not factors so you might have to do this
options(stringsAsFactors = FALSE)
example.df <- data.frame(
gene = c(sample(c("A", "B", "C", "D"), 100, replace = TRUE)),
treated = sample(c("Yes", "No"), 100, replace = TRUE),
resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))
Then again in expand.grid add the stringsAsFactors = F argument
GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene), stringsAsFactors = F) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)
Now that allows you in the loop when pasting the Comparisson variable to sort both inputs, with that, the lines will be duplicated, but when you use the distinct function at the end, it will make your data the way you want it
Comps <- list()
for(i in 1:nrow(GeneComp)){
Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
group_by(treated) %>% # Then gorup by treated
summarise_if(is.numeric, max) %>% # then summarise max if numeric
mutate(Comparison = paste(sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[1], sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[2], sep = "-")) # and generate the comparisson variable
}
Comps <- bind_rows(Comps) %>% distinct() # and finally join in a data frame

Add a column with count of NAs and Mean

I have a data frame and I need to add another column to it which shows the count of NAs in all the other columns for that row and also the mean of the non-NA values.
I think it can be done in dplyr.
> df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
> df1
a b c
1 1 1 NA
2 2 2 2
3 3 NA 3
4 4 4 NA
5 5 NA NA
I want to mutate another column which counts the number of NAs in that row and another column which shows the mean of all the NON-NA values in that row.
library(dplyr)
count_na <- function(x) sum(is.na(x))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = apply(., 1, count_na))
#### ANSWER FOR RADEK ####
elected_cols <- c('b', 'c')
df1 %>%
mutate(means = rowMeans(.[elected_cols], na.rm = T),
count_na = apply(.[elected_cols], 1, count_na))
As mentioned here https://stackoverflow.com/a/37732069/2292993
df1 <- data.frame(a = 1:5, b = c(1,2,NA,4,NA), c = c(NA,2,3,NA,NA))
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(.)))
to work on selected cols (the example here is for col a and col c):
df1 %>%
mutate(means = rowMeans(., na.rm = T),
count_na = rowSums(is.na(select(.,one_of(c('a','c'))))))
You can try this:
#Find the row mean and add it to a new column in the dataframe
df1$Mean <- rowMeans(df1, na.rm = TRUE)
#Find the count of NA and add it to a new column in the dataframe
df1$CountNa <- rowSums(apply(is.na(df1), 2, as.numeric))
I recently faced a variation on this question where I needed to compute the percent of complete values, but for specific variables (not all variables). Here is an approach that worked for me.
df1 %>%
# create dummy variables representing if the observation is missing ----
# can modify here for specific variables ----
mutate_all(list(dummy = is.na)) %>%
# compute a row wise sum of missing ----
rowwise() %>%
mutate(
# number of missing observations ----
n_miss = sum(c_across(matches("_dummy"))),
# percent of observations that are complete (non-missing) ----
pct_complete = 1 - mean(c_across(matches("_dummy")))
) %>%
# remove grouping from rowwise ----
ungroup() %>%
# remove dummy variables ----
dplyr::select(-matches("dummy"))

Resources