Summing NAs across all columns by group in dplyr [duplicate] - r

This question already has answers here:
How can I count the number of NAs per group?
(3 answers)
Closed 6 months ago.
I have a grouped data frame with some NA values in all columns.
id <- rep(c("a", "b", "c"), 3)
x1 <- c(1, NA, NA, 2, 2, NA, 0, NA, 0)
x2 <- c(1, 2, 3, NA, 12, NA, NA, 4, NA)
df <- cbind.data.frame(id, x1, x2)
I want to group by ID and then summarize the number of NAs across all numeric columns. The resulting data frame should have 3 rows (1 for each ID) and 2 columns (x1 and x2) and should contain the sums of NAs in both columns by ID.

library(dplyr)
df %>%
group_by(id) %>%
summarise(across(c(x1, x2), ~ sum(is.na(.x))))
or, with aggregate:
aggregate(list(x1 = df$x1, x2 = df$x2), by = list(id = df$id), function(x) sum(is.na(x)))
output
id x1 x2
<chr> <int> <int>
1 a 0 2
2 b 2 0
3 c 2 2

Using rowsum in base R
rowsum(+(is.na(df[-1])), df$id, na.rm = TRUE)
x1 x2
a 0 2
b 2 0
c 2 2

Related

Sum While melting columns in R

Is there a way to melt 2 columns and take there sums as value . For example
df <- data.frame(A = c("x", "y", "z"), B = c(1, 2, 3), Cat1 = c(1, 4, 3), New2 = c(4, 4, 4))
Expected output
New_Col Sum
Cat1 8
New2 12
Or using base R with colSums after selecting the columns of interest and then convert the named vector to data.frame with stack
stack(colSums(df[c("Cat1", "New2")]))[2:1]
ind values
1 Cat1 8
2 New2 12
Of course
df %>%
summarise(across(starts_with('Cat'), sum)) %>%
pivot_longer(everything(), names_to = 'New_Col', values_to = 'Sum')
# A tibble: 2 × 2
New_Col Sum
<chr> <dbl>
1 Cat1 8
2 Cat2 12

How to sum across rows with all NAs to be 0/NA

I have a dataframe:
dat <- data.frame(X1 = c(0, NA, NA),
X2 = c(1, NA, NA),
X3 = c(1, NA, NA),
Y1 = c(1, NA, NA),
Y2 = c(NA, NA, NA),
Y3 = c(0, NA, NA))
I want to create a composite score for X and Y variables. This is what I have so far:
clean_dat <- dat %>% rowwise() %>% mutate(X = sum(c(X1, X2, X3), na.rm = T),
Y = sum(c(Y1, Y2, Y3), na.rm = T))
However, I want the composite score for the rows with all NAs (i.e. rows 2 and 3) to be 0 in the column X and Y. Does anyone know how to do this?
Edit: I'd like to know how I can make X and Y in rows 2 and 3 NA too.
Thanks so much!
By default, sum or rowSums return 0 when we use na.rm = TRUE and when all the elements are NA. To prevent this either use an if/else or case_when approach i.e. determine whether there are any non-NA elements with if_any, then take the rowSums of the concerned columns within case_when (by default the TRUE will return NA)
library(dplyr)
dat %>%
mutate(X = case_when(if_any(starts_with('X'), complete.cases)
~ rowSums(across(starts_with('X')), na.rm = TRUE)),
Y = case_when(if_any(starts_with('Y'), complete.cases) ~
rowSums(across(starts_with('Y')), na.rm = TRUE)) )
-output
X1 X2 X3 Y1 Y2 Y3 X Y
1 0 1 1 1 NA 0 2 1
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA

How can I select columns based on two conditions? [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 3 years ago.
I have a data frame with lots of columns. For example:
sample treatment col5 col6 col7
1 a 3 0 5
2 a 1 0 3
3 a 0 0 2
4 b 0 1 1
I want to select the sample and treatment columns plus all columns that meet the following 2 conditions:
Their value on the row in which treatment == 'b' is 0
Their value from at least one row where treatment == 'a' is not 0.
The expected result should look like this:
sample treatment col5
1 a 3
2 a 1
3 a 0
4 b 0
Example dataframe:
structure(list(sample = 1:4, treatment = structure(c(1L, 1L,
1L, 2L), .Label = c("a", "b"), class = "factor"), col5 = c(3,
1, 0, 0), col6 = c(0, 0, 0, 1), col7 = c(5, 3, 2, 1)), class = "data.frame", row.names = c(NA,
-4L))
Here's a way in base R -
cs_a <- colSums(df[df$treatment == "a",-c(1:2)]) > 0
cs_b <- colSums(df[df$treatment == "b",-c(1:2)]) == 0
df[, c(TRUE, TRUE, cs_a & cs_b)]
sample treatment col5
1 1 a 3
2 2 a 1
3 3 a 0
4 4 b 0
With dplyr -
df %>%
select_at(which(c(TRUE, TRUE, cs_a & cs_b)))
Here is much more verbose way in tidyverse that does not require manual colSums for each level of treatment:
library(dplyr)
library(purrr)
library(tidyr)
sample <- 1:4
treatment <- c("a", "a", "a", "b")
col5 <- c(3,1,0,0)
col6 <- c(0,0,0,1)
col7 <- c(5,3,2,1)
dd <- data.frame(sample, treatment, col5, col6, col7)
# first create new columns that report whether the entries are zero
dd2 <- mutate_if(
.tbl = dd,
.predicate = is.numeric,
.funs = function(x)
x == 0
)
# then find the sum per column and per treatment group
# in R TRUE = 1 and FALSE = 0
number_of_zeros <- dd2 %>%
group_by(treatment) %>%
summarise_at(.vars = vars(col5:col7), .funs = "sum")
# then find the names of the columns you want to keep
keeper_columns <-
number_of_zeros %>%
select(-treatment) %>% # remove the treatment grouping variable
map_dfr( # function to check if all entries per column (now per treatment level) are greater zero
.x = .,
.f = function(x)
all(x > 0)
) %>%
gather(column, keeper) %>% # reformat
filter(keeper == TRUE) %>% # to grab the keepers
select(column) %>% # then select the column with column names
unlist %>% # and convert to character vector
unname
# subset the original dataset for the wanted columns
wanted_columns <- dd %>% select(1:2, keeper_columns)

Remove columns from a dataframe based on number of rows with valid values

I have a dataframe:
df = data.frame(gene = c("a", "b", "c", "d", "e"),
value1 = c(NA, NA, NA, 2, 1),
value2 = c(NA, 1, 2, 3, 4),
value3 = c(NA, NA, NA, NA, 1))
I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?
I am thinking something like this ...
df1 = df %>% select_if(function(.) ...)
Thanks
We can sum the non-NA elements and create a logical condition to select the columns of interest
library(dplyr)
df1 <- df %>%
select_if(~ sum(!is.na(.)) > 2)
df1
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or another option is keep
library(purrr)
keep(df, ~ sum(!is.na(.x)) > 2)
Or create the condition based on the number of rows
df %>%
select_if(~ mean(!is.na(.)) > 0.5)
Or use Filter from base R
Filter(function(x) sum(!is.na(x)) > 2, df)
We can use colSums in base R to count the non-NA value per column
df[colSums(!is.na(df)) > 2]
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or using apply
df[apply(!is.na(df), 2, sum) > 2]

How to convert diagonal rows into single row in R? [duplicate]

This question already has answers here:
Combining rows based on a column
(1 answer)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a dataset1 which is as follows:
dataset1 <- data.frame(
id1 = c(1, 1, 1, 2, 2, 2),
id2 = c(122, 122, 122, 133, 133, 133),
num1 = c(1, NA, NA, 50,NA, NA),
num2 = c(NA, 2, NA, NA, 45, NA),
num3 = c(NA, NA, 3, NA, NA, 4)
)
How to convert multiple rows into a single row?
The desired output is:
id1, id2, num1, num2, num3
1 122 1 2 3
2 133 50 45 4
library(dplyr)
dataset1 %>% group_by(id1, id2) %>%
summarise_all(funs(.[!is.na(.)])) %>%
as.data.frame()
# id1 id2 num1 num2 num3
# 1 1 122 1 2 3
# 2 2 133 50 45 4
Note: Assuming there will be only 1 non-NA item in a column.
Using data.table
library(data.table)
data.table(dataset1)[, lapply(.SD, sum, na.rm = TRUE), by = c("id1", "id2")]
# id1 id2 num1 num2 num3
#1: 1 122 1 2 3
#2: 2 133 50 45 4
You can use dplyr to achieve that:
library(dplyr)
dataset1 %>%
group_by(id1, id2) %>%
mutate(
num1 = sum(num1, na.rm=T),
num2 = sum(num2, na.rm=T),
num3 = sum(num3, na.rm=T)
) %>%
distinct()
Output:
This is also assuming if there's a repeated value in any of the variable we're going to sum it (if id1 = 1 has two values for num1, we're going to sum the value). If you're confident that every id has only one possible value for each of the num (num1 to num3), then don't worry about it.

Resources