Conditional aggregating by column pairs in R - r

UPDATE : I've updated the example because it wasn't clear enough.
I am trying to aggregate in R columns of a dataframe based on a condition.
My dataframe looks like this:
df <- data.frame(year = rep(2005, 8),
id = 1:8,
crash_x = c(0, 2, 0, 0, 4, 0,1,2),
crash_y = c(1, 0, 0, 0, 0, 1,0,0),
crash_z = c(0, 0, 3, 1, 0, 0,0,0),
injured_x = c(0, 1, 0, 0, 3, 0,0,0),
injured_y = c(0, 0, 2, 1, 0, 0,1,2),
injured_z = c(3, 0, 0, 0, 0, 2, 0,0))
year id crash_x crash_y crash_z injured_x injured_y injured_z
2005 1 0 1 0 0 0 3
2005 2 2 0 0 1 0 0
2005 3 0 0 3 0 2 0
2005 4 0 0 1 0 1 0
2005 5 4 0 0 3 0 0
2005 6 0 1 0 0 0 2
2005 7 1 0 0 0 1 0
2005 8 2 0 0 0 2 0
I would like to sum the columns on the condition that the columns crash_ and injured_ that share the same suffix (x, y, or z) have numbers greater than 0 in the same rows, e.g., rows 1 and 6, rows 3 and 4, rows 2 and 5, rows 7 and 8, etc.
The output should look like:
year crash_x crash_y crash_z injured_x injured_y injured_z
2005 0 2 0 0 0 5
2005 6 0 0 4 0 0
2005 0 0 4 0 3 0
2005 3 0 0 0 3 0
Is this possible ? Thanks!!

This solution first creates a new column with the "pattern" of 0 and non-0 values:
df <- data.frame(year = rep(2005, 8),
id = 1:8,
crash_x = c(0, 2, 0, 0, 4, 0,1,2),
crash_y = c(1, 0, 0, 0, 0, 1,0,0),
crash_z = c(0, 0, 3, 1, 0, 0,0,0),
injured_x = c(0, 1, 0, 0, 3, 0,0,0),
injured_y = c(0, 0, 2, 1, 0, 0,1,2),
injured_z = c(3, 0, 0, 0, 0, 2, 0,0))
df %<>% unite("pattern", c(crash_x, crash_y, crash_z, injured_x, injured_y, injured_z), remove = FALSE) %>%
mutate(pattern = gsub("[1-9]", "1", pattern))
Then summarizes each column according to pattern group with dplyr:
df %>% group_by(pattern, year) %>%
summarise_at(vars(crash_x, crash_y, crash_z, injured_x, injured_y, injured_z), sum)

The easiest way is to reshape (base R variant):
library(reshape2)
d <- read.table(text = "year id crash_x crash_y crash_z injured_x injured_y injured_z
2005 1 0 1 0 0 0 3
2005 2 2 0 0 1 0 0
2005 3 0 0 3 0 2 0
2005 4 0 0 1 0 1 0
2005 5 4 0 0 3 0 0
2005 6 0 1 0 0 0 2", header = T, stringsAsFactors = F)
want <- melt(subset(d, select = -id), id.vars = "year", variable.name = "crash", value.name = "val")
want$postfix <- gsub("(^crash_)|(^injured_)", "", want$crash)
want <- aggregate(val ~ crash + year + postfix, want, sum)
dcast(want, year + postfix ~ crash, value.var = "val", fill = 0)
# year postfix crash_x crash_y crash_z injured_x injured_y injured_z
#1 2005 x 6 0 0 4 0 0
#2 2005 y 0 2 0 0 3 0
#3 2005 z 0 0 4 0 0 5

Related

Counts of factor levels for multiple variables grouped by row

I want to count the number of occurrences that a specific factor level occurs across multiple factor varaibles per row.
Simplified, I want to know how many times each factor level is chosen across specific variables per row (memberID).
Example data:
results=data.frame(MemID=c('A','B','C','D','E','F','G','H'),
value_a = c(1,2,1,4,5,1,4,0),
value_b = c(1,5,2,3,4,1,0,3),
value_c = c(3,5,2,1,1,1,2,1)
)
In this example, I want to know the frequency of each factor level for value_a and value_b for each MemID. How many times does A respond 1? How many times does A respond 2? Etc...for each level and for each MemID but only for value_a and value_b.
I would like the output to look something like this:
counts_by_level = data.frame(MemID=c('A','B','C','D','E','F','G','H'),
count_1 = c(2, 0, 1, 0, 0, 2, 0, 0),
count_2 = c(0, 1, 1, 0, 0, 0, 0, 0),
count_3 = c(0, 0, 0, 1, 0, 0, 0, 1),
count_4 = c(0, 0, 0, 1, 1, 0, 1, 0),
count_5 = c(0, 1, 0, 0, 1, 0, 0, 0))
I have been trying to use add_count or add_tally as well as table and searching other ways to answer this question. However, I am struggling to identify specific factor levels across multiple variables and then output new columns for the counts of those levels for each row.
You could do something like this. Note that you didn't include a zero count, but there are some zero selections.
library(tidyverse)
results |>
select(-value_c) |>
pivot_longer(cols = starts_with("value"),
names_pattern = "(value)") |>
mutate(count = 1) |>
select(-name) |>
pivot_wider(names_from = value,
values_from = count,
names_prefix = "count_",
values_fill = 0,
values_fn = sum)
#> # A tibble: 8 x 7
#> MemID count_1 count_2 count_5 count_4 count_3 count_0
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2 0 0 0 0 0
#> 2 B 0 1 1 0 0 0
#> 3 C 1 1 0 0 0 0
#> 4 D 0 0 0 1 1 0
#> 5 E 0 0 1 1 0 0
#> 6 F 2 0 0 0 0 0
#> 7 G 0 0 0 1 0 1
#> 8 H 0 0 0 0 1 1
Another solution:
results %>%
group_by(MemID, value_a, value_b) %>%
summarise(n=n()) %>%
pivot_longer(c(value_a,value_b)) %>%
group_by(MemID, value) %>%
summarise(n=sum(n)) %>%
pivot_wider(MemID,
names_from = value, names_sort = T, names_prefix = "count_",
values_from=n, values_fn=sum, values_fill = 0)

How do you make a new factor column based on other columns in r?

I have a data set that looks like this
ID Group 1 Group 2 Group 3 Group 4
1 1 0 1 0
2 0 1 1 1
3 1 1 0 0
.
.
.
100 0 1 0 1
I want to make another column lets say Group 5 where if the condition of Group 1 is 1 then Group 5 would be 1. If Group 2 = 1, then Group 5 = 2. If Group 3 = 1, then Group 5 = 3, and if Group 4 = 1, then Group 5 = 4. How do I do this?
I tried these lines of code, but I seem to be missing something.
Group5 <- data.frame(Group1, Group2, Group3, Group4, stringsAsFactors=FALSE)
df$Group5 <- with(finalmerge, ifelse(Group1 %in% c("1", "0"),
"1", ""))
Any advice would be helpful, thanks in advance.
You could use which.max(), and apply this to each row.
df["Group_5"] <- apply(df[, -1], 1, which.max)
Output:
ID Group_1 Group_2 Group_3 Group_4 Group_5
1 1 0 0 0 1 4
2 2 0 1 0 0 2
3 3 0 0 1 0 3
4 4 1 0 0 0 1
Input:
df = structure(list(ID = c(1, 2, 3, 4), Group_1 = c(0, 0, 0, 1), Group_2 = c(0,
1, 0, 0), Group_3 = c(0, 0, 1, 0), Group_4 = c(1, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-4L))

how to add condition to mutate(across

I have df and I would like to calculate percentage (.x/.x[1] * 100 ) when row_number >2 and the first row in the same col is not 0. What should I do if we want to use mutate(across...? where and how can I add the part on .x[1]!=0?
mutate(across(.fns = ~ifelse(row_number() > 2 ... sprintf("%1.0f (%.2f%%)", .x, .x/.x[1] * 100), .x)))
df<-structure(list(Total = c(4, 2, 1, 1, 0, 0), `ELA` = c(0,
0, 0, 0, 0, 0), `Math` = c(4, 2, 1, 1, 0,
0), `PE` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df %>%
mutate(across(
where(~.x[1] > 0),
~ifelse(
row_number() > 2,
sprintf("%1.0f (%.2f%%)", .x, .x/.x[1] * 100),
.x
)))
# # A tibble: 6 × 4
# Total ELA Math PE
# <chr> <dbl> <chr> <dbl>
# 1 4 0 4 0
# 2 2 0 2 0
# 3 1 (25.00%) 0 1 (25.00%) 0
# 4 1 (25.00%) 0 1 (25.00%) 0
# 5 0 (0.00%) 0 0 (0.00%) 0
# 6 0 (0.00%) 0 0 (0.00%) 0
Have a look at the ?across help page for more examples.

Is there an R function for combining two replicate site columns in a table to show presence absence of species?

I have the following DF (example data, my actual data set is 96 columns):
class X1A X1B X2A X2B X3A X3B X4A X4B X5A X5B X6A X6B
1 A 0 1 0 0 0 0 0 1 1 1 1 1
2 B 1 1 1 1 0 0 0 1 1 1 0 1
3 C 0 0 0 1 1 0 0 0 1 1 0 0
4 D 0 0 0 1 1 0 1 0 1 0 0 0
5 A 0 1 1 1 0 0 0 1 1 1 1 1
6 B 0 0 1 1 0 0 0 1 1 1 0 1
7 C 0 0 0 1 1 0 0 0 1 1 0 0
8 D 0 0 0 1 1 0 1 0 1 0 0 0
9 A 0 1 1 1 0 0 0 1 1 1 1 1
10 B 1 1 1 1 0 0 0 1 1 1 0 1
11 C 0 0 0 1 1 0 0 0 1 1 0 0
12 D 0 1 0 1 1 0 1 0 1 0 0 0
Class denotes the phylogenic class of the organism (each replicate of the letter is a different species but members of the same class). 1A and 1B are samples from the same site. I want to combine the two presence/absence data (1/0 respectively) from each two samples from every site and add up the number of "presences" for the class across that site. so that my df now looks something like this:
Sample Class Number of Species Present
1 A 3
1 B 2
1 C 0
1 D 1
2 A 2
2 B 3
2 C 3
2 D 3
For example,
in the original df you see that Class C species are not present in sample 2A at all but each species of class C is present in sample 2B. So the output df records Species C as present 3 times in sample 2. Furthermore, Class B has 3 different species occur in 2A and in 2B but because they are replicates of the output df records sample 2 as having 3 Class B species present.
Any help would be appreactiated as I'm stumped!
Cheers!!
You just need to format your initial df a bit (since your colnames actually contain more information than just being a "name").
library(tidyverse)
d <- data %>% pivot_longer(-class, names_to = 'site', values_to = 'presence') %>%
mutate(sample=substr(site,1,1)) %>%
mutate(site = substr(site, 2,2))
d %>% group_by(class,sample) %>%
summarise(presence = sum(presence)) %>% arrange(sample)
which results in:
# A tibble: 24 x 3
# Groups: class [4]
class sample presence
<chr> <chr> <dbl>
1 A 1 3
2 B 1 4
3 C 1 0
4 D 1 1
5 A 2 4
6 B 2 6
7 C 2 3
8 D 2 3
9 A 3 0
10 B 3 0
Here is the data with dput():
structure(list(class = c("A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D"), `1A` = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0), `1B` = c(1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1), `2A` = c(0,
1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0), `2B` = c(0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1), `3A` = c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
1), `3B` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `4A` = c(0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1), `4B` = c(1, 1, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0), `5A` = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1), `5B` = c(1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), `6A` = c(1,
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0), `6B` = c(1, 1, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -12L), spec = structure(list(
cols = list(class = structure(list(), class = c("collector_character",
"collector")), `1A` = structure(list(), class = c("collector_double",
"collector")), `1B` = structure(list(), class = c("collector_double",
"collector")), `2A` = structure(list(), class = c("collector_double",
"collector")), `2B` = structure(list(), class = c("collector_double",
"collector")), `3A` = structure(list(), class = c("collector_double",
"collector")), `3B` = structure(list(), class = c("collector_double",
"collector")), `4A` = structure(list(), class = c("collector_double",
"collector")), `4B` = structure(list(), class = c("collector_double",
"collector")), `5A` = structure(list(), class = c("collector_double",
"collector")), `5B` = structure(list(), class = c("collector_double",
"collector")), `6A` = structure(list(), class = c("collector_double",
"collector")), `6B` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
You can try this:
Code
df %>%
#long format with column for sample and species
pivot_longer(-class,
names_pattern = "(\\d*)([A-Z]*)",
names_to = c("sample", "species")) %>%
#creating two columns (for each species one)
pivot_wider(c(class, sample),
names_from = species,
values_from = value,
values_fn = list) %>%
unnest(c(A, B)) %>%
#creating a presence column - 1 when any species (column A and B) is presence
mutate(presence = ifelse(A == 1 | B == 1, 1, 0)) %>%
#sum prescence by sample and class
group_by(sample, class) %>%
summarise(Number = sum(presence))
Output
# A tibble: 24 x 3
# Groups: sample [6]
sample class Number
<chr> <chr> <dbl>
1 1 A 3
2 1 B 2
3 1 C 0
4 1 D 1
5 2 A 2
6 2 B 3
7 2 C 3
8 2 D 3
9 3 A 0
10 3 B 0
# ... with 14 more rows

How to pass a vector of column names in case_when

I am using case_when to summarise a data frame using rowwise in dplyr. I have a sample data frame as shown below
structure(list(A = c(NA, 1, 0, 0, 0, 0, 0), B = c(NA, 0, 0, 1,
0, 0, 0), C = c(NA, 1, 0, 0, 0, 0, 0), D = c(NA, 1, 0, 1, 0,
0, 1), E = c(NA, 1, 0, 1, 0, 0, 1)), row.names = c(NA, -7L), class = "data.frame")
The code works when I mention all the names
df %>%
rowwise() %>%
mutate(New = case_when(any(c(A,B,C,D,E) == 1) ~ 1,
all(c(A,B,C,D,E) == 0 ) ~ 0
))
Can I pass the names in a vector, e.g cols <- colnames(df), and then that in case_when
To answer your question you can use cur_data() in dplyr 1.0.0 or c_across()
library(dplyr)
df %>%
rowwise() %>%
mutate(New = case_when(any(cur_data() == 1) ~ 1,
all(cur_data() == 0 ) ~ 0))
# A B C D E New
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 NA NA NA NA NA NA
#2 1 0 1 1 1 1
#3 0 0 0 0 0 0
#4 0 1 0 1 1 1
#5 0 0 0 0 0 0
#6 0 0 0 0 0 0
#7 0 0 0 1 1 1
With c_across() :
df %>%
rowwise() %>%
mutate(New = case_when(any(c_across()== 1) ~ 1,
all(c_across()== 0 ) ~ 0))
But you can also solve this using rowSums :
df %>%
mutate(New = case_when(rowSums(. == 1, na.rm = TRUE) > 0 ~ 1,
rowSums(. == 0, na.rm = TRUE) == ncol(.) ~ 0))
If you only have 0's and 1's in your dataset you could use this
df$New <- ifelse(rowSums(df) > 0, 1, 0)
If the rowsum > 0 it means that at least one '1' is present. Output
A B C D E New
1 NA NA NA NA NA NA
2 1 0 1 1 1 1
3 0 0 0 0 0 0
4 0 1 0 1 1 1
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 1 1 1
In base R, we can do this with
df$New <- +( rowSums(df) > 0)

Resources