How to replace specific rows with their column sums in R? - r

I feel like there should be a simpler way of doing this. Here is my sample data.
df <-
tibble(
group1 = c(1,1,2,2,3,3,3,3),
group2 = c("A", "B", "A", "B", "A", "B", "A", "B"),
vals = c(13,56,15,50,5,22,9,59)
)
df
# A tibble: 8 x 3
group1 group2 vals
<dbl> <chr> <dbl>
1 1 A 13
2 1 B 56
3 2 A 15
4 2 B 50
5 3 A 5
6 3 B 22
7 3 A 9
8 3 B 59
I want to combine the vals where group1 is 3 and replace the summed rows with the old ones. Can anyone come up with a cleaner/tidier solution than this?
df %>%
group_by(group1, group2) %>%
bind_rows(
summarize(
.[.$group1 == 3,],
across(vals, sum),
summed = "x"
)
) %>%
ungroup() %>%
filter(!(group1 == 3 & is.na(summed))) %>%
select(-summed)
Here is what the result should be:
# A tibble: 6 x 3
group1 group2 vals
<dbl> <chr> <dbl>
1 1 A 13
2 1 B 56
3 2 A 15
4 2 B 50
5 3 A 14
6 3 B 81

This isn't very efficient, but it gives you your intended output.
df %>%
mutate(tmp = if_else(group1 == 3, 0L, row_number())) %>%
group_by(tmp, group1, group2) %>%
summarize(vals = sum(vals)) %>%
ungroup() %>%
select(-tmp)
# # A tibble: 6 x 3
# group1 group2 vals
# <dbl> <chr> <dbl>
# 1 3 A 14
# 2 3 B 81
# 3 1 A 13
# 4 1 B 56
# 5 2 A 15
# 6 2 B 50
Another technique would be to split your data into "3" and "not 3", process the "3" data, then recombine them.
df3 <- filter(df, group1 == 3)
dfnot3 <- filter(df, group1 != 3)
df3 %>%
group_by(group1, group2) %>%
summarize(vals = sum(vals)) %>%
ungroup() %>%
bind_rows(dfnot3)
# # A tibble: 6 x 3
# group1 group2 vals
# <dbl> <chr> <dbl>
# 1 3 A 14
# 2 3 B 81
# 3 1 A 13
# 4 1 B 56
# 5 2 A 15
# 6 2 B 50
(This second one is really only meaningful/efficient if you have lots of non-3 rows.)

Related

Element-wise median across lists of data frames in R

I have several lists, each one containing many data frames. I would like to calculate the element-wise median across the elements of each data frame inside each list, i.e. the median between element [[1]][1,1] of list1, element [[1]][1,1] of list2, and element [[1]][1,1] of list3, and so on for all elements. The data frames have many columns each, but here is some sample data with only two columns:
set.seed(1)
list1 <- list(data.frame(a = sample.int(10, 4), b = sample.int(10, 4)),
data.frame(a = sample.int(10, 4), b = sample.int(10, 4)))
list2 <- list(data.frame(a = sample.int(10, 4), b = sample.int(10, 4)),
data.frame(a = sample.int(10, 4), b = sample.int(10, 4)))
list3 <- list(data.frame(a = sample.int(10, 4), b = sample.int(10, 4)),
data.frame(a = sample.int(10, 4), b = sample.int(10, 4)))
This is the expected result:
[[1]]
a b
7 4
9 9
7 3
4 6
[[2]]
a b
5 7
8 6
2 6
5 2
Any tips?
Using purrr:
library(purrr)
lsts <- list(list1,list2,list3)
map(transpose(lsts),~map_dfc(transpose(.), ~apply(list2DF(.x),1,median)))
[[1]]
# A tibble: 4 × 2
a b
<int> <int>
1 7 4
2 9 9
3 7 3
4 4 6
[[2]]
# A tibble: 4 × 2
a b
<int> <int>
1 5 7
2 8 6
3 2 6
4 5 2
In Base R, assuming they all have the same structure:
dims <- c(dim(list1[[1]]), length(list1), length(lsts))
d <- apply(array(unlist(lsts), dims), head(seq(dims),-1), median)
asplit(aperm(d, c(1,3,2)), 3)
[[1]]
[,1] [,2]
[1,] 7 4
[2,] 9 9
[3,] 7 3
[4,] 4 6
[[2]]
[,1] [,2]
[1,] 5 7
[2,] 8 6
[3,] 2 6
[4,] 5 2
Here's another (base R) option:
Map(function(...) {
dots1 <- list(...)
out <- do.call(mapply, c(list(FUN=function(...) {
dots2 <- list(...)
apply(do.call(cbind, dots2), 1, median)
}), dots1))
data.frame(out)
}, list1, list2, list3)
# [[1]]
# a b
# 1 7 4
# 2 9 9
# 3 7 3
# 4 4 6
# [[2]]
# a b
# 1 5 7
# 2 8 6
# 3 2 6
# 4 5 2
Certainly not beautiful, but functional.
A related dplyr option:
list(list1, list2, list3) |>
lapply(bind_rows, .id = "id1") |>
bind_rows(.id = "id2") |>
group_by(id1, id2) |>
mutate(rn = row_number()) |>
group_by(id1, rn) |>
summarize(across(c(a, b), ~ median(.))) |>
ungroup() |>
select(-rn) |>
group_nest(id1) |>
pull(data)
# [[1]]
# # A tibble: 4 × 2
# a b
# <int> <int>
# 1 7 4
# 2 9 9
# 3 7 3
# 4 4 6
# [[2]]
# # A tibble: 4 × 2
# a b
# <int> <int>
# 1 5 7
# 2 8 6
# 3 2 6
# 4 5 2
Certainly not the most efficient solution, but one option with tidyverse might be:
map_dfr(mget(ls(pattern = "list")),
function(list_of_lists) imap(list_of_lists,
function(lists, lists_id)
lists %>%
mutate(rowid = row_number(),
lists_id = lists_id))) %>%
group_by(rowid, lists_id) %>%
summarise(across(c(a, b), median))
rowid lists_id a b
<int> <int> <int> <int>
1 1 1 10 3
2 1 2 8 1
3 2 1 5 4
4 2 2 9 8
5 3 1 6 6
6 3 2 6 3
7 4 1 3 2
8 4 2 4 6
If the goal is to return a list:
map_dfr(mget(ls(pattern = "list")),
function(list_of_lists) imap(list_of_lists,
function(lists, lists_id)
lists %>%
mutate(rowid = row_number(),
lists_id = lists_id))) %>%
group_by(rowid, lists_id) %>%
summarise(across(c(a, b), median)) %>%
ungroup() %>%
group_split(lists_id)
[[1]]
# A tibble: 4 × 4
rowid lists_id a b
<int> <int> <int> <int>
1 1 1 10 3
2 2 1 5 4
3 3 1 6 6
4 4 1 3 2
[[2]]
# A tibble: 4 × 4
rowid lists_id a b
<int> <int> <int> <int>
1 1 2 8 1
2 2 2 9 8
3 3 2 6 3
4 4 2 4 6
Here is a tidyverse solution first draft (I am sure that it could be improved):
library(tidyverse)
bind_rows(list1, list2, list3) %>%
mutate(x =rep(1:3, each=8, length.out = n())) %>%
group_by(x) %>%
pivot_wider(names_from = x,
values_from = c(a,b),
values_fn = list) %>%
unnest() %>%
rowwise() %>%
transmute(a = median(c(a_1, a_2, a_3)),
b = median(c(b_1, b_2, b_3))
) %>%
ungroup() %>%
group_by(x = as.integer(gl(n(),4,n()))) %>%
group_split() %>%
map(.,~(.x %>%select(-x)))
[[1]]
# A tibble: 4 × 2
a b
<int> <int>
1 7 4
2 9 9
3 7 3
4 4 6
[[2]]
# A tibble: 4 × 2
a b
<int> <int>
1 5 7
2 8 6
3 2 6
4 5 2

Get mean column values every n rows grouped by value in other column

I have a dataframe df that looks like this
time object
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 B
8 8 B
9 9 B
10 10 B
11 11 B
12 12 C
13 13 C
14 14 C
15 15 C
16 16 C
17 17 C
18 18 C
I would like to get the mean of the timecolumn every 3 rows based on the object column
df_mean
time object
1 2 A
2 5 A
3 8 B
4 13 C
5 16 C
I though about using dplyr
df%>%
mutate(grp = 1+ (row_number()-1) %/% 3) %>%
group_by(grp) %>%
summarise(across(c("time"), mean, na.rm = TRUE)) %>%
select(-grp)
but I do not know how to integrate the control for the object.
Another option would be to use aggregate
aggregate(.~object, data=df, mean)
but in this case I do not know how to get the mean every 3 rows.
Your dplyr attempt is on the right track. With a few modifications it will work.
library(dplyr)
df <- tibble(time = 1:18, object = rep(c('A', 'B', 'C'), each = 6))
df %>%
group_by(object, grp = (row_number()-1) %/% 3) %>%
summarise(across(time, mean, na.rm = T), .groups = 'drop') %>%
select(-grp)
#> # A tibble: 6 × 2
#> object time
#> <chr> <dbl>
#> 1 A 2
#> 2 A 5
#> 3 B 8
#> 4 B 11
#> 5 C 14
#> 6 C 17
Here is an alternative dplyr way:
library(dplyr)
n = 3
df %>%
group_by(object, Col2 = rep(row_number(), each=n, length.out = n())) %>%
summarise(time = mean(time, na.rm = TRUE)) %>%
select(-Col2)
object time
<chr> <dbl>
1 A 2
2 A 5
3 B 8
4 B 11
5 C 14
6 C 17
You can do a slight modification, creating your grp variable within object group first, and then filtering where the size of the joint grouping is >=3, and then summarize:
df%>%
group_by(object) %>%
mutate(grp = 1+ (row_number()-1) %/% 3) %>%
group_by(object,grp) %>%
filter(n()>=3) %>%
summarize(time=mean(time), .groups="drop") %>%
select(-grp)
Output:
object time
<chr> <dbl>
1 A 2
2 A 5
3 B 8
4 C 13
5 C 16
data.table option:
library(data.table)
setDT(df)
df[, n3:=gl(.N, 3, length=.N), by=object]
df[, .(time=mean(time)), by=.(object, n3)][, !"n3"]
Output:
object time
1: A 2
2: A 5
3: B 8
4: B 11
5: C 14
6: C 17
in base R:
aggregate(time~., cbind(df, gr=gl(nrow(df),3, nrow(df))), mean)
object gr time
1 A 1 2
2 A 2 5
3 B 3 8
4 B 4 11
5 C 5 14
6 C 6 17

Fill NA with a series of characters in R dplyr

I have a large data frame that looks like this. Each player is assigned to a group.
library(tidyverse)
df <- tibble(player=c(1,2,3,4,5),groups=c("group1","group2","group2",NA,NA))
df
#> # A tibble: 5 × 2
#> player groups
#> <dbl> <chr>
#> 1 1 group1
#> 2 2 group2
#> 3 3 group2
#> 4 4 <NA>
#> 5 5 <NA>
Created on 2022-04-12 by the reprex package (v2.0.1)
Some players are not assigned into groups and I want to fill them serially -i.e. like this-
#> # A tibble: 5 × 2
#> player groups
#> <dbl> <chr>
#> 1 1 group1
#> 2 2 group2
#> 3 3 group2
#> 4 4 group3
#> 5 5 group4
dplyr
library(dplyr)
df %>%
mutate(
maxgrp = max(as.integer(gsub("[^0-9]", "", groups)), na.rm = TRUE),
groups = if_else(is.na(groups), paste0("group", maxgrp + cumsum(is.na(groups))), groups)
) %>%
select(-maxgrp)
# # A tibble: 5 x 2
# player groups
# <dbl> <chr>
# 1 1 group1
# 2 2 group2
# 3 3 group2
# 4 4 group3
# 5 5 group4
data.table
library(data.table)
DT <- as.data.table(df)
DT[, groups := fifelse(
is.na(groups),
paste0("group", cumsum(is.na(groups)) + max(as.integer(gsub("[^0-9]", "", groups)), na.rm = TRUE)),
groups) ]
This was tricky, finally I think we could do it this way:
library(dplyr)
df %>%
mutate(x = cumsum(groups %in% NA)+1) %>%
mutate(groups = ifelse(is.na(groups), paste0("group", x+1), groups), .keep="unused")
player groups
<dbl> <chr>
1 1 group1
2 2 group2
3 3 group2
4 4 group3
5 5 group4
You could do:
df |>
mutate(new_group = max(parse_number(groups), na.rm = TRUE) + cumsum(is.na(groups)),
groups = if_else(is.na(groups), paste0("group", new_group), groups)) |>
select(-new_group)
Using a slightly different data example where after the missings another group is mentioned, this would give you:
Input:
library(tidyverse)
df <- tibble(player=c(1,2,3,4,5,6),groups=c("group1","group2","group2",NA,NA, "group3"))
# A tibble: 6 x 2
player groups
<dbl> <chr>
1 1 group1
2 2 group2
3 3 group2
4 4 NA
5 5 NA
6 6 group3
Output:
# A tibble: 6 x 2
player groups
<dbl> <chr>
1 1 group1
2 2 group2
3 3 group2
4 4 group4
5 5 group5
6 6 group3

grouping to aggregate values, but tripping up on NA's

I have long data, and I am trying to make a new variable (consistent) that is the value for a given column (VALUE), for each person (ID), at TIME = 2. I used the code below to do this, but I am getting tripped up on NA's. If the VALUE for TIME = 2 is NA, then I want it to grab the VALUE at TIME = 1 instead. That part I'm not sure how to do. So, in the example below, I want the new variable (consistent) should be 10 instead of NA.
ID = c("A", "A", "B", "B", "C", "C", "D", "D")
TIME = c(1, 2, 1, 2, 1, 2, 1, 2)
VALUE = c(8, 9, 10, NA, 12, 13, 14, 9)
df = data.frame(ID, TIME, VALUE)
df <- df %>%
group_by(ID) %>%
mutate(consistent = VALUE[TIME == 2]) %>% ungroup
df
If we want to use the same code, then coalesce with the 'VALUE' where 'TIME' is 1 (assuming there is a single observation of 'TIME' for each 'ID')
library(dplyr)
df %>%
group_by(ID) %>%
mutate(consistent = coalesce(VALUE[TIME == 2], VALUE[TIME == 1])) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 9
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 13
6 C 2 13 13
7 D 1 14 9
8 D 2 9 9
Or another option is to arrange before doing the group_by and get the first element of 'VALUE' (assuming no replicating for 'TIME')
df %>%
arrange(ID, is.na(VALUE), desc(TIME)) %>%
group_by(ID) %>%
mutate(consistent = first(VALUE)) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 2 9 9
2 A 1 8 9
3 B 1 10 10
4 B 2 NA 10
5 C 2 13 13
6 C 1 12 13
7 D 2 9 9
8 D 1 14 9
Another possible solution, using tidyr::fill:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(consistent = VALUE) %>% fill(consistent) %>% ungroup
#> # A tibble: 8 × 4
#> ID TIME VALUE consistent
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 8 8
#> 2 A 2 9 9
#> 3 B 1 10 10
#> 4 B 2 NA 10
#> 5 C 1 12 12
#> 6 C 2 13 13
#> 7 D 1 14 14
#> 8 D 2 9 9
You can also use ifelse with your condition. TIME is guaranteed to be 1 in this scenario if there are only 2 group member each with TIME 1 and 2.
df %>%
group_by(ID) %>%
arrange(TIME, .by_group=T) %>%
mutate(consistent=ifelse(is.na(VALUE)&TIME==2, lag(VALUE), VALUE)) %>%
ungroup()
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 8
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 12
6 C 2 13 13
7 D 1 14 14
8 D 2 9 9

How to call columns implicitly in certain R functions

There are some functions in R where you have to call columns explicitly by name, such as pmin. My question is how to get around this, preferably using tidyverse.
Here's some sample data.
library(tidyverse)
df <- tibble(a = c(1:5),
b = c(6:10),
d = c(11:15),
e = c(16:20))
# A tibble: 5 x 4
a b d e
<int> <int> <int> <int>
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Now I'd like to find the minimum of all the columns except for "e". I can do this:
df %>%
mutate(min = pmin(a, b, d))
# A tibble: 5 x 5
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
But what if I have many columns and would like to call every column except "e" without having to type out each column's name? I've made several attempts but none successful. I used the column index in my examples but I'd prefer excluding "e" by name. See below.
df %>%
mutate(min = pmin(-e))
df %>%
mutate(min = pmin(names(. %>% select(.))[-4]))
df %>%
mutate(min = pmin(names(.)[-4]))
df %>%
mutate(min = pmin(noquote(paste0(names(.)[-4], collapse = ","))))
df %>%
mutate(min = pmin(!!ensyms(names(.)[-4])))
None of these worked and I'm a bit at a loss.
One option using dplyr and purrr could be:
df %>%
mutate(min = exec(pmin, !!!select(., -e)))
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
For those not reading the comments, a nice option proposed by #IceCreamToucan and involving only dplyr could be:
df %>%
mutate(min = do.call(pmin, select(., -e)))
We can also use reduce with pmin
library(dplyr)
library(purrr)
df %>%
mutate(min = select(., -e) %>%
reduce(pmin))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
Or with syms and !!!. Note that en- prefix is used while using from inside a function
df %>%
mutate(min = pmin(!!! syms(names(.)[-4])))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
I would do this by reshaping long, calculating the min by group, and then reshaping back to wide:
df %>%
rowid_to_column() %>%
pivot_longer(cols = -c(e, rowid)) %>%
group_by(rowid) %>%
mutate(min = min(value)) %>%
ungroup() %>%
pivot_wider() %>%
select(-rowid, -min, min)

Resources