dplyr: create new variable based upon grouping - r

Given this dataframe:
library(dplyr)
df.ex <- tibble(id = c(rep(1, 4), rep(2, 4), rep(3, 4)),
var1 = c('a','a','b','b','a','a','a','a','b','b','b','b'))
I would like to create a new variable var2 based upon the presence of b in var1 which is grouped by the id column. Thus each id, can then only contain one type of value in the output column. This is the hoped for outcome:
df.ex.outcome <- tibble(id = c(rep(1, 4), rep(2, 4), rep(3, 4)),
var1 = c('a','a','b','b','a','a','a','a','b','b','b','b'),
var2 = c(rep('foo', 4), rep('bar', 4), rep('foo', 4)))
I thought that using group_by would solve this, however it doesn't appear to work, like so:
df.ex <- df.ex %>% group_by(id) %>% mutate(var2 = if_else(var1 %in% 'b', 'foo','bar'))
Does anyone have any ideas on how to do this?

We can wrap with any
df.ex %>%
group_by(id) %>%
mutate(var2 = case_when(any(var1 == "b")~ "foo", TRUE ~ "bar"))
# A tibble: 12 x 3
# Groups: id [3]
# id var1 var2
# <dbl> <chr> <chr>
# 1 1 a foo
# 2 1 a foo
# 3 1 b foo
# 4 1 b foo
# 5 2 a bar
# 6 2 a bar
# 7 2 a bar
# 8 2 a bar
# 9 3 b foo
#10 3 b foo
#11 3 b foo
#12 3 b foo
Or reverse the arguments for %in%
df.ex %>%
group_by(id) %>%
mutate(var2 = case_when("b" %in% var1 ~ "foo", TRUE ~ "bar"))
Or using if_else
df.ex %>%
group_by(id) %>%
mutate(var2 = if_else('b' %in% var1, 'foo','bar'))
so that there will a single TRUE/FALSE output from %in%, which we can also use with if/else
df.ex %>%
group_by(id) %>%
mutate(var2 = if("b" %in% var1) "foo" else "bar")

Related

mixing unnest_longer and unnest_wider

I'm (once more) stuck with flattening nested lists.
I have this tibble with some list-columns (originating from a JSON format).
library(tidyr)
library(dplyr)
df = tibble(id = c(1, 2, 3),
branch = list(NULL, list(colA = 'abc', colB = 'mno'),
list(list(colA = 'def', colB = 'uvw'),
list(colA = 'ghi', colB = 'xyz'))))
I want to unnest_wider column 'branch'. That works with rows 1 and 2:
df %>%
slice(1:2) %>%
unnest_wider(branch)
However, row 3 consists of a list of lists which I have to unnest_longer first:
bind_rows(
df %>% slice(1,2),
df %>% slice(3) %>% unnest_longer(branch)) %>%
unnest_wider(branch)
above code gives the desired output, but I'm looking for a generic solution like:
If an element of column 'branch' is of type 'unnamed list' (indicating that there is a list of lists) then unnest_longer. Afterwards apply unnest_wider to the whole column 'branch'
Any help appreciated!
First convert the leaves to data frames and then unnest it.
library(dplyr)
library(tidyr)
leaf2df <- function(x) {
if (length(names(x))) as.data.frame(x)
else if (is.list(x)) lapply(x, leaf2df)
}
df %>%
rowwise %>%
mutate(branch = list(bind_rows(leaf2df(branch)))) %>%
ungroup %>%
unnest(branch, keep_empty = TRUE)
giving:
# A tibble: 4 × 3
id colA colB
<dbl> <chr> <chr>
1 1 <NA> <NA>
2 2 abc mno
3 3 def uvw
4 3 ghi xyz
Because leaf2df is recursive as long as all leaves in any row have the same parent it should continue to work. For example, below we have made the list in the last row one level deeper and it still works.
df <- tibble(id = c(1, 2, 3),
branch = list(NULL, list(colA = 'abc', colB = 'mno'),
list(list(list(colA = 'def', colB = 'uvw'),
(list(colA = 'ghi', colB = 'xyz'))))))
A little bit convoluted but here's a possible solution:
Iterate through the rows of your df
Determine if it's a named list by checking names(df$branch[[index]])
If unnamed --> slice + unnest; if named --> slice
Finally, unnest_wider()
library(tidyr)
library(dplyr)
library(purrr)
map_df(1:nrow(df), function(x) {
if (is.null(names(df$branch[[x]]))) {
df %>% slice(x) %>% unnest_longer(branch)
} else {
df %>% slice(x)
}
}) %>%
unnest_wider(branch)
Which returns:
# A tibble: 4 × 3
id colA colB
<dbl> <chr> <chr>
1 1 NA NA
2 2 abc mno
3 3 def uvw
4 3 ghi xyz
library(tidyverse)
df <- tibble(
id = c(1, 2, 3),
branch = list(
NULL, list(colA = "abc", colB = "mno"),
list(
list(colA = "def", colB = "uvw"),
list(colA = "ghi", colB = "xyz")
)
)
)
unnester <- function(x, grp) {
if (grp) {
x <- x |> unnest_longer(branch)
}
unnest_wider(x, branch)
}
df |>
rowwise() |>
mutate(grp = length(names(unlist(branch))) > 2) |>
ungroup() |>
split(~grp) |>
imap_dfr(~ unnester(.x, .y)) |>
select(-grp)
The following approach first modifies the list so that all leafs are located at the same list level, after which we can unnest all rows as needed:
library(tidyr)
library(purrr)
library(dplyr)
mutate(df, branch = map(
.x = branch,
.f = ~if(is.list(.x[[1]])) .x else list(.x)
)) |>
unnest_longer(branch) |>
unnest_wider(branch)
#> # A tibble: 4 × 3
#> id colA colB
#> <dbl> <chr> <chr>
#> 1 1 <NA> <NA>
#> 2 2 abc mno
#> 3 3 def uvw
#> 4 3 ghi xyz

How to use regular expression to match specific digits and mutate conditionally? (R)

I am trying to build a mutate with regular expression to identify specific digit matches, and then assign a new value to a new column var2 depending on a specific priority ranking.
library(dplyr)
df <- tibble(var1 = c("1", "10", "1-10", "1-2-4-5"))
If any value contains 1, assign 1 to var2
If any value contains 2, assign 2 to var2
If any value contains 3 to 10, assign 3 to var2
df_desire <- tibble(var1 = c("1", "10", "1-10", "1-2-4-5"), var2 = c(1, 3, 1, 1))
I'm expecting to use mutate, case_when, and str_detect.
df_output <- df %>%
mutate(
var2 = case_when(
str_detect(var1, "[1]") ~ 1,
str_detect(var1, "[2]") ~ 2,
str_detect(var1, "[3|4|5|6|7|8|9|10]") ~ 3,
TRUE ~ 0))
Using that code returns 1 when var1 = 1, but also when var1 = 10 whereas I want it to be 3. How should I adjust my code to get my desired output?
There will be partial matches if we don't use word boundary. An option is to use separate_rows and then create the column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(var1, convert = TRUE) %>%
group_by(rn) %>%
summarise(var2 = case_when(1 %in% var1 ~ 1, 2 %in% var1 ~ 2,
any(3:10 %in% var1) ~ 3, TRUE ~ 0)) %>%
select(-rn) %>%
bind_cols(df, .)
-output
# A tibble: 4 × 2
var1 var2
<chr> <dbl>
1 1 1
2 10 3
3 1-10 1
4 1-2-4-5 1
Or if we want to use str_detect - use the expression with the priority in the same order in case_when
library(stringr)
df %>%
mutate(var2 = case_when(str_detect(var1, "\\b1\\b") ~ 1,
str_detect(var1, "\\b2\\b") ~ 2,
str_detect(var1, "\\b[3-9]\\b|\\b10\\b") ~ 3, TRUE ~ 0))
-output
# A tibble: 4 × 2
var1 var2
<chr> <dbl>
1 1 1
2 10 3
3 1-10 1
4 1-2-4-5 1
Or using base R
sapply(strsplit(df$var1, '-'), \(x)
na.omit(match(1:3, replace(x, 3:10, 3)))[1])
[1] 1 3 1 1

Using the value in one column to specify from which row to retrieve a value for a new column

I'm looking for an automated way of converting this:
dat = tribble(
~a, ~b, ~c
, 'x', 1, 'y'
, 'y', 2, NA
, 'q', 4, NA
, 'z', 3, 'q'
)
to:
tribble(
~a, ~b, ~d
, 'x', 1, 2
, 'z', 3, 4
)
So, the column c in dat encodes which row in dat to look at to grab a value for a new column d, and if c is NA, toss that row from the output. Any tips?
We can join dat with itself using c and a columns.
library(dplyr)
dat %>%
inner_join(dat %>% select(-c) %>% rename(d = 'b'),
by = c('c' = 'a'))
# A tibble: 2 x 4
# a b c d
# <chr> <dbl> <chr> <dbl>
#1 x 1 y 2
#2 z 3 q 4
In base R, we can do this with merge :
merge(dat, dat[-3], by.x = 'c', by.y = 'a')
We create the 'd' with lead of 'b' and filter out the NA rows of 'c' and remove the c column with select
library(dplyr)
dat %>%
mutate(d = lead(b)) %>%
filter(!is.na(c)) %>%
select(-c)
# A tibble: 2 x 3
# a b d
# <chr> <dbl> <dbl>
#1 x 1 2
#2 z 3 4
Or more compactly
dat %>%
mutate(d = replace(lead(b), is.na(c), NA), c = NULL) %>%
na.omit
Or with fill
library(tidyr)
dat %>%
mutate(c1 = c) %>%
fill(c1) %>%
group_by(c1) %>%
mutate(d = lead(b)) %>%
ungroup %>%
filter(!is.na(c)) %>%
select(-c, -c1)
Or in data.table
library(data.table)
setDT(dat)[, d := shift(b, type = 'lead')][!is.na(c)][, c := NULL][]
# a b d
#1: x 1 2
#2: z 3 4
NOTE: Both the solutions are simple and doesn't require any joins. Besides, it gives the expected output in the OP's post
Or using match from base R
cbind(na.omit(dat), d = with(dat, b[match(c, a, nomatch = 0)]))[, -3]
# a b d
#1 x 1 2
#2 z 3 4

Remove duplicates, keeping most frequent row

I would like to deduplicate my data, keeping the row that has the most frequent appearances. If there is a tie in rows, I don't care which gets returned—the first in alphabetical or numeric order is fine. I would like to do this by group of id and var.
MRE:
df <- data.frame(
id = rep("a", 8),
var = c(rep("b", 4), rep("c", 4)),
val = c("d", "d", "d", "e", "f", "f", "g", "g")
)
> df
id var val
1 a b d
2 a b d
3 a b d
4 a b e
5 a c f
6 a c f
7 a c g
8 a c g
Should be:
id var val
1 a b d
2 a c f
I'm working with large datasets and tidyverse pipe chains, so a dplyr solution would be preferable.
Use table and which.max to extract the mode:
df %>%
group_by(id, var) %>%
summarise(val = {t <- table(val); names(t)[which.max(t)] })
# A tibble: 2 x 3
# Groups: id [?]
# id var val
# <fct> <fct> <chr>
#1 a b d
#2 a c f
Another way to do this in base R: Create a three way contingency table directly, and then find the max column along the third axis:
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
# var
#id b c
# a "d" "f"
Convert this to a data frame:
as.data.frame.table(
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
)
# id var Freq
#1 a b d
#2 a c f
Using dplyr:
library(dplyr)
df %>%
group_by(id, var, val) %>%
summarise(n = n()) %>%
group_by(id, var) %>%
arrange(-n) %>%
slice(1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# id var val
# <fct> <fct> <fct>
# 1 a b d
# 2 a c f
One option could be using table and max as:
library(dplyr)
df %>% group_by(id, var) %>%
filter(table(val) == max(table(val))) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: id, var [2]
# id var val
# <fctr> <fctr> <fctr>
# 1 a b d
# 2 a c g
NOTE: a c g is case of tie. Per OP any record can be returned in case of tie.
I doubt this is any faster, but another option is
df %>%
group_by(id, var) %>%
filter(row_number() == rle(as.character(val))$lengths %>%
{sum(.[1:which.max(.)])})
A dplyr solution using count:
library(dplyr)
df %>%
count(id,var,val,sort = T) %>%
group_by(id,var) %>%
summarize_at("val",head,1)
# # A tibble: 2 x 3
# id var val
# <fctr> <fctr> <fctr>
# 1 a b d
# 2 a c f
or maybe more idiomatic but longer:
df %>%
count(id,var,val,sort = T) %>%
group_by(id,var) %>%
slice(1) %>%
select(-n) %>%
ungroup
Or with tally for same output with slightly different syntax:
df %>%
group_by(id,var,val) %>%
tally(sort = T) %>%
slice(1) %>%
select(-n) %>%
ungroup
and a base solution :
df2 <- aggregate(x ~ .,cbind(df,x=1),sum)
aggregate(val ~ id+var, df2[order(-df2$x),],head,1)
# id var val
# 1 a b d
# 2 a c f
Here is my try:
library(dplyr)
df %>%
group_by(id,var,val) %>%
mutate(n=n()) %>%
arrange(desc(n)) %>%
group_by(id,var) %>%
filter(row_number()==1) %>%
select(-n)
`

Summarize all group values and a conditional subset in the same call

I'll illustrate my question with an example.
Sample data:
df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B = c(1, 5, 7, 23, 54, 202))
df
ID A B
1 1 foo 1
2 1 bar 5
3 2 foo 7
4 2 foo 23
5 3 bar 54
6 5 bar 202
What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:
require(magrittr)
require(dplyr)
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B))
df2 <- df %>%
filter(A == "foo") %>%
group_by(ID) %>%
summarize(sumBfoo = sum(B))
left_join(df1, df2)
ID sumB sumBfoo
1 1 6 1
2 2 30 30
3 3 54 NA
4 5 202 NA
However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.
require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)
I thought of using mutate to define a new Bfoo column:
df_sqlite %>%
mutate(Bfoo = ifelse(A=="foo", B, 0))
Unfortunately, this doesn't work on the database end of things.
Error in sqliteExecStatement(conn, statement, ...) :
RS-DBI driver: (error in statement: no such function: IFELSE)
You can do both sums in a single dplyr statement:
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B),
sumBfoo = sum(B[A=="foo"]))
And here is a data.table version:
library(data.table)
dt = setDT(df)
dt1 = dt[ , .(sumB = sum(B),
sumBfoo = sum(B[A=="foo"])),
by = ID]
dt1
ID sumB sumBfoo
1: 1 6 1
2: 2 30 30
3: 3 54 0
4: 5 202 0
Writing up #hadley's comment as an answer
df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect
If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.
df1 <- df %>%
group_by(ID) %>%
summarize(countB = n(),
countBfoo = sum(A=="foo"))
df1
Source: local data frame [4 x 3]
ID countB countBfoo
1 1 2 1
2 2 2 2
3 3 1 0
4 5 1 0
If you wanted to count the rows, instead of summing them, can you pass a variable to the function:
df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
RowCountBfoo = n(A=="foo"))
I get an error both with n() and nrow().

Resources