Consider the following dataset:
library(tidyverse)
tbl <- tibble(
x = letters[1:3],
y = c("a", "a", "b"),
z = letters[4:6]
)
I'd like to create a function to create new columns containing unique ids for each value of a given set of columns, with a single mutate call.
I tried the following but I'm getting an error:
add_ids <- function(.data, ...) {
mutate(.data, map_chr(c(...), ~ {
"{.x}_id" := vctrs::vec_group_id(.data[[.x]])
}))
}
add_ids(tbl, x, y)
Expected output:
# A tibble: 3 × 5
x y z x_id y_id
<chr> <chr> <chr> <int> <int>
1 a a d 1 1
2 b a e 2 1
3 c b f 3 2
You could rather use across in this case:
add_ids <- function(.data, var) {
mutate(.data,
across({{ var }}, vctrs::vec_group_id, .names = "{col}_id"))
}
add_ids(tbl, c(x, y))
## A tibble: 3 × 5
# x y z x_id y_id
# <chr> <chr> <chr> <int> <int>
#1 a a d 1 1
#2 b a e 2 1
#3 c b f 3 2
Related
I have a dataset :
library(tidyverse)
fac = factor(c("a","b","c"))
x = c(1,2,3)
d = tibble(fac,x);d
that looks like this :
# A tibble: 3 × 2
fac x
<fct> <dbl>
1 a 1
2 b 2
3 c 3
I want to change the value 2 of column x that corresponds to factor b with 3.14.
How can I do it in the dplyr pipeline framework ?
One alternative with ifelse statement:
library(dplyr)
d %>%
mutate(x = ifelse(fac == "b", 3.14, x))
fac x
<fct> <dbl>
1 a 1
2 b 3.14
3 c 3
We may use replace
library(dplyr)
library(magrittr)
d %<>%
mutate(x = replace(x, fac == "b", 3.14))
-output
d
# A tibble: 3 × 2
fac x
<fct> <dbl>
1 a 1
2 b 3.14
3 c 3
I have a dataset that looks like this:
test_df <- tibble(
category = c('a', 'a', 'b', 'b', 'b', 'c'),
group = c("X", "Y", "Z", "X", "Y", "Z"),
category_data_1 = c(rep("dataA", 2), rep("dataB", 3), rep("dataC", 1)),
category_data_2 = c(rep("data2A", 2), rep("data2B", 3), rep("data2C", 1))
)
# A tibble: 6 x 4
category group category_data_1 category_data_2
<chr> <chr> <chr> <chr>
1 a X dataA data2A
2 a Y dataA data2A
3 b Z dataB data2B
4 b X dataB data2B
5 b Y dataB data2B
6 c Z dataC data2C
I want two things to happen to this dataset:
Expand it by category and group (this is the easy party, e.g. tidyr::expand()), but leave the category_data variables in the dataset -- they are always tied to the category variable. So, category == "a", will have category_data_1 == "dataA" and category_data_2 == "data2A" across the dataset.
I want to create a new binary variable that checks if the combination of category and group existed (1) or not (0).
So, in the end I would like something that looks like this:
# A tibble: 9 x 5
category group category_data_1 category_data_2 combination_existed
<chr> <chr> <chr> <chr> <dbl>
1 a X dataA data2A 1
2 a Y dataA data2A 1
3 a Z dataA data2A 0
4 b X dataB data2B 1
5 b Y dataB data2B 1
6 b Z dataB data2B 1
7 c X dataC data2C 0
8 c Y dataC data2C 0
9 c Z dataC data2C 1
I think I can achieve this by Frankensteining several temporary datasets together, but was wondering maybe there's an easier path? Perhaps with tidyverse?
Here is simple Solution, relying on tidyr::expand and tidyr::nesting.
tidyr::nesting can be used to prevent from new combinations of variables being created.
test_df %>%
expand(nesting(category, category_data_1, category_data_2), group) %>%
left_join(test_df %>% mutate(x = 1), by = colnames(test_df)) %>%
replace_na(list(x = 0))
# A tibble: 9 x 5
category category_data_1 category_data_2 group x
<chr> <chr> <chr> <chr> <dbl>
1 a dataA data2A X 1
2 a dataA data2A Y 1
3 a dataA data2A Z 0
4 b dataB data2B X 1
5 b dataB data2B Y 1
6 b dataB data2B Z 1
7 c dataC data2C X 0
8 c dataC data2C Y 0
9 c dataC data2C Z 1
You need tidyr::complete with its two arguments fill and nesting to do it nicely for you. But you'll have to create a new column as desired before using complete. So a complete syntax could be
library(tibble)
test_df <- tibble(
category = c('a', 'a', 'b', 'b', 'b', 'c'),
group = c("X", "Y", "Z", "X", "Y", "Z"),
category_data_1 = c(rep("dataA", 2), rep("dataB", 3), rep("dataC", 1)),
category_data_2 = c(rep("data2A", 2), rep("data2B", 3), rep("data2C", 1))
)
library(tidyverse)
test_df %>% mutate(combination_existed = 1) %>%
complete(group = unique(test_df$group), nesting(category, category_data_1, category_data_2),
fill = list(combination_existed = 0))
#> # A tibble: 9 x 5
#> group category category_data_1 category_data_2 combination_existed
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 X a dataA data2A 1
#> 2 X b dataB data2B 1
#> 3 X c dataC data2C 0
#> 4 Y a dataA data2A 1
#> 5 Y b dataB data2B 1
#> 6 Y c dataC data2C 0
#> 7 Z a dataA data2A 0
#> 8 Z b dataB data2B 1
#> 9 Z c dataC data2C 1
Created on 2021-05-26 by the reprex package (v2.0.0)
or writing this a bit differently, to get an output as required
taking out category from nesting and adding a group_by on it. There is no difference in both syntaxes but, group_by on category causes it to place earlier than other columns, which is as per expected output.
test_df %>% mutate(combination_existed = 1) %>%
group_by(category) %>%
complete(group = unique(test_df$group), nesting(category_data_1, category_data_2),
fill = list(combination_existed = 0))
# A tibble: 9 x 5
# Groups: category [3]
category group category_data_1 category_data_2 combination_existed
<chr> <chr> <chr> <chr> <dbl>
1 a X dataA data2A 1
2 a Y dataA data2A 1
3 a Z dataA data2A 0
4 b X dataB data2B 1
5 b Y dataB data2B 1
6 b Z dataB data2B 1
7 c X dataC data2C 0
8 c Y dataC data2C 0
9 c Z dataC data2C 1
I would like to make this table:
Look like this:
Using dplyr:
df <- tibble(id = c(1,1,3),
b = c("foo", "bar", "foo"),
c = c("x", "y", "z"))
df
# A tibble: 3 x 3
id b c
<dbl> <chr> <chr>
1 1 foo x
2 1 bar y
3 3 foo z
df %>% group_by(id) %>%
summarize(new = paste(b, collapse = ","),
new2 = paste(c, collapse = ","))
which results in:
# A tibble: 2 x 3
a new new2
<dbl> <chr> <chr>
1 1 foo,bar x,y
2 3 foo z
I would like to create a function that will produce a table that has counts based on one or more grouping variables. I found this post Using dplyr group_by in a function which works if I pass the function a single variable name
library(dplyr)
l <- c("a", "b", "c", "e", "f", "g")
animal <- c("dog", "cat", "dog", "dog", "cat", "fish")
sex <- c("m", "f", "f", "m", "f", "unknown")
n <- rep(1, length(animal))
theTibble <- tibble(l, animal, sex, n)
countString <- function(things) {
theTibble %>% group_by(!! enquo(things)) %>% count()
}
countString(animal)
countString(sex)
That works nicely but I don't know how to pass the function two variables.
This sort of works:
countString(paste(animal, sex))
It gives me the correct counts but the returned table collapses the animal and sex variables into one variable.
# A tibble: 4 x 2
# Groups: paste(animal, sex) [4]
`paste(animal, sex)` nn
<chr> <int>
1 cat f 2
2 dog f 1
3 dog m 2
4 fish unknown 1
What is the syntax for passing a function two words separated by commas? I want to get this result:
# A tibble: 4 x 3
# Groups: animal, sex [4]
animal sex nn
<chr> <chr> <int>
1 cat f 2
2 dog f 1
3 dog m 2
4 fish unknown 1
You can use group_by_at and column index such as:
countString <- function(things) {
index <- which(colnames(theTibble) %in% things)
theTibble %>%
group_by_at(index) %>%
count()
}
countString(c("animal", "sex"))
## A tibble: 4 x 3
## Groups: animal, sex [4]
# animal sex nn
# <chr> <chr> <int>
#1 cat f 2
#2 dog f 1
#3 dog m 2
#4 fish unknown 1
We replaced 'things' with ... for multiple arguments, similarly enquos with !!! for multiple arguments. Removed the group_by with count
countString <- function(...) {
grps <- enquos(...)
theTibble %>%
count(!!! grps)
}
countString(sex)
# A tibble: 3 x 2
# sex nn
# <chr> <int>
#1 f 3
#2 m 2
#3 unknown 1
countString(animal)
# A tibble: 3 x 2
# animal nn
# <chr> <int>
#1 cat 2
#2 dog 3
#3 fish 1
countString(animal, sex)
# A tibble: 4 x 3
# animal sex nn
# <chr> <chr> <int>
#1 cat f 2
#2 dog f 1
#3 dog m 2
#4 fish unknown 1
I want to use purrr:pmap_df on a data.frame I created, to give me back another data.frame. However I want the original data.frame "kept" and cbinded to the new data.frame in a single pipe. Example:
f <- function(a, b, c) {
return(list(d = 1, e = 2, f = 3))
}
tibble(a = 1:2, b = 3:4, c = 5:6) %>%
pmap_df(f)
This would give me:
# A tibble: 2 × 3
d e f
<dbl> <dbl> <dbl>
1 1 2 3
2 1 2 3
But I would like to keep that tibble:
# A tibble: 2 × 6
a b c d e f
<int> <int> <int> <dbl> <dbl> <dbl>
1 1 3 5 1 2 3
2 2 4 6 1 2 3
(Silly example but you get what I mean). Any elegant way of doing this in a single pipe?
If you don't want to redefine the function, the simplest way is to just use bind_cols on the results, using . to place the data.frame where you need:
library(tidyverse)
f <- function(a, b, c) {
return(list(d = 1, e = 2, f = 3))
}
tibble(a = 1:2, b = 3:4, c = 5:6) %>%
bind_cols(pmap_df(., f))
#> # A tibble: 2 x 6
#> a b c d e f
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 3 5 1 2 3
#> 2 2 4 6 1 2 3
You can also use ... to represent the inputs into pmap, which lets you do
tibble(a = 1:2, b = 3:4, c = 5:6) %>% pmap_df(~c(..., f(...)))
which returns the same thing.