I would like to create a function that will produce a table that has counts based on one or more grouping variables. I found this post Using dplyr group_by in a function which works if I pass the function a single variable name
library(dplyr)
l <- c("a", "b", "c", "e", "f", "g")
animal <- c("dog", "cat", "dog", "dog", "cat", "fish")
sex <- c("m", "f", "f", "m", "f", "unknown")
n <- rep(1, length(animal))
theTibble <- tibble(l, animal, sex, n)
countString <- function(things) {
theTibble %>% group_by(!! enquo(things)) %>% count()
}
countString(animal)
countString(sex)
That works nicely but I don't know how to pass the function two variables.
This sort of works:
countString(paste(animal, sex))
It gives me the correct counts but the returned table collapses the animal and sex variables into one variable.
# A tibble: 4 x 2
# Groups: paste(animal, sex) [4]
`paste(animal, sex)` nn
<chr> <int>
1 cat f 2
2 dog f 1
3 dog m 2
4 fish unknown 1
What is the syntax for passing a function two words separated by commas? I want to get this result:
# A tibble: 4 x 3
# Groups: animal, sex [4]
animal sex nn
<chr> <chr> <int>
1 cat f 2
2 dog f 1
3 dog m 2
4 fish unknown 1
You can use group_by_at and column index such as:
countString <- function(things) {
index <- which(colnames(theTibble) %in% things)
theTibble %>%
group_by_at(index) %>%
count()
}
countString(c("animal", "sex"))
## A tibble: 4 x 3
## Groups: animal, sex [4]
# animal sex nn
# <chr> <chr> <int>
#1 cat f 2
#2 dog f 1
#3 dog m 2
#4 fish unknown 1
We replaced 'things' with ... for multiple arguments, similarly enquos with !!! for multiple arguments. Removed the group_by with count
countString <- function(...) {
grps <- enquos(...)
theTibble %>%
count(!!! grps)
}
countString(sex)
# A tibble: 3 x 2
# sex nn
# <chr> <int>
#1 f 3
#2 m 2
#3 unknown 1
countString(animal)
# A tibble: 3 x 2
# animal nn
# <chr> <int>
#1 cat 2
#2 dog 3
#3 fish 1
countString(animal, sex)
# A tibble: 4 x 3
# animal sex nn
# <chr> <chr> <int>
#1 cat f 2
#2 dog f 1
#3 dog m 2
#4 fish unknown 1
Related
Consider the following dataset:
library(tidyverse)
tbl <- tibble(
x = letters[1:3],
y = c("a", "a", "b"),
z = letters[4:6]
)
I'd like to create a function to create new columns containing unique ids for each value of a given set of columns, with a single mutate call.
I tried the following but I'm getting an error:
add_ids <- function(.data, ...) {
mutate(.data, map_chr(c(...), ~ {
"{.x}_id" := vctrs::vec_group_id(.data[[.x]])
}))
}
add_ids(tbl, x, y)
Expected output:
# A tibble: 3 × 5
x y z x_id y_id
<chr> <chr> <chr> <int> <int>
1 a a d 1 1
2 b a e 2 1
3 c b f 3 2
You could rather use across in this case:
add_ids <- function(.data, var) {
mutate(.data,
across({{ var }}, vctrs::vec_group_id, .names = "{col}_id"))
}
add_ids(tbl, c(x, y))
## A tibble: 3 × 5
# x y z x_id y_id
# <chr> <chr> <chr> <int> <int>
#1 a a d 1 1
#2 b a e 2 1
#3 c b f 3 2
I want to remove the rows which had been grouped from the original data frame and put them into another data frame.
library(dplyr)
Name <- c("Jon", "Jon", "Jon", "Jon", "Jon", "Jon")
school <- c("a", "a", "b", "c", "x", "y")
Age <- c(10, 15, 20, 25, 30, 60)
df <- data.frame(Name, school, Age )
#case_1
dfAvg <- df %>%
group_by(Name, school) %>%
summarise(across(Age, mean))
I want after this to remove the rows that resulted from group by so my original Df has 4 rows and a new Df has only 1 row that came out from group by.
expected output:
2 data frames:
the first one:
jon b 20
jon c 25
jon x 30
jon y 60
and the second one contains the row that resulted after group by:
jon a 12.5
Here is one way that stores your 2 data frames in a list
ibrary(dplyr)
df %>%
group_by(Name, school) %>%
summarise(n = n() > 1, Age = mean(Age)) %>%
split(., .$n)
`summarise()` has grouped output by 'Name'. You can override using the
$`FALSE`
# A tibble: 4 x 4
# Groups: Name [1]
Name school n Age
<chr> <chr> <lgl> <dbl>
1 Jon b FALSE 20
2 Jon c FALSE 25
3 Jon x FALSE 30
4 Jon y FALSE 60
$`TRUE`
# A tibble: 1 x 4
# Groups: Name [1]
Name school n Age
<chr> <chr> <lgl> <dbl>
1 Jon a TRUE 12.5
Or you can use the function slice from dplyr package to get the two outputs :
#First data frame :
dfAvg %>% dplyr::slice(-1)
#or
dfAvg %>% dplyr::slice(2:5)
# A tibble: 4 x 3
# Groups: Name [1]
Name school Age
<fct> <fct> <dbl>
1 Jon b 20
2 Jon c 25
3 Jon x 30
4 Jon y 60
#Second data frame :
dfAvg %>% dplyr::slice(1)
# A tibble: 1 x 3
# Groups: Name [1]
Name school Age
<fct> <fct> <dbl>
1 Jon a 12.5
I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2
Example of the problem I'm having with applying a function including tidyverse code. I want to repeat for different variable names, but I'm not sure how to 'unquote'.
Example data:
df <- data.frame(grp=c(1,2,1,2,1), one=c(rep('a', 3), rep('b', 2)), two=c(rep('a', 1), rep('d', 4)))
cn <- colnames(df)[2:ncol(df)]
for(i in cn){
i <- enquo(i)
print(df %>% group_by(grp) %>% count(!!i))
}
# A tibble: 2 x 3
# Groups: grp [2]
grp `"one"` n
<dbl> <chr> <int>
1 1 one 3
2 2 one 2
# A tibble: 2 x 3
# Groups: grp [2]
grp `"two"` n
<dbl> <chr> <int>
1 1 two 3
2 2 two 2
Doing it for a single variable named one; this is the correct output.
df %>% group_by(grp) %>% count(one)
# A tibble: 4 x 3
# Groups: grp [2]
grp one n
<dbl> <fct> <int>
1 1 a 2
2 1 b 1
3 2 a 1
4 2 b 1
You can use map, also can avoid group_by by including grp in count
library(dplyr)
library(purrr)
map(cn, ~df %>% count(grp, .data[[.x]]))
#[[1]]
# grp one n
#1 1 a 2
#2 1 b 1
#3 2 a 1
#4 2 b 1
#[[2]]
# grp two n
#1 1 a 1
#2 1 d 2
#3 2 d 2
You can also use NSE with sym
map(cn, ~df %>% count(grp, !!sym(.x)))
I have a dataframe like this:
library(tidyverse)
a <- tibble(x=c("mother","father","brother","brother"),y=c("a","b","c","d"))
b <- tibble(x=c("mother","father","brother","brother"),z=c("e","f","g","h"))
I want to join these dataframes so that each "brother" occurs only once
I have tried fulljoin
ab <- full_join(a,b,by="x")
and obtained this:
# A tibble: 6 x 3
x y z
<chr> <chr> <chr>
1 mother a e
2 father b f
3 brother c g
4 brother c h
5 brother d g
6 brother d h
What I need is this:
ab <- tibble(x=c("mother","father","brother1","brother2"),y=c("a","b","c","d"),z=c("e","f","g","h"))
# A tibble: 4 x 3
x y z
<chr> <chr> <chr>
1 mother a e
2 father b f
3 brother1 c g
4 brother2 d h
Using dplyr you could do something like the following, which adds an extra variable person to identify each person within each group in x, and then joins by x and person:
library(dplyr)
a %>%
group_by(x) %>%
mutate(person = 1:n()) %>%
full_join(b %>%
group_by(x) %>%
mutate(person = 1:n()),
by = c("x", "person")
) %>%
select(x, person, y, z)
Which returns:
# A tibble: 4 x 4
# Groups: x [3]
x person y z
<chr> <int> <chr> <chr>
1 mother 1 a e
2 father 1 b f
3 brother 1 c g
4 brother 2 d h
Unfortunatelly, the first and second brotherare indistinguisheable form each other! How would R know that you want to join them that way, and not the reverse?
I would try to "remove duplicates" in the original data.frames by adding the "1" and "2" identifiers there.
I don't know tidyverse syntax, but if you never get more than two repetitions, you may want to try
a <- c("A", "B", "C", "C")
a[duplicated(a)] <- paste0(a[duplicated(a)], 2)