Function over tidyverse code results in issue with quotes - r

Example of the problem I'm having with applying a function including tidyverse code. I want to repeat for different variable names, but I'm not sure how to 'unquote'.
Example data:
df <- data.frame(grp=c(1,2,1,2,1), one=c(rep('a', 3), rep('b', 2)), two=c(rep('a', 1), rep('d', 4)))
cn <- colnames(df)[2:ncol(df)]
for(i in cn){
i <- enquo(i)
print(df %>% group_by(grp) %>% count(!!i))
}
# A tibble: 2 x 3
# Groups: grp [2]
grp `"one"` n
<dbl> <chr> <int>
1 1 one 3
2 2 one 2
# A tibble: 2 x 3
# Groups: grp [2]
grp `"two"` n
<dbl> <chr> <int>
1 1 two 3
2 2 two 2
Doing it for a single variable named one; this is the correct output.
df %>% group_by(grp) %>% count(one)
# A tibble: 4 x 3
# Groups: grp [2]
grp one n
<dbl> <fct> <int>
1 1 a 2
2 1 b 1
3 2 a 1
4 2 b 1

You can use map, also can avoid group_by by including grp in count
library(dplyr)
library(purrr)
map(cn, ~df %>% count(grp, .data[[.x]]))
#[[1]]
# grp one n
#1 1 a 2
#2 1 b 1
#3 2 a 1
#4 2 b 1
#[[2]]
# grp two n
#1 1 a 1
#2 1 d 2
#3 2 d 2
You can also use NSE with sym
map(cn, ~df %>% count(grp, !!sym(.x)))

Related

Filter groups using a lagged column

I'm working on creating some error reports and one of the times I'm trying to address is potential errors within the ID column id_1. I've made an alternative id column from various identifying features within the rows that I'm calling id_2. To help, I've also created a date_lag column on date to catch items that were entered within a specific period after the initial entry. The main problem that I'm having is returning the entire group that meets the criteria, including that first entry that would have an NA in the date_lag, or, if I allow the NA values through, I get more than just the items I'm looking for (id_1 1 and 2 below).
Example:
#id_1 where potential errors lie
#id_2 alternative id col I'm using to test
df <- data.table(id_1 = c(1:4, 1:4),
id_2 = c(rep(c("b", "a"), c(2, 2))),
date = c(rep(1,4),rep(20,2), rep(10,2)))
df %>%
group_by(id_2) %>%
mutate(date_lag = date - lag(date)) %>%
filter(between(date_lag, 0, 10) | is.na(date_lag))
# A tibble: 6 x 4
# Groups: id_1 [4]
id_1 id_2 date date_lag
<int> <chr> <dbl> <dbl>
1 b 1 NA
2 b 1 0
3 a 1 NA
4 a 1 0
2 b 20 0
3 a 10 9
4 a 10 0
Expected:
# A tibble: 6 x 4
# Groups: id_2 [4]
id_1 id_2 value val_lag
<int> <chr> <dbl> <dbl>
3 a 1 NA
4 a 1 NA
3 a 10 9
4 a 10 9
Perhaps, we can use diff
library(dplyr)
df %>%
group_by(id_1) %>%
filter(between(diff(date), 0, 10))
-output
# A tibble: 4 x 3
# Groups: id_1 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 3 a 1
#2 4 a 1
#3 3 a 10
#4 4 a 10
Concatenate with NA as the diff returns a length 1 less than the original data
df %>%
group_by(id_2) %>%
filter(between(c(NA, diff(date)), 0, 10))
# A tibble: 5 x 3
# Groups: id_2 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 2 b 1
#2 4 a 1
#3 2 b 20
#4 3 a 10
#5 4 a 10

Add original values for columns after group by

For the dataframe below I want to add the original values for Var_x after a group_by on ID and event and a max() on quest, but I cannot get my code right. Any suggestions? By the way, in my original dataframe more than 1 column needs to be added.
df <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,3,3,3),
quest = c(1,1,2,2,3,3,1,2,3,1,2,3),
event = c("A","B","A","B","A",NA,"C","D","C","D","D",NA),
VAR_X = c(2,4,3,6,3,NA,6,4,5,7,5,NA))
Code:
df %>%
group_by(ID,event) %>%
summarise(quest = max(quest))
Desired output:
ID quest event VAR_X
1 1 2 B 6
2 1 3 A 3
3 2 2 D 4
4 2 3 C 5
5 3 2 D 5
Start by omiting the na values and in the end do an inner_join with the original data set.
df %>%
na.omit() %>%
group_by(ID, event) %>%
summarise(quest = max(quest)) %>%
inner_join(df, by = c("ID", "event", "quest"))
## A tibble: 5 x 4
## Groups: ID [3]
# ID event quest VAR_X
# <dbl> <fct> <dbl> <dbl>
#1 1 A 3 3
#2 1 B 2 6
#3 2 C 3 5
#4 2 D 2 4
#5 3 D 2 5
df %>%
drop_na() %>% # remove if necessary ..
group_by(ID, event) %>%
filter(quest == max(quest)) %>%
ungroup()
# A tibble: 5 x 4
# ID quest event VAR_X
#<dbl> <dbl> <chr> <dbl>
# 1 1 2 B 6
# 2 1 3 A 3
# 3 2 2 D 4
# 4 2 3 C 5
# 5 3 2 D 5

Removing mirrored combinations of variables in a data frame

I'm looking to get each unique combination of two variables:
library(purrr)
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`)
# A tibble: 6 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 1 2
4 3 2
5 1 3
6 2 3
How do I remove out the mirrored combinations? That is, I want only one of rows 1 and 3 in the data frame above, only one of rows 2 and 5, and only one of rows 4 and 6. My desired output would be something like:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
I don't care if a particular id value is in id1 or id2, so the below is just as acceptable as the output:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 1 2
2 1 3
3 2 3
A tidyverse version of Dan's answer:
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`) %>%
mutate(min = pmap_int(., min), max = pmap_int(., max)) %>% # Find the min and max in each row
unite(check, c(min, max), remove = FALSE) %>% # Combine them in a "check" variable
distinct(check, .keep_all = TRUE) %>% # Remove duplicates of the "check" variable
select(id1, id2)
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
A Base R approach:
# create a string with the sorted elements of the row
df$temp <- apply(df, 1, function(x) paste(sort(x), collapse=""))
# then you can simply keep rows with a unique sorted-string value
df[!duplicated(df$temp), 1:2]

How to append a sequential count of a column into a new column from a grouped column using dplyr

I have the following data frame:
library(tidyverse)
dat <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'a', 'b', 'b', 'c', 'd'))
dat
#> foo bar
#> 1 1 a
#> 2 1 a
#> 3 2 b
#> 4 3 b
#> 5 3 c
#> 6 3 d
What I want to do is to create a new column with bar column tagged with the sequential count of its member, resulting in:
foo bar new_column
1 a a.sample.1
1 a a.sample.2
2 b b.sample.1
3 b b.sample.2
3 c c.sample.1
3 d d.sample.1
I'm stuck with this code:
> dat %>% group_by(bar) %>% summarise(n=n())
# A tibble: 4 x 2
bar n
<fctr> <int>
1 a 2
2 b 2
3 c 1
4 d 1
You can use group_by %>% mutate:
dat %>% group_by(bar) %>% mutate(new_column = paste(bar, 'sample', 1:n(), sep = "."))
# A tibble: 6 x 3
# Groups: bar [4]
# foo bar new_column
# <dbl> <fctr> <chr>
#1 1 a a.sample.1
#2 1 a a.sample.2
#3 2 b b.sample.1
#4 3 b b.sample.2
#5 3 c c.sample.1
#6 3 d d.sample.1
dat%>%group_by(bar)%>%mutate(new_column=paste0(bar,'.','sample.',row_number()))
# A tibble: 6 x 3
# Groups: bar [4]
foo bar new_column
<dbl> <fctr> <chr>
1 1 a a.sample.1
2 1 a a.sample.2
3 2 b b.sample.1
4 3 b b.sample.2
5 3 c c.sample.1
6 3 d d.sample.1

summarise and group_by using two different columns consecutively

I have a dataframe df with three columns a,b,c.
df <- data.frame(a = c('a','b','c','d','e','f','g','e','f','g'),
b = c('X','Y','Z','X','Y','Z','X','X','Y','Z'),
c = c('cat','dog','cat','dog','cat','cat','dog','cat','cat','dog'))
df
# output
a b c
1 a X cat
2 b Y dog
3 c Z cat
4 d X dog
5 e Y cat
6 f Z cat
7 g X dog
8 e X cat
9 f Y cat
10 g Z dog
I have to group_by using the column b followed by summarise using the column c with counts of available values in it.
df %>% group_by(b) %>%
summarise(nCat = sum(c == 'cat'),
nDog = sum(c == 'dog'))
#output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 2 2
2 Y 2 1
3 Z 2 1
However, before doing the above task, I should remove the rows belonging to a value in a which has more than one value in b.
df %>% group_by(a) %>% summarise(count = n())
#output
# A tibble: 7 × 2
a count
<fctr> <int>
1 a 1
2 b 1
3 c 1
4 d 1
5 e 2
6 f 2
7 g 2
For example, in this dataframe, all the rows having value e(values: Y,X), f(values: Z,Y), g(values: X,Z) in column a.
# Expected output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 1 1
2 Y 0 1
3 Z 1 0
We can use filter with n_distinct to filter the values in 'b' that have only one unique element for each 'a' group, then grouped by 'b', we do the summarise
df %>%
group_by(a) %>%
filter(n_distinct(b)==1) %>%
group_by(b) %>%
summarise(nCat =sum(c=='cat'), nDog = sum(c=='dog'), Total = n())
# A tibble: 3 × 4
# b nCat nDog Total
# <fctr> <int> <int> <int>
#1 X 1 1 2
#2 Y 0 1 1
#3 Z 1 0 1

Resources