dplyr: access current group variable - r

After using data.table for quite some time I now thought it's time to try dplyr. It's fun, but I wasn't able to figure out how to access
the current grouping variable
returning multiple values per group
The following example shows is working fine with data.table. How would you write this with dplyr
library(data.table)
foo <- matrix(c(1, 2, 3, 4), ncol = 2)
dt <- data.table(a = c(1, 1, 2), b = c(4, 5, 6))
# data.table (expected)
dt[, .(c = foo[, a]), by = a]
a c
1: 1 1
2: 1 2
3: 2 3
4: 2 4
# dplyr (?)
library(dplyr)
dt %>%
group_by(a) %>%
summarize(c = foo[a])

We can use do from dplyr. (No other packages used). The do is very handy for expanding rows. We only need to wrap with data.frame.
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, unique(.$a)]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or instead of unique we can subset by the 1st observation
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, .$a[1]]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or with dplyr >= 1.0.0 (EDIT: Based on #Todd West comments)
dt %>%
reframe(c = foo[, cur_group()$a], .by = 'a')
a c
1 1 1
2 1 2
3 2 3
4 2 4
This can be also done without using any packages
stack(lapply(split(dt$a, dt$a), function(x) foo[,unique(x)]))[2:1]
# ind values
#1 1 1
#2 1 2
#3 2 3
#4 2 4

You can still access the group variable but it is like a normal vector with one unique value for each group, so if you put unique around it, it will work. And at same time, dplyr does not seem to expand rows like data.table automatically, you will need the unnest from tidyr package:
library(dplyr); library(tidyr)
dt %>%
group_by(a) %>%
summarize(c = list(foo[,unique(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
Or we can use first to speed up, since we've already know the group variable vector is the same for every group:
dt %>%
group_by(a) %>%
summarize(c = list(foo[,first(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4

To access a grouping variable in a grouped operation (map, walk, mutate), we can refer to .y which is exposed automatically within the evaluation context.
Example
> iris %>% group_by(Species) %>% group_walk(~{ print(.y) })
# A tibble: 1 x 1
Species
<fct>
1 setosa
# A tibble: 1 x 1
Species
<fct>
1 versicolor
# A tibble: 1 x 1
Species
<fct>
1 virginica
This is also documented with more details in https://dplyr.tidyverse.org/reference/group_map.html
The key, a tibble with exactly one row and columns for each grouping variable, exposed as .y.
Regarding the other proposed solutions: Afaik do is not recommended any longer, and the other solution with unqiue is imho clumsy (as it requires another reference to the dataframe in question).

Related

count distinct levels of a data frame for groups based on a condition

I have the following DF
x = data.frame('grp' = c(1,1,1,2,2,2),'a' = c(1,2,1,1,2,1), 'b'= c(6,5,6,6,2,6), 'c' = c(0.1,0.2,0.4,-1, 0.9,0.7))
grp a b c
1 1 1 6 0.1
2 1 2 5 0.2
3 1 1 6 0.4
4 2 1 6 -1.0
5 2 2 2 0.9
6 2 1 6 0.7
I want to count distinct levels of (a,b) for each group where c >= 0.1
I have tried using dplyr for this using group_by & summarise but not getting the desired result
x %>% group_by(grp) %>% summarise(count = n_distinct(c(a,b)[c >= 0.1]))
For the above case I would expect the following result
grp count
<dbl> <int>
1 1 2
2 2 2
However using the above query I am getting the following result
grp count
<dbl> <int>
1 1 4
2 2 3
Logically the above output seems to be solving for all unique values of a concat list of (a,b) but not what I require
Any pointers, really appreciate any help
Here's another way using dplyr. It sounds like you want to filter based on c, so we do that. Instead of using c(a, b) in n_distinct, we can write it as n_distinct(a, b).
x %>%
filter(c >= 0.1) %>%
group_by(grp) %>%
summarise(cnt_d = n_distinct(a, b))
# grp cnt_d
# <dbl> <int>
# 1 1 2
# 2 2 2
We can paste a and b columns and count distinct values in each group.
library(dplyr)
x %>%
mutate(col = paste(a, b, sep = "_")) %>%
group_by(grp) %>%
summarise(count = n_distinct(col[c >= 0.1]))
# grp count
# <dbl> <int>
#1 1 2
#2 2 2
An option using data.table
library(data.table)
setDT(x)[c >= 0.1, .(cnt_d = uniqueN(paste(a, b))), .(grp)]
# grp cnt_d
#1: 1 2
#2: 2 2

R Show duplicates in dataframe

I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1

Create subsets by group using dplyr

Here I have a data of three fields Dealer,Product,Freq.
My aim is to create a data which will contain top 2 sells for each dealer.
I have done it using data.table as bellow:
library(data.table)
library(dplyr)
dt <- data.table(Dealer = c("A","B","A","A","B","A"),
Product = c("a","b","b","c","d","d"),
Freq = c(10,12,23,24,23,12))
dt[,.SD[order(Freq, decreasing = T)][seq_along(Freq) < 3], by = Dealer]
How to do the similar thing using 'dplyr' package.
Here, I group by Dealer then find the top 2 values of Freq in each group.
dt %>% group_by(Dealer) %>% top_n(2, Freq) %>% ungroup
# # A tibble: 4 x 3
# Dealer Product Freq
# <fct> <fct> <dbl>
# 1 B b 12
# 2 A b 23
# 3 A c 24
# 4 B d 23
We can use slice or filter after doing the group_by and arrange (same methodology as in the OP's post)
library(dplyr)
dt %>%
group_by(Dealer) %>%
arrange(Dealer, desc(Freq)) %>%
slice(1:2)
# or with
# filter(row_number() < 3)
# A tibble: 4 x 3
# Groups: Dealer [2]
# Dealer Product Freq
# <chr> <chr> <dbl>
#1 A c 24
#2 A b 23
#3 B d 23
#4 B b 12
NOTE: In case of ties, this will get the output exactly the number of rows specified in the slice or filter

recoding categorical with no mapping values

Got a data frame with a lot of variables (82), many of them are used for further calculations. So I've tried to convert to numerical but there's a huge work guessing distinct values for every variable and then assign numbers.
I wonder if there's a more automated way of doing it since I don't care which number is assigned to any value as it is not repeated.
My approach so far (for he sake of clarity, dummy data):
df <- data.frame(original.var1 = c("display","memory","software","display","disk","memory"),
original.var2 = c("skeptic","believer","believer","believer","skeptic","believer"),
original.var3 = c("round","square","triangle","cube","sphere","hexagon"),
original.var4 = c(10,20,30,40,50,60))
taking into account this worked fine
library(dplyr)
library(magrittr)
df$NEW1 <- as.numeric(interaction(df$original.var1, drop=TRUE))
I've tried to adapt to dplyr and pipes this way
df %<>% mutate(VAR1= as.numeric(interaction(original.var1, drop=TRUE))) %>%
mutate(VAR2= as.numeric(interaction(original.var2, drop=TRUE))) %>%
mutate(VAR3= as.numeric(interaction(original.var2, drop=TRUE)))
but results got wrong from third VAR ahead
df %>% dplyr::group_by(original.var1,VAR1) %>% tally()
# A tibble: 4 x 3
# Groups: original.var1 [?]
original.var1 VAR1 n
<fctr> <dbl> <int>
1 disk 1 1
2 display 2 2
3 memory 3 2
4 software 4 1
> df %>% dplyr::group_by(original.var2,VAR2) %>% tally()
# A tibble: 2 x 3
# Groups: original.var2 [?]
original.var2 VAR2 n
<fctr> <dbl> <int>
1 believer 1 4
2 skeptic 2 2
> df %>% dplyr::group_by(original.var3,VAR3) %>% tally()
# A tibble: 6 x 3
# Groups: original.var3 [?]
original.var3 VAR3 n
<fctr> <dbl> <int>
1 cube 1 1
2 hexagon 1 1
3 round 2 1
4 sphere 2 1
5 square 1 1
6 triangle 1 1
Any approach or package to recode not having the mapping declared previously?
You can use mutate_if,
library(dplyr)
mutate_if(df, is.factor, funs(as.numeric(interaction(., drop = TRUE))))
which gives,
original.var1 original.var2 original.var3 original.var4
1 2 2 3 10
2 3 1 5 20
3 4 1 6 30
4 2 1 1 40
5 1 2 4 50
6 3 1 2 60
Alternatively you can read your data frame with stringsAsFactors = FALSE and use is.character but it's the same thing
To address your comment, If you want to also keep your original columns, then,
mutate_if(df, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))
Using purrr Keep the factor columns only and operate on them. Merge with numerical at the end.
df %>% purrr::keep(is.factor) %>% mutate_all(funs(as.numeric(interaction(., drop = TRUE))))

Find difference between rows by id, but place difference on first row in R

I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.

Resources