conditional matching between variables in dplyr - r

I am trying to find observations within a column that have certain or all the possible values within another column. In this tibble
parties <- tibble(class = c("R","R","R","R","R","K","K","K","K","K","K",
"L","L","L","L"),
name = c("Party1", "Party2","Party3","Party4","Party5",
"Party2", "Party4", "Party6","Party7","Party8","Party9",
"Party2","Party3","Party4","Party10"))
I want to find all the "parties" that are in all three classes "R", "K" and "L". Or generally parties that are in class "X" or "Y". I managed to find a solution, using group_split(class), then extracting each table from the list and then lastly performing two semi_joins. That is for the case when I want parties that are in all three classes:
parties_split <- parties %>%
group_split(class)
parties_K <- parties_split[[1]]
parties_L <- parties_split[[2]]
parties_R <- parties_split[[3]]
semi_join(parties_K,parties_L, by = "name") %>%
semi_join(parties_R, by = "name") %>%
select(-class)
name
<chr>
Party2
Party4
This would work in this case but would not be efficient especially if the number of classes (or observations) that need to match are much larger than three. I am looking in particular for solutions in tidyverse. Any ideas? Thanks

Try that:
parties %>%
group_by(name) %>%
filter("K" %in% class,
"R" %in% class,
"L" %in% class) %>%
summarise()
# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4
EDIT: If you want to work with more than 3 parties you can also use:
mask = c("K", "R", "L")
parties %>%
group_by(name) %>%
filter(all(mask %in% class)) %>%
summarise()

To make this work for many groups you can use purrr::reduce :
library(dplyr)
parties %>%
group_split(class) %>%
purrr::reduce(semi_join, by = "name") %>%
select(name)
# name
# <chr>
#1 Party2
#2 Party4

Does this work:
library(dplyr)
parties %>% group_by(name) %>% mutate(cnt = n()) %>%
group_by(class) %>% mutate(grpno = group_indices()) %>% ungroup() %>%
filter(cnt >= max(grpno)) %>% select(name) %>% distinct()
# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4

Another solution
library(tidyverse)
parties %>%
group_by(class) %>%
distinct() %>%
mutate(id = 1) %>%
pivot_wider(name, names_from = class, values_from = id) %>%
rowwise() %>%
filter(!is.na(sum(c_across(where(is.numeric))))) %>%
select(name) %>%
ungroup()
#> # A tibble: 2 x 1
#> name
#> <chr>
#> 1 Party2
#> 2 Party4
Created on 2020-12-09 by the reprex package (v0.3.0)

Related

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

I've written a function that takes multiple columns as its input that I'd like to apply to a grouped tibble, and I think that something with purrr::map might be the right approach, but I don't understand what the appropriate input is for the various map functions. Here's a dummy example:
myFun <- function(DF){
DF %>% mutate(MyOut = (A * B)) %>% pull(MyOut) %>% sum()
}
MyDF <- data.frame(A = 1:5, B = 6:10)
myFun(MyDF)
This works fine. But what if I want to add some grouping?
MyDF <- data.frame(A = 1:100, B = 1:100, Fruit = rep(c("Apple", "Mango"), each = 50))
MyDF %>% group_by(Fruit) %>% summarize(MyVal = myFun(.))
This doesn't work. I get the same value for every group in my data.frame or tibble. I then tried using something with purrr:
MyDF %>% group_by(Fruit) %>% map(.f = myFun)
Apparently, that's expecting character data as input, so that's not it.
This next variation is basically what I need, but the output is a list of lists rather than a tibble with one row for each value of Fruit:
MyDF %>% group_by(Fruit) %>% group_map(~ myFun(.))
We can use the OP's function in group_modify
library(dplyr)
MyDF %>%
group_by(Fruit) %>%
group_modify(~ .x %>%
summarise(MyVal = myFun(.x))) %>%
ungroup
-output
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
Or in group_map where the .y is the grouping column
MyDF %>%
group_by(Fruit) %>%
group_map(~ bind_cols(.y, MyVal = myFun(.))) %>%
bind_rows
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425

Creating a funnel using a pivot table in R considering NA column

I have the following dataset:
library(tidyverse)
dataset <- data.frame(id = c(121,122,123,124,125),
segment = c("A","B","B","A",NA),
Web = c(1,1,1,1,1),
Tryout = c(1,1,1,0,1),
Purchase = c(1,0,1,0,0),
stringsAsFactors = FALSE)
This table as you see converts to a funnel, from web visits (the quantity of rows), to tryout to a purchase. So a useful view of this funnel should be:
Step Total A B NA
Web 5 2 2 1
Tryout 4 1 2 1
Purchase 2 1 1 0
So I tried row by row doing this. The web views code is:
dataset %>% mutate(segment = ifelse(is.na(segment), "NA", segment)) %>%
group_by(segment) %>% summarise(Total = n()) %>%
ungroup() %>% spread(segment, Total) %>% mutate(Total = `A` + `B` + `NA`) %>%
select(Total,A,B,`NA`)
And worked fine, except that I have to put manually the row name. But for the other steps like tryout and purchase, is there a way to do it in just one simpler code, avoiding binding? Consider that this is an example and I have many columns so any help will be greatly appreciated.
Here is one option where we convert the data to 'long' format after removing the 'id' column, grouped by 'name' get the sum of 'value', then grouped by 'segment', 'Total' as well and do the second sum, get the distinct rows and pivot back to 'wide' format
library(dplyr)
library(tidyr)
dataset %>%
select(-id) %>%
pivot_longer(cols = -segment) %>%
group_by(name) %>%
mutate(Total = sum(value)) %>%
group_by(name, segment, Total) %>%
mutate(n = sum(value)) %>%
ungroup %>%
select(-value) %>%
distinct %>%
pivot_wider(names_from = segment, values_from = n)
# A tibble: 3 x 5
# name Total A B `NA`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Web 5 2 2 1
#2 Tryout 4 1 2 1
#3 Purchase 2 1 1 0
dataset %>%
select(-id) %>%
group_by(segment) %>%
summarise_all(sum) %>%
gather(Step, val, -segment) %>%
spread(segment, val) %>%
mutate(Total = rowSums(.[,-1]))

How can I find the longest names (by letters) in my data set?

I have a problem set that wants me to find out the "two longest names given to at least 1000 US babies" in the 'babynames' data set.
The code that I've tried in the past is this:
babynames %>%
mutate(long.name = str_count(babynames$name,
"[:alpha:]")) %>%
filter(n >= 1000) %>%
arrange(-long.name) %>%
head(2) %>%
select(name, long.name)
But it gave me this:
name long.name
<chr> <int>
1 Christopher 11
2 Christopher 11
By group_by name, I'm hoping to eliminate the issue above.
This is where I'm currently at:
babynames %>%
filter(n >= 1000) %>%
group_by(name) %>%
mutate(long.name = str_count(babynames$name,
"[:alpha:]")) %>%
arrange(-long.name) %>%
head(2)
I'm expecting to get something like:
name long.name
<chr> <int>
1 Christopher 11
2 (some name) 10
But I get this:
Error: Column `long.name` must be length 1 (the group size), not 1924665
What am I doing wrong?
We can group_by name and sum all the occurrence of each name, keep only those names which have occurred more than 1000 times, calculate the length using nchar and select top 2 values.
library(babynames)
library(dplyr)
babynames %>%
group_by(name) %>%
summarise(n = sum(n)) %>%
filter(n > 1000) %>%
mutate(name_length = nchar(name)) %>%
#Can also do
#mutate(name_length = stringr::str_count(name, "[:alpha:]")) %>%
top_n(2, name_length)
# name n name_length
# <chr> <int> <int>
#1 Maryelizabeth 1969 13
#2 Michaelangelo 1236 13

Spread in SparklyR / pivot in Spark

I am trying to refactor my R code (shown below) into Sparklyr R code to work on a spark dataset to get to the final result as shown in Table 1:
Using help from stack overflow post Gather in sparklyr and SparklyR separate one Spark Data Frame column into two columns I was able to reach all the way except last step dealing with Spread.
Need Help:
Implement Spread via SparklyR
Optimize code in any way
Table 1: Final output needed:
var n nmiss
1 Sepal.Length 150 0
2 Sepal.Width 150 0
R code to achieve it:
library(dplyr)
library(tidyr)
library(tibble)
data <- iris
data_tbl <- as_tibble(data)
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width) %>%
summarize_all(funs(
n = n(), #Count
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
spread(stat, value)
Spark Code:
sdf_gather <- function(tbl){
all_cols <- colnames(tbl)
lapply(all_cols, function(col_nm){
tbl %>%
select(col_nm) %>%
mutate(key = col_nm) %>%
rename(value = col_nm)
}) %>%
sdf_bind_rows() %>%
select(c('key', 'value'))
}
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width ) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.)))
)) %>%
sdf_gather(.) %>%
ft_regex_tokenizer(input_col="key", output_col="KeySplit", pattern="_(?=[^_]*$)") %>%
sdf_separate_column("KeySplit", into=c("var", "stat")) %>%
select(var,stat,value) %>%
sdf_register('profile')
In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.
With data defined like this:
df <- copy_to(sc, iris, overwrite = TRUE)
gather the columns (below I assume a function as defined in my answer to Gather in sparklyr)
long <- df %>%
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")
and then group and aggregate:
long %>%
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))
with result as:
# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
Given reduced size of the output it is also fine to collect the result after aggregation
agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()
and apply your gather - spread logic on the result:
agg %>%
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
In fact the latter approach should be superior performance-wise in this particular case.

Select rows by ID with most matches

I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))
You can use aggregate:
aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.
You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c
An approach with the data.table package:
library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
which gives:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
And two possible dplyr alternatives:
library(dplyr)
# option 1
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
group_by(id) %>%
mutate(f = rank(n, ties.method = "first")) %>%
filter(f == max(f)) %>%
select(-n, -f)
# option 2
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
distinct() %>%
arrange(n) %>%
group_by(id) %>%
slice(n()) %>%
select(-n)
Yet another dplyr solution, this time using add_count() instead of mutate():
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n == max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c

Resources