how to average rows based on two duplicated rows? - r

I have 2000 rows with some duplicates, I would like to average the rows based on duplicates.
Site Location Line Band1
Cal BC04 BC04A 130
Cal BC04 BC04B 131
Cal BC04 BC04C 129
I have tried:
bind_cols(
FC %>% distinct(site) %>% .[,-Band1], # pull out columns we aren't aggregating
FC[,c(1, Band1)] %>% group_by(Band1) %>%
summarise_each(funs(mean)) %>% .[,-1] # aggregate other columns
)
So ideally, I would like to result in:
Site Location Line Band1
Cal BC04 BC04A 130

With dplyr, you can do:
df %>%
group_by(Site) %>%
filter(n() > 1) %>%
mutate(Band1 = mean(Band1)) %>%
slice(1) %>%
ungroup()
Site Location Line Band1
<chr> <chr> <chr> <dbl>
1 Cal BC04 BC04A 130
Here it keeps the "Site" values that are duplicated, calculates the mean of "Band1" and selects the first row per "Site".
Maybe you also want to bind the duplicated and non-duplicated rows:
df %>%
group_by(Site) %>%
filter(n() > 1) %>%
mutate(Band1 = mean(Band1)) %>%
slice(1) %>%
ungroup() %>%
bind_rows(df %>%
group_by(Site) %>%
filter(n() < 1) %>%
ungroup())
Or if you want to calculate it just from the duplicated values per "Site":
df %>%
group_by(Site, dup = duplicated(Site)) %>%
filter(dup) %>%
mutate(Band1 = mean(Band1)) %>%
slice(1) %>%
ungroup() %>%
select(-dup)
Site Location Line Band1
<chr> <chr> <chr> <dbl>
1 Cal BC04 BC04B 130

I like data.table for this
x <-data.frame(
Site = c( "Cal","Cal","Cal"),
Location = c( "BC04","BC04","BC04"),
Line = c( "BC04A","BC04B","BC04C"),
Band1= c(130,131, 129))
library( data.table)
x<- data.table( x )
x[ , .(Band1=mean( Band1 )) , by = c("Site","Location")]

Related

Ontime percentage calculations

I need to calculate the overall ontime percentage of each airline with this sample dataset.
library(tidyverse)
library(dplyr)
df_chi <- tribble(
~airline, ~ontime, ~qty,~dest,
'delta',TRUE,527,'CHI',
'delta',FALSE,92,'CHI',
'american',TRUE,4229,'CHI',
'american',FALSE,825,'CHI'
)
df_nyc <- tribble(
~airline, ~ontime, ~qty,~dest,
'delta',TRUE,1817,'NYC',
'delta',FALSE,567,'NYC',
'american',TRUE,1651,'NYC',
'american',FALSE,625,'NYC'
)
I have a solution although it is verbose and I want to avoid the numbered index ie [2,2]. Is there a more elegant way using more of the tidyverse?
df_all <- bind_rows(df_chi,df_nyc)
delta_ot <- df_all %>%
filter(airline == "delta") %>%
group_by(ontime) %>%
summarize(total = sum(qty))
delta_ot <- delta_ot[2,2] / sum(delta_ot$total)
american_ot <- df_all %>%
filter(airline == "american") %>%
group_by(ontime) %>%
summarize(total = sum(qty))
american_ot <- american_ot[2,2] / sum(american_ot$total)
As on the ontime column is logical column, use that to subset instead of [2, 2]. Also, instead of doing the filter, do this once by adding the 'airline' as a grouping column
library(dplyr)
bind_rows(df_chi, df_nyc) %>%
group_by(airline, ontime) %>%
summarise(total = sum(qty), .groups = 'drop_last') %>%
summarise(total = total[ontime]/sum(total))
-output
# A tibble: 2 × 2
airline total
<chr> <dbl>
1 american 0.802
2 delta 0.781
Subsetting by logical returns the corresponding value where there are TRUE elements
> c(1, 3, 5)[c(FALSE, TRUE, FALSE)]
[1] 3

conditional matching between variables in dplyr

I am trying to find observations within a column that have certain or all the possible values within another column. In this tibble
parties <- tibble(class = c("R","R","R","R","R","K","K","K","K","K","K",
"L","L","L","L"),
name = c("Party1", "Party2","Party3","Party4","Party5",
"Party2", "Party4", "Party6","Party7","Party8","Party9",
"Party2","Party3","Party4","Party10"))
I want to find all the "parties" that are in all three classes "R", "K" and "L". Or generally parties that are in class "X" or "Y". I managed to find a solution, using group_split(class), then extracting each table from the list and then lastly performing two semi_joins. That is for the case when I want parties that are in all three classes:
parties_split <- parties %>%
group_split(class)
parties_K <- parties_split[[1]]
parties_L <- parties_split[[2]]
parties_R <- parties_split[[3]]
semi_join(parties_K,parties_L, by = "name") %>%
semi_join(parties_R, by = "name") %>%
select(-class)
name
<chr>
Party2
Party4
This would work in this case but would not be efficient especially if the number of classes (or observations) that need to match are much larger than three. I am looking in particular for solutions in tidyverse. Any ideas? Thanks
Try that:
parties %>%
group_by(name) %>%
filter("K" %in% class,
"R" %in% class,
"L" %in% class) %>%
summarise()
# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4
EDIT: If you want to work with more than 3 parties you can also use:
mask = c("K", "R", "L")
parties %>%
group_by(name) %>%
filter(all(mask %in% class)) %>%
summarise()
To make this work for many groups you can use purrr::reduce :
library(dplyr)
parties %>%
group_split(class) %>%
purrr::reduce(semi_join, by = "name") %>%
select(name)
# name
# <chr>
#1 Party2
#2 Party4
Does this work:
library(dplyr)
parties %>% group_by(name) %>% mutate(cnt = n()) %>%
group_by(class) %>% mutate(grpno = group_indices()) %>% ungroup() %>%
filter(cnt >= max(grpno)) %>% select(name) %>% distinct()
# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4
Another solution
library(tidyverse)
parties %>%
group_by(class) %>%
distinct() %>%
mutate(id = 1) %>%
pivot_wider(name, names_from = class, values_from = id) %>%
rowwise() %>%
filter(!is.na(sum(c_across(where(is.numeric))))) %>%
select(name) %>%
ungroup()
#> # A tibble: 2 x 1
#> name
#> <chr>
#> 1 Party2
#> 2 Party4
Created on 2020-12-09 by the reprex package (v0.3.0)

Creating a funnel using a pivot table in R considering NA column

I have the following dataset:
library(tidyverse)
dataset <- data.frame(id = c(121,122,123,124,125),
segment = c("A","B","B","A",NA),
Web = c(1,1,1,1,1),
Tryout = c(1,1,1,0,1),
Purchase = c(1,0,1,0,0),
stringsAsFactors = FALSE)
This table as you see converts to a funnel, from web visits (the quantity of rows), to tryout to a purchase. So a useful view of this funnel should be:
Step Total A B NA
Web 5 2 2 1
Tryout 4 1 2 1
Purchase 2 1 1 0
So I tried row by row doing this. The web views code is:
dataset %>% mutate(segment = ifelse(is.na(segment), "NA", segment)) %>%
group_by(segment) %>% summarise(Total = n()) %>%
ungroup() %>% spread(segment, Total) %>% mutate(Total = `A` + `B` + `NA`) %>%
select(Total,A,B,`NA`)
And worked fine, except that I have to put manually the row name. But for the other steps like tryout and purchase, is there a way to do it in just one simpler code, avoiding binding? Consider that this is an example and I have many columns so any help will be greatly appreciated.
Here is one option where we convert the data to 'long' format after removing the 'id' column, grouped by 'name' get the sum of 'value', then grouped by 'segment', 'Total' as well and do the second sum, get the distinct rows and pivot back to 'wide' format
library(dplyr)
library(tidyr)
dataset %>%
select(-id) %>%
pivot_longer(cols = -segment) %>%
group_by(name) %>%
mutate(Total = sum(value)) %>%
group_by(name, segment, Total) %>%
mutate(n = sum(value)) %>%
ungroup %>%
select(-value) %>%
distinct %>%
pivot_wider(names_from = segment, values_from = n)
# A tibble: 3 x 5
# name Total A B `NA`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Web 5 2 2 1
#2 Tryout 4 1 2 1
#3 Purchase 2 1 1 0
dataset %>%
select(-id) %>%
group_by(segment) %>%
summarise_all(sum) %>%
gather(Step, val, -segment) %>%
spread(segment, val) %>%
mutate(Total = rowSums(.[,-1]))

How can I find the longest names (by letters) in my data set?

I have a problem set that wants me to find out the "two longest names given to at least 1000 US babies" in the 'babynames' data set.
The code that I've tried in the past is this:
babynames %>%
mutate(long.name = str_count(babynames$name,
"[:alpha:]")) %>%
filter(n >= 1000) %>%
arrange(-long.name) %>%
head(2) %>%
select(name, long.name)
But it gave me this:
name long.name
<chr> <int>
1 Christopher 11
2 Christopher 11
By group_by name, I'm hoping to eliminate the issue above.
This is where I'm currently at:
babynames %>%
filter(n >= 1000) %>%
group_by(name) %>%
mutate(long.name = str_count(babynames$name,
"[:alpha:]")) %>%
arrange(-long.name) %>%
head(2)
I'm expecting to get something like:
name long.name
<chr> <int>
1 Christopher 11
2 (some name) 10
But I get this:
Error: Column `long.name` must be length 1 (the group size), not 1924665
What am I doing wrong?
We can group_by name and sum all the occurrence of each name, keep only those names which have occurred more than 1000 times, calculate the length using nchar and select top 2 values.
library(babynames)
library(dplyr)
babynames %>%
group_by(name) %>%
summarise(n = sum(n)) %>%
filter(n > 1000) %>%
mutate(name_length = nchar(name)) %>%
#Can also do
#mutate(name_length = stringr::str_count(name, "[:alpha:]")) %>%
top_n(2, name_length)
# name n name_length
# <chr> <int> <int>
#1 Maryelizabeth 1969 13
#2 Michaelangelo 1236 13

Select rows by ID with most matches

I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))
You can use aggregate:
aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.
You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c
An approach with the data.table package:
library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
which gives:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
And two possible dplyr alternatives:
library(dplyr)
# option 1
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
group_by(id) %>%
mutate(f = rank(n, ties.method = "first")) %>%
filter(f == max(f)) %>%
select(-n, -f)
# option 2
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
distinct() %>%
arrange(n) %>%
group_by(id) %>%
slice(n()) %>%
select(-n)
Yet another dplyr solution, this time using add_count() instead of mutate():
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n == max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c

Resources