Select rows by ID with most matches - r

I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))

You can use aggregate:
aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.

You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c

An approach with the data.table package:
library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
which gives:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
And two possible dplyr alternatives:
library(dplyr)
# option 1
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
group_by(id) %>%
mutate(f = rank(n, ties.method = "first")) %>%
filter(f == max(f)) %>%
select(-n, -f)
# option 2
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
distinct() %>%
arrange(n) %>%
group_by(id) %>%
slice(n()) %>%
select(-n)

Yet another dplyr solution, this time using add_count() instead of mutate():
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n == max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

conditional matching between variables in dplyr

I am trying to find observations within a column that have certain or all the possible values within another column. In this tibble
parties <- tibble(class = c("R","R","R","R","R","K","K","K","K","K","K",
"L","L","L","L"),
name = c("Party1", "Party2","Party3","Party4","Party5",
"Party2", "Party4", "Party6","Party7","Party8","Party9",
"Party2","Party3","Party4","Party10"))
I want to find all the "parties" that are in all three classes "R", "K" and "L". Or generally parties that are in class "X" or "Y". I managed to find a solution, using group_split(class), then extracting each table from the list and then lastly performing two semi_joins. That is for the case when I want parties that are in all three classes:
parties_split <- parties %>%
group_split(class)
parties_K <- parties_split[[1]]
parties_L <- parties_split[[2]]
parties_R <- parties_split[[3]]
semi_join(parties_K,parties_L, by = "name") %>%
semi_join(parties_R, by = "name") %>%
select(-class)
name
<chr>
Party2
Party4
This would work in this case but would not be efficient especially if the number of classes (or observations) that need to match are much larger than three. I am looking in particular for solutions in tidyverse. Any ideas? Thanks
Try that:
parties %>%
group_by(name) %>%
filter("K" %in% class,
"R" %in% class,
"L" %in% class) %>%
summarise()
# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4
EDIT: If you want to work with more than 3 parties you can also use:
mask = c("K", "R", "L")
parties %>%
group_by(name) %>%
filter(all(mask %in% class)) %>%
summarise()
To make this work for many groups you can use purrr::reduce :
library(dplyr)
parties %>%
group_split(class) %>%
purrr::reduce(semi_join, by = "name") %>%
select(name)
# name
# <chr>
#1 Party2
#2 Party4
Does this work:
library(dplyr)
parties %>% group_by(name) %>% mutate(cnt = n()) %>%
group_by(class) %>% mutate(grpno = group_indices()) %>% ungroup() %>%
filter(cnt >= max(grpno)) %>% select(name) %>% distinct()
# A tibble: 2 x 1
name
<chr>
1 Party2
2 Party4
Another solution
library(tidyverse)
parties %>%
group_by(class) %>%
distinct() %>%
mutate(id = 1) %>%
pivot_wider(name, names_from = class, values_from = id) %>%
rowwise() %>%
filter(!is.na(sum(c_across(where(is.numeric))))) %>%
select(name) %>%
ungroup()
#> # A tibble: 2 x 1
#> name
#> <chr>
#> 1 Party2
#> 2 Party4
Created on 2020-12-09 by the reprex package (v0.3.0)

R count number of rows with duplicate values

Let's say we have this data frame:
column_a <- c("a","a","b","c","c","c")
column_b <- c("xx","zz","nn","mm","vv","yy")
df <- data.frame (column_a, column_b)
I'm looking to count the number of rows with the same unique values in column_a so that I get something like this:
df2 <- data.frame(unique = c("a","b","c"), n = c("2","1","3"))
So far I tried this but it's not exactly what I'm looking for:
df %>% group_by(column_a) %>% mutate(replicate=seq(n()))
You can try this
library(dplyr)
df %>%
select(column_a, column_b) %>%
unique() %>%
group_by(column_a) %>%
summarize(n = n())
This gives the result:
# A tibble: 3 x 2
column_a n
<fct> <int>
1 a 2
2 b 1
3 c 3
You can convert it to a data.frame if required.
I believe you're looking for tally() or maybe count
df %>% group_by(column_a) %>% tally()

Creating a funnel using a pivot table in R considering NA column

I have the following dataset:
library(tidyverse)
dataset <- data.frame(id = c(121,122,123,124,125),
segment = c("A","B","B","A",NA),
Web = c(1,1,1,1,1),
Tryout = c(1,1,1,0,1),
Purchase = c(1,0,1,0,0),
stringsAsFactors = FALSE)
This table as you see converts to a funnel, from web visits (the quantity of rows), to tryout to a purchase. So a useful view of this funnel should be:
Step Total A B NA
Web 5 2 2 1
Tryout 4 1 2 1
Purchase 2 1 1 0
So I tried row by row doing this. The web views code is:
dataset %>% mutate(segment = ifelse(is.na(segment), "NA", segment)) %>%
group_by(segment) %>% summarise(Total = n()) %>%
ungroup() %>% spread(segment, Total) %>% mutate(Total = `A` + `B` + `NA`) %>%
select(Total,A,B,`NA`)
And worked fine, except that I have to put manually the row name. But for the other steps like tryout and purchase, is there a way to do it in just one simpler code, avoiding binding? Consider that this is an example and I have many columns so any help will be greatly appreciated.
Here is one option where we convert the data to 'long' format after removing the 'id' column, grouped by 'name' get the sum of 'value', then grouped by 'segment', 'Total' as well and do the second sum, get the distinct rows and pivot back to 'wide' format
library(dplyr)
library(tidyr)
dataset %>%
select(-id) %>%
pivot_longer(cols = -segment) %>%
group_by(name) %>%
mutate(Total = sum(value)) %>%
group_by(name, segment, Total) %>%
mutate(n = sum(value)) %>%
ungroup %>%
select(-value) %>%
distinct %>%
pivot_wider(names_from = segment, values_from = n)
# A tibble: 3 x 5
# name Total A B `NA`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Web 5 2 2 1
#2 Tryout 4 1 2 1
#3 Purchase 2 1 1 0
dataset %>%
select(-id) %>%
group_by(segment) %>%
summarise_all(sum) %>%
gather(Step, val, -segment) %>%
spread(segment, val) %>%
mutate(Total = rowSums(.[,-1]))

how to average rows based on two duplicated rows?

I have 2000 rows with some duplicates, I would like to average the rows based on duplicates.
Site Location Line Band1
Cal BC04 BC04A 130
Cal BC04 BC04B 131
Cal BC04 BC04C 129
I have tried:
bind_cols(
FC %>% distinct(site) %>% .[,-Band1], # pull out columns we aren't aggregating
FC[,c(1, Band1)] %>% group_by(Band1) %>%
summarise_each(funs(mean)) %>% .[,-1] # aggregate other columns
)
So ideally, I would like to result in:
Site Location Line Band1
Cal BC04 BC04A 130
With dplyr, you can do:
df %>%
group_by(Site) %>%
filter(n() > 1) %>%
mutate(Band1 = mean(Band1)) %>%
slice(1) %>%
ungroup()
Site Location Line Band1
<chr> <chr> <chr> <dbl>
1 Cal BC04 BC04A 130
Here it keeps the "Site" values that are duplicated, calculates the mean of "Band1" and selects the first row per "Site".
Maybe you also want to bind the duplicated and non-duplicated rows:
df %>%
group_by(Site) %>%
filter(n() > 1) %>%
mutate(Band1 = mean(Band1)) %>%
slice(1) %>%
ungroup() %>%
bind_rows(df %>%
group_by(Site) %>%
filter(n() < 1) %>%
ungroup())
Or if you want to calculate it just from the duplicated values per "Site":
df %>%
group_by(Site, dup = duplicated(Site)) %>%
filter(dup) %>%
mutate(Band1 = mean(Band1)) %>%
slice(1) %>%
ungroup() %>%
select(-dup)
Site Location Line Band1
<chr> <chr> <chr> <dbl>
1 Cal BC04 BC04B 130
I like data.table for this
x <-data.frame(
Site = c( "Cal","Cal","Cal"),
Location = c( "BC04","BC04","BC04"),
Line = c( "BC04A","BC04B","BC04C"),
Band1= c(130,131, 129))
library( data.table)
x<- data.table( x )
x[ , .(Band1=mean( Band1 )) , by = c("Site","Location")]

Resources