I have a large dataset with multiple instances of 'Player', and I want to collapse their tournament scores into a single row but keep the row with the highest earnings. It's a large data set so I can't exactly paste the whole thing but a small example table looks like this:
Player
Earned
T1.
T2
T3
John Doe
2100
5
N/A
N/A
John Doe
1900
N/A
12
N/A
John Doe
500
N/A
N/A
16
I'd like to eliminate the N/A values and the duplicate rows by combining Tournament 1, 2, 3 into a single row, and also keeping the highest earnings value (2100), so that it looks more like this:
Player
Earned
T1.
T2
T3
John Doe
2100
5
12
16
So far I've used top_n(1, Earned) to keep the row with the highest earnings, but it only keeps the score of Tournament1, and I need to fill in the other columns with their scores.
From your reference to top_n, I'm inferring dplyr and related packages.
If your columns are strings (since "N/A" is not NA), then
func <- function(x, na.rm = TRUE, na = c("NA", "N/A")) {
if (is.numeric(x)) max(x, na.rm = na.rm) else head(na.omit(setdiff(x, na)), 1)
}
library(dplyr)
dat %>%
group_by(Player) %>%
summarize(across(everything(), func))
# # A tibble: 1 x 5
# Player Earned T1. T2 T3
# <chr> <int> <chr> <chr> <chr>
# 1 John Doe 2100 5 12 16
If your columns are numeric, though, then we can simplify that to
dat %>%
# an interim line to change your strings to numbers
mutate(across(-Player, ~ suppressWarnings(as.numeric(.)))) %>%
# pick up from here
group_by(Player) %>%
summarize(across(everything(), ~ max(., na.rm = TRUE)))
# # A tibble: 1 x 5
# Player Earned T1. T2 T3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 John Doe 2100 5 12 16
Data
dat <- structure(list(Player = c("John Doe", "John Doe", "John Doe"), Earned = c(2100L, 1900L, 500L), T1. = c("5", "N/A", "N/A" ), T2 = c("N/A", "12", "N/A"), T3 = c("N/A", "N/A", "16")), class = "data.frame", row.names = c(NA, -3L))
I could not think of any other way to get rid of those N/A values and since you were trying to collapse scores into a single row, I guessed they might be numeric values. Hence I changed them into numeric type:
library(dplyr)
dat %>%
group_by(Player) %>%
mutate(Earned = cummax(Earned),
across(T1:T3, suppressWarnings(as.numeric))) %>%
group_by(Player, Earned) %>%
summarise(across(T1:T3, ~ na.omit(.x)))
# A tibble: 1 x 5
Player Max_Earned T1 T2 T3
<chr> <int> <dbl> <dbl> <dbl>
1 John Doe 2100 5 12 16
I used reproducible data shared by dear #r2evans, so I would like to thank him for that.
(P.S = I changed the T1. to T1 before using it.
Another solution to the top presented solution could be with lead and slice
dat %>%
arrange(desc(Earned)) %>%
mutate(T2 = lead(T2),
T3 = lead(T3,2)) %>%
slice(which.max(Earned))
Output:
Player Earned T1. T2 T3
1 John Doe 2100 5 12 16
Here is a data.table option
> setDT(dat)[, lapply(.SD, function(x) max(x, na.rm = TRUE)), Player]
Player Earned T1. T2 T3
1: John Doe 2100 5 12 16
Related
I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30
Background
I've got an R dataframe, d:
d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
birthdate = as.Date(c("1980-01-01","1980-01-01","2000-12-23","2000-12-23","1949-03-14","1949-03-14","1949-03-14")),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
stringsAsFactors=FALSE)
It consists of an ID code and two dates: a birthdate and an event_date. Everyone's got a consistent birthdate, but people have multiple events each, all of these occurring on different dates.
The Problem
I'm trying to calculate the average age of people (IDs) in d at their first event. In other words, I'd like to get R to calculate an "age at the first event" by subtracting each ID's first event from their birthdate, and then summing them and dividing by n (3, in this case).
The answer (if my arithmetic isn't too far off this late at night) should be ~35.3 years old.
What I've tried
I'm not too familiar with date work in R, so I've only gotten so far as mutating a new column that calculates the difference between event_date and birthdate for that row:
d <- d %>%
mutate(date_difference = (event_date-birthdate)/365)
But I'm still a ways away from my summary calculation. I'm mainly hung up on how to tell R to find the first date difference for each ID. (Not to mention that dividing by 365 gives me correct years but they're still labeled "days" in the resulting df.)
As an alternative to dividing by 365, you can use the lubridate::time_length function. It computes the length of a period in different time units (seconds, minutes, days, years).
library("tidyverse")
dat <- data.frame(
ID = c("a", "a", "b", "b", "c", "c", "c"),
birthdate = as.Date(c("1980-01-01", "1980-01-01", "2000-12-23", "2000-12-23", "1949-03-14", "1949-03-14", "1949-03-14")),
event_date = as.Date(c("2011-01-01", "2012-08-21", "2011-12-23", "2011-12-31", "2013-03-14", "2013-04-07", "2014-07-14")),
stringsAsFactors = FALSE
)
dat_with_age <- dat %>%
group_by(ID) %>%
slice_min(
event_date,
n = 1
) %>%
ungroup() %>%
mutate(
age_at_first_event = lubridate::time_length(event_date - birthdate, unit = "year")
)
dat_with_age
#> # A tibble: 3 × 4
#> ID birthdate event_date age_at_first_event
#> <chr> <date> <date> <dbl>
#> 1 a 1980-01-01 2011-01-01 31.0
#> 2 b 2000-12-23 2011-12-23 11.0
#> 3 c 1949-03-14 2013-03-14 64
dat_with_age %>%
summarise(
mean(age_at_first_event)
)
#> # A tibble: 1 × 1
#> `mean(age_at_first_event)`
#> <dbl>
#> 1 35.3
Created on 2022-03-11 by the reprex package (v2.0.1)
You can use this code:
d <- d %>%
group_by(ID) %>%
arrange(event_date) %>%
slice(1) %>%
mutate(date_difference = as.numeric((event_date-birthdate)/365)) %>%
ungroup() %>%
mutate(average_age = mean(date_difference))
Output:
# A tibble: 3 × 5
ID birthdate event_date date_difference average_age
<chr> <date> <date> <dbl> <dbl>
1 a 1980-01-01 2011-01-01 31.0 35.4
2 b 2000-12-23 2011-12-23 11.0 35.4
3 c 1949-03-14 2013-03-14 64.0 35.4
I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.
I have a data frame (df) that shares a key column ($Name) with a list of data frames:
head(df)
# A tibble: 6 x 3 ##truncating to show first 2 rows only
Name var1 var2
<chr> <chr> <chr>
1 Tom Marks LAX ORD
2 Bob Sells MIA CHI
I have a list of data frames that contains historical data for each person contained in df$Name.
head(employees$'Tom Marks')
Name date var3
Tom Marks 2017-01-01 250
Tom Marks 2017-01-02 457
head(employees$'Bob Sells')
Name date var3
Bob Sells 2017-01-01 385
Bob Sells 2017-01-02 273
I would like to append the value in var3 from employees list to the df by the most recent date (which is always the last row in an employees list). For example, the output, after matching Tom Marks from df$Name to employees$'Tom Marks' would look like this:
head(df)
Name var1 var2 var3
<chr> <chr> <chr> <num>
1 Tom Marks LAX ORD 457
2 Bob Sells MIA CHI 273
I have spent a decent amount of time researching filtering joins, mutating joins, bind_rows, reduce() functions but have been unsuccessful in accomplishing what is probably an easy task for a decent programmer. I'm hoping someone out there can put me out of my misery and provide some better direction or better yet, an answer!
Thank you!
If you're always after the last row, you can use tail to get it:
library(tidyverse)
left_join(
df,
map_df(employees, ~ tail(.x, 1))
)
This solution relies on the fact that your data arranged as you said they were, but you can easily arrange the list by date if they were not so.
library(tidyverse)
df %>% left_join(
df_list$employees %>%
bind_rows() %>%
group_by(Name) %>%
summarise_at(vars(var3), last))
# Name var1 var2 var3
# 1 Tom Marks LAX ORD 457
# 2 Bob Sells MIA CHI 273
Data
df <- data.frame(Name = c("Tom Marks", "Bob Sells"),
var1 = c("LAX", "MIA"),
var2 = c("ORD", "CHI"))
df_list <- list(employees = list(
`Tom Marks` = data.frame(Name = "Tom Marks",
date = c("2017-01-01", "2017-01-02"),
var3 = c(250, 457)),
`Bob Sells` = data.frame(Name = "Bob Sells",
date = c("2017-01-01", "2017-01-02"),
var3 = c(385, 273))
))
Suppose that I gave a treatment to some column values of a data frame like this:
id animal weight height ...
1 dog 23.0
2 cat NA
3 duck 1.2
4 fairy 0.2
5 snake BAD
df <- data.frame(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"))
Suppose that the treatment require to work in a separately table, and gave as the result, the following data frame that is a subset of the original:
id animal weight
2 cat 2.2
5 snake 1.3
sub_df <- data.frame(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
Now I want to put all together again, so I use an operation like this:
> df %>%
anti_join(sub_df, by = c("id", "animal")) %>%
bind_rows(sub_df)
id animal weight
4 fairy 0.2
1 dog 23.0
3 duck 1.2
2 cat 2.2
5 snake 1.3
Exist some way to do this directly with join operations?
In the case that the subset is just the key column and the variable subject to give a treatment (id, animal weigth) and not the total variables of the original data frame (id, animal, weight, height), how could assemble the subset with the original set?
What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).
Here's an example for your toy data:
library(data.table)
setDT(df) # convert to data.table without copy
setDT(sub_df) # convert to data.table without copy
# join and update "df" by reference, i.e. without copy
df[sub_df, on = c("id", "animal"), weight := i.weight]
The data is now updated:
# id animal weight
#1: 1 dog 23.0
#2: 2 cat 2.2
#3: 3 duck 1.2
#4: 4 fairy 0.2
#5: 5 snake 1.3
You can use setDF to switch back to ordinary data.frame.
Remove the na's first, then simply stack the tibbles:
bind_rows(filter(df,!is.na(weight)),sub_df)
Isn't dplyr::rows_update exactly what we need here? The following code should work:
df %>% dplyr::rows_update(sub_df, by = "id")
This should work as long as there is a unique identifier (one or multiple variables) for your datasets.
For anyone looking for a solution to use in a tidyverse pipeline:
I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.
For example, if the OP's df had an additional 'height' column:
library(dplyr)
df <- tibble(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"),
height = c("54", "45", "21", "50", "42"))
And the subset of data we wanted to join in was the same:
sub_df <- tibble(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.
In this case we could use the following function:
replace_subset <- function(df, df_subset, id_col_names = c()) {
# work out which of the columns contain "new" data
new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]
# complete the df_subset with the extra columns from df
df_sub_to_join <- df_subset %>%
left_join(select(df, -new_data_col_names), by = c(id_col_names))
# join and bind rows
df_out <- df %>%
anti_join(df_sub_to_join, by = c(id_col_names)) %>%
bind_rows(df_sub_to_join)
return(df_out)
}
Now for the results:
replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))
## A tibble: 5 x 4
# id animal weight height
# <dbl> <chr> <chr> <chr>
#1 1 dog 23 54
#2 3 duck 1.2 21
#3 4 fairy 0.2 50
#4 2 cat 2.2 45
#5 5 snake 1.3 42
And here's an example using the function in a pipeline:
df %>%
replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
mutate(bmi = weight / (height^2))
## A tibble: 5 x 5
# id animal weight height bmi
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 1 dog 23 54 0.00789
#2 3 duck 1.2 21 0.00272
#3 4 fairy 0.2 50 0.00008
#4 2 cat 2.2 45 0.00109
#5 5 snake 1.3 42 0.000737
hope this is helpful :)