Sort values across multiple columns in R with dplyr - r

Apologies for the not-particularly-clear title - hoping my example below helps. I am working with some sports data, attempting to compute "lineup statistics" for certain grouping of players in the data. Below is an example of the type of data I'm working with (playerInfo), as well as the type of analysis I am attempting to do (groupedInfo):
playerInfo = data.frame(
lineup = c(1,2,3,4,5,6),
player1 = c("Bil", "Tom", "Tom", "Nik", "Nik", "Joe"),
player1id = c("e91", "a27", "a27", "b17", "b17", "3b3"),
player2 = c("Nik", "Bil", "Nik", "Joe", "Tom", "Tom"),
player2id = c("b17", "e91", "b17", "3b3", "a27", "a27"),
player3 = c("Joe", "Joe", "Joe", "Tom", "Joe", "Nik"),
player3id = c("3b3", "3b3", "3b3", "a27", "3b3", "b17"),
points = c(6, 8, 3, 12, 36, 2),
stringsAsFactors = FALSE
)
groupedInfo <- playerInfo %>%
dplyr::group_by(player1, player2, player3) %>%
dplyr::summarise(
lineup_ct = n(),
total_pts = sum(points)
)
> groupedInfo
# A tibble: 6 x 5
# Groups: player1, player2 [?]
player1 player2 player3 lineup_ct total_pts
<chr> <chr> <chr> <int> <dbl>
1 Bil Nik Joe 1 6
2 Joe Tom Nik 1 2
3 Nik Joe Tom 1 12
4 Nik Tom Joe 1 36
5 Tom Bil Joe 1 8
6 Tom Nik Joe 1 3
The goal here is to group_by the 3 players in each row, and then compute some summary statistics (in this simple example, count and sum-of-points) for the different groups. Unfortunately, what dplyr::group_by is missing is the fact that certain groups of players should be the same group of players, if its the same 3 players simply in different columns.
For example, in the dataframe above, rows 3,4,5,6 all have the same 3 players (Nik, Tom, Joe), however because sometimes Nik is player1, and sometimes Nik is player2, etc., the group_by groups them separately.
For clarity, below is an example of the type of results I am seeking to get:
correctPlayerInfo = data.frame(
lineup = c(1,2,3,4,5,6),
player1 = c("Bil", "Bil", "Joe", "Joe", "Joe", "Joe"),
player1id = c("e91", "e91", "3b3", "3b3", "3b3", "3b3"),
player2 = c("Joe", "Joe", "Nik", "Nik", "Nik", "Nik"),
player2id = c("3b3", "3b3", "b17", "b17", "b17", "b17"),
player3 = c("Nik", "Tom", "Tom", "Tom", "Tom", "Tom"),
player3id = c("b17", "a27", "a27", "a27", "a27", "a27"),
points = c(6, 8, 3, 12, 36, 2),
stringsAsFactors = FALSE
)
correctGroupedInfo <- correctPlayerInfo %>%
dplyr::group_by(player1, player2, player3) %>%
dplyr::summarise(
lineup_ct = n(),
total_pts = sum(points)
)
> correctGroupedInfo
# A tibble: 3 x 5
# Groups: player1, player2 [?]
player1 player2 player3 lineup_ct total_pts
<chr> <chr> <chr> <int> <dbl>
1 Bil Joe Nik 1 6
2 Bil Joe Tom 1 8
3 Joe Nik Tom 4 53
In this second example, I have manually sorted the data alphabetically such that player1 < player2 < player3. As a result, when I do the group_by, it accurately groups rows 3-6 into a single grouping.
How can I achieve this programatically? I'm not sure if (a) re-structuring playerInfo into the column-sorted correctPlayerInfo (as I've done above(), or (b) some other approach where group_by automatically identifies that these are the same groups, is best.
I am actively working on this, and will post updates if I can come about to my own solution. Until then, any help with this is greatly appreciated!
Edit: Thus far I've tried something along these lines:
newPlayerInfo <- playerInfo %>%
dplyr::mutate(newPlayer1 = min(player1, player2, player3)) %>%
dplyr::mutate(newPlayer3 = max(player1, player2, player3))
... to no avail.

You could create group IDs that are sorted composites of the players' names (or IDs). For example:
playerInfo %>%
mutate(
group_id = purrr::pmap_chr(
.l = list(p1 = player1, p2 = player2, p3 = player3),
.f = function(p1, p2, p3) paste(sort(c(p1, p2, p3)), collapse = "_")
)
) %>%
group_by(group_id) %>%
summarise(
lineup_ct = n(),
total_pts = sum(points)
)
# A tibble: 3 x 3
group_id lineup_ct total_pts
<chr> <int> <dbl>
1 Bil_Joe_Nik 1 6
2 Bil_Joe_Tom 1 8
3 Joe_Nik_Tom 4 53

Related

How to add values from other column if conditional join does not execute?

I have two tables this one is old names
Last Name|First Name|ID
Clay Cassius 1
Alcindor Lou 2
Artest Ron 3
Jordan Michael 4
Scottie Pippen 5
Kanter Enes 6
New Names
Last Name| First Name| ID
Ali Muhammad 1
Abdul Jabbar Kareem 2
World Peace Metta 3
Jordan Michael 4
Pippen Scottie 5
Freedom Enes Kanter 6
Basically I want to do a join to the first table (old names) where it will show the new last name if there has been a name change otherwise blank
Last Name|First Name|ID|Discrepancies
Clay Cassius 1 Ali
Alcindor Lou 2 Abdul Jabbar
Artest Ron 3 World Peace
Jordan Michael 4
Pippen Scottie 5
Kanter Enes 6 Freedom
Note that Michael and Scottie's name did not change so in Discrepancies there is a blank.
You could use
library(dplyr)
df1 %>%
left_join(df2, by = "ID", suffix = c("", ".y")) %>%
mutate(Discrepancies = ifelse(Last_Name.y == Last_Name, "", Last_Name.y)) %>%
select(-ends_with(".y"))
to get
# A tibble: 6 x 4
Last_Name First_Name ID Discrepancies
<chr> <chr> <dbl> <chr>
1 Clay Cassius 1 "Ali"
2 Alcindor Lou 2 "Abdul Jabbar"
3 Artest Ron 3 "World Peace"
4 Jordan Michael 4 ""
5 Scottie Pippen 5 "Pippen"
6 Kanter Enes 6 "Freedom"
Note:
I named the columns Last_Name and First_Name.
The first data frame contains Scottie Pippen instead of Pippen Scottie.
Another possible solution:
library(tidyverse)
old <- data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
Last = c("Clay",
"Alcindor","Artest","Jordan","Scottie","Kanter"),
`First` = c("Cassius","Lou",
"Ron","Michael","Pippen","Enes"),
`ID` = c(1L, 2L, 3L, 4L, 5L, 6L)
)
new <- data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
`Last` = c("Ali",
"Abdul Jabbar","World Peace","Jordan","Pippen","Freedom"),
`First` = c("Muhammad",
"Kareem","Metta","Michael","Scottie","Enes Kanter"),
ID = c(1L, 2L, 3L, 4L, 5L, 6L)
)
old %>%
bind_rows(new) %>%
group_by(ID) %>%
summarise(
discrepancies = if_else(n_distinct(Last) > 1, last(Last), NA_character_),
Last = first(Last), First = first(First), .groups = "drop" )
#> # A tibble: 6 × 4
#> ID discrepancies Last First
#> <int> <chr> <chr> <chr>
#> 1 1 Ali Clay Cassius
#> 2 2 Abdul Jabbar Alcindor Lou
#> 3 3 World Peace Artest Ron
#> 4 4 <NA> Jordan Michael
#> 5 5 Pippen Scottie Pippen
#> 6 6 Freedom Kanter Enes
You can simply merge your data, and then filter duplicate occurrences.
dfinal <- setNames( merge( dat1, dat2, "ID", suffixes=c(1,2) )[
,c("Last.Name1","First.Name1","ID","Last.Name2")], c(colnames(dat1),"Discrepancies") )
dfinal$Discrepancies[ dfinal$Last.Name == dfinal$Discrepancies ] <- ""
dfinal
Last.Name First.Name ID Discrepancies
1 Clay Cassius 1 Ali
2 Alcindor Lou 2 Abdul Jabbar
3 Artest Ron 3 World Peace
4 Jordan Michael 4
5 Scottie Pippen 5 Pippen
6 Kanter Enes 6 Freedom
Data
dat1 <- structure(list(Last.Name = c("Clay", "Alcindor", "Artest", "Jordan",
"Scottie", "Kanter"), First.Name = c("Cassius", "Lou", "Ron",
"Michael", "Pippen", "Enes"), ID = 1:6), class = "data.frame", row.names = c(NA,
-6L))
dat2 <- structure(list(Last.Name = c("Ali", "Abdul Jabbar", "World Peace",
"Jordan", "Pippen", "Freedom"), First.Name = c("Muhammad", "Kareem",
"Metta", "Michael", "Scottie", "Enes Kanter"), ID = 1:6), class = "data.frame", row.names = c(NA,
-6L))

Group_by multiple columns and summarise unique column

I have a dataset below
family
type
inc
name
AA
success
30000
Bill
AA
ERROR
15000
Bess
CC
Pending
22000
Art
CC
Pending
18000
Amy
AA
Serve not respnding d
25000
Paul
ZZ
Success
50000
Pat
ZZ
Processing
50000
Pat
I want to group by multiple columns
here is my code bellow
df<-df1%>%
group_by(Family , type)%>%
summarise(Transaction_count = n(), Face_value = sum(Inc))%>%
mutate(Pct = Transaction_count/sum(Transaction_count))
what I want is that anywhere there is same observation Family, it should pick only one
like this result in the picture below.
Thank you
You can use duplicated to replace the repeating values with blank value.
library(dplyr)
df %>%
group_by(family , type)%>%
summarise(Transaction_count = n(), Face_value = sum(inc))%>%
mutate(Pct = Transaction_count/sum(Transaction_count),
family = replace(family, duplicated(family), '')) %>%
ungroup
# family type Transaction_count Face_value Pct
# <chr> <chr> <int> <int> <dbl>
#1 "AA" ERROR 1 15000 0.333
#2 "" Serve not respnding d 1 25000 0.333
#3 "" success 1 30000 0.333
#4 "CC" Pending 2 40000 1
#5 "ZZ" Processing 1 50000 0.5
#6 "" Success 1 50000 0.5
If you want data for displaying purpose you may look into packages like formattable, kable etc.
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(family = c("AA", "AA", "CC", "CC", "AA", "ZZ",
"ZZ"), type = c("success", "ERROR", "Pending", "Pending", "Serve not respnding d",
"Success", "Processing"), inc = c(30000L, 15000L, 22000L, 18000L,
25000L, 50000L, 50000L), name = c("Bill", "Bess", "Art", "Amy",
"Paul", "Pat", "Pat")), row.names = c(NA, -7L), class = "data.frame")

How to Visualize The frequency of a categorical variable in R

I have 2 variables in my dataframe that I am trying to use ggplot to graph. On the x-axis I want the date which has a daily frequency. On the y-axis I want the count of unique names that show up on that given day.
The variables look something like this in the dataframe.
Date Name
1 2016-03-01 Joe
2 2016-03-01 Joe
3 2016-03-01 Joe
4 2016-03-01 Mark
5 2016-03-01 Sue
6 2016-03-02 Mark
7 2016-03-02 Joe
8 2016-03-03 Joe
9 2016-03-03 Joe
10 2016-03-03 Bill
So the frequency on the y-axis on the first day would show 3, 2 on the second, and 2 on the third.
My question is how do I produce that graph.
count number of unique Name for each Date and then plot with geom_bar/geom_col.
library(dplyr)
library(ggplot2)
df %>%
group_by(Date) %>%
summarise(n = n_distinct(Name)) %>%
ggplot() + geom_col(aes(Date, n))
#ggplot() + geom_bar(aes(Date, n), stat = "identity")
data
df <- structure(list(Date = c("2016-03-01", "2016-03-01", "2016-03-01",
"2016-03-01", "2016-03-01", "2016-03-02", "2016-03-02", "2016-03-03",
"2016-03-03", "2016-03-03"), Name = c("Joe", "Joe", "Joe", "Mark",
"Sue", "Mark", "Joe", "Joe", "Joe", "Bill")), class = "data.frame",
row.names = c(NA, -10L))

How can I make conditional selections using dplyr in R?

I have the following situation. Given the table
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
I am trying to find a way to implement the following logic: If for an ID, both types (MC and MK) occur, use value1 from MK and value2 from MC. Otherwise (only the type MC occurs), use MC.
Hence, the final result is supposed to be:
data.frame(ID = c(1, 2, 3, 4),
type = c("MC", "MC", "MC", "MC"),
value1 = c(512, 4523, 1221, 2556),
value2 = c(726, 4000, 998, 6789))
Assuming the type MK is dropped after extracting the value1.
Another version with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(value1 = ifelse(any(type == "MK"), value1[type=="MK"],value1[type=="MC"]),
value2 = value2[type == "MC"]) %>%
filter(type == "MC")
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
Here, for value1 we check value in "MK" if it is present or take corresponding "MC" value instead and for value2 by default we take "MC" value and keep only rows with type "MC". This is assuming every group (ID) would have a "MC" type row.
For efficiency I would definitely prefer #Andre Elrico' answer but here is a dplyr option. Try:
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
library(dplyr)
df %>%
reshape(., idvar = "ID", timevar = "type", direction = "wide") %>%
group_by(ID) %>%
mutate(value1 = ifelse(is.na(value1.MK), value1.MC, value1.MK),
value2 = ifelse(is.na(value2.MC), value2.MK, value2.MC),
type = "MC") %>%
select(ID, type, value1, value2)
# output
# A tibble: 4 x 4
# Groups: ID [4]
ID type value1 value2
<dbl> <chr> <dbl> <dbl>
1 1 MC 512 726
2 2 MC 4523 4000
3 3 MC 1221 998
4 4 MC 2556 6789
data.table solution
setDT(df1)[,{x=.SD;if(all(c("MC","MK") %in% type)){x$value1[] = last(value1)};first(x)},by=ID]
result:
# ID type value1 value2
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
dplyr:
df1 %>% group_by(ID) %>% do(.,(function(x){if(all(c("MC","MK") %in% x$type)){x$value1[] = x$value1[x$type=="MK"]};x[1,]})(.))
# A tibble: 4 x 4
# Groups: ID [4]
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789

R Conditionally transform data frame from long to wide based on multiple unique variables

I know this is a very common post on SO but I have been spending a little too much time researching a way to transform a data frame from long form to wide form and haven't quite found a post(s) to guide me through the entire process. I have a data frame similar in structure to the reprex below but with 100+ rows. Basically, the same structure is repeated every 9 rows but with different variables. However, in order to keep this post as readable as possible, I'm providing the first 9 rows of my data frame. Please note that each Id is related to a Name and Pos.
library("reshape2")
test <- data.frame(
Id = c("9644", "14513", "9874",
"12363", "9673", "9538",
"9585", "23447", "40396"),
Pos = c("SG", "SF", "PF", "C", "PG", "SF",
"SG", "PF", "PG"),
Name = c("John", "James", "Bob", "Sam",
"Mark", "Andrew", "Bobby", "Elaine", "Jerry"),
Score = c(55.66, 43.82, 37.35, 40.59,
35.15, 27.45, 28.82, 28.95,
34.98),
Sal = c(60000, 60000, 60000, 60000,
60000, 60000, 60000, 60000,
60000),
Total = c(332.77, 332.77, 332.77, 332.77,
332.77, 332.77, 332.77, 332.77,
332.77),
TmNumber = c(1, 1, 1, 1, 1, 1, 1, 1, 1))
I would like to transform my columns and variables into this format:
desiredDF <- data.frame(
TmNum = "1",
Id1 = "9644", Id2 = "14513", Id3 = "9874", Id4 = "12363",
Id5 = "9673", Id6 = "9538", Id7 = "9585", Id8 = "23447",
Id9 = "403396",
PG = "Mark", PG = "Jerry", SG = "John", SG = "Bobby",
SF = "James", SF = "Andrew", PF = "Bob", PF = "Elaine",
C = "Sam",
Score1 = "55.66", Score2 = "43.82", Score3 = "3735", Score4 = "40.59",
Score5 = "35.15", Score6 = "27.45", Score7 = "28.82", Score8 = "28.95",
Score9 = "34.98",
Sal = "60000",
Total = "332.77"
)
I have tried the following code (and a few more failed attempts):
test2 <- dcast(test, TmNum ~ Pos, value.var = "Name")
> test2
TmNum C PF PG SF SG
1 1 1 2 2 2 2
Thank you!
Try merging several dcast's:
library(reshape2)
Ave <- function(lab, x, g, FUN = seq_along) paste0(lab, ave(format(x), g, FUN = FUN))
L <- list(
dcast(data = transform(test, ID = Ave("Id", Id, TmNumber)),
TmNumber ~ ID, value.var = "Id"),
dcast(data = transform(test, Pos = Ave("", Pos, TmNumber, make.unique)),
TmNumber ~ Pos, value.var = "Name"),
dcast(data = transform(test, SCORE = Ave("Score", Score, TmNumber)),
TmNumber + Sal + Total ~ SCORE, value.var = "Score"))
Reduce(function(x, y) merge(x, y, by = 1), L)
giving:
TmNumber Id1 Id2 Id3 Id4 Id5 Id6 Id7 Id8 Id9 C PF PF.1 PG
1 1 9644 14513 9874 12363 9673 9538 9585 23447 40396 Sam Bob Elaine Mark
PG.1 SF SF.1 SG SG.1 Sal Total Score1 Score2 Score3 Score4 Score5
1 Jerry James Andrew John Bobby 60000 332.77 55.66 43.82 37.35 40.59 35.15
Score6 Score7 Score8 Score9
1 27.45 28.82 28.95 34.98

Resources