How can I group the same value across multiple columns and sum subsequent values? - r

I have a table of information that looks like the following:
rusher_full_name receiver_full_name rushing_fpts receiving_fpts
<chr> <chr> <dbl> <dbl>
1 Aaron Jones NA 5 0
2 NA Aaron Jones 0 5
3 Mike Davis NA 0.5 0
4 NA Allen Robinson 0 3
5 Mike Davis NA 0.7 0
What I'm trying to do is get all of the values from the rushing_fpts and receiving_fpts to sum up depending on the rusher_full_name and receiver_full_name value. For example, for every instance of "Aaron Jones" (whether it's in rusher_full_name or receiver_full_name) sum up the values of rushing_fpts and receiving_fpts
In the end, this is what I'd like it to look like:
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Mike Davis 1.2
3 Allen Robinson 3
I'm pretty new to using R and have Googled a number of things but can't find any solution. Any suggestions on how to accomplish this?

library(tidyverse)
df %>%
mutate(player_full_name = coalesce(rusher_full_name, receiver_full_name)) %>%
group_by(player_full_name) %>%
summarise(total_fpts = sum(rushing_fpts+receiving_fpts))
Output
# A tibble: 3 x 2
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Allen Robinson 3
3 Mike Davis 1.2
Data
df <- data.frame(
rusher_full_name = c("Aaron Jones", NA, "Mike Davis", NA, "Mike Davis"),
receiver_full_name = c(NA, "Aaron Jones", NA, "Allen Robinson", NA),
rushing_fpts = c(5,0,0.5,0,.7),
receiving_fpts = c(0,5,0,3,0),
stringsAsFactors = FALSE
)

Related

Add multiple columns with the same group and sum

I've got this dataframe and I want to add the last two columns to another dataframe by summing them and grouping them by "Full.Name"
# A tibble: 6 x 5
# Groups: authority_dic, Full.Name [6]
authority_dic Full.Name Entity `2019` `2020`
<chr> <chr> <chr> <int> <int>
1 accomplished Derek J. Leathers WERNER ENTERPRISES INC 1 0
2 accomplished Dirk Van de Put MONDELEZ INTERNATIONAL INC 0 1
3 accomplished Eileen P. Drake AEROJET ROCKETDYNE HOLDINGS 1 0
4 accomplished G. Michael Sievert T-MOBILE US INC 0 3
5 accomplished Gary C. Kelly SOUTHWEST AIRLINES 0 1
6 accomplished James C. Fish, Jr. WASTE MANAGEMENT INC 1 0
This is the dataframe I want to add the two columns to: Like you can see the "Full.Name" column acts as the grouping column.
# A tibble: 6 x 3
# Groups: Full.Name [6]
Full.Name `2019` `2020`
<chr> <int> <int>
1 A. Patrick Beharelle 5541 3269
2 Aaron P. Graft 165 200
3 Aaron P. Jagdfeld 4 5
4 Adam H. Schechter 147 421
5 Adam P. Symson 1031 752
6 Adena T. Friedman 1400 1655
I can add one column using the following piece of code, but if I want to do it with the second one, it overwrites my existing one and I am only left with one instead of two columns added.
narc_auth_total <- narc_auth %>% group_by(Full.Name) %>% summarise(`2019_words` = sum(`2019`)) %>% left_join(totaltweetsyear, ., by = "Full.Name")
The output for this command looks like this:
# A tibble: 6 x 4
# Groups: Full.Name [6]
Full.Name `2019` `2020` `2019_words`
<chr> <int> <int> <int>
1 A. Patrick Beharelle 5541 3269 88
2 Aaron P. Graft 165 200 2
3 Aaron P. Jagdfeld 4 5 0
4 Adam H. Schechter 147 421 2
5 Adam P. Symson 1031 752 15
6 Adena T. Friedman 1400 1655 21
I want to do the same thing and add the 2020_words column to the same dataframe. I just cannot do it, but it cannot be that hard to do so. It should be summarized as well, just like the 2019_words column. When I add "2020" to my command, it says object "2020" not found.
Thanks in advance.
If I have understood you well, this will solve your problem:
narc_auth_total <-
narc_auth %>%
group_by(Full.Name) %>%
summarise(
`2019_words` = sum(`2019`),
`2020_words` = sum(`2020`)
) %>%
left_join(totaltweetsyear, ., by = "Full.Name")

Match records with a combination of regex and lookup

I want to match personal records between two tables using the following logic:
Regex match on last name up to minor variations - summarized by the following regex for a given last name: grepl("LNAME( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE).
The function fuzzyjoin::regex_*_join was suggested, but I'm not sure how to use it if the name isn't static...?
Match on first name based on the nicknames list. So basically matching all names in nicknames[[fname]] or just fname if that is empty. Should not be case-sensitive as well.
Exact match on city, not case-sensitive.
Right now I'm just iterating through df1 and implementing this logic by hand, but my data set is large and it's taking way too long, plus the manual implementation doesn't lend itself to parallelization, which is a concern as I willwant to optimize this in the future. There has to be a smarter way of doing this.
Example data:
df1 <- tibble("lname1"=c("SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","THOMAS","JAMES"),
"city"=c("NEW YORK","LOS ANGELES","SEATTLE"),
"id1"=c("aaaa","bbbb","cccc"),
"misc1"=c("bla","ble","bla"))
df2 <- tibble("lname2"=c("Smith Jr.","Black III","Miller-Muller","Smith"),
"fname2"=c("Jon","Tom","Jamie","John"),
"city"=c("New York","Los Angeles","Seattle","New York"),
"id2"=c("1111","2222","3333","4444"),
"misc2"=c("bonk","bzdonk","boom","bam"))
nicknames <- list("john"=c("john","jon","johnny"),
"thomas"=c("thomas","tom","tommy"),
"james"=c("james","jamie","jim"))
Expected output:
expected_output <- tibble("id1"=c("aaaa","aaaa","bbbb","cccc"),
"id2"=c("1111","4444","2222","3333"),
"lname1"=c("SMITH","SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","JOHN","THOMAS","JAMES"),
"lname2"=c("Smith Jr.","Smith","Black III","Miller-Muller"),
"fname2"=c("Jon","John","Tom","Jamie"),
"city"=c("New York","New York","Los Angeles","Seattle"),
"misc1"=c("bla","bla","ble","bla"),
"misc2"=c("bonk","bam","bzdonk","boom"))
# A tibble: 4 x 9
id1 id2 lname1 fname1 lname2 fname2 city misc1 misc2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 aaaa 1111 SMITH JOHN Smith Jr. Jon New York bla bonk
2 aaaa 4444 SMITH JOHN Smith John New York bla bam
3 bbbb 2222 BLACK THOMAS Black III Tom Los Angeles ble bzdonk
4 cccc 3333 MILLER JAMES Miller-Muller Jamie Seattle bla boom
EDIT:
This is as far as I got. Spent the past few hours trying to get the last step done but I can't. I have this:
df <- tibble("fname1"=c("JOHN","JOHN","JOHN"),
"lname1"=c("SMITH","SMITH","SMITH"),
"fname2"=c("FRANK","JOHN","BILL"),
"lname2"=c("SMITH","SMITH","SMITH"),
"city"=c("NEW YORK","NEW YORK","NEW YORK"))
nicknames_df <- tibble(fname = names(nicknames), nick = paste0("^(", sapply(nicknames, paste, collapse = "|"), ")$"))
>df
# A tibble: 3 x 5
fname1 lname1 fname2 lname2 city
<chr> <chr> <chr> <chr> <chr>
1 JOHN SMITH FRANK SMITH NEW YORK
2 JOHN SMITH JOHN SMITH NEW YORK
3 JOHN SMITH BILL SMITH NEW YORK
>nicknames_df
# A tibble: 3 x 2
fname nick
<chr> <chr>
1 john ^(john|jon|johnny)$
2 thomas ^(thomas|tom|tommy)$
3 james ^(james|jamie|jim)$
Expected output:
> out
# A tibble: 1 x 5
fname1 lname1 fname2 lname2 city
<chr> <chr> <chr> <chr> <chr>
1 JOHN SMITH JOHN SMITH NEW YORK
How do I join it with nicknames df to get just the 2nd row?!
out <- df %>% fuzzyjoin::regex_left_join(nicknames_df, ???)
fuzzyjoin::regex_right_join(
df2, df1, by = c(lname2 = "lname1"),
ignore_case = TRUE)
# # A tibble: 4 x 10
# lname2 fname2 city.x id2 misc2 lname1 fname1 city.y id1 misc1
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Smith Jr. Jon New York 1111 bonk SMITH JOHN NEW YORK aaaa bla
# 2 Smith John New York 4444 bam SMITH JOHN NEW YORK aaaa bla
# 3 Black III Tom Los Angeles 2222 bzdonk BLACK THOMAS LOS ANGELES bbbb ble
# 4 Miller-Muller Jamie Seattle 3333 boom MILLER JAMES SEATTLE cccc bla
I didn't want to assume any resolution for city.x vs city.y; while it's clear visually that they're good, I'll let you work through that.

Is there a function that will allow to get the difference between rows of the same type? [duplicate]

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 1 year ago.
I want to find the difference in the values of the same type.
Please refer to the sample dataframe below:
df <- data.frame(
x = c("Jimmy Page","Jimmy Page","Jimmy Page","Jimmy Page", "John Smith", "John Smith", "John Smith", "Joe Root", "Joe Root", "Joe Root", "Joe Root", "Joe Root"),
y = c(1,2,3,4,5,7,89,12,34,67,95,9674 )
)
I would like to get the difference in the each value for e.g. Jimmy Page = 1 and Jimmy Page = 2, difference = 1.
And present NA for difference between dissimilar names.
You can use diff in ave.
df$diff <- ave(df$y, df$x, FUN=function(z) c(diff(z), NA))
df
# x y diff
#1 Jimmy Page 1 1
#2 Jimmy Page 2 1
#3 Jimmy Page 3 1
#4 Jimmy Page 4 NA
#5 John Smith 5 2
#6 John Smith 7 82
#7 John Smith 89 NA
#8 Joe Root 12 22
#9 Joe Root 34 33
#10 Joe Root 67 28
#11 Joe Root 95 9579
#12 Joe Root 9674 NA
library(tidyverse)
df <-
data.frame(
x = c(
"Jimmy Page",
"Jimmy Page",
"Jimmy Page",
"Jimmy Page",
"John Smith",
"John Smith",
"John Smith",
"Joe Root",
"Joe Root",
"Joe Root",
"Joe Root",
"Joe Root"
),
y = c(1, 2, 3, 4, 5, 7, 89, 12, 34, 67, 95, 9674)
)
df %>%
group_by(x) %>%
mutate(res = c(NA, diff(y))) %>%
ungroup()
#> # A tibble: 12 x 3
#> x y res
#> <chr> <dbl> <dbl>
#> 1 Jimmy Page 1 NA
#> 2 Jimmy Page 2 1
#> 3 Jimmy Page 3 1
#> 4 Jimmy Page 4 1
#> 5 John Smith 5 NA
#> 6 John Smith 7 2
#> 7 John Smith 89 82
#> 8 Joe Root 12 NA
#> 9 Joe Root 34 22
#> 10 Joe Root 67 33
#> 11 Joe Root 95 28
#> 12 Joe Root 9674 9579
Created on 2021-09-14 by the reprex package (v2.0.1)

More efficient methods than nested for loops in R -- matching

I'm trying to match people when they have identical names, last names, and first names, and keep the smallest numerical value for IDs.
I've created a test database below (much smaller than my actual dataset) and written a nested for-loop that looks like it's doing what it's supposed to.
But it's slow as hell on bigger datasets.
I'm relatively new to the apply functions, but they seem more intuitive for applying functions than data wrangling.
What's a more efficient alternative for what I'm doing here? I'm sure there's a simple solution that will have me shaking my head for asking here, but I'm not coming to it.
dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
Here's a dplyr solution
library(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
# Person_id FirstName LastName DOB Actual_ID
# <dbl> <fct> <fct> <fct> <dbl>
# 1 1. John Smith 2001-01-01 1.
# 2 2. James Jones 2002-01-01 2.
# 3 3. John Jones 2003-01-01 3.
# 4 4. Alex Jones 2004-01-01 4.
# 5 5. Alexander Jones 2004-01-01 5.
# 6 6. Jonathan Smith 2001-01-01 6.
# 7 3. John Jones 2003-01-01 3.
# 8 8. Alex Smith 2006-01-01 8.
# 9 9. James Johnson 2006-01-01 9.
# 10 1. John Smith 2001-01-01 1.
# 11 11. John Smith 2009-01-01 11.
EDIT - Added Performance comparison
for_loop_approach <- function() {
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
}
dplyr_approach <- function() {
require(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
}
library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)
Unit: relative
expr min lq mean median uq max neval
for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743 100
dplyr_approach() 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 100
There were 50 or more warnings (use warnings() to see the first 50)
I've implemented a base R approach rather than dplyr and it comes out (according to microbenchmark) 7.46 times faster than the dplyr approach of CPak, and 139.4 times faster than the for loop approach. I've just used the match and paste0 functions to get this working, and it will automatically retain the smallest matching id:
dta.test[, "Actual_id"] <- match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB))
This approach also outputs it straight to a data frame, rather than a tibble (which you would need to extract the new column from, and add to your data frame):
Person_id FirstName LastName DOB Actual_id
1 1 John Smith 2001-01-01 1
2 2 James Jones 2002-01-01 2
3 3 John Jones 2003-01-01 3
4 4 Alex Jones 2004-01-01 4
5 5 Alexander Jones 2004-01-01 5
6 6 Jonathan Smith 2001-01-01 6
7 7 John Jones 2003-01-01 3
8 8 Alex Smith 2006-01-01 8
9 9 James Johnson 2006-01-01 9
10 10 John Smith 2001-01-01 1
11 11 John Smith 2009-01-01 11
In your real data I expect the person id is not so simple (not just an integer) and doesn't run in numerical order, e.g.
dta.test$Person_id <- paste0(LETTERS[1:11],1:11)
You just need a small tweak to make this still work, to make it extract value from the Person_id column:
dta.test[, "Actual_id"] <- dta.test[match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB)), "Person_id"]
Giving:
Person_id FirstName LastName DOB Actual_id
1 A1 John Smith 2001-01-01 A1
2 B2 James Jones 2002-01-01 B2
3 C3 John Jones 2003-01-01 C3
4 D4 Alex Jones 2004-01-01 D4
5 E5 Alexander Jones 2004-01-01 E5
6 F6 Jonathan Smith 2001-01-01 F6
7 G7 John Jones 2003-01-01 C3
8 H8 Alex Smith 2006-01-01 H8
9 I9 James Johnson 2006-01-01 I9
10 J10 John Smith 2001-01-01 A1
11 K11 John Smith 2009-01-01 K11
A data table solution will probably be quickest on large data with lots of groups:
library(data.table)
setDT(dta.test, key = c("FirstName", "LastName", "DOB"))
dta.test[, Actual_ID := min(Person_id, na.rm = TRUE), by = .(FirstName, LastName, DOB)]

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Resources