I have the following dataset in R:
address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
The data looks something like this:
> my_data
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
7 3665 Unit Moon Crescent Pine Rec. cntR 2
8 NO ADDRESS PROVIDED Boston Swimming Complex 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
11 1799 Orlans Street, Maryland , USA Massachusetts Hospital 3
I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep one occurrence of the duplicate.
I tried the following code:
library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)
But when I inspect the results, everything is NA - given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.
Can someone please show me how I can resolve this problem and continue on with the tutorial?
In the end, I am looking for something like this:
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
Thank you!
You forgot to enable the string comparison on columns (strcmp parameter):
address = c(
"44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
library(RecordLinkage)
pairs <- compare.dedup(my_data, blockfld=3, strcmp = c("address", "name"))
pairs
#> $data
#> address name
#> 1 44 Ocean Road Atlanta Georgia Pancake House of America
#> 2 882 4N Road River NY, NY 12345 ABC Center Building
#> 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC
#> 4 123 Fake Road Boston Drive Boston BD Home 25 New
#> 5 123 Fake - Rd Boston 56789 Boarding Direct 25
#> 6 3665 Apt 5 Moon Crs Pine Recreational Center
#> 7 3665 Unit Moon Crescent Pine Rec. cntR
#> 8 NO ADDRESS PROVIDED Boston Swimming Complex
#> 9 31 Silver Way Road boston gym center
#> 10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital
#> 11 1799 Orlans Street, Maryland , USA Massachusetts Hospital
#> blocking_var
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 2
#> 7 2
#> 8 2
#> 9 2
#> 10 3
#> 11 3
#>
#> $pairs
#> id1 id2 address name blocking_var is_match
#> 1 1 2 0.4657088 0.5014620 1 NA
#> 2 1 3 0.4256705 0.4551587 1 NA
#> 3 1 4 0.5924184 0.4543651 1 NA
#> 4 1 5 0.5139994 0.4768519 1 NA
#> 5 2 3 0.9082051 0.5802005 1 NA
#> 6 2 4 0.5112554 0.4734336 1 NA
#> 7 2 5 0.5094017 0.5467836 1 NA
#> 8 3 4 0.4767677 0.4404762 1 NA
#> 9 3 5 0.5418803 0.4761905 1 NA
#> 10 4 5 0.8550583 0.6672619 1 NA
#> 11 6 7 0.8749962 0.8306277 1 NA
#> 12 6 8 0.4385965 0.5243193 1 NA
#> 13 6 9 0.5622807 0.5502822 1 NA
#> 14 7 8 0.3974066 0.5075914 1 NA
#> 15 7 9 0.5626812 0.5896359 1 NA
#> 16 8 9 0.3942495 0.6478338 1 NA
#> 17 10 11 0.6939076 0.6843434 1 NA
#>
#> $frequencies
#> address name blocking_var
#> 0.09090909 0.09090909 0.33333333
#>
#> $type
#> [1] "deduplication"
#>
#> attr(,"class")
#> [1] "RecLinkData"
It then goes like this, using e.g. the EpiLink algorithm:
# Compute EpiLink weights
pairs_w <- epiWeights(pairs)
# Explore the pairs and their weight to find a good cutoff
getPairs(pairs_w, min.weight=0.6, max.weight=0.8)
#> id address
#> 1 2 882 4N Road River NY, NY 12345
#> 2 3 882 - River Road NY, ZIP 12345
#> 3
#> 4 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 5 11 1799 Orlans Street, Maryland , USA
#> 6
#> 7 7 3665 Unit Moon Crescent
#> 8 9 31 Silver Way Road
#> 9
#> 10 6 3665 Apt 5 Moon Crs
#> 11 9 31 Silver Way Road
#> 12
#> 13 2 882 4N Road River NY, NY 12345
#> 14 5 123 Fake - Rd Boston 56789
#> 15
#> 16 1 44 Ocean Road Atlanta Georgia
#> 17 4 123 Fake Road Boston Drive Boston
#> 18
#> 19 8 NO ADDRESS PROVIDED
#> 20 9 31 Silver Way Road
#> 21
#> 22 3 882 - River Road NY, ZIP 12345
#> 23 5 123 Fake - Rd Boston 56789
#> 24
#> name blocking_var Weight
#> 1 ABC Center Building 1
#> 2 Cent. Bldg ABC 1 0.7916856
#> 3
#> 4 mas hospital 3
#> 5 Massachusetts Hospital 3 0.7468321
#> 6
#> 7 Pine Rec. cntR 2
#> 8 boston gym center 2 0.6548348
#> 9
#> 10 Pine Recreational Center 2
#> 11 boston gym center 2 0.6386475
#> 12
#> 13 ABC Center Building 1
#> 14 Boarding Direct 25 1 0.6156913
#> 15
#> 16 Pancake House of America 1
#> 17 BD Home 25 New 1 0.6118630
#> 18
#> 19 Boston Swimming Complex 2
#> 20 boston gym center 2 0.6099491
#> 21
#> 22 Cent. Bldg ABC 1
#> 23 Boarding Direct 25 1 0.6001716
#> 24
I chose > 0.7 to classify as link, < 0.6 to classify as a non-link.
Matches in-between are labelled as "possible".
pairs_class <- epiClassify(pairs_w, threshold.upper = 0.7, threshold.lower = 0.6)
summary(pairs_class)
#>
#> Deduplication Data Set
#>
#> 11 records
#> 17 record pairs
#>
#> 0 matches
#> 0 non-matches
#> 17 pairs with unknown status
#>
#>
#> Weight distribution:
#>
#> [0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85]
#> 1 6 5 1 1 1 1
#> (0.85,0.9]
#> 1
#>
#> 4 links detected
#> 6 possible links detected
#> 7 non-links detected
#>
#> Classification table:
#>
#> classification
#> true status N P L
#> <NA> 7 6 4
And the results:
# detected links, possible matches, non-links
getPairs(pairs_class, show = "links")
#> id address
#> 1 6 3665 Apt 5 Moon Crs
#> 2 7 3665 Unit Moon Crescent
#> 3
#> 4 4 123 Fake Road Boston Drive Boston
#> 5 5 123 Fake - Rd Boston 56789
#> 6
#> 7 2 882 4N Road River NY, NY 12345
#> 8 3 882 - River Road NY, ZIP 12345
#> 9
#> 10 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 11 11 1799 Orlans Street, Maryland , USA
#> 12
#> name blocking_var Weight
#> 1 Pine Recreational Center 2
#> 2 Pine Rec. cntR 2 0.8801340
#> 3
#> 4 BD Home 25 New 1
#> 5 Boarding Direct 25 1 0.8054952
#> 6
#> 7 ABC Center Building 1
#> 8 Cent. Bldg ABC 1 0.7916856
#> 9
#> 10 mas hospital 3
#> 11 Massachusetts Hospital 3 0.7468321
#> 12
getPairs(pairs_class, show = "possible")
#> id address name blocking_var
#> 1 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 2 9 31 Silver Way Road boston gym center 2
#> 3
#> 4 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 5 9 31 Silver Way Road boston gym center 2
#> 6
#> 7 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 8 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 9
#> 10 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 11 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 12
#> 13 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 14 9 31 Silver Way Road boston gym center 2
#> 15
#> 16 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 17 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 18
#> Weight
#> 1
#> 2 0.6548348
#> 3
#> 4
#> 5 0.6386475
#> 6
#> 7
#> 8 0.6156913
#> 9
#> 10
#> 11 0.6118630
#> 12
#> 13
#> 14 0.6099491
#> 15
#> 16
#> 17 0.6001716
#> 18
getPairs(pairs_class, show = "nonlinks")
#> id address name blocking_var
#> 1 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 2 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 3
#> 4 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 5 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 6
#> 7 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 8 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 9
#> 10 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 11 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 12
#> 13 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 14 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 15
#> 16 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 17 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 18
#> 19 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 20 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 21
#> Weight
#> 1
#> 2 0.5890881
#> 3
#> 4
#> 5 0.5865789
#> 6
#> 7
#> 8 0.5794458
#> 9
#> 10
#> 11 0.5777132
#> 12
#> 13
#> 14 0.5591162
#> 15
#> 16
#> 17 0.5541298
#> 18
#> 19
#> 20 0.5442886
#> 21
Created on 2022-11-17 with reprex v2.0.2
Related
Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).
What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.
If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.
So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:
vid_tag_freq = df %>%
count(tags)
To further clarify:
With a dataset like:
Tags
1 ['Dog', 'Cat', 'Mouse', 'Fish']
2 ['Cat', 'Fish']
3 ['Cat', 'Fish']
I am now getting:
Tags Count
1 ['Dog', 'Cat', 'Mouse', 'Fish'] 1
2 ['Cat', 'Fish'] 2
What I actually want:
Tags Count
1 'Cat' 3
2 'Fish' 3
3 'Dog' 1
4 'Mouse' 1
I hope that explains it lol
EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?
[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"
[2] "[0]"
[3] "[0]"
[4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"
[5] "[0]"
[6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"
[7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"
[8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"
[9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"
[10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"
[11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"
[12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"
[13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"
[14] "[['Blue Jays', 'Ashes', 'friends']]"
[15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
It would help if you could dput() some of the data for a working example. Going off the idea that you have a list column, here are a couple of general solutions you may be able to work with:
df <- tibble::tibble(
x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)
df
#> # A tibble: 10 × 1
#> x
#> <list>
#> 1 <chr [7]>
#> 2 <chr [7]>
#> 3 <chr [8]>
#> 4 <chr [6]>
#> 5 <chr [8]>
#> 6 <chr [8]>
#> 7 <chr [8]>
#> 8 <chr [6]>
#> 9 <chr [5]>
#> 10 <chr [10]>
# dplyr in a dataframe
df |>
tidyr::unnest(x) |>
dplyr::count(x)
#> # A tibble: 36 × 2
#> x n
#> <chr> <int>
#> 1 Alabama 1
#> 2 Alaska 1
#> 3 Arkansas 4
#> 4 California 3
#> 5 Colorado 5
#> 6 Connecticut 1
#> 7 Delaware 3
#> 8 Florida 1
#> 9 Georgia 3
#> 10 Hawaii 2
#> # … with 26 more rows
# vctrs
vctrs::vec_count(unlist(df$x))
#> key count
#> 1 Colorado 5
#> 2 Louisiana 5
#> 3 North Dakota 4
#> 4 Mississippi 4
#> 5 Arkansas 4
#> 6 Delaware 3
#> 7 Vermont 3
#> 8 Minnesota 3
#> 9 Utah 3
#> 10 California 3
#> 11 Georgia 3
#> 12 Indiana 2
#> 13 Missouri 2
#> 14 New Hampshire 2
#> 15 Maryland 2
#> 16 Nebraska 2
#> 17 Hawaii 2
#> 18 New Jersey 2
#> 19 Oklahoma 2
#> 20 Massachusetts 1
#> 21 Illinois 1
#> 22 Texas 1
#> 23 Connecticut 1
#> 24 Rhode Island 1
#> 25 Michigan 1
#> 26 New York 1
#> 27 Ohio 1
#> 28 Nevada 1
#> 29 Florida 1
#> 30 Montana 1
#> 31 Wisconsin 1
#> 32 Alabama 1
#> 33 Alaska 1
#> 34 North Carolina 1
#> 35 Washington 1
#> 36 Kansas 1
Created on 2022-10-07 with reprex v2.0.2
Edit
If you list is actually a character vector, you'll need to do some string parsing.
# "list" but are actually strings
x <- c(
"[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]",
"[0]",
"[0]",
"[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]",
"[0]",
"[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]",
"[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]",
"[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]",
"[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]",
"[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]",
"[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]",
"[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]",
"[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]",
"[['Blue Jays', 'Ashes', 'friends']]",
"[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
)
# assing to a data.frame
df <- data.frame(x = x)
df |>
dplyr::mutate(
# remove square brackets at beginning or end
x = gsub("^\\[{1,2}|\\]{1,2}$", "", x),
# separate the strings into an actual list
x = strsplit(x, "',\\s|,\\s'")
) |>
# unnuest the list column so they appear as individual rows
tidyr::unnest(x) |>
# some extract cleaning to string out the '
dplyr::mutate(x = gsub("^'|'$", "", x)) |>
# count the individual elements
dplyr::count(x, sort = TRUE)
#> # A tibble: 47 × 2
#> x n
#> <chr> <int>
#> 1 0 3
#> 2 CBC Television (TV Network) 3
#> 3 Canadian Broadcasting Corporation (TV Network) 2
#> 4 Documentary (TV Genre) 2
#> 5 Rob Ford (Politician) 2
#> 6 the fifth estate 2
#> 7 Amanda Todd 1
#> 8 Ashes 1
#> 9 B.C. 1
#> 10 Bill Blair 1
#> # … with 37 more rows
# same result just working with the vector
x |>
gsub("^\\[{1,2}|\\]{1,2}$", "", x = _) |>
strsplit("',\\s|,\\s'") |>
unlist() |>
gsub("^'|'$", "", x = _) |>
vctrs::vec_count() # or table()
#> key count
#> 1 CBC Television (TV Network) 3
#> 2 0 3
#> 3 Rob Ford (Politician) 2
#> 4 the fifth estate 2
#> 5 Documentary (TV Genre) 2
#> 6 Canadian Broadcasting Corporation (TV Network) 2
#> 7 City Hall (Building) 1
#> 8 United States Of America (Country) 1
#> 9 Mark Kelley 1
#> 10 Israel (Country) 1
#> 11 Bill Blair 1
#> 12 Interview (TV Genre) 1
#> 13 Blue Jays 1
#> 14 Hockey (Sport) 1
#> 15 friends 1
#> 16 Peter Gzowski 1
#> 17 Suicide Of Amanda Todd (Event) 1
#> 18 greenhouse gas 1
#> 19 Dog Walking (Profession) 1
#> 20 Flood (Disaster Type) 1
#> 21 National Hockey League (Sports Association) 1
#> 22 Amanda Todd 1
#> 23 Chair 1
#> 24 dog walker 1
#> 25 bomb threat 1
#> 26 dogs 1
#> 27 Climate Change (Website Category) 1
#> 28 Robin Williams (Celebrity) 1
#> 29 Margaret Chan (Physician) 1
#> 30 cyberbullying 1
#> 31 Ashes 1
#> 32 Ontario (City/Town/Village) 1
#> 33 Iraq (Country) 1
#> 34 WHO 1
#> 35 cappers 1
#> 36 Gillian Findlay 1
#> 37 Military (Film Genre) 1
#> 38 CBC News (Website Owner) 1
#> 39 B.C. 1
#> 40 Ebola (Disease Or Medical Condition) 1
#> 41 Toronto (City/Town/Village) 1
#> 42 death 1
#> 43 emissions 1
#> 44 Rocket (Product Category) 1
#> 45 Gaza War (Military Conflict) 1
#> 46 dead 1
#> 47 Burlington (City/Town/Village) 1
Created on 2022-10-08 with reprex v2.0.2
It looks like you need unnest_longer():
library(dplyr)
library(tidyr)
df <- tibble(
Tags = list(
list('Dog', 'Cat', 'Mouse', 'Fish'),
list('Cat', 'Fish'),
list('Cat', 'Fish')
)
)
df %>%
tidyr::unnest_longer(Tags) %>%
count(Tags) %>%
arrange(desc(n))
#> # A tibble: 4 × 2
#> Tags n
#> <chr> <int>
#> 1 Cat 3
#> 2 Fish 3
#> 3 Dog 1
#> 4 Mouse 1
I would like to subtract Bay County from Florida in this data frame and create a new row with the name "Florida (-Bay County)".
Maybe group_modify and add_row (dplyr) would be a possibility?
year <- c(2005,2006,2007,2005,2006,2007,2005,2006,2007,2005,2006,2007)
county <- c("Alachua County","Alachua County","Alachua County","Baker County","Baker County","Baker County","Bay County","Bay County","Bay County","Florida","Florida","Florida")
pop <- c(3,6,8,9,8,4,5,8,10,17,22,22)
gdp <- c(3,6,8,9,8,4,5,8,10,17,22,22)
area <- c(3,6,8,9,8,4,5,8,10,17,22,22)
density<-c(3,6,8,9,8,4,5,8,10,17,22,22)
df <- data.frame(year, county,pop,gdp,area,density, stringsAsFactors = FALSE)
year
county
pop
gdp
area
density
2005
Alachua County
3
3
3
3
2005
Baker County
9
9
9
9
2005
Bay County
5
5
5
5
2005
Florida
17
17
17
17
2005
Florida (-Bay County)
12
12
12
12
2006
Alachua County
6
6
6
6
2006
Baker County
8
8
8
8
2006
Bay County
8
8
8
8
2006
Florida
22
22
22
22
2006
Florida (-Bay County)
14
14
14
14
2007
Alachua County
8
8
8
8
2007
Baker County
4
4
4
4
2007
Bay County
10
10
10
10
2007
Florida
22
22
22
22
2007
Florida (-Bay County)
12
12
12
12
If you wanted to try something with group_modify and add_row, you could consider something like this. Here, when using add_row, use map to sum up the data within the group, but not including "Florida" or "Bay County".
library(tidyverse)
df %>%
group_by(year) %>%
group_modify(
~ .x %>%
add_row(
county = "Florida (-Bay County)",
!!! map(.x %>%
filter(!county %in% c("Florida", "Bay County")) %>%
select(-county),
sum)
)
)
Output
year county pop gdp area density
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2005 Alachua County 3 3 3 3
2 2005 Baker County 9 9 9 9
3 2005 Bay County 5 5 5 5
4 2005 Florida 17 17 17 17
5 2005 Florida (-Bay County) 12 12 12 12
6 2006 Alachua County 6 6 6 6
7 2006 Baker County 8 8 8 8
8 2006 Bay County 8 8 8 8
9 2006 Florida 22 22 22 22
10 2006 Florida (-Bay County) 14 14 14 14
11 2007 Alachua County 8 8 8 8
12 2007 Baker County 4 4 4 4
13 2007 Bay County 10 10 10 10
14 2007 Florida 22 22 22 22
15 2007 Florida (-Bay County) 12 12 12 12
You could do:
df %>%
filter(county != 'Florida' & county != 'Bay County') %>%
group_by(year) %>%
bind_rows(summarise(., county = 'Florida (-Bay County)',
across(where(is.numeric), sum))) %>%
arrange(year)
#> # A tibble: 9 x 6
#> # Groups: year [3]
#> year county pop gdp area density
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2005 Alachua County 3 3 3 3
#> 2 2005 Baker County 9 9 9 9
#> 3 2005 Florida (-Bay County) 12 12 12 12
#> 4 2006 Alachua County 6 6 6 6
#> 5 2006 Baker County 8 8 8 8
#> 6 2006 Florida (-Bay County) 14 14 14 14
#> 7 2007 Alachua County 8 8 8 8
#> 8 2007 Baker County 4 4 4 4
#> 9 2007 Florida (-Bay County) 12 12 12 12
Hello! I am in the beginning stages of building (and learning!) how to build prediction models for sports, specifically using NHL statistics.
I have all the game outcomes of the NHL since 1990, and I want to use # goals to predict outcomes in future games (just based on goals, for now)
Below is an excerpt of my data set, but the full data set can be found in this Git link:
https://github.com/papelr/nhldatar/blob/master/nhldatar/data/NHL_outcomes.rda
Date Visitor GVisitor Home GHome Att.
1 1990-10-04 Philadelphia Flyers 1 Boston Bruins 4 <NA>
2 1990-10-04 Montreal Canadiens 3 Buffalo Sabres 3 <NA>
3 1990-10-04 Vancouver Canucks 2 Calgary Flames 3 <NA>
4 1990-10-04 New York Rangers 3 Chicago Blackhawks 4 <NA>
5 1990-10-04 Quebec Nordiques 3 Hartford Whalers 3 <NA>
6 1990-10-04 New York Islanders 1 Los Angeles Kings 4 <NA>
7 1990-10-04 St. Louis Blues 3 Minnesota North Stars 2 <NA>
8 1990-10-04 Detroit Red Wings 3 New Jersey Devils 3 <NA>
9 1990-10-04 Toronto Maple Leafs 1 Winnipeg Jets 7 <NA>
10 1990-10-05 Pittsburgh Penguins 7 Washington Capitals 4 <NA>
11 1990-10-06 Quebec Nordiques 1 Boston Bruins 7 <NA>
12 1990-10-06 Toronto Maple Leafs 1 Calgary Flames 4 <NA>
13 1990-10-06 Winnipeg Jets 3 Edmonton Oilers 3 <NA>
14 1990-10-06 New York Rangers 4 Hartford Whalers 5 <NA>
15 1990-10-06 Vancouver Canucks 6 Los Angeles Kings 3 <NA>
16 1990-10-06 New York Islanders 2 Minnesota North Stars 4 <NA>
17 1990-10-06 Buffalo Sabres 5 Montreal Canadiens 6 <NA>
18 1990-10-06 Philadelphia Flyers 1 New Jersey Devils 3 <NA>
19 1990-10-06 Chicago Blackhawks 5 St. Louis Blues 2 <NA>
20 1990-10-06 Detroit Red Wings 4 Washington Capitals 6 <NA>
21 1990-10-07 New York Islanders 4 Chicago Blackhawks 2 <NA>
22 1990-10-07 Toronto Maple Leafs 2 Edmonton Oilers 3 <NA>
23 1990-10-07 Detroit Red Wings 2 Philadelphia Flyers 7 <NA>
24 1990-10-07 New Jersey Devils 4 Pittsburgh Penguins 7 <NA>
25 1990-10-07 Boston Bruins 5 Quebec Nordiques 2 <NA>
26 1990-10-08 Hartford Whalers 3 Montreal Canadiens 5 <NA>
27 1990-10-08 Minnesota North Stars 3 New York Rangers 6 <NA>
28 1990-10-08 Calgary Flames 4 Winnipeg Jets 3 <NA>
29 1990-10-09 Minnesota North Stars 2 New Jersey Devils 5 <NA>
30 1990-10-09 Pittsburgh Penguins 3 St. Louis Blues 4 <NA>
31 1990-10-09 Los Angeles Kings 6 Vancouver Canucks 2 <NA>
32 1990-10-10 Calgary Flames 5 Detroit Red Wings 6 <NA>
33 1990-10-10 Buffalo Sabres 3 Hartford Whalers 4 <NA>
34 1990-10-10 Washington Capitals 2 New York Rangers 4 <NA>
35 1990-10-10 Quebec Nordiques 8 Toronto Maple Leafs 5 <NA>
36 1990-10-10 Boston Bruins 4 Winnipeg Jets 2 <NA>
37 1990-10-11 Pittsburgh Penguins 1 Chicago Blackhawks 4 <NA>
38 1990-10-11 Edmonton Oilers 5 Los Angeles Kings 5 <NA>
39 1990-10-11 Boston Bruins 3 Minnesota North Stars 3 <NA>
40 1990-10-11 New Jersey Devils 4 Philadelphia Flyers 7 <NA>
This is the prediction model that I have come up with so far, and I have failed to get the matrix that should come with my simulate match line below. Any help would be great.
# Using number of goals for prediction model
model_one <-
rbind(
data.frame(goals = outcomes$GHome,
team = outcomes$Home,
opponent = outcomes$Visitor,
home = 1),
data.frame(goals = outcomes$GVisitor,
team = outcomes$Visitor,
opponent = outcomes$Home,
home = 0)) %>%
glm(goals ~ home + team + opponent,
family = poisson (link = log), data = .)
summary(model_one)
# Probability function / matrix
simulate_game <- function(stat_model, homeTeam, awayTeam, max_goals =
10) {
home_goals <- predict(model_one,
data.frame(home = 1,
team = homeTeam,
opponent = awayTeam),
type ="response")
away_goals <- predict(model_one,
data.frame(home = 0,
team = awayTeam,
opponent = homeTeam),
type ="response")
dpois(0: max_goals, home_goals) %>%
dpois(0: max_goals, away_goals)
}
simulate_game(model_one, "Nashville Predators", "Chicago Blackhawks",
max_goals = 10)
I totally understand that a Poisson model isn't the best for sports predictions, but I am rebuilding a model I found for the EPL for learning/practice reasons, and adapting it to the NHL (from David Sheehan's model, https://dashee87.github.io/data%20science/football/r/predicting-football-results-with-statistical-modelling/).
Any tips would be great, because currently, this model returns a bunch of warnings:
There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In dpois(., 0:max_goals, away_goals_avg) : non-integer x = 0.062689
2: In dpois(., 0:max_goals, away_goals_avg) : non-integer x = 0.173621
I'm struggling to understand exactly how to compute a deflation factor for wages in a panel based on inflation.
I've teh R example below to help me illustrate the issue.
In Wooldridge (2009:452) Introductory Econometrics, 5th ed., he creates a deflation factor by dividing 107.6 by 65.2, i.e. 107.6/65.2 ≈ 1.65, but I can't figure out to to apply this to my own panel data. Wooldridge only mentions the deflation factor in passing.
Say I have a mini panel with two people, Jane and Tom, staring from 2006/2009 and running until 2015 with their yearly wage,
# install.packages(c("dplyr"), dependencies = TRUE)
library(dplyr)
set.seed(2)
tbl <- tibble(id = rep(c('Jane', 'Tom'), c(7, 10)),
yr = c(2009:2015, 2006:2015),
wg = c(rnorm(7, mean=5.1*10^4, sd=9), rnorm(10, 4*10^4, 12))
); tbl
#> A tibble: 17 x 3
#> id yr wg
#> <chr> <int> <dbl>
#> 1 Jane 2009 50991.93
#> 2 Jane 2010 51001.66
#> 3 Jane 2011 51014.29
#> 4 Jane 2012 50989.83
#> 5 Jane 2013 50999.28
#> 6 Jane 2014 51001.19
#> 7 Jane 2015 51006.37
#> 8 Tom 2006 39997.12
#> 9 Tom 2007 40023.81
#> 10 Tom 2008 39998.33
#> 11 Tom 2009 40005.01
#> 12 Tom 2010 40011.78
#> 13 Tom 2011 39995.29
#> 14 Tom 2012 39987.52
#> 15 Tom 2013 40021.39
#> 16 Tom 2014 39972.27
#> 17 Tom 2015 40010.54
I now get the consumer price index (CPI) (using this answer)
# install.packages(c("Quandl"), dependencies = TRUE)
CPI00to16 <- Quandl::Quandl("FRED/CPIAUCSL", collapse="annual",
start_date="2000-01-01", end_date="2016-01-01")
as_tibble(CPI00to16)
#> # A tibble: 17 x 2
#> Date Value
#> <date> <dbl>
#> 1 2016-12-31 238.106
#> 2 2015-12-31 237.846
#> 3 2014-12-31 236.290
#> 4 2013-12-31 234.723
#> 5 2012-12-31 231.221
#> 6 2011-12-31 227.223
#> 7 2010-12-31 220.472
#> 8 2009-12-31 217.347
#> 9 2008-12-31 211.398
#> 10 2007-12-31 211.445
#> 11 2006-12-31 203.100
#> 12 2005-12-31 198.100
#> 13 2004-12-31 191.700
#> 14 2003-12-31 185.500
#> 15 2002-12-31 181.800
#> 16 2001-12-31 177.400
#> 17 2000-12-31 174.600
my question is how do I deflate Jane and Tom's wages cf. Wooldridge 2009 selecting 2015 as the baseline year?
update; following MrSmithGoesToWashington’s comment below.
CPI00to16$yr <- as.numeric(format(CPI00to16$Date,'%Y'))
CPI00to16 <- mutate(CPI00to16, deflation_factor = CPI00to16[2,2]/Value)
df <- tbl %>% inner_join(as_tibble(CPI00to16[,3:4]), by = "yr")
df <- mutate(df, wg_defl = deflation_factor*wg, wg_diff = wg_defl-wg)
df
#> # A tibble: 17 x 6
#> id yr wg deflation_factor wg_defl wg_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Jane 2009 50991.93 1.094315 55801.21 4809.2844
#> 2 Jane 2010 51001.66 1.078804 55020.78 4019.1176
#> 3 Jane 2011 51014.29 1.046751 53399.28 2384.9910
#> 4 Jane 2012 50989.83 1.028652 52450.80 1460.9728
#> 5 Jane 2013 50999.28 1.013305 51677.83 678.5477
#> 6 Jane 2014 51001.19 1.006585 51337.04 335.8494
#> 7 Jane 2015 51006.37 1.000000 51006.37 0.0000
#> 8 Tom 2006 39997.12 1.171078 46839.76 6842.6394
#> 9 Tom 2007 40023.81 1.124860 45021.18 4997.3691
#> 10 Tom 2008 39998.33 1.125110 45002.53 5004.1909
#> 11 Tom 2009 40005.01 1.094315 43778.07 3773.0575
#> 12 Tom 2010 40011.78 1.078804 43164.86 3153.0747
#> 13 Tom 2011 39995.29 1.046751 41865.12 1869.8369
#> 14 Tom 2012 39987.52 1.028652 41133.26 1145.7322
#> 15 Tom 2013 40021.39 1.013305 40553.87 532.4863
#> 16 Tom 2014 39972.27 1.006585 40235.49 263.2225
#> 17 Tom 2015 40010.54 1.000000 40010.54 0.0000
I have a data set that has 57 locations "homes" for each of those locations I have 10 other locations "weather stations" that are ranked by nearness in miles. Then I have a column that has the quadrant the weather station falls in around the homes so 1-4.
I am trying to write code in R that takes the closest weather station and it's quadrant, then take the next closest in a different quadrant and the third in a different quadrant. So that I have a triangle based on the closest stations.
loc station nearness quadrant
1 Abilene-KS SALINA MUNICIPAL AIRPORT 1 2
2 Abilene-KS MARSHALL ARMY AIRFIELD 2 1
3 Abilene-KS MULTI PURPOSE RANGE 3 1
4 Abilene-KS MANHATTAN REGIONAL AIRPORT 4 3
5 Abilene-KS MANHATTAN 6 SSW 5 1
6 Abilene-KS BLOSSER MUNICIPAL AIRPORT 6 4
7 Abilene-KS NEWTON-CITY-COUNTY AIRPORT 7 1
8 Abilene-KS EMPORIA MUNICIPAL AIRPORT 8 2
9 Abilene-KS HUTCHINSON MUNICIPAL ARPT 9 4
10 Abilene-KS COLONEL JAMES JABARA ARPT 10 3
11 Archbold-OH SALINA MUNICIPAL AIRPORT 1 2
12 Archbold-OH MARSHALL ARMY AIRFIELD 2 1
13 Archbold-OH MULTI PURPOSE RANGE 3 3
14 Archbold-OH MANHATTAN REGIONAL AIRPORT 4 1
15 Archbold-OH MANHATTAN 6 SSW 5 4
16 Archbold-OH BLOSSER MUNICIPAL AIRPORT 6 4
17 Archbold-OH NEWTON-CITY-COUNTY AIRPORT 7 2
18 Archbold-OH EMPORIA MUNICIPAL AIRPORT 8 1
19 Archbold-OH HUTCHINSON MUNICIPAL ARPT 9 3
20 Archbold-OH COLONEL JAMES JABARA ARPT 10 2
I wish I could say that I have some code to show what I have tried but everything has gotten me nowhere so I'm lost. Any ideas?
for these two locations, I would like a new data frame with
loc station nearness quadrant
1 Abilene-KS SALINA MUNICIPAL AIRPORT 1 2
2 Abilene-KS MARSHALL ARMY AIRFIELD 2 1
3 Abilene-KS MANHATTAN REGIONAL AIRPORT 4 3
4 Archbold-OH SALINA MUNICIPAL AIRPORT 1 2
5 Archbold-OH MARSHALL ARMY AIRFIELD 2 1
6 Archbold-OH MULTI PURPOSE RANGE 3 3
Using dplyr, you could do:
library(dplyr)
df %>%
distinct(loc,quadrant,.keep_all=T) %>%
group_by(loc) %>%
top_n(-3,nearness)
This returns:
loc station nearness quadrant
<chr> <chr> <int> <int>
1 Abilene-KS SALINA MUNICIPAL AIRPORT 1 2
2 Abilene-KS MARSHALL ARMY AIRFIELD 2 1
3 Abilene-KS MANHATTAN REGIONAL AIRPORT 4 3
4 Archbold-OH SALINA MUNICIPAL AIRPORT 1 2
5 Archbold-OH MARSHALL ARMY AIRFIELD 2 1
6 Archbold-OH MULTI PURPOSE RANGE 3 3