I have the following dataset in R:
address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
The data looks something like this:
> my_data
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
7 3665 Unit Moon Crescent Pine Rec. cntR 2
8 NO ADDRESS PROVIDED Boston Swimming Complex 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
11 1799 Orlans Street, Maryland , USA Massachusetts Hospital 3
I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep one occurrence of the duplicate.
I tried the following code:
library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)
But when I inspect the results, everything is NA - given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.
Can someone please show me how I can resolve this problem and continue on with the tutorial?
In the end, I am looking for something like this:
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
Thank you!
You forgot to enable the string comparison on columns (strcmp parameter):
address = c(
"44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
library(RecordLinkage)
pairs <- compare.dedup(my_data, blockfld=3, strcmp = c("address", "name"))
pairs
#> $data
#> address name
#> 1 44 Ocean Road Atlanta Georgia Pancake House of America
#> 2 882 4N Road River NY, NY 12345 ABC Center Building
#> 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC
#> 4 123 Fake Road Boston Drive Boston BD Home 25 New
#> 5 123 Fake - Rd Boston 56789 Boarding Direct 25
#> 6 3665 Apt 5 Moon Crs Pine Recreational Center
#> 7 3665 Unit Moon Crescent Pine Rec. cntR
#> 8 NO ADDRESS PROVIDED Boston Swimming Complex
#> 9 31 Silver Way Road boston gym center
#> 10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital
#> 11 1799 Orlans Street, Maryland , USA Massachusetts Hospital
#> blocking_var
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 2
#> 7 2
#> 8 2
#> 9 2
#> 10 3
#> 11 3
#>
#> $pairs
#> id1 id2 address name blocking_var is_match
#> 1 1 2 0.4657088 0.5014620 1 NA
#> 2 1 3 0.4256705 0.4551587 1 NA
#> 3 1 4 0.5924184 0.4543651 1 NA
#> 4 1 5 0.5139994 0.4768519 1 NA
#> 5 2 3 0.9082051 0.5802005 1 NA
#> 6 2 4 0.5112554 0.4734336 1 NA
#> 7 2 5 0.5094017 0.5467836 1 NA
#> 8 3 4 0.4767677 0.4404762 1 NA
#> 9 3 5 0.5418803 0.4761905 1 NA
#> 10 4 5 0.8550583 0.6672619 1 NA
#> 11 6 7 0.8749962 0.8306277 1 NA
#> 12 6 8 0.4385965 0.5243193 1 NA
#> 13 6 9 0.5622807 0.5502822 1 NA
#> 14 7 8 0.3974066 0.5075914 1 NA
#> 15 7 9 0.5626812 0.5896359 1 NA
#> 16 8 9 0.3942495 0.6478338 1 NA
#> 17 10 11 0.6939076 0.6843434 1 NA
#>
#> $frequencies
#> address name blocking_var
#> 0.09090909 0.09090909 0.33333333
#>
#> $type
#> [1] "deduplication"
#>
#> attr(,"class")
#> [1] "RecLinkData"
It then goes like this, using e.g. the EpiLink algorithm:
# Compute EpiLink weights
pairs_w <- epiWeights(pairs)
# Explore the pairs and their weight to find a good cutoff
getPairs(pairs_w, min.weight=0.6, max.weight=0.8)
#> id address
#> 1 2 882 4N Road River NY, NY 12345
#> 2 3 882 - River Road NY, ZIP 12345
#> 3
#> 4 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 5 11 1799 Orlans Street, Maryland , USA
#> 6
#> 7 7 3665 Unit Moon Crescent
#> 8 9 31 Silver Way Road
#> 9
#> 10 6 3665 Apt 5 Moon Crs
#> 11 9 31 Silver Way Road
#> 12
#> 13 2 882 4N Road River NY, NY 12345
#> 14 5 123 Fake - Rd Boston 56789
#> 15
#> 16 1 44 Ocean Road Atlanta Georgia
#> 17 4 123 Fake Road Boston Drive Boston
#> 18
#> 19 8 NO ADDRESS PROVIDED
#> 20 9 31 Silver Way Road
#> 21
#> 22 3 882 - River Road NY, ZIP 12345
#> 23 5 123 Fake - Rd Boston 56789
#> 24
#> name blocking_var Weight
#> 1 ABC Center Building 1
#> 2 Cent. Bldg ABC 1 0.7916856
#> 3
#> 4 mas hospital 3
#> 5 Massachusetts Hospital 3 0.7468321
#> 6
#> 7 Pine Rec. cntR 2
#> 8 boston gym center 2 0.6548348
#> 9
#> 10 Pine Recreational Center 2
#> 11 boston gym center 2 0.6386475
#> 12
#> 13 ABC Center Building 1
#> 14 Boarding Direct 25 1 0.6156913
#> 15
#> 16 Pancake House of America 1
#> 17 BD Home 25 New 1 0.6118630
#> 18
#> 19 Boston Swimming Complex 2
#> 20 boston gym center 2 0.6099491
#> 21
#> 22 Cent. Bldg ABC 1
#> 23 Boarding Direct 25 1 0.6001716
#> 24
I chose > 0.7 to classify as link, < 0.6 to classify as a non-link.
Matches in-between are labelled as "possible".
pairs_class <- epiClassify(pairs_w, threshold.upper = 0.7, threshold.lower = 0.6)
summary(pairs_class)
#>
#> Deduplication Data Set
#>
#> 11 records
#> 17 record pairs
#>
#> 0 matches
#> 0 non-matches
#> 17 pairs with unknown status
#>
#>
#> Weight distribution:
#>
#> [0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85]
#> 1 6 5 1 1 1 1
#> (0.85,0.9]
#> 1
#>
#> 4 links detected
#> 6 possible links detected
#> 7 non-links detected
#>
#> Classification table:
#>
#> classification
#> true status N P L
#> <NA> 7 6 4
And the results:
# detected links, possible matches, non-links
getPairs(pairs_class, show = "links")
#> id address
#> 1 6 3665 Apt 5 Moon Crs
#> 2 7 3665 Unit Moon Crescent
#> 3
#> 4 4 123 Fake Road Boston Drive Boston
#> 5 5 123 Fake - Rd Boston 56789
#> 6
#> 7 2 882 4N Road River NY, NY 12345
#> 8 3 882 - River Road NY, ZIP 12345
#> 9
#> 10 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 11 11 1799 Orlans Street, Maryland , USA
#> 12
#> name blocking_var Weight
#> 1 Pine Recreational Center 2
#> 2 Pine Rec. cntR 2 0.8801340
#> 3
#> 4 BD Home 25 New 1
#> 5 Boarding Direct 25 1 0.8054952
#> 6
#> 7 ABC Center Building 1
#> 8 Cent. Bldg ABC 1 0.7916856
#> 9
#> 10 mas hospital 3
#> 11 Massachusetts Hospital 3 0.7468321
#> 12
getPairs(pairs_class, show = "possible")
#> id address name blocking_var
#> 1 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 2 9 31 Silver Way Road boston gym center 2
#> 3
#> 4 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 5 9 31 Silver Way Road boston gym center 2
#> 6
#> 7 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 8 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 9
#> 10 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 11 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 12
#> 13 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 14 9 31 Silver Way Road boston gym center 2
#> 15
#> 16 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 17 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 18
#> Weight
#> 1
#> 2 0.6548348
#> 3
#> 4
#> 5 0.6386475
#> 6
#> 7
#> 8 0.6156913
#> 9
#> 10
#> 11 0.6118630
#> 12
#> 13
#> 14 0.6099491
#> 15
#> 16
#> 17 0.6001716
#> 18
getPairs(pairs_class, show = "nonlinks")
#> id address name blocking_var
#> 1 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 2 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 3
#> 4 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 5 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 6
#> 7 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 8 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 9
#> 10 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 11 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 12
#> 13 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 14 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 15
#> 16 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 17 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 18
#> 19 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 20 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 21
#> Weight
#> 1
#> 2 0.5890881
#> 3
#> 4
#> 5 0.5865789
#> 6
#> 7
#> 8 0.5794458
#> 9
#> 10
#> 11 0.5777132
#> 12
#> 13
#> 14 0.5591162
#> 15
#> 16
#> 17 0.5541298
#> 18
#> 19
#> 20 0.5442886
#> 21
Created on 2022-11-17 with reprex v2.0.2
Im working with a twitter dataset i got with rtweet. I worked to create a state variable based on the coordinates (when available).
my output is this so far
> summary(rt1$state)
alabama arizona arkansas california colorado connecticut
3 6 2 104 5 1
delaware district of columbia florida georgia idaho illinois
1 0 17 7 0 12
indiana iowa kansas kentucky louisiana maine
4 1 2 3 2 1
maryland massachusetts michigan minnesota mississippi missouri
1 2 9 6 0 2
montana nebraska nevada new hampshire new jersey new mexico
0 3 5 1 4 7
new york north carolina north dakota ohio oklahoma oregon
25 8 1 3 2 4
pennsylvania rhode island south carolina south dakota tennessee texas
22 0 2 1 3 35
utah vermont virginia washington west virginia wisconsin
2 1 3 5 0 2
wyoming NA's
1 17669
can you please advise on how can i create an independent dataset from the output above so i have 2 columns (state and n) ?
thanks
We can wrap with stack to create a two column data.frame from the OP's code
out <- stack(summary(rt1$state))[2:1]
names(out) <- c("state", "n")
Or another option in base R is
as.data.frame(table(rt1$state))
A reproducible example
data(iris)
out <- stack(summary(iris$Species))[2:1]
Or with table
as.data.frame(table(iris$Species))
Or enframe from tibble
library(tibble)
library(tidyr)
enframe(summary(rt1$state)) %>%
unnest(c(value))
Or maybe you can work directly on your rt1 dataframe:
dplyr::count(rt1, state)
I want to do an extration from rasterdata and add the information to the polygon, but I get this warning and the extration contains only null values
Warning message:
In .local(x, y, ...) :
cannot return a sp object because the data length varies between polygons
What could be the problem? I did the same extraction last week with the same information and it was working fine. The formula is
expop <- extract(rasterdata, floods1985, small=TRUE, fun=sum, na.rm=TRUE, df=FALSE, nl=1, sp=TRUE)
Data of floods1985 is
head(floods1985)
ID AREA CENTRIODX CENTRIODY DFONUMBER GLIDE__ LINKS OTHER NATIONS
0 1 92620 5.230 35.814 1 <NA> Algeria <NA> <NA>
1 2 678500 -45.349 -18.711 2 <NA> Brazil <NA> <NA>
2 3 12850 122.974 10.021 3 <NA> Philippines <NA> <NA>
3 4 16540 124.606 1.015 4 <NA> Indonesia <NA> <NA>
4 5 20080 32.349 -25.869 5 <NA> Mozambique <NA> <NA>
5 6 1040 43.360 -11.652 6 <NA> Comoros islas <NA> <NA>
X_AFFECTED
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
AND_RIVERS
0 Northeastern
1 States: Rio de Janeiro, Minas Gerais a Espirito Santo
2 Towns: Tanjay a Pamplona
3 Region: Northern Sulawesi; Towns: Gorontalo Regency
4 Provinces: Natal, Maputo; Rivers: Nkomati, Omati, Maputo, Umbeluzi, Incomati, Limpopo, Pungue, Buzi a Zambezi; Town: Ressano Garcia
5 Isla of Anjouan; Villages: Hassimpao, Marahare, Vouani
RIVERS BEGAN ENDED DAYS DEAD DISPLACED X_USD_ MAIN_CAUSE
0 <NA> 1985/01/01 1985/01/05 5 26 3000 <NA> Heavy rain
1 <NA> 1985/01/15 1985/02/02 19 229 80000 2000000000 Heavy rain
2 <NA> 1985/01/20 1985/01/21 2 43 444 <NA> Brief torrential rain
3 <NA> 1985/02/04 1985/02/18 15 21 300 <NA> Brief torrential rain
4 <NA> 1985/02/09 1985/02/11 3 19 <NA> 3000000 Heavy rain
5 <NA> 1985/02/16 1985/02/28 13 2 35000 5600000 Tropical cyclone
SEVERITY__ SQ_KM X_M___
0 1.0 92620 5.665675
1 1.5 678500 7.286395
2 1.0 12850 4.409933
3 1.0 16540 5.394627
4 1.5 20080 4.955976
5 1.0 1040 4.130977
I have a function that scrapes a table from a list of urls:
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
Working example data:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
However, urls without games played will break my function:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
How might I build error/exception handling into my function to skip the urls that break?
Code should be reproducible...
The table you obtain when there are no games has an entirely other structure. You could check if colnames(boxscore) are as expected. As an example I include an adaptation of your function that checks if the column Visitor is available.
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
if ("Visitor" %in% colnames(boxscore)){
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
}
With this function, your example does not break:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
A nice approach here is to use rbindlist from data.table package (which allows you to use fill=TRUE), so that you can bind all even the one for which bind_rows is not working, but then you can filter non-NA Date (which essentially is the webpage for which bind_rows is not working) and then restrict to 6 columns which I guess you are looking for in valid data.
library(data.table) # development vs. 1.9.5
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/24/") # not working
resdt<-rbindlist(
lapply(
www_list, function(www0){
message ("web is ", www0) # comment out this if you don't want message to appear
getscore(www0)}),fill=TRUE)
resdt[!is.na(Date),1:6,with=FALSE] # 6 column is valid data
Date Visitor VG Home HG Type
1: 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2: 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3: 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4: 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5: 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6: 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7: 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8: 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9: 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10: 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11: 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12: 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13: 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14: 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15: 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16: 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17: 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18: 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19: 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20: 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21: 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22: 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23: 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24: 2014-12-22 Ottawa Senators 1 Washington Capitals 2
Date Visitor VG Home HG Type
If you are not familiar with data.table, you can just use it to do rbindlist and then convert data.table back to data.frame and perform usual data.frame operation. But, you should really learn data.table because it is very fast and efficient on big data.
resdf<-as.data.frame(res.dt)
with(resdf,resdf[!is.na(Date),1:6])
Date Visitor VG Home HG Type
1 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24 2014-12-22 Ottawa Senators 1 Washington Capitals 2
Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.