Error/exception handling with bind_rows() and lapply() functions - r

I have a function that scrapes a table from a list of urls:
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
Working example data:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
However, urls without games played will break my function:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
How might I build error/exception handling into my function to skip the urls that break?
Code should be reproducible...

The table you obtain when there are no games has an entirely other structure. You could check if colnames(boxscore) are as expected. As an example I include an adaptation of your function that checks if the column Visitor is available.
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
if ("Visitor" %in% colnames(boxscore)){
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
}
With this function, your example does not break:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))

A nice approach here is to use rbindlist from data.table package (which allows you to use fill=TRUE), so that you can bind all even the one for which bind_rows is not working, but then you can filter non-NA Date (which essentially is the webpage for which bind_rows is not working) and then restrict to 6 columns which I guess you are looking for in valid data.
library(data.table) # development vs. 1.9.5
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/24/") # not working
resdt<-rbindlist(
lapply(
www_list, function(www0){
message ("web is ", www0) # comment out this if you don't want message to appear
getscore(www0)}),fill=TRUE)
resdt[!is.na(Date),1:6,with=FALSE] # 6 column is valid data
Date Visitor VG Home HG Type
1: 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2: 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3: 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4: 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5: 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6: 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7: 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8: 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9: 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10: 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11: 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12: 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13: 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14: 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15: 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16: 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17: 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18: 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19: 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20: 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21: 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22: 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23: 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24: 2014-12-22 Ottawa Senators 1 Washington Capitals 2
Date Visitor VG Home HG Type
If you are not familiar with data.table, you can just use it to do rbindlist and then convert data.table back to data.frame and perform usual data.frame operation. But, you should really learn data.table because it is very fast and efficient on big data.
resdf<-as.data.frame(res.dt)
with(resdf,resdf[!is.na(Date),1:6])
Date Visitor VG Home HG Type
1 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24 2014-12-22 Ottawa Senators 1 Washington Capitals 2

Related

Edit string value based on value in another column using r

I have data on women who married and sometimes changed surnames over the period 1990-1999. However, I do not always know the exact year the name change took place, only that the surname changed sometime between year x and year y. In the original data, the old surname has only been crossed over and the new surname has been written next to it, which is indicated in the column "crossed_over". For example, Sarah Smith changed her name to Sarah Draper sometime in the period 1994-1999.
What I would like is that each woman have a unique surname for each year, like Liza Moore who changed her name to Liza Neville, preferably taking an average value when assigning a surname, using the column "crossed_over". For example, Sarah Smith would become Sarah Draper in 1997 and Mary King would become Mary Fisher in 1997 or 1998.
Does anyone have a suggestion to how I can achieve this using the example below?
library(tidyverse)
id <- rep(1:4, each = 10)
year <- rep(1990:1999, 4)
first_name <- c(rep("molly", 10), rep("sarah", 10), rep("mary", 10), rep("liza", 10))
last_name <- c(rep("johnson", 10), rep("smith", 4), rep("smith draper", 6), rep("king", 5), rep("king fisher", 5),
rep("moore", 7), rep("neville", 3))
crossed_over <- c(rep(NA, 10), rep(NA, 4), rep("smith", 6), rep(NA, 5), rep("king", 5), rep(NA, 10))
df <- tibble(id, year, first_name, last_name, crossed_over)
Here is one approach. For those rows with crossed_over names, set the new_last_name to the crossed_over name for the first half of rows, and to the difference between crossed_over and last_name for the second half of rows.
library(tidyverse)
library(stringr)
df %>%
filter(!is.na(crossed_over)) %>%
group_by(across(c(-year))) %>%
mutate(new_last_name = ifelse(row_number() <= n()/2,
crossed_over,
str_trim(str_remove(last_name, crossed_over)))) %>%
ungroup() %>%
right_join(df) %>%
mutate(new_last_name = coalesce(new_last_name, last_name)) %>%
arrange(id, year)
Output
id year first_name last_name crossed_over new_last_name
<int> <int> <chr> <chr> <chr> <chr>
1 1 1990 molly johnson NA johnson
2 1 1991 molly johnson NA johnson
3 1 1992 molly johnson NA johnson
4 1 1993 molly johnson NA johnson
5 1 1994 molly johnson NA johnson
6 1 1995 molly johnson NA johnson
7 1 1996 molly johnson NA johnson
8 1 1997 molly johnson NA johnson
9 1 1998 molly johnson NA johnson
10 1 1999 molly johnson NA johnson
11 2 1990 sarah smith NA smith
12 2 1991 sarah smith NA smith
13 2 1992 sarah smith NA smith
14 2 1993 sarah smith NA smith
15 2 1994 sarah smith draper smith smith
16 2 1995 sarah smith draper smith smith
17 2 1996 sarah smith draper smith smith
18 2 1997 sarah smith draper smith draper
19 2 1998 sarah smith draper smith draper
20 2 1999 sarah smith draper smith draper
21 3 1990 mary king NA king
22 3 1991 mary king NA king
23 3 1992 mary king NA king
24 3 1993 mary king NA king
25 3 1994 mary king NA king
26 3 1995 mary king fisher king king
27 3 1996 mary king fisher king king
28 3 1997 mary king fisher king fisher
29 3 1998 mary king fisher king fisher
30 3 1999 mary king fisher king fisher
31 4 1990 liza moore NA moore
32 4 1991 liza moore NA moore
33 4 1992 liza moore NA moore
34 4 1993 liza moore NA moore
35 4 1994 liza moore NA moore
36 4 1995 liza moore NA moore
37 4 1996 liza moore NA moore
38 4 1997 liza neville NA neville
39 4 1998 liza neville NA neville
40 4 1999 liza neville NA neville

Summary output to independent dataset

Im working with a twitter dataset i got with rtweet. I worked to create a state variable based on the coordinates (when available).
my output is this so far
> summary(rt1$state)
alabama arizona arkansas california colorado connecticut
3 6 2 104 5 1
delaware district of columbia florida georgia idaho illinois
1 0 17 7 0 12
indiana iowa kansas kentucky louisiana maine
4 1 2 3 2 1
maryland massachusetts michigan minnesota mississippi missouri
1 2 9 6 0 2
montana nebraska nevada new hampshire new jersey new mexico
0 3 5 1 4 7
new york north carolina north dakota ohio oklahoma oregon
25 8 1 3 2 4
pennsylvania rhode island south carolina south dakota tennessee texas
22 0 2 1 3 35
utah vermont virginia washington west virginia wisconsin
2 1 3 5 0 2
wyoming NA's
1 17669
can you please advise on how can i create an independent dataset from the output above so i have 2 columns (state and n) ?
thanks
We can wrap with stack to create a two column data.frame from the OP's code
out <- stack(summary(rt1$state))[2:1]
names(out) <- c("state", "n")
Or another option in base R is
as.data.frame(table(rt1$state))
A reproducible example
data(iris)
out <- stack(summary(iris$Species))[2:1]
Or with table
as.data.frame(table(iris$Species))
Or enframe from tibble
library(tibble)
library(tidyr)
enframe(summary(rt1$state)) %>%
unnest(c(value))
Or maybe you can work directly on your rt1 dataframe:
dplyr::count(rt1, state)

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

Building a prediction model with the dpois function in R

Hello! I am in the beginning stages of building (and learning!) how to build prediction models for sports, specifically using NHL statistics.
I have all the game outcomes of the NHL since 1990, and I want to use # goals to predict outcomes in future games (just based on goals, for now)
Below is an excerpt of my data set, but the full data set can be found in this Git link:
https://github.com/papelr/nhldatar/blob/master/nhldatar/data/NHL_outcomes.rda
Date Visitor GVisitor Home GHome Att.
1 1990-10-04 Philadelphia Flyers 1 Boston Bruins 4 <NA>
2 1990-10-04 Montreal Canadiens 3 Buffalo Sabres 3 <NA>
3 1990-10-04 Vancouver Canucks 2 Calgary Flames 3 <NA>
4 1990-10-04 New York Rangers 3 Chicago Blackhawks 4 <NA>
5 1990-10-04 Quebec Nordiques 3 Hartford Whalers 3 <NA>
6 1990-10-04 New York Islanders 1 Los Angeles Kings 4 <NA>
7 1990-10-04 St. Louis Blues 3 Minnesota North Stars 2 <NA>
8 1990-10-04 Detroit Red Wings 3 New Jersey Devils 3 <NA>
9 1990-10-04 Toronto Maple Leafs 1 Winnipeg Jets 7 <NA>
10 1990-10-05 Pittsburgh Penguins 7 Washington Capitals 4 <NA>
11 1990-10-06 Quebec Nordiques 1 Boston Bruins 7 <NA>
12 1990-10-06 Toronto Maple Leafs 1 Calgary Flames 4 <NA>
13 1990-10-06 Winnipeg Jets 3 Edmonton Oilers 3 <NA>
14 1990-10-06 New York Rangers 4 Hartford Whalers 5 <NA>
15 1990-10-06 Vancouver Canucks 6 Los Angeles Kings 3 <NA>
16 1990-10-06 New York Islanders 2 Minnesota North Stars 4 <NA>
17 1990-10-06 Buffalo Sabres 5 Montreal Canadiens 6 <NA>
18 1990-10-06 Philadelphia Flyers 1 New Jersey Devils 3 <NA>
19 1990-10-06 Chicago Blackhawks 5 St. Louis Blues 2 <NA>
20 1990-10-06 Detroit Red Wings 4 Washington Capitals 6 <NA>
21 1990-10-07 New York Islanders 4 Chicago Blackhawks 2 <NA>
22 1990-10-07 Toronto Maple Leafs 2 Edmonton Oilers 3 <NA>
23 1990-10-07 Detroit Red Wings 2 Philadelphia Flyers 7 <NA>
24 1990-10-07 New Jersey Devils 4 Pittsburgh Penguins 7 <NA>
25 1990-10-07 Boston Bruins 5 Quebec Nordiques 2 <NA>
26 1990-10-08 Hartford Whalers 3 Montreal Canadiens 5 <NA>
27 1990-10-08 Minnesota North Stars 3 New York Rangers 6 <NA>
28 1990-10-08 Calgary Flames 4 Winnipeg Jets 3 <NA>
29 1990-10-09 Minnesota North Stars 2 New Jersey Devils 5 <NA>
30 1990-10-09 Pittsburgh Penguins 3 St. Louis Blues 4 <NA>
31 1990-10-09 Los Angeles Kings 6 Vancouver Canucks 2 <NA>
32 1990-10-10 Calgary Flames 5 Detroit Red Wings 6 <NA>
33 1990-10-10 Buffalo Sabres 3 Hartford Whalers 4 <NA>
34 1990-10-10 Washington Capitals 2 New York Rangers 4 <NA>
35 1990-10-10 Quebec Nordiques 8 Toronto Maple Leafs 5 <NA>
36 1990-10-10 Boston Bruins 4 Winnipeg Jets 2 <NA>
37 1990-10-11 Pittsburgh Penguins 1 Chicago Blackhawks 4 <NA>
38 1990-10-11 Edmonton Oilers 5 Los Angeles Kings 5 <NA>
39 1990-10-11 Boston Bruins 3 Minnesota North Stars 3 <NA>
40 1990-10-11 New Jersey Devils 4 Philadelphia Flyers 7 <NA>
This is the prediction model that I have come up with so far, and I have failed to get the matrix that should come with my simulate match line below. Any help would be great.
# Using number of goals for prediction model
model_one <-
rbind(
data.frame(goals = outcomes$GHome,
team = outcomes$Home,
opponent = outcomes$Visitor,
home = 1),
data.frame(goals = outcomes$GVisitor,
team = outcomes$Visitor,
opponent = outcomes$Home,
home = 0)) %>%
glm(goals ~ home + team + opponent,
family = poisson (link = log), data = .)
summary(model_one)
# Probability function / matrix
simulate_game <- function(stat_model, homeTeam, awayTeam, max_goals =
10) {
home_goals <- predict(model_one,
data.frame(home = 1,
team = homeTeam,
opponent = awayTeam),
type ="response")
away_goals <- predict(model_one,
data.frame(home = 0,
team = awayTeam,
opponent = homeTeam),
type ="response")
dpois(0: max_goals, home_goals) %>%
dpois(0: max_goals, away_goals)
}
simulate_game(model_one, "Nashville Predators", "Chicago Blackhawks",
max_goals = 10)
I totally understand that a Poisson model isn't the best for sports predictions, but I am rebuilding a model I found for the EPL for learning/practice reasons, and adapting it to the NHL (from David Sheehan's model, https://dashee87.github.io/data%20science/football/r/predicting-football-results-with-statistical-modelling/).
Any tips would be great, because currently, this model returns a bunch of warnings:
There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In dpois(., 0:max_goals, away_goals_avg) : non-integer x = 0.062689
2: In dpois(., 0:max_goals, away_goals_avg) : non-integer x = 0.173621

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Resources