Scraping data from Australian Open stats - r

I'd like to scrape the stats from The official website of Australian Open, specifically the data from the table, using rvest library, however, when I use
read_html("https://ausopen.com/event-stats") %>% html_nodes("table")
It returns {xml_nodeset (0)}, how would I attempt to fix this? The website is a bit confusing because every data of each statistics is in one webpage.

There is ton of information at https://prod-scores-api.ausopen.com/year/2021/stats which you can read with jsonlite::fromJSON. The difficult task is to find the relevant data that you need.
For example, to get aces and player name you can do :
library(dplyr)
dat <- jsonlite::fromJSON('https://prod-scores-api.ausopen.com/year/2021/stats')
aces <- bind_rows(dat$statistics$rankings[[1]]$players)
dat$players %>%
inner_join(aces, by = c('uuid' = 'player_id')) %>%
select(full_name, value) %>%
arrange(-value)
# full_name value
#1 Novak Djokovic 103
#2 Alexander Zverev 86
#3 Milos Raonic 82
#4 Daniil Medvedev 80
#5 Nick Kyrgios 69
#6 Alexander Bublik 66
#7 Reilly Opelka 61
#8 Jiri Vesely 58
#9 Andrey Rublev 57
#10 Lloyd Harris 55
#11 Aslan Karatsev 54
#12 Taylor Fritz 53
#...
#...

Related

Match strings by distance between non-equal length ones

Say we have the following datasets:
Dataset A:
name age
Sally 22
Peter 35
Joe 57
Samantha 33
Kyle 30
Kieran 41
Molly 28
Dataset B:
name company
Samanta A
Peter B
Joey C
Samantha A
My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:
name_a name_b age company distance
Peter Peter 35 B 0.00
Samantha Samantha 33 A 0.00
Samantha Samanta 33 A 0.04166667
Joe Joey 57 C 0.08333333
In this example I'm calculating the distance using method = "jw" in stringdist, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr or stringdist.
You can use stringdist_inner_join to join the two dataframes and use levenshteinSim to get the similarity between the two names.
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(A, B, by = 'name') %>%
mutate(distance = 1 - RecordLinkage::levenshteinSim(name.x, name.y)) %>%
arrange(distance)
# name.x age name.y company distance
#1 Peter 35 Peter B 0.000
#2 Samantha 33 Samantha A 0.000
#3 Samantha 33 Samanta A 0.125
#4 Joe 57 Joey C 0.250

R Extract names from text

I'm trying to extract a list of rugby players names from a string. The string contains all of the information from a table, containing the headers (team names) as well as the name of the player in each position for each team. It also has the player ranking but I don't care about that.
Important - a lot of player rankings are missing. I found a solution to this however doesn't handle missing rankings (for example below Rabah Slimani is the first player not to have a ranking recorded).
Note, the 1-15 numbers indicate positions, and there's always two names following each position (home player and away player).
Here's the sample string:
" Team Sheets # FRA France RPI IRE Ireland RPI 1 Jefferson Poirot 72 Cian Healy 82 2 Guilhem Guirado 78 Rory Best 85 3 Rabah Slimani Tadhg Furlong 85 4 Arthur Iturria 82 Iain Henderson 84 5 Sebastien Vahaamahina 84 James Ryan 92 6 Wenceslas Lauret 82 Peter O'Mahony 93 7 Yacouba Camara 70 Josh van der Flier 64 8 Kevin Gourdon CJ Stander 91 9 Maxime Machenaud Conor Murray 87 10 Matthieu Jalibert Johnny Sexton 90 11 Virimi Vakatawa Jacob Stockdale 89 12 Henry Chavancy Bundee Aki 83 13 RĂ©mi Lamerat Robbie Henshaw 78 14 Teddy Thomas Keith Earls 89 15 Geoffrey Palis Rob Kearney 80 Substitutes # FRA France RPI IRE Ireland RPI 16 Adrien Pelissie Sean Cronin 84 17 Dany Priso 70 Jack McGrath 70 18 Cedate Gomes Sa 71 John Ryan 86 19 Paul Gabrillagues 77 Devin Toner 90 20 Marco Tauleigne Dan Leavy 80 21 Antoine Dupont 92 Luke McGrath 22 Anthony Belleau 65 Joey Carbery 86 23 Benjamin Fall Fergus McFadden "
Note - it comes from here: https://www.rugbypass.com/live/six-nations/france-vs-ireland-at-stade-de-france-on-03022018/2018/info/
So basically what I want is just the list of names with the team names as the headers e.g.
France Ireland
Jefferson Poirot Cian Healy
Guilhem Guirado Rory Best
... ...
Any help would be much appreciated!
I tried this on an advanced notepad editor and tried to find occurrences of 2 consecutive numbers and replaced those with a new line. the ReGex is
\d+\s+\d+
Once you are done replacing, you will be left with 2 names in each line separated by a number. Then use the below ReGex to replace that number with a single tab
\s+\d+\s+
Hope that helps

How to create a Markdown table with different column lengths based on a dataframe in long format in R?

I'm working on a R Markdown file that I would like to submit as a manuscript to an academic journal. I would like to create a table that shows which three words (item2) co-occur most frequently with some keywords (item1). Note that some key words have more than three co-occurring words. The data that I am currently working with:
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
df <- data.frame(item1,item2,n)
Which gives this dataframe:
item1 item2 n
1 water tree 200
2 water dog 83
3 water cat 34
4 water fish 34
5 water eagle 34
6 sun bird 300
7 sun table 250
8 sun bed 77
9 sun flower 77
10 moon house 122
11 moon desk 46
12 moon tiger 46
Ultimately, I would like to pass the data to the function papaja::apa_table, which requires a data.frame (or a matrix / list). I therefore need to reshape the data.
My question:
How can I reshape the data (preferably with dplyr) to get the following structure?
water_item2 water_n sun_item2 sun_n moon_item2 moon_n
1 tree 200 bird 300 house 122
2 dog 83 table 250 desk 46
3 cat 34 bed 77 tiger 46
4 fish 34 flower 77 <NA> <NA>
5 eagle 34 <NA> <NA> <NA> <NA>
We can borrow an approach from an old answer of mine to a different question, and modify a classic gather(), unite(), spread() strategy by creating unique identifiers by group to avoid duplicate identifiers, then dropping that variable:
library(dplyr)
library(tidyr)
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
# Owing to Richard Telford's excellent comment,
# I use data_frame() (or equivalently for our purposes,
# data.frame(..., stringsAsFactors = FALSE))
# to avoid turning the strings into factors
df <- data_frame(item1,item2,n)
df %>%
group_by(item1) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, item2, n) %>%
unite(temp2, item1, temp, sep = '_') %>%
spread(temp2, val) %>%
select(-id)
# A tibble: 5 x 6
moon_item2 moon_n sun_item2 sun_n water_item2 water_n
<chr> <chr> <chr> <chr> <chr> <chr>
1 house 122 bird 300 tree 200
2 desk 46 table 250 dog 83
3 tiger 46 bed 77 cat 34
4 NA NA flower 77 fish 34
5 NA NA NA NA eagle 34

How to get the link inside html_table using rvest?

library("rvest")
url <- "myurl.com"
tables<- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="pageContainer"]/table[1]') %>%
html_table(fill = T)
tables[[1]]
html content of the cell is like this
<td>Click Here</td>
But in scraped html I only get ,
Click Here
You could handle this by editing rvest::html_table with trace.
Example of existing behaviour:
library(rvest)
x <- "https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture" %>%
read_html() %>%
html_nodes("#mw-content-text > table:nth-child(55)")
html_table(x)
#[[1]]
# Film Production company(s) Producer(s)
#1 The Great Ziegfeld Metro-Goldwyn-Mayer Hunt Stromberg
#2 Anthony Adverse Warner Bros. Henry Blanke
#3 Dodsworth Goldwyn, United Artists Samuel Goldwyn and Merritt Hulbert
#4 Libeled Lady Metro-Goldwyn-Mayer Lawrence Weingarten
#5 Mr. Deeds Goes to Town Columbia Frank Capra
#6 Romeo and Juliet Metro-Goldwyn-Mayer Irving Thalberg
#7 San Francisco Metro-Goldwyn-Mayer John Emerson and Bernard H. Hyman
#8 The Story of Louis Pasteur Warner Bros. Henry Blanke
#9 A Tale of Two Cities Metro-Goldwyn-Mayer David O. Selznick
#10 Three Smart Girls Universal Joe Pasternak and Charles R. Rogers
html_table essentially extracts the cells of the html table and runs html_text on them. All we need to do is replace that by extracting the <a> tag from each cell and running html_attr(., "href") instead.
trace(rvest:::html_table.xml_node, quote({
values <- lapply(lapply(cells, html_node, "a"), html_attr, name = "href")
values[[1]] <- html_text(cells[[1]])
}), at = 14)
New behaviour:
html_table(x)
#Tracing html_table.xml_node(X[[i]], ...) step 14
#[[1]]
# Film Production company(s) Producer(s)
#1 /wiki/The_Great_Ziegfeld NA /wiki/Hunt_Stromberg
#2 /wiki/Anthony_Adverse NA /wiki/Henry_Blanke
#3 /wiki/Dodsworth_(film) NA /wiki/Samuel_Goldwyn
#4 /wiki/Libeled_Lady NA /wiki/Lawrence_Weingarten
#5 /wiki/Mr._Deeds_Goes_to_Town NA /wiki/Frank_Capra
#6 /wiki/Romeo_and_Juliet_(1936_film) NA /wiki/Irving_Thalberg
#7 /wiki/San_Francisco_(1936_film) NA /wiki/John_Emerson_(filmmaker)
#8 /wiki/The_Story_of_Louis_Pasteur NA /wiki/Henry_Blanke
#9 /wiki/A_Tale_of_Two_Cities_(1935_film) NA /wiki/David_O._Selznick
#10 /wiki/Three_Smart_Girls NA /wiki/Joe_Pasternak
If you want to get the value of the "href" tag then please use:
//*[#id="pageContainer"]/table[1]//#href
I tested this on http://xpather.com/RtnrY9fh (xpath online).

Manipulating R Data Frames

I've currently got two separate data frames, excerpts as per below:
mydata
Player TG% Pts Team Opp Yr Rd Grnd
John 56 42 A 1 2015 1 Grnd1
James 94 64 B 2 2015 1 Grnd2
Jerry 85 78 C 3 2015 1 Grnd3
Daniel 97 51 D 4 2015 1 Grnd4
John 89 61 A 1 2015 1 Grnd2
James 65 26 B 4 2015 1 Grnd3
Jerry 73 34 C 3 2015 1 Grnd2
Daniel 73 40 D 2 2015 1 Grnd2
John 89 26 A 1 2015 1 Grnd3
James 92 42 B 3 2015 1 Grnd1
Jerry 89 25 C 2 2015 1 Grnd2
Daniel 80 41 D 4 2015 1 Grnd2
John 73 82 A 3 2015 1 Grnd3
James 73 41 B 4 2015 1 Grnd3
Jerry 89 76 C 2 2015 1 Grnd1
Daniel 91 77 D 1 2015 1 Grnd2
round
Team Opp Grnd
A 1 Grnd1
B 3 Grnd4
C 4 Grnd2
D 2 Grnd3
What I want to be able to do is manipulate this so that I generate a second data frame as per below
Player Gms Avg.Pts Avg.Last3 Avg.v.Opp Avg.#.Grnd
John
James
Jerry
Daniel
I know how to do this in Excel, however I'm struggling in R
Gms - total number of games for each individual player (excel would be countif)
Avg.Pts - this is the average of Pts for each Player name (excel would be averageif)
Avg.Last3 - this is the average of Pts for each Player in their last 3 games, note that the data frame is in order with most recent games at the end of the data frame.
Avg.v.Opp - this is the average of Pts for each player against the next opponent as defined in data frame round. For example John plays for team A and his next opponent is Opp 1. (excel would be averageifs)
Avg.#.Grnd - this is the average of Pts for each player at the next ground as defined in data fram round. For example John plays for team A and his next game is held at Grnd1. (excel would be averageifs)
I've tried using dplyr and a number of other options but haven't seemed to successfully put together something that works at this stage. Note that mydata data frame runs to over 10,000+ rows.
I think this will work. If you share your sample data with dput(), I'll be happy to copy/paste it and check (and debug if necessary).
First I'll do the easy ones, the ones that don't depend on round:
library(dplyr)
group_by(mydata, Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)))
I wanted to do that one separately to emphasize how clean dplyr can be for simple cases. All the "ifs" in your Excel commands are taken care of by the single group_by at the beginning. n() is the count, and mean() is the average. tail() is a handy base function that returns the end of a data frame or vector.
To add in the round data, we'll want to join the data frames together based on the Team column. We still we'll want to be able to tell the other columns apart whether they're from mydata or round, so I'll rename the round columns:
round = rename(round, next_opp = Opp, next_grnd = Grnd)
Then we'll start with the join and proceed as before. This time we do need some ifs at the end, which I'll do with a simple subset inside the mean calls:
left_join(mydata, round) %>%
# convert ground columns to character as discussed in comments
mutate(next_grnd = as.character(next_grnd),
Grnd = as.character(Grnd)) %>%
group_by(Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)),
Avg.v.Opp = mean(Pts[Opp == next_opp]),
Avg.at.Grnd = mean(Pts[Grnd == next_grnd]))

Resources