How to get the link inside html_table using rvest? - r

library("rvest")
url <- "myurl.com"
tables<- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="pageContainer"]/table[1]') %>%
html_table(fill = T)
tables[[1]]
html content of the cell is like this
<td>Click Here</td>
But in scraped html I only get ,
Click Here

You could handle this by editing rvest::html_table with trace.
Example of existing behaviour:
library(rvest)
x <- "https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture" %>%
read_html() %>%
html_nodes("#mw-content-text > table:nth-child(55)")
html_table(x)
#[[1]]
# Film Production company(s) Producer(s)
#1 The Great Ziegfeld Metro-Goldwyn-Mayer Hunt Stromberg
#2 Anthony Adverse Warner Bros. Henry Blanke
#3 Dodsworth Goldwyn, United Artists Samuel Goldwyn and Merritt Hulbert
#4 Libeled Lady Metro-Goldwyn-Mayer Lawrence Weingarten
#5 Mr. Deeds Goes to Town Columbia Frank Capra
#6 Romeo and Juliet Metro-Goldwyn-Mayer Irving Thalberg
#7 San Francisco Metro-Goldwyn-Mayer John Emerson and Bernard H. Hyman
#8 The Story of Louis Pasteur Warner Bros. Henry Blanke
#9 A Tale of Two Cities Metro-Goldwyn-Mayer David O. Selznick
#10 Three Smart Girls Universal Joe Pasternak and Charles R. Rogers
html_table essentially extracts the cells of the html table and runs html_text on them. All we need to do is replace that by extracting the <a> tag from each cell and running html_attr(., "href") instead.
trace(rvest:::html_table.xml_node, quote({
values <- lapply(lapply(cells, html_node, "a"), html_attr, name = "href")
values[[1]] <- html_text(cells[[1]])
}), at = 14)
New behaviour:
html_table(x)
#Tracing html_table.xml_node(X[[i]], ...) step 14
#[[1]]
# Film Production company(s) Producer(s)
#1 /wiki/The_Great_Ziegfeld NA /wiki/Hunt_Stromberg
#2 /wiki/Anthony_Adverse NA /wiki/Henry_Blanke
#3 /wiki/Dodsworth_(film) NA /wiki/Samuel_Goldwyn
#4 /wiki/Libeled_Lady NA /wiki/Lawrence_Weingarten
#5 /wiki/Mr._Deeds_Goes_to_Town NA /wiki/Frank_Capra
#6 /wiki/Romeo_and_Juliet_(1936_film) NA /wiki/Irving_Thalberg
#7 /wiki/San_Francisco_(1936_film) NA /wiki/John_Emerson_(filmmaker)
#8 /wiki/The_Story_of_Louis_Pasteur NA /wiki/Henry_Blanke
#9 /wiki/A_Tale_of_Two_Cities_(1935_film) NA /wiki/David_O._Selznick
#10 /wiki/Three_Smart_Girls NA /wiki/Joe_Pasternak

If you want to get the value of the "href" tag then please use:
//*[#id="pageContainer"]/table[1]//#href
I tested this on http://xpather.com/RtnrY9fh (xpath online).

Related

Is it possible to convert lines from a text file into columns to get a dataframe?

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

Scraping data from Australian Open stats

I'd like to scrape the stats from The official website of Australian Open, specifically the data from the table, using rvest library, however, when I use
read_html("https://ausopen.com/event-stats") %>% html_nodes("table")
It returns {xml_nodeset (0)}, how would I attempt to fix this? The website is a bit confusing because every data of each statistics is in one webpage.
There is ton of information at https://prod-scores-api.ausopen.com/year/2021/stats which you can read with jsonlite::fromJSON. The difficult task is to find the relevant data that you need.
For example, to get aces and player name you can do :
library(dplyr)
dat <- jsonlite::fromJSON('https://prod-scores-api.ausopen.com/year/2021/stats')
aces <- bind_rows(dat$statistics$rankings[[1]]$players)
dat$players %>%
inner_join(aces, by = c('uuid' = 'player_id')) %>%
select(full_name, value) %>%
arrange(-value)
# full_name value
#1 Novak Djokovic 103
#2 Alexander Zverev 86
#3 Milos Raonic 82
#4 Daniil Medvedev 80
#5 Nick Kyrgios 69
#6 Alexander Bublik 66
#7 Reilly Opelka 61
#8 Jiri Vesely 58
#9 Andrey Rublev 57
#10 Lloyd Harris 55
#11 Aslan Karatsev 54
#12 Taylor Fritz 53
#...
#...

How to properly filter date in R

Hi i'm using library(carData) - MplsStops and i want to filter all events that took place in 2017 july, neighborhood Cedar Riverside, St. Anthony East, Downtown West and then arrange them by lat and long.
date format looks like that: 2017-01-01 00:00:42
i'm using dplyr
for now i'm trying to make this code work:
MplsStops %>%
filter(neighborhood=="Cedar Riverside" | neighborhood =="St. Anthony East" | neighborhood =="Downtown West") %>%
filter(date==2017-07) %>%
arrange(lat,long)
i think there is some problem with date== could anyone give me any tips how to make it work
One option is to change the == to %in% and format the 'date' to do the ==
library(dplyr)
library(carData)
MplsStops %>%
filter(neighborhood %in% c("Cedar Riverside", "St. Anthony East",
"Downtown West")) %>%
filter(format(as.Date(date), "%Y-%m") == "2017-07") %>%
arrange(lat, long)
# idNum date problem MDC citationIssued personSearch vehicleSearch preRace race gender
#1 17-264432 2017-07-14 22:07:37 traffic MDC NO NO NO Unknown Black Male
#2 17-274061 2017-07-21 13:17:05 suspicious MDC YES NO NO Native American Native American Female
#3 17-252658 2017-07-06 23:29:22 traffic MDC NO NO NO Unknown East African Male
#4 17-250572 2017-07-05 18:31:16 suspicious MDC <NA> NO NO Black Black Male
#5 17-269530 2017-07-18 17:03:38 traffic other <NA> <NA> <NA> <NA> <NA> <NA>
#6 17-277463 2017-07-23 22:05:45 traffic MDC NO NO
NO Black Black Male
#...
# lat long policePrecinct neighborhood
#1 44.96437 -93.24308 1 Cedar Riverside
#2 44.96440 -93.23357 1 Cedar Riverside
#3 44.96466 -93.23616 1 Cedar Riverside
#4 44.96497 -93.23492 1 Cedar Riverside
#5 44.96497 -93.23492 1 Cedar Riverside
#6 44.96497 -93.23492 1 Cedar Riverside
#...
Another option might be to use the lubridate package that has many useful functions to work with dates.
library(dplyr)
library(carData)
MplsStops %>%
filter(neighborhood %in% c("Cedar Riverside", "St. Anthony East",
"Downtown West")) %>%
# I first filter for all dates that were in 2017 and then for all 7th months
filter( lubridate::year(date) == 2017 & lubridate::month(date) == 7 ) %>%
arrange(lat, long)

Splitting a string few characters after the delimiter

I have a large data set of names and states that I need to split. After splitting, I want to create new rows with each name and state. My data strings are in multiple lines that look like this
"Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ"
"Ralph Hogan, TX, Michael Johnson, FL"
I need the data to look like this
attr name state
1 Peter Johnson IN
2 Chet Charles TX
3 Ed Walsh AZ
4 Ralph Hogan TX
5 Michael Johnson FL
I can't figure out how to do this, perhaps split it somehow a few characters after the comma? Any help would be greatly appreciated.
If it is multiple line strings, then we can create a delimiter with gsub, split the strings using strsplit, create data.frame with the components of the split in the output list, and rbind it together.
d1 <- do.call(rbind, lapply(strsplit(gsub("([A-Z]{2})(\\s+|,)",
"\\1;", lines), "[,;]"), function(x) {
x1 <- trimws(x)
data.frame(name = x1[c(TRUE, FALSE)],state = x1[c(FALSE, TRUE)]) }))
cbind(attr = seq_len(nrow(d1)), d1)
# attr name state
#1 1 Peter Johnson IN
#2 2 Chet Charles TX
#3 3 Ed Walsh AZ
#4 4 Ralph Hogan TX
#5 5 Michael Johnson FL
Or this can be done in a compact way
library(data.table)
fread(paste(gsub("([A-Z]{2})(\\s+|,)", "\\1\n", lines), collapse="\n"),
col.names = c("names", "state"), header = FALSE)[, attr := 1:.N][]
# names state attr
#1: Peter Johnson IN 1
#2: Chet Charles TX 2
#3: Ed Walsh AZ 3
#4: Ralph Hogan TX 4
#5: Michael Johnson FL 5
data
lines <- readLines(textConnection("Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ
Ralph Hogan, TX, Michael Johnson, FL"))

Turn names into numbers in a dataframe based on the row index of the name in another dataframe

I have two dataframes. One is just the names of my facebook friends and another one is the links with a sorce and target columns. I want to turn the names in the links dataframe to numbers based on the row index of that name in the friends dataframe.
friends
name
1 Andrewt Thomas
2 Robbie McCord
3 Mohammad Mojadidi
4 Andrew John
5 Professor Owk
6 Joseph Charles
links
source target
1 Andrewt Thomas Andrew John
2 Andrewt Thomas James Zou
3 Robbie McCord Bz Benz
4 Robbie McCord Yousef AL-alawi
5 Robbie McCord Sherhan Asimov
6 Robbie McCord Aigerim Aig
Seems trivial, but I cannot figure it out. Thanks for help.
Just use a simple match
links$source <- match(links$source, friends$name)
links
# source target
# 1 1 Andrew John
# 2 1 James Zou
# 3 2 Bz Benz
# 4 2 Yousef AL-alawi
# 5 2 Sherhan Asimov
# 6 2 Aigerim Aig
Something like this?
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
Full example
links <- data.frame(source = c("John", "John", "Alice"), target = c("Jimmy", "Al", "Chris"))
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
links$source
[1] 3 3 2

Resources