Retrieve link from html table with rvest - r

I would like to scrape a website storing German cycling results, but I'm struggling to get the urls pointing to the race result.
Website with result table
This is what I got so far, to me the html table seems also quite oddly formatted, but that could also be due to my lack of html knowledge:
library(tidyverse)
library(magrittr)
library(rvest)
#read html
result_url <- "https://www.rad-net.de/rad-net-ergebnisse.htm?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1"
results <- read_html(result_url)
#extract date, race name
results %>%
html_table(header = T, fill = T) %>%
extract2(8) %>%
tibble()
#> # A tibble: 40 x 2
#> Datum Veranstaltungstitel
#> <chr> <chr>
#> 1 So, 19.07.20… "5. Rosenheimer Jugend - Kriterium"
#> 2 So, 12.07.20… "Swiss O Par Preis"
#> 3 So, 12.07.20… "Deutsche Meisterschaft Einzelzeitfahren U19m/w"
#> 4 So, 12.07.20… "Jugendrenntag der RV Offenbach"
#> 5 Sa, 04.07.20… "CoronaChronoNRW"
#> 6 Sa, 20.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#> 7 Sa, 13.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#> 8 Sa, 06.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#> 9 So, 31.05.20… "Westsachsenklassiker - 72. Sachsenringradrennen"
#> 10 So, 08.03.20… "8. Herforder Frühjahrspreis"
#> # … with 30 more rows
Created on 2020-07-25 by the reprex package (v0.3.0)

I think you're looking for a bit more information than is normally provided by the html_table function (there are actually several nested html tables on the page anyway). I think this is what you are looking for:
library(tidyverse)
library(magrittr)
library(rvest)
results <- paste0("https://www.rad-net.de/rad-net-ergebnisse.htm",
"?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1") %>%
read_html()
link_nodes <- results %>% html_nodes(xpath = "//table//a")
link_text <- link_nodes %>% html_text()
index <- (which(link_text == "hier") + 1):(which(link_text == "N\u00e4chste") - 1)
link_nodes <- link_nodes[index]
dates <- link_nodes %>%
html_nodes(xpath = "//table//a/parent::td/preceding-sibling::td/font") %>%
html_text()
df <- tibble(Datum = dates[-1],
Veranstaltungstitel = link_nodes %>% html_text(),
link = link_nodes %>% html_attr("href"))
df
#> # A tibble: 40 x 3
#> Datum Veranstaltungstitel link
#> <chr> <chr> <chr>
#> 1 So, 19.0~ "5. Rosenheimer Jugend - Kriterium" /rad-net-portal/rad-net-erge~
#> 2 So, 12.0~ "Swiss O Par Preis" /rad-net-portal/rad-net-erge~
#> 3 So, 12.0~ "Deutsche Meisterschaft Einzelzeitfa~ /rad-net-portal/rad-net-erge~
#> 4 So, 12.0~ "Jugendrenntag der RV Offenbach" /rad-net-portal/rad-net-erge~
#> 5 Sa, 04.0~ "CoronaChronoNRW" /rad-net-portal/rad-net-erge~
#> 6 Sa, 20.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#> 7 Sa, 13.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#> 8 Sa, 06.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#> 9 So, 31.0~ "Westsachsenklassiker - 72. Sachsenr~ /rad-net-portal/rad-net-erge~
#> 10 So, 08.0~ "8. Herforder Frühjahrspreis" /rad-net-portal/rad-net-erge~
#> # ... with 30 more rows
Created on 2020-07-25 by the reprex package (v0.3.0)

Related

In R/rvest, how to get href information ( the linkage following click text)

In R/rvest, as below code , I can run the html_text(), but when i run want to get the linkage following every text web %>% html_node("div.p13n-desktop-grid") %>% html_attr(name='href') failed .Anyone can help? Thanks!
library(rvest)
url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
web <- rvest::read_html(url)
web %>% html_node("div.p13n-desktop-grid") %>% html_text() %>% strsplit("#") # ok
web %>% html_node("div.p13n-desktop-grid") %>% html_attr(name='href') # want to get the linkage following the click text, but failed
For (shortened) product links and link texts:
library(rvest)
library(dplyr)
url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
web <- rvest::read_html(url)
# "div.p13n-desktop-grid a[tabindex] + a" :
# text links are adjacent siblings of image links & image links have tabindex attribute
prod_links <- web %>% html_elements("div.p13n-desktop-grid a[tabindex] + a")
tibble(
# shorten links, keep only /pb/item_id/ part
href = prod_links %>% html_attr(name='href') %>% sub('.*(/dp/\\w*/).*','www.amazon.com\\1', .),
descr = prod_links %>% html_text2()
)
#> # A tibble: 30 × 2
#> href descr
#> <chr> <chr>
#> 1 www.amazon.com/dp/B07BR3F9N6/ Official Creality Ender 3 3D Printer Fully Ope…
#> 2 www.amazon.com/dp/B07FFTHMMN/ Official Creality Ender 3 V2 3D Printer Upgrad…
#> 3 www.amazon.com/dp/B09QGTTQKG/ ANYCUBIC Kobra 3D Printer Auto Leveling, FDM 3…
#> 4 www.amazon.com/dp/B07GYRQVYV/ Official Creality Ender 3 Pro 3D Printer with …
#> 5 www.amazon.com/dp/B083GTS8XJ/ ANYCUBIC Wash and Cure Station, Newest Upgrade…
#> 6 www.amazon.com/dp/B09FXYSFBV/ ANYCUBIC Photon Mono 4K 3D Printer, 6.23'' Mon…
#> 7 www.amazon.com/dp/B07J9QGP7S/ ANYCUBIC Mega-S New Upgrade 3D Printer with Hi…
#> 8 www.amazon.com/dp/B07Z9C9T42/ ELEGOO 5PCs FEP Release Film Mars LCD 3D Print…
#> 9 www.amazon.com/dp/B08SPXYND4/ Voxelab Aquila 3D Printer with Full Alloy Fram…
#> 10 www.amazon.com/dp/B07DYL9B2S/ Official Creality Ender 3 S1 3D Printer with D…
#> # … with 20 more rows
Created on 2022-06-16 by the reprex package (v2.0.1)
There are 50 products per page but only first 30 are included in the grid, the rest would be loaded in small chunks as you'd scroll down. Unless descriptions are needed, it's bit easier to just collect all IDs from data-client-recs-list and build links from those:
library(rvest)
library(dplyr)
library(jsonlite)
url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
web <- rvest::read_html(url)
client_recs_list <- web %>%
html_element("div.p13n-desktop-grid") %>%
html_attr(name='data-client-recs-list') %>%
fromJSON(flatten = TRUE) %>%
tibble()
client_recs_list %>%
select(1,2) %>%
mutate(prod_link = paste0('www.amazon.com/dp/', id, '/'), .after = id)
#> # A tibble: 50 × 3
#> id prod_link metadataMap.render.zg.rank
#> <chr> <chr> <chr>
#> 1 B07BR3F9N6 www.amazon.com/dp/B07BR3F9N6/ 1
#> 2 B07FFTHMMN www.amazon.com/dp/B07FFTHMMN/ 2
#> 3 B09Y54CWXY www.amazon.com/dp/B09Y54CWXY/ 3
#> 4 B07GYRQVYV www.amazon.com/dp/B07GYRQVYV/ 4
#> 5 B09L81S4L7 www.amazon.com/dp/B09L81S4L7/ 5
#> 6 B09JNMRS7R www.amazon.com/dp/B09JNMRS7R/ 6
#> 7 B09WHW8YXS www.amazon.com/dp/B09WHW8YXS/ 7
#> 8 B09W5CSFZQ www.amazon.com/dp/B09W5CSFZQ/ 8
#> 9 B09KXNYJLH www.amazon.com/dp/B09KXNYJLH/ 9
#> 10 B09R4QDVY5 www.amazon.com/dp/B09R4QDVY5/ 10
#> # … with 40 more rows
Created on 2022-06-17 by the reprex package (v2.0.1)
The href attribute is an attribute of the a tags. Not clear which one you want, there are 119 href found:
web %>%
html_node("div.p13n-desktop-grid") %>%
html_elements("a") %>%
html_attr(name = 'href')
# [1] "/Comgrow-Creality-Ender-Aluminum-220x220x250mm/dp/B07BR3F9N6/ref=zg_bs_6066127011_1/132-1194669-0063960?pd_rd_i=B07BR3F9N6&psc=1"
# [2] "/Comgrow-Creality-Ender-Aluminum-220x220x250mm/dp/B07BR3F9N6/ref=zg_bs_6066127011_1/132-1194669-0063960?pd_rd_i=B07BR3F9N6&psc=1"
# [3] "/product-reviews/B07BR3F9N6/ref=zg_bs_6066127011_cr_1/132-1194669-0063960?pd_rd_i=B07BR3F9N6"
# [4] ......

how to scrape text from a HTML body

I've never scraped. Would it be straightforward to scrape the text in the main, big gray box only from the link below (starting with header SRUS43 KMSR 271039, ending with .END)? My end goal is to basically have three tidy columns of data from all that text: the five digit codes, the values in inches, and the basin elevation descriptions, so any pointers with processing the text format are welcome, too.
https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6
thank you for any help.
Reading in the text is fairly easy (see #DiceBoyT answer). Cleaning up the format for three columns is a bit more involved. Below could use some clean-up (especially with the regex), but it gets the job done:
library(tidyverse)
library(rvest)
text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>%
html_node(".notes") %>%
html_text()
df <- tibble(txt = read_lines(text))
df %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
) %>%
separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(
combined_val = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_
) %>% as.numeric
) %>%
filter(!is.na(combined_val)) %>%
mutate(
code = zoo::na.locf(code),
basin_desc = zoo::na.locf(basin_desc)
) %>%
select(
code, combined_val, basin_desc
)
#> # A tibble: 643 x 3
#> code combined_val basin_desc
#> <chr> <dbl> <chr>
#> 1 ACSC1 0 San Antonio Ck - Sunol
#> 2 ADLC1 0 Arroyo De La Laguna
#> 3 ADOC1 0 Santa Ana R - Prado Dam
#> 4 AHOC1 0 Arroyo Honda nr San Jose
#> 5 AKYC1 41 SF American nr Kyburz
#> 6 AKYC1 3.2 SF American nr Kyburz
#> 7 AKYC1 42.2 SF American nr Kyburz
#> 8 ALQC1 0 Alamo Canal nr Pleasanton
#> 9 ALRC1 0 Alamitos Ck - Almaden Res
#> 10 ANDC1 0 Coyote Ck - Anderson Res
#> # ... with 633 more rows
Created on 2019-03-27 by the reprex package (v0.2.1)
This is pretty straightforward to scrape with rvest:
library(rvest)
text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>%
html_node(".notes") %>%
html_text()

Is rvest the best tool to collect information from this table?

I have used rvest package to extract a list of companies and the a.href elements in each company, which I need to proceed with the data collection process. This is the link of the website: http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market.
I have used the following code to extract the table but nothing comes out. I used other approaches as those posted in "Scraping table of NBA stats with rvest" and similar links, but I cannot obtain what I want. Any help would be greatly appreciated.
my code:
link.main <-
"http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market/"
web <- read_html(link.main) %>%
html_nodes("table#bm_equities_prices_table")
# it does not work even when I write html_nodes("table")
or ".table" or #bm_equities_prices_table
web <- read_html(link.main)
%>% html_nodes(".bm_center.bm_dataTable")
# no working
web <- link.main %>% read_html() %>% html_table()
# to inspect the position of table in this website
The page generates the table using JavaScript, so you either need to use RSelenium or Python's Beautiful Soup to simulate the browser session and allow javascript to run.
Another alternative is to use awesome package by #hrbrmstr called decapitated, which basically runs headless Chrome browser session in the background.
#devtools::install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)
res <- chrome_read_html(link.main)
main_df <- res %>%
rvest::html_table() %>%
.[[1]] %>%
as_tibble()
This outputs the content of the table alright. If you want to get to the elements underlying the table (href attributes behind the table text), you will need to do a bit more of list gymnastics. Some of the elements in the table are actually missing links, extracting by css proved to be difficult.
library(dplyr)
library(purrr)
href_lst <- res %>%
html_nodes("table td") %>%
as_list() %>%
map("a") %>%
map(~attr(.x, "href"))
# we need every third element starting from second element
idx <- seq.int(from=2, by=3, length.out = nrow(main_df))
href_df <- tibble(
market_href=as.character(href_lst[idx]),
company_href=as.character(href_lst[idx+1])
)
bind_cols(main_df, href_df)
#> # A tibble: 800 x 5
#> No `Company Name` `Company Website` market_href company_href
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 7-ELEVEN MALAYS~ http://www.7elev~ /market/list~ http://www.~
#> 2 2 A-RANK BERHAD [~ http://www.arank~ /market/list~ http://www.~
#> 3 3 ABLEGROUP BERHA~ http://www.gefun~ /market/list~ http://www.~
#> 4 4 ABM FUJIYA BERH~ http://www.abmfu~ /market/list~ http://www.~
#> 5 5 ACME HOLDINGS B~ http://www.suppo~ /market/list~ http://www.~
#> 6 6 ACOUSTECH BERHA~ http://www.acous~ /market/list~ http://www.~
#> 7 7 ADVANCE SYNERGY~ http://www.asb.c~ /market/list~ http://www.~
#> 8 8 ADVANCECON HOLD~ http://www.advan~ /market/list~ http://www.~
#> 9 9 ADVANCED PACKAG~ http://www.advan~ /market/list~ http://www.~
#> 10 10 ADVENTA BERHAD ~ http://www.adven~ /market/list~ http://www.~
#> # ... with 790 more rows
Another option without using browser:
library(httr)
library(jsonlite)
library(XML)
r <- httr::GET(paste0(
"http://ws.bursamalaysia.com/market/listed-companies/list-of-companies/list_of_companies_f.html",
"?_=1532479072277",
"&callback=jQuery16206432131784246533_1532479071878",
"&alphabet=",
"&market=main_market",
"&_=1532479072277"))
l <- rawToChar(r$content)
m <- gsub("jQuery16206432131784246533_1532479071878(", "", substring(l, 1, nchar(l)-1), fixed=TRUE)
tbl <- XML::readHTMLTable(jsonlite::fromJSON(m)$html)$bm_equities_prices_table
output:
> head(tbl)
# No Company Name Company Website
#1 1 7-ELEVEN MALAYSIA HOLDINGS BERHAD http://www.7eleven.com.my
#2 2 A-RANK BERHAD [S] http://www.arank.com.my
#3 3 ABLEGROUP BERHAD [S] http://www.gefung.com.my
#4 4 ABM FUJIYA BERHAD [S] http://www.abmfujiya.com.my
#5 5 ACME HOLDINGS BERHAD [S] http://www.supportivetech.com/
#6 6 ACOUSTECH BERHAD [S] http://www.acoustech.com.my/

Replacing missing value when web scraping (rvest)

I'm trying to write a script that will go through a list of players provided by the website Transfermarkt and gathering some information about them. For that, I've created the script below, but faced a problem with 1 of the 29 players in the list. Due to one page being arranged differently than the others, the code outputs a list of only 28 players since it can't find information on the aforementioned page.
I understand why the code I've written doesn't find any information on the given page and thus giving me a list of 28, but I don't know how to rewrite a code in order to achieve what I want:
for the script to simply replace the entry with a "-" if it does not find anything, in this case a nationality, for the node on a particular page and return a full list with 29 players with all the other info in it.
The player page in question is this and while the other pages has the node used in the code for nationality, here it's ".dataValue span".
I'm still quite new to R and it might be an easy fix, but atm I can't figure it out. Any help or advise is appreciated.
URL <- "http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1"
WS <- read_html(URL)
Team <- WS %>% html_nodes(".spielprofil_tooltip") %>% html_attr("href") %>% as.character()
Team <- paste0("http://www.transfermarkt.de",Team)
Catcher <- data.frame(Name=character(),Nat=character(),Vertrag=character())
for (i in Team) {
WS1 <- read_html(i)
Name <- WS1 %>% html_nodes("h1") %>% html_text() %>% as.character()
Nat <- WS1 %>% html_nodes(".hide-for-small+ p .dataValue span") %>% html_text() %>% as.character()
Vertrag <- WS1 %>% html_nodes(".dataValue:nth-child(9)") %>% html_text() %>% as.character()
if (length(Nat) > 0) {
temp <- data.frame(Name,Nat,Vertrag)
Catcher <- rbind(Catcher,temp)
}
else {}
cat("*")
}
num_Rows <- nrow(Catcher)
odd_indexes <- seq(1,num_Rows,2)
Catcher <- data.frame(Catcher[odd_indexes,])
It's honestly easier to scrape the whole table, just in case things move around. I find purrr is a helpful complement for rvest, allowing you to iterate over URLs and node lists and easily coerce results to data.frames:
library(rvest)
library(purrr)
# build dynamically if you like
urls <- c(boateng = 'http://www.transfermarkt.de/jerome-boateng/profil/spieler/26485',
friedl = 'http://www.transfermarkt.de/marco-friedl/profil/spieler/156990')
# scrape once, parse iteratively
html <- urls %>% map(read_html)
df <- html %>%
map(html_nodes, '.dataDaten p') %>%
map_df(map_df,
~list(
variable = .x %>% html_node('.dataItem') %>% html_text(trim = TRUE),
value = .x %>% html_node('.dataValue') %>% html_text(trim = TRUE) %>% gsub('\\s+', ' ', .)
),
.id = 'player')
df
#> # A tibble: 17 × 3
#> player variable value
#> <chr> <chr> <chr>
#> 1 boateng Geb./Alter: 03.09.1988 (28)
#> 2 boateng Geburtsort: Berlin
#> 3 boateng Nationalität: Deutschland
#> 4 boateng Größe: 1,92 m
#> 5 boateng Position: Innenverteidiger
#> 6 boateng Vertrag bis: 30.06.2021
#> 7 boateng Berater: SAM SPORTS
#> 8 boateng Nationalspieler: Deutschland
#> 9 boateng Länderspiele/Tore: 67/1
#> 10 friedl Geb./Alter: 16.03.1998 (19)
#> 11 friedl Nationalität: Österreich
#> 12 friedl Größe: 1,87 m
#> 13 friedl Position: Linker Verteidiger
#> 14 friedl Vertrag bis: 30.06.2021
#> 15 friedl Berater: acta7
#> 16 friedl Akt. Nationalspieler: Österreich U19
#> 17 friedl Länderspiele/Tore: 6/0
Alternately, that particular piece of data is in three places on those pages, so if one is inconsistent there's a chance the others are better. Or grab them from the table with the whole team—countries are not printed, but they're in the title attribute of the flag images, which can be grabbed with html_attr:
html <- read_html('http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1')
team <- html %>%
html_nodes('tr.odd, tr.even') %>%
map_df(~list(player = .x %>% html_node('a.spielprofil_tooltip') %>% html_text(),
nationality = .x %>% html_nodes('img.flaggenrahmen') %>% html_attr('title') %>% toString()))
team
#> # A tibble: 29 × 2
#> player nationality
#> <chr> <chr>
#> 1 Manuel Neuer Deutschland
#> 2 Sven Ulreich Deutschland
#> 3 Tom Starke Deutschland
#> 4 Jérôme Boateng Deutschland
#> 5 David Alaba Österreich
#> 6 Mats Hummels Deutschland
#> 7 Javi Martínez Spanien
#> 8 Juan Bernat Spanien
#> 9 Philipp Lahm Deutschland
#> 10 Rafinha Brasilien, Deutschland
#> # ... with 19 more rows

r rvest webscraping hltv

Yes, that's just another "how-to-scrape" question. Sorry for that, but I've read the previous answers and the manual for rvest as well.
I'm doing web-scraping for my homework (so I do not plan to use the data for any commercial issue). The idea is to show that average skill of team affect individual skill. I'm trying to use CS:GO data from HLTV.org for it.
The information is available at http://www.hltv.org/?pageid=173&playerid=9216
I need two tables: Keystats (data only) and Teammates (data and URLs). I try to use CSS selectors generated by SelectorGadget and I also tryed to analyze the source code of webpage. I've failed. I'm doing the following:
library(rvest)
library(dplyr)
url <- 'http://www.hltv.org/?pageid=173&playerid=9216'
info <- html_session(url) %>% read_html()
info %>% html_node('.covSmallHeadline') %>% html_text()
Can you please tell me that is the right CSS selector?
If you look at the source, those tables aren't HTML tables, but just piles of divs with inconsistent nesting and inline CSS for alignment. Thus, it's easiest to just grab all the text and fix the strings afterwards, as the data is either all numeric or not at all.
library(rvest)
library(tidyverse)
h <- 'http://www.hltv.org/?pageid=173&playerid=9216' %>% read_html()
h %>% html_nodes('.covGroupBoxContent') %>% .[-1] %>%
html_text(trim = TRUE) %>%
strsplit('\\s*\\n\\s*') %>%
setNames(map_chr(., ~.x[1])) %>% map(~.x[-1]) %>%
map(~data_frame(variable = gsub('[.0-9]+', '', .x),
value = parse_number(.x)))
#> $`Key stats`
#> # A tibble: 9 × 2
#> variable value
#> <chr> <dbl>
#> 1 Total kills 9199.00
#> 2 Headshot %% 46.00
#> 3 Total deaths 6910.00
#> 4 K/D Ratio 1.33
#> 5 Maps played 438.00
#> 6 Rounds played 11242.00
#> 7 Average kills per round 0.82
#> 8 Average deaths per round 0.61
#> 9 Rating (?) 1.21
#>
#> $TeammatesRating
#> # A tibble: 4 × 2
#> variable value
#> <chr> <dbl>
#> 1 Gabriel 'FalleN' Toledo 1.11
#> 2 Fernando 'fer' Alvarenga 1.11
#> 3 Joao 'felps' Vasconcellos 1.09
#> 4 Epitacio 'TACO' de Melo 0.98

Resources