Xpath not "seeing" table in Rvest - r

I am attempting to scrape table in this website. I have done it in several website but can't scrape in this website.
Does it have anything to do with the fact that the table is generated and not static?
Here is my code:
library(tidyverse)
library(rvest)
link <- "https://portal.vietcombank.com.vn/Personal/TG/Pages/ty-gia.aspx?devicechannel=default"
webpage <- read_html(link)
webpage %>%
html_nodes(xpath = "//*[#class = 'tbl-01 rateTable']") %>%
html_table()
Thank all very much

The data actually is present in another link which can be found out by looking under Networks tab in the browser.
library(rvest)
url <- 'https://portal.vietcombank.com.vn/UserControls/TVPortal.TyGia/pListTyGia.aspx?txttungay=5/10/2021&BacrhID=1&isEn=False'
data <- url %>% read_html %>% html_table() %>% .[[1]]
data
# `Ngoại tệ` `Ngoại tệ` Mua Mua Bán
# <chr> <chr> <chr> <chr> <chr>
# 1 Tên ngoại tệ Mã NT Tiền mặt Chuyển khoản Bán
# 2 AUSTRALIAN DOLLAR AUD 16,146.00 16,309.09 16,820.59
# 3 CANADIAN DOLLAR CAD 17,613.63 17,791.55 18,349.54
# 4 SWISS FRANC CHF 23,991.41 24,233.75 24,993.78
# 5 YUAN RENMINBI CNY 3,459.86 3,494.81 3,604.96
# 6 DANISH KRONE DKK - 3,489.44 3,620.54
# 7 EURO EUR 25,758.61 26,018.80 27,099.20
# 8 POUND STERLING GBP 30,155.61 30,460.21 31,415.53
# 9 HONGKONG DOLLAR HKD 2,849.58 2,878.36 2,968.63
#10 INDIAN RUPEE INR - 304.65 316.60
# … with 11 more rows

Related

How to Scrape multi page website using R language

I want to scrape contents of multi page website using R, currently I'm able to scrape the first page, How do I scrape all pages and store them in csv.
Here;s my code so far
library(rvest)
library(tibble)
library(tidyr)
library(dplyr)
df = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=1&selectedItem=viewAllAwardedContracts.do&T01_ps=100' %>%
read_html() %>% html_table()
df
write.csv(df,"Contracts_test_taneps.csv")
Scrape multiple pages. Change 1:2 to 1:YOU NUMBER
library(tidyverse)
library(rvest)
get_taneps <- function(page) {
str_c("https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=",
page, "&selectedItem=viewAllAwardedContracts.do&T01_ps=100") %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::clean_names()
}
map_dfr(1:2, get_taneps)
# A tibble: 200 x 7
tender_no procuring_entity suppl~1 award~2 award~3 lot_n~4 notic~5
<chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Municipal Council SHIBAM~ 08/11/~ "66200~ N/A NA
2 AE/005/2022-2023/DODOMA/FA/NC/02 Ministry of Livestock and Fish~ NINO G~ 04/11/~ "46511~ N/A NA
3 LGA/014/2022/2023/G/01 UTAWALA Bagamoyo District Council VILANG~ 02/11/~ "90000~ N/A NA
4 LGA/014/014/2022/2023/G/01 FEDHA 3EPICAR Bagamoyo District Council VILANG~ 02/11/~ "88100~ N/A NA
5 LGA/014/2022/2023/G/01/ARDHI Bagamoyo District Council VILANG~ 31/10/~ "16088~ N/A NA
6 LGA/014/2022/2023/G/11 VIFAA VYA USAFI SOKO LA SAMAKI Bagamoyo District Council MBUTUL~ 31/10/~ "10000~ N/A NA
7 DCD - 000899- 400E - ANIMAL FEEDS Kibaha Education Centre ALOYCE~ 29/10/~ "82400~ N/A NA
8 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Regional Referral Hos~ JIGABH~ 02/11/~ "17950~ N/A NA
9 IE/023/2022-23/HQ/G/13 Commission for Mediation and A~ AKO GR~ 27/10/~ "42500~ N/A NA
10 AE/005/2022-2023/MOROGORO/FA/G/05 Morogoro Municipal Council THE GR~ 01/11/~ "17247~ N/A NA
# ... with 190 more rows, and abbreviated variable names 1: supplier_name, 2: award_date, 3: award_amount, 4: lot_name,
# 5: notice_pdf
# i Use `print(n = ...)` to see more rows
Write as .csv
write_csv(df, "Contracts_test_taneps.csv")

Problem scraping a webpage with R and Rvest

I used the code below to extract a table from a webpage:
library(rvest)
library(dplyr)
#Link to site and then getting html code.
link <- "https://www.stats.gov.sa/en/915"
page <- read_html(link)
#extract table from html
files <- page %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
However, I get results that differ from that on the webpage. The result are shown below:
A tibble: 1 × 4
Name Report Period Periodicity Download
1 please wait ... please wait ... please wait ... please wait ...
I was wondering is there a way to get the table in the form I view it through the web browser without using Rselenium. This is because this doesn't seem to work with r studio online
The solution could be RSelenium
Below a simple example
library(RSelenium)
library(rvest)
library(dplyr)
#Your URL
URL <- "https://www.stats.gov.sa/en/915"
#Open the browser by RSelenium
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = F)
remDr <- rD[["client"]]
#Open the page into browser
remDr$navigate(URL)
#Get the table that you see
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
[[1]]
# A tibble: 13 x 4
Name `Report Period` Periodicity Download
<chr> <int> <chr> <lgl>
1 Ar-Riyad Region 2017 Annual NA
2 Makkah Al-Mokarramah Region 2017 Annual NA
3 Al-Madinah Al-Monawarah Region 2017 Annual NA
4 Al-Qaseem Region 2017 Annual NA
5 Eastern Region 2017 Annual NA
6 Aseer Region 2017 Annual NA
7 Tabouk Region 2017 Annual NA
8 Hail Region 2017 Annual NA
9 Northern Borders Region 2017 Annual NA
10 Jazan Region 2017 Annual NA
11 Najran Region 2017 Annual NA
12 Al-Baha Region 2017 Annual NA
13 Al-Jouf Region 2017 Annual NA

URL substitution using vectors

I'm trying to substitute the years 2000-2020 into the url by using vectors. The error I get is Can't combine 1$Rnk and 4$Rnk . How can I fix this?
TDF_wtables <- function(url){
url %>%
read_html() %>%
# extract the table part of the html code
html_node("table") %>%
# create R dataset from webpage contents
html_table() %>%
# only Year and Gross are of interest in our analysis
select(c("Rnk", "Rider", "Team", "Time")) %>%
as_tibble()
}
Year <- 2000:2020
TDFurls <- str_c("https://www.procyclingstats.com/race/tour-de-france/",Year,"/gc")
Maps <- map_dfr(TDFurls, TDF_wtables, .id = "Year")
Maps
You get that error because sometimes there are rows like this
For any table that contains rows like the one shown above, the program cannot naturally tell if the Rnk column should be of a character type or an integer type. The simplest solution to this would be just converting all everything into character
TDF_wtables <- function(url){
url %>%
read_html() %>%
# extract the table part of the html code
html_node("table") %>%
# create R dataset from webpage contents
html_table() %>%
# only Year and Gross are of interest in our analysis
select(c("Rnk", "Rider", "Team", "Time")) %>%
as_tibble() %>%
# all columns as character
mutate(across(.fns = as.character))
}
Output
> Maps <- map_dfr(TDFurls, TDF_wtables)
> Maps
# A tibble: 3,225 x 4
Rnk Rider Team Time
<chr> <chr> <chr> <chr>
1 1 Zanini StefanoMapei - Quickstep Mapei - Quickstep 3:12:363:12:36
2 2 Zabel ErikTeam Telekom Team Telekom ,,0:00
3 3 Vainšteins RomānsVini Caldirola - Sidermec Vini Caldirola - Sidermec ,,0:00
4 4 Rodriguez FredMapei - Quickstep Mapei - Quickstep ,,0:00
5 5 van Heeswijk MaxMapei - Quickstep Mapei - Quickstep ,,0:00
6 6 Magnien EmmanuelLa Française des Jeux La Française des Jeux ,,0:00
7 7 Simon FrançoisBonjour - Toupargel Bonjour - Toupargel ,,0:00
8 8 McEwen RobbieFarm Frites Farm Frites ,,0:00
9 9 Commesso SalvatoreSaeco Saeco ,,0:00
10 10 Piziks ArvisMemoryCard - Jack & Jones MemoryCard - Jack & Jones ,,0:00
# ... with 3,215 more rows
Update
That Time column has two spans nested in each row. They have the same text, so you have to trim off one of them to avoid repetition. Also, I just realised that your code does not get you the table you want. You only want the table on the "GC" tab, right? Consider the following function:
TDF_wtables <- function(url){
gc_table <-
url %>%
read_html() %>%
html_node("div[class$='resultCont '] > table")
timeff <-
gc_table %>%
html_nodes("tbody > tr > td > span.timeff") %>%
html_text()
gc_table %>%
html_table() %>%
select(c("Rnk", "Rider", "Team", "Time")) %>%
as_tibble() %>%
mutate(Time = timeff, across(.fns = as.character))
}
Output
# A tibble: 3,222 x 4
Rnk Rider Team Time
<chr> <chr> <chr> <chr>
1 NA Armstrong LanceUS Postal Service US Postal Service " 92:33:08"
2 2 Ullrich JanTeam Telekom Team Telekom "6:02"
3 3 Beloki JosebaFestina - Lotus Festina - Lotus "10:04"
4 4 Moreau ChristopheFestina - Lotus Festina - Lotus "10:34"
5 5 Heras RobertoKelme - Costa Blanca Kelme - Costa Blanca "11:50"
6 6 Virenque RichardPolti Polti "13:26"
7 7 Botero SantiagoKelme - Costa Blanca Kelme - Costa Blanca "14:18"
8 8 Escartín FernandoKelme - Costa Blanca Kelme - Costa Blanca "17:21"
9 9 Mancebo FranciscoBanesto Banesto "18:09"
10 10 Nardello DanieleMapei - Quickstep Mapei - Quickstep "18:25"
# ... with 3,212 more rows

Is rvest the best tool to collect information from this table?

I have used rvest package to extract a list of companies and the a.href elements in each company, which I need to proceed with the data collection process. This is the link of the website: http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market.
I have used the following code to extract the table but nothing comes out. I used other approaches as those posted in "Scraping table of NBA stats with rvest" and similar links, but I cannot obtain what I want. Any help would be greatly appreciated.
my code:
link.main <-
"http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market/"
web <- read_html(link.main) %>%
html_nodes("table#bm_equities_prices_table")
# it does not work even when I write html_nodes("table")
or ".table" or #bm_equities_prices_table
web <- read_html(link.main)
%>% html_nodes(".bm_center.bm_dataTable")
# no working
web <- link.main %>% read_html() %>% html_table()
# to inspect the position of table in this website
The page generates the table using JavaScript, so you either need to use RSelenium or Python's Beautiful Soup to simulate the browser session and allow javascript to run.
Another alternative is to use awesome package by #hrbrmstr called decapitated, which basically runs headless Chrome browser session in the background.
#devtools::install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)
res <- chrome_read_html(link.main)
main_df <- res %>%
rvest::html_table() %>%
.[[1]] %>%
as_tibble()
This outputs the content of the table alright. If you want to get to the elements underlying the table (href attributes behind the table text), you will need to do a bit more of list gymnastics. Some of the elements in the table are actually missing links, extracting by css proved to be difficult.
library(dplyr)
library(purrr)
href_lst <- res %>%
html_nodes("table td") %>%
as_list() %>%
map("a") %>%
map(~attr(.x, "href"))
# we need every third element starting from second element
idx <- seq.int(from=2, by=3, length.out = nrow(main_df))
href_df <- tibble(
market_href=as.character(href_lst[idx]),
company_href=as.character(href_lst[idx+1])
)
bind_cols(main_df, href_df)
#> # A tibble: 800 x 5
#> No `Company Name` `Company Website` market_href company_href
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 7-ELEVEN MALAYS~ http://www.7elev~ /market/list~ http://www.~
#> 2 2 A-RANK BERHAD [~ http://www.arank~ /market/list~ http://www.~
#> 3 3 ABLEGROUP BERHA~ http://www.gefun~ /market/list~ http://www.~
#> 4 4 ABM FUJIYA BERH~ http://www.abmfu~ /market/list~ http://www.~
#> 5 5 ACME HOLDINGS B~ http://www.suppo~ /market/list~ http://www.~
#> 6 6 ACOUSTECH BERHA~ http://www.acous~ /market/list~ http://www.~
#> 7 7 ADVANCE SYNERGY~ http://www.asb.c~ /market/list~ http://www.~
#> 8 8 ADVANCECON HOLD~ http://www.advan~ /market/list~ http://www.~
#> 9 9 ADVANCED PACKAG~ http://www.advan~ /market/list~ http://www.~
#> 10 10 ADVENTA BERHAD ~ http://www.adven~ /market/list~ http://www.~
#> # ... with 790 more rows
Another option without using browser:
library(httr)
library(jsonlite)
library(XML)
r <- httr::GET(paste0(
"http://ws.bursamalaysia.com/market/listed-companies/list-of-companies/list_of_companies_f.html",
"?_=1532479072277",
"&callback=jQuery16206432131784246533_1532479071878",
"&alphabet=",
"&market=main_market",
"&_=1532479072277"))
l <- rawToChar(r$content)
m <- gsub("jQuery16206432131784246533_1532479071878(", "", substring(l, 1, nchar(l)-1), fixed=TRUE)
tbl <- XML::readHTMLTable(jsonlite::fromJSON(m)$html)$bm_equities_prices_table
output:
> head(tbl)
# No Company Name Company Website
#1 1 7-ELEVEN MALAYSIA HOLDINGS BERHAD http://www.7eleven.com.my
#2 2 A-RANK BERHAD [S] http://www.arank.com.my
#3 3 ABLEGROUP BERHAD [S] http://www.gefung.com.my
#4 4 ABM FUJIYA BERHAD [S] http://www.abmfujiya.com.my
#5 5 ACME HOLDINGS BERHAD [S] http://www.supportivetech.com/
#6 6 ACOUSTECH BERHAD [S] http://www.acoustech.com.my/

Replacing missing value when web scraping (rvest)

I'm trying to write a script that will go through a list of players provided by the website Transfermarkt and gathering some information about them. For that, I've created the script below, but faced a problem with 1 of the 29 players in the list. Due to one page being arranged differently than the others, the code outputs a list of only 28 players since it can't find information on the aforementioned page.
I understand why the code I've written doesn't find any information on the given page and thus giving me a list of 28, but I don't know how to rewrite a code in order to achieve what I want:
for the script to simply replace the entry with a "-" if it does not find anything, in this case a nationality, for the node on a particular page and return a full list with 29 players with all the other info in it.
The player page in question is this and while the other pages has the node used in the code for nationality, here it's ".dataValue span".
I'm still quite new to R and it might be an easy fix, but atm I can't figure it out. Any help or advise is appreciated.
URL <- "http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1"
WS <- read_html(URL)
Team <- WS %>% html_nodes(".spielprofil_tooltip") %>% html_attr("href") %>% as.character()
Team <- paste0("http://www.transfermarkt.de",Team)
Catcher <- data.frame(Name=character(),Nat=character(),Vertrag=character())
for (i in Team) {
WS1 <- read_html(i)
Name <- WS1 %>% html_nodes("h1") %>% html_text() %>% as.character()
Nat <- WS1 %>% html_nodes(".hide-for-small+ p .dataValue span") %>% html_text() %>% as.character()
Vertrag <- WS1 %>% html_nodes(".dataValue:nth-child(9)") %>% html_text() %>% as.character()
if (length(Nat) > 0) {
temp <- data.frame(Name,Nat,Vertrag)
Catcher <- rbind(Catcher,temp)
}
else {}
cat("*")
}
num_Rows <- nrow(Catcher)
odd_indexes <- seq(1,num_Rows,2)
Catcher <- data.frame(Catcher[odd_indexes,])
It's honestly easier to scrape the whole table, just in case things move around. I find purrr is a helpful complement for rvest, allowing you to iterate over URLs and node lists and easily coerce results to data.frames:
library(rvest)
library(purrr)
# build dynamically if you like
urls <- c(boateng = 'http://www.transfermarkt.de/jerome-boateng/profil/spieler/26485',
friedl = 'http://www.transfermarkt.de/marco-friedl/profil/spieler/156990')
# scrape once, parse iteratively
html <- urls %>% map(read_html)
df <- html %>%
map(html_nodes, '.dataDaten p') %>%
map_df(map_df,
~list(
variable = .x %>% html_node('.dataItem') %>% html_text(trim = TRUE),
value = .x %>% html_node('.dataValue') %>% html_text(trim = TRUE) %>% gsub('\\s+', ' ', .)
),
.id = 'player')
df
#> # A tibble: 17 × 3
#> player variable value
#> <chr> <chr> <chr>
#> 1 boateng Geb./Alter: 03.09.1988 (28)
#> 2 boateng Geburtsort: Berlin
#> 3 boateng Nationalität: Deutschland
#> 4 boateng Größe: 1,92 m
#> 5 boateng Position: Innenverteidiger
#> 6 boateng Vertrag bis: 30.06.2021
#> 7 boateng Berater: SAM SPORTS
#> 8 boateng Nationalspieler: Deutschland
#> 9 boateng Länderspiele/Tore: 67/1
#> 10 friedl Geb./Alter: 16.03.1998 (19)
#> 11 friedl Nationalität: Österreich
#> 12 friedl Größe: 1,87 m
#> 13 friedl Position: Linker Verteidiger
#> 14 friedl Vertrag bis: 30.06.2021
#> 15 friedl Berater: acta7
#> 16 friedl Akt. Nationalspieler: Österreich U19
#> 17 friedl Länderspiele/Tore: 6/0
Alternately, that particular piece of data is in three places on those pages, so if one is inconsistent there's a chance the others are better. Or grab them from the table with the whole team—countries are not printed, but they're in the title attribute of the flag images, which can be grabbed with html_attr:
html <- read_html('http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1')
team <- html %>%
html_nodes('tr.odd, tr.even') %>%
map_df(~list(player = .x %>% html_node('a.spielprofil_tooltip') %>% html_text(),
nationality = .x %>% html_nodes('img.flaggenrahmen') %>% html_attr('title') %>% toString()))
team
#> # A tibble: 29 × 2
#> player nationality
#> <chr> <chr>
#> 1 Manuel Neuer Deutschland
#> 2 Sven Ulreich Deutschland
#> 3 Tom Starke Deutschland
#> 4 Jérôme Boateng Deutschland
#> 5 David Alaba Österreich
#> 6 Mats Hummels Deutschland
#> 7 Javi Martínez Spanien
#> 8 Juan Bernat Spanien
#> 9 Philipp Lahm Deutschland
#> 10 Rafinha Brasilien, Deutschland
#> # ... with 19 more rows

Resources