Problem scraping a webpage with R and Rvest - r

I used the code below to extract a table from a webpage:
library(rvest)
library(dplyr)
#Link to site and then getting html code.
link <- "https://www.stats.gov.sa/en/915"
page <- read_html(link)
#extract table from html
files <- page %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
However, I get results that differ from that on the webpage. The result are shown below:
A tibble: 1 × 4
Name Report Period Periodicity Download
1 please wait ... please wait ... please wait ... please wait ...
I was wondering is there a way to get the table in the form I view it through the web browser without using Rselenium. This is because this doesn't seem to work with r studio online

The solution could be RSelenium
Below a simple example
library(RSelenium)
library(rvest)
library(dplyr)
#Your URL
URL <- "https://www.stats.gov.sa/en/915"
#Open the browser by RSelenium
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = F)
remDr <- rD[["client"]]
#Open the page into browser
remDr$navigate(URL)
#Get the table that you see
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
[[1]]
# A tibble: 13 x 4
Name `Report Period` Periodicity Download
<chr> <int> <chr> <lgl>
1 Ar-Riyad Region 2017 Annual NA
2 Makkah Al-Mokarramah Region 2017 Annual NA
3 Al-Madinah Al-Monawarah Region 2017 Annual NA
4 Al-Qaseem Region 2017 Annual NA
5 Eastern Region 2017 Annual NA
6 Aseer Region 2017 Annual NA
7 Tabouk Region 2017 Annual NA
8 Hail Region 2017 Annual NA
9 Northern Borders Region 2017 Annual NA
10 Jazan Region 2017 Annual NA
11 Najran Region 2017 Annual NA
12 Al-Baha Region 2017 Annual NA
13 Al-Jouf Region 2017 Annual NA

Related

How to Scrape multi page website using R language

I want to scrape contents of multi page website using R, currently I'm able to scrape the first page, How do I scrape all pages and store them in csv.
Here;s my code so far
library(rvest)
library(tibble)
library(tidyr)
library(dplyr)
df = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=1&selectedItem=viewAllAwardedContracts.do&T01_ps=100' %>%
read_html() %>% html_table()
df
write.csv(df,"Contracts_test_taneps.csv")
Scrape multiple pages. Change 1:2 to 1:YOU NUMBER
library(tidyverse)
library(rvest)
get_taneps <- function(page) {
str_c("https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=",
page, "&selectedItem=viewAllAwardedContracts.do&T01_ps=100") %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::clean_names()
}
map_dfr(1:2, get_taneps)
# A tibble: 200 x 7
tender_no procuring_entity suppl~1 award~2 award~3 lot_n~4 notic~5
<chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Municipal Council SHIBAM~ 08/11/~ "66200~ N/A NA
2 AE/005/2022-2023/DODOMA/FA/NC/02 Ministry of Livestock and Fish~ NINO G~ 04/11/~ "46511~ N/A NA
3 LGA/014/2022/2023/G/01 UTAWALA Bagamoyo District Council VILANG~ 02/11/~ "90000~ N/A NA
4 LGA/014/014/2022/2023/G/01 FEDHA 3EPICAR Bagamoyo District Council VILANG~ 02/11/~ "88100~ N/A NA
5 LGA/014/2022/2023/G/01/ARDHI Bagamoyo District Council VILANG~ 31/10/~ "16088~ N/A NA
6 LGA/014/2022/2023/G/11 VIFAA VYA USAFI SOKO LA SAMAKI Bagamoyo District Council MBUTUL~ 31/10/~ "10000~ N/A NA
7 DCD - 000899- 400E - ANIMAL FEEDS Kibaha Education Centre ALOYCE~ 29/10/~ "82400~ N/A NA
8 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Regional Referral Hos~ JIGABH~ 02/11/~ "17950~ N/A NA
9 IE/023/2022-23/HQ/G/13 Commission for Mediation and A~ AKO GR~ 27/10/~ "42500~ N/A NA
10 AE/005/2022-2023/MOROGORO/FA/G/05 Morogoro Municipal Council THE GR~ 01/11/~ "17247~ N/A NA
# ... with 190 more rows, and abbreviated variable names 1: supplier_name, 2: award_date, 3: award_amount, 4: lot_name,
# 5: notice_pdf
# i Use `print(n = ...)` to see more rows
Write as .csv
write_csv(df, "Contracts_test_taneps.csv")

Xpath not "seeing" table in Rvest

I am attempting to scrape table in this website. I have done it in several website but can't scrape in this website.
Does it have anything to do with the fact that the table is generated and not static?
Here is my code:
library(tidyverse)
library(rvest)
link <- "https://portal.vietcombank.com.vn/Personal/TG/Pages/ty-gia.aspx?devicechannel=default"
webpage <- read_html(link)
webpage %>%
html_nodes(xpath = "//*[#class = 'tbl-01 rateTable']") %>%
html_table()
Thank all very much
The data actually is present in another link which can be found out by looking under Networks tab in the browser.
library(rvest)
url <- 'https://portal.vietcombank.com.vn/UserControls/TVPortal.TyGia/pListTyGia.aspx?txttungay=5/10/2021&BacrhID=1&isEn=False'
data <- url %>% read_html %>% html_table() %>% .[[1]]
data
# `Ngoại tệ` `Ngoại tệ` Mua Mua Bán
# <chr> <chr> <chr> <chr> <chr>
# 1 Tên ngoại tệ Mã NT Tiền mặt Chuyển khoản Bán
# 2 AUSTRALIAN DOLLAR AUD 16,146.00 16,309.09 16,820.59
# 3 CANADIAN DOLLAR CAD 17,613.63 17,791.55 18,349.54
# 4 SWISS FRANC CHF 23,991.41 24,233.75 24,993.78
# 5 YUAN RENMINBI CNY 3,459.86 3,494.81 3,604.96
# 6 DANISH KRONE DKK - 3,489.44 3,620.54
# 7 EURO EUR 25,758.61 26,018.80 27,099.20
# 8 POUND STERLING GBP 30,155.61 30,460.21 31,415.53
# 9 HONGKONG DOLLAR HKD 2,849.58 2,878.36 2,968.63
#10 INDIAN RUPEE INR - 304.65 316.60
# … with 11 more rows

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?
You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

web scraping to extract only table values from the webpage

I wanted to extract only table values from the following link.
url<-"https://www.ds-norden.com/drycargo/fleetlist/"
I was trying the following code but I am not getting my desired output
library(rvest)
url <- "https://www.scorpiobulkers.com/our-fleet/"
webpage<-read_html(url)
rank_data_html<- html_node(webpage,".col-main")
rank_data<-html_text(rank_data_html)
head(rank_data)
from this code, I was getting the whole text of the webpage. I just wanted the fleet list which in the table of the webpage and store that as df in R.
library(rvest)
url <- "https://www.scorpiobulkers.com/our-fleet/"
webpage<-read_html(url)
rank_data <-
webpage %>%
html_node("table") %>%
html_table()
head(rank_data)
#> Vessel Name Year Built (1) Yard (2) Vessel Type
#> 1 NA SBI Bravo 2015 Nacks Ultramax
#> 2 NA SBI Athena 2015 Chengxi Ultramax
#> 3 NA SBI Antares 2015 Nacks Ultramax
#> 4 NA SBI Cronos 2015 Imabari Ultramax
#> 5 NA SBI Leo 2015 Dacks Ultramax
#> 6 NA SBI Echo 2015 Imabari Ultramax

Find html table name and scrape in R

I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)
named list()
Warning message:
XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'
This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?
Consider using readLines() to scrape the html page content and use result in readHTMLTable():
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)
readHTMLTable(webpage, header = T, stringsAsFactors = F) # LIST OF 3 TABLES
# $`NULL`
# Name FIPS State Numeric Code Official USPS Code
# 1 Alabama 01 AL
# 2 Alaska 02 AK
# 3 Arizona 04 AZ
# 4 Arkansas 05 AR
# 5 California 06 CA
# 6 Colorado 08 CO
# 7 Connecticut 09 CT
# 8 Delaware 10 DE
# 9 District of Columbia 11 DC
# 10 Florida 12 FL
# 11 Georgia 13 GA
# 12 Hawaii 15 HI
# 13 Idaho 16 ID
# 14 Illinois 17 IL
# ...
For specific dataframe return:
fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]
Another solution using rvest instead of XML is:
require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>%
html_table %>% .[[1]]

Resources