Web scraping data for use in R-Studio - r

I am wanting to pull the data out of this server site and into R-Studio. I am new to R so not at all sure what is possible. Any help with coding to achieve this would be appreciated.
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=679&samples=true

install.packages("rvest")
library('rvest')
install.packages('XML')
library('XML')
library("httr")
#Specifying the url for desired website to be scrapped
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-
bin/hydwebserver.cgi/points/samples?point=679'
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
html_table(fill = TRUE)
tbl <- as.data.frame(tbls_ls)
View(tbl)
I have tried to fetch few other tables from the given website which is working fine.
for example:
rainfall depth:
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=63
small modification in the url as follows will fetch you actual table. rest all code reamins same (details?point=63 as samples?point=63)
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/HydWebServer.cgi/points/samples?point=63'
for more help you can refer the website:
http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html

Related

How can I do web scraping USDA FoodData central?

I am trying to scrap a table in this url with using R. I tried this code, but it showed that xml_missing. How can I retrieve a nutrition table in this url?
library(rvest)
library(tidyverse)
url <- "https://fdc.nal.usda.gov/fdc-app.html#/food-details/2237774/nutrients"
read_html(url) %>% html_element(xpath = '// [#id="nutrients-table"]')

Read HTML table using R directly from a website

I want to read covid data directly from government website: https://pikobar.jabarprov.go.id/distribution-case#
I did that using rvest library
url <- "https://pikobar.jabarprov.go.id/distribution-case#"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T)
I saw someone using lapply to make it into a tidy table, but when I tried it looked like a mess because I'm new to this.
Can anybody help me? I really frustated
You can't scrape the data in the table by rvest because it's requested to this link:
https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32 with the api-key attached.
pg <- httr::GET(
"https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32",
config = httr::add_headers(`api-key` = "480d0aeb78bd0064d45ef6b2254be9b3")
)
data <- httr::content(pg)$data
I don't know if the api-key works in the future but it works for now as I see.

Using R to scrape play-by-play data

I am currently trying to scrape the play-by-play entries from the following link:
https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4
I used the SelectorGadget to determine CSS selectors and ended up with '//td'. However when I attempt to scrape the data using this, html_nodes() returns an empty list and thus the following code returns an error.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_node(xpath='//td') %>%
html_table()
play_by_play
Does anybody know how to resolve this issue?
Thank you in advance!
I think you cannot get the table simply because there are no table in the website(see the source).
It there are any tables, you can get it with following code.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_table()
play_by_play
The data in the page you are loading is loaded with Javascript, so when you used read_html, you are not seeing what you want. If you check "view the source", you will not see table or td in the source page.
What you can do is using other options like Rselenium to get the page source, and if you want to use rvest later you can scrape from the source you get.
library(rvest)
library(Rselenium)
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
rD<- rsDriver()
remDr <- rD$client
remDr$navigate(url)
remDr$getPageSource()[[1]]
play_by_play <-read_html(unlist(remDr$getPageSource()),encoding="UTF-8") %>%
html_nodes("td")
remDr$close()
rm(remDr, rD)
gc()

Scraping data in R

Is there any way to scrape data in R for:
General Information/Launch Date
from this Website: https://www.euronext.com/en/products/etfs/LU1437018838-XAMS/market-information
So far, I have used this code, but the generated XML file does not contain Information that I Need:
library(rvest)
library(XML)
url <- paste("https://www.euronext.com/en/products/etfs/LU1437018838-XAMS/market-information",sep="")
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content1 <- htmlTreeParse(content, error=function(...){}, useInternalNodes = TRUE)
What you are trying to scrap is in an AJAX object called factsheet (I dont know javascript so I cant tell you more).
Here is a solution to get what you want :
Get the URL of the data used by javascript using the network analysis from your browser (XHR thing). See here.
library(rvest)
url <- read_html("https://www.euronext.com/en/factsheet-ajax?instrument_id=LU1437018838-XAMS&instrument_type=etfs")
launch_date <- url %>% html_nodes(xpath = "/html/body/div[2]/div[1]/div[3]/div[4]/strong")%>%
html_text()

Getting email address in web scraping through rvest

Hi I am trying to get little information about this webpage through web scraping in R language using the package rvest. I am getting name and everything but I am unable to get email id i.e. info#brewhemia.co.uk. If I see in the read_html as text, I don't see email id in html parsed text. Can anybody please help? I am new to web scraping. But I know R Language.
link <- 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/'
page <- read_html(link)
name_html <- html_nodes(page,'.placeHeading')
business_adr <- html_text(adr_html)
tel_html <- html_nodes(page,'.value')
business_tel <- html_text(tel_html)
The email id is in 'a' html tag but I am not able to extract it.
You need a javascript engine here to process the js code. Luckily, R has got V8.
Modify your code after installing V8 package:
library(rvest)
library(V8)
link <- 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/'
page <- read_html(link)
name_html <- html_nodes(page,'.placeHeading')
business_adr <- html_text(adr_html)
tel_html <- html_nodes(page,'.value')
business_tel <- html_text(tel_html)
emailjs <- page %>% html_nodes('li') %>% html_nodes('script') %>% html_text()
ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
Output:
> read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
[1] "info#brewhemia.co.uk"

Resources