Scraping data in R - r

Is there any way to scrape data in R for:
General Information/Launch Date
from this Website: https://www.euronext.com/en/products/etfs/LU1437018838-XAMS/market-information
So far, I have used this code, but the generated XML file does not contain Information that I Need:
library(rvest)
library(XML)
url <- paste("https://www.euronext.com/en/products/etfs/LU1437018838-XAMS/market-information",sep="")
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content1 <- htmlTreeParse(content, error=function(...){}, useInternalNodes = TRUE)

What you are trying to scrap is in an AJAX object called factsheet (I dont know javascript so I cant tell you more).
Here is a solution to get what you want :
Get the URL of the data used by javascript using the network analysis from your browser (XHR thing). See here.
library(rvest)
url <- read_html("https://www.euronext.com/en/factsheet-ajax?instrument_id=LU1437018838-XAMS&instrument_type=etfs")
launch_date <- url %>% html_nodes(xpath = "/html/body/div[2]/div[1]/div[3]/div[4]/strong")%>%
html_text()

Related

Read HTML table using R directly from a website

I want to read covid data directly from government website: https://pikobar.jabarprov.go.id/distribution-case#
I did that using rvest library
url <- "https://pikobar.jabarprov.go.id/distribution-case#"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T)
I saw someone using lapply to make it into a tidy table, but when I tried it looked like a mess because I'm new to this.
Can anybody help me? I really frustated
You can't scrape the data in the table by rvest because it's requested to this link:
https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32 with the api-key attached.
pg <- httr::GET(
"https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32",
config = httr::add_headers(`api-key` = "480d0aeb78bd0064d45ef6b2254be9b3")
)
data <- httr::content(pg)$data
I don't know if the api-key works in the future but it works for now as I see.

Web scraping data for use in R-Studio

I am wanting to pull the data out of this server site and into R-Studio. I am new to R so not at all sure what is possible. Any help with coding to achieve this would be appreciated.
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=679&samples=true
install.packages("rvest")
library('rvest')
install.packages('XML')
library('XML')
library("httr")
#Specifying the url for desired website to be scrapped
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-
bin/hydwebserver.cgi/points/samples?point=679'
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
html_table(fill = TRUE)
tbl <- as.data.frame(tbls_ls)
View(tbl)
I have tried to fetch few other tables from the given website which is working fine.
for example:
rainfall depth:
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=63
small modification in the url as follows will fetch you actual table. rest all code reamins same (details?point=63 as samples?point=63)
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/HydWebServer.cgi/points/samples?point=63'
for more help you can refer the website:
http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html

Rselenium xpath not able to save response

I'm trying to get the stocks from https://www.vinmonopolet.no/
for example this wine https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301
Using Rselenium
library('RSelenium')
rD=rsDriver()
remDr =rD[["client"]]
remDr$navigate("https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301")
webElement = remDr$findElement('xpath', '//*[#id="product_2953010"]/span[2]')
webElement$clickElement()
It will render Response
But how to store it?
Full XML
Maybe be rvest is what you are looking for?
library(rvest, tidyverse)
url <- "https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301"
page <- read_html(url)
stock <- page %>%
html_nodes(".product-stock-status div") %>%
html_text()
stock.df <- data.frame(url,stock)
To extract the number use
stock.df <- stock.df %>%
mutate(stock=as.numeric(gsub(".*?([0-9]+).*", "\\1", stock)))
Got it to work just sending the right plain request no need of R
https://www.vinmonopolet.no/vmp/store-pickup/1101/pointOfServices?locationQuery=0661&cartPage=false&entryNumber=0&CSRFToken=718228c1-1dc1-41cd-a35e-23197bed7b0c

Web Scraping NBA Fantasy Projections - R

There are a number of NBA Fantasy Projections that I would like to scrape in a more streamlined approach. Currently I use a combination of importhtml function in google sheets and simple archaic cut'n'paste.
I use R regularly to scrape other data from the internet, however, I can't manage to get these tables to scrape. The tables I am having trouble with are located at three separate addresses (1 table per page), they are:
1) http://www.sportsline.com/nba/player-projections/player-stats/all-players/
2) https://swishanalytics.com/optimus/nba/daily-fantasy-projections
3) http://www.sportingcharts.com/nba/dfs-projections/
For all my other scraping activities I use packages rvest and xml. Following the same process I've tried both methods listed below which result in the outputs shown. I'm sure this has something to do with the format of the table on the website, however I haven't been able to find something that can help me.
Method 1
library(XML)
projections1 <- readHTMLTable("http://www.sportsline.com/nba/player-projections/player-stats/all-players/")
projections2 <- readHTMLTable("https://swishanalytics.com/optimus/nba/daily-fantasy-projections")
projections3 <- readHTMLTable("http://www.sportingcharts.com/nba/dfs-projections/")
Output
projections1
named list()
projections2
named list()
Warning message:
XML content does not seem to be XML: 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
projections3 - I get the headers of the table but not the content of the table.
Method 2
library(rvest)
URL <- "http://www.sportsline.com/nba/player-projections/player-stats/all-players/"
projections1 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "https://swishanalytics.com/optimus/nba/daily-fantasy-projections"
projections2 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "http://www.sportingcharts.com/nba/dfs-projections/"
projections3 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
Output
projections1
list()
projections2 - I get the headers of the table but not the content of the table.
projections3 - I get the headers of the table but not the content of the table.
If anybody could point me in the right direction it would be greatly appreciated.
the content of the table is generated by javascript, so readHTMLTable and read_html find nothing, you can find the table as below
projections1: link
import requests
url = 'http://www.sportsline.com/sportsline-web/service/v1/playerProjections?league=nba&position=all-players&sourceType=FD&game=&page=PS&offset=0&max=25&orderField=&optimal=false&release2Ver=true&auth=3'
r = requests.get(url)
print r.json()
projections2: view-source:https://swishanalytics.com/optimus/nba/daily-fantasy-projections Line 1181
import requests
url = 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
r = requests.get(url)
text = r.content
print eval(text.split('this.players = ')[1].split(';')[0])
projections3: view-source Line 918

download csv file in R

I'm trying to download historical stock trading from my country with R. I tried with the download.file() function. Indeed, a file is downloaded but is an empty spreadsheet. Obviously, if I use this url in my browser the file I downloaded is in fact the one I want.
I would love to do it with quantmod, but that package only applies to larger markets
url<-"https://www.ccbolsa.cl/apps/script/detalleaccion/Transaccion.asp?Nemo=AFPCAPITAL&Menu=H"
destfile <- "/home/hector/TxHistoricas.xls"
download.file(url, destfile)
Thanks in advance.
You can jury-rig something like this if you don't want to use selenium:
library(rvest)
library(httr)
library(stringr)
URL <- "https://www.ccbolsa.cl/apps/script/detalleaccion/Transaccion.asp?Nemo=AFPCAPITAL&Menu=H"
Get initial URL:
res <- html_session(URL, timeout(30))
It embeds a form that it uses javascript to submit to get the form:
inputs <- html_nodes(res, "input")
It uses the last javascript entry to do a redirect on page load, so we need the location of it:
scripts <- html_nodes(res, "script")
action <- html_text(scripts[[length(scripts)]])
This is the new URL to submit to:
base_url <- "https://www.ccbolsa.cl/apps/script/detalleaccion"
loc <- str_match(action, '\\.action *= *"(.*)"')[,2]
doc_url <- sprintf("%s/%s", base_url, loc)
Gather up all the query params:
query <- lapply(inputs, xml_attr, "value")
names(query) <- sapply(inputs, xml_attr, "name")
Now we have to make a new POST request with the query encoded as "form", using and providing a redirect URL (timeout was necessary for me). This write the "xls" content to a file:
ret <- POST(doc_url,
body=query,
encode="form",
add_headers(Referer=URL),
write_disk("fil.xls", overwrite=TRUE),
timeout(30))
It says it's an XLS file:
ret$headers$`content-type`
## [1] "application/vnd.ms-excel"
but it's really an HTML table, so you can really just do:
ret <- POST(doc_url,
body=query,
encode="form",
add_headers(Referer=URL),
timeout(30))
doc <- read_html(content(ret, as="text"))
dat <- html_table(html_nodes(doc, "table"), fill=TRUE)
to get what you're looking for (there are two ugly tables in the dat list and you may want to use header=TRUE as an additional parameter to html_table).
I am not sure how "dynamic" this solution but that's test-able/verifiable.

Resources