Web Data Scraping issues-Auto Download

Web Data Scraping issues-Auto Download - r

I want to automatically download all the whitepapers from this website: https://icobench.com/ico, when you choose to enter each ICO's webpage, there's a whitepaper tab to click, which will take you to the pdf preview screen, I want to retrieve the pdf url from the css script by using rvest, but nothing comes back after I tried multiple input on the nodes
A example of one ico's css inspect:
embed id="plugin" type="application/x-google-chrome-pdf"
src="https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf"
stream-url="chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/9ca6571a-509f-4924-83ef-5ac83e431a37"
headers="content-length: 2629762
content-type: application/pdf
I've tried something like the following:
library(rvest)
url <- "https://icobench.com/ico"
url <- str_c(url, '/hygh')
webpage <- read_html(url)
Item_html <- html_nodes(webpage, "content embed#plugin")
Item <- html_attr(Item_html, "src")
or
Item <- html_text(Item_html)
Item
But nothing comes back, anybody can help?
From above example, I'm expecting to retrieve the embedded url to the ico's official website for pdf whitepapers, eg: https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf
But as it's google chrome plugin, it's not being retrieved by the rvest package, any ideas?

A possible solution:
Using your example I would change the selector to combine, using descendant combinator, id with attribute = value selector. This would target the whitepaper tab by id and the child link by href attribute value; using $ ends with operator to get the pdf.
library(rvest)
library(magrittr)
url <- "https://icobench.com/ico/hygh"
pdf_link <- read_html(url) %>% html_node(., "#whitepaper [href$=pdf]") %>% html_attr(., "href")
Faster option?
You could also target the object tag and its data attribute
pdf_link <- read_html(url) %>% html_node(., "#whitepaper object") %>% html_attr(., "data")
Explore which is fit for purpose across pages.
The latter is likely faster and seems to be used across the few sites I checked.
Solution for all icos:
You could put this in a function that receives an url as input (the url of each ico); the function would return the pdf url, or some other specified value if no url found/css selector fails to match. You'd need to add some handling for that scenario. Then call that function over a loop of all ico urls.

Related

Extracting image from web using Rvest

I'm trying to use R to extract player photos from the PGA website. The following is my attempt at getting the image URLs, but it does not show the images or the image is blank as shown in the image below.
if(!require(pacman))install.packages("pacman")
pacman::p_load('rvest', 'stringi', 'dplyr', 'tidyr', 'measurements', 'reshape2','foreach','doParallel','curl','httr','Iso','stringi','janitor')
PGA_url <- "https://www.pgatour.com"
pga_web=read_html(paste0(PGA_url,'/players.html'))
plyers_photo <- pga_web%>%html_nodes("[class='player-card']")%>%html_nodes('div.player-image-wrapper')%>%html_nodes('img')%>%html_attr('src')
Could someone kindly tell me what I'm doing wrong?

If you examine the page source you will see that you are retrieving the content as per the page source i.e. where there is a default img value. Scan across and you may notice that there is a data-src attribute adjacent which has an alternate ending for the png of the form matching regex: headshots_\d{5}\.png.
When JavaScript runs in the browser, which doesn't happen with an xmlhttp request through rvest, those urls are dynamically updated with the default png endings replaced with those in the data-src attributes.
Either replace the endings you are getting with that attribute's value, for the set-size small image, or instead, use the part up to and including upload as a base, and combine that with the extracted data-src values to give large images.
There is also no need for all those chained html_nodes() calls. A single call with an appropriate css selector list will do. Also, prefer the maintained html_elements() method, over the old html_nodes():
library(rvest)
library(magrittr)
PGA_url <- "https://www.pgatour.com"
pga_web <- read_html(paste0(PGA_url, "/players.html"))
placeholder_link <- 'https://pga-tour-res.cloudinary.com/image/upload/'
plyers_photo <- pga_web %>%
html_elements(".player-card .player-image-wrapper img") %>%
html_attr("data-src") %>% paste0(placeholder_link, .)

Why is html_nodes() in R not giving me the desired output for this webpage?

I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()

This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

How to pull a product link from customer profile page on Amazon

I'm trying to get the product link from a customers profile page usign R's RVEST package
I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.
For example on this profile page:
https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8
I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2
https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp
library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)
# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)
page %>%
html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
html_text()
#I did a test to see if i could even find the href, with no luck
test <- page %>%
html_nodes("#a-page") %>%
html_text()
grepl("B01A51S9Y2",test)
Thanks for the tip #Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)
driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText
This doesn't really return anything
Added the get attribute href, and was able to get the link
prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')
for (link in 1:length(prod)){
print(prod[[link]]$getElementAttribute('href'))
}

That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....
You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:
It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector
.profile-at-product-box-link
to gather a webElements collection you can loop and extract href attribute from.

Trouble selecting correct css elements for scraping with rvest

GOAL: I'm trying to scrape win-loss records for NBA teams from basketball-reference.com.
More broadly, I'm trying to better understand how to correctly use CSS selector gadget to scrape specified elements from a website, but would appreciate a solution for this problem.
The url I'm using (https://www.basketball-reference.com/leagues/NBA_2018_standings.html) has multiple tables on it, so I'm trying to use the CSS selector gadget to specify the element I want, which is the "Expanded Standings" table - about 1/3 of the way down the page.
I have read various tutorials about web scraping that involve the rvest and dplyr packages, as well as the CSS selector web browser add-in (which I have installed in Chrome, my browser of choice). That's what I'm going for.
Here is my code so far:
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
url %>%
read_html() %>%
html_nodes(css) %>%
html_table()
The result of this code is an error:
Error: html_name(x) == "table" is not TRUE
When I delete the last line of code, I get:
url %>%
read_html() %>%
html_nodes(css)
{xml_nodeset (0)}
It seems like there's an issue with the way I'm defining the CSS object/how I'm using the CSS selector tool. What I've been doing is clicking at the very right edge of the desired table, so that the table has a rectangle around it.
I've also tried to click a specific "cell" in the table (i.e., "65-17', which is the value in the "Overall" column for the Houston Rockets row), but that seems to highlight some, but not all of the table, and the random parts of other tables on the web page.
Can anyone provide a solution? Bonus points if you can help me understand where/why what I'm doing is incorrect.
Thanks in advance!

library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
mystr <- toString(mynode)
mystr <- gsub("<!--","",mystr)
mystr <- gsub("-->","",mystr)
newdiv <- read_html(mystr)
newtable <- html_nodes(newdiv,"#expanded_standings")
newframe <- html_table(newtable)
print(newframe)

library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
#print node to console - interprets slashes
cat(toString(mynode))

I tried downloading the bare url html(before javascript render). Seems strange like the table data is in a comment block. In this div - there is the 'Expanded Standings' table.
I used python and beautifulsoup to extract the element and then remove the comment markers, resoup the string section and then parse the string into td bits. Strange like the rank is in a th element.

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.

OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Data Scraping issues-Auto Download - r

Related

Extracting image from web using Rvest

Why is html_nodes() in R not giving me the desired output for this webpage?

How to pull a product link from customer profile page on Amazon

Trouble selecting correct css elements for scraping with rvest

R Web scrape - Error

Categories

Resources