Scraping News Articles using rvest - r

I'm trying to scrape news articles from FoxNews using Rvest. However I can't find the right Node to get the header and URL for scraping. Could it be that FoxNews is blocking me from scraping their site?
html_fox <- read_html("https://www.foxnews.com/search-results/search?q=trump")
html_fox %>%
html_nodes(".article") %>%
html_text()
If I enter this the return is {xml_nodeset (0)}
Can anybody help? I've been trying to figure this out for days now and I can't find an answer.
Thanks!

One possible solution could be RSelenium library
Below simple example
library(RSelenium)
#Start a selenium server and browser
driver <- rsDriver(browser=c("firefox"), port = 4567L)
#Defines the client part.
remote_driver <- driver[["client"]]
#Sent the web site address to the firefox
remote_driver$navigate("https://www.foxnews.com/search-results/search?q=trump")
#To take the first article, you could do this:
all_articles<-remote_driver$findElement(using = 'xpath', value = '//*[#id="wrapper"]/div[2]/div[2]/div')$getElementText()
print(all_articles)

Related

How to pull a product link from customer profile page on Amazon

I'm trying to get the product link from a customers profile page usign R's RVEST package
I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.
For example on this profile page:
https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8
I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2
https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp
library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)
# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)
page %>%
html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
html_text()
#I did a test to see if i could even find the href, with no luck
test <- page %>%
html_nodes("#a-page") %>%
html_text()
grepl("B01A51S9Y2",test)
Thanks for the tip #Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)
driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText
This doesn't really return anything
Added the get attribute href, and was able to get the link
prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')
for (link in 1:length(prod)){
print(prod[[link]]$getElementAttribute('href'))
}
That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....
You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:
It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector
.profile-at-product-box-link
to gather a webElements collection you can loop and extract href attribute from.

Nodes from a website are not scraping the content

I have tried to scrape the content of a news website ('titles', 'content', etc) but the nodes I am using do not return the content.
I have tried different nodes/tags, but none of them seem to be working. I have also used the SelectorGadget without any result. I have used the same strategy for scraping other websites and it has worked with no issues.
Here is an example of trying to get the 'content'
library(rvest)
url_test <- read_html('https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc')
content_test <- html_text(html_nodes(url_test, ".article-body-mt-5"))
I have also tried using the xpath instead of the css class with no results.
Here is an example of trying to get the 'date'
content_test <- html_text(html_nodes(url_test, ".article-date"))
Even if I try to scrape all the <h>from the website page, for example, I do also get character(0)
What can be the problem? Thanks for any help!
Since the content is loaded by javascript to the page, I used RSelenium to scrape the data and it worked
library(RSelenium)
#Setting the remote browser
remDr <- RSelenium::remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4444L,
browserName = "chrome")
remDr$open()
url_test <- 'https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc'
remDr$navigate(url_test)
#Checking if the website page is loaded
remDr$screenshot(display = TRUE)
#Getting the content
content_test <- remDr$findElements(using = "css selector", value = '.article-date')
content_test <- sapply(content_test, function(x){x$getElementText()})
> content_test
[[1]]
[1] "22 de Septiembre de 2018"
Two things.
Your css selector is wrong. It should have been:
".article-body.mt-5"
The data is dynamically loaded and returned as json. You can find the endpoint in the network tab. No need for overhead of using selenium.
library(jsonlite)
data <- jsonlite::read_json('https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json')
body is html so you could use html parser. The following is a simple text dump. You would refine with node selection.
library(rvest)
read_html(data[[1]]$body) %>% html_text()

How to extract id names from search result urls using rvest? (CSS selector isn't working)

I'm attempting to extract a list of product item names from a search result page (link here).
library(rvest)
results <- read_html('https://www.fishersci.com/us/en/catalog/search/products?keyword=sodium+hydroxide&nav=')
results %>%
html_nodes(".result_title a") %>%
html_text()
which returns
character(0)
I've also attempted to make use of:
html_attr('href')
with no luck. Can I even use css to pull the titles of these links? I'm trying to make a list of the 30 product results (e.g. "Sodium Hydroxide (Pellets/Certified ACS), Fisher Chemical"). Is the id for these links using javascript?
Thanks for any help, this is my first scraping project and my knowledge of web design is much simpler than this particular page.
The result is indeed generated with javascript. rvest doesn't handle javascript at the moment, but other alternatives exists.
For example, you can use selenium and phantomjs to get to what you want :
library(RSelenium) # Wrapper around Selenium
library(wdman) # helper to download and configure phantomjs
library(rvest)
phantomjs <- phantomjs(port = 4444L)
remote_driver <- remote_driver(browserName = "phantomjs", port = 4444L)
remote_driver <- remoteDriver(browserName = "phantomjs", port = 4444L)
remote_driver$open(silent = TRUE)
remote_driver$navigate("https://www.fishersci.com/us/en/catalog/search/products?keyword=sodium+hydroxide&nav=")
remote_driver$getPageSource()[[1]]
page_source %>%
read_html() %>%
html_nodes(css = '.result_title') %>%
html_text()

Rvest not seeing xpath in website

I am attempting to scrape this website using the rvest package in R. I have done it successfully with several other website but this one doesn't seem to work and I am not sure why.
I copied the xpath from inside chrome's inspector tool, but when i specify it in the rvest script it shows that it doesn't exist. Does it have anything to do with the fact that the table is generated and not static?
appreciate the help!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[#id='indicator10']")
a<-html_table(a)
a
Regarding your question, yes, you are unable to get it because is being generated dynamically. In these cases, it's better to use the RSelenium library:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[#id='indicator10']")
I guess that you are trying to get the first table. In that case, maybe it's better to just get the table with read_table:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
I'm using the CSS selector this time instead of the XPath.
Hope it helps! Happy scraping!

Scrape forum using R

I would like to seek helping in scraping the information from hardware zone.
This is the link: http://www.hardwarezone.com.sg/search/forum/camera
I would like to get all the information on camera on the forum.
library(RSelenium)
library(magrittr)
base_url = "http://www.hardwarezone.com.sg/search/forum/camera"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
I tried using the above codes for the first part, and I got an error for the last line. (Mac user)
The error I got is undefined RCurl call, but I referenced to many possible solutions but I still cannot solve it.
library (rvest)
url <- "http://www.hardwarezone.com.sg/search/forum/camera"
result <- url %>%
html() %>%
html_nodes (xpath = '//*[id="cse"]/table[1]') %>%
html_table()
result
I tried using another method (Above code), but it still couldn't work.
Can anyone guide me through this?
Thank you.

Resources