Nodes from a website are not scraping the content - r

I have tried to scrape the content of a news website ('titles', 'content', etc) but the nodes I am using do not return the content.
I have tried different nodes/tags, but none of them seem to be working. I have also used the SelectorGadget without any result. I have used the same strategy for scraping other websites and it has worked with no issues.
Here is an example of trying to get the 'content'
library(rvest)
url_test <- read_html('https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc')
content_test <- html_text(html_nodes(url_test, ".article-body-mt-5"))
I have also tried using the xpath instead of the css class with no results.
Here is an example of trying to get the 'date'
content_test <- html_text(html_nodes(url_test, ".article-date"))
Even if I try to scrape all the <h>from the website page, for example, I do also get character(0)
What can be the problem? Thanks for any help!

Since the content is loaded by javascript to the page, I used RSelenium to scrape the data and it worked
library(RSelenium)
#Setting the remote browser
remDr <- RSelenium::remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4444L,
browserName = "chrome")
remDr$open()
url_test <- 'https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc'
remDr$navigate(url_test)
#Checking if the website page is loaded
remDr$screenshot(display = TRUE)
#Getting the content
content_test <- remDr$findElements(using = "css selector", value = '.article-date')
content_test <- sapply(content_test, function(x){x$getElementText()})
> content_test
[[1]]
[1] "22 de Septiembre de 2018"

Two things.
Your css selector is wrong. It should have been:
".article-body.mt-5"
The data is dynamically loaded and returned as json. You can find the endpoint in the network tab. No need for overhead of using selenium.
library(jsonlite)
data <- jsonlite::read_json('https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json')
body is html so you could use html parser. The following is a simple text dump. You would refine with node selection.
library(rvest)
read_html(data[[1]]$body) %>% html_text()

Related

Scraping News Articles using rvest

I'm trying to scrape news articles from FoxNews using Rvest. However I can't find the right Node to get the header and URL for scraping. Could it be that FoxNews is blocking me from scraping their site?
html_fox <- read_html("https://www.foxnews.com/search-results/search?q=trump")
html_fox %>%
html_nodes(".article") %>%
html_text()
If I enter this the return is {xml_nodeset (0)}
Can anybody help? I've been trying to figure this out for days now and I can't find an answer.
Thanks!
One possible solution could be RSelenium library
Below simple example
library(RSelenium)
#Start a selenium server and browser
driver <- rsDriver(browser=c("firefox"), port = 4567L)
#Defines the client part.
remote_driver <- driver[["client"]]
#Sent the web site address to the firefox
remote_driver$navigate("https://www.foxnews.com/search-results/search?q=trump")
#To take the first article, you could do this:
all_articles<-remote_driver$findElement(using = 'xpath', value = '//*[#id="wrapper"]/div[2]/div[2]/div')$getElementText()
print(all_articles)

Scraping with RSelenium and understanding html tags

I've been learning how to scrape the web using RSelenium by trying to gather sports data. I have a difficult time understanding how to get certain elements based on tags. In the following example, I get the player names that I want, but I only get the top 28. I don't understand why, since when I inspect lower elements, they have similar xpaths. Example:
library(rvest)
library(RCurl)
library(RSelenium)
library(XML)
library(dplyr)
# URL
rotoURL = as.character("https://rotogrinders.com/lineuphq/nba?site=draftkings")
# Start remote driver
remDrall <- rsDriver(browser = "chrome", verbose = F)
remDr <- remDrall$client
remDr$open(silent = TRUE)
Sys.sleep(1)
# Go to URL
remDr$navigate(rotoURL)
Sys.sleep(3)
# Get player names and clean
plyrNms <- remDr$findElement(using = "xpath", "//*[#id='primary-pane']/div/div[3]/div/div/div/div")
plyrNmsText <- plyrNms$getElementAttribute("outerHTML")[[1]]
plyrNmsClean <- htmlTreeParse(plyrNmsText, useInternalNodes=T)
plyrNmsCleaner <- trimws(unlist(xpathApply(plyrNmsClean, '//a', xmlValue)))
plyrNmsCleaner <- plyrNmsCleaner[!plyrNmsCleaner=='']
If you run this, you'll see that the list stops at Ben McLemore, even though there are 50+ names below. I tried this code yesterday as well and it still limited me to 28 names, which tells me the 28 isn't arbitrary.
What part of my code is preventing me from grabbing all names? I'm assuming it has to do with findElement, but I've tried a 100 different xpaths with the InspectorGadget html selector tool, and nothing seems to work. Any help would be much appreciated!

Get webpage links using rvest

I tried using rvest to extract links of "VAI ALLA SCHEDA PRODOTTO" form this website:
https://www.asusworld.it/series.asp?m=Notebook#db_p=2
My R code:
library(rvest)
page.source <- read_html("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
version.block <- html_nodes(page.source, "a") %>% html_attr("href")
However, I can't get any links look like "/model.asp?p=2340487". How can I do?
element looks like this
You may utilize RSelenium to request the intended information from the website.
Load the relevant packages. (Please ensure that the R package 'wdman' is up-to-date.)
library("RSelenium")
library("wdman")
Initialize the R Selenium server (I use Firefox - recommended).
rD <- rsDriver(browser = "firefox", port = 4850L)
rd <- rD$client
Navigate to the URL (and set an appropriate waiting time).
rd$navigate("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
Sys.sleep(5)
Request the intended information (you may refer to, for example, the 'xpath' of the element).
element <- rd$findElement(using = 'xpath', "//*[#id='series']/div[2]/div[2]/div/div/div[2]/table/tbody/tr/td/div/a/div[2]")
Display the requested element (i.e., information).
element$getElementText()
[[1]]
[1] "VAI ALLA SCHEDA PRODOTTO"
A detailed tutorial is provided here (for OS, see this tutorial). Hopefully, this helps.

Rvest not seeing xpath in website

I am attempting to scrape this website using the rvest package in R. I have done it successfully with several other website but this one doesn't seem to work and I am not sure why.
I copied the xpath from inside chrome's inspector tool, but when i specify it in the rvest script it shows that it doesn't exist. Does it have anything to do with the fact that the table is generated and not static?
appreciate the help!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[#id='indicator10']")
a<-html_table(a)
a
Regarding your question, yes, you are unable to get it because is being generated dynamically. In these cases, it's better to use the RSelenium library:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[#id='indicator10']")
I guess that you are trying to get the first table. In that case, maybe it's better to just get the table with read_table:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
I'm using the CSS selector this time instead of the XPath.
Hope it helps! Happy scraping!

blank value captures while scraping using Rselenium

I am trying to scrape a textbox value from the URL in the code. I picked the css using slector gadget. It is not able to capture the content in the text box. Tested several other CSS toobut the textbox value is not captured.
Text box is : construction year
Please help . Below is the code for reference.
url = "https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848"
values = list()
remDr$navigate(url)
page_source<-remDr$getPageSource()
a = read_html(page_source[[1]])
= html_nodes(a,"#ctl00_mainContentPlaceholder_txtConstructionYear_iu")
values = html_text(html_main_node)
values
Thanks in advance
Why RSelenium? It scrapes fine with rvest (though it is a horrible SharePoint site which may cause problems down the end with maintaining the proper view state cookies).
library(rvest)
pg <- html_session("https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848")
html_attr(html_nodes(pg, "input#ctl00_mainContentPlaceholder_txtConstructionYear_iu"), "value")
## [1] 1987
You should be grabbing the value attribute vs the node text. This should work in the your selenium code, too.
The above answer also works. But if you are only trying to use RSelenium. Here is the code
library(RSelenium)
checkForServer()
startServer()
Sys.sleep(5)
re<-remoteDriver()
re$open()
re$navigate("https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848")
re$findElement(using = "css selector", "#ctl00_mainContentPlaceholder_txtConstructionYear_iu")$clickElement()
text<-unlist(re$findElement(using = "css selector", "#ctl00_mainContentPlaceholder_txtConstructionYear_iu")$getElementAttribute("value"))
This works

Resources