I am attempting to scrape this website using the rvest package in R. I have done it successfully with several other website but this one doesn't seem to work and I am not sure why.
I copied the xpath from inside chrome's inspector tool, but when i specify it in the rvest script it shows that it doesn't exist. Does it have anything to do with the fact that the table is generated and not static?
appreciate the help!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[#id='indicator10']")
a<-html_table(a)
a
Regarding your question, yes, you are unable to get it because is being generated dynamically. In these cases, it's better to use the RSelenium library:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[#id='indicator10']")
I guess that you are trying to get the first table. In that case, maybe it's better to just get the table with read_table:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
I'm using the CSS selector this time instead of the XPath.
Hope it helps! Happy scraping!
Related
I have started using the rvest package and have encountered some consistent problems, namely exactly how to refer to the HTML code.
For example, the below code returns a null character (ultimately want 0.74). Basically the only thing I can get to return is using "div" as the node, which just returns all text. "tr.total-return", "total-return", "div.sal-trailing-return__middle" all returned null too.
a=read_html(https://www.morningstar.com/funds/xnas/hcyix/performance)
b=html_nodes(a, "td")
That page loads dynamically. You thus need to use RSelenium, and not just rvest.
This code works for me to obtain the data point of 0.74.
library(rvest)
library(tidyverse)
library(RSelenium)
url<- "https://www.morningstar.com/funds/xnas/hcyix/performance"
# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
b <- html_node(web, ".total-return > td:nth-child(1)") %>%
html_text() %>%
trimws()
# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
I have tried to scrape the content of a news website ('titles', 'content', etc) but the nodes I am using do not return the content.
I have tried different nodes/tags, but none of them seem to be working. I have also used the SelectorGadget without any result. I have used the same strategy for scraping other websites and it has worked with no issues.
Here is an example of trying to get the 'content'
library(rvest)
url_test <- read_html('https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc')
content_test <- html_text(html_nodes(url_test, ".article-body-mt-5"))
I have also tried using the xpath instead of the css class with no results.
Here is an example of trying to get the 'date'
content_test <- html_text(html_nodes(url_test, ".article-date"))
Even if I try to scrape all the <h>from the website page, for example, I do also get character(0)
What can be the problem? Thanks for any help!
Since the content is loaded by javascript to the page, I used RSelenium to scrape the data and it worked
library(RSelenium)
#Setting the remote browser
remDr <- RSelenium::remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4444L,
browserName = "chrome")
remDr$open()
url_test <- 'https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc'
remDr$navigate(url_test)
#Checking if the website page is loaded
remDr$screenshot(display = TRUE)
#Getting the content
content_test <- remDr$findElements(using = "css selector", value = '.article-date')
content_test <- sapply(content_test, function(x){x$getElementText()})
> content_test
[[1]]
[1] "22 de Septiembre de 2018"
Two things.
Your css selector is wrong. It should have been:
".article-body.mt-5"
The data is dynamically loaded and returned as json. You can find the endpoint in the network tab. No need for overhead of using selenium.
library(jsonlite)
data <- jsonlite::read_json('https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json')
body is html so you could use html parser. The following is a simple text dump. You would refine with node selection.
library(rvest)
read_html(data[[1]]$body) %>% html_text()
New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
tab <- url %>% read_html %>%
html_node("dogruns_wrapper") %>%
html_text()
View(tab)
Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".
As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.
In addition, if you want to get the table, you can get it with less code if you use html_table().
My try:
# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
# go to website
remDr$navigate(url)
# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed
# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# read the table in the html_obj
tab <- html_obj %>% html_table() %>% .[[1]]
Hope it helps! However, always check if webpages allow scraping before doing it!
Check Terms and conditions:
Except for the direct purpose of viewing, printing, accessing or
interacting with the Web Site for your own personal use or as
otherwise indicated on the Web Site or these Terms and Conditions, you
must not copy, reproduce, modify, communicate to the public, adapt,
transfer, distribute, download or store any of the contents of the Web
Site (including Race Information as described below), or incorporate
any part of the Web Site into another web site without GRV’s written
consent.
I've been learning how to scrape the web using RSelenium by trying to gather sports data. I have a difficult time understanding how to get certain elements based on tags. In the following example, I get the player names that I want, but I only get the top 28. I don't understand why, since when I inspect lower elements, they have similar xpaths. Example:
library(rvest)
library(RCurl)
library(RSelenium)
library(XML)
library(dplyr)
# URL
rotoURL = as.character("https://rotogrinders.com/lineuphq/nba?site=draftkings")
# Start remote driver
remDrall <- rsDriver(browser = "chrome", verbose = F)
remDr <- remDrall$client
remDr$open(silent = TRUE)
Sys.sleep(1)
# Go to URL
remDr$navigate(rotoURL)
Sys.sleep(3)
# Get player names and clean
plyrNms <- remDr$findElement(using = "xpath", "//*[#id='primary-pane']/div/div[3]/div/div/div/div")
plyrNmsText <- plyrNms$getElementAttribute("outerHTML")[[1]]
plyrNmsClean <- htmlTreeParse(plyrNmsText, useInternalNodes=T)
plyrNmsCleaner <- trimws(unlist(xpathApply(plyrNmsClean, '//a', xmlValue)))
plyrNmsCleaner <- plyrNmsCleaner[!plyrNmsCleaner=='']
If you run this, you'll see that the list stops at Ben McLemore, even though there are 50+ names below. I tried this code yesterday as well and it still limited me to 28 names, which tells me the 28 isn't arbitrary.
What part of my code is preventing me from grabbing all names? I'm assuming it has to do with findElement, but I've tried a 100 different xpaths with the InspectorGadget html selector tool, and nothing seems to work. Any help would be much appreciated!
I'm attempting to extract a list of product item names from a search result page (link here).
library(rvest)
results <- read_html('https://www.fishersci.com/us/en/catalog/search/products?keyword=sodium+hydroxide&nav=')
results %>%
html_nodes(".result_title a") %>%
html_text()
which returns
character(0)
I've also attempted to make use of:
html_attr('href')
with no luck. Can I even use css to pull the titles of these links? I'm trying to make a list of the 30 product results (e.g. "Sodium Hydroxide (Pellets/Certified ACS), Fisher Chemical"). Is the id for these links using javascript?
Thanks for any help, this is my first scraping project and my knowledge of web design is much simpler than this particular page.
The result is indeed generated with javascript. rvest doesn't handle javascript at the moment, but other alternatives exists.
For example, you can use selenium and phantomjs to get to what you want :
library(RSelenium) # Wrapper around Selenium
library(wdman) # helper to download and configure phantomjs
library(rvest)
phantomjs <- phantomjs(port = 4444L)
remote_driver <- remote_driver(browserName = "phantomjs", port = 4444L)
remote_driver <- remoteDriver(browserName = "phantomjs", port = 4444L)
remote_driver$open(silent = TRUE)
remote_driver$navigate("https://www.fishersci.com/us/en/catalog/search/products?keyword=sodium+hydroxide&nav=")
remote_driver$getPageSource()[[1]]
page_source %>%
read_html() %>%
html_nodes(css = '.result_title') %>%
html_text()