I am trying to scrape synonyms from the National Cancer Institute Thesaurus data base, however I am having some trouble finding the right html to point to for this. Below is my code and the data frame I am using. When I run my script to pull the synonyms I get an Error in open.connection(x, "rb") : HTTP error 404. I cant seem to figure out what the right html link should be and how to find it.
library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)
synonyms<-read_csv("terms.csv")
##list of acronyms
words <- c(synonyms$Keyword)
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Data<-data.frame(Pages=c(htmls))
results<-sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('p') %>%
html_text()
)
})
I suspect there's a problem with this line of code:
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Because paste0() just concatenates text together, this will give you you URLs like
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel
While I do not have particular experience with rvest, the 404 error you see is almost certainly tied to the inability of web browsers to load those URLs. I recommend logging or printing out htmls so you can confirm that they indeed work properly in a web browser.
I will point out that in this particular case the website offers a downloadable database; you might find it easier to download and query that offline than to do this web scraping.
Related
I've come up with a partial solution to my question, but still need help getting to the end.
My issue is that I can no longer get JSON from a website using R, but I can still access it via my browser:
library(rvest)
library(httr)
library(jsonlite)
library(dplyr)
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
fromJSON(website)
Now gives me:
Error in open.connection(con, "rb") : SSL certificate problem: certificate has expired but I'm still able to visit the site on Chrome.
Ideally I'd like to find a way to get this working using fromJSON()
I don't entirely understand what's causing this error, so I tried a bunch of different solutions. I found that I could at least read the html using this:
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
doc <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
However, from here I'm stuck. I'm struggling trying to parse the doc to the JSON I require. I've tried things like doc %>% xmlParse() %>% xmlToList() %>% toJSON() %>% fromJSON() but it comes out as gibberish.
So my question comes down to: 1) is there a way to get around the SSL certificate problem so that I can use fromJSON() directly again. And 2) if not, how can I sanitize the html document to get it in a usable JSON format?
I’m trying to scrape this website using RVest: https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx
Notice that the site loads quickly, but the data takes some time to appear. I realized that, while the content appears as html text in a web browser Inspector, the nodes appear empty when scraped using rvest.
library(dplyr)
library(rvest)
camara <- "https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx" %>%
session()
camara %>%
html_elements("h2")
camara %>%
html_elements(".box-proyecto")
camara %>%
html_elements("#trabajo-en-sala") %>%
html_elements("#info-tabs") %>%
html_elements("#ajax-container") %>%
html_elements("pnlTablaOrdinaria")
All of these should return at least some text content, but they appear empty.
I tried using V8 to interpret javascript according to these instructions, but the site appears to use JS only for interface elements, not for data retrieval.
I also tried to run it through PhantomJS following these instructions, but couldn’t run the script due to permission issues.
It seems that I need to perform a GET request for the data, but the URL I found on the site’s code returns nothing: https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx?_=1628291424652
I can’t use RSelenium as I’m working remotely through a headless server.
You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:
library(rvest)
library(magrittr)
r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>%
session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')
tables <- r %>% read_html() %>% html_table()
I have a script below that works for simple html scraping. Nothing is returned below for this particular site. New to using html with R and selectorgadget but I have other sites that work. I am wondering why this one does not see the element. The picture below has the path in the highlighted red box and I am curious if it because of the # before the fancy-box that makes this hidden. Any tips and language correction would be helpful as I am still learning how to scrape html.
library(rvest)
library(dplyr)
library(tm)
library(stringi)
library(readr)
url <- read_html('https://www.draftkings.com/draft/contest/84207356')
rot <- url %>%
html_nodes('..prize-payouts td+ td') %>%
html_text()
roster <- data.frame(ROT = rot)
The website is using javascript to render the page. One solution is to download the data as JSON. If you examine the files from the network under the developer tools on your web browser.
This file should provide the information you are looking for:
library(jsonlite)
fromJSON("https://api.draftkings.com/contests/v1/contests/84207356?format=json")
Be sure to comply with the term of service on this website.
I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:
I'm trying to scrape data from a webpage, however I get a 404 error for the URLs below. However, there is data from the 404 link that I need from within the browser. Here's the example:
library(tidyverse)
library(rvest)
url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"
link_list <- url %>%
read_html() %>%
html_nodes("td:nth-child(13) a") %>%
html_attr("href") %>%
{paste0("http://www.uscho.com", .)}
Now, for example, search the 200th example here (http://www.uscho.com/recaplink.php?gid=1_970_20172018) in your web browser. You'll get this:
I don't actually want the 404 Error, but in the address bar, there's a URL that -- after some manipulation -- I can use to get the actual webpage that I want ("https://www.uscho.com/recaps/?p=171810970")
This URL, however, doesn't show up in R. Running read_html(link_list[200]), I only get a 404 error.
Any idea how I can get the URL from the browser within R?
To get the URL from the browser within R using rvest you can search for the meta data:
library(rvest)
library(tidyverse)
url <- "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
url %>%
read_html() %>%
html_nodes(xpath = '//meta[#property="og:url"]') %>%
html_attr('content')
#[1] "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
However, this will not suffice for your case. I think it would be better for you to use RSelenium to scrape the data dynamically. It might be slower, but it is most certainly a solution to your problem. You can check out this tutorial on how to do so.
EDIT:
Not really experienced with splashr, but I do know that RSelenium is different from rvest because Selenium simulates whereas rvest is dependent on RESTful API's. It crashes when a 404 is received, where Selenium can just ignore by waiting using setImplicitWaitTimeout() so that the page redirects. You can then get the URL captured by using remoteDriver$getCurrentUrl()