I'm trying to scrape data from a webpage, however I get a 404 error for the URLs below. However, there is data from the 404 link that I need from within the browser. Here's the example:
library(tidyverse)
library(rvest)
url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"
link_list <- url %>%
read_html() %>%
html_nodes("td:nth-child(13) a") %>%
html_attr("href") %>%
{paste0("http://www.uscho.com", .)}
Now, for example, search the 200th example here (http://www.uscho.com/recaplink.php?gid=1_970_20172018) in your web browser. You'll get this:
I don't actually want the 404 Error, but in the address bar, there's a URL that -- after some manipulation -- I can use to get the actual webpage that I want ("https://www.uscho.com/recaps/?p=171810970")
This URL, however, doesn't show up in R. Running read_html(link_list[200]), I only get a 404 error.
Any idea how I can get the URL from the browser within R?
To get the URL from the browser within R using rvest you can search for the meta data:
library(rvest)
library(tidyverse)
url <- "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
url %>%
read_html() %>%
html_nodes(xpath = '//meta[#property="og:url"]') %>%
html_attr('content')
#[1] "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
However, this will not suffice for your case. I think it would be better for you to use RSelenium to scrape the data dynamically. It might be slower, but it is most certainly a solution to your problem. You can check out this tutorial on how to do so.
EDIT:
Not really experienced with splashr, but I do know that RSelenium is different from rvest because Selenium simulates whereas rvest is dependent on RESTful API's. It crashes when a 404 is received, where Selenium can just ignore by waiting using setImplicitWaitTimeout() so that the page redirects. You can then get the URL captured by using remoteDriver$getCurrentUrl()
Related
I’m trying to scrape this website using RVest: https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx
Notice that the site loads quickly, but the data takes some time to appear. I realized that, while the content appears as html text in a web browser Inspector, the nodes appear empty when scraped using rvest.
library(dplyr)
library(rvest)
camara <- "https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx" %>%
session()
camara %>%
html_elements("h2")
camara %>%
html_elements(".box-proyecto")
camara %>%
html_elements("#trabajo-en-sala") %>%
html_elements("#info-tabs") %>%
html_elements("#ajax-container") %>%
html_elements("pnlTablaOrdinaria")
All of these should return at least some text content, but they appear empty.
I tried using V8 to interpret javascript according to these instructions, but the site appears to use JS only for interface elements, not for data retrieval.
I also tried to run it through PhantomJS following these instructions, but couldn’t run the script due to permission issues.
It seems that I need to perform a GET request for the data, but the URL I found on the site’s code returns nothing: https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx?_=1628291424652
I can’t use RSelenium as I’m working remotely through a headless server.
You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:
library(rvest)
library(magrittr)
r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>%
session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')
tables <- r %>% read_html() %>% html_table()
I am trying to scrape synonyms from the National Cancer Institute Thesaurus data base, however I am having some trouble finding the right html to point to for this. Below is my code and the data frame I am using. When I run my script to pull the synonyms I get an Error in open.connection(x, "rb") : HTTP error 404. I cant seem to figure out what the right html link should be and how to find it.
library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)
synonyms<-read_csv("terms.csv")
##list of acronyms
words <- c(synonyms$Keyword)
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Data<-data.frame(Pages=c(htmls))
results<-sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('p') %>%
html_text()
)
})
I suspect there's a problem with this line of code:
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Because paste0() just concatenates text together, this will give you you URLs like
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel
While I do not have particular experience with rvest, the 404 error you see is almost certainly tied to the inability of web browsers to load those URLs. I recommend logging or printing out htmls so you can confirm that they indeed work properly in a web browser.
I will point out that in this particular case the website offers a downloadable database; you might find it easier to download and query that offline than to do this web scraping.
I want to scrape the location data of a vessel using RStudio:
Here is the link-
https://www.marinetraffic.com/en/ais/details/ships/shipid:199293
My code:
"https://www.marinetraffic.com/en/ais/details/ships/shipid:199293" %>%
read_html() %>%
html_nodes("div#MuiTypography-displayInline")
Any suggestions on how this is possible?
If I go to the page you linked in my browser, open inspector tools, and head to the tab 'network', I see a lot of requests that are sent out when you visit the page.
Among those requests are:
https://www.marinetraffic.com/vesselDetails/voyageInfo/shipid:199293
https://www.marinetraffic.com/vesselDetails/latestPosition/shipid:199293
https://www.marinetraffic.com/en/vesselDetails/vesselInfo/shipid:199293
Those return nicely formatted JSON files that you should be able to parse with the jsonlite package:
library(jsonlite)
"https://www.marinetraffic.com/en/vesselDetails/vesselInfo/shipid:199293" %>%
read_json()
I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:
Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.
OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).