I want to scrape the location data of a vessel using RStudio:
Here is the link-
https://www.marinetraffic.com/en/ais/details/ships/shipid:199293
My code:
"https://www.marinetraffic.com/en/ais/details/ships/shipid:199293" %>%
read_html() %>%
html_nodes("div#MuiTypography-displayInline")
Any suggestions on how this is possible?
If I go to the page you linked in my browser, open inspector tools, and head to the tab 'network', I see a lot of requests that are sent out when you visit the page.
Among those requests are:
https://www.marinetraffic.com/vesselDetails/voyageInfo/shipid:199293
https://www.marinetraffic.com/vesselDetails/latestPosition/shipid:199293
https://www.marinetraffic.com/en/vesselDetails/vesselInfo/shipid:199293
Those return nicely formatted JSON files that you should be able to parse with the jsonlite package:
library(jsonlite)
"https://www.marinetraffic.com/en/vesselDetails/vesselInfo/shipid:199293" %>%
read_json()
Related
I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()
I’m trying to scrape this website using RVest: https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx
Notice that the site loads quickly, but the data takes some time to appear. I realized that, while the content appears as html text in a web browser Inspector, the nodes appear empty when scraped using rvest.
library(dplyr)
library(rvest)
camara <- "https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx" %>%
session()
camara %>%
html_elements("h2")
camara %>%
html_elements(".box-proyecto")
camara %>%
html_elements("#trabajo-en-sala") %>%
html_elements("#info-tabs") %>%
html_elements("#ajax-container") %>%
html_elements("pnlTablaOrdinaria")
All of these should return at least some text content, but they appear empty.
I tried using V8 to interpret javascript according to these instructions, but the site appears to use JS only for interface elements, not for data retrieval.
I also tried to run it through PhantomJS following these instructions, but couldn’t run the script due to permission issues.
It seems that I need to perform a GET request for the data, but the URL I found on the site’s code returns nothing: https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx?_=1628291424652
I can’t use RSelenium as I’m working remotely through a headless server.
You need to pick up a session cookie (ASP.NET_SessionId) from the initial url. You could use session for this, for example:
library(rvest)
library(magrittr)
r <- session('https://www.camara.cl/legislacion/sesiones_sala/sesiones_sala.aspx') %>%
session_jump_to('https://www.camara.cl/legislacion/sesiones_sala/tabla.aspx')
tables <- r %>% read_html() %>% html_table()
I am trying to scrape synonyms from the National Cancer Institute Thesaurus data base, however I am having some trouble finding the right html to point to for this. Below is my code and the data frame I am using. When I run my script to pull the synonyms I get an Error in open.connection(x, "rb") : HTTP error 404. I cant seem to figure out what the right html link should be and how to find it.
library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)
synonyms<-read_csv("terms.csv")
##list of acronyms
words <- c(synonyms$Keyword)
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Data<-data.frame(Pages=c(htmls))
results<-sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('p') %>%
html_text()
)
})
I suspect there's a problem with this line of code:
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Because paste0() just concatenates text together, this will give you you URLs like
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel
While I do not have particular experience with rvest, the 404 error you see is almost certainly tied to the inability of web browsers to load those URLs. I recommend logging or printing out htmls so you can confirm that they indeed work properly in a web browser.
I will point out that in this particular case the website offers a downloadable database; you might find it easier to download and query that offline than to do this web scraping.
I'm trying to scrape data from a webpage, however I get a 404 error for the URLs below. However, there is data from the 404 link that I need from within the browser. Here's the example:
library(tidyverse)
library(rvest)
url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"
link_list <- url %>%
read_html() %>%
html_nodes("td:nth-child(13) a") %>%
html_attr("href") %>%
{paste0("http://www.uscho.com", .)}
Now, for example, search the 200th example here (http://www.uscho.com/recaplink.php?gid=1_970_20172018) in your web browser. You'll get this:
I don't actually want the 404 Error, but in the address bar, there's a URL that -- after some manipulation -- I can use to get the actual webpage that I want ("https://www.uscho.com/recaps/?p=171810970")
This URL, however, doesn't show up in R. Running read_html(link_list[200]), I only get a 404 error.
Any idea how I can get the URL from the browser within R?
To get the URL from the browser within R using rvest you can search for the meta data:
library(rvest)
library(tidyverse)
url <- "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
url %>%
read_html() %>%
html_nodes(xpath = '//meta[#property="og:url"]') %>%
html_attr('content')
#[1] "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
However, this will not suffice for your case. I think it would be better for you to use RSelenium to scrape the data dynamically. It might be slower, but it is most certainly a solution to your problem. You can check out this tutorial on how to do so.
EDIT:
Not really experienced with splashr, but I do know that RSelenium is different from rvest because Selenium simulates whereas rvest is dependent on RESTful API's. It crashes when a 404 is received, where Selenium can just ignore by waiting using setImplicitWaitTimeout() so that the page redirects. You can then get the URL captured by using remoteDriver$getCurrentUrl()
I am trying to scrape the text of U.N. Security Council (UNSC) resolutions into R. The U.N. maintains an online archive of all UNSC resolutions in PDF format (here). So, in theory, this should be do-able.
If I click on the hyperlink for a specific year and then click on the link for a specific document (e.g., this one), I can see the PDF in my browser. When I try to download that PDF by pointing download.file at the link in the URL bar, it seems to work. When I try to read the contents of that file into R using the pdf_text function from the pdftools package, however, I get a stack of error messages.
Here's what I'm trying that's failing. If you run it, you'll see the error messages I'm talking about.
library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)
What am I missing? I think it has to do with the link addresses to the downloadable versions of these files differing from the link addresses for the in-browser display, but I can't figure out how to get the path to the former. I tried right-clicking on the download icon; using the "Inspect" option in Chrome to see the URL identified as 'src' there (this link); and pointing the rest of my process at it. Again, the download.file part executes, but I get the same error messages when I run pdf_text. I also tried a) varying the mode part of the call to download.file and b) tacking ".pdf" onto the end of the path to tmp, but neither of those helped.
The pdf you are looking to download is in an iframe in the main page, so the link you are downloading only contains html.
You need to follow the link in the iframe to get the actual link to the pdf. You need to jump to several pages to get cookies/temporary urls before getting to the direct link to download the pdf.
Here's an example for the link you posted:
rm(list=ls())
library(rvest)
library(pdftools)
s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[#name='mainFrame']") %>%
html_attr("src")
#go to that link
s <- s %>% jump_to(url=frame_link)
#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
html_nodes("meta") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(url=temp_url)
#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
back()
#get the pdf link and download it
pdf_link <- s %>% read_html() %>%
html_nodes(xpath="//meta[#http-equiv='refresh']") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc