I am trying to scrape the text of U.N. Security Council (UNSC) resolutions into R. The U.N. maintains an online archive of all UNSC resolutions in PDF format (here). So, in theory, this should be do-able.
If I click on the hyperlink for a specific year and then click on the link for a specific document (e.g., this one), I can see the PDF in my browser. When I try to download that PDF by pointing download.file at the link in the URL bar, it seems to work. When I try to read the contents of that file into R using the pdf_text function from the pdftools package, however, I get a stack of error messages.
Here's what I'm trying that's failing. If you run it, you'll see the error messages I'm talking about.
library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)
What am I missing? I think it has to do with the link addresses to the downloadable versions of these files differing from the link addresses for the in-browser display, but I can't figure out how to get the path to the former. I tried right-clicking on the download icon; using the "Inspect" option in Chrome to see the URL identified as 'src' there (this link); and pointing the rest of my process at it. Again, the download.file part executes, but I get the same error messages when I run pdf_text. I also tried a) varying the mode part of the call to download.file and b) tacking ".pdf" onto the end of the path to tmp, but neither of those helped.
The pdf you are looking to download is in an iframe in the main page, so the link you are downloading only contains html.
You need to follow the link in the iframe to get the actual link to the pdf. You need to jump to several pages to get cookies/temporary urls before getting to the direct link to download the pdf.
Here's an example for the link you posted:
rm(list=ls())
library(rvest)
library(pdftools)
s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[#name='mainFrame']") %>%
html_attr("src")
#go to that link
s <- s %>% jump_to(url=frame_link)
#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
html_nodes("meta") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(url=temp_url)
#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
back()
#get the pdf link and download it
pdf_link <- s %>% read_html() %>%
html_nodes(xpath="//meta[#http-equiv='refresh']") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc
Related
I would like to download the last archive (meteorological data) that has been added to this website by using Rstudio;
https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/
Do you know how to do it? I am able to download one in particular, but then I have to write the exact extension and It should be manually changed every time, and I do not want that, I want it automatically detected.
Thanks.
The function download_CDC() downloads the files for you. Input number 1 will download the lastest one with their respective name provided by the website.
library(tidyverse)
library(rvest)
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
files <- base_url %>%
read_html() %>%
html_elements("a+ a") %>%
html_attr("href")
download_CDC <- function(item_number) {
base_url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
download.file(paste0(base_url, files[item_number]),
destfile = files[item_number],
mode = "wb")
}
download_CDC(1)
It's bit naïve (no error checking, blindly takes the last link from the file list page), but works with that particular listing.
Most of web scraping in R happens through rvest , html_element("a:last-of-type") extracts the last element of type <a> though CSS selector - your last archive. And html_attr('href') extracts the href attribute from that last <a>-element - actual link to the file.
library(rvest)
last_link <- function(url) {
last_href <- read_html(url) |>
html_element("a:last-of-type") |>
html_attr('href')
paste0(url,last_href)
}
url <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/"
last_link(url)
#> [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/recent/asc/RW-20220720.tar.gz"
Created on 2022-07-21 by the reprex package (v2.0.1)
I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.
Recently, I learned how to "webscrape" a single pdf file from the website I am using :
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)
#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)
#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w , https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/
The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto
I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")
raw_list <- page %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.pdf") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .)
map(read_html) %>%
map(html_node, "#raw-url") %>%
map(html_attr, "href") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>%
walk2(., basename(.), download.file, mode = "wb")
But this produces the following error:
Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'
Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?
Thanks
I see some people suggesting that you use rselenium, which is a way to
simulate browser actions, so that the web server renders the page as
if a human was visiting the site. From my experience it is almost never
necessary to go down that route. The javascript part of the website is
interacting with an API and we can utilize that to circumvent the Javascript
part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I
assume) you can right-click on the website and select “Inspect Element (Q)”,
go to the “Network” tab and click on reload. You’ll see that each request
the browser makes to the webserver is being listed after a few seconds or less.
We are interested in the ones that have the “Type” json.
When you right click on an entry you can select “Open in New Tab”. One of the
requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1
Opening that URL in Firefox gets you to a GUI that lets you explore the
json data structure and you’ll see that there is a “results” entry which
contains the data for the 25 first results of your search. Each one has a
“path” entry, that leads to the page that will display the embedded PDF.
It turns out that if you replace the “.html” part with “.pdf” that path
leads directly to the PDF file. The code below utilizes all this information.
library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)
base_url <- "https://www.canlii.org"
json_url_search_p1 <-
"https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"
This downloads the json for page 1 / results 1 to 25
results_p1 <-
GET(json_url_search_p1, encode = "json") %>%
content()
For each result we extract the path only.
result_html_paths_p1 <-
map_chr(results_p1$results,
~ .$path)
We replace “.html” with “.pdf”, combine the base URL with the path to
generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map()
and pdftools::pdf_text in order to extract the text from all 25 PDFs.
pdf_texts_p1 <-
gsub(".html$", ".pdf", result_html_paths_p1) %>%
paste0(base_url, .) %>%
map(pdf_text)
If you want to do this for more than just the first page you might want to
wrap the above code in a function that lets you switch out the “&page=”
parameter. You could also make the “&text=” parameter an argument of the
function in order to automatically scrape results for other searches.
For the remaining part of the task we can build on the code you already have.
We make it a function that can be applied to any article and apply that function
to each PDF text again using purrr::map().
extract_article_summary <-
function(article) {
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
textrank_sentences(data = article_sentences, terminology = article_words)
}
This now will take a real long time!
article_summaries_p1 <-
map(pdf_texts_p1, extract_article_summary)
Alternatively you could use furrr::future_map() instead to utilize all the CPU
cores in your machine and speed up the process.
library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <-
future_map(pdf_texts_p1, extract_article_summary)
Disclaimer
The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.
I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))
I am using rvest R package to scrape a PDF file from this webpage but the final link is exposed (as a bitstream url - whatever it is) after I click on the exposed url by name AC1-96-21-01-2011.pdf. The final pdf file is tucked in here hidden from access. This blocks all attempts of rvest function read_html() as the final pdf file opens only on clicking on the previous link (on href). Copy pasting the xml node that is not allowing me to enter into the pdf file.
Arbitration Case - AC
The final file is on this url which is not exposed in the href node.
http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf
So as a summary how do I access the pdf file link using rvest that is not found in the href attribute as explained above.
I tried to search bitstream but it takes my to something else.
You're looking at the wrong node I think:
library(rvest)
"http://judgmenthck.kar.nic.in/judgments/handle/123456789/563560" %>%
read_html() %>%
html_nodes(xpath = "//td/a[#target='_blank']") %>%
html_attr("href") %>%
unique() %>%
{grep("[.]pdf", ., value = T)} %>%
paste0("http://judgmenthck.kar.nic.in", .) ->
pdf_url
print(pdf_url)
# [1] "http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf"
New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
tab <- url %>% read_html %>%
html_node("dogruns_wrapper") %>%
html_text()
View(tab)
Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".
As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.
In addition, if you want to get the table, you can get it with less code if you use html_table().
My try:
# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
# go to website
remDr$navigate(url)
# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed
# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# read the table in the html_obj
tab <- html_obj %>% html_table() %>% .[[1]]
Hope it helps! However, always check if webpages allow scraping before doing it!
Check Terms and conditions:
Except for the direct purpose of viewing, printing, accessing or
interacting with the Web Site for your own personal use or as
otherwise indicated on the Web Site or these Terms and Conditions, you
must not copy, reproduce, modify, communicate to the public, adapt,
transfer, distribute, download or store any of the contents of the Web
Site (including Race Information as described below), or incorporate
any part of the Web Site into another web site without GRV’s written
consent.