How can I scrape the pdf documents from HTML?
I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrape is as follows.
https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx
When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.
library(XML)
library(RCurl)
url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
links contains all the URLs to the PDF-files you are trying to download.
Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.
With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}
Related
I am pretty new to web scraping and i need to scrape newspapers articles content from a list of urls related to articles from different websites. I would like to obtain the actual textual content from each of the documents, however, I cannot find a way to automate the scraping procedure through links relating to different websites.
In my case, data are stored in "dublin", a dataframe looking like this.
enter image description here
So far, I managed to scrape together articles from equal websites in order to rely to the same .css paths I find with selector gadget for retrieving the texts. Here is the code I'm using to scrape content selecting documents from the same webpage, in this case those posted by The Irish Times:
library(xml2)
library(rvest)
library(dplyr)
dublin <- dublin%>%
filter(dublin$page == "The Irish Times")
link <- c(pull(dublin, 2))
articles <- list()
for(i in link){
page <- read_html(i)
text = page %>%
html_elements(".body-paragraph")%>%
html_text()
articles[[i]] <- c(text)
}
articles
It actually works. However, since webpages vary case by case, I was wondering whether there is any way to automate this procedure through all the elements of the "url" variable.
Here is an example of the links I scraped:
https://www.thesun.ie/news/10035498/dublin-docklands-history-augmented-reality-app/
https://lovindublin.com/lifestyle/dublins-history-comes-to-life-with-new-ar-app-that-lets-you-experience-it-first-hand
https://www.irishtimes.com/ireland/dublin/2023/01/11/phone-app-offering-augmented-reality-walking-tour-of-dublins-docklands-launched/
https://www.dublinlive.ie/whats-on/family-kids-news/new-augmented-reality-app-bring-25949045
https://lovindublin.com/news/campaigners-say-we-need-to-be-ambitious-about-potential-lido-for-georges-dock
Thank you in advance! Hope the material I provided is enough.
I'm trying to practice text analysis with the Fed FOMC minutes.
I was able to obtain all links to the appropriate pdf files from the link below.
https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm
I tried download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf").
The download was successful; however, when I click on the downloaded file, it outputs "There was an error opening this document. The file is damaged and could not be repaired."
What are some ways to fix this? Is this a way of preventing web scraping on Fed's side?
I have 44 links(pdf files) to download and read in R. Is there a way to do this without physically downloading the files?
library(stringr)
library(rvest)
library(pdftools)
# Scrape the website with rvest for all href links
p <-
rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")
# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/", pdfs)
# Scrape minutes as list of text files
pdf_data <- lapply(paths, pdftools::pdf_text)
I am trying to scrape some URLs from multiple websites I collected. I saved the already collected websites in a dataframe called meetings2017_2018. The problem is that URLs don't look very similar to one another except the first part of the URLs: https://amsterdam.raadsinformatie.nl. The second part of urls are saved in the dataframe. Here are some examples:
/vergadering/458873/raadscommissie%20Algemene%20Zaken
/vergadering/458888/raadscommissie%20Wonen
/vergadering/458866/raadscommissie%20Jeugd%20en%20Cultuur
/vergadering/346691/raadscommissie%20Algemene%20Zaken
So the whole URL would be https://amsterdam.raadsinformatie.nl/vergadering/458873/raadscommissie%20Algemene%20Zaken
I managed to create a very simple function from which I can pull out the URLs from a single website (see below).
web_scrape <- function(meeting) {
url <- glue("https://amsterdam.raadsinformatie.nl{meeting}")
read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
}
With this function I still need to insert every single URL from the dataframe I want to scrape. Since there more than 140 in the dataframe this might take a while. As you can guess, I want to scrape all the urls at once using the url-list in the dataframe. Does anybody know how I can do that?
You could map/iterate over your saved URL in the meetings2017_2018 data frame:
Assuming your URLs are saved in a url column in your meetings2017_2018 data frame a starting point would be:
# create a vector of the URLs
urls <- pull(meetings2017_2018, url)
# map over the URLs and execute whatever code you want for every URL
map(urls, function(url) {
your_code
})
I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.
I am trying to scrap data from a website, where I have to first get a list of links from the main page, and then go into each link to scrape data. The only way I can think of to do this is through a loop.
For example:
library(rvest)
content <- c(rep(NA_character_,10))
for(i in 1:10){
link <- html(links[i])
content[i] <- html_text(html_nodes(link,"tr:nth-child(1) td"))
}
Here, assume that links is a character vector of urls.
This works, but it is very slow. Is there a way to speed it up?