Webscraping and downloading PDFs in R

Webscraping and downloading PDFs in R - r

I'm trying to loop through different pages of this website https://burnsville.civicweb.net/filepro/documents/25657/ and download all the PDFs to a folder. Because of the way the website is set up, my usual download.file solution wont work. Any other suggestions?

You probably have found a solution by now, but here is my suggestion with rvest and purrrs method of loop. This should work across the Burnsville database, just replace the page variable.
library(tidyverse)
library(rvest)
page <-
"https://burnsville.civicweb.net/filepro/documents/25657/" %>%
read_html
df <- tibble(
names = page %>%
html_nodes(".document-link") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish(),
links = page %>%
html_nodes(".document-link") %>%
html_attr("href") %>%
paste0("https://burnsville.civicweb.net/", .)
)
# A tibble: 6 × 2
names links
<chr> <chr>
1 Parks & Natural Resources Commission - 06 Dec 2021 Work Session - M… http…
2 Parks & Natural Resources Commission - 15 Nov 2021 - Minutes - Pdf http…
3 Parks & Natural Resources Commission - 04 Oct 2021 - Minutes - Pdf http…
4 Parks & Natural Resources Commission - 07 Jun 2021 - Minutes - Pdf http…
5 Parks & Natural Resources Commission - 19 Apr 2021 - Minutes - Pdf http…
6 Parks & Natural Resources Commission - 04 Jan 2021 - Minutes - Pdf http…
df %>%
map(~ download.file(links, destfile = paste0(names, ".pdf")))

This worked for me
download.file("https://burnsville.civicweb.net/filepro/documents/36906", "a1.pdf", mode="wb")

Related

Attempting to scrape an "unscrapable" page?

I'm attempting to build a simple scraper, iterating through a website to pull two pieces of information and build myself a little reference list.
This is what the url looks like: "https://www.mtgstocks.com/prints/[[n]]"
The two pieces of information are the card name (Forbidden Alchemy) and card set (Innistrad).
Pretty straightforward, yeah? I thought so.
I attempted to pass any relevant anchors — css or xpath  — to try and isolate the two variables, but was met with "{xml_nodeset (0)}".
Here's the code that I ran:
# return page info
page_html <- read_html(httr::GET("https://www.mtgstocks.com/prints/1"))
# extract item name
page_html %>%
html_nodes("h3") %>%
html_nodes("a") %>%
html_text()
# character(0)
I've scraped enough webpages to know that this information is being hidden, but I'm not exactly sure how. Would love help!

They are gathering the information from their API, which you can see in the network section of the developer tools. There is a ton of info of card that you can gather from that API - take a look.
library(tidyverse)
library(httr2)
get_card <- function(card_num) {
cat("Scraping card num:", card_num, "\n")
data <- str_c("https://api.mtgstocks.com/prints/", card_num) %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
tibble(
name = data %>%
pluck("name"),
set = data %>%
pluck("card_set") %>%
pluck("name")
)
}
get_card(1)
# A tibble: 1 × 2
name set
<chr> <chr>
1 Forbidden Alchemy Innistrad
Or get for a range of numbers. If the card does not exists, the code returns NA.
map_dfr(1:20, possibly(get_card, otherwise = tibble(
name = NA_character_,
set = NA_character_
)))
# A tibble: 20 × 2
name set
<chr> <chr>
1 Forbidden Alchemy Innistrad
2 NA NA
3 Fortress Crab Innistrad
4 Frightful Delusion Innistrad
5 Full Moon's Rise Innistrad
6 Furor of the Bitten Innistrad
7 Gallows Warden Innistrad
8 Galvanic Juggernaut Innistrad
9 Garruk Relentless Innistrad
10 Gatstaf Shepherd Innistrad
11 Gavony Township Innistrad
12 Geist of Saint Traft Innistrad
13 Geist-Honored Monk Innistrad
14 Geistcatcher's Rig Innistrad
15 Geistflame Innistrad
16 Ghost Quarter Innistrad
17 Ghostly Possession Innistrad
18 Ghoulcaller's Bell Innistrad
19 Ghoulcaller's Chant Innistrad
20 Ghoulraiser Innistrad

R splitting cells with complex pattern strings

I am web-scraping a website (https://pubmed.ncbi.nlm.nih.gov/) to build up a dataset out of it.
`
> str(result)
$ title : chr [1:4007]
$ authors : chr [1:4007]
$ cite : chr [1:4007]
$ PMID : chr [1:4007]
$ synopsis: chr [1:4007]
$ link : chr [1:4007]
$ abstract: chr [1:4007] `
title authors cite PMID synop…¹ link abstr…
1 Food insecurity in baccalaureate nursing stude… Cocker… J Pr… 3386… METHOD… http… "Backg…
2 Household food insecurity and educational outc… Masa R… Publ… 3271… We mea… http… "Objec…
3 Food Insecurity and Food Label Comprehension a… Mansou… Nutr… 3437… Multiv… http… "Food …
4 The Household Food Security and Feeding Patter… Omachi… Nutr… 3623… Childr… http… "Child…
5 Food insecurity: Its prevalence and relationsh… Turnbu… J Hu… 3373… BACKGR… http… "Backg…
6 Cross-sectional Analysis of Food Insecurity an… Estrel… West… 3535… METHOD… http… "Intro…
`
Among the various variables I am creating, there is one that is giving me trouble (resulut$cite): it includes various information that I have to split up into different columns in order to get a clear overview of the data I need. Here there is an example of some rows to show the difficulty I am facing. I have searched for similar issues but can't find anything fitting this.
1. Public Health. 2021 Sep;198:332-339. doi: 10.1016/j.puhe.2021.07.032. Epub 2021 Sep 9. PMID: 34509858
2. Public Health Nutr. 2021 Apr;24(6):1469-1477. doi: 10.1017/S1368980021000550. Epub 2021 Feb 9. PMID: 33557975
3. Clin Nutr ESPEN. 2022 Dec;52:229-239. doi: 10.1016/j.clnesp.2022.11.005. Epub 2022 Nov 10. PMID: 36513458
Given this, I would like to split result$cite into multiple columns in order to attain what follows:
Review Date of publication Page doi Reference Epub PMID
Public Health. 2021 Sep; 198:332-339. 10.1016 j.puhe.2021.07.032. 2021 Sep 9. 34509858
Public Health Nutr. 2021 Apr; 24(6):1469-1477.10.1017 S1368980021000550. 2021 Feb 9. 33557975
Clin Nutr ESPEN. 2022 Dec; 52:229-239. 10.1016 j.clnesp.2022.11.005. 2022 Nov 10. 36513458
The main problem for me is that the strings are not regular, hence I can't find a pattern to split the cells into different columns. Any idea?

Does this work for you? (Since the OP will not provide data in reproducible format I've created a toy column idbeside the cite column):
library(tidyverse)
result %>%
extract(cite,
into = c("Review","Date of publication","Page","doi","Reference","Epub","PMID"),
regex = "\\d+\\. ([^.]+)\\. ([^;]+);([^.]+)\\. doi:([^/]+)/(\\S+)\\. Epub ([^.]+)\\. PMID: (\\d+)")
id Review Date of publication Page doi Reference Epub
1 1 Public Health 2021 Sep 198:332-339 10.1016 j.puhe.2021.07.032 2021 Sep 9
2 2 Public Health Nutr 2021 Apr 24(6):1469-1477 10.1017 S1368980021000550 2021 Feb 9
3 3 Clin Nutr ESPEN 2022 Dec 52:229-239 10.1016 j.clnesp.2022.11.005 2022 Nov 10
PMID
1 34509858
2 33557975
3 36513458
(NB: if there aren't leading numerics in cite, just remove this part from regex: \\d+\\. (with whitespace at the end!)
The way extract works may look difficult to parse but is actually quite simple. Essentially, you do two things: (i) you look at the strings and try and figure out how they are patterned, i.e., what rules they follow; (ii) in the regex argument you put everything you want to extract into distinct capture groups ((...)) and everything else remains outside of the capture groups.
Data (update #2):
result <- data.frame(id = 1:3,
cite = c("1. Public Health. 2021 Sep;198:332-339. doi: 10.1016/j.puhe.2021.07.032. Epub 2021 Sep 9. PMID: 34509858",
"2. Public Health Nutr. 2021 Apr;24(6):1469-1477. doi: 10.1017/S1368980021000550. Epub 2021 Feb 9. PMID: 33557975",
"3. Clin Nutr ESPEN. 2022 Dec;52:229-239. doi: 10.1016/j.clnesp.2022.11.005. Epub 2022 Nov 10. PMID: 36513458")
)

How to retrieve text below titles from google search using rvest

This is a follow up question for this one:
How to retrieve titles from google search using rvest
In this time I am trying to get the text behind titles in google search (circled in red):
Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.
The answer by #AllanCameron is very useful but I do not know how to modify it:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>%
html_text()
Many thanks for your help!

This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.
It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
'https://www.google.com/',
'&gl=GB&m=0&pc=shp&x=5&src=2',
'&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
'set_eom=false&set_aps=true&set_sc=true'),
handle = google_handle)
url <- httr::GET(url, handle = google_handle)
For the OP and those without a Google consent page, the set up is simply:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
Now we can just apply these xpaths to get the correct nodes and extract the text from them:
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#> title text
#> <chr> <chr>
#> 1 "Mario García Torres - Wikipedia" "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (#mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~

The subtext you refer to appears to be rendered in JavaScript, which makes it difficult to access with conventional read_html() methods.
Here is a script using RSelenium which gets the results necessary. You can also click the next page element if you want to get more results etc.
library(wdman)
library(RSelenium)
library(tidyverse)
selServ <- selenium(
port = 4446L,
version = 'latest',
chromever = '103.0.5060.134', # set to available
)
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4446L,
browserName = 'chrome'
)
remDr$open()
remDr$navigate("insert URL here")
text_elements <- remDr$findElements("xpath", "//div/div/div/div[2]/div/span")
sapply(text_elements, function(x) x$getElementText()) %>%
unlist() %>%
as_tibble_col("results") %>%
filter(str_length(results) > 15)
# # A tibble: 6 × 1
# results
# <chr>
# 1 "The meaning of HI is —used especially as a greeting. How to use hi in a sentence."
# 2 "Hi definition, (used as an exclamation of greeting); hello! See more."
# 3 "A friendly, informal, casual greeting said upon someone's arrival. quotations ▽synonyms △. Synonyms: hello, greetings, howdy.…
# 4 "Hi is defined as a standard greeting and is short for \"hello.\" An example of \"hi\" is what you say when you see someone. i…
# 5 "hi synonyms include: hola, hello, howdy, greetings, cheerio, whats crack-a-lackin, yo, how do you do, good morrow, guten tag,…
# 6 "Meaning of hi in English ... used as an informal greeting, usually to people who you know: Hi, there! Hi, how are you doing? …

How do I scrape data from this specific website using r?

I want to download the data from this website.
http://asphaltoilmarket.com/index.php/state-index-tracker/
But the request keeps getting timed out.
I have tried following methods already, but it keep getting timed out.
library(rvest)
IndexData <- read_html("http://asphaltoilmarket.com/index.php/state-index-tracker/")
library(RCurl)
IndexData <- getURL("http://asphaltoilmarket.com/index.php/state-index-tracker/")
library(httr)
library(XML)
IndexData <- htmlParse(GET(url))
This website opens in the browser without any problem, and I am able to download this data using excel and alteryx.

If by "get the data", you mean "scrape the table on that page", then you just need to go a little further.
First thing, you'll want to check the sites robots.txt to see if scraping is allowed. In this case, there is no mention against scraping.
You've got the html for the site, you just need to find the css selector for what you want. You can use developer tools or something like selector gadget to find the table and get its css selector.
After that you use the html, extract the node you're interested in with html_node() then extract the table with html_table().
library(magrittr)
library(rvest)
html <-read_html("http://asphaltoilmarket.com/index.php/state-index-tracker/")
html %>%
html_node("#tablepress-5") %>%
html_table()
#> State Jan Feb Mar Apr May Jun Jul
#> 1 Alabama $496.27 $486.86 $482.16 $498.62 $517.44 $529.20 $536.26
#> 2 Alaska $513.33 $513.33 $513.33 $513.33 $513.33 $525.84 $535.00
#> 3 Arizona $476.00 $469.00 $466.00 $463.00 $470.00 $478.00 $480.00
#> 4 Arkansas $503.50 $500.50 $494.00 $503.00 $516.50 $521.20 $525.00
#> 5 California $305.80 $321.00 $346.20 $365.50 $390.10 $380.50 $345.50
#> 6 Colorado $228.10 $301.45 $320.58 $354.12 $348.70 $277.55 $297.23
#> 7 Connecticut $495.00 $495.00 $495.00 $495.00 $502.50 $502.50 $500.56
#> 8 Delaware $493.33 $458.33 $481.67 $496.67 $513.33 $510.00 $498.33
#> 9 Florida $507.30 $484.32 $487.12 $503.38 $518.52 $517.68 $514.03
#> 10 Georgia $515.00 $503.00 $503.00 $517.00 $534.00 $545.00 $550.00

How to remove extra | (pipe) separator from rows when loading | (pipe)-separated text into R

I am reading text from a file in which the text is separated by | (pipes).
The text table looks like this (tweet id|date and time|tweet):
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
I am reading this information using the following code:
nyt <- read.table(file=".../nytimeshealth.txt",
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")
Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).
So, the row above becomes
71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.
Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?

I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …

Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Webscraping and downloading PDFs in R - r

I'm trying to loop through different pages of this website https://burnsville.civicweb.net/filepro/documents/25657/ and download all the PDFs to a folder. Because of the way the website is set up, my usual download.file solution wont work. Any other suggestions?

This worked for me download.file("https://burnsville.civicweb.net/filepro/documents/36906", "a1.pdf", mode="wb")

Related

Attempting to scrape an "unscrapable" page?

R splitting cells with complex pattern strings

How to retrieve text below titles from google search using rvest

How do I scrape data from this specific website using r?

How to remove extra | (pipe) separator from rows when loading | (pipe)-separated text into R

Categories

Resources