Parse CDATA with R - r

I'm scraping and analyzing data from a car auction website. My goal is to develop date-time and sentiment analysis skills, and I like old cars. The website is Bring A Trailer-- they do not offer API access (I asked), but robots.txt is OK.
SO user '42' pointed out that this is not permitted by BAT's terms, so I have removed their base url. I will likely remove the question. After thinking about it, I can do what I want by saving a couple of webpages from my browser and analyzing that data. I don't need ALL the auctions, I just followed a tutorial that did and here I am reading TOS instead of doing what I wanted in the first place...
Some of the data is easily accessed, but the best parts are hard, and I'm stuck with that. I'm really looking for advice on my approach.
My first steps work: I can find and locally cache the webpages:
library(tidyverse)
library(rvest)
data_dir <- "bat_data-html/"
# Step 1: Create list of links to listings ----------------------------
base_url <- "https://"
pages <- read_html(file.path(base_url,"/auctions/")) %>%
html_nodes(".auctions-item-title a") %>%
html_attr("href") %>%
file.path
pages <- head(pages, 3) # use a subset for testing code
# Step 2 : Save auction pages locally ---------------------------------
dir.create(data_dir, showWarnings = FALSE)
p <- progress_estimated(length(pages))
# Download each auction page
walk(pages, function(url){
download.file(url, destfile = file.path(data_dir, basename(url)), quiet = TRUE)
p$tick()$print()
})
I can also process metadata about the auction from these cached pages, identifying the css selectors with SelectorGadget and specifying them to rvest:
# Step 3: Process each auction info into df ----------------------------
files <- dir(data_dir, pattern = "*", full.names = TRUE)
# Function: get_auction_details, to be applied to each auction page
get_auction_details <- function(file) {
pagename <- basename(file) # the filename of the page (trailing index for multiples)
page <- read_html(file) # read the html into R ( consider , options = "NOCDATA")
# Grab the title of the auction stored in the ".listing-post-title" tag on the page
title <- page %>% html_nodes(".listing-post-title") %>% html_text()
# Grab the "BAT essentials" of the auction stored in the ".listing-essentials-item" tag on the page
essence <- page %>% html_nodes(".listing-essentials-item") %>% html_text()
# Assemble into a data frame
info_tbl0 <- as_tibble(essence)
info_tbl <- add_row(info_tbl0, value = title, .before = 1)
names(info_tbl) [1] <- pagename
return(info_tbl)
}
# Apply the get_auction_details function to each element of files
bat0 <- map_df(files, get_auction_details) # run function
bat <- gather(bat0) %>% subset(value != "NA") # serialize results
# Save as csv
write_csv(bat, path = "data-csv/bat04.csv") # this table contains the expected metadata:
key,value
1931-ford-model-a-12,Modified 1931 Ford Model A Pickup
1931-ford-model-a-12,Lot #8576
1931-ford-model-a-12,Seller: TargaEng
But the auction data (bids, comments) is inside of a CDATA section:
<script type='text/javascript'>
/* <![CDATA[ */
var BAT_VMS = { ...bids, comments, results
/* ]]> */
</script>
I've tried elements within this section using the path that I find using SelectorGadget, but they are not found-- this gives an empty list:
tmp <- page %>% html_nodes(".comments-list") %>% html_text()
Looking at the text within this CDATA section, I see some xml tags but it is not structured in the cached file like it is when I inspect the auction section of the live webpage.
To extract this information, should I try to parse the information "as-is" from within this CDATA section, or can I transform it so that it can be parsed like XML? Or am I barking up the wrong tree?
I appreciate any advice!

It's buried in the XML2 documentation, but you can use this option to keep the CDATA intact.
# Instead of rvest::read_html()
xml2::read_xml(options = "NOCDATA")
After reading the feed in this way, you'll be able to access the comments list the way you wanted.
tmp <- page %>% html_nodes(".comments-list") %>% html_text()

Related

How do I scrape / automatically download PDF files from a document search web interface in R?

I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.
Recently, I learned how to "webscrape" a single pdf file from the website I am using :
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)
#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)
#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w , https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/
The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto
I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")
raw_list <- page %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.pdf") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .)
map(read_html) %>%
map(html_node, "#raw-url") %>%
map(html_attr, "href") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>%
walk2(., basename(.), download.file, mode = "wb")
But this produces the following error:
Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'
Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?
Thanks
I see some people suggesting that you use rselenium, which is a way to
simulate browser actions, so that the web server renders the page as
if a human was visiting the site. From my experience it is almost never
necessary to go down that route. The javascript part of the website is
interacting with an API and we can utilize that to circumvent the Javascript
part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I
assume) you can right-click on the website and select “Inspect Element (Q)”,
go to the “Network” tab and click on reload. You’ll see that each request
the browser makes to the webserver is being listed after a few seconds or less.
We are interested in the ones that have the “Type” json.
When you right click on an entry you can select “Open in New Tab”. One of the
requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1
Opening that URL in Firefox gets you to a GUI that lets you explore the
json data structure and you’ll see that there is a “results” entry which
contains the data for the 25 first results of your search. Each one has a
“path” entry, that leads to the page that will display the embedded PDF.
It turns out that if you replace the “.html” part with “.pdf” that path
leads directly to the PDF file. The code below utilizes all this information.
library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)
base_url <- "https://www.canlii.org"
json_url_search_p1 <-
"https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"
This downloads the json for page 1 / results 1 to 25
results_p1 <-
GET(json_url_search_p1, encode = "json") %>%
content()
For each result we extract the path only.
result_html_paths_p1 <-
map_chr(results_p1$results,
~ .$path)
We replace “.html” with “.pdf”, combine the base URL with the path to
generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map()
and pdftools::pdf_text in order to extract the text from all 25 PDFs.
pdf_texts_p1 <-
gsub(".html$", ".pdf", result_html_paths_p1) %>%
paste0(base_url, .) %>%
map(pdf_text)
If you want to do this for more than just the first page you might want to
wrap the above code in a function that lets you switch out the “&page=”
parameter. You could also make the “&text=” parameter an argument of the
function in order to automatically scrape results for other searches.
For the remaining part of the task we can build on the code you already have.
We make it a function that can be applied to any article and apply that function
to each PDF text again using purrr::map().
extract_article_summary <-
function(article) {
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
textrank_sentences(data = article_sentences, terminology = article_words)
}
This now will take a real long time!
article_summaries_p1 <-
map(pdf_texts_p1, extract_article_summary)
Alternatively you could use furrr::future_map() instead to utilize all the CPU
cores in your machine and speed up the process.
library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <-
future_map(pdf_texts_p1, extract_article_summary)
Disclaimer
The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.

How to webscrape a file which adress changes in R

I am interested in this excel file, which structure does not change : https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
Which i can get from this page : https://rigcount.bakerhughes.com/na-rig-count
The last url does not change over time, whereas the first one does.
But I guess the url of the file is located somewhere in the elements of the fixed webpage, even if it is changed, and the the generation of the filename follows a repetitive procedure.
Therefore, is there a way, in R, to get the file (which is updated every week or so) in an automated manner, without dowloading it manually each time ?
You skipped the part of the question where you talk about what you had done. Or searching the web for tutorials. But it was easy to do so here goes. You'll have to look up an rvest tutorial for more explanation.
library(rvest) # to allow easy scraping
library(magrittr) # to allow %>% pipe commands
page <- read_html("https://rigcount.bakerhughes.com/na-rig-count")
# Find links that match excel type files as defined by the page
links <- page %>%
html_nodes("span.file--mime-application-vnd-ms-excel-sheet-binary-macroEnabled-12") %>%
html_nodes("a")
links_df <- data.frame(
title = links %>% html_attr("title"),
link = links %>% html_attr("href")
)
links_df
title
# 1 north_america_rotary_rig_count_jan_2000_-_current.xlsb
# 2 north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb
# link
# 1 https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
# 2 https://rigcount.bakerhughes.com/static-files/c7852ea5-5bf5-4c47-b52c-f025597cdddf

how to download pdf file with R from web (encode issue)

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))

Extract data from an aspx web page with R

I want to extract data from an 'aspx' page (I'm not a specialist of web pages formats) :
http://www.ffvoile.fr/ffv/web/pratique/habitable/OSIRIS/table.aspx
More precisely, I want to extract the information for each boat, that we access clicking the 'information' button on the left of the row.
My problem is that the URL is always the same in the case of the 'aspx' page so I don't understand how I can access the information for each boat.
I know how to extract data from a 'standard' web page so how I need to modify the following code (these pages display similar but more limited information on boats that the 'aspx' page) ?
library(rvest)
Url <- "http://www.ffvoile.fr/ffv/public/Application1/Habitable/HN_Detail.asp?Matricule=1"
Page <- read_html(Url)
Data <- Page %>%
html_nodes(".Valeur") %>% # I use SelectorGadget to highlights the relevant elements
html_text()
print(Data)
Assuming that it is not illegal to scrape data from the website, you might consider using the following.
As mentioned in the comment, you can leverage on Fiddler to figure out what are the http requests being made and duplicate those actions.
library(httr)
library(xml2)
website <- "http://www.ffvoile.fr/ffv/web/pratique/habitable/OSIRIS/table.aspx"
#get cookies and and view states
req <- GET(paste0(website, "/js"))
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR","__VIEWSTATEENCRYPTED",
"__PREVIOUSPAGE", "__EVENTVALIDATION")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post data request with index, i starting from 0. You can loop through each row using i
i <- 0
params <- c(viewheaders,
list(
"__EVENTTARGET"="ctl00$mainContentPlaceHolder$GridView_TH",
"__EVENTARGUMENT"=paste0("Select$", i),
"ctl00$mainContentPlaceHolder$DropDownList_classes"="TOUT",
"ctl00$mainContentPlaceHolder$TextBox_Bateau"="",
"ctl00$mainContentPlaceHolder$DropDownList_GR"="TOUT",
"hiddenInputToUpdateATBuffer_CommonToolkitScripts"=1))
resp <- POST(website, body=params, encode="form",
set_cookies(structure(cookies(req)$value, names=cookies(req)$name)))
if(resp$status_code == 200) {
writeLines(rawToChar(resp$content), "ffvoile.html")
shell("ffvoile.html")
}

Scrape contents of dynamic pop-up window using R

I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.

Resources