Extract data from an aspx web page with R - r

I want to extract data from an 'aspx' page (I'm not a specialist of web pages formats) :
http://www.ffvoile.fr/ffv/web/pratique/habitable/OSIRIS/table.aspx
More precisely, I want to extract the information for each boat, that we access clicking the 'information' button on the left of the row.
My problem is that the URL is always the same in the case of the 'aspx' page so I don't understand how I can access the information for each boat.
I know how to extract data from a 'standard' web page so how I need to modify the following code (these pages display similar but more limited information on boats that the 'aspx' page) ?
library(rvest)
Url <- "http://www.ffvoile.fr/ffv/public/Application1/Habitable/HN_Detail.asp?Matricule=1"
Page <- read_html(Url)
Data <- Page %>%
html_nodes(".Valeur") %>% # I use SelectorGadget to highlights the relevant elements
html_text()
print(Data)

Assuming that it is not illegal to scrape data from the website, you might consider using the following.
As mentioned in the comment, you can leverage on Fiddler to figure out what are the http requests being made and duplicate those actions.
library(httr)
library(xml2)
website <- "http://www.ffvoile.fr/ffv/web/pratique/habitable/OSIRIS/table.aspx"
#get cookies and and view states
req <- GET(paste0(website, "/js"))
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR","__VIEWSTATEENCRYPTED",
"__PREVIOUSPAGE", "__EVENTVALIDATION")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post data request with index, i starting from 0. You can loop through each row using i
i <- 0
params <- c(viewheaders,
list(
"__EVENTTARGET"="ctl00$mainContentPlaceHolder$GridView_TH",
"__EVENTARGUMENT"=paste0("Select$", i),
"ctl00$mainContentPlaceHolder$DropDownList_classes"="TOUT",
"ctl00$mainContentPlaceHolder$TextBox_Bateau"="",
"ctl00$mainContentPlaceHolder$DropDownList_GR"="TOUT",
"hiddenInputToUpdateATBuffer_CommonToolkitScripts"=1))
resp <- POST(website, body=params, encode="form",
set_cookies(structure(cookies(req)$value, names=cookies(req)$name)))
if(resp$status_code == 200) {
writeLines(rawToChar(resp$content), "ffvoile.html")
shell("ffvoile.html")
}

Related

Extract table from web page by page

I have written a code for web scraping table from webpage. This code extracts table from page one (in url /page=0):
url <- "https://ss0.corp.com/auth/page=0"
login <- "john.johnson" (fake)
password <- "67HJL54GR" (fake)
res <- GET(url, authenticate(login, password))
content <- content(res, "text")
table <- fromJSON(content) %>%
as.data.farme()
I want to write a code to extract rows from table page by page and then to bind them. I do that, cause table is too large and i can't extract everything at once (it will brake the system). I don't know what how many pages there can be, it changes, so it must stop once the last page is collected. How could i do that?
I cannot test to guarantee this will work because the question is not reproducible, but you mainly need three steps:
Setup the url and credentials
url <- "http://someurl/auth/page="
login <- ""
password <- ""
Iterate over all (I'm assuming there are N) pages and store the result in a list. Note that we modify the url properly for each page.
tables <- lapply(1:N, function(page) {
# Create the proper url and make the request
this_url <- paste0(url, page)
res <- GET(this_url, authenticate(login, password))
# Extract the content just like you would in a single page
content <- content(res, "text")
table <- fromJSON(content) %>%
as.data.frame()
return(table)}
)
Aggregate all the tables in the list in a single complete table using rbind
complete <- do.call(rbind, tables)
I hope this helps at least giving a direction.

How do I find html_node on search form?

I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.
It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.
The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.
Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Parse CDATA with R

I'm scraping and analyzing data from a car auction website. My goal is to develop date-time and sentiment analysis skills, and I like old cars. The website is Bring A Trailer-- they do not offer API access (I asked), but robots.txt is OK.
SO user '42' pointed out that this is not permitted by BAT's terms, so I have removed their base url. I will likely remove the question. After thinking about it, I can do what I want by saving a couple of webpages from my browser and analyzing that data. I don't need ALL the auctions, I just followed a tutorial that did and here I am reading TOS instead of doing what I wanted in the first place...
Some of the data is easily accessed, but the best parts are hard, and I'm stuck with that. I'm really looking for advice on my approach.
My first steps work: I can find and locally cache the webpages:
library(tidyverse)
library(rvest)
data_dir <- "bat_data-html/"
# Step 1: Create list of links to listings ----------------------------
base_url <- "https://"
pages <- read_html(file.path(base_url,"/auctions/")) %>%
html_nodes(".auctions-item-title a") %>%
html_attr("href") %>%
file.path
pages <- head(pages, 3) # use a subset for testing code
# Step 2 : Save auction pages locally ---------------------------------
dir.create(data_dir, showWarnings = FALSE)
p <- progress_estimated(length(pages))
# Download each auction page
walk(pages, function(url){
download.file(url, destfile = file.path(data_dir, basename(url)), quiet = TRUE)
p$tick()$print()
})
I can also process metadata about the auction from these cached pages, identifying the css selectors with SelectorGadget and specifying them to rvest:
# Step 3: Process each auction info into df ----------------------------
files <- dir(data_dir, pattern = "*", full.names = TRUE)
# Function: get_auction_details, to be applied to each auction page
get_auction_details <- function(file) {
pagename <- basename(file) # the filename of the page (trailing index for multiples)
page <- read_html(file) # read the html into R ( consider , options = "NOCDATA")
# Grab the title of the auction stored in the ".listing-post-title" tag on the page
title <- page %>% html_nodes(".listing-post-title") %>% html_text()
# Grab the "BAT essentials" of the auction stored in the ".listing-essentials-item" tag on the page
essence <- page %>% html_nodes(".listing-essentials-item") %>% html_text()
# Assemble into a data frame
info_tbl0 <- as_tibble(essence)
info_tbl <- add_row(info_tbl0, value = title, .before = 1)
names(info_tbl) [1] <- pagename
return(info_tbl)
}
# Apply the get_auction_details function to each element of files
bat0 <- map_df(files, get_auction_details) # run function
bat <- gather(bat0) %>% subset(value != "NA") # serialize results
# Save as csv
write_csv(bat, path = "data-csv/bat04.csv") # this table contains the expected metadata:
key,value
1931-ford-model-a-12,Modified 1931 Ford Model A Pickup
1931-ford-model-a-12,Lot #8576
1931-ford-model-a-12,Seller: TargaEng
But the auction data (bids, comments) is inside of a CDATA section:
<script type='text/javascript'>
/* <![CDATA[ */
var BAT_VMS = { ...bids, comments, results
/* ]]> */
</script>
I've tried elements within this section using the path that I find using SelectorGadget, but they are not found-- this gives an empty list:
tmp <- page %>% html_nodes(".comments-list") %>% html_text()
Looking at the text within this CDATA section, I see some xml tags but it is not structured in the cached file like it is when I inspect the auction section of the live webpage.
To extract this information, should I try to parse the information "as-is" from within this CDATA section, or can I transform it so that it can be parsed like XML? Or am I barking up the wrong tree?
I appreciate any advice!
It's buried in the XML2 documentation, but you can use this option to keep the CDATA intact.
# Instead of rvest::read_html()
xml2::read_xml(options = "NOCDATA")
After reading the feed in this way, you'll be able to access the comments list the way you wanted.
tmp <- page %>% html_nodes(".comments-list") %>% html_text()

Scrape contents of dynamic pop-up window using R

I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.

Resources