R webshot package not capturing everything

R webshot package not capturing everything - r

The page at this link in the webshot command includes ovals to indicate air quality in a particular location
webshot("https://www.purpleair.com/map?&zoom=12&lat=39.09864026298141&lng=-108.56749455168722&clustersize=27&orderby=L&latr=0.22700642752714373&lngr=0.4785919189453125", "paMap.png")
The png that webshot produces doesn't include these ovals. I suspect these are created with javascript and webshot is not picking them up. But I don't know how to tell it to do so, or even if it is possible.

Although this issue is not directly related with webshot versions, you should consider to try webshot2 on https://github.com/rstudio/webshot2 instead of using webshot. I have prepared a blog post including various details about webshot2. You can see the details from here. In addition, see my detailed answer about the issues on webshot compared to webshot2.
I have replicated your scenario with webshot2 and delay parameter, the issue is resolved as below screenshot. The main issue is related with delay side. Basically, the URL needs a longer delay for all assets to display.
The code
library(webshot2)
temp_url = "https://www.purpleair.com/map?&zoom=12&lat=39.09864026298141&lng=-108.56749455168722&clustersize=27&orderby=L&latr=0.22700642752714373&lngr=0.4785919189453125"
webshot(url = temp_url, file = "paMap.png", delay = 4)
The output file

library(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L)
remDr$open()
remDr$navigate("https://www.purpleair.com/map?&zoom=12&lat=39.09864026298141&lng=-108.56749455168722&clustersize=27&orderby=L&latr=0.22700642752714373&lngr=0.4785919189453125")
remDr$screenshot(file = "paMag.png")
effect:

Related

Scrape webpage that does not change URL

I’m new to web scraping. I can do the very basic stuff of scraping pages using URLs and css selector tools with R. Now I have run into problems.
For hobby purposes I would like to be able to scrape the following URL:  https://matchpadel.halbooking.dk/newlook/proc_baner.asp (a time slot booking system for sports)
However, the URL does not change when I navigate to different dates or adresses (‘Område’).
I have read a couple of similar problems suggesting to inspect the webpage, look under ’Network’ and then ‘XHR’ or ‘JS’ to find the data source of the table and get information from there. I am able to do this, but to be honest, I have no idea what to do from there.
I would like to retrieve data on what time slots are available across dates and adresses (the ‘Område’ drop down on the webpage).
If anyone is willing to help me and my understanding, it would be greatly appreciated.
Have a nice day!

The website you have linked looks to be run on Javascript which changes dynamically. You need to extract your desired information using RSelenium library which opens a browser and then you need to choose your dropdown and get data.
Find the sample code here to fire up firefox to your website. From here you can write codes to select different types of ‘Område’ dropdown and get the following table info using remdr$getPageSource() and then using Rvest functions to extract the data
# load libraries
library(RSelenium)
# open browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
Sys.sleep(2)
shell(selCommand, wait = FALSE, minimized = TRUE)
Sys.sleep(2)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(10)
remdr$open()
remdr$navigate(url = 'https://matchpadel.halbooking.dk/newlook/proc_baner.asp')

Capture a captcha image

I am trying to scrape data from the Peruvian Electronic System for Government Procurement and Contracting (SEACE) (using RSelenium) and I have succeeded until I try to capture the URL from the Captcha image. The problem that I encounter is that the link for the captcha has the extension "dynamiccontent.properties.xhtml" (See next screenshot), but not a "JPEG", "JPG" or "PNG" extension.
I would like to get the URL from the captcha image in one of these extensions (JPEG, JPG or PNG) using R, any suggestions? Thanks!

You CAN get the captcha image using Rselenium. But you require some image processing along with it. Since the captcha is generated dynamically, you need to take a screenshot of the page, then crop the image such that only the captcha is left. (try playing with the dimension argument of the cropping function to get it just right)
While doing so, make the window size large so that the resolution is good.
For cropping the image, you will need some trial and error. You can use packages imager or magick for the cropping bit.
library(RSelenium)
library(magick)
library(dplyr)
url<-"https://prodapp2.seace.gob.pe/seacebus-uiwd-pub/buscadorPublico/buscadorPublico.xhtml"
#### Selenium server
## For Firefox
rd<- rsDriver(browser = "firefox",port = 4581L)
remDr <- rd$client
remDr$open()
##Open url
remDr$navigate(url)
remDr$setWindowSize(2000,1600)
remDr$screenshot(display = FALSE, file = "captcha.png")
final_cap<-image_read("captcha.png") %>%
image_crop(.,"200X100+500+500")
plot(final_cap)
Please do note that I don't support any illegal captcha breaking - they were designed to keep robots out!

Download all data when default filter is active

The BCOGC keeps a database of applications for drilling wells in northeast British Columbia. By default, some filters are active to only highlight approved applications within the last month, even though the application database holds 30K+ records:
When the filter is deactivated:
To download the entire data set, remove or deactivate any filters, click on Actions > Download > CSV. I want to download the entire data set (containing 30K+ records) automatically using R.
When I use
library(tidyverse)
df <- read_csv(
file = 'https://reports.bcogc.ca/ogc/f?p=200:21::CSV::::'
)
it only downloads whatever the default query specifies, so around 150 records, not 30K+.
How can I use R to download the entire data set automatically? Is this a task for httr or RSelenium?

OK, I'm going to go with Selenium then since it doesn't necessarily require Docker (though the example I'm using is with Docker :-) Pretty sure I could get Splash/splashr to do this as well, but it involves a file download and I think there's issues with that and the Splash back-end. As the splashr author, I avoid having to deal with GitHub issues if I use Selenium for this example as well ;-)
Anyway, you should install RSelenium. I can't really provide support for that but it's well documented and the rOpenSci folks are super helpful. I'd highly suggest getting Docker to run on your system or getting your department to setup a Selenium server you all can use.
There are a couple gotchas for this use-case:
Some element names we need to instrument are dynamically generated so we have to work around that
This involves downloading a CSV file so we need to map a filesystem path in Docker so it downloads properly
This is a super slow site so you need to figure out wait times after each interaction (I'm not going to do that since you may be on a slower or faster network and network speed does play a part here, tho not that much)
I'd suggest working through the vignettes for RSelenium before trying the below to get a feel for how it works. You're essentially coding up human page interactions.
You will need to start Docker with a mapped directory. See download file with Rselenium & docker toolbox for all the info but here's how I did it on my macOS box:
docker run -d -v /Users/hrbrmstr/Downloads://home/seluser/Downloads -p 4445:4444 selenium/standalone-firefox:2.53.1
That makes Selenium accessible on port 4445, uses Firefox (b/c Chrome is evil) and maps my local downloads directory to the Firefox default dir for the selenium user in the Docker container. That means well_authorizations_issued.csv is going to go there (eventually).
Now, we need to crank up R and connect it to this Selenium instance. We need to create a custom Firefox profile since we're saving stuff to disk and we don't want the browser to prompt us for anything:
library(RSelenium)
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
You will need to uncomment out the Sys.sleep()s and experiment with various "wait time" values between calls. Some will be short (1-2s) others will be larger (20s, 30s, or higher).
I'm not displaying the output of the screenshots here but those are one way to figure out timings (i.e. keep generating screen shots after an element interaction until gray spinner boxes are gone — etc — and keep a mental note of how many seconds that was).
Now, the one tricky bit noted above is figuring out the where the checkbox is to turn off the filter since it has a dynamic id. However, we aren't actually going to click on the checkbox b/c the daft fools who created that app have no idea what they are doing and actually have the click-event trapped with the span element that surrounds it, so we have to find the li element that contains the checkbox label text then go to the span element and click on it.
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ definitely needs a delay (you likely saw it spin a while in-person when clicking yourself so you can count that and add in some buffer seconds).
Then, we click on the drop-down "menu" (it's really a button):
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
Then the download "menu" item (it's really a button:
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
^^ also needs rly needs a delay as the Download "dialog" takes a few seconds to come up (it did for me at least).
Now, find the CSV box which is really an a tag:
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))
That last bit is something you'll have to experiment with. It takes a while to process the request and then transfer the ~9MB file. The call to rmDr$screenshot() actually waits for the download to complete so you can remove the display and decoding code and assign the output to a variable and use that as an automatic "wait"er.
I tried this 3x on 2 different macOS systems and it worked fine. YMMV.
I'm guessing you'll want to automate this eventually so you could have a system() call towards the top of the script that starts the Selenium Docker container, then does the rest of the bits and then issues another system() call to shut down the Docker container.
Alternately, https://github.com/richfitz/stevedore is now on CRAN so it is a pure R interface to starting/stopping Docker containers (amongst many other things) so you could use that instead of system() calls.
If you can't use Docker, you need to install a "webdriver" executable for Firefox on your Windows box and also get the Selenium Java archive, ensure you have Java installed and then do the various manual incantations to get that going (which is beyond the scope of this answer).
Here's a shortened, contiguous version of the above:
library(RSelenium)
# start Selenium before doing this
makeFirefoxProfile(
list(
browser.download.dir = "home/seluser/Downloads",
browser.download.folderList = 2L,
browser.download.manager.showWhenStarting = FALSE,
browser.helperApps.neverAsk.saveToDisk = "text/csv"
)
) -> ffox_prof
remoteDriver(
browserName = "firefox", port = 4445L,
extraCapabilities = ffox_prof
) -> remDr
invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
done <- remDr$screenshot()
# stop Selenium

Get Google Chrome's Inspect Element into R

This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for XML::getNodeSet. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem.
I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from htmlTreeParse as demonstrated here:
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
doc <- htmlTreeParse(url, useInternalNodes = TRUE)
m <- capture.output(doc)
any(grepl("258.12", m))
## FALSE
But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow):
How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (XLM and RCurl-fu).

You should be able to scrape the page using something like the following code for RSelenium. You need to have java installed and available on your path for the startServer() line to work (and thus for you to be able to do anything).
library("RSelenium")
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox"
)
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
remDr$open()
remDr$navigate(url)
source <- remDr$getPageSource()[[1]]
Check to make sure it worked according to your test:
> grepl("258.12", source)
[1] TRUE

Is it possible to Autosave a webpage as an image inside of R?

I think this can be done but I do not know if the functionality exists. I have searched the internet ans stack high and low and can not find anything. I'd like to save www.espn.com as an image to a certain folder on my computer at a certain time of day. Is this possible? Any help would be very much appreciated.

Selenium allows you to do this. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. The image can be exported as a base64 encoded png. As an example:
# RSelenium::startServer() # start a selenium server if required
require(RSelenium)
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://espn.go.com/")
# remDr$screenshot(display = TRUE) # to display image
tmp <- paste0(tempdir(), "/tmpScreenShot.png")
base64png <- remDr$screenshot()
writeBin(base64Decode(base64png, "raw"), tmp)
The png will be saved to the file given at tmp.
A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R webshot package not capturing everything - r

Related

Scrape webpage that does not change URL

Capture a captcha image

Download all data when default filter is active

Get Google Chrome's Inspect Element into R

Is it possible to Autosave a webpage as an image inside of R?

Categories

Resources