I am trying to scrape data from the Peruvian Electronic System for Government Procurement and Contracting (SEACE) (using RSelenium) and I have succeeded until I try to capture the URL from the Captcha image. The problem that I encounter is that the link for the captcha has the extension "dynamiccontent.properties.xhtml" (See next screenshot), but not a "JPEG", "JPG" or "PNG" extension.
I would like to get the URL from the captcha image in one of these extensions (JPEG, JPG or PNG) using R, any suggestions? Thanks!
You CAN get the captcha image using Rselenium. But you require some image processing along with it. Since the captcha is generated dynamically, you need to take a screenshot of the page, then crop the image such that only the captcha is left. (try playing with the dimension argument of the cropping function to get it just right)
While doing so, make the window size large so that the resolution is good.
For cropping the image, you will need some trial and error. You can use packages imager or magick for the cropping bit.
library(RSelenium)
library(magick)
library(dplyr)
url<-"https://prodapp2.seace.gob.pe/seacebus-uiwd-pub/buscadorPublico/buscadorPublico.xhtml"
#### Selenium server
## For Firefox
rd<- rsDriver(browser = "firefox",port = 4581L)
remDr <- rd$client
remDr$open()
##Open url
remDr$navigate(url)
remDr$setWindowSize(2000,1600)
remDr$screenshot(display = FALSE, file = "captcha.png")
final_cap<-image_read("captcha.png") %>%
image_crop(.,"200X100+500+500")
plot(final_cap)
Please do note that I don't support any illegal captcha breaking - they were designed to keep robots out!
Related
I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.
I’m new to web scraping. I can do the very basic stuff of scraping pages using URLs and css selector tools with R. Now I have run into problems.
For hobby purposes I would like to be able to scrape the following URL:
https://matchpadel.halbooking.dk/newlook/proc_baner.asp (a time slot booking system for sports)
However, the URL does not change when I navigate to different dates or adresses (‘Område’).
I have read a couple of similar problems suggesting to inspect the webpage, look under ’Network’ and then ‘XHR’ or ‘JS’ to find the data source of the table and get information from there. I am able to do this, but to be honest, I have no idea what to do from there.
I would like to retrieve data on what time slots are available across dates and adresses (the ‘Område’ drop down on the webpage).
If anyone is willing to help me and my understanding, it would be greatly appreciated.
Have a nice day!
The website you have linked looks to be run on Javascript which changes dynamically. You need to extract your desired information using RSelenium library which opens a browser and then you need to choose your dropdown and get data.
Find the sample code here to fire up firefox to your website. From here you can write codes to select different types of ‘Område’ dropdown and get the following table info using remdr$getPageSource() and then using Rvest functions to extract the data
# load libraries
library(RSelenium)
# open browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
Sys.sleep(2)
shell(selCommand, wait = FALSE, minimized = TRUE)
Sys.sleep(2)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(10)
remdr$open()
remdr$navigate(url = 'https://matchpadel.halbooking.dk/newlook/proc_baner.asp')
The page at this link in the webshot command includes ovals to indicate air quality in a particular location
webshot("https://www.purpleair.com/map?&zoom=12&lat=39.09864026298141&lng=-108.56749455168722&clustersize=27&orderby=L&latr=0.22700642752714373&lngr=0.4785919189453125", "paMap.png")
The png that webshot produces doesn't include these ovals. I suspect these are created with javascript and webshot is not picking them up. But I don't know how to tell it to do so, or even if it is possible.
Although this issue is not directly related with webshot versions, you should consider to try webshot2 on https://github.com/rstudio/webshot2 instead of using webshot. I have prepared a blog post including various details about webshot2. You can see the details from here. In addition, see my detailed answer about the issues on webshot compared to webshot2.
I have replicated your scenario with webshot2 and delay parameter, the issue is resolved as below screenshot. The main issue is related with delay side. Basically, the URL needs a longer delay for all assets to display.
The code
library(webshot2)
temp_url = "https://www.purpleair.com/map?&zoom=12&lat=39.09864026298141&lng=-108.56749455168722&clustersize=27&orderby=L&latr=0.22700642752714373&lngr=0.4785919189453125"
webshot(url = temp_url, file = "paMap.png", delay = 4)
The output file
library(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L)
remDr$open()
remDr$navigate("https://www.purpleair.com/map?&zoom=12&lat=39.09864026298141&lng=-108.56749455168722&clustersize=27&orderby=L&latr=0.22700642752714373&lngr=0.4785919189453125")
remDr$screenshot(file = "paMag.png")
effect:
I'm trying to automate downloading company profile images from Crunchbase's OpenDataMap using R. I've tried download.file, GET (in httr package) and getURLContent in RCurl, but they all return a 416 error. I know that I must be forgetting a parameter or user_agent, but I can't figure out what.
Here's an example URL for testing:
http://www.crunchbase.com/organization/google-ventures/primary-image/raw
Thanks for any help that you can provide.
I think I came up with a fairly clever, albeit slow-ish solution that worked with R.
Essentially, I created a headless browser that navigates from page-to-page, downloading the crunchbase images I need. This allows me to get past the 'redirect' and javascript that stops me from getting to the images via a simple Curl request.
This may work for other scraping projects.
library(RSelenium)
RSelenium::checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()
# For each url of interest profile_image_url is a list of image urls from crunchbase's open data map.
for(row in 1:length(profile_image_url)){
print(row) # keep track of where I am
# if already downloaded, don't do it again
if(file.exists(paste0("profileimages/",row,".png"))| file.exists(paste0("profileimages/",row,".jpg"))|file.exists(paste0("profileimages/",row,".gif"))){
next
}
# navigate to new page
remDr$navigate(paste0(profile_image_url[row],"?w=500&h=500"))
imageurl <- remDr$getCurrentUrl()[[1]]
# get file extension (to handle pngs and jpgs
file.ext <- gsub('[^\\]*\\.(\\w+)$',"\\1", imageurl)
# download image file from 'real' url
download.file(imageurl, paste0("profileimages/",thiscid,".",file.ext), method="curl")
# wait ten seconds to avoid rate-limiting
Sys.sleep(10)
}
remDr$close()
I think this can be done but I do not know if the functionality exists. I have searched the internet ans stack high and low and can not find anything. I'd like to save www.espn.com as an image to a certain folder on my computer at a certain time of day. Is this possible? Any help would be very much appreciated.
Selenium allows you to do this. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. The image can be exported as a base64 encoded png. As an example:
# RSelenium::startServer() # start a selenium server if required
require(RSelenium)
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://espn.go.com/")
# remDr$screenshot(display = TRUE) # to display image
tmp <- paste0(tempdir(), "/tmpScreenShot.png")
base64png <- remDr$screenshot()
writeBin(base64Decode(base64png, "raw"), tmp)
The png will be saved to the file given at tmp.
A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps