Scrape webpage that does not change URL - r

I’m new to web scraping. I can do the very basic stuff of scraping pages using URLs and css selector tools with R. Now I have run into problems.
For hobby purposes I would like to be able to scrape the following URL:
 https://matchpadel.halbooking.dk/newlook/proc_baner.asp (a time slot booking system for sports)
However, the URL does not change when I navigate to different dates or adresses (‘Område’).
I have read a couple of similar problems suggesting to inspect the webpage, look under ’Network’ and then ‘XHR’ or ‘JS’ to find the data source of the table and get information from there. I am able to do this, but to be honest, I have no idea what to do from there.
I would like to retrieve data on what time slots are available across dates and adresses (the ‘Område’ drop down on the webpage).
If anyone is willing to help me and my understanding, it would be greatly appreciated.
Have a nice day!

The website you have linked looks to be run on Javascript which changes dynamically. You need to extract your desired information using RSelenium library which opens a browser and then you need to choose your dropdown and get data.
Find the sample code here to fire up firefox to your website. From here you can write codes to select different types of ‘Område’ dropdown and get the following table info using remdr$getPageSource() and then using Rvest functions to extract the data
# load libraries
library(RSelenium)
# open browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
Sys.sleep(2)
shell(selCommand, wait = FALSE, minimized = TRUE)
Sys.sleep(2)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(10)
remdr$open()
remdr$navigate(url = 'https://matchpadel.halbooking.dk/newlook/proc_baner.asp')

Related

Webscraping images in r and saving them into a zip file

I am trying to webscrape information from this website: https://www.nea.gov.sg/weather/rain-areas and download the 240km radar scans between 2022-07-31 01:00:00 (am) and 2022-07-31 03:00:00 (am) at five-minute intervals, inclusive of end points. Save the images to a zip file.
Edit: Is there a way to do it with just rvest and avoiding the usage of for loops?
I've fount out that the image address can be acquired by clicking on the image and selecting copy image address. An example :https://www.nea.gov.sg/docs/default-source/rain-area-240km/dpsri_240km_2022091920000000dBR.dpsri.png
I've noted that the string of numbers would represent the date and time. So the one I'd need would be 20220731xxxxxxx where x would be the time. However, how would I then use this to webscrape?
Could someone provide some guidance? I can't even seem to find the radar scans for that day. Thank you.
You can consider the following code to save the screenshots of the webpage :
library(RSelenium)
url <- "https://www.nea.gov.sg/weather/rain-areas"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
web_Elem <- remDr$findElement("xpath", '//*[#id="rain-area-slider"]/div/button')
web_Elem$clickElement()
for(i in 1 : 10)
{
print(i)
Sys.sleep(1)
path_To_File <- paste0("C:/file", i, ".png")
remDr$screenshot(display = FALSE, useViewer = TRUE, file = path_To_File)
}
Scraping the images from the website requires you to interact with the website (e.g. clicks), so we will use the RSelenium package for the task. You will also need to have Firefox installed on your system to be able to follow this solution.
1. Load packages
We will begin by loading our packages of interest:
# Load packages ----
pacman::p_load(
httr,
png,
purrr,
RSelenium,
rvest,
servr
)
2. Setup
Now, we need to start the Selenium server with firefox. The following code will start a firefox instance. Run it and wait for firefox to launch:
# Start Selenium driver with firefox ----
rsd <- rsDriver(browser = "firefox", port = random_port())
Now that the firefox browser (aka the client) is up, we want to be able to manipulate it with our code. So, let's create a variable (cl for client) that will represent it. We will use the variable to perform all the actions we need:
cl <- rsd$client
The first action we want to perform is to navigate to the website. Once you run the code, notice how Firefox goes to the website as a response to you running your R code:
# Navigate to the webpage ----
cl$navigate(url = "https://www.nea.gov.sg/weather/rain-areas")
Let's get scraping
Now we're going to begin the actual scraping! #EmmanuelHamel took the clever approach of simply clicking on the "play" button in order to launch the automatic "slideshow". He then took a screenshot of the webpage every second in order to capture the changes in the image. The approach I use is somewhat different.
In the code below, I identify the 13 steps of the slideshow (along the horizontal green bar) and I click on each "step" one after the other. After clicking on a step, I get the URL of the image, then I click on the other step... all the way to the 13th step.
Here I get the HTML element for each step:
# Get the selector for each of the 13 steps
rail_steps <- cl$findElements(using = "css", value = "div.vue-slider-mark")[1:13]
Then, I click on each element and get the image URL at each step. After you run this code, check how your code manipulates the webpage on the firefox instance, isn't that cool?
img_urls <- map_chr(rail_steps, function(step){
cl$mouseMoveToLocation(webElement = step)
cl$click()
img_el <- cl$findElement(using = "css", value = "#rain_overlay")
Sys.sleep(1)
imcg_url <-
img_el$getElementAttribute(attrName = "src")[[1]]
})
Finally, I create an image folder img where I download and save the images:
# Create an image folder then download all images in it ----
dir.create("img")
walk(img_urls, function(img_url){
GET(url = img_url) |>
content() |>
writePNG(target = paste0("img/", basename(img_url)))
})
Important
The downloaded images do not contain the background map on the webpage... only the points! You can download the background map then lay the points on top of it (using an image processing software for example). Here is how to download the background map:
# Download the background map----
GET(url = "https://www.nea.gov.sg/assets/images/map/base-853.png") |>
content() |>
writePNG(target = "base_image.png")
If you want to combine the images programmatically, you may want to look into the magick package in R.

Unable while using Rvest and Selenium to extract text specific text located in <span> object from LinkedIn

First of all, I'm only a beginner in R so my apologies if this sound like a dumb question.
Basically, I want to scape the experience section in LinkedIn and extract the name of the position. As an example, I picked the profile of Hadley Wickham. As you can see on this Screenshot, the data I need ("Chief Scientist") is located in a Span object, with the span object itself located within several Div objects.
As a first attempt, I figured that I'll just try to extract directly the text from the Span objects using this code. However and unsurprisingly, it returned every text that was in other Span objects.
role <-signals %>%
html_nodes("span") %>%
html_nodes(".visually-hidden") %>%
html_text()
I can isolate the text I need by subsetting "[ ]" the object but I'm gonna apply this code to several LinkedIn profiles and the order of the title will change depending on the page. So I thought "Ok maybe I need to specify to R that I want to target the Span object that is located in the experience section and not the whole page" so I thought that I'll just need to mention in the code the "#experience" so that it only pick the Span object I need. But it only returned an empty object.
role <-signals %>%
html_nodes("#experience") %>%
html_nodes("span") %>%
html_nodes(".visually-hidden") %>%
html_text()
I'm pretty sure I'm missing some steps here but I can't figure out what. Maybe I need to specify each objects that are between "#experience" and "span" in order for this code to work but I feel there must be a better and easier way. Hope this make sense. I spent a lot of time trying to debug this and I'm not skilled enough in scraping to find a solution on my own.
As is, this requires RSelenium since data is rendered after the page loads and not with reading pre-defined html page. Refer here on how to launch a browser (either Chrome, Firefox, IE etc..) with the object as remdr
The below snippet opens Firefox but there are other ways to launch any other browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(format(runif(1,10,15),digits = 1))
remdr$open()
You might have to login to LinkedIn since it won't allow viewing profiles without signing up. You will need to use the RSelenium clickElement and sendKeys functions to operate the webpage.
remdr$navigate(url = 'https://www.linkedin.com/') # navigate to the link
username <- remdr$findElement(using = 'id', value = 'session_key') # finding the username field
username$sendKeysToElement(list('your_USERNAME'))
password <- remdr$findElement(using = 'id', value = 'session_password') # finding the password field
password$sendKeysToElement(list('your_PASSWORD'))
remdr$findElement(using = 'xpath', value = "//button[#class='sign-in-form__submit-button']")$clickElement() # find and click Signin button
Once the page is loaded, you can get the page source and use rvest functions to read between the HTML tags. You can use this extension to easily get xpath selectors for the text you want to scrape.
pgSrc <- remdr$getPageSource()
pgData <- read_html(pgSrc[[1]])
experience <- pgData %>%
html_nodes(xpath = "//div[#class='text-body-medium break-words']") %>%
html_text(trim = TRUE)
Output of experience:
> experience
[1] "Chief Scientist at RStudio, Inc."

Scraping a Webpage with RSelenium

I am trying to scrape this link here: https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer -- and return the player props on the page in some sort of workable table within R where I can clean it to a final result.
I am working with the RSelenium package in combination with the tidyverse and rvest in order to scrape this info into R. I have had success on other pages on this website in the past, but can't seem to crack this one.
I've gotten as far as Inspecting the webpage down to the most granular <div> that contains the entire list of players on the page, and copied the corresponding xpath from that line of the inspection.
My code looks as such:
# Run this code to scrape the player props for goals from draftkings
library(tidyverse)
library(RSelenium)
library(rvest)
# start up local selenium server
rD <- rsDriver(browser = "chrome", port=6511L, chromever = "96.0.4664.45")
remote_driver <- rD$client
# Open chrome
remote_driver$open()
# Navigate to URL
url <- "https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer"
remote_driver$navigate(url)
# Find the table via the XML path
table_xml <- remote_driver$findElement(using = "xpath", value = "//*[#id='root']/section/section[2]/section/div[3]/div/div[3]/div/div/div[2]/div")
# Locates the table, turns it into a list, and binds into a single dataframe
player_prop_table <- table_xml$getElementAttribute("innerHTML")
That last line, instead of returning a workable list, tibble, or dataframe like I'm used to returns a large list that contains the same values I see on the Chrome inspect tool.
What am I missing here in terms of successfully scraping this page?

What is the difference between rvest::html_text and RSelenium::getPageSource?

I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.
More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.
My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?
I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.
library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)
test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)
# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()
# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()
# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()
You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.
Showing just a small chunk, choosing "BEARFACTS" in rvest is:
BEARFACTS\n \n \n
while in RSelenium it is something like the following :
<li class=\"expanded dropdown\">\n BEARFACTS\n
The difference between RSelenium and rvest is:
RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.
Some useful tips:
When scraping a page that doesn't load javascript, use rvest.
When you must use RSelenium, try using a headless option to improve speed (it will load the page in a browser just like normal, but it won't display any of the graphical elements, so it will be faster).
Example of using RSelenium headless
eCaps <- list(chromeOptions = list(
args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))
rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps)

Is it possible to Autosave a webpage as an image inside of R?

I think this can be done but I do not know if the functionality exists. I have searched the internet ans stack high and low and can not find anything. I'd like to save www.espn.com as an image to a certain folder on my computer at a certain time of day. Is this possible? Any help would be very much appreciated.
Selenium allows you to do this. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. The image can be exported as a base64 encoded png. As an example:
# RSelenium::startServer() # start a selenium server if required
require(RSelenium)
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://espn.go.com/")
# remDr$screenshot(display = TRUE) # to display image
tmp <- paste0(tempdir(), "/tmpScreenShot.png")
base64png <- remDr$screenshot()
writeBin(base64Decode(base64png, "raw"), tmp)
The png will be saved to the file given at tmp.
A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps

Resources