I need to web-scrape through a number of pages. Given that there are many pages, I first write down the html files in my external HD and then process them.
The problem is that when I try to get the URL I always download the same page, because the website waits half a second to display the new page.
The code is as follows
library(RCurl)
library(stringr)
library(XML)
for(i in 1:numberPages){
# Generate the unique URL for each page
url <- str_c('https://www..../#{"page":"', i, '"}')
# Download the page
bill.result <- try(getURL(url))
while(class(bill.result)=="try-error"){ # if class()==error, retry to get the url after 1 sec
cat("Page irresponsive - trying again")
Sys.sleep(1)
bill.result <- try(getURL(url))
}
# Write the page to local hard drive
result <- try(write(bill.result, str_c(new_folder, "/page", i, ".html")))
while(class(result)=="try-error"){ # if class()==error, retry to write after 2 min
cat("I will sleep if network is down")
Sys.sleep(120)
result <- try(write(bill.result, str_c(new_folder, "/page", i, ".html")))
}
# Print progress of download
cat(i, "\n")
}
I have searched and haven´t found an option that would wait, say, 1 second before getting the URL. Without this wait time, I always download the same page, no matter where in the loop I am. I know this because when I change the url address in my browser the page is the same for 1 second, and then it changes to the page I wanted in the first place.
Related
My Goal: Using R, scrape all light bulb model #s and prices from homedepot.
My Problem: I can not find the URLs for ALL the light bulb pages. I can scrape one page, but I need to find a way to get the URLs so I can scrape them all.
Ideally I would like these pages
https://www.homedepot.com/p/TOGGLED-48-in-T8-16-Watt-Cool-White-Linear-LED-Tube-Light-Bulb-A416-40210/205935901
but even getting the list pages like these would be ok
https://www.homedepot.com/b/Lighting-Light-Bulbs/N-5yc1vZbmbu
I tried crawlr -> Does not work on homedepot (maybe because https?)I tried to get specific pages
I tried Rvest -> I tried using html_form and set_values to put light bulb in the search box, but the form comes back
[[1]]
<form> 'headerSearchForm' (GET )
<input hidden> '': 21
<input text> '':
<button > '<unnamed>
and set_value will not work because is '' so the error comes back
error: attempt to use zero-length variable name.
I also tried using the paste function and lapply
tmp <- lapply(0:696, function(page) {
url <- paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs/N-
5yc1vZbmbu?Nao=", page, "4&Ns=None")
page <- read_html(url)
html_table(html_nodes(page, "table"))[[1]]
})
I got the error : error in html_table(html_nodes(page,"table"))[[1]]: script out of bounds.
I am seriously at a loss and any advice or tips would be so fantastic.
You can do it through rvest and tidyverse.
You can find a listing of all bulbs starting in this page, with a pagination of 24 bulbs per page across 30 pages:
https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79
Take a look at the pagination grid at the bottom of the initial page. I drew a(n ugly) yellow oval around it:
You could extract the link to each page listing 24 bulbs by following/extracting the links in that pagination grid.
Yet, just by comparing the urls it becomes evident that all pages follow a pattern, with "https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79" as root, and a tail where the
last digit characters represent the first lightbulb displayed, "?Nao=24"
So you could simply infer the structure of each url pointing to a display of the bulbs. The following command creates such a list in R:
library(rvest)
library(tidyverse)
index_list <- as.list(seq(0,(24*30), 24)) %>% paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79?Nao=", . )
Now, to extract the url for each lightbulb page, a combuination of a function and purrt's map function would come handy.
To exctract the individual bulbs url from the index pages, we can call this:
scrap_bulbs <- function(url){
object <- read_html(as.character(url))
object <- html_nodes(x = object, xpath = "//a[#data-pod-type='pr']")
object <- html_attr(x = object, 'href')
Sys.sleep(10) ## Courtesy pause of 10 seconds, prevents the website from possibly blocking your IP
paste0('https://www.homedepot.com', object)
}
Now we store the results in a list create by map().
bulbs_list <- map(.x = index_list, .f = scrap_bulbs)
unlist(bulbs_list)
Done!
I am following my question from yesterday - harvesting data via drop down list in R 1
first, I need to obtain all 50k strings of details of all doctors from this page: http://www.lkcr.cz/seznam-lekaru-426.html#seznam
I know, how to obtain them from a single page:
oborID<-"48"
okresID<-"3702"
web<- "http://www.lkcr.cz/seznam-lekaru-426.html"
extractHTML<-function(oborID,okresID){
query<-list('filterObor'="107",'filterOkresId'="3201",'do[findLekar]'=1)
query$filterObor<-oborID
query$filterOkresId<-okresID
html<- POST(url=web,body=query)
html<- content(html, "text")
html
}
IDfromHTML<-function(html){
starting<- unlist(gregexpr("filterId", html))
ending<- unlist(gregexpr("DETAIL", html))
starting<- starting[seq(2,length(starting),2)]
if (starting != -1 && ending != -1){
strings<-c()
for (i in 1:length(starting)) {
strings[i]<-substr(html,starting[i]+9,ending[i]-18)
}
strings<-list(strings)
strings
}
}
still, I am aware that downloading whole page for only few lines of text is quite uneffective(but works!:) Could you give me a tip how to make this process more effective?
I have also encountered some pages with more than 20 doctors listed (i.e. combination of "Brno-město" and "chirurgie". Such data are listed and accessed via hyperlink list at the end of the form. I need to access each of these pages and use there the code I presented here. But I guess I have to pass some cookies there.
Other than that, combination of "Praha" and "chirurgie" is problematic as well, because there is more than 200 records, therefore page applies some script and then I need to click the button "další" and use the same method as in the previous paragraph.
Can you help me please?
So I'm working on a webscraping script in R and because the particular website I'm scraping doesn't take too kindly to people who scrape their data in large volumes, I have broken down my loop to handle only 10 links at a time. I still want to go through all the links, however, just in a random and slow manner.
productLink # A list of all the links that I'll be scraping
x<- length(productLink)
randomNum <- sample(1:x, 10)
library(rvest)
for(i in 1:10){
url <- productLink[randomNum[i]]
specs <- url %>%
html() %>%
html_nodes("h5") %>%
html_text()
specs
message<- "\n Temporarily unavailable\n "
if(specs == message){
print("Item unavailable")
}
else{
print("Item available")
}
}
Now the next time I run this for-loop I want to exclude all the random numbered indices that have already been tried in the previous running of the loop. That way this for loop runs through 10 new links each time until all the links have been used. There is another aspect to this that I'd like some input on. Since I can raise alarm flags by brute force scraping the particular company's website, is there any way I can slow down this loop so that it only runs every couple of minutes? I'm thinking of a timeout function or such where the code runs the for-loop once, waits a few minutes then runs it again (with new links each time as mentioned above). Any ideas?
Use something like this. Loop over all the product index randomly.
for (i in sample(1:x)){
<Your code here>
# Sleep for 120 seconds
Sys.sleep(120)
}
And if you want to do 10 at a time. Sleep for 120 seconds every 10 executions.
n = 1
for (i in sample(1:x)){
# Sleep for 120 seconds every 10 runs
if (n == 10) {Sys.sleep(120); n = 0}
n = n+1
<Your code here>
}
I am trying to scrape data from a website which is unfortunately located on a very unreliable sever which has very volatile reaction times. The first idea is of course to loop over the list of (thousands of) URLs and saving the downloaded results by populating a list.
The problem however is that the server randomly responds very slowly which results into a timeout error. This alone would not be a problem as I can use the tryCatch() function and jump to the next iteration. Doing so I am however missing some files in each run. I know that each of the URLs in the list exists and I need all of the data.
My idea thus would have been to use tryCatch() to evaluate if the getURL() request yields and error. If so the loop jumps to the next iteration and the erroneous URL is appended at the end of the URL list over which the loop runs. My intuitive solution would look something like this:
dwl = list()
for (i in seq_along(urs)) {
temp = tryCatch(getURL(url=urs[[i]]),error=function(e){e})
if(inherits(temp,"OPERATION_TIMEDOUT")){ #check for timeout error
urs[[length(urs)+1]] = urs[[i]] #if there is one the erroneous url is appended at the end of the sequence
next} else {
dwl[[i]] = temp #if there is no error the data is saved in the list
}
}
If it "would" work I would eventually be able to download all the URLs in the list. It however doesn't work, because as the help page for the next function states: "seq in a for loop is evaluated at the start of the loop; changing it subsequently does not affect the loop". Is there a workaround for this or a trick with which I could achieve my envisaged goal? I am grateful for every comment!
I would do something like this(explanation within comments):
## RES is a global list that contain the final result
## Always try to pre-allocate your results
RES <- vector("list",length(urs))
## Safe getURL returns NA if error, the NA is useful to filter results
get_url <- function(x) tryCatch(getURL(x),error=function(e)NA)
## the parser!
parse_doc <- function(x){## some code to parse the doc})
## loop while we still have some not scraped urls
while(length(urs)>0){
## get the doc for all urls
l_doc <- lapply(urs,get_url)
## parse each document and put the result in RES
RES[!is.na(l_doc )] <<- lapply(l_doc [!is.na(l_doc)],parse_doc)
## update urs
urs <<- urs[is.na(l_doc)]
}
For one of my dissertation's data collection modules, I have implemented a simple polling mechanism. This is needed, because I make each data collection request (one of many) as SQL query, submitted via Web form, which is simulated by RCurl code. The server processes each request and generates a text file with results at a specific URL (RESULTS_URL in code below). Regardless of the request, URL and file name are the same (I cannot change that). Since processing time for different data requests, obviously, is different and some requests may take significant amount of time, my R code needs to "know", when the results are ready (file is re-generated), so that it can retrieve them. The following is my solution for this problem.
POLL_TIME <- 5 # polling timeout in seconds
In function srdaRequestData(), before making data request:
# check and save 'last modified' date and time of the results file
# before submitting data request, to compare with the same after one
# for simple polling of results file in srdaGetData() function
beforeDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
beforeDate <<- strptime(beforeDate, "%a, %d %b %Y %X", tz="GMT")
<making data request is here>
In function srdaGetData(), called after srdaRequestData()
# simple polling of the results file
repeat {
if (DEBUG) message("Waiting for results ...", appendLF = FALSE)
afterDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
afterDate <- strptime(afterDate, "%a, %d %b %Y %X", tz="GMT")
delta <- difftime(afterDate, beforeDate, units = "secs")
if (as.numeric(delta) != 0) { # file modified, results are ready
if (DEBUG) message(" Ready!")
break
}
else { # no results yet, wait the timeout and check again
if (DEBUG) message(".", appendLF = FALSE)
Sys.sleep(POLL_TIME)
}
}
<retrieving request's results is here>
The module's main flow/sequence of events is linear, as follows:
Read/update configuration file
Authenticate with the system
Loop through data requests, specified in configuration file (via lapply()),
where for each request perform the following:
{
...
Make request: srdaRequestData()
...
Retrieve results: srdaGetData()
...
}
The issue with the code above is that it doesn't seem to be working as expected: upon making data request, the code should print "Waiting for results ..." and then, periodically checking the results file for being modified (re-generated), print progress dots until the results are ready, when it prints confirmation. However, the actual behavior is that the code waits long time (I intentionally made one request a long-running), not printing anything, but then, apparently retrieves results and prints both "Waiting for results ..." and " Ready" at the same time.
It seems to me that it's some kind of synchronization issue, but I can't figure out what exactly. Or, maybe it's something else and I'm somehow missing it. Your advice and help will be much appreciated!
In a comment to the question, I believe MrFlick solved the issue: the polling logic appears to be functional, but the problem is that the progress messages are out of synch with current events on the system.
By default, the R console output is buffered. This is by design: to speed things up and avoid the distracting flicker that may be associated with frequent messages etc. We tend to forget this fact, particularly after we've been using R in a very interactive fashion, running various ad-hoc statement at the console (the console buffer is automatically flushed just before returning the > prompt).
It is however possible to get message() and more generally console output in "real time" by either explicitly flushing the console after each critical output statement, using the flush.console() function, or by disabling buffering at the level of the R GUI (right-click when on the console, see Buffered output Ctrl W item. This is also available in the Misc menu)
Here's a toy example of the explicit use of flush.console. Note the use of cat() rather than message() as the former doesn't automatically add a CR/LF to the output. The latter however is useful however because its messages can be suppressed with suppressMessages() and the like. Also as shown in the comment you can cat the "\b" (backspace) character to make the number overwrite one another.
CountDown <- function() {
for (i in 9:1){
cat(i)
# alternatively to cat(i) use: message(i)
flush.console() # <<<<<<< immediate ouput to console.
Sys.sleep(1)
cat(" ") # also try cat("\b") instead ;-)
}
cat("... Blast-off\n")
}
The output is the following, what is of course not evident in this print-out is that it took 10 seconds overall with one number printed every second, before the final "Blast off"; do remove the flush.console() statement and the output will come at once, after 10 seconds, i.e. when the function terminates (unless console is not buffered at the level of the GUI).
CountDown()
9 8 7 6 5 4 3 2 1 ... Blast-off