Web Scraping: Issues With Set_values and crawlr - r

My Goal: Using R, scrape all light bulb model #s and prices from homedepot.
My Problem: I can not find the URLs for ALL the light bulb pages. I can scrape one page, but I need to find a way to get the URLs so I can scrape them all.
Ideally I would like these pages
https://www.homedepot.com/p/TOGGLED-48-in-T8-16-Watt-Cool-White-Linear-LED-Tube-Light-Bulb-A416-40210/205935901
but even getting the list pages like these would be ok
https://www.homedepot.com/b/Lighting-Light-Bulbs/N-5yc1vZbmbu
I tried crawlr -> Does not work on homedepot (maybe because https?)I tried to get specific pages
I tried Rvest -> I tried using html_form and set_values to put light bulb in the search box, but the form comes back
[[1]]
<form> 'headerSearchForm' (GET )
<input hidden> '': 21
<input text> '':
<button > '<unnamed>
and set_value will not work because is '' so the error comes back
error: attempt to use zero-length variable name.
I also tried using the paste function and lapply
tmp <- lapply(0:696, function(page) {
url <- paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs/N-
5yc1vZbmbu?Nao=", page, "4&Ns=None")
page <- read_html(url)
html_table(html_nodes(page, "table"))[[1]]
})
I got the error : error in html_table(html_nodes(page,"table"))[[1]]: script out of bounds.
I am seriously at a loss and any advice or tips would be so fantastic.

You can do it through rvest and tidyverse.
You can find a listing of all bulbs starting in this page, with a pagination of 24 bulbs per page across 30 pages:
https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79
Take a look at the pagination grid at the bottom of the initial page. I drew a(n ugly) yellow oval around it:
You could extract the link to each page listing 24 bulbs by following/extracting the links in that pagination grid.
Yet, just by comparing the urls it becomes evident that all pages follow a pattern, with "https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79" as root, and a tail where the
last digit characters represent the first lightbulb displayed, "?Nao=24"
So you could simply infer the structure of each url pointing to a display of the bulbs. The following command creates such a list in R:
library(rvest)
library(tidyverse)
index_list <- as.list(seq(0,(24*30), 24)) %>% paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79?Nao=", . )
Now, to extract the url for each lightbulb page, a combuination of a function and purrt's map function would come handy.
To exctract the individual bulbs url from the index pages, we can call this:
scrap_bulbs <- function(url){
object <- read_html(as.character(url))
object <- html_nodes(x = object, xpath = "//a[#data-pod-type='pr']")
object <- html_attr(x = object, 'href')
Sys.sleep(10) ## Courtesy pause of 10 seconds, prevents the website from possibly blocking your IP
paste0('https://www.homedepot.com', object)
}
Now we store the results in a list create by map().
bulbs_list <- map(.x = index_list, .f = scrap_bulbs)
unlist(bulbs_list)
Done!

Related

Spotify API - "raw" data class arbitrarily returned for some requests

I am compiling data about a set of artists on Spotify - data for each song on each album of each artist. I use a for loop to automate this API request on about 80 different artists in the data frame albums, then assign a bit of info on each album in albums to its list object from the API.
The problem: my API call doesn't always return a list object. Sometimes it returns an object where class() = raw.
#REQUEST DATA
#------------
library(plyr)
library(httr)
library(lubridate)
collablist <- as.list(NULL)
for(i in 1:nrow(albums)){
tracks_in_one_album <- as.list(NULL)
URI = paste0('https://api.spotify.com/v1/albums/', albums$album_uri[i], '/tracks')
response = GET(url = URI, add_headers(Authorization = HeaderValue))
tracks_in_one_album = content(response)
tracks_in_one_album[["album"]] = albums$album_name[i]
tracks_in_one_album[["album_artist"]] = albums$artists[i]
collablist[[i]] <- tracks_in_one_album
print(albums$artist_name[i])
}
The loop runs for somewhere between 50 and 300 albums before I inevitably get the following message:
Error in tracks_in_one_album[["album"]] <- albums$album_name[i] :
incompatible types (from character to raw) in subassignment type fix
When I assign attempting to assign the character object albums$album_name[i] to the API requested object tracks_in_one_album when it's a list, I have no issue. But occasionally the object is of a raw class. Changing it to a list by encapsulating the content() call with as.list prevents the error from occurring, but it doesn't really fix the issue because for the requests where the data come in as raw instead of as a list by default, they're sort of mangled (just a vector with .
The craziest part? This doesn't happen consistently. It could happen for the 4th album of Cat Stevens one time; if I rerun, that Cat Stevens album will be fine and get pulled into R as a list but perhaps the second album for Migos will come in raw instead.
My Question - why are the data not always coming in as a list when I make a request? How is it possible that this could be happening in such a non-reproducible way?

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?
There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.

read.Lines all webpages after a URL

In order to do some textual analysis in R, I would like to download several webpages that have a very similar design. I have tried it with several pages and this code indeed only keeps the lines I am interested in.
thepage= readLines("http://example/xwfw_665399/s2510_665401/t1480900.shtml")
thepage2 = readLines("http://example/xwfw_665399/s2510_665401/2535_665405/t851768.shtml")
mypattern1 = '<P style=\\"FONT.*\\">'
datalines1 = grep(mypattern1,thepage[1:length(thepage)],value=TRUE)
datalines2 = grep(mypattern1,thepage2[1:length(thepage)],value=TRUE)
mypattern2 = '<STRONG>'
mypattern3 = '</STRONG>'
mypattern4 = '</P>'
page1=gsub(mypattern1,"",datalines1)
page1=gsub(mypattern2,"", page1)
page1=gsub(mypattern3,"",page1)
page1=gsub(mypattern4,"",page1)
page2=gsub(mypattern1,"",datalines2)
page2=gsub(mypattern2,"", page2)
page2=gsub(mypattern3,"",page2)
page2=gsub(mypattern4,"",page2)
As you might see, the URLS are very similar, ending with s2510_665401/
Now, I wonder, is there a way to automatically retrieve all possible files after s2510_665401/ and have my code run over them? Despite some googleing, I haven't been able to find anything. Would it require to write a function? If so, would someone please point me in the right direction?
Thanks!
This is not a final working answer, I very rarely do web scraping so not sure how well this generalizes to other webpages, but it may help you in the right direction. Consider this page. We can write a function that extracts all .html elements, which we can then again crawl with the same function to get their references.
So in the code below, unique(refs_1) contains all .html pages that are one level deep, and unique(refs_2) contains all .html pages that are two levels deep.
You would still need a wrapper to either stop after a certain number of iterations, maybe prevent recrawling of already visited pages (setdiff(refs_2,refs_1)?), etc.
when you have all the URLs to scrape (in this case unique(c(refs_1,refs_2)), you should wrap your own read script in a function, and call lapply(x,f), where x is the list/array of URLs, and f is your function.`
Anyway, hope this helps!
library(qdapRegex)
get_refs_on_page <- function(page){
refs = lapply( tryCatch({readLines(page)}, error=function(cond){return(NA)},warning=function(cond){return(NA)})
, function(x) {y=rm_between(x , "href=\"", "\"", extract=TRUE)[[1]]})
refs=unlist(refs)
refs = refs[!is.na(refs)]
return(refs)
}
thepage = 'https://stat.ethz.ch/R-manual/R-devel/library/utils/'
refs_1 = get_refs_on_page(thepage)
refs_2 = unlist(lapply(paste0(thepage,refs_1),get_refs_on_page))
Example output:
> unique(c(refs_1,refs_2))
[1] "?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
[4] "?C=D;O=A" "/R-manual/R-devel/library/" "DESCRIPTION"
[7] "doc/" "html/" "?C=N;O=A"
[10] "?C=M;O=D" "?C=S;O=D" "?C=D;O=D"
[13] "/doc/html/R.css" "/doc/html/index.html" "../../../library/utils/doc/Sweave.pdf"
[16] "../../../library/utils/doc/Sweave.Rnw" "../../../library/utils/doc/Sweave.R" "Sweave.Rnw.~r55105~"

Harvesting data from webpage in R - accessing multiple pages

I am following my question from yesterday - harvesting data via drop down list in R 1
first, I need to obtain all 50k strings of details of all doctors from this page: http://www.lkcr.cz/seznam-lekaru-426.html#seznam
I know, how to obtain them from a single page:
oborID<-"48"
okresID<-"3702"
web<- "http://www.lkcr.cz/seznam-lekaru-426.html"
extractHTML<-function(oborID,okresID){
query<-list('filterObor'="107",'filterOkresId'="3201",'do[findLekar]'=1)
query$filterObor<-oborID
query$filterOkresId<-okresID
html<- POST(url=web,body=query)
html<- content(html, "text")
html
}
IDfromHTML<-function(html){
starting<- unlist(gregexpr("filterId", html))
ending<- unlist(gregexpr("DETAIL", html))
starting<- starting[seq(2,length(starting),2)]
if (starting != -1 && ending != -1){
strings<-c()
for (i in 1:length(starting)) {
strings[i]<-substr(html,starting[i]+9,ending[i]-18)
}
strings<-list(strings)
strings
}
}
still, I am aware that downloading whole page for only few lines of text is quite uneffective(but works!:) Could you give me a tip how to make this process more effective?
I have also encountered some pages with more than 20 doctors listed (i.e. combination of "Brno-město" and "chirurgie". Such data are listed and accessed via hyperlink list at the end of the form. I need to access each of these pages and use there the code I presented here. But I guess I have to pass some cookies there.
Other than that, combination of "Praha" and "chirurgie" is problematic as well, because there is more than 200 records, therefore page applies some script and then I need to click the button "další" and use the same method as in the previous paragraph.
Can you help me please?

Dynamically changing the sequence of a loop

I am trying to scrape data from a website which is unfortunately located on a very unreliable sever which has very volatile reaction times. The first idea is of course to loop over the list of (thousands of) URLs and saving the downloaded results by populating a list.
The problem however is that the server randomly responds very slowly which results into a timeout error. This alone would not be a problem as I can use the tryCatch() function and jump to the next iteration. Doing so I am however missing some files in each run. I know that each of the URLs in the list exists and I need all of the data.
My idea thus would have been to use tryCatch() to evaluate if the getURL() request yields and error. If so the loop jumps to the next iteration and the erroneous URL is appended at the end of the URL list over which the loop runs. My intuitive solution would look something like this:
dwl = list()
for (i in seq_along(urs)) {
temp = tryCatch(getURL(url=urs[[i]]),error=function(e){e})
if(inherits(temp,"OPERATION_TIMEDOUT")){ #check for timeout error
urs[[length(urs)+1]] = urs[[i]] #if there is one the erroneous url is appended at the end of the sequence
next} else {
dwl[[i]] = temp #if there is no error the data is saved in the list
}
}
If it "would" work I would eventually be able to download all the URLs in the list. It however doesn't work, because as the help page for the next function states: "seq in a for loop is evaluated at the start of the loop; changing it subsequently does not affect the loop". Is there a workaround for this or a trick with which I could achieve my envisaged goal? I am grateful for every comment!
I would do something like this(explanation within comments):
## RES is a global list that contain the final result
## Always try to pre-allocate your results
RES <- vector("list",length(urs))
## Safe getURL returns NA if error, the NA is useful to filter results
get_url <- function(x) tryCatch(getURL(x),error=function(e)NA)
## the parser!
parse_doc <- function(x){## some code to parse the doc})
## loop while we still have some not scraped urls
while(length(urs)>0){
## get the doc for all urls
l_doc <- lapply(urs,get_url)
## parse each document and put the result in RES
RES[!is.na(l_doc )] <<- lapply(l_doc [!is.na(l_doc)],parse_doc)
## update urs
urs <<- urs[is.na(l_doc)]
}

Resources