Google Sheets/Excel Web Scraping - Variable page count into Link URL - web-scraping

If there's an easier way to do this please lmk, but this is what I have so far.
=importhtml("https://example/101", "list", 4)
There are 100 items per page, so if this is in A1, would like something like this:
=importhtml("https://example/201", "list", 4)
to go into A101 (or B1 and then change to single column later)
... and so on so the last part of the url keeps getting incremented by 100 each time.

try:
={IMPORTHTML("https://example/101", "list", 4)
IMPORTHTML("https://example/201", "list", 4)}

Related

How To Prevent Web Scraping Script From Being Blocked By Google (HTTP 429)

I have this script to take each domain names in Dataframe and perform a "inurl:domain automation testing" Google search for each of them. I will scrape the 1st search result and add to my Dataframe.
import random
# Convert the Domain column in Dataframe into a list
working_backlink = backlink_df.iloc[23:len(backlink_df['Domain']), 1:22]
working_domain = working_backlink["Domain"]
domain_list = working_domain.values.tolist()
# Iterate through the list and perform query search for them
for x in range(23, len(domain_list)):
sleeptime = random.randint(1,10)
time.sleep(sleeptime)
for i in domain_list:
query = "inurl:{} automation testing".format(i)
delay = random.randint(10,30)
for j in search(query, tld="com", num=1,stop=1,pause=delay):
working_backlink.iat[x,5] = j
# Show Dataframe
working_backlink.head(n=40)
I tried using sleeptime and random delay time to prevent HTTP 429 error, but it still doesn't work. Could you suggest any solution to this? Thanks a lot!

get the number of redirects from a url in R

I have to extract a feature- the number of redirects, from the url in my dataframe. Is there a way to find the number in R like there is in python:
r = requests.get(url)
i=0
for h in r.history:
i=i+1
print(i)
The return value from httr::GET is completely undocumented, but the headers etc from redirects seem to appear in the $all_headers object:
> url = "http://github.com"
> g = httr::GET(url)
> length(g$all_headers)
[1] 2
because http redirects to https. If you go straight to https you dont see a redirect:
> url = "https://github.com"
> g = httr::GET(url)
> length(g$all_headers)
[1] 1
The return value of httr::GET is an httr::response object which has the core documentation at ?httr::response. You can examine the whole object with str() to see the parts that aren't salient to most R users. It's been documented, like, forever. I don't know where folks might be confused that it has no docs. Perhaps heads are above the clouds…perhaps in orbit or space or something.
Since what you want is count of redirects, you might actually care about redirects vs a naive count of all the response headers. e.g.
res <- httr::GET("http://1.usa.gov/1J6GNoW")
sum(((sapply(res$all_headers, `[[`, "status") %% 300) == 1))
That's 3 (and may not be exactly what you want either).
length(res$all_headers)
is 4 and I doubt you should be including 4xx responses in the redirects, but you could be clearer in your question if it is just the number of 3xx's vs total in the HTTP chain.
You might also want to consider:
cat(rawToChar(curl::curl_fetch_memory("http://1.usa.gov/1J6GNoW")$headers))
count the actual redirects from that (depending on what the actual "mission" is).

Web Scraping: Issues With Set_values and crawlr

My Goal: Using R, scrape all light bulb model #s and prices from homedepot.
My Problem: I can not find the URLs for ALL the light bulb pages. I can scrape one page, but I need to find a way to get the URLs so I can scrape them all.
Ideally I would like these pages
https://www.homedepot.com/p/TOGGLED-48-in-T8-16-Watt-Cool-White-Linear-LED-Tube-Light-Bulb-A416-40210/205935901
but even getting the list pages like these would be ok
https://www.homedepot.com/b/Lighting-Light-Bulbs/N-5yc1vZbmbu
I tried crawlr -> Does not work on homedepot (maybe because https?)I tried to get specific pages
I tried Rvest -> I tried using html_form and set_values to put light bulb in the search box, but the form comes back
[[1]]
<form> 'headerSearchForm' (GET )
<input hidden> '': 21
<input text> '':
<button > '<unnamed>
and set_value will not work because is '' so the error comes back
error: attempt to use zero-length variable name.
I also tried using the paste function and lapply
tmp <- lapply(0:696, function(page) {
url <- paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs/N-
5yc1vZbmbu?Nao=", page, "4&Ns=None")
page <- read_html(url)
html_table(html_nodes(page, "table"))[[1]]
})
I got the error : error in html_table(html_nodes(page,"table"))[[1]]: script out of bounds.
I am seriously at a loss and any advice or tips would be so fantastic.
You can do it through rvest and tidyverse.
You can find a listing of all bulbs starting in this page, with a pagination of 24 bulbs per page across 30 pages:
https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79
Take a look at the pagination grid at the bottom of the initial page. I drew a(n ugly) yellow oval around it:
You could extract the link to each page listing 24 bulbs by following/extracting the links in that pagination grid.
Yet, just by comparing the urls it becomes evident that all pages follow a pattern, with "https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79" as root, and a tail where the
last digit characters represent the first lightbulb displayed, "?Nao=24"
So you could simply infer the structure of each url pointing to a display of the bulbs. The following command creates such a list in R:
library(rvest)
library(tidyverse)
index_list <- as.list(seq(0,(24*30), 24)) %>% paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79?Nao=", . )
Now, to extract the url for each lightbulb page, a combuination of a function and purrt's map function would come handy.
To exctract the individual bulbs url from the index pages, we can call this:
scrap_bulbs <- function(url){
object <- read_html(as.character(url))
object <- html_nodes(x = object, xpath = "//a[#data-pod-type='pr']")
object <- html_attr(x = object, 'href')
Sys.sleep(10) ## Courtesy pause of 10 seconds, prevents the website from possibly blocking your IP
paste0('https://www.homedepot.com', object)
}
Now we store the results in a list create by map().
bulbs_list <- map(.x = index_list, .f = scrap_bulbs)
unlist(bulbs_list)
Done!

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?
There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.

R: Open websites from a URL string in a data frame

I have an R data frame with a list of 500ish URLs. It looks a bit like this:
websites <- data.frame(rbind("www.nytimes.com", "www.google.com", "www.facebook.com"))
I want to go through these URLs and open them (maybe 10 at a time) in Google Chrome.
How would I go about this automatically with R?
I used this to get all 3 of them to open.
websites <- data.frame(rbind("www.nytimes.com", "www.google.com", "www.facebook.com"))
websites <- as.data.frame(t(websites))
websites[] <- lapply(websites, as.character)
webVec <- unname(unlist(websites[1,]))
for(i in 1:length(webVec)){
shell.exec(paste(webVec[i]))
}
This opens all of them however, and I'm not sure how to open only a certain amount at a time. I took a stab at it though:
setTen <- 1
for(i in (1 * (10 * (setTen - 1))):(10 * setTen )){
shell.exec(paste(webVec[i]))
}
the setTen variable asks if you want the first ten websites, second ten, ect.
I couldn't test it though since there is only 3 sites in this data frame.
If it doesn't work let me know and I'll try to figure out a different method.

Resources