How To Prevent Web Scraping Script From Being Blocked By Google (HTTP 429) - web-scraping

I have this script to take each domain names in Dataframe and perform a "inurl:domain automation testing" Google search for each of them. I will scrape the 1st search result and add to my Dataframe.
import random
# Convert the Domain column in Dataframe into a list
working_backlink = backlink_df.iloc[23:len(backlink_df['Domain']), 1:22]
working_domain = working_backlink["Domain"]
domain_list = working_domain.values.tolist()
# Iterate through the list and perform query search for them
for x in range(23, len(domain_list)):
sleeptime = random.randint(1,10)
time.sleep(sleeptime)
for i in domain_list:
query = "inurl:{} automation testing".format(i)
delay = random.randint(10,30)
for j in search(query, tld="com", num=1,stop=1,pause=delay):
working_backlink.iat[x,5] = j
# Show Dataframe
working_backlink.head(n=40)
I tried using sleeptime and random delay time to prevent HTTP 429 error, but it still doesn't work. Could you suggest any solution to this? Thanks a lot!

Related

Scraping string from a large number of URLs with Julia

Happy New Year!
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"\/page\/[0-9]+\/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "\n")
end
close(f)
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
Even one some servers are delaying transmission, always many connections will be working.

Spotify API - "raw" data class arbitrarily returned for some requests

I am compiling data about a set of artists on Spotify - data for each song on each album of each artist. I use a for loop to automate this API request on about 80 different artists in the data frame albums, then assign a bit of info on each album in albums to its list object from the API.
The problem: my API call doesn't always return a list object. Sometimes it returns an object where class() = raw.
#REQUEST DATA
#------------
library(plyr)
library(httr)
library(lubridate)
collablist <- as.list(NULL)
for(i in 1:nrow(albums)){
tracks_in_one_album <- as.list(NULL)
URI = paste0('https://api.spotify.com/v1/albums/', albums$album_uri[i], '/tracks')
response = GET(url = URI, add_headers(Authorization = HeaderValue))
tracks_in_one_album = content(response)
tracks_in_one_album[["album"]] = albums$album_name[i]
tracks_in_one_album[["album_artist"]] = albums$artists[i]
collablist[[i]] <- tracks_in_one_album
print(albums$artist_name[i])
}
The loop runs for somewhere between 50 and 300 albums before I inevitably get the following message:
Error in tracks_in_one_album[["album"]] <- albums$album_name[i] :
incompatible types (from character to raw) in subassignment type fix
When I assign attempting to assign the character object albums$album_name[i] to the API requested object tracks_in_one_album when it's a list, I have no issue. But occasionally the object is of a raw class. Changing it to a list by encapsulating the content() call with as.list prevents the error from occurring, but it doesn't really fix the issue because for the requests where the data come in as raw instead of as a list by default, they're sort of mangled (just a vector with .
The craziest part? This doesn't happen consistently. It could happen for the 4th album of Cat Stevens one time; if I rerun, that Cat Stevens album will be fine and get pulled into R as a list but perhaps the second album for Migos will come in raw instead.
My Question - why are the data not always coming in as a list when I make a request? How is it possible that this could be happening in such a non-reproducible way?

How do I cache vectorized calls that take user input in R?

I am trying to calculate a field for all rows of a large dataset. The function to calculate it is from the package taxize, and uses an HTTP request to query an external site for the right ID number. It is searching by scientific name, and often there are multiple results, in which case this function asks for user input. I would like the function to cache my selection and return that ID number every time the same call is made from then on. I have tried with my own caching function and with memoizedCall() from the package R.cache but every time it hits the second entry of the same scientific name it still prompts me for user input. I feel like I am misunderstanding something basic about how vectorization works. Sorry for my ignorance but any advice is appreciated.
Here is the code I used as a custom caching function.
check_tsn <- function(data,tsn_list){
print(data)
print(tsn_list)
if (is.null(tsn_list$data)){
tsn_list$data = taxize::get_tsn(data)
print('added to tsn_list')
}
return(tsn_list$data)
}
tsn_list <- vector(mode = "list", nrow(wanglang))
Genus.Species <- c('Tamiops swinhoei','Bos taurus','Tamiops swinhoei')
IUCN.ID <- c('21382','','21382')
species <- data.frame(Genus.Species,IUCN.ID)
species$TSN.ID = check_tsn(species$Genus.Species,tsn_list)

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?
There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.

How to handle Twitter "Rate limit encountered ..." in R?

I'm totally new to this. I'm using the package "twitteR" on Rstudio to access Twitter rest APIs and pull data from Twitter data.
username <- "netflix"
user <- getUser(username)
followers <- user$getFollowers()
followers <- twListToDF(followers)
Then I got the following notification
Rate limit encountered & retry limit reached - returning partial results
It's been over 15 minutes and I haven't got any results yet. so I modified the code to
followers <- user$getFollowers(150)
And get the result, but it's a dataframe full of numbers, what does this mean?
794705342974328832
39308631
807216507939880960
808263559599845376
2507888091
174977598
338081716
807803521810698240
2775999428
2570734208
I was wondering if something's wrong with my API keys and changed up the code to the following
searchNF <- searchTwitter("#netflix bad OR suck OR terrible OR disaster OR :(", n=1500, since=as.character(Sys.Date()-3))
negativeTweets <- length(searchNF)
negativeSentiment <- negativeTweets/1500
And get the following notice
1500 tweets were requested but the API can only return 52> negativeTweets
Is this normal?

Resources