Requests(url) is having after 5 iteration - web-scraping

I am attempting to run a webscraping algo on indeed using beautifulSoup and loop through the different pages. However, after 2-6 iterations, the requests.get(url) hangs and stops finding the next page. I have read that it might do something with the server being blocked but that would have blocked the original requests and it also says online that Indeed allows for web scraping. I have also heard that I should set a header but I am unsure how to do that. I am running on the latest version of safari and MacOs 12.4.

A solution I came up with, thought this does not answer the question specifically, is by using a try expect statement and setting a timeout value to the request. Once the timeout value is reached, it enters the try except statement, sets a boolean value, and then continues the loop and try again. Code is inserted below.
while(i < 10):
url = get_url('software intern', '', i)
print("Parsing Page Number:" + str(i + 1))
error = False
try:
response = requests.get(url, timeout = 10)
except requests.exceptions.Timeout as err:
error = True
if error:
print("Trying to connect to webpage again")
continue
i += 1
I am leaving the question as unanswered for now however as I still don't know the root cause of this issue and this solution is just a workaround.

Related

Selenium stopping mid-loop without any error or exception

I was working on a project to scrap data for multiple cities from a website. There are about 1000 cities to scrap and during the for-loop, the webdriver window simply stops going to the requested site. However, the program is still running and no error or exception is thrown, the webdriver window just simply stays at the same webpage.
There is a common feature regarding the webpage that it stopped at, which is the windows are always the ones requesting the button clicks. The button click parts of the code should be correct as it is working as desired for hundreds of cities before the suspension.
However, every different run the program is stopping at different cities and I find no correlation between the cities, which leaves me very confused and unable to identify the root of the problem.
data_lookup = 'https://www.numbeo.com/{}/in/{}'
tabs = ['cost-of-living', 'property-investment', 'quality-of-life']
cities_stat = []
browser = webdriver.Chrome(executable_path = driver_path, chrome_options=ChromeOptions)
for x in range(len(cities_split)):
city_stat = []
for tab in tabs:
browser.get(data_lookup.format(tab, str(cities_split[x][0])+'-'+str(cities_split[x][1])))
try:
city_stat.append(read_data(tab))
except NoSuchElementException:
try:
city_button = browser.find_element(By.LINK_TEXT, cities[x])
city_button.click()
city_stat.append(read_data(tab))
except NoSuchElementException:
try:
if len(cities_split[x]) == 2:
browser.get(data_lookup.format(tab, str(cities_split[x][0])))
try:
city_stat.append(read_data(tab))
except NoSuchElementException:
city_button = browser.find_element(By.LINK_TEXT, cities[x])
city_button.click()
city_stat.append(read_data(tab))
elif len(cities_split[x]) == 3:
city_stat.append(initialize_data(tab))
except:
city_stat.append(initialize_data(tab))
cities_stat.append(city_stat)
I have tried using WebDriverWait to no appeal. All it does is simply make the NoSuchElementException become a TimeoutException.

Scraping string from a large number of URLs with Julia

Happy New Year!
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"\/page\/[0-9]+\/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "\n")
end
close(f)
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
Even one some servers are delaying transmission, always many connections will be working.

withTimeout not working inside functions?

I am having some issues with R.utils::withTimeout(). It doesn't seem to take the timeout option into acount at all, or only sometimes. Below the function I want to use:
scrape_player <- function(url, time){
raw_html <- tryCatch({
R.utils::withTimeout({
RCurl::getURL(url)
},
timeout = time, onTimeout = "warning")}
)
html_page <- xml2::read_html(raw_html)
}
Now when I use it:
scrape_player("http://nhlnumbers.com/player_stats/1", 1)
it either works fine and I get the html page I want, or I get an error message telling me that the elapsed time limit was reached, or, and this is my problem, it takes a very long time, way more than 1 second, to finally return an html page with an error 500.
Shouldn't RCurl::getURL() try for only 1 second (in the example) to get the html page and if not, simply return a warning? What am I missing?
Ok, what I did as a workaround: instead of returning the page I write it to disk. Doesn't solve the issue that withTimeout doesn't seem to work, but at least I see that I'm getting pages written to disk, slowly but surely.

invalid procedure call or argument left

I am facing the following error:
Microsoft VBScript runtime error '800a0005'
Invalid procedure call or argument: 'left'
/scheduler/App.asp, line 16
The line is:
point1 = left(point0,i-1)
This code works perfectly in another server, but now on another server it is showing this error. I can guess it has to do with system or IIS settings or may be something else but its nothing with code (as its works fine in another server).
If i is equal to zero then this will call Left() with -1 as the length parameter. This will result in an Invalid procedure call or argument error. Verify that i >= 0.
Just experienced this problem myself - a script running seamlessly for many months suddenly collapsed with this error. It seems that the scripting engine falls over itself for whatever reason and string functions cease being able to handle in-function calculations.
I appreciate it's been quite a while since this question was asked, but in case anyone encounters this in the future...
Replace
point1 = left(point0, i-1)
with
j = i-1
point1 = left(point0, j)
... and it will work.
Alternatively, simply re-boot the server (unfortunately, simply re-starting the WWW service won't fix it).

Vendors black box function can only be called successfully once

(first question here, sorry if I am breaking a piece of etiquette)
My site is running on an eCommerce back end provider that I subscribe to. They have everything in classic ASP. They have a black box function called import_products that I use to import a given text file into my site's database.
The problem is that if I call the function more than once, something breaks. Here is my example code:
for blah = 1 to 20
thisfilename = "fullcatalog_" & blah & ".csv"
Response.Write thisfilename & "<br>"
Response.Flush
Call Import_Products(3,thisfilename,1)
Next
Response.End
The first execution of the Import_Products function works fine. The second time I get:
Microsoft VBScript runtime error '800a0009'
Subscript out of range: 'i'
The filenames all exist. That part is fine. There are no bugs in my calling code. I have tried checking the value of "i" before each execution. The first time the value is blank, and before the second execution the value is "2". So I tried setting it to null during each loop iteration, but that didn't change the results at all.
I assume that the function is setting a variable or opening a connection during its execution, but not cleaning it up, and then not expecting it to already be set the second time. Is there any way to find out what this would be? Or somehow reset the condition back to nothing so that the function will be 'fresh'?
The function is in an unreadable include file so I can't see the code. Obviously a better solution would be to go with the company support, and I have a ticket it in with them, but it is like pulling teeth to get them to even acknowledge that there is a problem. Let alone solve it.
Thanks!
EDIT: Here is a further simplified example of calling the function. The first call works. The second call fails with the same error as above.
thisfilename = "fullcatalog_testfile.csv"
Call Import_Products(3,thisfilename,1)
Call Import_Products(3,thisfilename,1)
Response.End
The likely cause of the error are the two numeric parameters for the Import_Products subroutine.
Import_Products(???, FileName, ???)
The values are 3 and 1 in your example but you never explain what they do or what they are documented to do.
EDIT Since correcting the vender subroutine is impossible, but it always works for the first time it's called lets use an HTTP REDIRECT instead of a FOR LOOP so that it technically only gets called once per page execution.
www.mysite.tld/import.asp?current=1&end=20
curr = CInt(Request.QueryString("current"))
end = CInt(Request.QueryString("end"))
If curr <= end Then
thisfilename = "fullcatalog_" & curr & ".csv"
Call Import_Products(3,thisfilename,1)
Response.Redirect("www.mysite.tld/import.asp?current=" & (curr + 1) & "&end=" & end)
End If
note the above was written inside my browser and is untested so syntax errors may exist.

Resources