Proper settings to avoid blocking while scraping - web-scraping

For scraping the webiste I use scraproxy to create a pool of 15 proxies within 2 locations.
Website is auto-redirect (302) to reCapthca page when the request seems suspicious.
I use the following settings in scrapy. I was able to scrape only 741 page with relatively low speed (5 pages/min).
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 30.0
AUTOTHROTTLE_MAX_DELAY = 260.0
AUTOTHROTTLE_DEBUG = True
DOWNLOAD_DELAY = 10
BLACKLIST_HTTP_STATUS_CODES = [302]
Any tips how can I avoid blacklisting? It seems that increasing the number of proxies can solve this problem, but maybe there is a space for improvements in settings as well.

If you can afford it, Crawlera is probably the best way to go.
Depending on the type of protection, however, using Splash might be enough.

Related

RCrawler : way to limit number of pages that RCrawler collects? (not crawl depth)

I'm using RCrawler to crawl ~300 websites. The size of websites is quite diverse: some are small (dozen or so pages) and others are large (1000s pages per domain). To crawl the latter is very time-consuming, and - for my research purpose - the added value of more pages when I already have a few hundred, decreases.
So: is there a way to stop the crawl if an x number of pages is collected?
I know I can limit the crawl with MaxDepth, but even at MaxDepth=2, this is still an issue. MaxDepth=1 is not desirable for my research. Also, I'd prefer to keep MaxDepth high, so the smaller websites do get crawled completely.
Thanks a lot!
How about implementing a custom function for the FUNPageFilter parameter of the Rcrawler function? The custom function checks the number of files in DIR and returns FALSE if there are too many files.

how can we make our scraping to look like a real person browsing

So, I am scraping a website but every now and then I will get temp-banned for some minutes. I am using headers in my code for scraping but I was wondering if is there is any more stuff we can do to look like a real person rather than just a bot.
I researched a bit and found out that we can make our scraping a little slower as well to bypass detection.
I'd want to hear your thoughts and suggestions.
ua=UserAgent()
hdr = {'User-Agent': ua.random,
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
(Had to put this code because it wasn't letting me post it!)
One of the things you can do it's to make your time.sleep random. Bots are pace keeping and humans are erratics.
You need to import random library
import random
Then you change your time.sleep for something like this:
time.sleep(random.randint(3,15))
One way to avoid being banned is to not blast the site with so much force cause then they will definitely take not because a human using a browser will not be able to hit the site with so much speed hence it must be a bot. So going a little slower on the number of requests you send per second helps. Another way you can get around this is by using proxies. If you get banned for sometime it means they have taken note of your IP address and blocked it. If you are using proxies when they block one IP you can switch over to another and continue on your merry scraping way, this is one of the major components of more complex bots and spiders and it is something that is not too difficult to do.
import requests
from bs4 import BeautifulSoup as bs
def crawler():
headers = {headers}
proxies = {proxies}
url = url
requests.get(url, headers=headers, proxies=proxies)
with this your IP address is hidden. All proxy addresses do not work in all locations, so what i tend to do when working with them is have a bunch of them in a file somewhere. I read the file and loop through the proxies i get from the file until i get one that works in my present location, then scraping can begin without the fear of my IP getting blocked. check out this post if still in doubt about how proxies work with requests library and beautiful soup

How to track internet speed data using google Tag Manger?

I want to track internet speed(2G, 3G, 4G, wifi, etc) of the user using Google tag manager. Let me know the step for this
It is not possible to do that with Google Tag Manager by default.
It is a definitely bad practice for your website visitors, but if you want to know their speed in MB/s (it is not very precise), you can do that following:
1) Create API endpoint on your website which will respond exact 1MB of any data.
2) Write JS code, which will measure how much time they spent on downloading 1MB data
var start_time = new Date().getTime();
jQuery.get('your-url'+ '?timestamp=' + new Date.getTime().toString(),
function(data, status, xhr) {
var request_time = new Date().getTime() - start_time;
var mbs= 1/(request_time/1000.0); //this variable will have speed in MBS
dataLayer.push({'speed': mbs});
}
);
3) Read this dataLayer in GTM
Again, it is a bad practice and it is not very precise
Victors answer, as unprecise as the results may be, is probably the easiest way (and for most use cases having a few buckets for the users should be okay). It will also work in any halfway modern browser. An implementation that does not rely on jQuery can be found in this very old answer (you'd need to adapt this to push the value to the datalayer).
For completeness sake:
There are also dedicated services to measure connection speed
(I'm not affiliated, I just googled it).
If you are feeling adventurous you might want to explore the
Network Information API, which is currently mostly useless due
to almost non-existent browser support, but might be implemented at
some point in the future (the downlinkMax property should give you the maximal possible speed for downloads)

Google Search Appliance Reports

We use GSA 7.2 and have more than 500k docs in index from large number of subdomains. I am looking for Page where search was performed. GSA is integrated with Google Analytics already. When I look in Search Terms, I see the terms searched on but I can not tell which site from the collection user was on as GA includes only URI ie /search?q=...... I tried looking in Referral too but no success. Any answers?
Thanks.
I see this question is old, but going to answer anyway.
The Google Search Appliance does not track the Referrer (the web page that sent the GET request).
This leaves you with two options to collect that data:
1) Insert a web proxy between your site(s) and the GSA(s). This can have a performance impact of 250-500 ms, so don't use this option if blazing speed is a priority. You would have this proxy log the Referrer and the GET URL, so that you can match that to the reports from the GSA.
2) Rearrange your Collections to reflect the sites that could be sending requests. You can have a max of 200 Collections without impacting performance, so this should work for you unless you have an already complicated arrangement of Collections.
a) An arrangement by Site only could look like this:
- fromIntranetSite
- fromMarketingSite
- fromTokyoSite
- fromHQIntranet
...
b) An arrangement by contents and by Site could look like this:
- FAQsfromIntranetSite
- ProductsFromMarketingSite
- ResourcesFromHQIntranet
...

Efficiently webscraping a website without an api?

Considering that most languages have webscraping functionality either built in, or made by others, this is more of a general web-scraping question.
I have a site in which I would like to pull information from about 6 different pages. This normally would not be that bad; unfortunately though, the information on these pages changes roughly every ten seconds, which could mean over 2000 queries an hour (which is simply not okay). There is no api to the website I have in mind either. Is there any possible efficient way to get the amount of information I need without flooding them with requests, or am I out of luck?
At best, the site might return you an HTTP 304 Not Modified in its header when you make a request - indicating that you need not download the page, as nothing has changed. If the site is set up to do so, this might help decrease bandwidth, but would still require the same number of requests.
If there's a consistent update schedule, then at least you know when to make the requests - but you'll still have to ask (i.e.: make a request) to find out what information has changed.

Resources