CasperJS hangs out when accessing Google Keyword Planner - web-scraping

My goal is to grab keywords from Google Keyword Planner, as the API does not support getting search volume in last 24 months but 12 months only.
I use maily SimpleBrowser, so I am new with CasperJS, I googled some scripts, read documentation and then combined following script.
I can login to Google, even to Adwords dashboard, but when I try to access KeywordPlanner CasperJS freezes, any idea?
JS script
/**
* Basic vars
* #type Module utils|Module utils
*/
var utils = require('utils');
var casper = require('casper').create({
verbose: true,
logLevel: "debug",
waitTimeout: 20000
});
/**
* Start casper and, set the biggest viewport and browser signature
*/
casper.start();
casper.viewport(1280, 1024);
casper.userAgent('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36');
/**
* Override user agent when requesting source
*
* #param {type} param1
* #param {type} param2
*/
casper.on('resource.requested', function(resource) {
for (var obj in resource.headers) {
var name = resource.headers[obj].name;
var value = resource.headers[obj].value;
if (name == "User-Agent"){
this.echo(value);
}
}
});
/**
* Prevent locking navigation
*/
casper.on('navigation.requested', function (url, navigationType, navigationLocked, isMainFrame) {
if (url.indexOf('adwords.google.com/ko/KeywordPlanner/Home?__') != -1 || url.indexOf('facebook.com') != -1) {
this.page.navigationLocked = true;
}else{
this.page.navigationLocked = false;
}
});
/**
* Main script
*
* #param {type} param
*/
casper.open("https://accounts.google.com/ServiceLogin?service=adwords&continue=https://adwords.google.com/um/identity?#identifier").then(function () {
console.log("page loaded...");
this.fillSelectors('form#gaia_loginform', {
'input[name="Email"]': 'myemail#gmail.com',
}); //Fills the email box with email
console.log("email filled...");
this.click("#next"); //Fills the email box with email
console.log("next pressed...");
this.wait(500, function () { //Wait for next page to load
console.log("Inside WAIT...");
this.waitForSelector("#Passwd", //Wait for password box
function success() {
console.log("SUCCESS...");
this.fillSelectors('form#gaia_loginform', {
'input[name="Passwd"]': 'myPassword',
}); //Fill password box with PASSWORD
this.click("#signIn"); //Click sign in button
console.log("Clicked to sign in...");
this.wait(5000, function () {
console.log("Logged in...");
this.capture('1.png');
/**
* Open the keyword planner
*/
casper.open('https://adwords.google.com/ko/KeywordPlanner/Home?').then(function() {
this.wait(10000, function() {
console.log("KP opened...");
this.capture('2.png');
});
});
});
},
function fail() {
console.log("FAIL...");
this.capture('exit.png');
}
);
});
});
casper.run();
Last console output when Casper freezes
[debug] [phantom] url changed to
"https://adwords.google.com/ko/KeywordPlanner/H
ome?"
[debug] [phantom] Navigation requested: url=https://adwords.google.com/um/identi
ty?dst=/ko/KeywordPlanner/Home?, type=Other, willNavigate=true, isMainFrame=true
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
[debug] [phantom] Successfully injected Casper client-side utilities
[debug] [phantom] Navigation requested: url=https://adwords.google.com/ko/Keywor
dPlanner/Home?__u=6961347906&__c=8635630266&authuser=0, type=Other, willNavigate
=true, isMainFrame=true
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
[debug] [phantom] url changed to "https://adwords.google.com/ko/KeywordPlanner/H
ome?__u=6961347906&__c=8635630266&authuser=0"
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
[debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigat
e=false, isMainFrame=false
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.110 Safari/537.36

Related

R httr POST function return Send failure: Connection was reset

I am trying to scrape the table under the Market Segment tab as in the below image, The code logic blow used to work with similar tasks, however it is not working here and returning
Error Send failure: Connection was reset
link<-'https://www.egx.com.eg/en/prices.aspx'
headers.id <-c('User-Agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer'= 'https://www.egx.com.eg/en/prices.aspx',
'Content-Type'= 'application/x-www-form-urlencoded',
'Origin'='https://www.egx.com.eg',
'Host'= 'www.egx.com.eg',
'Content-Type'= 'application/x-www-form-urlencoded',
'sec-ch-ua'='" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"'
)
pgsession<-session(link,httr::add_headers(.headers=headers.id), verbose())
pgform<-html_form(pgsession)[[1]]
page<-POST(link, body=list(
'__EVENTTARGET'= pgform$fields$`__EVENTTARGET`$value,
'__EVENTARGUMENT'=pgform$fields$`__EVENTARGUMENT`$value,
'__VIEWSTATE'=pgform$fields$`__VIEWSTATE`$value,
'ctl00$H$txtSearchAll'=pgform$fields$`ctl00$H$txtSearchAll`$value,
'ctl00$H$rblSearchType'=pgform$fields$`ctl00$H$rblSearchType`$value,
'ctl00$H$rblSearchType'=pgform$fields$`ctl00$H$rblSearchType`$value,
'ctl00$H$imgBtnSearch'=pgform$fields$`ctl00$H$imgBtnSearch`$value,
'ctl00$C$S$TextBox1'=pgform$fields$`ctl00$C$S$TextBox1`$value
),
encode="form", verbose()
)
I keeped seacrhing till i find the solution using rvest as follows:
link<-'https://www.egx.com.eg/en/prices.aspx'
headers.id <-c('Accept'='*/*',
'Accept-Encoding'='gzip, deflate, br',
'Accept-Language'='en-US,en;q=0.9',
'Cache-Control'='no-cache',
'Connection'='keep-alive',
'Content-Type'='application/x-www-form-urlencoded',
'Host'='www.egx.com.eg',
'Origin'='https://www.egx.com.eg',
'Referer'='https://www.egx.com.eg/en/prices.aspx',
'sec-ch-ua'='" Not;A Brand";v="99", "Microsoft Edge";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile'='?0',
'Sec-Fetch-Dest'='empty',
'Sec-Fetch-Mode'='cors',
'Sec-Fetch-Site'='same-origin',
'User-Agent'='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59',
'X-MicrosoftAjax'='Delta=true'
)
pgsession<-session(link,httr::add_headers(.headers=headers.id))
pgform<-html_form(pgsession)[[1]]
filled_form<-html_form_set(pgform,
'__EVENTTARGET'= 'ctl00$C$S$lkMarket',
'__EVENTARGUMENT'=pgform$fields$`__EVENTARGUMENT`$value,
'__VIEWSTATE'=pgform$fields$`__VIEWSTATE`$value,
'ctl00$H$txtSearchAll'=pgform$fields$`ctl00$H$txtSearchAll`$value,
'ctl00$H$rblSearchType'=pgform$fields$`ctl00$H$rblSearchType`$value,
'ctl00$H$rblSearchType'="1",
'ctl00$H$imgBtnSearch'=pgform$fields$`ctl00$H$imgBtnSearch`$value,
'ctl00$C$S$TextBox1'=pgform$fields$`ctl00$C$S$TextBox1`$value
)
page<-session_submit(pgsession,filled_form)
# in the above example change eventtarget as "ctl00$ContentPlaceHolder1$DataList2$ctl02$lnk_blok" to get different table
page.html <-read_html(page$response)%>%html_table%>%.[[7]]

How do I use requests library with this website?, it keeps giving response 403

the link is: https://angel.co/medical-marijuana-dispensaries-1
every time I use requests.get(url), it keeps giving me 403 response, so I cannot parse it
I tried changing the headers: user-agent and the referer but it did not work
import requests
page=requests.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
session.headers
{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
page=requests.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>
page=session.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>

Target Edge Canary in CSS

To fix an image resizing bug, I need to target Edge Canary in CSS.
Specifically, I am using Microsoft Edge
Version 76.0.151.0 (Official build) Canary (64-bit)
on macOS Mojave 10.14.6
I tried #supports (-ms-ime-align: auto) { } but that doesn't work. Is there a new hack available?
The Microsoft Edge Canary version using chromium engine, so the #supports (-ms-ime-align: auto) can't detect it as Edge browser.
As an alternative workaround, I suggest you could use the window.navigator.UserAgent to check whether the browser is Microsoft Edge(Chromium), this is a JavaScript method.
Code as below:
<script>
var browser = (function (agent) {
switch (true) {
case agent.indexOf("edge") > -1: return "edge";
case agent.indexOf("edg") > -1: return "chromium based edge (dev or canary)";
case agent.indexOf("opr") > -1 && !!window.opr: return "opera";
case agent.indexOf("chrome") > -1 && !!window.chrome: return "chrome";
case agent.indexOf("trident") > -1: return "ie";
case agent.indexOf("firefox") > -1: return "firefox";
case agent.indexOf("safari") > -1: return "safari";
default: return "other";
}
})(window.navigator.userAgent.toLowerCase());
document.body.innerHTML = window.navigator.userAgent.toLowerCase() + "<br>" + browser;
</script>
The Browser Agent strings as below:
The Edge browser userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/64.0.3282.140 safari/537.36 edge/18.17763
The Microsoft Chromium Edge Dev userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/76.0.3800.0 safari/537.36 edg/76.0.167.1
The Microsoft Chromium Edge Canary userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/76.0.3800.0 safari/537.36 edg/76.0.167.1
The IE browser userAgent:
mozilla/5.0 (windows nt 10.0; wow64; trident/7.0; .net4.0c; .net4.0e; rv:11.0) like gecko
The Chrome browser userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/74.0.3729.169 safari/537.36
[Note] If your site is being targeted for UA string overrides by the browser, you may not be able to use userAgent to detect the browser correctly depending on what those overrides show.

When tor is set to obtain rotating IP, the “login” of the webpage interreputs even if the IP has not already change

Introduction
First of all I did try the solutions of this topic and it does not work in my case. I mean the IP is rotating, but I obtained empty socket content message, besides the website was scraped but not as I want because there are informations I can scrape only if I am logged.
So I set torcc file to obtain a rotating IP with MaxCircuitDirtiness 20 and it is rotating, without socks problem this time but I finish fastly to be unlogged, then I do not obtain the informations I am interested about.
That is the kind of item I scrape
{'_id': 'Bidule',
'field1': ['A','C','D','E'], #it requires to be logged to the page
'field2': 'truc de bidule',
'field3': [0,1,2,3],#it requires to be logged to the page
'field4': 'le champ quatre'}
It works for the first items but there is a moment it goes bad as:
{'_id': 'Machine',
'field1': [], #empty because not logged
'field2': 'truc de machine',
'field3': [],#empty because not logged
'field4': 'le champ quatre'}
when it goes bad?
Below you have an illustration of what happens during one of my scrape, referring to the log file and terminal informations.
IP: 178.239.176.73 #first item scraped as expected
IP: 178.239.176.73 #second item scraped as expected
IP: 178.239.176.73 #third item scraped as expected
IP: 178.239.176.73 #fourth item scraped as expected
IP: 178.239.176.73 #fifth item scraped as expected
IP: 178.239.176.73 #sixth item scraped as expected
IP: 178.239.176.73 #seventh item scraped as expected
IP: 178.239.176.73 #eighth item scraped as expected
IP: 178.239.176.73 #nineth item NOT scraped as expected
IP: 178.239.176.73 #and item NOT scraped as expected until the end
IP: 178.239.176.73
IP: 178.239.176.73
IP: 162.247.74.27
IP: 162.247.74.27
IP: 162.247.74.27
IP: 162.247.74.27
IP: 162.247.74.27
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
IP: 178.175.132.227
The big deal is even if the IP has not already changed, it unlogs me, then my items are not interesting anymore. So I do not understand why this happens.
Note that when it is not set to make a rotating IP it works well, no issue, but I want a set rotating IP.
I tried to include the parameter COOKIES_ENABLED = True because I wondered if to avoid cookies makes lose my login to the webpage, but that's not the reason apparently. So I am still wondering what is the cause.
code if you want to test and to make results reproducible and to help:
scrapy startproject project
the project directory's organization:
project #directory
|_ scrapy.cfg #file
|__project #directory
|_ __init__.py (empty) #file
|_ items.py (unnecessary to test) #file
|_ middlewares.py #file
|_ pipelines.py (unnecessary to test)#file
|_ settings.py #file
|__ spiders #directory
|_ spiders.py #file
middlewares.py:
from scrapy import signals
import random
from scrapy.conf import settings
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(settings.get('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
spider.log('Proxy : %s' % request.meta['proxy'])
class ProjectSpiderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class ProjectDownloaderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
settings.py:
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 OPR/43.0.2442.991',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 OPR/42.0.2393.94',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 OPR/48.0.2685.52'
]
#proxy for polipo
HTTP_PROXY = 'http://127.0.0.1:8123'
#retry if needed
RETRY_ENABLED = True
RETRY_TIMES = 5 # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [401, 403, 404, 408, 500, 502, 503, 504]
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Disable cookies (enabled by default)
COOKIES_ENABLED = True #commented or not did not work for me
DOWNLOADER_MIDDLEWARES = {
'project.middlewares.RandomUserAgentMiddleware': 400,
'project.middlewares.ProxyMiddleware': 410,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
spiders.py in spiders directory:
import scrapy
from re import search
class ChevalSpider(scrapy.Spider):
name = "fiche_cheval"
start_urls = ['https://www.paris-turf.com/compte/login']
def __init__ (self,username=None, mdp=None):
self.username= username #create a fake account by yourself
self.mdp = mdp
def parse(self, response):
token = response.css('[name="_csrf_token"]::attr(value)').get()
data_log = {
'_csrf_token': token,
'_username': self.username,
'_password': self.mdp
}
yield scrapy.FormRequest.from_response(response, formdata=data_log, callback=self.after_login)
def after_login(self, response):
liste_ch=['alexandros-751044','annette-girl-735523','citoyenne-743132','everest-748084','goudurix-687456','lady-zorreghuietta-752292','petit-dandy-671825','ritvilik-708712','scarface-686119','siamese-713651','tic-tac-toe-685508',
'velada-745272','wind-breaker-755107','zodev-715463','ballerian-813033','houpala-riquette-784415','jemykos-751551','madoudal-736164','margerie-778614','marquise-collonges-794335','mene-thou-du-plaid-780155']#Only a sample of thousands ids.
url=['https://www.paris-turf.com/fiche-cheval/'+ch for ch in liste_ch]
for link,cheval in zip(url,liste_ch):
yield scrapy.Request(
url=link,
callback=self.page_cheval,
meta={'nom':cheval}
)
def page_cheval(self, response):
def lister_valeur(x_path,x_path2):
"""Here a customed function to get value none even if the tag
does not exist in the page. Then the list fields have the same
length. Important for my applications."""
liste_valeur=[]
for valeur in response.xpath(x_path):
val=valeur.xpath(x_path2).extract_first()
if val is None or val =="" or val=="." or val=="-" or val==" ":
liste_valeur.append(None)
else:
liste_valeur.append(val)
return liste_valeur
cat_course1,cat_course2="//html//td[#class='italiques']","text()"
cat_course=lister_valeur(cat_course1,cat_course2)#'Course A', 'Course B', ...
gains1,gains2="//html//td[#class='rapport']","a/text()"
gains=lister_valeur(gains1,gains2)#'17 100', '14 850', '0',...
gains=[
int(search(r'(\d{1,10})', gain.replace('\n','').replace(' ','').replace('.','')).group(1))
if search(r'(\d{1,10})', gain.replace('\n','').replace(' ','').replace('.','')) != None else 0
for gain in gains
]
_id_course=response.xpath("//html//td[1]/#data-id").extract()
item={
'_id':response.request.meta['nom'],
'cat_de_course':cat_course,
'gains':gains,
'id_de_course':_id_course
}
yield scrapy.Request('http://checkip.dyndns.org/', callback=self.checkip, dont_filter=True)
yield item
def checkip(self, response):
print('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))
logging.warning('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))
To launch the spider:
scrapy crawl fiche_cheval -a username=yourfakeemailaccount -a mdp=password -o items.json -s LOG_FILE=Project.log
some notes:
In the last crawl I did few minutes ago, while the IP was changing, the items were all okay except the last one, so if your run it once and you see every item to be okay it is certainly not a reproducible result when you launch it again.
Tor and Polipo configurations:
/etc/tor/torrc file:
MaxCircuitDirtiness 20
SOCKSPort 9050
ControlPort 9051
CookieAuthentication 1
/etc/tor/torsocks.conf file:
TorAddress 127.0.0.1
TorPort 9050
/etc/polipo/config file:
logSyslog = true
logFile = /var/log/polipo/polipo.log
socksParentProxy = localhost:9050
diskCacheRoot=""
disableLocalInterface=true
OS: Ubuntu 18.04.2 LTS, Tor: 0.3.2.10 (git-0edaa32732ec8930) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.0g, Zlib 1.2.11, Liblzma 5.2.2, and Libzstd 1.3.3., I do not find way to check polipo version.
UPDATE
I made a loop. I mean when the item is not as expected it goes back to parse it log again because I checked in the loop if data_log was still ok during the loop and it was when it was used again.
So when I login I do not obtain the items as expected. This is very strange.

Scrapy using pool of random proxies to avoid being banned

I am quite new to scrapy (and my background is not informatics). I have a website that I cant visit with my local ip, since I am banned, I can visit it using a VPN service on browser. To my spider be able to crawl it I set up a pool of proxies that I have found here http://proxylist.hidemyass.com/ . And with that my spider is able to crawl and scrape items but my doubt is if I have to change the proxy pool list everyday?? Sorry if my question is a dumb one...
here my settings.py:
BOT_NAME = 'reviews'
SPIDER_MODULES = ['reviews.spiders']
NEWSPIDER_MODULE = 'reviews.spiders'
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'reviews.rotate_useragent.RotateUserAgentMiddleware' :400,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'reviews.middlewares.ProxyMiddleware': 100,
}
PROXIES = [{'ip_port': '168.63.249.35:80', 'user_pass': ''},
{'ip_port': '162.17.98.242:8888', 'user_pass': ''},
{'ip_port': '70.168.108.216:80', 'user_pass': ''},
{'ip_port': '45.64.136.154:8080', 'user_pass': ''},
{'ip_port': '149.5.36.153:8080', 'user_pass': ''},
{'ip_port': '185.12.7.74:8080', 'user_pass': ''},
{'ip_port': '150.129.130.180:8080', 'user_pass': ''},
{'ip_port': '185.22.9.145:8080', 'user_pass': ''},
{'ip_port': '200.20.168.135:80', 'user_pass': ''},
{'ip_port': '177.55.64.38:8080', 'user_pass': ''},]
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'reviews (+http://www.yourdomain.com)'
here my middlewares.py:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
Another question: if I have a website that is https should I have a proxy pool list for https only? and then another function class HTTPSProxyMiddleware(object) that recives a list HTTPS_PROXIES ?
my rotate_useragent.py:
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
#the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
Another question and last(sorry if is again a stupid one) in settings.py there is a commented default part # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'reviews (+http://www.yourdomain.com)' should I uncomment it and put my personal informations? or just leave it like that? I wanna crawl effeciently but regarding the good policies and good habits to avoid possible ban issues...
I am asking this all because with this things my spiders started to throw errors like
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting http://www.example.com/browse/?start=884 took longer than 180.0 seconds.
and
Error downloading <GET http://www.example.com/article/2883892/x-review.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
and
Error downloading <GET http://www.example.com/browse/?start=6747>: TCP connection timed out: 110: Connection timed out.
Thanks so much for your help and time.
There is already a library to do this. https://github.com/aivarsk/scrapy-proxies
Please download it from there. It has not been in pypi.org yet, so you can't install it easily using pip or easy_install.
There's not a correct answer for this. Some proxies are not always available so you have to check them now and then. Also, if you use the same proxy every time the server you are scraping may block its IP as well, but that depends on the security mechanisms this server has.
Yes, because you don't know if all the proxies you have in your pool support HTTPS. Or you could have just one pool and add a field to each proxy that indicates its HTTPS support.
In your settings your are disabling the user agent middleware: 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None.
The USER_AGENT setting won't have any effect.

Resources