Replicating javascript search in scrapy - web-scraping

I'm not having success in scraping this website because it's does not contain any forms.
My crawler always returns nothing when I dump response data to a file:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'mamega.org'
start_urls = ['https://www.mamega.org/search/']
def parse(self, response):
return scrapy.Request('https://www.mamega.org/_searchm.php',
method="POST",
meta = {'section': 'ebooks', 'datafill': 'musso'},
headers={'Content-Type': 'application/json; charset=UTF-8'},
callback = self.after_login
)
def after_login(self, response):
print ("__________________________________________after_login______________________________________________________")
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
for title in response.xpath('//table[#style="width:93%;"]//tbody//tr//td/following-sibling::a[2]/#href'):
yield {'roman': title.css('a ::text').extract_first(),'url': title.css('a::attr(href)').extract_first()}

Your first POST request doesn't contain any body.
If you take a look at the website you can see it includes 3 things that you need to replicate to get a proper response from their server:
The content-type and x-requested-with headers and some form data type body.
You can replicate this in your crawler:
headers = {
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest'
}
Request(
'https://www.mamega.org/_searchm.php',
method='POST',
body='section=ebooks&datafill=musso',
headers=headers
}

return scrapy.Request('https://www.mamega.org/_searchm.php',
method="POST",
meta = {'section': 'ebooks', 'datafill': 'musso'},
headers={'Content-Type': 'application/json; charset=UTF-8'},
callback = self.after_login
)
data you are passing as meta is actually formdata of POST Request.
Make Your Request as:
return scrapy.Request('https://www.mamega.org/_searchm.php',
method="POST",
#formdata formdata = {'section': 'ebooks', 'datafill': 'musso'},
headers={'Content-Type': 'application/json; charset=UTF-8'},
callback = self.after_login
)

Related

How do I set up automatic change of the github token during parsing?

GitHub allows you to send no more than 2500 requests per hour if I have several accounts/tokens, how to set up an automatic token change in Scrapy when a certain level of requests is reached (for example 2500) or for the token to change when responding 403.?
class GithubSpider(scrapy.Spider):
name = 'github.com'
start_urls = ['https://github.com']
tokens = ['token1', 'token2', 'token3', 'token4']
headers = {
'Accept': 'application/vnd.github.v3+json',
'Authorization': 'token ' + tokens[1],
}
def start_requests(self, **cb_kwargs):
for lang in languages:
cb_kwargs['lang'] = lang
url = f'https://api.github.com/search/users?q=language:{lang}%20location:{country}&per_page=100'
yield Request(url=url, headers=self.headers, callback=self.parse, cb_kwargs=cb_kwargs)
You could use the cycle function from the module itertools to create a generator using your list of tokens that you can then cycle through for each request you send to ensure you are using all the tokens equally thereby reducing chance of reaching the limit for any of the tokens.
If you start receiving 403 responses then you will know that all the tokens have reached their limit. See sample code below
from itertools import cycle
class GithubSpider(scrapy.Spider):
name = 'github.com'
start_urls = ['https://github.com']
tokens = cycle(['token1', 'token2', 'token3', 'token4'])
def start_requests(self, **cb_kwargs):
for lang in languages:
headers = {
'Accept': 'application/vnd.github.v3+json',
'Authorization': 'token ' + next(self.tokens)
}
cb_kwargs['lang'] = lang
url = f'https://api.github.com/search/users?q=language:{lang}%20location:{country}&per_page=100'
yield Request(url=url, headers=headers, callback=self.parse, cb_kwargs=cb_kwargs)

Postmates Invalide paramaters

Getting some problems with this post request for creating a delivery:
{'dropoff_name': 'stephen',
'pickup_address': '1234 Bancroft Way, Emeryville, CA',
'pickup_phone_number': '1231231234',
'dropoff_phone_number': '1231231234',
'dropoff_address': '200 Powell Street, Emeryville, CA',
'pickup_name': 'ShareTea',
'manifest': 'boba'
}
Here's my code:
def post_data(self):
post_data = {}
post_data["manifest"] = self.manifest
# post_data['manifest_items'] = self.manifest_items
post_data.update(self.pickup.post_data("pickup"))
post_data.update(self.dropoff.post_data("dropoff"))
if self.quote:
post_data["quote_id"] = self.quote.quote_id
return post_data
def _make_request(self, url, data=None, type='get'):
if type == 'post':
print(data)
headers = {'Content-type': 'application/x-www-form-urlencoded'}
response = requests.post(url, data=data, auth=(self.api_key, ''), headers = headers)
params = delivery.post_data()
return self._make_request(url, data=params, type='post')
I'm getting a 400 Exception that says The parameters of your request were invalid.
Does it identify which parameters are invalid?
If it's just the phone numbers, I had success by formatting the phone number in my request to the format of "123-123-1234'
I believe the manifest field should be an array.

scrapy, how to parse AJAX response from asp.net page on POST

I want to look through the companies at: https://www.greg.gg/webCompSearch.aspx
I know that the asp.net form needs certain parameters that can be extracted. When sending a POST in scrapy as FormRequest I also get a response with the additional data. My problem is that it is only partially html, see:
1|#||4|1890|updatePanel|ctl00_updPanel|
<br />
<div id="login">
<div id="ctl00_pnlLogin" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_btnLogin')">
So the question is how I could parse the HTML properly.
Here is the minimal scrapy spider as reference:
# -*- coding: utf-8 -*-
import scrapy
class GgTestSpider(scrapy.Spider):
name = 'gg_test'
allowed_domains = ['www.greg.gg']
base_url = 'https://www.greg.gg/webCompSearch.aspx'
start_urls = [base_url]
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}
def parse(self, response):
# grep ASP.NET elements out of response
EVENTVALIDATION = response.xpath(
'//*[#id="__EVENTVALIDATION"]/#value').extract_first()
VIEWSTATE = response.xpath(
'//*[#id="__VIEWSTATE"]/#value').extract_first()
PREVIOUSPAGE = response.xpath(
'//*[#id="__PREVIOUSPAGE"]/#value').extract_first()
response.meta['fdat'] = {
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE': VIEWSTATE,
'__PREVIOUSPAGE': PREVIOUSPAGE,
'__EVENTVALIDATION': EVENTVALIDATION,
'__ASYNCPOST': "true",
'ctl00$ScriptManager2': "ctl00$cntPortal$updPanel|ctl00$cntPortal$btnSearch",
'ctl00$cntPortal$radSearchType': "radStartsWith",
'ctl00$cntPortal$chkPrevNames': "on",
'ctl00$cntPortal$ddlRegister': "0",
'ctl00$cntPortal$btnSearch': "Search"
}
# id to search
response.meta['fdat']['ctl00$cntPortal$txtCompRegNum'] = "1"
return scrapy.FormRequest.from_response(
response,
headers={
'Referer': self.base_url,
'X-MicrosoftAjax': 'Delta=true',
},
formdata=response.meta['fdat'],
meta={'fdat': response.meta['fdat']},
callback=self._parse_items,
)
def _parse_items(self, response):
company_item = response.xpath(
'//input[contains(#id, "ctl00$cntPortal$grdSearchResults$ctl")]/#value').extract()
print "no data:", response.request.headers, response.meta['fdat'], company_item, response.xpath('/')
response.meta['fdat']['__EVENTVALIDATION'] = response.xpath(
'//*[#id="__EVENTVALIDATION"]/#value').extract()
response.meta['fdat']['__VIEWSTATE'] = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract()
response.meta['fdat']['__PREVIOUSPAGE'] = response.xpath(
'//*[#id="__PREVIOUSPAGE"]/#value').extract()
# give as input to form (POST) to get redirected
for i in company_item:
response.meta['fdat']['ctl00$ScriptManager2'] = 'ctl00$cntPortal$updPanel|{0}'.format(i)
yield scrapy.FormRequest(
url=self.base_url,
formdata=response.meta['fdat'],
meta={'company_extra_id': response.meta['company_extra_id']},
callback=self._parse_company,
)
def _parse_company(self, response):
pass
Thanks in advance!
EDIT: I changed the title of the question from how to get the full HTML like displayed in the browser to how to actually parse the partial HTML that is returned by the POST.
Using selectors
response_data = scrapy.Selector(text=response.body)
# this will give you selector object
# you should be able to use .xpath and .css on response_data

Freebase Mqlread InvalidURLError: Invalid request URL -- too long -- POST possible?

Is it possible to submit a Freebase mqlread request via POST in Python? I have tried to search for documentation but everything refers to GET. Thanks.
It is possible.
You will need issue a POST and add a specific header: X-HTTP-Method-Override: GET (basically tells the server to emulate a GET with the POST's content). Specifically for me I used the Content-Encoding: application/x-www-form-urlencode.
Here's the relevant part of my code (coffeescript) if it helps:
mqlread = (query, queryEnvelope, cb) ->
## build URL
url = urlparser.format
protocol: 'https'
host: 'www.googleapis.com'
pathname: 'freebase/v1/mqlread'
## build POST body
queryEnvelope ?= {}
queryEnvelope.key = config.GOOGLE_API_SERVER_KEY
queryEnvelope.query = JSON.stringify query
options =
url: url
method: 'POST'
headers:
'X-HTTP-Method-Override': 'GET'
'User-Agent': config.wikipediaScraperUserAgent
timeout: 3000
form: queryEnvelope
## invoke API
request options, (err, response, body) ->
if err then return cb err
if response.statusCode != 200
try
json = JSON.parse(body)
errmsg = json?.error?.message or "(unknown JSON)"
catch e
errmsg = body?[..50]
return cb "#{response.statusCode} #{errmsg}"
r = JSON.parse response.body
decodeStringsInResponse r
cb null, r
I don't think POST is supported for MQLread, but you could use the HTTP Batch facility.
Here's an example in Python:
https://github.com/tfmorris/freebase-python-samples/blob/master/client-library/mqlread-batch.py

Scrapy: sending http requests and parse response

i have looked at scrapy docs , but Can scrapy send http form (Ex: user name , password ,....) and parse the result of sending this form ?
There's an example in the same page : http://scrapy.readthedocs.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You have just to pass a callback parameter function to the request and then, parse the result in parse_page2 ;)

Resources