I want to look through the companies at: https://www.greg.gg/webCompSearch.aspx
I know that the asp.net form needs certain parameters that can be extracted. When sending a POST in scrapy as FormRequest I also get a response with the additional data. My problem is that it is only partially html, see:
1|#||4|1890|updatePanel|ctl00_updPanel|
<br />
<div id="login">
<div id="ctl00_pnlLogin" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_btnLogin')">
So the question is how I could parse the HTML properly.
Here is the minimal scrapy spider as reference:
# -*- coding: utf-8 -*-
import scrapy
class GgTestSpider(scrapy.Spider):
name = 'gg_test'
allowed_domains = ['www.greg.gg']
base_url = 'https://www.greg.gg/webCompSearch.aspx'
start_urls = [base_url]
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}
def parse(self, response):
# grep ASP.NET elements out of response
EVENTVALIDATION = response.xpath(
'//*[#id="__EVENTVALIDATION"]/#value').extract_first()
VIEWSTATE = response.xpath(
'//*[#id="__VIEWSTATE"]/#value').extract_first()
PREVIOUSPAGE = response.xpath(
'//*[#id="__PREVIOUSPAGE"]/#value').extract_first()
response.meta['fdat'] = {
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE': VIEWSTATE,
'__PREVIOUSPAGE': PREVIOUSPAGE,
'__EVENTVALIDATION': EVENTVALIDATION,
'__ASYNCPOST': "true",
'ctl00$ScriptManager2': "ctl00$cntPortal$updPanel|ctl00$cntPortal$btnSearch",
'ctl00$cntPortal$radSearchType': "radStartsWith",
'ctl00$cntPortal$chkPrevNames': "on",
'ctl00$cntPortal$ddlRegister': "0",
'ctl00$cntPortal$btnSearch': "Search"
}
# id to search
response.meta['fdat']['ctl00$cntPortal$txtCompRegNum'] = "1"
return scrapy.FormRequest.from_response(
response,
headers={
'Referer': self.base_url,
'X-MicrosoftAjax': 'Delta=true',
},
formdata=response.meta['fdat'],
meta={'fdat': response.meta['fdat']},
callback=self._parse_items,
)
def _parse_items(self, response):
company_item = response.xpath(
'//input[contains(#id, "ctl00$cntPortal$grdSearchResults$ctl")]/#value').extract()
print "no data:", response.request.headers, response.meta['fdat'], company_item, response.xpath('/')
response.meta['fdat']['__EVENTVALIDATION'] = response.xpath(
'//*[#id="__EVENTVALIDATION"]/#value').extract()
response.meta['fdat']['__VIEWSTATE'] = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract()
response.meta['fdat']['__PREVIOUSPAGE'] = response.xpath(
'//*[#id="__PREVIOUSPAGE"]/#value').extract()
# give as input to form (POST) to get redirected
for i in company_item:
response.meta['fdat']['ctl00$ScriptManager2'] = 'ctl00$cntPortal$updPanel|{0}'.format(i)
yield scrapy.FormRequest(
url=self.base_url,
formdata=response.meta['fdat'],
meta={'company_extra_id': response.meta['company_extra_id']},
callback=self._parse_company,
)
def _parse_company(self, response):
pass
Thanks in advance!
EDIT: I changed the title of the question from how to get the full HTML like displayed in the browser to how to actually parse the partial HTML that is returned by the POST.
Using selectors
response_data = scrapy.Selector(text=response.body)
# this will give you selector object
# you should be able to use .xpath and .css on response_data
Related
I want to send data from app.post() to app.get() using RedirectResponse.
#app.get('/', response_class=HTMLResponse, name='homepage')
async def get_main_data(request: Request,
msg: Optional[str] = None,
result: Optional[str] = None):
if msg:
response = templates.TemplateResponse('home.html', {'request': request, 'msg': msg})
elif result:
response = templates.TemplateResponse('home.html', {'request': request, 'result': result})
else:
response = templates.TemplateResponse('home.html', {'request': request})
return response
#app.post('/', response_model=FormData, name='homepage_post')
async def post_main_data(request: Request,
file: FormData = Depends(FormData.as_form)):
if condition:
......
......
return RedirectResponse(request.url_for('homepage', **{'result': str(trans)}), status_code=status.HTTP_302_FOUND)
return RedirectResponse(request.url_for('homepage', **{'msg': str(err)}), status_code=status.HTTP_302_FOUND)
How do I send result or msg via RedirectResponse, url_for() to app.get()?
Is there a way to hide the data in the URL either as path parameter or query parameter? How do I achieve this?
I am getting the error starlette.routing.NoMatchFound: No route exists for name "homepage" and params "result". when trying this way.
Update:
I tried the below:
return RedirectResponse(app.url_path_for(name='homepage')
+ '?result=' + str(trans),
status_code=status.HTTP_303_SEE_OTHER)
The above works, but it works by sending the param as query param, i.e., the URL looks like this localhost:8000/?result=hello. Is there any way to do the same thing but without showing it in the URL?
For redirecting from a POST to a GET method, please have a look at this and this answer on how to do that and the reason for using status_code=status.HTTP_303_SEE_OTHER (example is given below).
As for the reason for getting starlette.routing.NoMatchFound error, this is because request.url_for() receives path parameters, not query parameters. Your msg and result parameters are query ones; hence, the error.
A solution would be to use a CustomURLProcessor, as suggested in this and this answer, allowing you to pass both path (if need to) and query parameters to the url_for() function and obtain the URL. As for hiding the path and/or query parameters from the URL, you can use a similar approach to this answer that uses history.pushState() (or history.replaceState()) to replace the URL in the browser's address bar.
Complete working example can be found below (you can use your own TemplateResponse in the place of HTMLResponse).
from fastapi import FastAPI, Request, status
from fastapi.responses import RedirectResponse, HTMLResponse
from typing import Optional
import urllib
app = FastAPI()
class CustomURLProcessor:
def __init__(self):
self.path = ""
self.request = None
def url_for(self, request: Request, name: str, **params: str):
self.path = request.url_for(name, **params)
self.request = request
return self
def include_query_params(self, **params: str):
parsed = list(urllib.parse.urlparse(self.path))
parsed[4] = urllib.parse.urlencode(params)
return urllib.parse.urlunparse(parsed)
#app.get('/', response_class=HTMLResponse)
def event_msg(request: Request, msg: Optional[str] = None):
if msg:
html_content = """
<html>
<head>
<script>
window.history.pushState('', '', "/");
</script>
</head>
<body>
<h1>""" + msg + """</h1>
</body>
</html>
"""
return HTMLResponse(content=html_content, status_code=200)
else:
html_content = """
<html>
<body>
<h1>Create an event</h1>
<form method="POST" action="/">
<input type="submit" value="Create Event">
</form>
</body>
</html>
"""
return HTMLResponse(content=html_content, status_code=200)
#app.post('/')
def event_create(request: Request):
redirect_url = CustomURLProcessor().url_for(request, 'event_msg').include_query_params(msg="Succesfully created!")
return RedirectResponse(redirect_url, status_code=status.HTTP_303_SEE_OTHER)
Update
Regarding adding query params to url_for(), another solution would be using Starlette's starlette.datastructures.URL, which now provides a method to include_query_params. Example:
from starlette.datastructures import URL
redirect_url = URL(request.url_for('event_msg')).include_query_params(msg="Succesfully created!")
return RedirectResponse(redirect_url, status_code=status.HTTP_303_SEE_OTHER)
I am using VS code + git bash to scrape this data into JSON. But I am not getting any data into JSON or I did not get anything in JSON. JSON file is empty.
import scrapy
class ContactsSpider(scrapy.Spider):
name= 'contacts'
start_urls = [
'https://app.cartinsight.io/sellers/all/amazon/'
]
def parse(self, response):
for contacts in response.xpath("//td[#title= 'Show Contact']"):
yield{
'show_contacts_td': contacts.xpath(".//td[#id='show_contacts_td']").extract_first()
}
next_page= response.xpath("//li[#class = 'stores-desc hidden-xs']").extract_first()
if next_page is not None:
next_page_link= response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
The URL https://app.cartinsight.io/sellers/all/amazon/ you want to scrape is redirecting to this URL https://app.cartinsight.io/. The second URL didn't contain this XPath "//td[#title= 'Show Contact']" which results in skipping the for loop in parse method and thus you are not getting your desired results.
I'm not having success in scraping this website because it's does not contain any forms.
My crawler always returns nothing when I dump response data to a file:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'mamega.org'
start_urls = ['https://www.mamega.org/search/']
def parse(self, response):
return scrapy.Request('https://www.mamega.org/_searchm.php',
method="POST",
meta = {'section': 'ebooks', 'datafill': 'musso'},
headers={'Content-Type': 'application/json; charset=UTF-8'},
callback = self.after_login
)
def after_login(self, response):
print ("__________________________________________after_login______________________________________________________")
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
for title in response.xpath('//table[#style="width:93%;"]//tbody//tr//td/following-sibling::a[2]/#href'):
yield {'roman': title.css('a ::text').extract_first(),'url': title.css('a::attr(href)').extract_first()}
Your first POST request doesn't contain any body.
If you take a look at the website you can see it includes 3 things that you need to replicate to get a proper response from their server:
The content-type and x-requested-with headers and some form data type body.
You can replicate this in your crawler:
headers = {
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest'
}
Request(
'https://www.mamega.org/_searchm.php',
method='POST',
body='section=ebooks&datafill=musso',
headers=headers
}
return scrapy.Request('https://www.mamega.org/_searchm.php',
method="POST",
meta = {'section': 'ebooks', 'datafill': 'musso'},
headers={'Content-Type': 'application/json; charset=UTF-8'},
callback = self.after_login
)
data you are passing as meta is actually formdata of POST Request.
Make Your Request as:
return scrapy.Request('https://www.mamega.org/_searchm.php',
method="POST",
#formdata formdata = {'section': 'ebooks', 'datafill': 'musso'},
headers={'Content-Type': 'application/json; charset=UTF-8'},
callback = self.after_login
)
I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}
Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req
i have looked at scrapy docs , but Can scrapy send http form (Ex: user name , password ,....) and parse the result of sending this form ?
There's an example in the same page : http://scrapy.readthedocs.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You have just to pass a callback parameter function to the request and then, parse the result in parse_page2 ;)