I need to scrape query results from an .aspx web page.
http://legistar.council.nyc.gov/Legislation.aspx
The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.
Somebody out there must know how to do this.
As an overview, you will need to perform four main tasks:
to submit request(s) to the web site,
to retrieve the response(s) from the site
to parse these responses
to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)
The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup
The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).
This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.
import urllib
import urllib2
uri = 'http://legistar.council.nyc.gov/Legislation.aspx'
#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type.
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
'Content-Type': 'application/x-www-form-urlencoded'
}
# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples. This helps
# with clarity and also makes it easy to later encoding of them.
formFields = (
# the viewstate is actualy 800+ characters in length! I truncated it
# for this sample code. It can be lifted from the first page
# obtained from the site. It may be ok to hardcode this value, or
# it may have to be refreshed each time / each day, by essentially
# running an extra page request and parse, for this specific value.
(r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),
# following are more of these ASP form fields
(r'__VIEWSTATE', r''),
(r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
(r'ctl00_RadScriptManager1_HiddenField', ''),
(r'ctl00_tabTop_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
#but then we come to fields of interest: the search
#criteria the collections to search from etc.
# Check boxes
(r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'), # file number
(r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'), # Legislative text
(r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'), # attachement
# etc. (not all listed)
(r'ctl00$ContentPlaceHolder1$txtSearch', 'york'), # Search text
(r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'), # Years to include
(r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'), #types to include
(r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation') # Search button itself
)
# these have to be encoded
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req) #that's the actual call to the http site.
# *** here would normally be the in-memory parsing of f
# contents, but instead I store this to file
# this is useful during design, allowing to have a
# sample of what is to be parsed in a text editor, for analysis.
try:
fout = open('tmp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.
This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...
Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.
What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.
Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.
Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code.
Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.
The code in the other answers was useful; I never would have been able to write my crawler without it.
One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:
Add this import:
import cookielib
Init the cookie stuff:
COOKIEFILE = 'cookies.lwp' # the path and filename that you want to use to save your cookies in
cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar that has useful load and save methods
Install CookieJar so that it is used as the default CookieProcessor in the default opener handler:
cj.load(COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
To see what cookies the site is using:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
This saves the cookies:
cj.save(COOKIEFILE) # save the cookies
"Assume we need to select "all years" and "all types" from the respective dropdown menus."
What do these options do to the URL that is ultimately submitted.
After all, it amounts to an HTTP request sent via urllib2.
Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.
Select '"all years" and "all types" from the respective dropdown menus'
Note the URL which is actually submitted.
Use this URL in urllib2.
Related
I came across an unusual URL structure on a site. It looked like this:
https://www.agilealliance.org/glossary/xp/#q=~(infinite~false~filters~(postType~(~'post~'aa_book~'aa_event_session~'aa_experience_report)~tags~(~'xp))~searchTerm~'~sort~false~sortDirection~'asc~page~1)
It seems the category, pagination and sort options of a widget on the page injects and reads through these values. Does this format for storing data in the URL have a name, or is this an esoteric format someone made?
What's the purpose of doing this over using regular GET params, or at least using a more conventional format after the fragment?
If you inspect the URL carefully, you'll see that the parameters you describe are placed after the fragment (#), meaning they're not sent to the server but used by the client instead.
In this case, the client (JavaScript) builds them into something like an ElasticSearch query that's then POSTed to the server, in order to update listing you see on your screen.
I need to submit one form multiple times in parallel. The server accepts the parameter _ASYNCPOST.
I can explain in an abstract way how the page works
Login
Submit form search (POST)
POST same form with new data (all these need to be done in parallel)
In the last step, I yield all the requests with every parameter I could find (including __VIEWSTATE, EVENTTARGET, etc)
The problem is that the first post works, but the rest return an error saying "The server data does not match the browser data, hit refresh"
Is what I'm trying to achieve possible?
I followed this doc https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition/
I need to reuse value which is generated for my previous request.
For example, at first request, I make a POST to the URL /api/products/{UUID} and get HTTP response with code 201 (Created) with an empty body.
And at second request I want to get that product by request GET /api/products/{UUID}, where UUID should be from the first request.
So, the question is how to store that UUID between requests and reuse it?
You can use the Request Sent Dynamic values https://paw.cloud/extensions?extension_type=dynamic_value&q=request+send these will get the value used last time you sent a requst for a given request.
In your case you will want to combine the URLSentValue with the RegExMatch (https://paw.cloud/extensions/RegExMatch) to first get the url as it was last sent for a request and then extract the UUID from the url.
e.g
REQUEST A)
REQUEST B)
The problem is in your first requests answer. Just dont return "[...] an empty body."
If you are talking about a REST design, you will return the UUID in the first request and the client will use it in his second call: GET /api/products/{UUID}
The basic idea behind REST is, that the server doesn't store any informations about previous requests and is "stateless".
I would also adjust your first query. In general the server should generate the UUID and return it (maybe you have reasons to break that, then please excuse me). Your server has (at least sometimes) a better random generator and you can avoid conflicts. So you would usually design it like this:
CLIENT: POST /api/products/ -> Server returns: 201 {product_id: UUID(1234...)}
Client: GET /api/products/{UUID} -> Server returns: 200 {product_detail1: ..., product_detail2: ...}
If your client "loses" the informations and you want him to be later able to get his products, you would usually implement an API endpoint like this:
Client: GET /api/products/ -> Server returns: 200 [{id:UUID(1234...), title:...}, {id:UUID(5678...),, title:...}]
Given something like this, presuming the {UUID} is your replacement "variable":
It is probably so simple it escaped you. All you need to do is create a text file, say UUID.txt:
(with sample data say "12345678U910" as text in the file)
Then all you need to do is replace the {UUID} in the URL with a dynamic token for a file. Delete the {UUID} portion, then right click in the URL line where it was and select
Add Dynamic Value -> File -> File Content :
You will get a drag-n-drop reception widget:
Either press the "Choose File..." or drop the file into the receiver widget:
Don't worry that the dynamic variable token (blue thing in URL) doesn't change yet... Then click elsewhere to let the drop receiver go away and you will have exactly what you want, a variable you can use across URLs or anywhere else for that matter (header fields, form fields, body, etc):
Paw is a great tool that goes asymptotic to awesome when you explore the dynamic value capability. The most powerful yet I have found is the regular expression parsing that can parse raw reply HTML and capture anything you want for the next request... For example, if you UUID came from some user input and was ingested into the server, then returned in a html reply, you could capture that from the reply HTML and re-inject it to the URL, or any field or even add it to the cookies using the Dynamic Value capabilities of Paw.
#chickahoona's answer touches on the more normal way of doing it, with the first request posting to an endpoint without a UUID and the server returning it. With that in place then you can use the RegExpMatch extension to extract the value from the servers's response and use it in subsequent requests.
Alternately, if you must generate the UUID on the client side, then again the RegExpMatch extension can help, simply choose the create request's url for the source and provide a regexp that will strip the UUID off the end of it, such as /([^/]+)$.
A third option I'll throw out to you, put the UUID in an environment variable and just have all of your requests reference it from there.
I'm a newbie but I think Paw can do what i need :
I need to extract a session id behind a login page.
I go to https://admin.booking.com, filling the form (login and pass) and the landing page behind includes a session id :
https://admin.booking.com/pc/index.html?ses=xxxxyyyyyzzzzz11112222233333
I'd like to :
1) Push credentials with Paw as part of my request,
2) get the above item (ses) item as a response so i can use the php script extension provided by Paw and then call this script "on demand".
Is this possible ? If so, what should i do ?
Thanks for your help
UPDATE*: we've added a documentation article to describe the process a little more: Login via a web form in Paw. We've detailed the process to deal with CSRF tokens too.
Paw isn't quite yet ready for handling web/HTML forms. Though, there's one way to do it the right way: if you inspect the form with the Chrome dev tools you'll find the name of the input from the DOM/HTML:
In your case, you have the inputs: loginname, password, lang.
Also, find the <form…> tag to see what's the action attribute. If there's no action attribute (like in your example), it means the target URL for your form is the current page's URL (https://admin.booking.com/ in your case). Also, make sure the method="POST" is also there in the <form…> tag, otherwise this method won't work.
Then jump into Paw and set:
URL (in your case https://admin.booking.com/)
method to POST
go to the Body tab and use "Form URL-Encoded + fill up the fields from your form
If all works, you'll see Paw show a redirection request, and if you go to the right-hand side panel under "Response" > "Headers", you should see a Location header with a value similar to the URL you initially mentioned (https://admin.booking.com/pc/index.html?ses=xxxxyyyyyzzzzz11112222233333). Hurray! You got your value into Paw!
Now that you have that, you can create in a new request (click on the + button at the bottom of the left-hand side list). And wherever you want to use this session token/ID, you can insert a dynamic value to retrieve that URL value. You have more infos here, in our docs, but I'll describe the steps here:
On whichever field you want to insert the token, right-click and pick Responses > Response Header.
Make sure you pick the first request in the "Request" dropdown menu, and enter Location in the "Header" field:
You should see the value of the Location header of the previous response appear here.
Now what you want to do is to extract only the part you want (i.e. the value of the ses param in your case). For that you'll need that extension for Paw, so please install it now: https://luckymarmot.com/paw/extensions/RegExMatch
Copy the dynamic value you have just inserted (the blue token), and right-click on that field to insert a new dynamic value, and pick Extensions > RegExp match:
In the Input field, paste the previous dynamic value you copied. And use the RegExp field to write a regular expression that will successfully extract the part of the URL you want (this should work in your case ses=(.*)).
Now that you're set up. You should be able to use this little new blue token wherever you like and automagically extract the value from the previous form. And whenever you send again the initial request, and get a new token, everything else will also update! :)
It was a little long guide, but I hope this will help you and hopefully others too.
I'm doing some heavy web scraping using Python. In some cases, post data is sent not through a form submit but through some Javascript, which I cannot interact with via this approach. In order to circumvent this, I've been appending names and values for the post requests to the url and then visiting that url.
This method was working fine until I came across a site that used this kind of structure: [sitename].com/?[pagename].do/. I admit total ignorance about this .do extension, though some light searching tells me that it has to do with Struts and a Java-based backend. In this case it seems to be a way of dynamically generating a table; I'm trying to filter the results of that table. What I want to enter is something like [sitename].com/?[pagename].do?[name]=[value]&[name]=[value], but this doesn't work, nor does it even seem like it should work. I attempted it using several variations in syntax. It seems like something I don't quite understand is going on here.
I wish I could direct you to the actual site, but unfortunately I cannot due to the sensitive nature of the project. Let me know, though, if there's any additional information that would be helpful in providing an answer. Thanks in advance.
Edit: This is not really a "my code isn't working" question, as it's the underlying functionality that I would like to emulate in my code which is troubling me, but I'll do my best to get grittier. I'm contractually bound not to share the names of the sites that we're studying, but I will try to model the problem. I am hoping that someone with some familiarity of the back-end activity sending this .do page to the browser will be able to shed light.
import urllib
import urllib2
#
## case 1: a site that i have success in scraping
url = 'http://[sitename]/[pagename]'
values = {'s' : '40', 'pg' : '1'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page #i get the filtered data that i am looking for
#
## case 2: the site that poses a problem for the encoding of post parameters
url = 'http://[sitename]/?[pagename].do/' # this site uses a .do file to generate
# the content i want to filter. note that the page name is preceded by ?.
values = {'s' : '40', 'pg' : '1'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # i am taken back to the root of the site,
# the same result i would get if i entered nonsense
# post parameters that did not correspond with actual control names.
here, also, is an example of some javascript on the page that accomplishes what i'd like to do with my scraper:
function page_next (id) {
$("#loading").fadeIn("normal");
$.post("/?dumps.do/", {s: id, pg: 2},
function( data ) {
var content = $( data ).find( '#dumps' );
}
)
}
I don't know what site you are parsing, but this: [sitename].com/?[pagename].do/ is not something I would call default Struts behaviour, assuming it's indeed a Struts application.
Having a .do extension was indeed something Struts used to use as request mapping, but the URL in that case should be [sitename].com/[pagename].do not [sitename].com/?[pagename].do/
In the second form, the action is in fact a parameter in a query string. This is why this syntax is broken: [sitename].com/?[pagename].do?[name]=[value]&[name]=[value]. You want to send a query string to the action but the action itself is a parameter in the query string.
But that's not the issue. The issue is that the site is doing something with that parameter and expects to receive it's data in a certain way, a way you were not able to reverse engineer.
Assuming again that this is a Struts application, Struts uses a front controller to intercept all action.do URLs and then use the action to invoke a particular class in the application, a class that is mapped to that particular action. The format for this should be [sitename].com/[pagename].do. That would be similar to having, say, [sitename].com/[pagename].php.
But having the action as a parameter makes me think that the site has a different front controller (not that of Struts) that is taking the parameter from the query string and passing it downstream to the Struts framework.
There could be a lot of reasons for having this funky way of handling requests, including making it harder for others to scrape the site, although this seems kinda straight forward:
$.post("/?dumps.do/", {s: id, pg: 2}, ...
Have you tried doing a POST to the root of the application with the action in the query string?