I have following string:
soqDi22c2_A-eY4ahWKJV6GAYgmuJBZ3poNNEixha1lOhXxxoucRuuzmcyDD_9ZYp_ECXRPbrBf6issNn23CUDJrh_A5L3Y5dHhB0o_U5Oq_j4rDCXOJ4Q==
It's a query parameter generated by form on a page. (This is done server-side in ASP.net) We are able to submit this form programatically and get the string we need (it just leads to a detail page of an object [realworld parcel/building, publicly accessible]) and redirect our user to it. However I would like to know, if there is a way to decrypt/deobfuscate this string to know what it contains and if we could possibly just generate these without going through the form (it's a multi step form).
The string also has some sort of expiration, so I sadly cannot provide a link to the result page, as it would stop working after like 10 minutes or so.
It feels a bit like it's base64, but after trying to run it through base64 -d, it says it's invalid.
It's likely base64 with + and / replaced with - and _ to make it more browser-friendly.
Though even if it's base64-encoded, it may just be a completely random key. You won't nessesarily be able to decode it to something readable.
I'm working on some interesting APIs that have a "?" in their path, like so:
/api/?other/stuff/here
However, if I type "?" in the request URL in Paw, it automatically moves my cursor into the query parameter fields. Is there a way to disable this? I'm forced to use postman at the moment to work around this, which is less than ideal.
Nevermind, using %3F instead fixed the issue
As mentioned before, using %3F should work nicely!
Another, more generic way is to use the URL-Encode dynamic value:
Right-click on the field where you want to insert the special character and pick Encoding > URL-Encoding > Encode
A popup opens and you can type your special character (here ?) in the Input field. You should see the preview of the encoded value at the bottom of the popup.
Continue to type the end of the URL after this dynamic value. And you should be good to go!
I'm a newbie but I think Paw can do what i need :
I need to extract a session id behind a login page.
I go to https://admin.booking.com, filling the form (login and pass) and the landing page behind includes a session id :
https://admin.booking.com/pc/index.html?ses=xxxxyyyyyzzzzz11112222233333
I'd like to :
1) Push credentials with Paw as part of my request,
2) get the above item (ses) item as a response so i can use the php script extension provided by Paw and then call this script "on demand".
Is this possible ? If so, what should i do ?
Thanks for your help
UPDATE*: we've added a documentation article to describe the process a little more: Login via a web form in Paw. We've detailed the process to deal with CSRF tokens too.
Paw isn't quite yet ready for handling web/HTML forms. Though, there's one way to do it the right way: if you inspect the form with the Chrome dev tools you'll find the name of the input from the DOM/HTML:
In your case, you have the inputs: loginname, password, lang.
Also, find the <form…> tag to see what's the action attribute. If there's no action attribute (like in your example), it means the target URL for your form is the current page's URL (https://admin.booking.com/ in your case). Also, make sure the method="POST" is also there in the <form…> tag, otherwise this method won't work.
Then jump into Paw and set:
URL (in your case https://admin.booking.com/)
method to POST
go to the Body tab and use "Form URL-Encoded + fill up the fields from your form
If all works, you'll see Paw show a redirection request, and if you go to the right-hand side panel under "Response" > "Headers", you should see a Location header with a value similar to the URL you initially mentioned (https://admin.booking.com/pc/index.html?ses=xxxxyyyyyzzzzz11112222233333). Hurray! You got your value into Paw!
Now that you have that, you can create in a new request (click on the + button at the bottom of the left-hand side list). And wherever you want to use this session token/ID, you can insert a dynamic value to retrieve that URL value. You have more infos here, in our docs, but I'll describe the steps here:
On whichever field you want to insert the token, right-click and pick Responses > Response Header.
Make sure you pick the first request in the "Request" dropdown menu, and enter Location in the "Header" field:
You should see the value of the Location header of the previous response appear here.
Now what you want to do is to extract only the part you want (i.e. the value of the ses param in your case). For that you'll need that extension for Paw, so please install it now: https://luckymarmot.com/paw/extensions/RegExMatch
Copy the dynamic value you have just inserted (the blue token), and right-click on that field to insert a new dynamic value, and pick Extensions > RegExp match:
In the Input field, paste the previous dynamic value you copied. And use the RegExp field to write a regular expression that will successfully extract the part of the URL you want (this should work in your case ses=(.*)).
Now that you're set up. You should be able to use this little new blue token wherever you like and automagically extract the value from the previous form. And whenever you send again the initial request, and get a new token, everything else will also update! :)
It was a little long guide, but I hope this will help you and hopefully others too.
We have taken over a .NET project recently and upon looking at the db we have the following in some columns:
1) Some columns have values such as
" & etc etc
2) Some have <script> tags and other non html encoded tags
This data is displayed all over the site. When trying out HtmlEncoding on point number 1 we get the following " -> "
Obviously we are wanting to htmlencode when displaying as point 2 contains javascript which we don't want executed.
Is there a way to use HtmlEncoded on values that might or might not be already encoded?
Is there a way to use HtmlEncoded on values that might or might not be already encoded?
No there isn't.
What i would suggest is that you write a quick script that goes through the database and unencode the already encoded data. Then use something like the Microsoft AntiXSS library (tutorial here) to encode all output before it gets output to the web page. Remember that it is fine to store the data unencoded1, the danger is when you echo it back out to the end user.
Some controls already encode output using encode functionality built into the .Net framework - which is not bulletproof to XSS - you just have to either avoid using those controls or just not encode the data displayed by them. There is a FAQ question pertaining to the MS controls that encode at the bottom of the page for the first link which you should read. Also some third party control vendors encode the output of their controls, you would do yourself a favor if you test them to make sure they are not still susceptible to XSS.
1Don't forget to take steps to prevent SQL injection though!
Before applying HtmlEncode( "myText" ) use HtmlDecode method to the input text.
That way you will decode your string from:
& quot; & amp; etc etc < script>
to
" & etc etc < script>
and afterwards apply encode "from scratch".
I need to scrape query results from an .aspx web page.
http://legistar.council.nyc.gov/Legislation.aspx
The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.
Somebody out there must know how to do this.
As an overview, you will need to perform four main tasks:
to submit request(s) to the web site,
to retrieve the response(s) from the site
to parse these responses
to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)
The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup
The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).
This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.
import urllib
import urllib2
uri = 'http://legistar.council.nyc.gov/Legislation.aspx'
#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type.
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
'Content-Type': 'application/x-www-form-urlencoded'
}
# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples. This helps
# with clarity and also makes it easy to later encoding of them.
formFields = (
# the viewstate is actualy 800+ characters in length! I truncated it
# for this sample code. It can be lifted from the first page
# obtained from the site. It may be ok to hardcode this value, or
# it may have to be refreshed each time / each day, by essentially
# running an extra page request and parse, for this specific value.
(r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),
# following are more of these ASP form fields
(r'__VIEWSTATE', r''),
(r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
(r'ctl00_RadScriptManager1_HiddenField', ''),
(r'ctl00_tabTop_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
#but then we come to fields of interest: the search
#criteria the collections to search from etc.
# Check boxes
(r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'), # file number
(r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'), # Legislative text
(r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'), # attachement
# etc. (not all listed)
(r'ctl00$ContentPlaceHolder1$txtSearch', 'york'), # Search text
(r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'), # Years to include
(r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'), #types to include
(r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation') # Search button itself
)
# these have to be encoded
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req) #that's the actual call to the http site.
# *** here would normally be the in-memory parsing of f
# contents, but instead I store this to file
# this is useful during design, allowing to have a
# sample of what is to be parsed in a text editor, for analysis.
try:
fout = open('tmp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.
This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...
Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.
What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.
Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.
Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code.
Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.
The code in the other answers was useful; I never would have been able to write my crawler without it.
One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:
Add this import:
import cookielib
Init the cookie stuff:
COOKIEFILE = 'cookies.lwp' # the path and filename that you want to use to save your cookies in
cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar that has useful load and save methods
Install CookieJar so that it is used as the default CookieProcessor in the default opener handler:
cj.load(COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
To see what cookies the site is using:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
This saves the cookies:
cj.save(COOKIEFILE) # save the cookies
"Assume we need to select "all years" and "all types" from the respective dropdown menus."
What do these options do to the URL that is ultimately submitted.
After all, it amounts to an HTTP request sent via urllib2.
Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.
Select '"all years" and "all types" from the respective dropdown menus'
Note the URL which is actually submitted.
Use this URL in urllib2.