Scrape ASPX web form with wget?

Scrape ASPX web form with wget? - asp.net

For a data collection/analysis project, I am trying to download entries in a aspx web form at http://www.lasuperiorcourt.org/civilcasesummarynet/ui/?CT=AP&casetype=appellate, but I'm having little success so far.
The idea is to download the relevant information from the web page through wget and output the results to a single html file. From the resulting output I would then compile stats on the extracted data on relevant cases (e.g. from case nos BV024000 to BV028933).
However, I'm having trouble getting wget to retrieve data from the form. I've been using:
wget --post-data "frmsearch=BV024000" http://www.lasuperiorcourt.org/civilcasesummarynet/ui/?CT=AP^&casetype=appellate -O output.html
But I just get the original page back, not the form output. What am I doing wrong?

There two problems
You have typos in your command - you should wrap the http address between quotes and it's ui/index.aspx?CT=AP. without the ^
When you post the form you have to post all the input fields of the form otherwise your post request is not validated.
Here I did the request as follow
wget --post-data "__VIEWSTATE=%2FwEPDwUJMzM0NzAxOTczD2QWBgIBD2QWCmYPDxYCHgdWaXNpYmxlZ2RkAgIPDxYCHwBoZGQCBA8PFgIfAGhkZAIGDw8WAh8AaGRkAggPDxYCHwBoZGQCAw9kFgpmDw8WAh8AZ2RkAgIPDxYCHwBoZGQCBA8PFgIfAGhkZAIGDw8WAh8AaGRkAggPDxYCHwBoZGQCCQ9kFgICAw8PFgIfAGhkFgICAQ8QZA8WIGYCAQICAgMCBAIFAgYCBwIIAgkCCgILAgwCDQIOAg8CEAIRAhICEwIUAhUCFgIXAhgCGQIaAhsCHAIdAh4CHxYgEAUGU2VsZWN0BQZTZWxlY3RnEAUTQWxoYW1icmEgQ291cnRob3VzZQUDQUxIZxAFFUJlbGxmbG93ZXIgQ291cnRob3VzZQUDTEMgZxAFGEJldmVybHkgSGlsbHMgQ291cnRob3VzZQUDQkggZxAFEkJ1cmJhbmsgQ291cnRob3VzZQUDQlVSZxAFFUNoYXRzd29ydGggQ291cnRob3VzZQUDQ0hBZxAFEkNvbXB0b24gQ291cnRob3VzZQUDQ09NZxAFFkN1bHZlciBDaXR5IENvdXJ0aG91c2UFA0NDIGcQBRFEb3duZXkgQ291cnRob3VzZQUDRE9XZxAFG0Vhc3QgTG9zIEFuZ2VsZXMgQ291cnRob3VzZQUDRUxBZxAFE0VsIE1vbnRlIENvdXJ0aG91c2UFA0VMTWcQBRNHbGVuZGFsZSBDb3VydGhvdXNlBQNHTE5nEAUaSHVudGluZ3RvbiBQYXJrIENvdXJ0aG91c2UFA0hQIGcQBRRJbmdsZXdvb2QgQ291cnRob3VzZQUDSU5HZxAFFUxvbmcgQmVhY2ggQ291cnRob3VzZQUDTEIgZxAFEU1hbGlidSBDb3VydGhvdXNlBQNNQUxnEAUtTWljaGFlbCBBbnRvbm92aWNoIEFudGVsb3BlIFZhbGxleSBDb3VydGhvdXNlBQNBVFBnEAUTTW9ucm92aWEgQ291cnRob3VzZQUDU05JZxAFE1Bhc2FkZW5hIENvdXJ0aG91c2UFA1BBU2cQBRdQb21vbmEgQ291cnRob3VzZSBOb3J0aAUDUE9NZxAFGFJlZG9uZG8gQmVhY2ggQ291cnRob3VzZQUDU0JCZxAFF1NhbiBGZXJuYW5kbyBDb3VydGhvdXNlBQNMQVNnEAUUU2FuIFBlZHJvIENvdXJ0aG91c2UFA0xBUGcQBRhTYW50YSBDbGFyaXRhIENvdXJ0aG91c2UFA05FV2cQBRdTYW50YSBNb25pY2EgQ291cnRob3VzZQUDU00gZxAFFVNvdXRoIEdhdGUgQ291cnRob3VzZQUDU0cgZxAFF1N0YW5sZXkgTW9zayBDb3VydGhvdXNlBQNMQU1nEAUTVG9ycmFuY2UgQ291cnRob3VzZQUDU0JBZxAFGFZhbiBOdXlzIENvdXJ0aG91c2UgV2VzdAUDTEFWZxAFFldlc3QgQ292aW5hIENvdXJ0aG91c2UFA0NJVGcQBRtXZXN0IExvcyBBbmdlbGVzIENvdXJ0aG91c2UFA0xBV2cQBRNXaGl0dGllciBDb3VydGhvdXNlBQNXSCBnFgFmZGQk7ioHoNWuWLyRkeV2Jf7vbNorIw%3D%3D&CaseNumber=BV024000&submit1=Search&casetype=appellate" "http://www.lasuperiorcourt.org/civilcasesummarynet/ui/index.aspx?CT=AP&casetype=appellate" -O output.html
--2012-08-12 19:25:32-- http://www.lasuperiorcourt.org/civilcasesummarynet/ui/index.aspx?CT=AP&casetype=appellate
Resolving www.lasuperiorcourt.org... 153.43.255.56
Connecting to www.lasuperiorcourt.org|153.43.255.56|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /civilcasesummarynet/ui/casesummary.aspx?CT=AP&casetype=appellate [following]
--2012-08-12 19:25:33-- http://www.lasuperiorcourt.org/civilcasesummarynet/ui/casesummary.aspx?CT=AP&casetype=appellate
and it worked see the picture http://i47.tinypic.com/35db8k3.png
Probably you will need to set up a new value of __VIEWSTATE for every request.

In what environment are you executing this command? In most unix shells, "&" is a special character which would terminate the command string and send the command, when executed, to the background., but you aren't quoting that url in any way.
Edit: Ok, nevermind... my answer isn't wasn't that useful except that I did not know that "^" was a quote character and now I know. http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/ntcmds_shelloverview.mspx?mfr=true

Related

Making an HTTP request with a blank user agent

I'm troubleshooting an issue that I think may be related to request filtering. Specifically, it seems every connection to a site made with a blank user agent string is being shown a 403 error. I can generate other 403 errors on the server doing things like trying to browse a directory with no default document while directory browsing is turned off. I can also generate a 403 error by using a tool like Modify Headers for Google Chrome (Google Chrome extension) to set my user agent string to the Baidu spider string which I know has been blocked.
What I can't seem to do is generate a request with a BLANK user agent string to try that. The extensions I've looked at require something in that field. Is there a tool or method I can use to make a GET or POST request to a website with a blank user agent string?

I recommend trying a CLI tool like cURL or a UI tool like Postman. You can carefully craft each header, parameter and value that you place in your HTTP request and trace fully the end to end request-response result.
This example straight from the cURL docs on User Agents shows you how you can play around with setting the user agent via cli.
curl --user-agent "Mozilla/4.73 [en] (X11; U; Linux 2.2.15 i686)" [URL]
In postman its just as easy, just tinker with the headers and params as needed. You can also click the "code" link on the right hand side and view as HTTP when you want to see the resulting request.
You can also use a heap of hther HTTP tools such as Paw and Insomnia, all of which are quite well suited to your task at hand.
One last tip - in your chrome debugging tools, you can right click the specific request from the network tab and copy it as cURL. You can then paste your cURL command and modify as needed. In Postman you can import a request and past from raw text and Postman will interpret the cURL command for you which is particularly handy.

Injecting an api key to google script's UrlFetchApp for an HTTP Request

The Problem
Hi. I'm trying to use Google Script's "UrlFetchApp" function to pull data from an email marketing tool (Klaviyo). Klaviyo's documentation, linked here, provides an API endpoint via an HTTP request.
I'm trying to use Google Script's "UrlFetchApp" to pull data from Klaviyo's API. Unfortunately, Klaviyo requires a key parameter in order to access data, but Google doesn't document if (or where) I can add a custom parameter (note, it should look something like: "api_key=xxxxxxxxxxxxxxxx". Note, it's quite easy for me to pull data into my terminal using the api_key parameter, but ideally I'd have it pulled via google scripts and added to a google sheet appropriately. If I can get JSON into google scripts, I can work with the data to output it how i want.
KLAVIYO'S EXAMPLE REQUEST FOR TERMINAL
curl https://a.klaviyo.com/api/v1/metrics -G \
-d api_key=XX_XXXXXXXXXXXXXXX
THIS OUTPUTS CORRECT DATA IN JSON
Note: My ultimate end goal is to pipe the data into Google data studio on a recurring basis for reporting. I thought i'd get the data into a csv for download / upload into google data studio on a recurring basis. If I'm thinking about this the wrong way, let me know.

Regarding the -G flag, from the curl man pages (emphasis mine):
When used, this option will make all data specified with -d, --data,
--data-binary or --data-urlencode to be used in an HTTP GET request instead of the POST request that otherwise would be used. The data
will be appended to the URL with a '?' separator.
Given that the default HTTP method for UrlFetchApp.fetch() is "GET", your request will be very simple:
UrlFetchApp.fetch("https://a.klaviyo.com/api/v1/metrics?api_key=XX_XXXXXXXXXXXXXXX");

HTTP 400 of "AAAy H.mp3" but not "AAAy L.mp3"

Ok, this is weird, and I can fix it by getting rid of spaces, but what is going on?
I have two files on my website, AAAy H.mp3 and AAAy L.mp3. I can browse to them just fine.
When I do:
curl "http://mikehelland.com/omg/AAAy L.mp3"
I get the mp3 file.
When I do:
curl "http://mikehelland.com/omg/AAAy H.mp3"
I get 400, bad request. Also doing:
curl "http://mikehelland.com/omg/AAAY H.mp3"
yields a 400.
Change the H to an L or A or M or anything else seems to work fine. What's going on?

This is because of how server interprets space in file name, try to replace it with %20 (which represents space symbol in url) like this:
curl "http://mikehelland.com/omg/AAAy%20H.mp3"
If you try to acess your file with browser and will open developer console, you will found that browser inserts this %20 in GET request. So it is the reason why you can access file with browser, but not from terminal.
Also, try to add --verbose option to curl command. I noticed that when you access some inexistent file without space in name, the is the field in response 'Server: Apache/2', but when you add space it is 'Server: nginx'.
So maybe there is special case when server stops handling requset because it can't distinguish what to do with first line in request
GET /omg/AAAy H.mp3 HTTP/1.1
because it expects HTTP/1.1 after /omg/AAAy, but not weird H.mp3. And maybe server looks at first symbol in "H.mp3" when parse for HTTP, and it made things broken. So I think the reason why "/omg/AAAy H.mp3" doesn't work, but "/omg/AAAy L.mp3" works is due to parsing mechanism of the server. Of course, without %20 all variants are forbidden by standard.

luaSocket HTTP requests always respond with a redirect (301 or 302)

I use LuaForWindows (latest version) and I have read this and this answer and everything i could find in the mailinglist of lua-users.org. What ever I try (most) sites only respond with either 301 or 302. I have created an example batch script which downloads (some) of the OpenGL 2.1 Reference from their man pages.
#ECHO OFF
FOR /F "SKIP=5" %%# IN ( %~fs0 ) DO lua -l socket.http -e "print(socket.http.request('https://www.opengl.org/sdk/docs/man2/xhtml/%%#.xml'))"
GOTO:EOF
glAccum
glActiveTexture
glAlphaFunc
glAreTexturesResident
glArrayElement
glAttachShader
glBegin
glBeginQuery
glBindAttribLocation
glBindBuffer
the most important part is this:
print(require('socket.http').request('https://www.opengl.org/sdk/docs/man2/xhtml/glAccum.xml')) -- added glAccum so you can run it
This ALWAYS returns a 301. This also happens to me when downloading from other random pages. (I dont note them so I cant give a list, but i happened to find out some of them use cloudflare.)
If i write an equivalent downloader in Java using URL and openConnection() it wont redirect.
I already tried folowing the redirect manually (setting refferer and stuff) and using the 'generic' way. As most of the tips stated in other answers.

You are using socket.http, but try to access https URL. luasocket doesn't handle HTTPS protocol, so it sends a request to the default port 80 instead and gets a redirect to HTTPS link (same link); this goes for several times (as the URL doesn't really change), and in the end luasocket gives up producing the message.
The solution is to install luasec and to use ssl.https module to do the request.

CURL command not working with simple HTTP GET but browser does

I tried to fetch the data from https://m.jetstar.com/Ink.API/api/flightAvailability?LocaleKey=en_AU&ChildPaxCount=0&DepartureDate=2016-03-21T00%3A00%3A00&ModeSaleCode=&Destination=NGO&CurrencyCode=TWD&AdultPaxCount=1&ReturnDate=&InfantPaxCount=0&Origin=TPE
it couldn't be done by curl -vv https://m.jetstar.com/Ink.API/api/flightAvailability?LocaleKey=en_AU&ChildPaxCount=0&DepartureDate=2016-03-21T00%3A00%3A00&ModeSaleCode=&Destination=NGO&CurrencyCode=TWD&AdultPaxCount=1&ReturnDate=&InfantPaxCount=0&Origin=TPE it will return nothing,
However, browser can fetch whole data.
What's wrong with that?

It seems to me that "m.jetstar.com" is filtering requests that don't include the headers that a browser would send. Your curl statement needs to fully emulate a browser to get the data. One way to see what I'm saying is to open developer tools in Google Chrome, select the network tab, run the URL in the browser then goto to the row indicating the call and right click, then copy the request as a curl statement, then paste it to a notepad and you'll see all the additional headers you need. Additionally, that curl statement should work.

check if you have set any HTTP_REQUEST variable for proxy settings. Verify by calling curl command in verbose mode. curl -v
I had setup a variable earlier and when I check the curl output in verbose mode it told me that it was going to proxy address. Once I deleted the HTTP_REQUEST variable from advanced system settings, it started working. Hope it helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrape ASPX web form with wget? - asp.net

Related

Making an HTTP request with a blank user agent

Injecting an api key to google script's UrlFetchApp for an HTTP Request

HTTP 400 of "AAAy H.mp3" but not "AAAy L.mp3"

luaSocket HTTP requests always respond with a redirect (301 or 302)

CURL command not working with simple HTTP GET but browser does

Categories

Resources