How can I find the URL that downloads a file? - http

I am developing a web scraper and I need to download a .pdf file from a page. I can get the file name from the html tag, but can't find the complete url (or request body) that downloads the file.
I have tried to sniff the traffic with the chrome and firefox network traffic tool and with wireshark, with no success. I can see it make a post request to the exact same url as the page itself, and so I can't understand why this happens. My guess is that the filename is being sent inside the POST request body, but I also can't find that information in those tools. If I could see the variable name in the body, I could create a copy of the request and then get the file.
How can I get that information?
Here is the website I am talking about: http://www2.trt8.jus.br/consultaprocesso/formulario/ProcessoConjulgado.aspx?sDsTelaOrigem=ListarProcessos.aspx&iNrInstancia=1&sFlTipo=T&iNrProcessoVaraUnica=126&iNrProcessoUnica=1267&iNrProcessoAnoUnica=2010&iNrRegiaoUnica=8&iNrJusticaUnica=5&iNrDigitoUnica=24&iNrProcesso=1267&iNrProcessoAno=2010&iNrProcesso2a=0&iNrProcessoAno2a=0
EDIT: for those seeking to do something similar, take a look at this website: http://curl.trillworks.com/
It converts a cURL to a python requests code. Very useful

The POST data used for the request is encoded content generated by ASP.NET. It contains various state/session information of the page that the link is on. This makes it difficult to directly scrape for the URL.
You can examine the HAR by exporting it from the Network tab in Chrome DevTools:
The __EVENTVALIDATION data is used to ensure events raised on the client originate from the controls rendered on the page from the server.
You might be able to achieve what you want by requesting the page the link is on first, then extract the required POST data from the response (containing the page state and embedded request for file), and then make a new request with this information. This assumes the server doesn't expire any sessions in the meantime.

Related

How to use scripts on relay response url?

I'm trying to style my relay response URL for A.Net, but the URL is still showing "https://test.authorize.net/gateway/transact.dll" so my CSS files are not being included. Right now it just looks like an ugly plain HTML page.
Do I have to pass along the response code and other POST information to another script on my server in order to include the styling?
Relay Response loads your page as a pass-through (Authnet reads the contents of your page and sends them to the user). This means if you must either use the full URL (e.g. including the protocol and domain name) in the path to your CSS file or put the CSS directly in the web page.

How to manipulate a .NET ASPX form programmatically?

I'm trying to manipulate a .net ASP form on a site that's using AJAX Control Toolkit. The site is only accessible to valid logins, and I do have a valid account. It consists of a search page with a form. Each time a submit button is clicked on the form, the server is updated using the values of some text fields on the form, and then the VIEWSTATE and EVENTVALIDATION tokens will be updated based on the response from the server, ready for the next request.
I'm using HttpClient in Java to do this. I suspect there's something I'm not doing correctly with regard to interacting with ASPX forms in general.
When I hit the main search page for the first time (cookies are validating my login with the server), I get the HTML for the search page back. I extract the VIEWSTATE and EVENTVALIDATION tokens for the next request. I've examined the exact form fields and their values that need to be sent to the server in a POST by looking at the Chrome debugger utility after making a request on the site manually. I've replicated them exactly as they should be, inserting the VIEWSTATE and EVENTVALIDATION appropriately.
But the response I get back from the server is not what it should be. What I get back is just the same HTML for the main search page that I get the first time I hit the webpage. The form data I'm using looks like this:
ctl00$ScriptManager1:ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton
ctl00_ContentPlaceHolder1_TabContainer1_ClientState:{"ActiveTabIndex":0,"TabState":[true,true]}
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:<token extracted from first page hit>
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:<token extracted from first page hit>
ctl00$ContentPlaceHolder1$LabelFee:0
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$RadioButtonList1:Person
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$snameText:aSurname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$HiddenField1:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$fnameText:aFirstname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayFromTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthFromTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearFromTextBox:2001
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayToTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthToTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearToTextBox:2008
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$DropDownList1:aCity
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$PropText:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel2$RefText:
__ASYNCPOST:true
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton:Accept
I've also tried replicating the headers that the Chrome debugger shows, so my request is including the same Content-Type, Host, Origin, Referer, User-Agent (for my browser) and every other header, including this header X-MicrosoftAjax: Delta=true.
I know there's a lot of moving parts here, but I intentionally haven't mentioned how I'm actually making the POST request with the HttpClient lib because I'd don't want to complicate the question anymore or alienate anyone who doesn't know Java but knows ASP. I'd like to know if there's an ASP issue I'm not addressing, but I can post the Java code is necessary.
Edit:
I've checked the debugging info that HttpClient is outputting just before sending the request, and the form data is being added properly as multi-part form data. The headers are all there too.
This answer is a long shot, but I've seen weirder things.
You mention this header:
X-MicrosoftAjax: Delta=true
I did some deep googling and found that this is often shown as all lower case in dumps of Ajax and UpdatePanel POST requests:
x-microsoftajax: Delta=true
See here and here.
Could it be as simple as not casing the header correctly?
I eventually got this working. The problem was not specific to ASP in general, it was actually a problem with how Java (specifically HttpClient) was sending the request. I was using HttpClient to compile the request using multi-part form, but after using Fiddler to analyse and compare the requests (see the edited part of this question for more details on that) sent from both my application and the actual webpage, my app request was structured very differently.
The real website request had the form options embedded in the request body in what looked like a URL encoded query string. My request was a series of entries in the request body where each option was wrapped in the Content-Type and Content-Disposition headers. The requests succeeded after changing the POST to add the parameters like:
request.setEntity(new UrlEncodedFormEntity(paramList));

Logging into a webpage via HTTP Request

So I have a webpage, ("http://data.terapeak.com/verify/") and I don't see any & tags in the URL so I am unaware how to post data to this. I need to do this via HTTPRequest rather than browser control. I am creating a double threaded batch searching program. I have already successfully made this using a single browser control but that wont allow for multi-threading, atleast with my current knowledge due to the fact that even when creating a new frmBrw that already exists it needs for me to set the threat apartment to single. If i set it to single, I am unable to have it send the data the the excel sheet I need both threads to access. I hope this is clear... The basic question is how can I log into this form via HTTP request.
This isn't going to be easy to answer without further details however I suspect you'll need to provide the variables via a HTTP POST request.
Can you successfully login to this page in your browser? If so, run a proxy tool such as fiddler and check the HTTP headers it makes to the server. You should see the form variables being passed over. You then need to mimic this in code.
How to: Send Data Using the WebRequest Class
Hope this gets you started

Where can I find sample Web Server Requests/Response Data

Is there any sort of data dump or data set with information from Web Server logs?
The information that I am mainly looking for are:
a) what type of request is it (POST or GET or HTTP or something else)
b) What type of data is being transferred (image, audio, video or text)
c) what is the size of the data that is being transferred
Information such as IP address, URL can be anonymous.
Are you using Firefox? If so, you can use the included Web Console tool to view all the HTTP request body being sent from your browser to the server and the response bodies, along with things like the method (GET, POST, etc.). This would be the same thing that a web server would be logging (except the IP address of the client is always you, obviously). You should be able to copy all the data and paste it to a file if you want a data dump.
To use the web console, click the orange Firefox button and then Web Developer > Web Console. Or if you're using an older version or have the Firefox button disabled, it's under the tools menu.
Edit: To get the most out of it, you'll want to right click on the console and select Log Request and Response Bodies. This will get you more information than just the headers.

How to find HTTP POST Data sent to a CGI Page?

I searched google for a good number of hours. Maybe I searched for the wrong keywords.
Here is what I want to do.
I'm posting data to a website which then makes a HTTP POST request and returns a .CGI webpage. I want to know the parameters the web page uses to send that HTTP POST request so that I can directly link a page from my Webpage to the final .CGI webpage by making the user enter the data on my own webpage.
How do I achieve it?
Usually the POST body is piped into STDIN, just read it as a normal file

Resources