Automating complex tasks with http requests?

Automating complex tasks with http requests? - http

I need to automate a long list of tasks using post requests.
I need to make a payment, and I want to use only http requests.
I cannot use headless browsers or anything that actively interacts with a browser, so please do not suggest that. I will use JSoup. (Why? I need to run multiple tasks, and headless browsers are too heavy).
The website has different steps to accomplish the task. The problem here is that the website sends a lot of different gibberish requests to various websites, and I have no idea on what I can manipulate/need to send.
For example, by using firefox's built in network tool, I can clearly see that there are a lot of different cookies, and those cookies actually change every time another request is sent.
Example:
Task on page 1:
Request cookie: cookievalue1
Response cookie: cookievalue1
Task on page 2:
Request cookie: cookievalue2
Response cookie: cookievalue2
Where does the browser get the cookie on task 2? Why does the server respond with the same cookie?
How should I automate this task? I have no idea even where to start.
Are there some specific things that I need to look out for?

Related

How does the browser cache the API response?

I'd like to achieve a deep knowledge about caching mechanism for website assets and even API requests. I read some articles and also searched about it in StackOverflow. There are some examples which show that, if you set for example: max-age: 20; browser will cache the RESTful API response for 20 seconds...
But the important question is that... If browser is able to cache the API response so, why we do have to use some libraries like react-query or using PWA to implement caching scenarios for web applications?
As far as I understand, if we use the browser cache and add max-age browser send request to server but server returns an empty response and makes browser to load from cache... but there is still an issue it would be that request to server to get the empty response.
But if we use something like react-query it even doesn't send that additional request to check cache with server and we can handle caching with zero request and it would a great trick to decrease server requests?
So, I am right? or this scenario is wrong and I couldn't learn it correct?
Thank you

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

When I make a request to a server can I see all the requests made by that server to another server?

I need to know which requests a webpage sends. Basically the site i call, calls another service/api/url whatever and receives the data (probably within javascript) and show me this. Can i see all the calls it make?
Edit: concrete example:
From this site (http://www.flickriver.com/lenses/nikon/) you can choose a lens, at that moment, the page sends a request to flickr, and get all the data. But in chrome developer tools i could not see this request.
Here is a screenshot of get requests. I have looked through them but could not see any request to flickr.
The first is request to the page. And the sixth one is the picture request already, where it requests the picture by its id. So in between other 4 requests should contain a request to the external source which gives the picture id in return or do i miss sth?
And what if the backend makes this request? Do i still need to see this request in developer tools?

No, of course you cannot see the calls made by some server to another server. Why would you expect to be able to do that? Those calls have nothing to do with the browser. The browser knows nothing about those requests. The browser knows only about requests that it itself initiated. Devtools can only report on requests made by the browser. If in fact there were some way to spy on the requests made by a server to another server, it would be gaping security hole.

Symfony2 - Check server-server request

I need to know when a request comes from a browser and when it comes from a server.
I have created an API and a listener to onKernelRequest event, I need to know what kind of request I received to execute a function or other.
How can I do this on Symfony 2.7?

A “server“ is an HTTP client just as a browser is. They only handle your websites response differently. So there’s no way to be sure who you are talking to. You can only check for a number of indicators.
You can examine the HTTP headers in the Request object. Your best bet would probably be the User-Agent header. But a non-browser could just as well fake the user agent header of an actual browser, so you’d only detect them if they want you to. And you’d have to prepare a list of user agents that you’d consider “servers“.

Tamper with first line of URL request, in Firefox

I want to change first line of the HTTP header of my request, modifying the method and/or URL.
The (excellent) Tamperdata firefox plugin allows a developer to modify the headers of a request, but not the URL itself. This latter part is what I want to be able to do.
So something like...
GET http://foo.com/?foo=foo HTTP/1.1
... could become ...
GET http://bar.com/?bar=bar HTTP/1.1
For context, I need to tamper with (make correct) an erroneous request from Flash, to see if an error can be corrected by fixing the url.
Any ideas? Sounds like something that may need to be done on a proxy level. In which case, suggestions?

Check out Charles Proxy (multiplatform) and/or Fiddler2 (Windows only) for more client-side solutions - both of these run as a proxy and can modify requests before they get sent out to the server.
If you have access to the webserver and it's running Apache, you can set up some rewrite rules that will modify the URL before it gets processed by the main HTTP engine.

For those coming to this page from a search engine, I would also recommend the Burp Proxy suite: http://www.portswigger.net/burp/proxy.html
Although more specifically targeted towards security testing, it's still an invaluable tool.

If you're trying to intercept the HTTP packets and modify them on the way out, then Tamperdata may be route you want to take.
However, if you want minute control over these things, you'd be much better off simulating the entire browser session using a utility such as curl
Curl: http://curl.haxx.se/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex