I was working on my server and encountered the need to implement the use of request.headers.referer When I did tests and read headers to determine how to write the parsing functions, I couldn't determine a differentiation between requests that invoke from a link coming from outside the server, outside the directory, or calls for local resources from a given HTML response. For instance,
Going from localhost/dir1 to localhost/dir2 using <a href="http://localhost/dir2"> will yield the response headers:
referer:"http://localhost/dir1" url:"/dir2"
while the HTML file sent from localhost/dir2 asking for resources using local URI style.css will yeild:
referer:"http://localhost/dir2" url:"/style.css"
and the same situation involving an image could end up
referer:"http://localhost/dir2" url:"/_images/image.png"
How would I prevent incorrect resolution, between url and referer, from accidentally being parsed as http://localhost/dir1/dir2 or http://localhost/_images/image.png and so on? Is there a way to tell in what way the URI is being referred by the browser, and how can either the browser or server identify when http://localhost/dir2/../dir1 is intended destination?
Related
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
How can I detect if a given URL is a file to be downloaded?
I came across the content-disposition header, however it seems that this isn't a part of http 1.1 directly.
Is there a more standard way to detect if the response for a GET request made to a given URL is actually a file to/can be downloaded?
That is the response is not html or json or anything similar, but something like an image, mp3, pdf file etc.?
HTTP is a transfer protocol - which is a very different thing to hard drive storage layouts. The concept of "file" simply does not exist in HTTP. No more than your computer hard drive contains actual paper-and-cardboard "files" that one would see in an office filing system.
Whatever you may think the HTTP message or URL are saying the response content does not have to come from any computer file, and does not have to be stored in one by the recipient.
The response to any GET message in HTTP can always be "downloaded" by sending another GET request with that same URL (and maybe other headers in the case of HTTP/1.1 variants). That is built into the definition of what a GET message is and has nothing to do with files.
I ended up using the content-type to decide if it's an html file or some other type of file that is on the other end of a given URL.
I'm using the content-disposition header content to detect the original file name if it exists since the header isn't available everywhere.
Could checking for a file extension be a possibility? Sorry I can't enlarge on that much without knowing more, but I guess you could consider using PHP to implement this if HTML doesn't have enough functionality?
I may be barking up completely the wrong tree here but what I would like to do is protect my .js pages by having them return a 403 Forbidden http error status page if someone tries to access them directly via http. I use them to support my index.html page but would like for them to remain hidden.
The helpdesk guys at my ISP basically say they don't know if it's possible but it may be something you could do with a web.config file (which is not something I have used before).
Any help at all would be gratefully received - I am a bit out of my comfort zone with this one
I would like to […] protect my .js pages by having them return a 403 Forbidden http error status page if someone tries to access them directly via http.
Please note that if you include some resource, for example a script via the <script>-tag in HTML or an image via the <img>-tag, the browser does nothing else than simply run another HTTP request to get that resource. The whole communication already happens over HTTP.
While a browser may include additional details in its HTTP request when requesting additional resources, like the Referer-header, it definitely is not required to do so. So if you look out for the Referer-header, be advised that you may lock out other valid clients which do not send the Referer-header in their requests.
Also note that this will not give you any protection whatsoever. One can simply construct HTTP headers when requesting things, so “faking” requests your server would allow (because it thinks they are correct) is not a problem at all. And even without that; every resource you tell the client to use to make your website work will be downloaded by the client. And after that, the client can do whatever he wants with it. It can cache them on the hard disk, or allow the user to quickly look at it without having to run another request.
So if you want to do this for protecting your code, then just forget about it, and make it easier for everyone by just not adding a non-optimal protection. Code you put on the web can be made difficult to read, but if you want the user to see the end result, then you also give out your code in the same step.
In php you can do this with:
header("HTTP/1.0 403 Forbidden");
I want to change first line of the HTTP header of my request, modifying the method and/or URL.
The (excellent) Tamperdata firefox plugin allows a developer to modify the headers of a request, but not the URL itself. This latter part is what I want to be able to do.
So something like...
GET http://foo.com/?foo=foo HTTP/1.1
... could become ...
GET http://bar.com/?bar=bar HTTP/1.1
For context, I need to tamper with (make correct) an erroneous request from Flash, to see if an error can be corrected by fixing the url.
Any ideas? Sounds like something that may need to be done on a proxy level. In which case, suggestions?
Check out Charles Proxy (multiplatform) and/or Fiddler2 (Windows only) for more client-side solutions - both of these run as a proxy and can modify requests before they get sent out to the server.
If you have access to the webserver and it's running Apache, you can set up some rewrite rules that will modify the URL before it gets processed by the main HTTP engine.
For those coming to this page from a search engine, I would also recommend the Burp Proxy suite: http://www.portswigger.net/burp/proxy.html
Although more specifically targeted towards security testing, it's still an invaluable tool.
If you're trying to intercept the HTTP packets and modify them on the way out, then Tamperdata may be route you want to take.
However, if you want minute control over these things, you'd be much better off simulating the entire browser session using a utility such as curl
Curl: http://curl.haxx.se/
Use-case: Upload a simple image file to a server, which clients could later retrieve
Designate a FTP Server for the job.
HTTP Put: It can directly upload files to a server without the need of a server side
component to handle the bytestream.
HTTP Post: Handle the bytestream by the server side component.
I think to safely use PUT on a public website requires even more effort than using POST (and is less commonly done) due to potential security issues. See http://bitworking.org/news/PUT_SaferOrDangerous.
OTOH, I think there are plenty of resources for safely uploading files with POST and checking them in the server side script, and that this is the more common practice.
PUT is only appropriate when you know the URL you are putting to.
You could also do:
4) POST to obtain a URL to which you then PUT the file.
edit: how are you going to get the HTTP server to decide whether it is OK to accept a particular PUT request?
What I usually do (via PHP) is HTTP POST.
And employ PHP's move_uploaded_file() to get it to whatever destination I want.