get https response from scrapy shell - web-scraping

I have a spider that is getting cookies from a site in the first few steps. I would like to get the cookies, start the scrape, and if the HTTP status of the current request == 302, I want to loop back to the cookies part to refresh them. How can I log the HTTP status as a variable in scrapy shell, to add in an "if http_status ==302, break and go back to step 1"? Thank you!

I'm an idiot. If anyone comes across this, all you have to do it set your variable (in my case http_response) to response.status. so http_response = response.status returns '200' or whatever depending on the status of the current request. lol solved.

Related

HTTP status code for resource that is not available yet

I have a DB table with a report_url column. As soon as a backend done with filling and storing a report it fills that column with S3 link. If the report was not yet stored, the column value is NULL by default. I also have Pyramid API where an endpoint is declared returning Response with body of report content. So, whenever the user makes request, according controller will be fired to get the report link and download the file and return it to user. However, if report is not done yet (report_url is NULL), I need to inform the user somehow. In this case front-end should receive HTTP status 400, but I have not figured out if this fits best. Or maybe 503 fits better here?
Have a look at available http status codes.
What you probably want is 404, specifically because of this line:
In an API, this can also mean that the endpoint is valid but the
resource itself does not exist.:
Full description:
404 Not Found
The server cannot find the requested resource. In the browser, this
means the URL is not recognized. In an API, this can also mean that
the endpoint is valid but the resource itself does not exist. Servers
may also send this response instead of 403 Forbidden to hide the
existence of a resource from an unauthorized client. This response
code is probably the most well known due to its frequent occurrence on
the web.
If the server is working on getting the report, 102 gets an honorable mention:
102 Processing (WebDAV)
This code indicates that the server has received and is processing the request, but no response is available yet.
it's not part of the standard, it's an extension, WebDAV.
400 status codes are used to let the user know something they did is not working. 500 status codes are used when something is going on with the server. That's how I understand it anyway.
In that way, if this is a "normal" execution of the API/program, perhaps a 200 status code would do just fine. E.g. just define the endpoint to return {"report_url": null} if it isn't ready, otherwise {"report_url": "an actual url"} and then give 200 in each case. And the receiving party handles it depending on if it is null or not. The pro of this method is, now the user can know that it is definitely a proper endpoint (and not an url typo, which would also give 404). However, you could make your own 404 page saying "report is not ready" or "report does not exist" for example. The con of this 200 method is some speed penalty since you have to send an unnecessary response body.
Disclaimer: I am not a web/http expert at all.
The correct HTTP status code is 202 - Accepted. The documentation says:
The 202 (Accepted) status code indicates that the request has been accepted for processing, but the processing has not been completed.
..
The representation sent with this response ought to describe the request's current status and point to (or embed) a status monitor that can provide the user with an estimate of when the request will be fulfilled.

HTTP HEAD alternative

I wanted to do an existence check before I actually GET an item, and I was planning to use a HEAD request. But my server is having problems with HEAD requests.
It returns an error 403 for new items. I have to make a GET request before making a HEAD request for new items, or my HEAD request consistently returns a 403.
I cannot change anything about my server. What alternatives do I have? I really don't want to download the items to do an existence check (the items are images).
HTTP ranges could be an option, for example, using curl to get the first 200 bytes:
curl -r 0-199 http://example.com

SoundCloud API: GET succeeds, HEAD fails

I use the SoundCloud API to retrieve the stream URL for a streamable track.
I follow the redirect and I end up with an URL that looks like:
http://ec-media.soundcloud.com/eodihgiuh.128.mp3?<a string>
AWSAccessKeyId=<access key>
&Expires=<timestamp>
&Signature=<signature>
or
http://ak-media.soundcloud.com/euieuieie.128.mp3?
AWSAccessKeyId=<access key>
&Expires=<timestamp>
&Signature=<signature>
&__gda__=<a string>
Then I start streaming the MP3 data at this URL.
First I send a HEAD request to read the Content-Length header, so that I know how many GET requests I will have to send in order to play the whole song.
Then I send several partial GET requests, each one with a different Range header.
The problem is that sometimes the HEAD request returns a 403 status code, even though a GET request to the exact same URL returns with a 200 status code. It seems that this happens if and only if the host is ak-media.soundcloud.com.
Is this supposed to happen? I expected the HEAD request to return exactly the same headers as the GET request, only without the body response.
Cheers,
PB
P.S: I should probably mention that my code is not running on a computer, but on an audio device with a tiny 8-bit processor which has extremely limited resources.
Unfortunately, currently we only offer guaranteed proper response for GET requests.
As a hack, you could try to do requests with very short ranges.

Sending info with a HTTP redirect that the browser should send to the redirected location?

Is this possible.. for example, imagine I respond to a request with a 302 (or 303), and I inform the browser to do a request to a given location.. is there a header that I can send with the HTTP 302, so that the subsequent request from the browser would include that header?
I know I could do this with the location header, as in redirect and specify the information in the url as a query string.. but I'm wondering if there is a better way.. it seems that it should be a legit scenario..
'Content has moved, go here .. oh and you'll want to take this with you to give to the redirect location'
I'm guessing a big fat no!
Thanks in advance.
Edit
The reason for this is in respect to PRG patterns, where you have a GET url and POST url, given that you post data and it isn't acceptable, the server redirects you to the GET, and does some 'magic' in order to 'send data' to that GET, using most often session state to store a variable.
However this can breakdown in scenarios where many of these PRG requests are happening, granted this isn't a common scenario and generally nobody need worry about this.. but if you do- you'll need a way to identify the requests, this can be done with query string parameters send in the 302.. so that a specific entry can be put in session state according to that request.
The question was regarding trying to remove the 'request key' from the url, and making it more implicit.. cookies 'appear' to work, but they only make the window for screw ups smaller.
It would be great to say when you go the 'location' i've specified, send these parameters.
Edit
Just to note, I'm not trying to get the browser to send arbitrary headers to the location, but if there is ANY headers designed to hint the context of the request (like the querystring parameters could).
A redirect response itself doesn't contain any data. You can redirect using a URL with query parameters, but the new "location" will need to know how to consume those parameters.
No, that’s not possible. You cannot force the client to something. You just can say “this is not the right location, but try that location instead”. But it’s not guaranteed that the client will send the same request or another request to that new location. And telling the client to add a specific header field in that subsequent request to the new location is also not possible.

HTTP response with redirect, but without roundtrip?

I want the browser to reflect some other URL than the one used to create the request, but without roundtripping to the server.
I would maybe do this:
POST /form HTTP/1.1
...
...and then return:
HTTP/1.1 200 OK
Location: /hello
But that would cause a redirect, the browser will again, request URL /hello.
I would like to just tell the browser that, while the request you just sent was POST /some_url the actuall resource that I'm now returning is actually called GET /hello/1 but without preforming a roundtrip. i.e. Location: ...
Is there any way to do this with JavaScript or the base="" attribute? That will tell the browser to request /hello/1 when I hit F5 (refresh) instead of that, post submission warning?
HTTP/1.1 200 OK
Location: /hello
Actually that probably wouldn't work; it should be a 30x status rather than 200 (“303 See Other” is best for the response to a POST), and ‘Location’ should be a complete absolute URL.
(If your script just says ‘Location: /relativeurl’ without the 30x status, CGI servers will usually do an internal redirect by fetching the new URL and returning it without telling the browser anything funny happened. This may sound like what you want but it isn't really because from the browser's point of view it's no different from the original script returning a 200 and direct page.)
But that would cause a redirect, the browser will again, request URL /hello.
In practice that's probably not as bad as you think, thanks to HTTP/1.1 keep-alives. The client should be able to respond to the redirect straight away (in the next packet) as long as it's on the same server.
Is there any way [...] That will tell the browser to request /hello/1 when I hit F5 (refresh) instead of that, post submission warning?
Nope. Stick with the POST-Redirect-GET model for solving this.
No. Http is stateless, and every request has one answer. When you post, you need to redirect to a get page immediately to prevent a double post - you don't want it to sit on that post url. The redirect is what tells the browser that it is on a new page. That's just the way it works.

Resources