unable to crawl a website using scrappy but the same website can be requested and used using scrappy shell using same settings - web-scraping

I am trying to crawl the website https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW
but I get (410) error
INFO: Ignoring response <410 https://www.rightmove.co.uk/properties/105717104>: HTTP status code is not handled or not allowed
I am just trying to find the properties that have been sold using the notification on the page "This property has been removed by the agent."
I know the website has not blocked me because I am able to use the scrappy shell to get the data and also view(response) works fine too, I can directly go to the same URL using web browser so the 410 doesn't make sense I can also crawl pages from the same domain,
(ie) the pages without the notification "This property has been removed by the agent."
Any help would be much appreciated.

Seem's the when a listing has been marked as removed by and agent on Rightmove then the website will return status code 410 Gone (Which is quite weird). But to solve this, simply do something like this in your request:
def start_requests(self):
yield scrapy.Request(
url='https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW',
meta={
'handle_httpstatus_list': [410],
}
)
EDIT
Explanation: Basically, Scrapy will only handle the status code from the response is in the range 200-299, since 2XX means that it was a successful response. In your case, you got a 4XX status code which means that some error happened. By passing handle_httpstatus_list = [410] we tell Scrapy that we want it to also handle 410 responses and not only 200-299.
Here is the docs: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#std-reqmeta-handle_httpstatus_list

Related

HTTP Request fails when using the same parameters and the same environment

I'm trying to fetch data from a website (https://gesetze.berlin.de/bsbe/search). Using Mozilla, I've taken a look at the network analysis. Usually, I'm just messing around with the parameters of the POST-Request to see how I might influence the response of the server. But when I simply re-send the request (making no changes at all), I'm getting HTTP-response 500. The server answer states as message: security_notAuthenticated.
Can anyone explain that behaviour? The request is done by the same PC, the same browser in the same session, and there is no login function on that website. Pictures shown below.
Picture 1 - Code 200
Picture 2 - Code 500
The response security_notAuthenticated indicates, that your way of repeating the request omits some authentication-related information.
When I repeat the request, using Mozilla Firefox's "Resend" or "Edit and resend" function, the Cookie header is not sent with the request. Although it occurs in the editable header list when using "Edit and resend" it's missing when watching the actual sent request. I'm not sure whether this is a feature or a bug.
When using Firefox's "Use as Fetch in Console" function, the header will automatically be included and you still have the ability to change the headers and the body. The fetch API is a web standard and some introductory material about fetch can be found on MDN.
If you want to do custom requests, in the browser, fetch is a good option.
In other environments and languages you usually use some HTTP client (just search the web for "...your language... http request" or similar, you will find something).

API Status Page Response Codes

(This is sort of an abstract philosophical question. But I believe it has objective concrete answers.)
I'm writing an API, my API has a "status" page (like, https://status.github.com/).
If whatever logic I have in place to determine the status says everything is good my plan would be to return 200 OK, and a JSON response with more information about each service tested by my status page.
But what if my logic says the API is down? Say the database isn't responding or something.
I think I want to return 500 INTERNAL SERVER ERROR (or 503 SERVICE NOT AVAILABLE) along with a JSON response with more details.
However, is that breaking the HTTP Status Code spec? Would that confuse end users? My status page itself is working just fine in that case. So maybe it should return 200? But that would mean anyone using it would have to dig into the body looking for a specific parameter to determine the API's status vs. just checking the HTTP Status Code. (Also if my status page itself was broken, I'm fine with the end user taking that to mean the API is down since that's a pretty bad sign...)
Thoughts? Is there official protocol on how a status page should work?
https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
For me the page should return 200 unless has problems itself. Is true that is easier to check the status code of a response than parsing but using HTTP status codes to encode application informations breaks what people (and spiders) expect. If a spider passes for your page and sees a 500 or 503 will think your site has a page with problems, not that that page is ok and is signaling that the site is down.
Also, as you notice, it wont' be possible to distinguish between the service is down and the status page is down cases, with the last the only one that should send 500. Also, what if you show more than one service like the twitter status page ? Use 200.
Related: https://stackoverflow.com/a/943021/1536382 https://stackoverflow.com/a/34324179/1536382

I'm accessing a secure site with httr but I get a server error whenever it isn't the first request

As the title says the site in question is secure and I can't share my credentials but here's the outline of events.
The way the site security works is you send a POST to one url with user/pass and then it sends back a token. All requests then need to carry that token in their headers to work. I can get that to work once. On the first request after the login step I get the results I want. All subsequent requests result in a http 500 error of "Internal Server Error". Of course, in a perfect world, I could go to the server and get logs to see more verbosely what is going on. However, they aren't so accommodating on my planet so I'm left scratching my head.
Just to clarify I can send the exact same request the second time and I get the aforementioned error. So far my work around is to detach httr and then relibrary(httr) to start over. This doesn't seem like it's the best approach for this problem.
I'm guessing that the problem has to do with how httr reuses the same handle but I don't know what info is changing between the two requests.
In pseudo code let's say I do
resp<-POST('https://my.site.com/login', add_headers(.headers=c('user'='me', 'pass'='blah'))
mytoken<-content(resp)$token
qry<-POST('https://my.site.com/soap/qry', add_headers(.headers=c('token'=mytoken)),body=myxmlstring)
#qry will have status 200 and the content I expect
#If I run the same POST command again
qry2<-POST('https://my.site.com/soap/qry', add_headers(.headers=c('token'=mytoken)),body=myxmlstring)
#qry2 will be status code 500
#if I do
detach("package:httr", unload=TRUE)
library(httr)
#and then do the commands again from the top then it will work again.
Ideally, there'd be a parameter I can add to POST which will make each POST completely independent of the last. Short of that I'd be happy with something that makes more sense than detaching and reattaching the package itself.

GET request on website vs standalone

I'm a bit confused or maybe I don't fully understand http requests.
There is a website on which the search results are fetched through a GET request. I can see the whole parameter list in Firebug and if I click "search" the results are displayed as you would expect. What I don't understand is if I take this request URL (with the same parameters) and copy it in a new browser tab it doesn't return results anymore. Instead I see a 500 - Internal server error.
Can someone explain why is this happening or what can I do to see the results when accessing the URL?
As robert_b_clark suggested the solution is to send the referrer header when making the request.

Response Redirect URL returns HTTP Error 400 - Bad Request

I'm a noob when it comes to ASP.NET. I know few basic commands such as Response.Redirect("URL") to redirect my application web page to a different location.
However i receive HTTP Error 400 - Bad Request, whenever i try to use the code shown below
Response.Redirect(Server.UrlEncode(this.Downloadlink));
where this.Downloadlink is a user defined property which returns something like this
http://mdn.vatsag.net/fp;files/DOWNLOAD/VTSetup.exe
If i post this link in the browser, the .exe file pops up (means the link is good)
However this error comes when i use the ASP.NET code.
Any form of response on this issue/reason is deeply appreciated.
See here: http://www.kirit.com/Response.Redirect%20and%20encoded%20URIs
In short: if you quickly want to fix the issue, remove the part of your code that is UrlEncoding the URL!

Resources