Scrapy Web Scraping return 405 - web-scraping

I made a simple spider in Scrapy with python to get the title from some websites.I get this 405 error which can be seen in the photo from one website and the other one is good it returns me 200. Do you know what the problem may be?
https://postimg.cc/gallery/2pbx9j7wy/
I searched a lot for this question but i couldnt find it. If you can give me a full answer or just some links i would really appreciate it.
Thank you!
It is different of what is linked here because i encounter a captcha...

So Http 405 is method not allowed. What does this mean?
There is the simple GET request, that happens when you type an URL in the browser. There is also POST which is usually used when submitting a form. What this error means in your case most probably is that the URL expects something different than GET, and given that this is some kind of captcha most probably it expects POST. Check the FormRequest class in scrapy documentation on how to make a post request.

Related

dealing with 300 http status code

So I am trying to access 'set-cookie' from the response I get from post(). However, I am getting a status code 300 and consequently, I am getting an KeyError: 'set-cookie'
When I read about code 300 it’s multiple choice and that I should get “a list of representation metadata and URI reference(s) from which [I] can choose the one most preferred.” . I wanted to read about how to do that and where it should be done but couldn’t find any sources. Where is that list and how can I choose from it? How can I redirect?
note: I never dealt with http request before
You've "never dealt with http request before" and at your first attempt you get this. Unlucky.
The first point of order is that you've provided no details of what is implementing the HTTP. Is it php, asp, ruby, serverside JavaScript, Java.... before you ask another question (or ask this one again, since it's likely to be closed) you use the search here to find similar questions. You're not just looking for answers but the level of information which questioners provide and how well received this is.
Fortunately I have a crystal ball and studied hard during my formative years at Hogwarts so I know you are using python.
The 300 status code is a dinosaur which has been ignored by browser makers since....well forever really. I'm guessing it's being thrown up here because your python binding wants to flag an error. Really it should be returning a 5xx code as it seems rather broken. Anyway, if you had looked you might have found this.
This might be related to the "keyerror" or that might be a further problem with your server-side code/config

I get HTTP 500 and HTML content, what's wrong?

When I got on this page (same with lots of articles on this website) : http://thereasonmag.com/9231-2/
I get an error HTTP 500 (see in the Chrome Dev Tools) AND the article.
Well, I'm a bit lost with this. Do you know why it is designed like this ?
That's a problem for my crawler which is designed to avoid processing HTTP 5xx error responses.
I would say that this is hardly can be called "designed" and possible when somebody has an error in backend code/logic. Actually this is the first time I see anything like this, but I can think only of workaround for you in this case.
Because this response has 500 error AND correct HTTP body with html, you can avoid in your code processing 5xx error WITHOUT body with correct html.. How to determine if this html is correct? This is pretty risky to guess.. You can research their html and find some global variables or some comment tags/classes in html which won't be returned if real error page is returned.
Important: I understand (and sure you too) that my suggestion is absolutely crazy workaround just to make your code work. What I would do in your place, I would write those guys and ask them to fix their backend. Seems like this is the only place with some email at the bottom..
Try to write them, otherwise you will definitely face a case, where you will fail to meet the criteria of if (res.errorCode === 500 && res.body.anyPossiblePredicateYouMayThinkToCheckRightHTMLBody) {// show the post on your page }
1) Looks it is an expected behavior since PHP version 5.2.4.
2) The above url is using X-Powered-By: PHP/5.4.45 (wordpress app)
3) root cause could be,one of the wordpress plugin in the above site is having
wrong string thatt ph eval() could not parse it.
4) more info look at the link a) wordpress discussion
5) from ph forum
Finally, i don't think so you can do anything here.

Determine if requester is an Ajax call and/or is expecting JSON (or another content type)

I have solved a problem with a solution I found here on SO, but I am curious about if another idea I had is as bad as I think it might be.
I am debugging a custom security Attribute we have on/in several of our controllers. The Attribute currently redirects unauthorized users using a RedirectResult. This works fine except when calling the methods with Ajax. In those cases, the error returned to our JS consists of a text string of all the HTML of our error page (the one we redirect to) as well as the HTTP code and text 200/OK. I have solved this issue using the "IsAjaxRequest" method described in the answer to this question. Now I am perfectly able to respond differently to Ajax calls.
Out of curiosity, however, I would like to know what pitfalls might exist if I were to instead have solved the issue by doing the following. To me it seems like a bad idea, but I can't quite figure out why...
The ActionExecutingContext ("filterContext") has an HttpContext, which has a Request, which in turn has an AcceptTypes string collection. I notice that on my Ajax calls, which expect JSON, the value of filterContext.HttpContext.Request.AcceptTypes[0] is "application/json." I am wondering what might go wrong if I were to check this string against one or more expected content types and respond to them accordingly. Would this work, or is it asking for disaster?
I would say it works perfect, and I have been using that for years.
The whole point use request headers is to be able to tell the server what the client accept and expect.
I suggest you read more here about Web API and how it uses exactly that technique.

Getting requests containing [PLM=0][N]

I recently noticed that I've been getting some strange looking requests which after decoding look like
target_url?id=17 [PLM=0][N] GET target_url?id=17 [0,14770,13801] -> [N] POST target_url?id=17 [R=302][8880,0,522]
I know there is an older question concerning that subject, but there is no actual answer so I posted my own, in case there may be some newer member who knows what's going on.
The requests I mentioned do not seem to have any effect as they cause the error page to be displayed. I am however curious to know what they might have been capable of.
target_url only refers to pages where someone posts to the forum. The website uses ASP.NET. The numbers contained in brackets (0,14770,13801 etc) seem to be the same in every request made so far.
Any ideas?
I see things more or less similar on sereval websites and I think it is a code for passing by your captcha in the form you have on the page id=17. My guess would be that :
GET target_url?id=17 [0,14770,13801] = Get the captcha at the position [0,14770,13801] on the page, where the captcha image or computation or else has been detected ;
POST target_url?id=17 [R=302][8880,0,522] = still on the same page, put it back in the field at the position [8880,0,522]. [R=302] is possibly an error redirect management in case it is wrong.

Is it true that POST can be used instead of GET in all scenarios?

I've read lots of articles about the differences between GET and POST. Lots of them are available here at StackOverflow.
A summary of the important differences is:
Post can send its information via body while GET should not (but I think it can be done practically)
Some browsers cache the GET results and rely on the idempotent behavior of GET requests.
Using GET is much easier than using POST for most of developers.
Concluding this summary, Using GET in POST situations is bad and dangerous.
But is it true that ignoring the easiness, POST can be used as a replacement of the GET requests as it seems it totally covers the GET requirements.
To clarify that I'm not crazy!, I'm not going to use POST instead of GET. This question is just about to check if I understand the GET and POST difference correctly.
No, POST is not a replacement of GET requests. There are two important things that a POST request cannot do that a GET request can.
You cannot generate a POST request simply by typing a URL in the address bar of the browser. This always generates a GET request.
You cannot generate a POST requesting using an ordinary link in HTML. This has far-reaching consequences. You cannot find a page that is only accessible using a POST request with any search engine, and you cannot link to it unless it is done by an HTML form or using Javascript.
Its a good practice that you classify your transaction. These methods are very important specially when you are developing an API Service Oriented architecture or even Single Page Applications.
GET - used to retrieve a dataset. (also has a limitation for url length. parameters are exposed and urlencoded.)
POST - Saving/adding (this is more secure)
EX:
GET /items - means you are getting the list of items.
POST /items - means you are saving/adding item(s)
and later you might need to learn PUT and DELETE too.
But for now, always use POST in your form or ajax request when saving/adding data. and GET when retrieving data.

Resources