Scrapy and Splash times out for a specific site - web-scraping

I have an issue with Scrapy, Crawlera and Splash when trying the fetch responses from this site.
I tried the following without luck:
pure Scrapy shell - times out
Scrapy + Crawlera - times out
Scrapinghub Splash instance (small) - times out
However I can scrape the site with the Firefox webdriver of Selenium. But I want to move away from that and use Splash instead.
Is there a workaround to avoid these timeouts?
NOTE:
If I use local Splash instances set up by aquarium the site loads, though it still takes 20+ seconds compared to the Firefox webdriver's 10 seconds.

Try to increase the timeouts for Splash. If you run Splash using Docker, set the parameter --max-timeout to some bigger value, e.g. 3600 (for more info, look into documentation).
Next, in your Splash requests, also increate timeout. If you use scrapy-splash library, then set SplashRequest argument timeout to some higher value, e.g. 3600. Like this:
yield scrapy_splash.SplashRequest(
url, self.parse, endpoint='execute',
args={'lua_source': script, 'timeout': 3600})

You could retry the request with scrapy shell and setting the user agent in the headers. For me using this method on worked in a few seconds. Using the default user agent caused the connection to be dropped by the site. Default user agent declares that you're using scrapy, so it makes sense that the site would choose to drop the connection.
Replace custom user agent to match your own browser or preferred user agent, and url. You can try using the following steps, and then view the response in your browser:
scrapy shell
url = "https://www.yoururl.com"
request = scrapy.Request(url, headers={'User-Agent': 'custom user agent'})
fetch(request)
view(response)

Related

unable to crawl a website using scrappy but the same website can be requested and used using scrappy shell using same settings

I am trying to crawl the website https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW
but I get (410) error
INFO: Ignoring response <410 https://www.rightmove.co.uk/properties/105717104>: HTTP status code is not handled or not allowed
I am just trying to find the properties that have been sold using the notification on the page "This property has been removed by the agent."
I know the website has not blocked me because I am able to use the scrappy shell to get the data and also view(response) works fine too, I can directly go to the same URL using web browser so the 410 doesn't make sense I can also crawl pages from the same domain,
(ie) the pages without the notification "This property has been removed by the agent."
Any help would be much appreciated.
Seem's the when a listing has been marked as removed by and agent on Rightmove then the website will return status code 410 Gone (Which is quite weird). But to solve this, simply do something like this in your request:
def start_requests(self):
yield scrapy.Request(
url='https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW',
meta={
'handle_httpstatus_list': [410],
}
)
EDIT
Explanation: Basically, Scrapy will only handle the status code from the response is in the range 200-299, since 2XX means that it was a successful response. In your case, you got a 4XX status code which means that some error happened. By passing handle_httpstatus_list = [410] we tell Scrapy that we want it to also handle 410 responses and not only 200-299.
Here is the docs: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#std-reqmeta-handle_httpstatus_list

Multiple sub-responses for Wordpress site, related with CSS/JS/images, cause delayed response times in Jmeter

I've been trying to load test a Wordpress site and I'm seeing many sub-responses under the main sampler response in 'View Result Tree' listener. This is probably resulting in more load time displayed in Jmeter as well. I've tried enabling/disabling the 'Retrieve All Embedded Resources' advanced setting of sampler and it has not made a difference.
I want to see only those samplers which are part of my script in 'View Results Tree'. How can I get rid of sub-responses appearing under those samplers in 'View Results Tree'?
If you are recording, Then you have option to skip files with desired extension in Jmeter. So you can skip *.png files and they wont show up in the recorded script.
In HTTP(S) Test Script Recorder there is a tab called Request Filtering.
So when you run the Jmeter script these request will not show up in the listener.
It might be the case you have embedded resources retrieval enabled in the HTTP Request Defaults, if this is the case - it impacts all the HTTP Request samplers, no matter what you set there.
The question is why do you want to disable it? It makes sense only to disable requests to external domains (like Google, Facebook, etc.) so you would focus only on your application.
Downloading images, scripts, fonts, styles, etc. is what real browsers are doing so your script should be doing this as well. Just make sure to add HTTP Cache Manager to ensure that the resources are downloaded only once or according to Cache-Control headers
More information: Web Testing with JMeter: How To Properly Handle Embedded Resources in HTML Responses

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

JMeter NTLM/Windows Authentication Load Testing

What is to be done?
We have an application deployed on the Sharepoint (corporate) Server which uses the windows credentials to log into the application.
App URL format: http://testmachine:1000/sites/test/
Windows Credentials Format: user_id#domain.co.in
The objective is to perform the load/performance testing on the application (especially the log in functionality) for such n number of users.
Normally when I hit the app URL in the Firefox/IE, it pops up a window asking for credentials. I enter the credentials, browse the app and then log out. I intend to capture this in JMeter and simulate this for large number of users.
Where I’m stuck?
Now I start the JMeter proxy server, and then try the same steps as above. But when the pop up window appears, JMeter simply doesn’t record the it nor it does record anything else after the login.
What I’ve tried?
If I try the same steps after enabling “Automatically detect intranet network” in IE, then it simply auto detects my windows credentials (No credentials pop-up), logs me into the app (this is not recorded in JMeter either) and takes me to the home page. And any page thereafter I hit gets recorded in JMeter.
I’ve also tried to use the HTTP Authorization Manager using following parameters:
BaseURL : http://testmachine:1000/sites/test/
Username: DOMAIN\USER_ID
Password: i_wont_tell_you
Domain: \
Realm:
It didn't help. I am quite confused about how-to-use the above element. And not even sure whether its a right approach to get the solution to my problem.
Any help/suggestions?
P.S. I know about a tool called Badboy, but have to go for it as a last resource. Also not even sure if it records the pop windows.
And sorry if the post is verbose.
UPDATE:
I have also tried -
Username: USER_ID and Domain: my_company_domain
But this is not the actual problem. Problem is, when I try to hit the pages (automation) which I've recorded previously return success response even if I haven't used the HTTP Authorization Manager. I'm not sure what I'm missing.
OK. Finally I got what was missing.
First, I had to change the implementation of every request to HttpClient3.1
Second, it was really frustrating to see that JMeter documentation was misleading.
It says that the config file httpclient.parameters, should be edited as following:
http.authentication.preemptive$Boolean=false
But it didn't work. Changing it to true worked like a charm.
Hope this helps other people.
JMeter works at the HTTP layer so the proxy will only capture requests made over this protocol layer. It sounds to me like you have already found the right approach to use for recording by using '“Automatically detect intranet network” in IE', you can use this method to capture most requests and you will have to figure out authentication manually. How you do this depends on how your application communicates with your server to authenticate a user.

What does "pending" mean for request in Chrome Developer Window?

What does "Pending" mean under the status column in the "Network" tab of Google Chrome Developer window?
This happens when my page script issues a GET request whose response contains content-headers for downloading a CSV file:
Content-type: text/csv;
Content-Disposition: attachment; filename=myfile.csv
This works fine in FF and IE7, downloading a CSV file as expected and opening a file picker to save the file, but Chrome does nothing. I confirmed that the server responds to the request, so it appears that Chrome will not process the response.
Curiously, all works as expected if I type the URL into Chromes address bar and hit <enter>.
FYI: Chrome 10.0.648.204 on Windows XP
In my case, I found that the "pending" status was caused by the AdBlock extension. The image that I couldn't get to load had the word "ad" in the URL, so AdBlock kept it from loading.
Disabling AdBlock fixes this issue.
Renaming the file so that it doesn't contain "ad" in the URL also fixes it, and is obviously a better solution. Unless it's an advertisement, in which case you should leave it like that.
I also get this when using the HTTPS everywhere plugin.
This plugin has a list of sites that also have https instead of http. So I assume before the actual request is made it is already being cancelled somehow.
So for example when I go to http://stackexchange.com, in Developer I first see a request with status (terminated). This request has some headers, but only the GET, User-Agent, and Accept. No response as well.
Then there is request to https://stackexchange.com with full headers etc.
So I assume it is used for requests that aren't sent.
I had some problems with pending request for mp3 files.
I had a list of mp3 files and one player to play them. If I picked a file that had already been downloaded, Chrome would block the request and show "pending request" in the network tab of the developer tools.
All versions of Chrome seem to be affected.
Here is a solution I found:
player[0].setAttribute('src','video.webm?dummy=' + Date.now());
You just add a dummy query string to the end of each url. This forces Chrome to download the file again.
Another example with popcorn player (using jquery) :
url = $(this).find('.url_song').attr('url');
pop = Popcorn.smart( "#player_", url + '?i=' + Date.now());
This works for me. In fact, the resource is not stored in the cache system. This should also work in the same way for .csv files.
I had the same issue on OSX Mavericks, it turned out that Sophos anti-virus was blocking certain requests, once I uninstalled it the issue went away.
If you think that it might be caused by an extension one easy way to try and test this is to open chrome with the '--disable-extensions flag to see if it fixes the problem. If that doesn't fix it consider looking beyond the browser to see if any other application might be causing the problem, specifically security apps which can affect requests.
I had a similar issue with application/json ajax calls. In ff/IE they were fine. In chrome in the Developer Network window Status was always (pending) because a different status code was being returned.
In my case I changed my Json response to send a HttpStatusCode of 200 then Chrome was fine and the Status Text changed to 200 OK.
For example using ASP.NET Web Api
return new HttpResponseMessage(HttpStatusCode.OK ) {
Content = request.Content
};
The Network pending state on time, means your request is in progressing state. As soon as it responds the time will be updated with total elapsed time.
This picture shows the network call is in processing state(Pending)
This picture shows the time taken in processing by network call.
The fix, for me, was to add the following to the top of the php file which was being requested.
header("Cache-Control: no-cache,no-store");
Same problem with Chrome : I had in my html page the following code :
<body>
...
<script src="http://myserver/lib/load.js"></script>
...
</body>
But the load.js was always in status pending when looking in the Network pannel.
I found a workaround using asynchronous load of load.js:
<body>
...
<script>
setTimeout(function(){
var head, script;
head = document.getElementsByTagName("head")[0];
script = document.createElement("script");
script.src = "http://myserver/lib/load.js";
head.appendChild(script);
}, 1);
</script>
...
</body>
Now its working fine.
Encountered a similar issue recently.
My App is in angular 11 and we have a form with some validators which have regex to validate the data. One of data element had a special character which the regex wasn't handling and it made the entire browser hung up. Infact, even though all network calls were successful with 200 Ok, chrome was not showing any response returned by the backend and was also showing the requests in Pending State when infact all network calls are successful, there was no console log errors or anything. Handling the regex fixed the issue.
After i found the issue, i googled more about it. Here is more explanation about it.
https://javascript.info/regexp-catastrophic-backtracking
I came across this issue when I was debugging a local web application. The issue turned out to be AVG Antivirus and Firewall restrictions. I had to allow an exception through the firewall to get rid of the "Pending" status.
In my case, a simple restart to my browser (chrome) and it worked straight away afterwards like magic!
Little bit of context, I happen to refresh my frontend web page and straight away went onto making a changes to my API which led it to restart. During that instance, the frontend was making calls to API which led into "pending" due to that API is reloading. Browser at this point cached that pending state. For me to get out of it is either I set no-cache (which I didn't want to) or simply restart the browser, I chose the restart.
A little background
I encountered such an issue when requesting an url in my Django project. The server is setup using Apache HTTP web server and basic auth for user authentication.
The url I was accessing required no authentication to access i.e. in my Apache config, I had set Require all granted on the url using the LocationMatch directive.
The issue
The url I was trying to access returned 200 status (in the Network tab in Chrome), but the static assets being used for styling of the requested webpage (css, javascript, font files etc.) associated with the request url were not loading and returned pending status.
In the meanwhile, the page loaded partially and still kept on loading. All this was happening in the presence of basic-auth dialog in browser, even though my url was granted all access.
What worked for me
Interestingly, as I entered my credentials and logged in, the requested page loaded all the static assets. This made it very clear to me that the static assets directory might NOT have the necessary access permissions.
Then, I granted the access to the static assets directory by updating my Apache config and then the requested url and the webpage loaded up fine (200 status) without any basic auth dialog OR pending status.
In my case, there's an update for Chrome that makes it won't load before you restart the browser. Cheers
I encountered the same problem when I request certain images from page. I use JavaScript to set the src attribute of an img object and if the network is poor pending will be displayed in the network panel of chrome developer window. I think it's due to the poor network.

Resources