How do I make the Mixnode Crawler to crawl slower? - web-scraping

We topped 20 million pages/hour and I truly appreciate the speed; however I'm afraid I may be putting too much pressure on target sites, is there any way we can decrease the speed at which websites are crawled?

Not sure why you'd want to decrease the speed as the documentation clearly states that :
There is a minimum delay of 10 seconds between requests sent to the same website. If robots.txt directives of a website require a longer delay, Mixnode will follow the delay duration specified by the robots.txt directives.

Related

Why does facebook's conversion pixel load multiple JavaScript files?

If I visit a website with the facebook conversion pixel installed (such as https://www.walmart.com/), I notice that several different JavaScript files are loaded by the pixel.
The first one is https://connect.facebook.net/en_US/fbevents.js.
The second one is
https://connect.facebook.net/signals/config/168539446845503?v=2.9.2&r=stable. This one seems to have some user specific configuration data baked into the file.
The third one is https://connect.facebook.net/signals/plugins/inferredEvents.js?v=2.9.2
What I don't understand is, why doesn't Facebook simply consolidate all of these into one request, like https://connect.facebook.net/en_US/168539446845503/fbevents.js?v=2.9.2&r=stable, and then simply return one file with everything in it? This would be able to do everything the conversion pixel does now, but with 1 request instead of 3.
As the page makes more than a hundred requests for its loading, loading 1 javascript file instead of 3 would not be a significant improvement.
Facebook chose to divide in 3 files for a better design, probably :
1 generic library : fbevents.js
1 more specific : inferredEvents.js, that uses the first one
1 file that contains generated code, probably specific to the merchant 168539446845503 (Walmart?)
This fragmentation makes code maintenance easier (test, reusability, bug fix).
And finally, the generic files fbevents.js and inferredEvents.js can be cached by the browser and reused on other web sites. This is a kind of optimization, possibly better than the one you suggest.
Having multiple resource requests to the same origin is FAR FAR less of an issue than it was a few years ago:
Internet speeds and are much faster.
Latency is less (most notably so on 5G phones).
HTTP/3 protocol has many improvements which help when multiplexing files simultaneously from the same server.
Browsers don't limit active number of connections to a site as agressively they used to (that doesn't matter with HTTP/3 anyway).
Facebook uses HTTP/3 as you can see here:

Google Page Speed Drop, says I can save more loading time, even though nothing changed

I'm testing my pagespeed everyday several times. My page often receives a grade between 94 to 98, with the main problems being:
Eliminate render-blocking resources - Save 0.33
Defer unused CSS - Save 0.15 s
And in Lab data, all values are green.
Since yesterday, suddenly page speed has dropped, to about 80-91 range,
with the problems being:
Eliminate render-blocking resources - Save ~0.33
Defer unused CSS - Save ~0.60 s
And it is also saying my First CPU idle is slow ~4,5ms
And so is time to interactive , ~4.7
And sometimes speed index is slow as well.
It also started to show Minimize main-thread work advice, which didn't show earlier.
The thing is, I did not change anything in the page. Still same HTML, CSS and JS. This also not a server issue, I don't have a CPU overuse problem.
On Gtmetrix I'm still getting the same 100% score and same 87% Yslow score, with the page being fully loaded somewhere between 1.1s to 1.7s, making 22 HTTP requests in total size of 259kb, just like before.
On Pingdom I also get the same 91 grade as before, with page load speed around 622ms to 750ms.
Therefore, I can't understand this sudden change in the way Google analyzes my page.
I'm worried of course it will affect my rankings.
Any idea what is causing this?
it seems that this is a problem of PageSpeed Insights web itself as it is reported now on some pagespeed insights discuss google groups:
https://groups.google.com/forum/#!topic/pagespeed-insights-discuss/luQUtDOnoik
The point is that if you try to test your performnce direclty from another lighthouse web test, for example:
https://www.webpagetest.org/lighthouse
You will see your previous rates
In our case, in this site we always had 90+ on mobile but now google page rate has been reduced to 65+
https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fwww.test-english.com%2F&tab=mobile
but it still remains 90+ in webpagetest.org: https://www.webpagetest.org/result/190204_2G_db325d7f8c9cddede3262d5b3624e069/
This bug was acknowledged by Google and now has been fixed. Refer to https://groups.google.com/forum/#!topic/pagespeed-insights-discuss/by9-TbqdlBM
From Feb 1 to Feb 4 2019, PSI had a bug that led to lower performance
scores. This bug is now resolved.
The headless Chrome infrastructure used by PageSpeed Insights had a
bug that reported uncompressed (post-gzip) sizes as if they were the
compressed transfer sizes. This led to incorrect calculations of the
performance metrics and ultimately a lower score. The mailing list
thread titled [BUG] [compression] Avoid enormous network payloads /
Defer unused CSS doesn't consider compression covered this issue in
greater detail. Thanks for Raul and David for their help.
As of Monday Feb 4, 5pm PST, the bug is completely addressed via a new
production rollout.

Why is load time for images so slow and variable?

My website loads most elements within 1-2 secs.
However, some of the images will take 6 seconds (or even up to 20 secs) to load. There seems to be no pattern to which images will take a long time to load. Most will load in under 1 sec but 2 or 3 will "wait" for 6 secs plus. This is obviously reducing the page load time.
I have served the images from a CDN (I know they dont show that here) but that still makes no difference and sometimes the images that take the longest to load are <1kb in size.
The website is hosted on an AWS EC2 t2.micro instance. I am using W3TC and Cloudfront CDN. The images have been optimised.
I have included my CPU Credit Balance. This is also low. Might this be a problem?
Any ideas as to why random images will take a long time to be served?

Asp.net guaranteed response time

Does anybody have any hints as to how to approach writing an ASP.net app that needs to have a guaranteed response time?
When under high load that would normally cause us to exceed our desired response time, we want to throw out an appropriate number of requests, so that the rest of the requests can return before the max response time. Throwing out requests based on exceeding a fixed req/s is not viable, as there are other external factors that will control response time that cause the max rps we can safely support to fiarly drastically drift and fluctuate over time.
Its ok if a few requests take a little too long, but we'd like the great majority of them to meet the required response time window. We want to "throw out" the minimal or near minimal number of requests so that we can process the rest of the requests in the allotted response time.
It should account for ASP.Net queuing time, ideally the network request time but that is less important.
We'd also love to do adaptive work, like make a db call if we have plenty of time, but do some computations if we're shorter on time.
Thanks!
SLAs with a guaranteed response time require a bit of work.
First off you need to spend a lot of time profiling your application. You want to understand exactly how it behaves under various load scenarios: light, medium, heavy, crushing.. When doing this profiling step it is going to be critical that it's done on the exact same hardware / software configuration that production uses. Results from one set of hardware have no bearing on results from an even slightly different set of hardware. This isn't just about the servers either; I'm talking routers, switches, cable lengths, hard drives (make/model), everything. Even BIOS revisions on the machines, RAID controllers and any other device in the loop.
While profiling make sure the types of work loads represent an actual slice of what you are going to see. Obviously there are certain load mixes which will execute faster than others.
I'm not entirely sure what you mean by "throw out an appropriate number of requests". That sounds like you want to drop those requests... which sounds wrong on a number of levels. Doing this usually kills an SLA as being an "outage".
Next, you are going to have to actively monitor your servers for load. If load levels get within a certain percentage of your max then you need to add more hardware to increase capacity.
Another thing, monitoring result times internally is only part of it. You'll need to monitor them from various external locations as well depending on where your clients are.
And that's just about your application. There are other forces at work such as your connection to the Internet. You will need multiple providers with active failover in case one goes down... Or, if possible, go with a solid cloud provider.
Yes, in the last mvcConf one of the speakers compares the performance of various view engines for ASP.NET MVC. I think it was Steven Smith's presentation that did the comparison, but I'm not 100% sure.
You have to keep in mind, however, that ASP.NET will really only play a very minor role in the performance of your app; DB is likely to be your biggest bottle neck.
Hope the video helps.

What is the most accurate method of estimating peak bandwidth requirement for a web application?

I am working on a client proposal and they will need to upgrade their network infrastructure to support hosting an ASP.NET application. Essentially, I need to estimate peak usage for a system with a known quantity of users (currently 250). A simple answer like "you'll need a dedicated T1 line" would probably suffice, but I'd like to have data to back it up.
Another question referenced NetLimiter, which looks pretty slick for getting a sense of what's being used.
My general thought is that I'll fire the web app up and use the system like I would anticipate it be used at the customer, really at a leisurely pace, over a certain time span, and then multiply the bandwidth usage by the number of users and divide by the time.
This doesn't seem very scientific. It may be good enough for a proposal, but I'd like to see if there's a better way.
I know there are load tools available for testing web application performance, but it seems like these would not accurately simulate peak user load for bandwidth testing purposes (too much at once).
The platform is Windows/ASP.NET and the application is hosted within SharePoint (MOSS 2007).
In lieu of a good reporting tool for bandwidth usage, you can always do a rough guesstimate.
N = Number of page views in busiest hour
P = Average Page size
(N * P) /3600) = Average traffic per second.
The server itself will have a lot more internal traffic for probably db server/NAS/etc. But outward facing that should give you a very rough idea on utilization. Obviously you will need to far surpass the above value as you never want to be 100% utilized, and to allow for other traffic.
I would also not suggest using an arbitrary number like 250 users. Use the heaviest production day/hour as a reference. Double and triple if you like, but that will give you the expected distribution of user behavior if you have good log files/user auditing. It will help make your guesstimate more accurate.
As another commenter pointed out, a data center is a good idea, when redundancy and bandwidth availability become are a concern. Your needs may vary, but do not dismiss the suggestion lightly.
There are several additional questions that need to be asked here.
Is it 250 total users, or 250 concurrent users? If concurrent, is that 250 peak, or 250 typically? If it's 250 total users, are they all expected to use it at the same time (eg, an intranet site, where people must use it as part of their job), or is it more of a community site where they may or may not use it? I assume the way you've worded this that it is 250 total users, but that still doesn't tell enough about the site to make an estimate.
If it's a community or "normal" internet site, it will also depend on the usage - eg, are people really going to be using this intensely, or is it something that some users will simply log into once, and then forget? This can be a tough question from your perspective, since you will want to assume the former, but if you spend a lot of money on network infrastructure and no one ends up using it, it can be a very bad thing.
What is the site doing? At the low end of the spectrum, there is a "typical" web application, where you have reasonable size (say, 1-2k) pages and a handful of images. A bit more intense is a site that has a lot of media - eg, flickr style image browsing. At the upper end is a site with a lot of downloads - streaming movies, or just large files or datasets being downloaded.
This is getting a bit outside the threshold of your question, but another thing to look at is the future of the site: is the usage going to possibly double in the next year, or month? Be wary of locking into a long term contract with something like a T1 or fiber connection, without having some way to upgrade.
Another question is reliability - do you need redundancy in connections? It can cost a lot up front, but there are ways to do multi-homed connections where you can balance access across a couple of links, and then just use one (albeit with reduced capacity) in the event of failure.
Another option to consider, which effectively lets you completely avoid this entire question, is to just host the application in a datacenter. You pay a relatively low monthly fee (low compared to the cost of a dedicated high-quality connection), and you get as much bandwidth as you need (eg, most hosting plans will give you something like 500GB transfer a month, to start with - and some will just give you unlimited). The datacenter is also going to be more reliable than anything you can build (short of your own 6+ figure datacenter) because they have redundant internet, power backup, redundant cooling, fire protection, physical security.. and they have people that manage all of this for you, so you never have to deal with it.

Resources