I read this article Optimize Parallel Downloads to Minimize Object Overhead and write a test demo
But the result is not I expected,look at the waterfall figure multiple domain's images indeed parallel download but the total time no less.
Who can tell me why,thanks
multiple domain image download
single domain image download
What I think the problem with your test is that the network latency is so low (assuming its a local server), that your network performance isn't even playing a huge role here. If you look at the time difference between those images it doesn't' even register in HTTPWatch. So the browser may be spending more time parsing, processing and rendering the downloads than it does actually downloading (just a guess).
I would hit this test site with something that will show off latency more. If these sites are reachable on the internet, you can just use http://www.webpagetest.org/ and hit them from around the world...
~Sniff
P.S. Check out Amdahl's law, it may have a little bearing here here ;-).
Related
I am using WordPress and Wp-rocket is installed on my website.
Now my issue is, I am getting 98 in the performance but my Core Web Vitals Assessment still shows Failed.
Any idea how to solve core web vitals? LCP is showing 2.9s. Do I need to work on this?
Core Web Vitals are measured over field data, not the fixed, repeatable definitions that lab-based tools like Lighthouse uses to analyse your website. See this article for good discussion on them both.
Often times Lighthouse is set too strictly, and people complain it shows worse performance than is seen by the site's real users, but it is just as easy to have the opposite as you see here. PageSpeed Insights (PSI) tries to use settings that are broadly applicable to all sites to give you "insights" into how to improve your performance but the results should be calibrated towards the real-user data that you see at the top of the audit.
In your case, I can see from your screenshots that you are seeing a high Time to First Byte (TTFB) in your real user data of 1.9 seconds. This makes passing the LCP limit of 2.5 seconds quite tough as it only leaves 0.6 seconds for that.
The question is why are you seeing that long TTFB in the field, when you don't see the same in your lab-based results - where you see a 1.1 second LCP time - including TTFB? There could be a number of reasons, and several potential options to resolve:
Your users are further away from your data centre, whereas PSI is close by. Are you using a CDN?
Your users are predominately using poorer network conditions than Lighthouse uses. Do you just need to serve less to them in these cases? For example, hold back images for those on slower network conditions using the Effective Connection Type API and only load them on demand so LCP is text by default? Or don't use web fonts for these users. Or other forms of progressive enhancements.
Your page visits are often jumping through several redirect steps - all of which add to TTFB, but for PSI you put the end of URL in directly so miss this in the analysis. This can often be out of your control if the referrer uses a link shortener (e.g. Twitter does).
Your page visits are often from uncached pages, that take time to generate. But when using PSI you run the test a few times, and so are benefiting from that page being cached and so is served quickly. Can you optimise your back-end server code, or improve your caching?
Your pages are not eligible for the super-fast in-memory bfcache for repeat visits when going back and forth throughout the site, which can be seen as a free web-performance win!.
Your pages often suffer from contention when lots of people visit at once, and that wasn't apparent in the PSI tests.
Those are some of the more common reasons for a slow TTFB but you may understand your sites, your infrastructure, and your users better to understand the main reason. Once you solve that, you should see your LCP times reduce and hopefully pass CWV.
I want to prevent web scrapers from agressively scraping 1,000,000 pages on my website. I'd like to do this by returning a "503 Service Unavailable" HTTP error code to bots that access an abnormal number of pages per minute. I'm not having trouble with form-spammers, just with scrapers.
I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold.
Is this an acceptable solution? Do all major search engines support the crawl-delay directive? Could it negatively affect SEO? Are there any other solutions or recommendations?
I have built a few scrapers, and the part that takes the longest time is allways trying to figure out the site layout what to scrape and not. What I can tell you is that changing divs and internal layout will be devastating for all scrapers. Like ConfusedMind already pointed out.
So here's a little text for you:
Rate limiting
To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed timeframe before blocking it. This may seem sure way prevent the worst offenders but in reality it's not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.
One solution is of course to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realise that you are rate limiting certain addresses.
In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.
Captcha tests
Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text and numbers on that a machine can't read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of captcha tests have had their implementations compromised.
Obfuscating source code
Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.
Blacklists
Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.
I had meeting with a local newspaper company's owner. they are planning to have a newly designed website. their current website is static and doesnt have any kinds of database. But their weekly pageview figure is around 317k. This figure surely will increase in the future
The question is if i create a Wordpress system for them will the website run smoothly with new functionalities (news,galleries may be). it is not neccessary to use lots of plugins. can their current server support wordpress package without any upgrade.
Or shall i think to use php to design website.
Yes - so long as the machinery for it is adequate, and you configure it properly.
If the company uses CDN (like akamai), ask them if this thing can piggyback on their account, then make them do it anyway when they throw up a political barrier. Then, then stop sweating it, turn keepalive on and ignore anything below this line. Otherwise:
If this is on a VPS, make sure it has guaranteed memory and I/O resources - otherwise host it on a hardware machine. If you're paranoid, something with a 10k RPM drive and 2-3 gigs of ram will do (memory for apache and mysql to have breathing room and hard drive for unexpected swap file compensation.)
Make sure the 317k/w figure is accurate:
If it comes from GA/Omniture/another vendor tracking suite, increase the figures by about 33-50% to account for robots that they can't track.
If the number comes from house stats/httpd logs, assume it's 10-20% less (since robots don't typically hit you up for stylesheets and images.)
If it comes from combined reports by an analyst whose job it is to report on their own traffic performance, scratch your head and flip a coin.
Apache: News sites in America have lunchtime and workday winddown traffic bursts around or about 11 am, and 4 pm, so you may want to turn Keepalive off (having it on will improve things during slow traffic periods, but during burst times the machine will spin into an unrecoverable state.)
PHP: Make sure some kind of opcode caching is enabled on the hosting machine (either APC or eAccelerator). With opcode caching, memory footprint drops off significantly and machine doesn't have to borrow as much from the swap file - hard drive.
WP: Make sure you use WP3.4, as ticket http://core.trac.wordpress.org/ticket/10964 was closed in favor of this ticket's fix: http://core.trac.wordpress.org/ticket/18536. Both longstanding issues address query performances on large volume sites, but the overall improvements/fixes help everywhere else too.
Secondly, make sure to use something like the WP Super Cache caching plugin and configure it appropriately. If volume of content on this site is going to be permanently small, you shouldn't have to take any special precautions - otherwise you may want to alter the plugin/rules so to permanently archive older content into a static file. There is no reason why 2 year old content should be constantly respidered at full resource cost.
Robots.txt: prepare and properly register a dynamic sitemap with google/bing/etc. If you expect posts to be unnecessarily peppered with a bunch of tags and categories by people who don't understand what they actually do, you may want to Disallow /page/*, /category/* and /tag/*. Otherwise, when spider robots swarm the site, for every post you'll be slammed by an amount increased by number of tags/cats it has. And then some.
For several years The Baltimore Sun hosted their reader reward, sports and editorial database projects directly off a single collocated machine. Combined traffic volume was factors larger than what you mention, but adequately met.
Here's a video of httpd status w/keepalive on during a slow hour, at about 30 req./sec: http://www.youtube.com/watch?v=NAHz4GRY0WM#t=09
I would not exclude WordPress for this project based only off of the weekly pageview of < a million. I have hosted WordPress sites that receive much, much more traffic and were still very functional. Whether or not WordPress is the best solution for this type of project though based off of the other criteria you have is completely up to you.
Best of luck and happy coding!
WP is capable of handling huge traffic. See this list of people who are using WP VIP services:
Time,DowJones,NBC Sprts,CNN and many more.
Visit WordPress VIP site: http://vip.wordpress.com/clients/
A client of mine has a website and they need to determine how 'scalable' the site currently is. What I mean by this is the number of users browsing around the site concurrently.
It's a custom e-commerce app in .net, not written by myself and the code is... well lets just say, a bit dubious.
A much bigger company is looking to buy them / throw funding their way but they need some form of metrics to show how much load it can take before it falls apart. This big company has the ability to 'turn on the taps' to a huge user base - and obviously doesn't want to do that if the site is going to fall over with a sneeze of traffic.
What is a good metric to provide here? And how can I obtain it?
Edit: Question revised
I always use Apache's "ab" tool: link text
Run it from a different machine, preferably a BSD or Linux machine with no firewall rules that will limit the performance of the tool. Because otherwise the result might not be as reliable. If you use a Windows machine, make sure you're using one that isn't limiting the number of active TCP connections.
When using "ab", the number you're looking for it "Requests per second". Experiment with the concurrency switch to see how many concurrent users you can handle before you're getting a lot of errors, or when the requests per seconds is dropping rapidly.
When you are noticing the webserver is having serious issues you should restart the webserver, and let it rest for a while before continuing the test.
You'd be better off with a hosted load test, as this might give you more insight on realworld scenario's (something like http://www.scl.com/software-quality/hosted-load-test, no experience with them though).
Furthermore: scalability is as far as I know, not how many concurrent users can be served, but the way how easy it is to serve more when the site grows bigger (by adding extra servers etc, how easy is it for the website to scale up, does the codebase allow to use unlimited number of servers, etc.)
Well, I suppose it'll depend on what the client cares about.
Do they care about how many users to can access the site at once? Report on that, but running simultaneous requests from another server until it dies, then get the number.
Do they care about something else?
For me, when someone says they want it to 'scale', it really means they have no idea what they want. So try and talk to them, and get specific details of what, exactly, they want to see 'scaling', and then, once you find the areas to analyse, you can do so trivially, and attempt to improve them.
In Joel's article for Inc. entitled How Hard Could It Be?: The Unproven Path, he wrote:
...it turns out that Jeff and his
programmers were so good that they
built a site that could serve 80,000
visitors a day (roughly 755,000 page
views)
How would I go about figuring out the maximum load my server(s) can handle?
Benchmarking your software is often a lot harder than it seems. Sure, it's easy to produce some numbers that say something about the performance of your software, but unless it was calculated using a very accurate representation of the actual usage patterns of your end users, it might be completely different from the actual results you will get in the wild. Websites are notoriously hard to benchmark correctly. Sure, you can run a script that measures the time it takes to generate a page but it will be a very different number from what you will see under real world usage.
Inorder to create a solid benchmark of what your servers can handle, you first need to figure out what the usage patterns of your users is. If your site is already running, you can easily collect this data from your logs. Next, you need to create a simulation that will emulate exactly the same patterns as your real users exhibit... that is - view front page, login, view status page and so forth. Different pages will create a different load on the servers requiring that you actually fetch correct set of pages when simulating load on your servers. Finally, you need to figure out which resources are cached by your users, you can do this again by looking through your access log or using a tool such as firebug.
JMeter, ab, or httperf
You can create several "stress tests" and run them as the other posters are telling.
Apache has a tool called JMeter where you can create these tests and run them several times.
http://jmeter.apache.org/
Greetings.
Jason, Have you looked at the Load Test built in to Visual Studio 2008 Team System? Check out this video to see a demo.
Edit: Here's another video that has better resolution.
Apache has a tool called ab that you can use to benchmark a server. It can simulate loads requests and concurrency situations for you.
Basically you need to mimic the behavior of a user and keep ramping up the number of users being mimiced until the server response is no longer acceptable.
There are a variety of tools that can do this but essentially you want to record a few sessions activity on your site and then play those sessions back (adding some randomisation to reflect real user behaviour) lots of times.
You will want to log the performance of each session and keep increasing the load until the the performance becomes unacceptable.