Druid: how it uses cache and OS page cache? - bigdata

I'm observing that Druid query performance can benefit from the previous queries. Thus, I'm trying to understand the reasons.
I Know that Druid uses cache (I'm using cache in the Broker), but this cache just stores the result of the queries per segment (right?). However, I have noticed that if the subsequent queries use the same segments, the performance improves.
Example:
Select sum(metric), dimteste2, dimteste3 from table x where dimteste='x' group by dimteste2, dimteste3 -> 2 seconds
Select sum(metric), dimteste2, dimteste3 from table x where dimteste3='y' group by dimteste2, dimteste3 -> 0.5 seconds
I searched and found that this behavior can be achieved by the OS page cache. Based on my research, I think that Druid, during the first query to the datasource, loads the necessary segments to memory (OS page cache). And the segments can be read faster in the next queries.
Am I right?
I looked in the Druid documentation and I was unable to find anything helpful.
Can you please give me some help explaining this awesome behavior?
Best regards,
José Correia

Druid does uses caching to improve the performance at various levels. At segment level on historicals and druid query level on broker. The more memory you give it the faster it works.
Below is the documentation on the caching -
Query Caching
Druid supports query result caching through an LRU cache. Results are stored on a per segment basis, along with the parameters of a given query. This allows Druid to return final results based partially on segment results in the cache and partially on segment results from scanning historical/real-time segments.
Segment results can be stored in a local heap cache or in an external distributed key/value store. Segment query caches can be enabled at either the Historical and Broker level (it is not recommended to enable caching on both).
Query caching on Brokers
Enabling caching on the broker can yield faster results than if query caches were enabled on Historicals for small clusters. This is the recommended setup for smaller production clusters (< 20 servers). Take note that when caching is enabled on the Broker, results from Historicals are returned on a per segment basis, and Historicals will not be able to do any local result merging.
Query caching on Historicals
Larger production clusters should enable caching only on the Historicals to avoid having to use Brokers to merge all query results. Enabling caching on the Historicals instead of the Brokers enables the Historicals to do their own local result merging and puts less strain on the Brokers.
Druid broker doesn't do anything directly at the OS page cache, if
there is virtual memory available by the os, than based on memory
requirements the heaps are allocated.

Related

Multiple wordpress site installation

guys i have question,
lets say i want to upload hundreds thousands of post / product to wordpress, which will slow down the website performance, and the database size will also getting bigger.
what if i split the wordpress site into several different installation to different sub directory based on the product or post category, so lets say one website only contain 25-30k post / producst, but there will be like 10 of those in different installation, in this way the database will be a lot smaller.
do you think it will make the performance better than put everything in single website?
my server is around 16gb ram and 8 cpu cores.
I don't think it will make any difference given you will run it on the same hardware. In case of multiple machines and one ingress node/load balancer you could route the request to the different backend server basing on the product requested, but if you have only one server for hosting everything: web server, database, etc. you will hit the limits of CPU/RAM/etc. much faster than the size of the database table (given it's properly designed, has indices and so on)
However you can measure the performance in both cases using a load testing tool and see how does response time, resources usage and database slow query log looks like in both deployment scenarios.
Data size doesn't have to slow the site. It becomes a matter of how fast can you get the data from the DB. A few things to consider:
Place the Database on a dedicated host. If locally hosted dedicate a crossover cable from the web tier to the DB tier, with a second IP for admin on the database host. You might consider a managed instance of your database with a cloud provider.
Indexes are your friend. Larger datasets result in longer indexes, but you can make deliberately shortened indexes. Choose a database that supports partitioned indexes. Combine these partitioned indexes along with higher I/O operations per second of SSDs for your index partitions and ensuring that all lookup access via index will result in your performance for large data sets doesn't suffer. How does a partitioned index increase access speed? Instead of having to traverse an index from A to S for an index supported query with an S based where clause, in a partioned index you might have 26 indexes, one for A, then B, then C, then ... You jump straight to the S partition for the lookup.
Shape your pool size on the PHP/Web tier. You have already increased the pool size by pulling the database onto its own host. The next thing to do is to effectively manage your cache of fixed assets, the items that do not change across user sessions. Commonly these items are style sheets, images, fonts, javascript files, ... Minimally look at a cache node in front of your wordpress site. Take a look at Varnish or Nginx for this. I am partial to Varnish, but either should do the trick. If you pair this with a CDN for a multigenerational cache then all the better. If you are in the cloud then you have built in CDN options with each cloud provider. You can also widen your bandwidth by placing these fixed assets ona dedicated host and then caching that one host, but this would require a lot of base modification of your wordpress image.
There is no reason why you cannot have multiple web fronts with a common database back end. You would need a load balancer to distribute the load and your first generation cache would sit in front of the load balancer. Realistically, if all of your queries are index supported and your cache is effectively managed, then you can easily scale to hundreds of concurrent users on moderate hardware. Your most taxing item is going to be your PHP execution to pull dynamic data for user sessions. Make the queries respond as fast as possible then you have a small lock window on PHP for each session.
Watch your locks per session! You may be at the mercy of a template and how it is managing your finite resource pool, but in general (a) unless 50%+1 use something, do not allocate it early, (b) be merciless in cutting sessions to release the session based locks on memory, (c) pinch your assets until they bleed - No 45 MB images on the front page when a color optimized 120K compressed image will do the job, (d) Watch the repeat access problem - This applies to subqueries in the database as well as building a web page with hundreds of assets to resolve a page.
Have you considered other options, such as Drupal? The setup is a bit more complex, but I can validate running a dozen distinct websites out of a single Drupal instance with no degradation in performance with the above dedicated database and cache nodes with hundreds of concurrent users on fairly moderate hardware (mini-itx atom based PCs)

Firebase: queries on large datasets

I'm using Firebase to store user profiles. I tried to put the minimum amount of data in each user profile (following the good practices advised in the documentation about structuring data) but as I have more than 220K user profiles, it still represents 150MB when downloading as JSON all user profiles.
And of course, it will grow bigger and bigger as I intend to have a lot more users :)
I can't do queries on those user profiles anymore because each time I do that, I reach 100% Database I/O capacity and thus some other requests, performed by users currently using the app, end up with errors.
I understand that when using queries, Firebase need to consider all data in the list and thus read it all from disk. And 150MB of data seems to be too much.
So is there an actual limit before reaching 100% Database I/O capacity? And what is exactly the usefulness of Firebase queries in that case?
If I simply have small amounts of data, I don't really need queries, I could easily download all data. But now that I have a lot of data, I can't use queries anymore, when I need them the most...
The core problem here isn't the query or the size of the data, it's simply the time required to warm the data into memory (i.e. load it from disk) when it's not being frequently queried. It's likely to be only a development issue, as in production this query would likely be a more frequently used asset.
But if the goal is to improve performance on initial load, the only reasonable answer here is to query on less data. 150MB is significant. Try copying a 150MB file between computers over a wireless network and you'll have some idea what it's like to send it over the internet, or to load it into memory from a file server.
A lot here depends on the use case, which you haven't included.
Assuming you have fairly standard search criteria (e.g. you search on email addresses), you can use indices to store email addresses separately to reduce the data set for your query.
/search_by_email/$user_id/<email address>
Now, rather than 50k per record, you have only the bytes to store the email address per records--a much smaller payload to warm into memory.
Assuming you're looking for robust search capabilities, the best answer is to use a real search engine. For example, enable private backups and export to BigQuery, or go with ElasticSearch (see Flashlight for an example).

Meteor server-side memory usage for thousands of concurrent users

Based on this answer, it looks like the meteor server keeps an in-memory copy of the cache for each connected client. My understanding is that it gets used in order to avoid sending multiple copies of data when dealing with overlapping subscriptions on a client.
The relevant part of the linked answer (emphasis is mine):
The merge box: The job of the merge box is to combine the results (added, changed and removed calls) of all of a client's active publish functions into a single data stream. There is one merge box for each connected client. It holds a complete copy of the client's minimongo cache.
Assuming that answer is still accurate in the current version of meteor, couldn't that create a huge waste of memory on the server as the number of users increases?
As an off-the-cuff calculation, if an app had about a 100kB cache per client, then 10,000 concurrent users would use up 1GB of memory on the server, and 100,000 users a whopping 10GB! This would be true even if each client was looking at almost identical data. It seems plausible for an app use much more data than that per client, which would further exacerbate the problem.
Does this problem exist in the current version of Meteor? If so, what techniques can be used to limit the amount of memory the server needs to use to manage all the client subscriptions?
Take a look at this post by Arunoda at his meteorhacks.com blog:
http://meteorhacks.com/making-meteor-500-faster-with-smart-collections.html
which talks about his Smart Collections page:
http://meteorhacks.com/introducing-smart-collections.html
He created an alternative Collection stack which has succeeded in it's goals for speed, efficiency (memory & cpu) and scalability (you can see a graphed comparison in the post). Admittedly in his tests RAM usage was negligent with both Collection types, although the way he's implemented things there should be a very obvious difference with the type of use case you mentioned.
Also, you can see in this post on meteor-core:
https://groups.google.com/d/msg/meteor-core/jG1KLObX1bM/39aP4kxqWZUJ
that the Meteor developers are aware of his work and are cooperating in implementing some of the improvements into Meteor itself (but until then his smart package works great).
Important note! Smart collections relies on access to the Mongo Oplog. This is easy if you're running on your own machine or hosted infrastructure. If you're using a cloud based database, this option might not be available, or if it is, will cost a lot more than the smaller packages.

NoSQL and AppFabric with Azure

I have an ASP.net application that I'm moving to Azure. In the application, there's a query that joins 9 tables to produce a user record. Each record is then serialized in json and sent back and forth with the client. To increase query performance, the first time the 9 queries run and the record is serialized in json, the resulting string is saved to a table called JsonUserCache. The table only has 2 columns: JsonUserRecordID (that's unique) and JsonRecord. Each time a user record is requested from the client, the JsonUserCache table is queried first to avoid having to do the query with the 9 joins. When the user logs off, the records he created in the JsonUserCache are deleted.
The table JsonUserCache is SQL Server. I could simply leave everything as is but I'm wondering if there's a better way. I'm thinking about creating a simple dictionary that'll store the key/values and put that dictionary in AppFabric. I'm also considering using a NoSQL provider and if there's an option for Azure or if I should just stick to a dictionary in AppFabric. Or, is there another alternative?
Thanks for your suggestions.
"There are only two hard problems in Computer Science: cache invalidation and naming things."
Phil Karlton
You are clearly talking about a cache and as a general principle, you should not persist any cached data (in SQL or anywhere else) as you have the problem of expiring the cache and having to do the deletes (as you currently are). If you insist on storing your result somewhere and don't mind the clearing up afterwards, then look at putting it in an Azure blob - this is easily accessible from the browser and doesn't require that the request be handled by your own application.
To implement it as a traditional cache, look at these options.
Use out of the box ASP.NET caching, where you cache in memory on the web role. This means that your join will be re-run on every instance that the user goes to, but depending on the number of instances and the duration of the average session may be the simplest to implement.
Use AppFabric Cache. This is an extra API to learn and has additional costs which may get quite high if you have lots of unique visitors.
Use a specialised distributed cache such as Memcached. This has the added cost/hassle of having to run it all yourself, but gives you lots of flexibility in the long run.
Edit: All are RAM based. Using ASP.NET caching is simpler to implement and is faster to retrieve the data from cache because it is on the same machine - BUT requires the cache to be populated for each instance of the web role (i.e. it is not distributed). AppFabric caching is distributed but is also a bit slower (network latency) and, depending what you mean by scalable, AppFabric caching currently behaves a bit erratically at scale - so make sure you run tests. If you want scalable, feature rich distributed caching, and it is a big part of your application, go and put in Memcached.

Caching large amounts of data

I have been reading that lots of people use Redis or another key-value store/NoSQL solution as a distributed cache for their website.
Maybe I'm not understanding completely, but it seems a solution like this only works for shared data. For example, if I have a website that requires a user to log-in and the queries they generate return data specific to only that user (in my case, banking/asset information) that can't be cached for all users, this type of solution doesn't work.
Unfortunately, the database is shared across all our applications and when it get bogged down, the website gets bogged down as well. Since each user has gigabytes of information, I obviously can't cache all of that and each web page queries completely different information.
Is there some caching strategy that I can employ for this type of scenario?
A distributed cache like Velocity doesn't require that the data it stores be limited to "shared" data. But you do have to read the data from your DB and store it in the cache, which takes time.
A few alternatives:
Partition your data, so it's spread out among several DB servers
Add as much RAM as you can to each DB server, to allow SQL Server to cache what it can
There are many variations to the partitioning theme....
Is your web app load balanced? There are caching options at the web tier as well -- the ASP.NET object cache is a good place to start.
It's possible that your web clients are requesting the same data more than once (for a given user). So caching could give a benefit in that case.
But before you go implementing a huge caching solution, you really need to look at the queries that are particularly slow or executed a huge number of times and see if you can optimize them in any way.
Then look at upgrading your DB machine.
I read a nice article about the performance issues that MySpace had when they had a huge growth.
You can find the article here.
One quote from the article that stands out:
The addition of the cache servers is "something we should have done
from the beginning, but we were growing too fast and didn't have time
to sit down and do it," Benedetto adds
If the problem is in your database server think about partitioning your data and making use of a database farm to spread the load. Also think about SSD's! They can really speed up your database access code.
Depending how dynamic your data is you could consider using Fragment Caching. This will cache the HTML of the page rather than the data so if the volume of data is prohibtive to cache then this might work for you

Resources