How to cache complex calculated temporary Data - symfony

I have an Application that allows people to bet on the result of soccer games.
The score of each single bet (=entity) is calculated by comparing the betted scores of the bet with the actual result in the game(=entity). Bet's are betted within Betrounds. Betrounds are organisations where groups bet on gamegroups (groups of games e.g. single matchdays). Single Usergroups can have several betrounds.
To summarize the relational model:
UserGroup 1:N BetRounds 1:N Bets N:1 Game
Within each betround I create a resulttable where I show every user with their result points and position.
In order to calculate the position of one user I need to calculate the points of every user within a betround.
These points from the single betrounds are aggregated into groups and within the group there is again a resulttable.
Example
A Usergroup with: 20 users
One Season has 34 matchdays
One matchday has 9 games
In order to calculate the the points for this usergroup I would need to calculate the points from 20*34*9=6120 bets.
Since this is a lot to calculate I don't want to do it everytime I show the resulttable.
I currently see two options in order to save some calculation time:
Cache
Save interim results (e.g. on the bet entity) in the database
Maybe a mix of both.
Cache
If caching is the correct way I am not sure on which level and how to invalidate.
There are several options what to cache:
- pointresult of single bets
- pointresults of single users within a betround
- whole result table of a betround (points & position)
- pointresult of single user within usergroup
- whole resulttable of usergroup
I am unsure how to cache those data:
- just the integer values for positions and points
- whole entities (e.g. bets)
- temporary not persistent entities (e.g. to represent the the resulttables)
- the html output of the table
Then dependent on format how to cache it:
- html views could be cached via reverse proxies
- values / entities probably via redis / memcache etc.
In the future we might change to a single page app that data is only served via restapi, then caching of html outputs is not an option.
Dependent on the caching strategy the question arises how to invalidate cache and optionally warm it, so that the result is never calculated within the application but only recalculated when the cache is invalidated and immediately replaced by the new result.
I have read very often that cache invalidation is evil. I am not sure if this applies to my use case since all points/results/tables etc. only change when my interface updates the result of the games. This is the only time when points change.
2.Save interim results (e.g. on the bet entity) in the database
I am not sure if this scenario is applicable on all levels. I first thought about saving the actual result on a bet instead of always comparing the bet scores with the actual scores. This would then make my data model a little bit redundent and i have increased complexity if I wrong result is fetched by my interface and later the correct comes in and my points are not recalculated.
On all other levels I would need to create new interim entities to store table results persistently.
3.Mix of both
I am not sure how mixing both would look like and if it makes sense at all, but I thought it might be an option.
Any advice, Input or experience would be highly appreciated.

I only mildly understand betting, so hopefully this helps.
It sounds like you are asking two questions:
When do I calculate results?
How much caching should I use?
To me it sounds like there are very clear events that happen, after which you can successfully calculate your results. Your design should take advantage of this and be evented in nature. You should have background processes that can detect when a game is complete. The results of the game should be written, and additional background jobs should be triggered to calculate the results of any bets that depend on that game.
This would also be the point at which any caches that involve that game, results from that game, or results from any bets on that game, should be invalidated and/or refreshed.
How much you should cache should be based on how much you need to cache. Caching should be considered separately from computing results. That is not caching. That is computing results and storing them. You should definitely not be calculating results during a page view request, and should be done ahead of time when the corresponding event (game ends) has triggered the calculation.
Your database should pretty much always represent the latest information you have on everything. You should avoid doing any calculations on-the-fly if possible.
I would get all the events and background stuff working first, then see what kind of performance you get. At that point your app should be doing little more than taking the results and sticking them into a view for each page view. If that part is going too slow, then you should start looking at caching your views/templates/html. As mentioned before, these caches could be invalidated by your background workers when they encounter new results.

Related

Firebase Database data consumption optimization (observing node only partly)

On my database, I have Post and User Models
User Model has a lot of information, but when I load the posts, I only need 3 out of like 20 parameters.
What I am currently doing is just loading the entire node. This is obviously not really efficient.
My question: Is it more efficient if I observe all 3 values (making 3 connections) individually or just observe the entire node once (making only a single connection).
I don't know exactly what would be more expensive (higher consumption as making 3 connections is probably not better than 1)
Kind regards
Edit
Firebase always loads complete nodes. While it is possible to get a subset of nodes with queries, that doesn't apply here.
So you will either have to load all nodes and do the subselection client-side, or you'll have to create another higher level node that only contains the three properties that you're interested in.
Which one to choose is highly dependent, and (honestly) largely subjective. The main options:
You can reduce bandwidth a bit by only loading the three properties, but if you store them as a duplicate you'll then end up paying for the storage of duplicate information.
You can also store the three properties separately, but not duplicate them. But that means that if you need all properties, you'll need to execute two read operations that then add some overhead and complicate code.

How to retrieve a row's position within a DynamoDB global secondary index and the total?

I'm implementing a leaderboard which is backed up by DynamoDB, and their Global Secondary Index, as described in their developer guide, http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
But, two of the things that are very necessary for a leaderboard system is your position within it, and the total in a leaderboard, so you can show #1 of 2000, or similar.
Using the index, the rows are sorted the correct way, and I'd assume these calls would be cheap enough to make, but I haven't been able to find a way, as of yet, how to do it via their docs. I really hope I don't have to get the entire table every single time to know where a person is positioned in it, or the count of the entire table (although if that's not available, that could be delayed, calculated and stored outside of the table at scheduled periods).
I know DescribeTable gives you information about the entire table, but I would be applying filters to the range key, so that wouldn't suit this purpose.
I am not aware of any efficient way to get the ranking of a player. The dumb way is to do a query starting from the player with the highest point, move downward, keep incrementing your counter until you reach the target player. So for the user with lowest point, you might end up scanning the whole range.
That being said, you can still get the top 100 player with no problem (Leaders). Just do a query starting from the player with the highest point, and set the query limit to 100.
Also, for a given player, you can get 100 players around him with similar points. You just need do two queries like:
query with hashkey="" and rangekey <= his point, limit 50
query with hashkey="" and rangekey >= his point, limit 50
This was the exact same problem we were facing when we were developing our app. Following are two solutions we had come with to deal with this problem:
Query your index with scanIndex->false that will give you all top players (assuming your score/points key in range) with limit 1000. Then applying this mathematical formula y = mx+b where you can take 2 iteration, mostly 1 and last value to find out m and b, x-points, and y-rank. Based on this you will get the rank if you have user's points (this will not be exact rank value it would be approximate, google does the same if we search some thing in our mail it show
and not exact value in first call.
Get all the records and store it in cache until the next update. This is by far the best and less expensive thing we are using.
The beauty of DynamoDB is that it is highly optimized for very specific (and common) use cases. The cost of this optimization is that many other use cases cannot be achieved as easily as with other databases. Unfortunately yours is one of them. That being said, there are perfectly valid and good ways to do this with DynamoDB. I happen to have built an application that has the same requirement as yours.
What you can do is enable DynamoDB Streams on your table and process item update events with a Lambda function. Every time the number of points for a user changes you re-compute their rank and update your item. Even if you use the same scan operation to re-compute the rank, this is still much better, because it moves the bulk of the cost from your read operation to your write operation, which is kind of the point of NoSQL in the first place. This approach also keeps your point updates fast and eventually consistent (the rank will not update immediately, but is guaranteed to update properly unless there's an issue with your Lambda function).
I recommend to go with this approach and once you reach scale optimize by caching your users by rank in something like Redis, unless you have prior experience with it and can set this up quickly. Pick whatever is simplest first. If you are concerned about your leaderboard changing too often, you can reduce the cost by only re-computing the ranks of first, say, 100 users and schedule another Lambda function to run every several minutes, scan all users and update their ranks all at the same time.

Access 2010 Calculated Field - Table Requires More Space Than Static Field

I've started using Access 2010 recently and started testing some of the new features, namely the Calculated Field datatype.
I had hoped that this was something that based on a formula (expression builder) would remove an amount of data and shrink an ACCDB file because Access only has the formula not actual data.
However, my new version of the file seems to be larger than the original which IMHO makes the feature a bit useless.
I've searched the interweb regarding the feature and can only really find people who show how to create one rather than any pros and cons about the feature.
As it stands I'm going to go back to the old method of calculations in a query but before I do I thought I'd ask on StackOverflow just in case anybody has used it.
Access stores the results of calculated fields for each record, so yes, that will increase the size of the database. However your claim that this "makes the feature a bit useless" misses the point:
The primary advantage of using calculated fields is that the calculation (expression) is defined once, at the table level. Once the calculated field has been defined it can simply be used much like any other field in queries, reports, etc..
Sure, you can "go back to the old method of calculations in a query" if that suits your purposes, but it also means that
You will have to repeat the (same) calculation logic in all of your queries.
If the calculation logic ever changes then you'll have to go back and edit all of those queries.
Every time you run one of those queries it will have to re-do the calculation for every record, instead of simply retrieving the calculated field from the table.

Strategy for handling variable time queries?

I have a typical scenario that I'm struggling with from a performance standpoint. The user selects a value from a dropdown and clicks a button. A stored procedure takes that value as an input parameter, executes, and returns the results to a grid. For just one of the values ('All'), the query runs for roughly 2.5 minutes. For the rest of the values the query runs less than 1ms.
Obviously, having the user wait for 2.5 minutes just isn't going to fly. So, what are some typical strategies to handle this?
Some of my own thoughts:
New table that stores the information for the 'All' value and is generated nightly
Cache the data on the caching server
Any help is appreciated.
Thanks!
Update
A little bit more info:
sp returns two result sets. The first is a group by rollup summary and the second is the first result set, disaggregated (roughly 80,000 rows).
I would first look at if your have the proper indexes in place. Using the Query Analyzer and the Database Tuning Assistant is a simple and often effective way of seeing what indexes might help.
If you still have performance problems after creating the appropriate indexes you might then look at adding tables/views to speed things up. If your query does a lot of joins you might consider creating an indexed view that allows you to do a select with no joins on the denormalized data. Since indexed views are persisted you can see big gains from their use.
You can read up on indexed views here:
http://msdn.microsoft.com/en-us/library/dd171921%28v=sql.100%29.aspx
and read about the database tuning adviser here:
http://msdn.microsoft.com/en-us/library/ms166575.aspx
Also, how many records does "All" return? I have seen people get hung up on the "All" scenario before, but if it returns 1 million records or something then the data is not usable to a person anyways...
Caching data is a good thing, but.... if the SP is inherently flawed, then you might want to actually fix it instead of trying to bandage it with caching.
You might also want to (since you didn't mention here) look at the number of rows "All" returns compared to the other selections and think about your indexes.
Also in your SP does the "All" cause it to run a different sets of tsql as in maybe a case or an if... or is it running the same code just with a different "WHERE"?
It might simply be that "ALL" just returns A LOT of records. You may want to implement paging and partial dataset return using ajax... (kinda like return the first 1000 records early so that it can be displayed and also show a throbber on the screen while the rest of the dataset is returned)
These are all options... if the number of records really isnt that different between ALL and the others... then it probably has something to do with the query/index/program flow.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources