Counting page views with Aerospike - bigdata

In the last few days I haven't found any good schema for counting page views in time series using Aerospike.
My objective is to have a graph that I can filter by date with hour granularity, giving me the page counts.
In my investigation I believe a correct approach would be using a Large Stack. But how? Each stack would be an hour or a page hit?
Any suggestion on this?

You can use the llist which supports a filter. You can use the filter to filter for the desired date range (stored as integer). Each entry in the llist can be for an hour. However, it's probably not the best way to update the value in llist for every page hit. I suggest that you maintain a non-ldt bin (simple integer) in the same record which will be used as counter for the current hour. Once an hour is completed, you can move this value to the llist and reset the counter. This way the page hit counter for the hour will be fast. And you can do graphing based on llist

Python client supports increment function to maintain counters. Please check your client. You can use LSTACK for achieving time series and you can get the length of the stack (count) easily by getting the stack size using your client.

Related

How to ingest historical data with proper creation time?

When ingesting historical data, we would like it to become consistent with streamed data with respect to caching and retention, hence we need to set proper creation time on the data extents.
The options I found:
creationTime ingestion property,
with(creationTime='...') query ingestion property,
creationTimePattern parameter of Lightingest.
All options seem to have very limited usability as they require manual work or scripting to populate creationTime with some granularity based on the ingested data.
In case the "virtual" ingestion time can be extracted from data in form of a datetime column or otherwise inherited (e.g. based on integer ID), is it possible to instruct the engine to set creation time as an expression based on the data row?
If such a feature is missing, what could be other handy alternatives?
creationTime is a tag on an extent/shard.
The idea is to be able to effectively identify and drop / cool data at the end of the retention time.
In this context, your suggested capability raises some serious issues.
If all records have the same date, no problem, we can use this date as our tag.
If we have different dates, but they span on a short period, we might decide to take min / avg / max date.
However -
What is the behavior you would expect in case of a file that contains dates that span on a long period?
Fail the ingestion?
Use the current time as the creationTime?
Use the min / avg / max date, although they clearly don't fit the data well?
Park the records in a tamp store until (if ever) we get enough records with similar dates to create the batches?
Scripting seems the most reasonable way to go here.
If your files are indeed homogenous by their records dates, then you don't need to scan all records, just read the 1st record and use its date.
If the dates are heterogenous, then we are at the scenario described by the "However" part.

How to retrieve a row's position within a DynamoDB global secondary index and the total?

I'm implementing a leaderboard which is backed up by DynamoDB, and their Global Secondary Index, as described in their developer guide, http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
But, two of the things that are very necessary for a leaderboard system is your position within it, and the total in a leaderboard, so you can show #1 of 2000, or similar.
Using the index, the rows are sorted the correct way, and I'd assume these calls would be cheap enough to make, but I haven't been able to find a way, as of yet, how to do it via their docs. I really hope I don't have to get the entire table every single time to know where a person is positioned in it, or the count of the entire table (although if that's not available, that could be delayed, calculated and stored outside of the table at scheduled periods).
I know DescribeTable gives you information about the entire table, but I would be applying filters to the range key, so that wouldn't suit this purpose.
I am not aware of any efficient way to get the ranking of a player. The dumb way is to do a query starting from the player with the highest point, move downward, keep incrementing your counter until you reach the target player. So for the user with lowest point, you might end up scanning the whole range.
That being said, you can still get the top 100 player with no problem (Leaders). Just do a query starting from the player with the highest point, and set the query limit to 100.
Also, for a given player, you can get 100 players around him with similar points. You just need do two queries like:
query with hashkey="" and rangekey <= his point, limit 50
query with hashkey="" and rangekey >= his point, limit 50
This was the exact same problem we were facing when we were developing our app. Following are two solutions we had come with to deal with this problem:
Query your index with scanIndex->false that will give you all top players (assuming your score/points key in range) with limit 1000. Then applying this mathematical formula y = mx+b where you can take 2 iteration, mostly 1 and last value to find out m and b, x-points, and y-rank. Based on this you will get the rank if you have user's points (this will not be exact rank value it would be approximate, google does the same if we search some thing in our mail it show
and not exact value in first call.
Get all the records and store it in cache until the next update. This is by far the best and less expensive thing we are using.
The beauty of DynamoDB is that it is highly optimized for very specific (and common) use cases. The cost of this optimization is that many other use cases cannot be achieved as easily as with other databases. Unfortunately yours is one of them. That being said, there are perfectly valid and good ways to do this with DynamoDB. I happen to have built an application that has the same requirement as yours.
What you can do is enable DynamoDB Streams on your table and process item update events with a Lambda function. Every time the number of points for a user changes you re-compute their rank and update your item. Even if you use the same scan operation to re-compute the rank, this is still much better, because it moves the bulk of the cost from your read operation to your write operation, which is kind of the point of NoSQL in the first place. This approach also keeps your point updates fast and eventually consistent (the rank will not update immediately, but is guaranteed to update properly unless there's an issue with your Lambda function).
I recommend to go with this approach and once you reach scale optimize by caching your users by rank in something like Redis, unless you have prior experience with it and can set this up quickly. Pick whatever is simplest first. If you are concerned about your leaderboard changing too often, you can reduce the cost by only re-computing the ranks of first, say, 100 users and schedule another Lambda function to run every several minutes, scan all users and update their ranks all at the same time.

Where and how often to calculate ranking for posts in ASP.NET Website

I have a algorithm to calculate the ranking for posts on a website (uses votes, views, comments and lifespan ). I am going to be using a shared hosting provider, so I am thinking that I would start a thread in the Application_Start method of Global.asax.
Is this the only/best way to do it?
How often should I calculate the ranking? (the result will be stored in the same db table as the post.)
Would you use Thread.Sleep(T) to make the calculation happen every T often?
Why do you need to calculate lifespan?) it can be calculated on the fly from start or end date at UI.
Just in case. For the rest you can use triggers at your DB. Every new post or view or something may update corresponding counter.
Another way would be to use Application_BeginRequest in global, and every XX min do the calculation :).
This off course all depends on how your ranking data is stored, and data amount. It could be enough for a smaler site to calculate the rating on the fly when someone opens a page with ranked content and cache it for the next user.
In short it all depends on data you have to proccess, the more information you post. The more we can help.
Atm we can mostly throw out guesses

solr search for a time within a time range

I'm aware that Solr provides a date field which can store a time instance and then range queries can be performed to match all documents which have that field within a particular range.
My problem is the inverse of this. I need to associate multiple time ranges with documents and then search for all documents which have the searched time within one of those ranges.
For e.g. I'm indexing outlets and have 3-4 ranges during which the outlet is open. I need to search for all outlets which are open at a particular time instance.
One way of doing this is to index start time and end time of the durations as separate date fields and compare during search like
(time1_1 > t AND time1_2 < t) OR (time2_1 > t AND time2_2 < t) OR (time3_1 > t AND time3_2 < t)
Is there a better/faster/cleaner way to do this?
Your example looks like the entities of your index are the outlet stores and you store their opening and closing times in separate (probably dynamic) fields.
If you ask for a different approach you have to consider to restructure the existing schema or to even create an additional one that uses another entity.
It may seem unusual at first, but if this query is the most essential one to your app then you should consider making the entity of your new index to what you acutally want to query: the particular time instance. I take it, time instance is either a whole day, or maybe half or quarter of a day.
The schema would include fields like the ID, the startdate of the day or half day or whatever you choose, the end of it, and a multivalued list of ids that point to the outlets (stored in your current index (use a multi core setup)).
Even if you choose quarter days to handle morning, afternoon and night hours separately, and even with a preview of several years, data should not explode.
This different schema setup allows you to:
do the most important computation during import so that it is easily accessible when querying,
simple query that returns in one hit what you seek
You could even forgo Date fields by using a custom way to identify the ranges. I am thinking of creating the identifier from the date and a string that indicates whether it is morning or afternoon etc. This would be used as the unique ID in SOLR. If you can create such an ID from any "time instance" that is queried you'd end up with a simple ID lookup.
e.g.
What is open on 2013/03/03 in the morning?
/solr/openhours/select?q=id:2013_03_03_am
returns:
Array of outlet ids.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources