Aggregate Many Firebase Datasets in BigQuery into Single Dataset - firebase

For starters, I am relatively new to Firebase and BigQuery...so a bit naive...
Background:
I am working on a project where we have many distinct mobile apps - one for each country, for each OS (iOS, Android) - with all of the raw event data getting pushed into BigQuery datasets that live in the same project. The rationale behind having distinct apps by country relates to regulatory requirements.
For dashboarding purposes, I had hoped to be able to combine all datasets into a single aggregate dataset (with tables by year) that is partitioned by date via a "partition_date" column that I added. My goal has been to take advantage of the ability to use BigQuery nested structures, and to maintain the original nested fields from the raw Firebase tables.
I've been searching around over the past month or so, but I haven't come across any other use cases where people are dealing with many datasets.
Questions:
Does it make sense to take this approach (aggregate all datasets into a table partitioned by date, and a nested field based on app name)?
As an alternative - I tried denormalizing the tables, with tables sharded by date...though the resulting tables (not surprisingly) are much larger due to flattening the nested structures, which makes me think this approach is less than ideal. Should I actually consider this approach?
Should I consider just making a relatively simple aggregate table of KPIs? I've avoided this approach because it limits my ability to use date range filters that dynamically calculate unique users on the fly (using COUNT_DISTINCT(user_dim.app_info.app_instance_id) in Data Studio)
The end goal is to empower any user (ex. someone who doesn't know SQL) to be able to answer simple questions across all apps (ex. how many users opened the app yesterday), and to let end users make use of date range filters in the dashboard interface.
I've been able to write ad hoc queries to get at the answers by querying across all datasets, but I have not found a good solution that will make it easier for non-technical users within the dashboard.
Also, this is my first Stack Overflow question...please let me know if I am being too vague, including too many questions, or otherwise abusing the platform.
Thanks in advance for any thoughts.

If your end goal is to empower users to answer simple questions, aggregating KPI data makes sense to me. I would probably approach this by setting up a scheduled daily job that selects the relevant data from all the datasets and loads it into a new dataset, which can then be used in Data Studio. The new datasets could have the default Firebase date table suffix to support date range filters.
I am also relatively new to BigQuery and Firebase though, so maybe there is a better way.
You can find more information about scheduling in BigQuery here: Schedule query in BigQuery

Related

DynamoDB usable for largeish event table?

I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik
Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.
Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.
Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.

How to find which kinds are not being used in Google Datastore

There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.

Large Product catalog with statistics - alternatives to Sql Server?

I am building UI for a large product catalog (millions of products).
I am using Sql Server, FreeText search and ASP.NET MVC.
Tables are normalized and indexed. Most queries take less then a second to return.
The issue is this. Let's say user does the search by keyword. On search results page I need to display/query for:
Display 20 matching products on first page(paged, sorted)
Total count of matching products for paging
List of stores only of all matching products
List of brands only of all matching products
List of colors only of all matching products
Each query takes about .5 to 1 seconds. Altogether it is like 5 seconds.
I would like to get the whole page to load under 1 second.
There are several approaches:
Optimize queries even more. I already spent a lot of time on this one, so not sure it can be pushed further.
Load products first, then load the rest of the information using AJAX. More like a workaround. Will need to revise UI.
Re-organize data to be more Report friendly. Already aggregated a lot of fields.
I checked out several similar sites. For ex. zappos.com. Not only they display the same information as I would like in under 1 second, but they also include statistics (number of results in each category).
The following is the search for keyword "white"
http://www.zappos.com/white
How do sites like zappos, amazon make their results, filters and stats appear almost instantly?
So you asked specifically "how does Zappos.com do this". Here is the answer from our Search team.
An alternative idea for your issue would be using a search index such as solr. Basically, the way these work is you load your data set into the system and it does a huge amount of indexing. My projects include product catalogs with 200+ data points for each of the 140k products. The average return time is less than 20ms.
The search indexing system I would recommend is Solr which is based on lucene. Both of these projects are open source and free to use.
Solr fits perfectly for your described use case in that it can actually do all of those things all in one query. You can use facets (essentially group by in sql) to return the list of different data values for all applicable results. In the case of keywords it also would allow you to search across multiple fields in one query without performance degradation.
you could try replacing you aggergate queries with materialized indexed views of those aggregates. this will pre-compute all the aggregates and will be as fast as selecting any regular row data.
.5 sec is too long for an appropriate hardware. I agree with Aaronaught and first thing to do is to convert it in single SQL or possibly Stored Procedure to ensure it's compiled only once.
Analyze your queries to see if you can create even better indexes (consider covering indexes), fine tune existing indexes, employ partitioning.
Make sure you have appropriate hardware config - data, log, temp and even index files should be located on independent spindles. make sure you have enough RAM and CPU's. I hope you are running 64-bit platform.
After all this, if you still need more - analyze most used keywords and create aggregate result tables for top 10 keywords.
Amount Amazon - they most likely use superior hardware and also take advantage of CDN's. Also, they have thousands of servers surviving up the content and there is no performance bottlenecks - data is duplicated multiple times across several data centers.
As completely separate approach - you may want to look into "in-memory" databases such as CACHE - this is the fastest you can get on DB side.

De-normalize live data for the sake of reports - Good or Bad?

What are the pros/cons of de-normalizing an enterprise application database because it will make writing reports easier?
Pro - designing reports in SSRS will probably be "easier" since no joins will be necessary.
Con - developing/maintaining the app to handle de-normalized data will become more difficult due to duplication of data and synchronization.
Others?
Denormalization for the sake of reports is Bad, m'kay.
Creating views, or a denormalized data warehouse is good.
Views have solved most of my reporting related needs. Data warehouses are great when users will be generating reports almost constantly or when your views start to slow down.
This is why you want to normalize your database
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
—E.F. Codd, "Further Normalization of the Data Base Relational Model" via wikipedia
The only time you should consider de-normaliozation is when the time it takes the report to generate is not acceptable. De-normalization will cause consistentcy issues that are sometimes impossible to determine especially in large datasets
Don't denormalize just to get rid of complexity in reporting, it can cause huge problems in the rest of the application. Either you don't enforce the rules resulting in bad data or if you do then inserts, deletes and updates can be seriously slowed for everyone not just the two or three people who run reports.
If the reports truly can't run well, then create a data warehouse that is denormalized and populate it in a nightly or weekly feed. The kind of reports that typically need this do not generally care if the data is up-to-the minute as they are usually monthly, quarterly, or annual reports that process (and especially aggregate) large amounts of data after the fact.
You can do both... let the normalized database for applications.
Then create a denormalized database for reports, and create an application which regulary copy data from one database to the other.
After all, reports don't always need to have the latest updated data, most of the time you can easily launch an update every 1 hour on the reporting database, and only once a day day.
Beyond the data warehouse and views solutions provided in other answers, which are good in some ways, if you are willing to sacrifice some performance to get a good to the last second data, but still want a normalized database, you could use on Oracle a Materialized View with fast refresh on commit, or in Sql Server, you could use clustered indexes for a view.
Another Con is that the data is likely not to be real-time as there is some time moving around the data to go from a normalized form to a de-normalized. If someone wants the report to be up to the very second it was requested, that can be tough to do in this situation.
If this is a duplication of the synchronization in the original post, sorry I didn't quite see it that way.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources