design help on offline / reporting database - olap

We are considering to move away from having reports from a transactional database to an offline/ reporting database for our java web based project. Some ETL(like Kettle) to load the full/delta updates to offline database.
Reasons are obvious, to reduce load on transactional db & reporting performance.
Our outstanding questions are related to designing the offline database as i have no much knowledge on OLAP. Requirement is to have reports run by some report engine like Jasper/Pentaho, develop analytics and dashboards.
what is the best way to design offline/reporting
1) One big flat table? - I am sure this very bad idea.
2) Multiple flat tables. I mean multiple denormalized tables. Idea is to de-normalize the related tables and link other de-normalized tables to get detailed reports.
And any idea how we can handle summaries?
3) Star schema, facts and dimensions.
One dumb question here: can we have all other detail columns in a fact table along with additive measures(summaries or aggregated data)
Is there a an tool to denormalize data from a set of normalised tables?
Thanks in advance.
Pradeep

Related

Plain JSON vs in-app DB for React-native?

I have 50-100mb dataset that users need to have access to. It's static, so doesn't make sense to host a server for it. There are two kinds of operations I'll perform on the data:
Reading objects by unique ObjectId. Each object is ~3kb.
Full text search through ~300.000 strings. Each string is 4-60 characters.
I'm considering to store data as JSON files. The 300k strings will be stored separately. I'll use https://github.com/nextapps-de/flexsearch or something similar to perform search over it. I've done something similar before with ~10mb dataset back in 2016. I used just regex search and it was working flawlessly.
Are there reasons to use RealmDB, SQLite, PouchDB or something else instead of just JSON?
I wish I did this question an year ago...
In the office I currently work we tried creating an app by using PouchDB and react native, we basically saw PouchDB as an advantage because it wouldn't require our API to send all data over and over again on every refresh triggered by the user, it would only send the data that changed based on the client's checkpoint. As the data in the server was quite heavy (around 6k entries with more than 200 attributes each) we tried at all costs to go easy on the client's data plan.
Months after this implementation was in place we implemented a search functionality with many different options for sorting and filtering, and not only we had to throw away all our implementation of PouchDB, but we had to start from scratch replacing all its logic with indexed JSON values. PouchDB performance was extremely slow, it was taking more than 5 seconds or so to retrieve results, and we just couldn't afford to delay this time on our scope.
In the end we accomplished to reach a very quick search running flex search inside our indexed JSONs. Don't do the same mistake we did, PouchDB costed us too much budget and precious time. It was a terrible choice.
Unfortunately I cannot offer proof or more details from a reputable source, I can only share the own personal terrible experience I had when I thought we were reaching the end of a project and we had to start from scratch. it was a mess.
Oh boy, a bountied, opinion based question!
I have about 5 years experience with pouchDB specifically, a little with SQLite. I have but a cursory experience with RealmDB - I tried it out and decided it was not a good fit for my hybrid/mobile needs.
pouchDB exceeds in on one area hands down - synchronization/replication just like it's big brother CouchDB. Providing interaction with an offline database that synchronizes with a remote database is huge for many mobile apps. pouchDB is schemaless, leveraging JSON documents. With pouchDB one may choose among several data stores via adapters. As there can be quota headaches1 for your data size the right choice may likely be the SQLite adapter. pouchDB does not support full text search.
SQLite is what its name implies - a relational database, requiring a schema. An advantage to SQLite is platform support and the size of the database is not subject to quota headaches like web storage (e.g. IndexedDB). SQLite supports full text search, and apps can deploy with a canned database.
Between pouchDB and SQLite lies RealmDB - it is a schema based object database that supports synchronization/replication. Like pouchDB, it does not support full text search.
Now your requirements
Looking up object by id
300k static text
full-text search
I read 'static' to mean immutable.
Since your data does not change and full-text search is required, pouchDB and RealmDB would not be good choices. If there is a requirement to enhance, remove or add to the data, either would make sense as changes to data on a single server would replicate changes to the local database, practically in a seamless fashion.
SQLite might be a reasonable choice since it supports search and it is possible to deploy a canned database with the app. However, SQLite can be slow in hybrid apps.
So,
pouchDB and RealmDB would be massive overkill and not a good fit.
SQLite would add a fair bit of complexity.
For your specific requirements I'd stay on your path, though I have a care as it appears flexsearch loads its index into memory - if its performance returns some penalty then SQLite, with it's ability to deploy a canned database and providing a search facility may prove a reasonable trade off versus complexity.
Good luck!
1 Quota Headaches
I would say it really just depends on whether you want and NEED to leverage the power of relational queries. Because your data is never changing I would use JSON unless you are trying to perform complex comparisons between your data. In your case it sounds like you are just going to be searching for the particular ObjectId so JSON is your best bet especially because you are saying you won't need to change the data later.
If you organize your JSON so that your ObjectId are in a sorted order you will easily be able to search quickly.

Big picture on creating a Database for Social App

I have a question regarding the most suitable way to organise datas when your app/product becomes used by more people.
Until now I've coded an Instagram-alike application for iOS which used Firebase to store data.
In particular I used "Firebase Realtime Database" with JSON data format to store all the datas.
My question here is: if I want to code an app which is potentially used by a lot of people, can I still use the same Realtime database way of storing or it's better if I use something else?
In particular I'm thinking about querying speed and sustainability of realtime database with a larger amount of data.
I'm a novice in this field and I don't know so much about Firebase so I'm sorry if my technical descriptions are raw.
Both databases will scale very well. If you have only simple querying needs than RTDB is fine. If your querying requirements are more sophisticated then Firestore. The other major factor is how usage of the two databases is charged for. You need to research that and then work out how the two cost models will work for your use case.

Should CosmosDB be modeled like a document database or a graph database?

I see that a CosmosDb can support both graph queries as well as more traditional SQL like queries - however I'm a bit confused about what kind of underlying schema is best at the collections level. If I were to model something in MongoDb or SQL Server, or Neo4j, I would have very different schemas. Also - it seems like I can query using more traditional SQL-like syntax - which makes it confusing about what's right or efficient underneath. Sometimes, making something easy to query does not mean that one should assume that it's an efficient query.
Is CosmosDb at it's heart a document database and I should model it accordingly - or is it a very different beast.
Example use case
Here's an example- let's say I have:
a user profile
multiple post types (photo, blog, question)
users can like photos
users can comment on photos, blogs, questions
With a sql database I would have tables:
profiles
photos
blogs
questions
and join tables with referential integrity to support the actions:
photoLikes
blogComments
photoComments
questionComments
With a graph database
I would just have the same core tables
profiles
photos
blogs
questions
and just create graph relationship types for like and comment - relying on the code business logic to enforce the rule that you can't like blogs, etc..
With a document db like MongoDb
Again, I might have the same core tables
profiles
photos
blogs
questions
Comments would be sub collections under each - and there would be a question of whether we want to keep the likes as an embedded collection under each profile, or under photos.. and we would have to increment and sync a like count to the other collection (depending on the use case we might create a like collection as well). Comments would be tucked under each photo, blog or question as an embedded collection and not have their own top-level collection.
So my question is this:
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
Azure Cosmos DB database engine is designed to be fully schema-agnostic.
A container (which can be a graph, a collection of documents, or a table) is a schema-agnostic container of arbitrary user generated content which gets automatically indexed upon ingest. I suggest to read "Schema-Agnostic Indexing with Azure DocumentDB" - http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf, which is the same in Cosmos DB to better understand the details.
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
When you start modeling data in Azure Cosmos DB, you need to consider: 1.Is your application read heavy or write heavy? 2.How is your application going to query and update data? etc. Normally denormalized data models can provide better read performance, normalizing can provide better write performance.
This article explained with example how to model document data for NoSQL databases, and shared some scenarios for using embedded data models, normalized data models and Hybrid data models, which should be helpful.

Simulate records in database without entering any

I've nearly finished the development of a project and would like to test its performance, especially the database query calls. I'm using Linq to SQL to search via usernames, but I've only got around 10 'users' in my database, so I can't really get a decent speed reading. How can I simulate thousands/millions of users in the database without actually creating new records? I've read about Selenium, but it seems that is good for repeat actions (simulating concurrent users?). Are there any other tools I should look into, or are there any options in VS 2008 (Professional Edition)?
Thanks
You can "trick" SQL Server into thinking there are more records than there actually are in a table using the approach outlined in this article. See the section on False SQL Server Statistics
e.g.
UPDATE STATISTICS TableName WITH ROWCOUNT=100000
will create statistics for the table as if it has 100000 rows in. You can then see what effect this has on the execution plan. But note this is undocumented functionality as so it may give quirky behaviour.
You could just populate your table with sample data. There's various tools available to help out with that like, Red Gate's SQL Data Generator. I prefer actually having large data volumes as I think that is what will be more accurate.

De-normalize live data for the sake of reports - Good or Bad?

What are the pros/cons of de-normalizing an enterprise application database because it will make writing reports easier?
Pro - designing reports in SSRS will probably be "easier" since no joins will be necessary.
Con - developing/maintaining the app to handle de-normalized data will become more difficult due to duplication of data and synchronization.
Others?
Denormalization for the sake of reports is Bad, m'kay.
Creating views, or a denormalized data warehouse is good.
Views have solved most of my reporting related needs. Data warehouses are great when users will be generating reports almost constantly or when your views start to slow down.
This is why you want to normalize your database
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
—E.F. Codd, "Further Normalization of the Data Base Relational Model" via wikipedia
The only time you should consider de-normaliozation is when the time it takes the report to generate is not acceptable. De-normalization will cause consistentcy issues that are sometimes impossible to determine especially in large datasets
Don't denormalize just to get rid of complexity in reporting, it can cause huge problems in the rest of the application. Either you don't enforce the rules resulting in bad data or if you do then inserts, deletes and updates can be seriously slowed for everyone not just the two or three people who run reports.
If the reports truly can't run well, then create a data warehouse that is denormalized and populate it in a nightly or weekly feed. The kind of reports that typically need this do not generally care if the data is up-to-the minute as they are usually monthly, quarterly, or annual reports that process (and especially aggregate) large amounts of data after the fact.
You can do both... let the normalized database for applications.
Then create a denormalized database for reports, and create an application which regulary copy data from one database to the other.
After all, reports don't always need to have the latest updated data, most of the time you can easily launch an update every 1 hour on the reporting database, and only once a day day.
Beyond the data warehouse and views solutions provided in other answers, which are good in some ways, if you are willing to sacrifice some performance to get a good to the last second data, but still want a normalized database, you could use on Oracle a Materialized View with fast refresh on commit, or in Sql Server, you could use clustered indexes for a view.
Another Con is that the data is likely not to be real-time as there is some time moving around the data to go from a normalized form to a de-normalized. If someone wants the report to be up to the very second it was requested, that can be tough to do in this situation.
If this is a duplication of the synchronization in the original post, sorry I didn't quite see it that way.

Resources