Design data provisioning strategy for big data system? - encryption

I'm designing Data provisioning module in an big data system. Data provisioning is describe as
The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure access to the data assets in the Data Lake and allows them to source this data. Data delivery, access, and egress are all synonyms of Data Provisioning and can be used in this context.
in Data Lake Development with Big Data. I'm looking for some standards in designing this module, including how to secure the data, how to to identify some data is the data from the system, etc. I have searched on Google but there is not many results related to that keyword. Can you provide me with some advice or your own experience related to this problem? Every answer is appreciated.
Thank you!

Data Provisioning is mainly done by creating different Data Marts for your downstream consumers. For example, if you have a BigData system with data coming from various sources aggregated into one Data lake, yo can create different Data marts, like 'Purchase', 'Sales', 'Inventory' etc and let the down stream consume these. So a downstream consumer who needs only 'Inventory' data needs to consume only the 'Inventory' data mart.
Your best bet is to search for 'Data Marts'. For example, ref: https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/
Now you can fine tune the security, access control all based on the data mart. for example,
'sales' data is accessible only for sales reporting systems, users, groups etc.
Tokenize data in 'Purchase' data etc... All up to the business requirement.
Another way is to export the aggregate data via data export mechanisms. For example use 'Apache Sqoop' to offload data to an RDBMS. This is approach is advisable when the data to export is smaller enough to be exported for the downstream consumer.
Another way is to create separate 'Consumer Zones' in the same Data Lake, for exampele, be it a different Hadoop directory, or Hive DB.

Related

CosmosDB API selection: does it dictate how the data is stored, or only how we communicate with the instance?

When creating a CosmosDB instance, we can choose the API that we will use to communicate with the instance (e.g. SQL, MongoDB, Cassandra, etc.)
What is not clear to me is - does this selection dictates how the data is stored, or only the way we communicate with the instance? For example, if we choose MongoDB, does it mean that CosmosDB will store data in a MongoDB fashion?
The choice of API does not change how the data is stored. Cosmos DB always stores data using something called atom-record-sequence (ARS) which is essentially a set of primitive types, structs and arrays. The database engine translates the native ARS format into the data structures used by the various APIs (i.e. json documents, table rows, etc.)
So the answer to your question is that the choice of API only impacts how you communicate with the databases for that Cosmos DB account.
As David Makogon points out in his comment on another answer, while the way the data is stored is the same regardless of the API used, the content of the data will be different because each API requires it's own metadata so that the underlying data can be projected into the format expected by each API.
Here is a good technical overview of how Cosmos works under the hood.
https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/
Data is always stored in the same fashion (as a bunch of json documents), only the way you interact with the data changes
https://learn.microsoft.com/en-us/azure/cosmos-db/introduction#develop-applications-on-cosmos-db-using-popular-open-source-software-oss-apis

What is a Data Warehouse and can it be applied to complex data?

I want to define data warehouse with the necessary literature reference.
I found on wikipedia that wiki
DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one
single place that are used for creating analytical reports for
workers throughout the enterprise.
does that imply that it is always a relational data base underneath an data warehouse or can it be any kind of repository?
In An Architecture Framework for Complex Data Warehouses the term data warehouse is also used for complex data which means video, images etc. but the term data warehouse remains undefined in that paper.
A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
I consider the Kimball Group one of the most authoritative sources on the topic, as they have been developing their framework and methodologies for over two decades, and they've also been applying that framework to different business and technical areas and sharing the outcomes and results of this.
Kimball's The Data Warehouse Toolkit is one of the reference books on the topic, and it defines a data warehouse as "a copy of transaction data specifically structured for query and analysis".
Bill Inmon is also considered one of the pioneers of data warehousing, and defines a data warehouse as "a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"
A data warehouse does not have to be implemented on a relational database system, although it is very common to implement Kimball's dimensional models in RDBMS or different database systems that support the concepts of "joinable" tables (e.g. Redshift, Presto, Hive).
A recent addition to data architectures, which perfectly accomodates complex data types, is the concept of a data lake, which is usually a data store that can handle virtually any kind of data types (e.g. S3, HDFS) which can either be analyzed directly (e.g. MapReduce over XML files on S3) or processed into different formats or data models (like a dimensional model).
Edit following your comment:
A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.

Filtering data while reading from S3 to Spark

We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).

Silo data for users into particular physical location

We have a system we're designing which has to hold data for people globally, including countries with very strict data protection policies, specifically where data about its citizens must physically reside that country.
Now we could engineer a mechanism for silo-ing/querying the data where it must be pulled from a particular location but as the system will be azure based, we were hoping that cosmosDB's partitioning feature might be an option.
Based on the information available to date for partitioning, it seems like it's possible to assign a location-specific partition for some data but it's not very clear. Any search for partitioning in general goes on about high availability and low latency - good things - but not what I'm looking for.
To this end, can location-specific data be assigned in cosmosDB as part of its feature set or is this something that has to be engineered on top?
For data sovereignty, you must engineer a data access layer across multiple Cosmos DB accounts. Cosmos DB by default will replicate your data across all regions within your account (which is not what you need).
While not specifically for this scenario, you can see a description of how to build such a layer here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers

Require advice on Time Series Database solution that fits with Apache Spark

For example, lets say I wish to analyze a months worth of company data for trends. I plan on doing regression analysis and classification using an MLP.
A months worth of data has ~10 billion data points (rows).
There are 30 dimensions to the data.
12 features are numeric (integer or float; continuous).
The rest are categoric (integer or string).
Currently the data is stored in flat files (CSV) and is processed and delivered in batches. Data analysis is carried out in R.
I want to:
change this to stream processed (rather than batch process).
offload the computation to a Spark cluster
house the data in a time-series database to facilitate easy read/write and query. In addition, I want the cluster to be able to query data from the database when loading the data into memory.
I have an Apache Kafka system that can publish the feed for the processed input data. I can write a Go module to interface this into the database (via CURL, or a Go API if it exists).
There is already a development Spark cluster available to work with (assume that it can be scaled as necessary, if and when required).
But I'm stuck on the choice of database. There are many solutions (here is a non-exhaustive list) but I'm looking at OpenTSDB, Druid and Axibase Time Series Database.
Other time-series databases which I have looked at briefly, seem more as if they were optimised for handling metric data. (I have looked at InfluxDB, RiakTS and Prometheus)
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and S3. - Apache Spark Website
In addition, the time-series database should store that data in a fashion that exposes it directly to Spark (as this is time-series data, it should be immutable and therefore satisfies the requirements of an RDD - therefore, it can be loaded natively by Spark into the cluster).
Once the data (or data with dimensionality reduced by dropping categoric elements) is loaded, using sparklyr (a R interface for Spark) and Spark's machine learning library (MLib, this cheatsheet provides a quick overview of the functionality), regression and classification model can be developed and experimented with.
So, my questions:
Does this seem like a reasonable approach to working with big data?
Is my choice of database solutions correct? (I am set on working with columnar store and time-series database, please do not recommend SQL/Relational DBMS)
If you have prior experience working with data analysis on clusters, from both an analytics and systems point of view (as I am doing both), do you have any advice/tips/tricks?
Any help would be greatly appreciated.

Resources