What's the difference between Data Mesh and multiple Data Warehouses in a technical perspective? - bigdata

I've come across with the new concept "Data Mesh" recently.
After reading some blogs and watched introduction videos about Data Mesh, it's not clear to me what is the difference between Data Mesh and multiple Data Warehouses in an organisation from a technical perspective.
If anyone is familiar with this concept, could you please share with me:
Except the "domain oriented" principle, what's the difference of a Data Mesh and multiple Data Warehouses for different domains?
How does Data Mesh solves the problem of integrating data from different apartments(meshes)?
Thanks :)!
Here are some links for Data Mesh introduction:
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Introduction to Data Mesh

They are many differences, but one is the standardization of API to access the data and metadata. By nature, the data quantum, which is the "atomic" element of the data mesh is agnostic of its data store (or data stores). So when you are thinking about Observability, Dictionary, and Control access of your data quanta, you want uniformity.

Related

Storing data on edges of GraphDB

It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.

Design data provisioning strategy for big data system?

I'm designing Data provisioning module in an big data system. Data provisioning is describe as
The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure access to the data assets in the Data Lake and allows them to source this data. Data delivery, access, and egress are all synonyms of Data Provisioning and can be used in this context.
in Data Lake Development with Big Data. I'm looking for some standards in designing this module, including how to secure the data, how to to identify some data is the data from the system, etc. I have searched on Google but there is not many results related to that keyword. Can you provide me with some advice or your own experience related to this problem? Every answer is appreciated.
Thank you!
Data Provisioning is mainly done by creating different Data Marts for your downstream consumers. For example, if you have a BigData system with data coming from various sources aggregated into one Data lake, yo can create different Data marts, like 'Purchase', 'Sales', 'Inventory' etc and let the down stream consume these. So a downstream consumer who needs only 'Inventory' data needs to consume only the 'Inventory' data mart.
Your best bet is to search for 'Data Marts'. For example, ref: https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/
Now you can fine tune the security, access control all based on the data mart. for example,
'sales' data is accessible only for sales reporting systems, users, groups etc.
Tokenize data in 'Purchase' data etc... All up to the business requirement.
Another way is to export the aggregate data via data export mechanisms. For example use 'Apache Sqoop' to offload data to an RDBMS. This is approach is advisable when the data to export is smaller enough to be exported for the downstream consumer.
Another way is to create separate 'Consumer Zones' in the same Data Lake, for exampele, be it a different Hadoop directory, or Hive DB.

Comparison between Big Data and Data Lakes , difference and similarities [duplicate]

This question already has answers here:
Is Data Lake and Big Data the same?
(4 answers)
Closed 3 years ago.
Can someone tell me the similarities and differences between Big data and Data Lakes.
Can't find a satisfactory answer anywhere.
Big Data is a term used in very different ways, one might call it even a buzzword. Often times, it is used as a collective term for digital technologies, digitization, industry 4.0 and many topics connected with the digital transformation.
In the less general interpretation, big data simply refers to a complex, large dataset. The term "big" then refers to the three dimensions (see Wikipedia on Big Data)
volume, i.e. size of the data set
velocity at which the data volumes are generated
variety of data types and sources
A Data Lake refers to an approach how to store big data. Other possibilities of storing data are a traditional database also called relational database management system (RDBMS) on the one hand and a data warehouse on the other side, see for instance Data Lake vs. Data Warehouse vs. Database: What’s The Difference?
Big data and data lake are two different things.
Data lake is a concept where you have all your data stored and easily accessible using different mechanism. Data lake can be maintained on s3 or redshift or any other storage platform.
Big data is a term used for processing large volume of data. Mostly it refereed with big data solutions like Hadoop, Spark.
I think, We can't compare and differentiate both of them terminology because data lake is synonyms of the big data. Data lake = Enterprise data+ unstructured data + semi structure data.
Other hand it's data repository you can store any kind of data and used for analysis purpose. Mostly data will be stored in Hadoop FileSystem (HDFS), where as under "big data", there is storage & some other processing technology involved.

What is a Data Warehouse and can it be applied to complex data?

I want to define data warehouse with the necessary literature reference.
I found on wikipedia that wiki
DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one
single place that are used for creating analytical reports for
workers throughout the enterprise.
does that imply that it is always a relational data base underneath an data warehouse or can it be any kind of repository?
In An Architecture Framework for Complex Data Warehouses the term data warehouse is also used for complex data which means video, images etc. but the term data warehouse remains undefined in that paper.
A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
I consider the Kimball Group one of the most authoritative sources on the topic, as they have been developing their framework and methodologies for over two decades, and they've also been applying that framework to different business and technical areas and sharing the outcomes and results of this.
Kimball's The Data Warehouse Toolkit is one of the reference books on the topic, and it defines a data warehouse as "a copy of transaction data specifically structured for query and analysis".
Bill Inmon is also considered one of the pioneers of data warehousing, and defines a data warehouse as "a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"
A data warehouse does not have to be implemented on a relational database system, although it is very common to implement Kimball's dimensional models in RDBMS or different database systems that support the concepts of "joinable" tables (e.g. Redshift, Presto, Hive).
A recent addition to data architectures, which perfectly accomodates complex data types, is the concept of a data lake, which is usually a data store that can handle virtually any kind of data types (e.g. S3, HDFS) which can either be analyzed directly (e.g. MapReduce over XML files on S3) or processed into different formats or data models (like a dimensional model).
Edit following your comment:
A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.

How is BI related to data mining?

I'm a little confused on how to connect BI with data mining. Can BI be termed as some kind of a manifestation of data mining?
How different is a BI tool like Microsoft Analysis Services from a data mining tool like Weka?
I guess BI involves more of reporting and analysis of data, where in the data undergoes some kind of aggregation and is represented in the form of cubes, but data mining also involves different algorithms to perform clustering, no?
Any pointers?
cheers
BI small is generating a detail report (list of today's sales). Very little math involved, maybe counting rows and summing sales. This is where you see reporting tools called "BI"
BI medium is generating a metric (profit margin for the quarter). It's just simple algebra, but producing it on a frequent basis is a challenge on account of the sheer amount of data. This is the world of cubes and olap.
BI large is doing mathematical modeling. This may be anything from linear regression to statistics models, you name it. The key here is the models are using large quantities of data. Real statisticians use the phrase "data mining" in a derogatory sense because people untrained in the use of statistics are likely to mine the data until they find a spurious correlation. The bigger your data set the more likely you are to find relationships due to chance instead of there really being such a relationship in reality.
Because the customer for BI are line of business managers, not PhD grad students, vendors like Microsoft et al. have dumbed it down by providing us with black box "Data Mining" tools, many are the same as what you'd find in SAS and the like.
The only thing I see connecting all of these applications of the phrase BI is that they all are using large quantities of data to make a business decision.
To answer your general question "Is Business Intelligence a manifestation of data mining", it's actually the other way around.
BI is, in a general definition, using your firm's data to understand your market conditions and make decisions. So, as MatthewMartin said, it can be as simple as an SSRS repport or as complex as a real-time decision support/AI system.
Data Mining is an aspect of BI, in that Data Mining can be used on massive amounts of data for knowledge discovery and predicition using tools that implement algorithms such as clustering, neural networks, etc.

Resources