For example, lets say I wish to analyze a months worth of company data for trends. I plan on doing regression analysis and classification using an MLP.
A months worth of data has ~10 billion data points (rows).
There are 30 dimensions to the data.
12 features are numeric (integer or float; continuous).
The rest are categoric (integer or string).
Currently the data is stored in flat files (CSV) and is processed and delivered in batches. Data analysis is carried out in R.
I want to:
change this to stream processed (rather than batch process).
offload the computation to a Spark cluster
house the data in a time-series database to facilitate easy read/write and query. In addition, I want the cluster to be able to query data from the database when loading the data into memory.
I have an Apache Kafka system that can publish the feed for the processed input data. I can write a Go module to interface this into the database (via CURL, or a Go API if it exists).
There is already a development Spark cluster available to work with (assume that it can be scaled as necessary, if and when required).
But I'm stuck on the choice of database. There are many solutions (here is a non-exhaustive list) but I'm looking at OpenTSDB, Druid and Axibase Time Series Database.
Other time-series databases which I have looked at briefly, seem more as if they were optimised for handling metric data. (I have looked at InfluxDB, RiakTS and Prometheus)
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and S3. - Apache Spark Website
In addition, the time-series database should store that data in a fashion that exposes it directly to Spark (as this is time-series data, it should be immutable and therefore satisfies the requirements of an RDD - therefore, it can be loaded natively by Spark into the cluster).
Once the data (or data with dimensionality reduced by dropping categoric elements) is loaded, using sparklyr (a R interface for Spark) and Spark's machine learning library (MLib, this cheatsheet provides a quick overview of the functionality), regression and classification model can be developed and experimented with.
So, my questions:
Does this seem like a reasonable approach to working with big data?
Is my choice of database solutions correct? (I am set on working with columnar store and time-series database, please do not recommend SQL/Relational DBMS)
If you have prior experience working with data analysis on clusters, from both an analytics and systems point of view (as I am doing both), do you have any advice/tips/tricks?
Any help would be greatly appreciated.
Related
I'm currently thinking about a little "BigData" Project where I want to record some utilizations every 10 minutes and write them to a DB over several month or years.
I then want to analyze the data e.g. in these ways:
Which time of the day is best (in terms of a low utilization)?
What are the differences in utilization between normal weekdays and days on the weekend?
At what time does the higher part of the utilization begin on a normal monday?
For this I obviously need the possibility to build averaged graphs for e.g. all mondays that where recorded so far.
For the first "proof of concept" I set up a InfluxDB and Grafana which works quite fine for seeing the data being written to the DB, but the more I research on the internet the more I see that InfluxDB is not made for what I want to do (or it can not do it yet).
So which Database would be best to record and analyze data like that? Or is it more like a question about which tool to use to analyze the data? Which tool could that be?
InfluxDB query language is not flexible enough for your kind of questions.
SQL databases supported by Grafana (MySQL, Postgres, TimescaleDB, Clickhouse) seem to fit better.The choice depends on your preferences and amount of your data. For smaller datasets pure MySQL & Postgres may be enough. For higher loads consider TimescaleDB. For billions of datapoints Clickhouse is a probably better.
If you want a lightweight but scalable NoSQL timeseries solution have a look at VictoriaMetrics.
I'm designing Data provisioning module in an big data system. Data provisioning is describe as
The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure access to the data assets in the Data Lake and allows them to source this data. Data delivery, access, and egress are all synonyms of Data Provisioning and can be used in this context.
in Data Lake Development with Big Data. I'm looking for some standards in designing this module, including how to secure the data, how to to identify some data is the data from the system, etc. I have searched on Google but there is not many results related to that keyword. Can you provide me with some advice or your own experience related to this problem? Every answer is appreciated.
Thank you!
Data Provisioning is mainly done by creating different Data Marts for your downstream consumers. For example, if you have a BigData system with data coming from various sources aggregated into one Data lake, yo can create different Data marts, like 'Purchase', 'Sales', 'Inventory' etc and let the down stream consume these. So a downstream consumer who needs only 'Inventory' data needs to consume only the 'Inventory' data mart.
Your best bet is to search for 'Data Marts'. For example, ref: https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/
Now you can fine tune the security, access control all based on the data mart. for example,
'sales' data is accessible only for sales reporting systems, users, groups etc.
Tokenize data in 'Purchase' data etc... All up to the business requirement.
Another way is to export the aggregate data via data export mechanisms. For example use 'Apache Sqoop' to offload data to an RDBMS. This is approach is advisable when the data to export is smaller enough to be exported for the downstream consumer.
Another way is to create separate 'Consumer Zones' in the same Data Lake, for exampele, be it a different Hadoop directory, or Hive DB.
I want to define data warehouse with the necessary literature reference.
I found on wikipedia that wiki
DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one
single place that are used for creating analytical reports for
workers throughout the enterprise.
does that imply that it is always a relational data base underneath an data warehouse or can it be any kind of repository?
In An Architecture Framework for Complex Data Warehouses the term data warehouse is also used for complex data which means video, images etc. but the term data warehouse remains undefined in that paper.
A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
I consider the Kimball Group one of the most authoritative sources on the topic, as they have been developing their framework and methodologies for over two decades, and they've also been applying that framework to different business and technical areas and sharing the outcomes and results of this.
Kimball's The Data Warehouse Toolkit is one of the reference books on the topic, and it defines a data warehouse as "a copy of transaction data specifically structured for query and analysis".
Bill Inmon is also considered one of the pioneers of data warehousing, and defines a data warehouse as "a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"
A data warehouse does not have to be implemented on a relational database system, although it is very common to implement Kimball's dimensional models in RDBMS or different database systems that support the concepts of "joinable" tables (e.g. Redshift, Presto, Hive).
A recent addition to data architectures, which perfectly accomodates complex data types, is the concept of a data lake, which is usually a data store that can handle virtually any kind of data types (e.g. S3, HDFS) which can either be analyzed directly (e.g. MapReduce over XML files on S3) or processed into different formats or data models (like a dimensional model).
Edit following your comment:
A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.
I'm using RStudio to run analysis on large datasets stored in BigQuery. The dataset is private and from a large retailer that shared the dataset with me via BigQuery to run the required analyses.
I used the bigrquery library to connect R to BigQuery, but couldn't find answers to the following two questions:
1) When I use R to run the analyses (e.g. first used SELECT to get the data and stored them in a data frame in R), is the data then somehow locally stored on my laptop? The company is concerned about confidentiality and probably doesn't want me to store the data locally but leave them in the cloud. But is it even possible to use R then?
2) My BigQuery free version has 1 TB/month for analyses. If I use select in R to get the data, it for instance tells me "18.1 gigabytes processed", but do I also use up my 1 TB if I run analyses on R instead of running queries on BigQuery? If it doesn't incur cost, then i'm wondering what the advantage is of running queries on BigQuery instead of in R, if the former might cost me money in the end?
Best
Jennifer
As far as I know, Google's BigQuery is an entirely cloud based database. This means that when you run a query or report on BigQuery, it happens in the cloud, and not locally (i.e. not in R). This is not to say that your source data might be local; in fact, as you have seen you may upload a local data set from R. But, the query would execute in the cloud, and then return the result set to R.
With regard to your other question, the source data in the BigQuery tables would remain in the cloud, and the only exposure to the data you would have locally would be the results of any query you might execute from R. Obviously, if you ran SELECT * on every table, you could see all the data in a particular database. So I'm not sure how much of a separation of concerns there would really be in your setup.
As for pricing, from the BigQuery documentation on pricing:
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed. You are charged for the number of bytes processed whether the data is stored in BigQuery or in an external data source such as Google Cloud Storage, Google Drive, or Google Cloud Bigtable.
So you get 1TB of free processing per month of data, after which you would start getting billed.
Unless you explicitly save to a file, R stores the data in memory. Because of the way sessions work, however, RStudio will basically keep a copy of the session unless you tell it not to, which is why it asks you if you want to save your session when you exit of switch projects. What you should do to be sure of not storing anything is when you are done for the day (or whatever) use the broom icon in the Environment tab to delete everything in the environment. Or you can individually delete a data frame or other object rm(obj) or go to the environment window and change "list" to "grid" and select individual objects to remove. See this How do I clear only a few specific objects from the workspace? which addresses this part of my answer (but this is not a duplicate question).
We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).