How should I store EEG data in FHIR to store my data as electronic health record(EHR)? - dstu2-fhir

I have EEG data sets which are either Matlab or EDF format. I want to store them as FHIR records.
And also are there any sample mental health records available in FHIR format?

Matlab or EDF data would be stored as Binary. If you want to capture metadata about it (e.g. what patient it's for, when it was created, etc.), you can use Media.
I'd suggest raising the question about mental health records on http://chat.fhir.org. I'm not aware of any as part of the specification.

Related

Design data provisioning strategy for big data system?

I'm designing Data provisioning module in an big data system. Data provisioning is describe as
The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure access to the data assets in the Data Lake and allows them to source this data. Data delivery, access, and egress are all synonyms of Data Provisioning and can be used in this context.
in Data Lake Development with Big Data. I'm looking for some standards in designing this module, including how to secure the data, how to to identify some data is the data from the system, etc. I have searched on Google but there is not many results related to that keyword. Can you provide me with some advice or your own experience related to this problem? Every answer is appreciated.
Thank you!
Data Provisioning is mainly done by creating different Data Marts for your downstream consumers. For example, if you have a BigData system with data coming from various sources aggregated into one Data lake, yo can create different Data marts, like 'Purchase', 'Sales', 'Inventory' etc and let the down stream consume these. So a downstream consumer who needs only 'Inventory' data needs to consume only the 'Inventory' data mart.
Your best bet is to search for 'Data Marts'. For example, ref: https://panoply.io/data-warehouse-guide/data-mart-vs-data-warehouse/
Now you can fine tune the security, access control all based on the data mart. for example,
'sales' data is accessible only for sales reporting systems, users, groups etc.
Tokenize data in 'Purchase' data etc... All up to the business requirement.
Another way is to export the aggregate data via data export mechanisms. For example use 'Apache Sqoop' to offload data to an RDBMS. This is approach is advisable when the data to export is smaller enough to be exported for the downstream consumer.
Another way is to create separate 'Consumer Zones' in the same Data Lake, for exampele, be it a different Hadoop directory, or Hive DB.

Comparison between Big Data and Data Lakes , difference and similarities [duplicate]

This question already has answers here:
Is Data Lake and Big Data the same?
(4 answers)
Closed 3 years ago.
Can someone tell me the similarities and differences between Big data and Data Lakes.
Can't find a satisfactory answer anywhere.
Big Data is a term used in very different ways, one might call it even a buzzword. Often times, it is used as a collective term for digital technologies, digitization, industry 4.0 and many topics connected with the digital transformation.
In the less general interpretation, big data simply refers to a complex, large dataset. The term "big" then refers to the three dimensions (see Wikipedia on Big Data)
volume, i.e. size of the data set
velocity at which the data volumes are generated
variety of data types and sources
A Data Lake refers to an approach how to store big data. Other possibilities of storing data are a traditional database also called relational database management system (RDBMS) on the one hand and a data warehouse on the other side, see for instance Data Lake vs. Data Warehouse vs. Database: What’s The Difference?
Big data and data lake are two different things.
Data lake is a concept where you have all your data stored and easily accessible using different mechanism. Data lake can be maintained on s3 or redshift or any other storage platform.
Big data is a term used for processing large volume of data. Mostly it refereed with big data solutions like Hadoop, Spark.
I think, We can't compare and differentiate both of them terminology because data lake is synonyms of the big data. Data lake = Enterprise data+ unstructured data + semi structure data.
Other hand it's data repository you can store any kind of data and used for analysis purpose. Mostly data will be stored in Hadoop FileSystem (HDFS), where as under "big data", there is storage & some other processing technology involved.

What is a Data Warehouse and can it be applied to complex data?

I want to define data warehouse with the necessary literature reference.
I found on wikipedia that wiki
DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one
single place that are used for creating analytical reports for
workers throughout the enterprise.
does that imply that it is always a relational data base underneath an data warehouse or can it be any kind of repository?
In An Architecture Framework for Complex Data Warehouses the term data warehouse is also used for complex data which means video, images etc. but the term data warehouse remains undefined in that paper.
A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
I consider the Kimball Group one of the most authoritative sources on the topic, as they have been developing their framework and methodologies for over two decades, and they've also been applying that framework to different business and technical areas and sharing the outcomes and results of this.
Kimball's The Data Warehouse Toolkit is one of the reference books on the topic, and it defines a data warehouse as "a copy of transaction data specifically structured for query and analysis".
Bill Inmon is also considered one of the pioneers of data warehousing, and defines a data warehouse as "a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"
A data warehouse does not have to be implemented on a relational database system, although it is very common to implement Kimball's dimensional models in RDBMS or different database systems that support the concepts of "joinable" tables (e.g. Redshift, Presto, Hive).
A recent addition to data architectures, which perfectly accomodates complex data types, is the concept of a data lake, which is usually a data store that can handle virtually any kind of data types (e.g. S3, HDFS) which can either be analyzed directly (e.g. MapReduce over XML files on S3) or processed into different formats or data models (like a dimensional model).
Edit following your comment:
A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.

Require advice on Time Series Database solution that fits with Apache Spark

For example, lets say I wish to analyze a months worth of company data for trends. I plan on doing regression analysis and classification using an MLP.
A months worth of data has ~10 billion data points (rows).
There are 30 dimensions to the data.
12 features are numeric (integer or float; continuous).
The rest are categoric (integer or string).
Currently the data is stored in flat files (CSV) and is processed and delivered in batches. Data analysis is carried out in R.
I want to:
change this to stream processed (rather than batch process).
offload the computation to a Spark cluster
house the data in a time-series database to facilitate easy read/write and query. In addition, I want the cluster to be able to query data from the database when loading the data into memory.
I have an Apache Kafka system that can publish the feed for the processed input data. I can write a Go module to interface this into the database (via CURL, or a Go API if it exists).
There is already a development Spark cluster available to work with (assume that it can be scaled as necessary, if and when required).
But I'm stuck on the choice of database. There are many solutions (here is a non-exhaustive list) but I'm looking at OpenTSDB, Druid and Axibase Time Series Database.
Other time-series databases which I have looked at briefly, seem more as if they were optimised for handling metric data. (I have looked at InfluxDB, RiakTS and Prometheus)
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and S3. - Apache Spark Website
In addition, the time-series database should store that data in a fashion that exposes it directly to Spark (as this is time-series data, it should be immutable and therefore satisfies the requirements of an RDD - therefore, it can be loaded natively by Spark into the cluster).
Once the data (or data with dimensionality reduced by dropping categoric elements) is loaded, using sparklyr (a R interface for Spark) and Spark's machine learning library (MLib, this cheatsheet provides a quick overview of the functionality), regression and classification model can be developed and experimented with.
So, my questions:
Does this seem like a reasonable approach to working with big data?
Is my choice of database solutions correct? (I am set on working with columnar store and time-series database, please do not recommend SQL/Relational DBMS)
If you have prior experience working with data analysis on clusters, from both an analytics and systems point of view (as I am doing both), do you have any advice/tips/tricks?
Any help would be greatly appreciated.

Categorical Clustering of Users Reading Habits

I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.

Resources