I am contemplating the idea of using BiqQuery for data (unstructured) analysis.
I am aware that the ability to run ad-hoc queries over terabytes of data forms one of the biggest strengths of BigQuery.
How do I use this potential to handle unstructured data?
As per my understanding of BigQuery, it operates over data represented in form of relations, and that is the structure to follow when you feed data to BigQuery.
Is there any way BigQuery be made to operate over unstructured data, say for example data contained in documents? (Without of-course processing the documents first and then feeding the output to BigQuery.)
BigQuery works with SQL (Structured Query Language) over tables stored in columnar format - so everything is pretty structured.
Still, you could import documents into BigQuery in a one string column table that can store up to 2MB per line - then you could apply the power of BigQuery to that text - as long as you can express your analysis using SQL.
Coming soon: The ability to write Javascript inside your SQL queries.
Related
Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.
I exported hierarchically structured data with Firebase hash-keys to BigQuery. However, since the data is not structured in tables, I don't know how to use SQL queries to get desired information. Is this possible in principle or do I need to convert / flatten the data into tables first? Google seems to advice visualizing data in Data Studio using BigQuery as source (not Firebase/Firestore directly). Yet, I cannot find any useful information / sample queries for this case. Thank you very much in advance.
I'm not familiar with the "hierarchically structured data with Firebase hash-keys", but the general guideline here to query such 'blob/text' data (from BigQuery perspective) is:
As you said, use separate pipeline to load / save the data into BQ table structure, or
Define a function to access your data. Since the function body could be JavaScript, you have the full flexibility to parse / read your text/blob data:
CREATE FUNCTION yourDataset.getValue(input STRING, key STRING)
RETURNS STRING
LANGUAGE js
AS """
// JavaScript code to read the data
""";
I would like to set up event logging for my application. Simple information such as date (YYYYMMDD), activity and appVersion. Later I would like to query this to give me some simple information such as how many times a certain activity occurred for each month.
From what I see there are a few different types of database in Cosmos such as NoSQL and Casandra.
Which would be the most suitable to meet my simple needs?
You can use Cosmos DB SQL API for storing this data. It has rich querying capabilities and also has a great support for aggregate functions.
One thing you would need to keep in mind is your data partitioning strategy and design your container's partition key accordingly. Considering you're going to do data aggregation on a monthly basis, I would recommend creating a partition key for year and month so that data for a month (and year) stays in a single logical partition. However, please note that a logical partition can only contain 10GB data (including indexes) so you may have to rethink your partitioning strategy if you expect the data to go above 10GB.
A cheaper alternative for you would be to use Azure Table Storage however it doesn't have that rich querying capabilities and also it doesn't have aggregation capability. However with some code (running in Azure Functions), you can aggregate the data yourself.
I currently have a table in BigQuery with a size of 100+GB that I would like to retrieve to R. I am using the list_tabledata() function in bigrquery package in R, but it takes a huge amount of time.
Anyone has recommendation on handling this large amount of data in R, and how to boost the performance? Like any packages, tools?
tabledata.list is not a great way to consume a large amount of table data from BigQuery - as you note, it's not very performant. I'm not sure if bigrquery has support for table exports, but the best way to retrieve data from a large BigQuery table is using an export job. This will dump the data to a file on Google Cloud Storage that you can then download to your desktop. You can find more info on exporting tables in our documentation.
Another option, would be: instead of bringing that large volume of data to code - try to bring your code to data. This can be challenging in terms of implementing logic in BQL. JS UDF might help. It depends.
In case if this is not doable - i would recommend either use sampled data or revisit your model
The first assertion is that document style nosql databases such as MarkLogic and Mongo should store each piece of information in a nested/complex object.
Consider the following model
<patient>
<patientid>1000</patientid>
<firstname>Johnny</firstname>
<claim>
<claimid>1</claimid>
<claimdate>2015-01-02</claimdate>
<charge><amount>100</amount><code>374.3</code></charge>
<charge><amount>200</amount><code>784.3</code></charge>
</claim>
<claim>
<claimid>2</claimid>
<claimdate>2015-02-02</claimdate>
<charge><amount>300</amount><code>372.2</code></charge>
<charge><amount>400</amount><code>783.1</code></charge>
</claim>
</patient>
In the relational world this would be modeled as a patient table, claim table, and claim charge table.
Our primary desire is to simultaneously feed downstream applications with this data, but also perform analytics on it. Since we don't want to write a complex program for every measure, we should be able to put a tool on top of this. For example Tableau claims to have a native connection with MarkLogic, which is through ODBC.
When we create views using range indexes on our document model, the SQL against it in MarkLogic returns excessive repeating results. The charge numbers are also double counted with sum functions. It does not work.
The thought is that through these index, view, and possibly fragment techniques of MarkLogic, we can define a semantic layer that resembles a relational structure.
The documentation hints that you should create 1 object per table, but this seems to be against the preferred document db structure.
What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?
If the ODBC connection is going to always return bad data and not be aware of relationships, then all of the tools claiming to have ODBC support against NoSQL is not true.
References
https://docs.marklogic.com/guide/sql/setup
https://docs.marklogic.com/guide/sql/tableau
http://www.marklogic.com/press-releases/marklogic-and-tableau-build-connection/
https://developer.marklogic.com/learn/arch/data-model
For your question: "What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?"
The rule of thumb I use is that when I want to count "objects", I model them as separate documents. So if you want to run queries that count patients, claims, and charges, you would put them in separate documents.
That doesn't mean we're constraining MarkLogic to only relational patterns. In UML terms, a one-to-many relationship can be a composition or an aggregation. In a relational model, I have no choice but to model those as separate tables. But in a document model, I can do separate documents per object or roll them all together - the choice is usually based on how I want to query the data.
So your first assertion is partially true - in a document store, you have the option of nesting all your related data, but you don't have to. Also note that because MarkLogic is schema-agnostic, it's straightforward to transform your data as your requirements evolve (corb is a good option for this). Certain requirements may require denormalization to help searches run efficiently.
Brief example - a person can have many names (aliases, maiden name) and many addresses (different homes, work address). In a relational model, I'd need a persons table, a names table, and an addresses table. But I'd consider the names to be a composite relationship - the lifecycle of a name equals that of the person - and so I'd rather nest those names into a person document. An address OTOH has a lifecycle independent of the person, so I'd make that an address document and toss an element onto the person document for each related address. From an analytics perspective, I can now ask lots of interesting questions about persons and their names, and persons and addresses - I just can't get counts of names efficiently, because names aren't in separate documents.
I guess MarkLogic is a little atypical compared to other document stores. It works best when you don't store an entire table as one document, but one record per document. MarkLogic indexing is optimized for this approach, and handles searching across millions of documents easily that way. You will see that as soon as you store records as documents, results in Tableau will improve greatly.
Splitting documents to such small fragments also allows higher performance, and lower footprints. MarkLogic doesn't hold the data as persisted DOM trees that allow random access. Instead, it streams the data in a very efficient way, and relies on index resolution to pull relevant fragments quickly..
HTH!