Google BigQuery Optimization Strategies - google-analytics

I am querying data from Google Analytics Premium using Google BigQuery. At the moment, I have one single query which I use to calculate some metrics (like total visits or conversion rate). This query contains several nested JOIN clauses and nested SELECTs. While querying just one table I am getting the error:
Error: Resources exceeded during query execution.
Using GROUP EACH BY and JOIN EACH does not seem to solve this issue.
One solution to be adopted in the future involves extracting only the relevant data needed for this query and exporting it into a separate table (which will then be queried). This strategy works in principle, I have already a working prototype for it.
However, I would like to explore additional optimization strategies for this query that work on the original table.
In this presentation You might be paying too much for BigQuery some of them are suggested, namely:
Narrowing the scan (already doing it)
Using query cache (does not apply)
The book "Google BigQuery Analytics" mentions also adjusting query features, namely:
GROUP BY clauses generating large number of distinct groups (already
did this)
Aggregation functions requiring memory proportional to the number of input values (probably does not apply)
Join operations generating a greater number of outputs than inputs (does not seem to apply)
Another alternative is just splitting this query into its composing sub-queries, but at this moment I cannot opt for this strategy.
What else can I do to optimize this query?

Why does BigQuery have errors?
BigQuery is a shared and distributed resource and as such it is expected for jobs to fail at some point in time. This is why the only solution is to retry the job with exponential backoff. As a golden rule, jobs should be retried a minimum of 5 times and as long as a job is not unable to complete for more than 15 minutes the service is within the SLA [1].
What can be the causes?
I can think off two causes for this that can be affecting your queries:
Data skewing [2]
Unoptimized queries
Data Skewing
Regarding the first situation, this happens when data is not evenly distributed. Because the inner mechanic of BigQuery uses a version of MapReduce this means if you have for example a music or video file with millions of hits, the workers doing that data aggregation will have their resources exhausted while the other workers won’t be doing much at all because the aggregations for the videos or musics they are processing have little to no hits.
If this is the case, the recommendation is to uniformly distribute your data.
Unoptimized queries
If you don’t have access to modifying the data, the only solution is to optimize the queries. Optimized queries follow these general rules:
When using a SELECT, make sure you only select strictly the columns you need as this diminishes the cardinality of the requests (avoid using SELECT * for example)
Avoid using ORDER BY clauses on large sets of data
Avoid using GROUP BY clauses as they create a barrier to parallelism
Avoid using JOINS as these are extremely heavy on the worker's memory, and may cause resource starvation and resource errors (as in not enough memory).
Avoid using Analytical functions [3]
If possible, do your queries on Partitioned tables [4].
Following any of these strategies should help your queries have less errors and improve their overall running time.
Additional
You can't really understand BigQuery unless you understand MapReduce first. For this reason I strongly recommend you have a look on Hadoop tutorials, like the one in tutorialspoint:
https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
For a similar version of BigQuery, but that is Open Source (and less optimized in every single way) you can also check Apache Hive [4]. If you understand why Apache Hive fails, you understand why BigQuery fails.
[1] https://cloud.google.com/bigquery/sla
[2] https://www.mathsisfun.com/data/skewness.html
[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions
[4] https://cloud.google.com/bigquery/docs/partitioned-tables
[5] https://en.wikipedia.org/wiki/Apache_Hive

Google's BigQuery has a lot of quirks because it is not ANSI compatible. These quirks are also its advantages. That said, you will waste too much time writing queries against BigQuery directly. You should either use an API/SDK or a tool such as Looker that will generate SQL for you: https://looker.com/blog/big-query-launch-blog at execution time, giving you resource estimate before spending your money.

Related

DynamoDB usable for largeish event table?

I'm thinking of re-architecting an RDS model to a DynamoDB one and it appears mostly to be working using a single-table design. We have, however a log table that can contain 5-10 million rows that are queried on many attributes.
Is there any pattern that might be applicable in migrating to DynamoDB or is this a case where full scans would be required and we would just be better off keeping the log stuff as a relational table?
Thanks in advance,
Nik
Those keywords and phrases "log" and "queried on many attributes" sound to me like DynamoDB is not the best solution for your log data. If the number of distinct queries is fairly limited and well-known in advance, you might be able to design your keys to fit your access patterns.
For example, if you commonly query on Color and Quantity attributes, you could design a key like COLOR#Red#QTY#25. And you could use secondary or global secondary indexes for queries involving other attributes similarly.
But it is not a great solution if you have many attributes that you need to query arbitrarily.
Alternative Solution: Another serverless option to consider is storing your log data in S3 and using Athena to query it using SQL.
You will likely be trading away a bit of latency and speed by taking this approach compared to RDS and DynamoDB. But queries against log data often don't need millisecond response times, so it can cover a lot of use cases.
Data modelling for DynamoDB
Write down all of your access patterns, in order of priority/most used
Research models which are similar to your use-case
Download NoSQL Workbench and create test models where you can visualize your ideas
Run commands against DynamoDB Local and test your access patterns are fulfilled.
Access Parterns
Your access patterns will ultimately decide if DynamoDB will suit your needs. If you need to query based on multiple fields you can have up to 20 Global Secondary Indexes which will give you some flexibility, but usually if you exceed 8-10 indexes then DynamoDB may not be a good choice or the schema is badly designed.
Use smart designs with sort-key and index-key overloading, it will allow you to group the data better and make your access patterns more efficient.
Log Data Use-case
Storing log data is a pretty common use-case for DynamoDB and many many AWS customers use it for that sole purpose. But I can't over emphasize the importance of understanding your access patterns and working backwards from those to create your model.
Alternatives
If you require query capability or free text search ability, then you could use DynamoDB integrations with OpenSearch (via Lambda/EventBridge) for example, with OpenSearch providing you the flexibility for your queries.
Doesn't seem like a good use case - I have done it and wasn't at all happy with the result - now I load 'log like' data into elasticsearch and much happier with the result.
In my case, I insert the data to dynamodb - to archive it - but also feed data in ES, but once in a while if I kill my ES cluster, I can reload all or some of the data from ddb.

Gremlin: OLAP vs dividing query

I have a query (link below) I must execute once per day or once per week in my application to find groups of connected users. In the query I check all possible groups for each user of the application (not all users are evaluated but could be a lot). For the moment I'm only making performance tests in localhost using Gremlin Server, since my application is not live yet.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default, another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Solution 1:
Send a query to get the users first and then send one query per user, then remove duplicates in the server code, this should work in my case and since I can send all the queries at the same time I can use all resources available and bypass the time limits.
Solution 2:
Use OLAP. I guess OLAP does not have a time limit. The problem: My idea is to use Amazon Neptune and OLAP is not supported there as far as I know.
In this question about it:
Gremlin OLAP queries on AWS Neptune
David says:
Update: Since GA (June 2018), Neptune supports multiple queries in a single request/transaction
What does it mean "multiple queries in a single request"?
How my solution 1 compares with OLAP?
Should I look for another database service that supports OLAP instead of Neptune? Which one could be? I don't want an option that implies learning to setup my own "Neptune like" server, I have limited time.
My query in case you want to take a look:
https://gremlify.com/69cb606uzaj
This is a bit of a complicated question.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default,
I'll assume there is a reason you can't change the default value, but for those who might be reading this answer the timeout is configurable both at the server (with evaluationTimeout in the server yaml) and per request both for scripts and bytecode based requests.
another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
If you're testing with TinkerGraph in Gremlin Server then know that TinkerGraph is really simple. It doesn't do anything internally to run any aspect of a traversal in parallel (without TinkerGraphComputer which is OLAP related).
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Either approach has the potential to work. In the first solution you suggest a form of poor man's OLAP where you must devise your own methods for doing this parallel processing (i.e. manage thread pools, synchronize state, etc). I think that this approach is a common first step that folks take to deal with this sort of problem. I'd wonder if you need to be as fine grained as one user per request. I would think that sending several at a time would be acceptable but only testing in your actual environment would yield the answer to that. The nice thing about this solution is that it will typically work on any graph system, including Neptune.
Using your second solution with OLAP is trickier. You have the obvious problem that Neptune does not directly support it, but going to a different provider that does will not instantly solve your problem. While OLAP rids you of having to worry about how to optimally parallelize your workload, it doesn't mean that you can instantly take that Gremlin query you want to run, throw it into Spark and get an instant win. For example, and I take this from the TinkerPop Reference Documentation:
In OLAP, where the atomic unit of computing is the vertex and its local
"star graph," it is important that the anonymous traversal does not leave the
confines of the vertex’s star graph. In other words, it can not traverse to an
adjacent vertex’s properties or edges.
In your query, there are already a places where you "leave the star graph" so you would immediately find problems there to solve. Usually that limitation can be worked around for OLAP purposes but it's not as simple as adding withComputer() to your traversal and getting a win in this case.
Going further down this path of using OLAP with a graph other than Neptune, you would probably want to at least consider if this complex traversal could be better written as a custom VertexProgram which might better bind your use case to the the capabilities of BSP than what the more generic TraversalVertexProgram does when processing arbitrary Gremlin. For that matter, a mix of Gremlin OLAP, a custom VertexProgram and some standard map/reduce style processing might ultimately lead to the most elegant and efficient answer.
An idea I've been considering for graphs that don't support OLAP has been to subgraph() (with Java) the portion of the graph that is relevant to your algorithm and then execute it locally in TinkerGraph! I think that might make sense in some use cases where the algorithm has some limits that can be defined ahead of time to form the subgraph, where those limits can be easily filtered and where the resulting subgraph is not so large that it takes an obscene amount of time to construct. It would be even better if the subgraph had some use beyond a single algorithm - almost behaving like a cache graph. I have no idea if that is useful to you but it's a thought. Here's a recent blog post I wrote that talks about writing VertexPrograms. Perhaps you will find it interesting.
All that said about OLAP, I think that your first solution seems fine to start with. You don't have a multi-billion edge graph yet and can probably afford to take this approach for now.
What does it mean "multiple queries in a single request"?
I believe that this just means that you can send a script like:
g.addV().iterate()
g.addV().iterate()
g.V()
where multiple Gremlin commands can be executed within the scope of a single transaction where each command must be "separated by newline ('\n'), spaces (' '), semicolon ('; '), or nothing (for example: g.addV(‘person’).next()g.V() is valid)". I think that only the last command returns a value. It doesn't seem like that particular feature would be helpful in your case. I would look more to batch users within a particular request where possible.
If you a looking for a native OLAP graph engine, perhaps take look at AnzoGraphDB which scales and performs much better for that style of more complex querying than anything else we know of. It's an MPP engine, so every core works on the query in parallel. Depending on how much data you need it to act on, the free version (single node only, RAM limited) may well be all you need and can be used commercially. You can find it in the AWS Marketplace or on Docker Hub.
Disclaimer: I work for Cambridge Semantics Inc.

Saiku,Mondrian performance degrades with large amount of data

We are using mondrian olap schema with saiku to analyse our records.We are using star schema model.We have one fact table which contains around 3000000 records. We have four dimension tables timestamp,rank,path and domain. Timestamp is almost unique for each entry . Now after deploying schema in saiku when we are performing analysis saiku takes a lot of time to return results. It takes 10 minutes to fetch 3000 records and if number of records are more than 50000 saiku dies.Please suggest me on what should I do in order to boost performance of saiku and mondrian.
You can easily figure out if this is database issue or saiku/mondrian problem:
Enable sql logging facility in saiku-server/tomcat/webapps/saiku/WEB-INF/classes/log4j.xml (uncomment section below Special Log File specifically for Mondrian SQL Statements text
Restart server
Do couple of typical analysis in Saiku
Get used queries from log
Analyze performance of queries directly in database (e.g. for PostgreSQL there is explain analyze command)
If performance of queries is as slow as in Saiku then you have identified your problem.
Btw. If you really have dimension for timestamp (by second?) than you should consider splitting it into two dimensions with days and seconds.
It's hard to understand what is your particular problem.
Two things helped us when we struggle with saiku performance problems:
indices on all fields and sometimes their combinations that may be
used as dimensions - helps like everywhere in DB
we avoided joines with other tables denormalizing our data

DynamoDB: Conditional writes vs. the CAP theorem

Using DynamoDB, two independent clients trying to write to the same item at the same time, using conditional writes, and trying to change the value that the condition is referencing. Obviously, one of these writes is doomed to fail with the condition check; that's ok.
Suppose during the write operation, something bad happens, and some of the various DynamoDB nodes fail or lose connectivity to each other. What happens to my write operations?
Will they both block or fail (sacrifice of "A" in the CAP theorem)? Will they both appear to succeed and only later it turns out that one of them actually was ignored (sacrifice of "C")? Or will they somehow both work correctly due to some magic (consistent hashing?) going on in the DynamoDB system?
It just seems like a really hard problem, but I can't find anything discussing the possibility of availability issues with conditional writes (unlike with, for instance, consistent reads, where the possibility of availability reduction is explicit).
There is a lack of clear information in this area but we can make some pretty strong inferences. Many people assume that DynamoDB implements all of the ideas from its predecessor "Dynamo", but that doesn't seem to be the case and it is important to keep the two separated in your mind. The original Dynamo system was carefully described by Amazon in the Dynamo Paper. In thinking about these, it is also helpful if you are familiar with the distributed databases based on the Dynamo ideas, like Riak and Cassandra. In particular, Apache Cassandra which provides a full range of trade-offs with respect to CAP.
By comparing DynamoDB which is clearly distributed to the options available in Cassandra I think we can see where it is placed in the CAP space. According to Amazon "DynamoDB maintains multiple copies of each item to ensure durability. When you receive an 'operation successful' response to your write request, DynamoDB ensures that the write is durable on multiple servers. However, it takes time for the update to propagate to all copies." (Data Read and Consistency Considerations). Also, DynamoDB does not require the application to do conflict resolution the way Dynamo does. Assuming they want to provide as much availability as possible, since they say they are writing to multiple servers, writes in DyanmoDB are equivalent to Cassandra QUORUM level. Also, it would seem DynamoDB does not support hinted handoff, because that can lead to situations requiring conflict resolution. For maximum availability, an inconsistent read would only have to be at the equivalent of Cassandras's ONE level. However, to get a consistent read given the quorum writes would require a QUORUM level read (following the R + W > N for consistency). For more information on levels in Cassandra see About Data Consistency in Cassandra.
In summary, I conclude that:
Writes are "Quorum", so a majority of the nodes the row is replicated to must be available for the write to succeed
Inconsistent Reads are "One", so only a single node with the row need be available, but the data returned may be out of date
Consistent Reads are "Quorum", so a majority of the nodes the row is replicated to must be available for the read to succeed
So writes have the same availability as a consistent read.
To specifically address your question about two simultaneous conditional writes, one or both will fail depending on how many nodes are down. However, there will never be an inconsistency. The availability of the writes really has nothing to do with whether they are conditional or not I think.

Riak and time-sorted records

I'd like to sort some records, stored in riak, by a function of the each record's score and "age" (current time - creation date). What is the best way do do a "time-sensitive" query in riak? Thus far, the options I'm aware of are:
Realtime mapreduce - Do the entire calculation in a mapreduce job, at query-time
ETL job - Periodically do the query in a background job, and store the result back into riak
Punt it to the app layer - Don't sort at all using riak, and instead use an application-level layer to sort and cache the records.
Mapreduce seems the best on paper, however, I've read mixed-reports about the real-world latency of riak mapreduce.
MapReduce is a quite expensive operation and not recommended as a real-time querying tool. It works best when run over a limited set of data in batch mode where the number of concurrent mapreduce jobs can be controlled, and I would therefore not recommend the first option.
Having a process periodically process/aggregate data for a specific time slice as described in the second option could work and allow efficient access to the prepared data through direct key access. The aggregation process could, if you are using leveldb, be based around a secondary index holding a timestamp. One downside could however be that newly inserted records may not show up in the results immediately, which may or may not be a problem in your scenario.
If you need the computed records to be accurate and will perform a significant number of these queries, you may be better off updating the computed summary records as part of the writing and updating process.
In general it is a good idea to make sure that you can get the data you need as efficiently as possibly, preferably through direct key access, and then perform filtering of data that is not required as well as sorting and aggregation on the application side.

Resources