I am trying to set a row order policy ( ref : https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/roworder-policy ). However I am tying to apply it only on time logging column. But it is not working. The timestamp is not in UNIX format though. Do I have to combine with any other attribute ?
Row order policy is intended to improve performance by ordering the rows in the internal storage format of Kusto, it does not have any semantic impact, see this sentence from the docs:
Since row order policy consumes computing resources, it is advised to avoid applying it if there is no performance need. If there is such a need, using a partitioning policy has a much higher performance impact and will most likely work much better.
Related
If each of my database's an overview has only two types (state: pending, appended), is it efficient to designate these two types as partition keys? Or is it effective to index this state value?
It would be more effective to use a sparse index. In your case, you might add an attribute called isPending. You can add this attribute to items that are pending, and remove it once they are appended. If you create a GSI with tid as the hash key and isPending as the sort key, then only items that are pending will be in the GSI.
It will depend on how would you search for these records!
For example, if you will always search by record ID, it never minds. But if you will search every time by the set of records pending, or appended, you should think in use partitions.
You also could research in this Best practice guide from AWS: https://docs.aws.amazon.com/en_us/amazondynamodb/latest/developerguide/best-practices.html
Updating:
In this section of best practice guide, it recommends the following:
Keep related data together. Research on routing-table optimization
20 years ago found that "locality of reference" was the single most
important factor in speeding up response time: keeping related data
together in one place. This is equally true in NoSQL systems today,
where keeping related data in close proximity has a major impact on
cost and performance. Instead of distributing related data items
across multiple tables, you should keep related items in your NoSQL
system as close together as possible.
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
Use sort order. Related items can be grouped together and queried
efficiently if their key design causes them to sort together. This is
an important NoSQL design strategy.
Distribute queries. It is also important that a high volume of
queries not be focused on one part of the database, where they can
exceed I/O capacity. Instead, you should design data keys to
distribute traffic evenly across partitions as much as possible,
avoiding "hot spots."
Use global secondary indexes. By creating specific global secondary
indexes, you can enable different queries than your main table can
support, and that are still fast and relatively inexpensive.
I hope I could help you!
Suppose I have a Datastore kind with two properties listed below and an extremely high insert rate overall (but low insertion rate for individual values of random_key):
random_key - a uniformly distributed large number
time - a monotonically increasing timestamp indicating the insertion time of an entity
I'm primarily concerned with queries on the composite index (random_key ASC, time DESC) and I don't care about queries on just the time field.
Problem: But according to the datastore documentation, creating this composite index requires that I not exclude the random_key and time fields from auto-indexing. According to the best practices, indexing on time will lead to the hotspoting issue as it is monotonically increasing.
Other questions such as Google datastore - index a date created field without having a hotspot recommend prepending a random value to the timestamp to shard the data. But I'd like to try and have a clean approach that uses a more meaningful value in the other separate property random_key
Question:
What are my options for maintaining the composite index on both fields without having any of the issues related to the auto-index on time alone?
Excluding/ignoring the hot-spotting issue on auto-indexing on time alone doesn't really change/improve things for the composite index: you're still having the problem of updating an index (a composite one, but that doesn't really make a difference) with a monothonically increasing property value, which is still subject to the hot-spotting issue.
That's because the underlying fundamental root cause of the hot-spotting issue, graphically illustrated in App Engine datastore tip: monotonically increasing values are bad, is the number of worker threads that the indexing update workload can be distributed to:
with monothonically changing property values consecutive index updates requests tend to keep hitting the same worker thread which can only perform them in a serialized manner - the hotspot
with random/uniformly distributed property values consecutive indexing update requests can be statistically distributed to multiple workers to be executed in parallel. This is really what sharding is doing for monothonically changing properties as well.
The answer to the question you referenced applies in the composite index case equally well: you can use sharding for time if have an update rate above the mentioned tipping point of 500 writes/sec.
But sharding complicates your app: you'd need multiple queries and client-side merging of the results. If your random_key is indeed more meaningful you might find it more attractive instead to:
keep time unindexed (thus avoiding hot-spotting alltogether)
only query by random_key (which doesn't require a composite index) and simply handle the time filtering via client side processing (which might be less processing than combining results from sharded queries).
I am querying data from Google Analytics Premium using Google BigQuery. At the moment, I have one single query which I use to calculate some metrics (like total visits or conversion rate). This query contains several nested JOIN clauses and nested SELECTs. While querying just one table I am getting the error:
Error: Resources exceeded during query execution.
Using GROUP EACH BY and JOIN EACH does not seem to solve this issue.
One solution to be adopted in the future involves extracting only the relevant data needed for this query and exporting it into a separate table (which will then be queried). This strategy works in principle, I have already a working prototype for it.
However, I would like to explore additional optimization strategies for this query that work on the original table.
In this presentation You might be paying too much for BigQuery some of them are suggested, namely:
Narrowing the scan (already doing it)
Using query cache (does not apply)
The book "Google BigQuery Analytics" mentions also adjusting query features, namely:
GROUP BY clauses generating large number of distinct groups (already
did this)
Aggregation functions requiring memory proportional to the number of input values (probably does not apply)
Join operations generating a greater number of outputs than inputs (does not seem to apply)
Another alternative is just splitting this query into its composing sub-queries, but at this moment I cannot opt for this strategy.
What else can I do to optimize this query?
Why does BigQuery have errors?
BigQuery is a shared and distributed resource and as such it is expected for jobs to fail at some point in time. This is why the only solution is to retry the job with exponential backoff. As a golden rule, jobs should be retried a minimum of 5 times and as long as a job is not unable to complete for more than 15 minutes the service is within the SLA [1].
What can be the causes?
I can think off two causes for this that can be affecting your queries:
Data skewing [2]
Unoptimized queries
Data Skewing
Regarding the first situation, this happens when data is not evenly distributed. Because the inner mechanic of BigQuery uses a version of MapReduce this means if you have for example a music or video file with millions of hits, the workers doing that data aggregation will have their resources exhausted while the other workers won’t be doing much at all because the aggregations for the videos or musics they are processing have little to no hits.
If this is the case, the recommendation is to uniformly distribute your data.
Unoptimized queries
If you don’t have access to modifying the data, the only solution is to optimize the queries. Optimized queries follow these general rules:
When using a SELECT, make sure you only select strictly the columns you need as this diminishes the cardinality of the requests (avoid using SELECT * for example)
Avoid using ORDER BY clauses on large sets of data
Avoid using GROUP BY clauses as they create a barrier to parallelism
Avoid using JOINS as these are extremely heavy on the worker's memory, and may cause resource starvation and resource errors (as in not enough memory).
Avoid using Analytical functions [3]
If possible, do your queries on Partitioned tables [4].
Following any of these strategies should help your queries have less errors and improve their overall running time.
Additional
You can't really understand BigQuery unless you understand MapReduce first. For this reason I strongly recommend you have a look on Hadoop tutorials, like the one in tutorialspoint:
https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
For a similar version of BigQuery, but that is Open Source (and less optimized in every single way) you can also check Apache Hive [4]. If you understand why Apache Hive fails, you understand why BigQuery fails.
[1] https://cloud.google.com/bigquery/sla
[2] https://www.mathsisfun.com/data/skewness.html
[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions
[4] https://cloud.google.com/bigquery/docs/partitioned-tables
[5] https://en.wikipedia.org/wiki/Apache_Hive
Google's BigQuery has a lot of quirks because it is not ANSI compatible. These quirks are also its advantages. That said, you will waste too much time writing queries against BigQuery directly. You should either use an API/SDK or a tool such as Looker that will generate SQL for you: https://looker.com/blog/big-query-launch-blog at execution time, giving you resource estimate before spending your money.
my asp.net application uses some sequences to generate tables primary keys. Db administrators have set the cache size to 20. Now the application is under test and a few records are added daily (say 4 for each user test session).
I've found that new test session records always use new cache portions as if the preavious day cached numbers had expired, losing tenth of keys everyday. I'd like to understand if it's due to some mistake i might have made in my application (disposing of tableadapters or whatever) or if it's the usual behaviour. There are programming best practices to take into account when handling oracle sequences ?
Since the application will not have to bear an heavy load of work (say 20-40 new records at day), i was tinking if it might be the case to set a smaller cache size or none at all.
Does sequence cache resizing implies the reset of current index ?
thank you in advance for any hint
The answer from Justin Cave in this thread might be interesting for you:
http://forums.oracle.com/forums/thread.jspa?threadID=640623
In a nutshell: if the sequence is not accessed frequently enough but you have a a lot of "traffic" in the library cache, then the sequence might be aged out and removed from the cache. In that case the pre-allocated values are lost.
If that happens very frequently to you, it seems that your sequence is not used very often.
I guess that reducing the cache size (or completely disabling it) will not have a noticable impact on performance in your case (also when taking your statement of 20-40 new records a day into account)
Oracle Sequences are not gap-free. Reducing the Cache size will reduce the gaps... but you will still have gaps.
The sequence is not associated to the table by the database, but by your code (via the nextval on the insert via trigger/sql/pkg api) -- on that note you may use the same sequence over multiple tables (it is not like sql server's identity where it is associated to the column/ table)
thus changing the sequence will have no impact on the indexes.
You would just need to make sure if you drop the sequence and restart it, you 'reseed' to the +1 of the current value (e.g. create sequence seqer start with 125 nocache;)
, but
If your application requires a
gap-free set of numbers, then you
cannot use Oracle sequences. You must
serialize activities in the database
using your own developed code.
but be forewarned, you may increase disk IO and possible transaction locking if you choose not to use sequences.
The sequence generator is useful in
multiuser environments for generating
unique numbers without the overhead of
disk I/O or transaction locking.
to reiterate a_horse_with_no_name's comments, what is the issue with gaps in the id?
Edit
also have a look at the caching logic you should use located here:
http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/views002.htm#i1007824
If you are using the sequence for PKs and not to enforce some application logic then you shouldn't worry about gaps. However, if there is some application logic tied to sequential sequence values, you will have holes if you use sequence caching and do not have a busy system. Sequence cache values can be aged out of the library cache.
You say that your system is not very busy, in this case alter your sequence to no cache. You are in a position of taking a negligible performance hit to fix a logic issue so you might as well.
As people mentioned: Gaps shouldn't be a problem, so if you are requiring no gaps you are doing something wrong. (But I don't think this is what you want).
Reducing the cache should reduce the number and decrease the performance of the sequence especially with concurrent access to it. (which shouldn't be a problem in your use case).
Changing the sequence using the alter sequence statement (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_2011.htm) should not reset the current/next val of the sequence.
I have a general inquiry related to processing rows from a query. In general, I always try to format/process my rows in SQL itself, using numerous CASE WHEN statements to pre-format my db result, limiting rows and filling columns based on other columns.
However, you can also opt to just select all your rows and do the post-processing in code (asp.NET in my case). What do you guys think is the best approach in terms of performance?
Thanks in advance,
Stijn
I would recommend doing the processing in the code, unless you have network bandwidth considerations. The simple reason for this is that is is generally easier to make code changes than database changes. Furthermore, performance is more often related to the actual database query and disk access rather than the amount of data returned.
However, I'm assuming that your are referring to "minor" formatting changes to the result. Standard where clauses should naturally be done in the database.