Predicate Push Down vs Bloom Filters - bigdata

While looking for Query optimizations on Big data especially an ORC file , I I came across two possibilities predicate push down and Bloom Filters .
Predicate push down helps us to avoid reading unnecessary stripes, which helps to reduce IO , but to me it appears that Bloom Filter also serves the same purpose except the below.
for predicate push down we do not need to explicitly create any artifacts while writing an ORC file , where as for Bloom filters we need to configure the columns for while writing to ORC file.
Request suggestions to get my understanding better.
Thanks
Santosh

Bloom filters are used by predicate push down. Predicate push down uses column statistics primarily to skip row groups and minimize number of rows read. If bloom filters are used then predicate push down can minimize number of rows read further.

Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.
Predicate push down uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows
seealso: https://orc.apache.org/docs/index.html
Predicate push down needs to implemented by the query engine like apache spark
A good definition about predicate push down can be found here and:
https://medium.com/microsoftazure/data-at-scale-learn-how-predicate-pushdown-will-save-you-money-7063b80878d7#:~:text=What%20is%20Predicate%20Pushdown%3F,are%20referred%20to%20as%20predicates.&text=It%20can%20improve%20query%20performance,%2FO)%20from%20Storage%20files
ORC provides three level of indexes within each file:
file level - statistics about the values in each column across the entire file
stripe level - statistics about the values in each column for each stripe
row level - statistics about the values in each column for each set of 10,000 rows within a stripe
Column statistics always contain the count of values and whether there are null values present. Most other primitive types include the minimum and maximum values and for numeric types the sum. As of Hive 1.2, the indexes can include bloom filters, which provide a much more selective filter.
https://orc.apache.org/docs/indexes.html
ORC predicate push-down is enabled by default in Spark SQL.
Bloom filters are only useful for equality, not less than or greater than.
"A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set"."
Seealso:
https://llimllib.github.io/bloomfilter-tutorial/
https://en.wikipedia.org/wiki/Bloom_filter
https://docs.cloudera.com/runtime/7.2.8/developing-spark-applications/topics/spark-predicate-push-down-optimization.html

Related

Read all data from a table of 10k rows in a single request

Could there be problems in reading all the data of a 10k rows table in a single request?
It would be a read only request.
I would like to do it because I want to perform some queries on the array, and from the documentation I can’t find a way to do it directly with Pact.
No there shouldn't be. Read only queries are "free" atm.
You can do it in two ways
Do a select query which will always evaluate true
Get all the keys (i.e. unique ids in the table) via (keys your-table-name) and then have a separate method which returns data for a list of ids.
But do consider using select statements to help filter out your data during the query as this could be easier than you doing it yourself.
Pact will check arrays like any other property, but you should ask yourself the question - do you need to test all 10k records or just a representative sample of them (the answer should in most cases be the latter).
You should also consider:
Do you need to exact match? (if so, the consumer and provider must have exactly the same data - not recommended)
Can you use matchers to check the shape of the items in the array

Cloud DLP inspection scan to look for multiple infoTypes in same row

I have to run an inspection scan on a Big Query Table. My goal is to highlight/find a row only if it contains say, first_name, last_name, Phone_number & age infoTypes (*all in same row).
I'm new to Cloud DLP and have created a Job trigger (with all infoTypes i'm interested in) to scan data from a BQ table. I'm not really sure if Inspection Rulesets can help here.
Just in case my point is not clear: https://help.symantec.com/cs/DLP15.0/DLP/v54111221_v120691346/Coincidencia-con-3-columnas-en-una-condici?locale=EN_US
Cloud DLP provides a set of built-in infoType detectors which allows you to specify by name, each of which is listed in InfoType detector reference. These detectors use a variety of techniques to discover and classify each type. For example, some types will require a pattern match, some may have mathematical checksums, some have special digit restrictions, and others may have specific prefixes or context around the findings.
However, there are additional options as shown in here which you can use to further explore your use case.
"Important: Built-in infoType detectors are not a 100% accurate detection method. For example, they can't guarantee compliance with regulatory requirements. You must decide what data is sensitive and how to best protect it. Google recommends that you test your settings to make sure your configuration meets your requirements."
You can find some additional information here.

Partition By & Clustered & Distributed By in USql - Need to know their meaning and when to use them

I can see that while creating table in USQL we can use Partition By & Clustered & Distributed By clauses.
As per my understanding partition will store data of same key (on which we have partition) together or closer (may be in same structured stream at background), so that our query will be more faster when we use that key in joins, filter.
Clustering is - I guess it stores data of those columns together or closer inside each partition.
And Distribution is some method like Hash or Round Robin - the way of storing data inside each partition. If you have integer column and you frequently query within some range , use range else use hash. If your data is not distributed equally then you may face data skew issue, so in that case use round robin.
Question 2: Please let me know whether my understanding is correct or not?
Question 1: There is INTO clause - I want to know how we should identify value for this INTO clause for DISTRIBUTION?
Question 3: Also want to know that which one is vertical partitioning and which one is horizontal?
Question 4: I don't see any good online document to learn these concepts with examples. If you know please send me links.
Peter and Bob have given you links to documentation.
To quickly answer your questions here:
Partitions and distributions both partition the data based on the partitioning scheme and both provide data scale out and partition elimination.
Partitions are optional and individually manageable for data life cycle management (besides giving you the ability to get partition elimination) and currently only support a value-based partition based on the same column values.
Each Partition then gets further partitioned based on the distribution scheme. Here you have different schemes (HASH, RANGE etc). The system decides on the number of distribution buckets based on some heuristic. In the case of HASH partitions, you can also specify the number of buckets with the INTO clause.
The clustering will then specify the order of the rows within a distribution bucket and allows you to further improve query performance (you can to a range scan instead of a full scan for example).
Vertical and horizontal partitioning are terms sometimes used to separate these two levels of partitioning. I try to avoid it, since it can be confusing to remember which one is which.

DynamoDB: Using filtered expression vs creating separate table with selected data for efficiency

I am writing an API, which has a data model with a status field which is boolean.
And 90% of the calls to the API will require filter over that status = “active"
Context:
Currently, I have it as a DyanmoDB Boolean field and use a filtered expression over it but I am contending the decision over creating a separate table with the relevant identifier which acts as a hash key for the query and saving corresponding item information corresponding to "active" status, as there can be only one item with "active" status in the item for a particular hash key.
Now my questions are:
Data integrity is a big question here since I will be updating two
tables depending upon the request.
Is using separate tables a good practice in Dynamo DB in this use
case or I am using a wrong DB?
Is the query execution over filtered expression efficient enough and
I can use the current setup?
Scale of the API usage is medium right now but it is expected to increase.
A filter expression is going to be inefficient because filter expressions are applied to results after the scan or query is processed. They could save on network bandwidth in some cases but otherwise you could just as well apply the filter in your own code with pretty much the same results and efficiency.
You other option would be to create a Global Secondary Index (GSI) with a partition key on the boolean field, which might be better if you have significantly less "active" records than "inactive". In that case a useful pattern is to create a surrogate field, say "status_active", which you set to TRUE only for active fields, and to NULL for others. Then, if you create a GSI with a partition key on the "status_active" field it will contain only the active records (NULL values do not get indexed).
The index on a surrogate field is probably the best option as long as you expect than the set of active records is sparse in the table (ie. there's less actives than inactives).
If you expect that about 50% of records would be active and 50% would be inactive then having two tables and dealing with transaction integrity on your own might be a better choice. This is especially attractive if records are only infrequently expected to transition between states. DynamoDB provides very powerful atomic counters and conditional checks that you can use to craft a solution that ensures state transitions are consistent.
If you expect that many records would be active and only a few inactive, then using a filter might actually be the best option, but keep in mind that filtered records still count towards your provisioned throughput, so again, you could simply filter them out in the application with much the same result.
In summary, the answer depends on the distribution of values in the status attribute.

When to include an index (automated heuristic)

I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.

Resources