Table Encoding Policy difference between Kusto/ADX clusters - azure-data-explorer

I have the exact the same table in 2 different ADX/Kusto clusters -- the data/schema is identical but if I calculate ExtentSize for 1 day of data the difference between the two is enormous. While the table on cluster has 10TB , the table on the other has 15TB. That's a big difference. When I checked encoding policies on both the table , there is a slight difference. The table on the first cluster has the following encoding policy:-
"ColumnIndexRangeGranularity": 0,
"ShardFieldCompressionCodec": "DEFAULT",
Whereas the table on the other cluster has the following:-
"ColumnIndexRangeGranularity": 32,
"ShardFieldCompressionCodec": "LZ4",
My goal is bring down size of other table to the same size as the first table, so I can do with less cache policy. So ideally I would like to change these two parameters. But when I fire the following command, it has no effect on encoding policy of the table:-
.alter table MyTable policy encoding #'{ "ShardFieldCompressionCodec": "Default" }'
There is no error also.
So I have two questions.
Does ADX simply ignore whenever we try to change encoding policy of
a table? These encoding policies we have not explicitly set , when
the table got created these were assigned by clusters by default.
Does it mean that if we have the same table in 2 different clsuters
with the same schema/table -- their extent size will always
vary (because the underlying table compression is different)?

An encoding policy can be set on different entities:
column - only affects data ingested to the column after the change in policy.
table - only affects columns that will be added to the table after the change in policy.
database – only affects tables that will be created in the database after the change in policy.
It appears what you're interested in doing is altering the column level policy, and not the table level policy. Though, you should understand that will not change the encoding for data that has already been ingested.

Related

Best way to model high score data in DynamoDB

I believe this would be easier with PostgreSQL or MongoDB, both of which I'm familiar with, but I'm using DynamoDB with my project for the sake of learning how to use it and getting comfortable with it. I've never used it before.
I want to use DynamoDB to store high scores for my typing test project. There are 4 data attributes to be stored:
name (doesn't need to be unique)
WPM
number of errors
test type (because I have 2 different kinds of typing tests)
At first, my partition key was testType, and my sort key was WPM. Then I realized that if anyone got the same WPM as a previous user, it would overwrite the previous user's data, because testType and WPM, the two key components, were identical. So ties did not work.
So, now, name is my partition key, and WPM is my sort key. In order to filter by testType, I just use JS array filter methods. This still doesn't seem optimal though for multiple reasons. For my small typing test project, I think it's ok, but I can see that it's possible for 2 people to input the same name and get the same WPM and overwrite each other.
What would be a better way to set this up with DynamoDB?
Assuming you want the top X many WPM results for a given test type:
Set the partition key to be the test type. Set the sort key as <WPM>#<username>. Make sure to zero-pad the WPM so it’s always 3 digits even if the score is below 100. That keeps it numerically sorted.
With this key structure you have a sorted list (in the sort key) of all the scores for a given test type. You can Query against the test type and use ScanIndexForward=false to get descending high scores.
Notice how multiple identical scores by different usernames won’t overwrite each other. The username can be pulled from the returned sort key or from an attribute on the item, along with other metadata about the high score event.
If you have multiple users with the same username, well, that’s kinda weird. Presumably you have an internal identifier. You can use that as the suffix in the sort key instead of the username.

MariaDB - Inserting historical data into a system versioned (temporal) table

I have some tables in MariaDB that I have been tracking the changes for by using a separate "changelog" table that updates every time a record is updated. However I have recently learned about temporal data tables in MariaDB and I would like to switch to that method as it is a much more elegant method of tracking changes. I'm wondering, however, if there is a way to transfer over my "changelog" table to the newly system versioned tables.
So I was hoping I could insert new rows somehow with the specified values for the table and also specify the row_end and row_start columns and also have that not trigger the table to create another historical row... is this possible? I tried just doing a a "insert into (id, row_start, row_end, etc) values(x, y, z)" but that results in an unknown column "row_start" error.
Old question, but starting with 10.11 MariaDB allows direct insertion of historical data using a command line option or setting.
https://mariadb.com/kb/en/system-versioned-tables/#system_versioning_insert_history
system_versioning_insert_history
Description: Allows direct inserts into ROW_START and ROW_END columns if secure_timestamp allows changing timestamp.
Commandline: --system-versioning-insert-history[={0|1}]
Scope: Global, Session
Dynamic: Yes
Type: Boolean
Default Value: OFF
Introduced: MariaDB 10.11.0

Tablesize is different although Table DDL and data is exactly same(Teradata)

I encountered very strange issue in teradata today.I created one table from another table using following syntax:
create table a.xyz as a.abc with data;
so obviously xyz will be created exactly same as abc(including column attributes). Also data will be same. However if I execute query to get size of that table using allspace or tablesize, new table is taking more size than original table? May I know why it should be the case? I checked skew factor as well for curiosity, surprisingly skew factor of new table was less(ideally it should be same because same PI and same data). abc having UPI and one column got column level compression as well, but of course both these attributes copied in new table as well.
May I know what is happening here?

Creation of Flyway "schema_version" fails for dashDB

I'm using Flyway to manage db migration on IBM dashDB. This database organizes by default table content 'by column', which in particular makes the creation of the "schema_version" table fail.
To get it to work, the table creation SQL statement should only include the "ORGANIZE BY ROW" directive:
CREATE TABLE (...)
(...)
) ORGANIZE BY ROW
What would be the best approach to handle this issue ? I'm looking for a solution that does not impact the default table organization.
Thanks for helping,
Cheers.
dashDB will perform best when all tables are column-based. When you start to mix row and column based tables, many operations are then performed in "compensation" which basically means they won't take full advantage of the columnar engine.
There are currently some compatibility reasons why a columnar table cannot be created and thus a row based table must be used, but the original DDL nor error are stated so I can't tell in this case. If you can provide the full CREATE TABLE statement and the resulting error (if you have it), I can possibly provide an alternative solution that would allow you to still use all column-based tables.
If you only want to change a particular table from column organized to row organized then a "ORGANIZE BY ROW" on the table definition would be the recommended way to approach this. (This seems to be what you're doing)
Changing the default table org will change how tables are created when you don't put an "ORGANIZE BY " in your table ddl.
If you have admin privileges on your dashDB instance you can change the default table org via 'Run SQL' in the dashDB console or using a dashDB client. (for exampl: clp/clpplus)
Set default table organization to ROW:
call ADMIN_CMD('UPDATE DB CFG USING DFT_TABLE_ORG ROW');
Set default table organization to COLUMN: (default dashDB configuration)
call ADMIN_CMD('UPDATE DB CFG USING DFT_TABLE_ORG COLUMN');
Analytics will perform much better with Column organized tables so it's recommended to have the majority of your tables as column organized.

How to make values unique in cassandra

I want to make unique constraint in cassandra .
As i want to all the value in my column be unique in my column family
ex:
name-rahul
phone-123
address-abc
now i want that i this row no values equal to rahul ,123 and abc get inserted again on seraching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the 3 values uniques
means if
name- jacob
phone-123
address-qwe
this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul.
The short answer is that constraints of any type are not supported in Cassandra. They are simply too expensive as they must involve multiple nodes, thus defeating the purpose of having eventual consistency in first place. If you needed to make a single column unique, then there could be a solution, but not for more unique columns. For the same reason - there is no isolation, no consistency (C and I from the ACID). If you really need to use Cassandra with this type of enforcement, then you will need to create some kind of synchronization application layer which will intercept all requests to the database and make sure that the values are unique, and all constraints are enforced. But this won't have anything to do with Cassandra.
I know this is an old question and the existing answer is correct (you can't do constraints in C*), but you can solve the problem using batched creates. Create one or more additional tables, each with the constrained column as the primary key and then batch the creates, which is an atomic operation. If any of those column values already exist the entire batch will fail. For example if the table is named Foo, also create Foo_by_Name (primary key Name), Foo_by_Phone (primary key Phone), and Foo_by_Address (primary key Address) tables. Then when you want to add a row, create a batch with all 4 tables. You can either duplicate all of the columns in each table (handy if you want to fetch by Name, Phone, or Address), or you can have a single column of just the Name, Phone, or Address.

Resources