Best way to generate identifiers? - u-sql

I will incrementally insert rows to a table. This table stores sales facts, and has some columns that will be used to define a identifier: business id (int), product name (string), product price (float). E.g. <1, heineken, 1.0>, <1, heineken, 22.99>.
Certainly, these values will be used in joins. When thinking the SQL way, I would create a hashed column using those columns. This way, I would be able to optimize some queries.
How about data lake and u-sql? Should I calculate the hash on insert? Should I leave it as is? Should I simply concatenate the values and create a big string?
Thanks in advance.

While U-SQL supports clustering and distribution schemes on multiple columns, you probably could gain some additional performance in your joins if you find an efficient value to do your equi-join comparison. So you could calculate a hash or concatenate.
However, I think finding the right distribution scheme and clustering is the better "bang for your buck".
And, more importantly, please do not incrementally insert small number of rows, but use bulk insertion of many rows at the same time (e.g., daily or weekly). And regularly rebuild the table or table partition to avoid table fragmentation that will have a much bigger impact on your query performance.

Related

What are maximum limits of Mariadb columnstore?

I want to create wide table of thousands of rows in mariadb columnstore. I didn't find any documentation of max number of columns allowed by storage engine. I would also like to know how will be the performance of columnstore for 1000 integer columns?
(Caveat: This 'Answer' is based on my understanding of the design, not on any 'facts'.)
The disk footprint of a table should be proportional to the number of columns.
As with most things in MariaDB, there is probably a hard limit on the number of columns, but I can think of no reason for it to be under 1000. Perhaps, instead, some larger power of 2.
When referencing only a small number of columns, it should not matter how many columns there are in the table. The way the data is structured should allow fetching each column with a relatively fixed amount of effort.
For filtering, I would expect the effort taken to depend on the number of columns used for filtering, and their distribution. If your WHERE clause references a lot of columns, I would not expect good performance.
With any Engine, having lots of columns is not necessarily a wise design. In general, when you have lots of columns that are not used for filtering or sorting (WHERE, ORDER BY), you may as well toss them into a JSON string (or other structure) and store as a single TEXT or BLOB columns. Then let the application parse the string to get the individual columns.
Columnstore shines for 'filtering'. It also is very good in compressing data, and my JSON suggestion would defeat this. But now you are into speed-vs-space tradeoffs that are very data-specific.
Would you care to describe your proposed dataset?
There is no limitation on number of columns for Columnstore but data ingestion performance is not the best ATM. We are decreasing the timing significantly in the nearest future.
When I tried to create a ColumnStore table which has 2310 columns, it returned "Error Code: 1117. Table definition is too large".
I decreased the number of columns and tried again.
It looks like that the maximum number of columns for MariaDB ColumnStore is 2201.

Partition By & Clustered & Distributed By in USql - Need to know their meaning and when to use them

I can see that while creating table in USQL we can use Partition By & Clustered & Distributed By clauses.
As per my understanding partition will store data of same key (on which we have partition) together or closer (may be in same structured stream at background), so that our query will be more faster when we use that key in joins, filter.
Clustering is - I guess it stores data of those columns together or closer inside each partition.
And Distribution is some method like Hash or Round Robin - the way of storing data inside each partition. If you have integer column and you frequently query within some range , use range else use hash. If your data is not distributed equally then you may face data skew issue, so in that case use round robin.
Question 2: Please let me know whether my understanding is correct or not?
Question 1: There is INTO clause - I want to know how we should identify value for this INTO clause for DISTRIBUTION?
Question 3: Also want to know that which one is vertical partitioning and which one is horizontal?
Question 4: I don't see any good online document to learn these concepts with examples. If you know please send me links.
Peter and Bob have given you links to documentation.
To quickly answer your questions here:
Partitions and distributions both partition the data based on the partitioning scheme and both provide data scale out and partition elimination.
Partitions are optional and individually manageable for data life cycle management (besides giving you the ability to get partition elimination) and currently only support a value-based partition based on the same column values.
Each Partition then gets further partitioned based on the distribution scheme. Here you have different schemes (HASH, RANGE etc). The system decides on the number of distribution buckets based on some heuristic. In the case of HASH partitions, you can also specify the number of buckets with the INTO clause.
The clustering will then specify the order of the rows within a distribution bucket and allows you to further improve query performance (you can to a range scan instead of a full scan for example).
Vertical and horizontal partitioning are terms sometimes used to separate these two levels of partitioning. I try to avoid it, since it can be confusing to remember which one is which.

MariaDB partitionning last 3 month

First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview

Set table vs multi set table performance

I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.

Large Table vs Multiple Tables - Normalized Data

I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.
I am planning to split this into 2 tables.
Table with often used demographics (~60 variables sourced from 3 tables)
Normalized (1 customer id and add date for each demographic variable)
Table with rarely or unused demographics (~100 variables sourced from 6 tables)
Normalized (1 customer id and add date for each demographic variable)
MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)
The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:
SELECT cus_id,demo
FROM db1.demo_test
WHERE (cus_id,add_dt) IN (
SELECT cus_id, MAX(add_dt)
FROM db1.dt_test
WHERE add_dt <= '2013-03-01' -- Analyst selected Point-in-Time Date
GROUP BY 1)
GROUP BY 1,2
This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.
Does this approach seem sound for storage and querying?
Is any individual table too large?
Is there a better suggested approach?
My concern with splitting further is
Space due to uncompressible fields such as dates and customer ids
Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)
Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.
Additional to Rob's remarks:
What is your current PI/partitioning?
Is the current performance unsatisfactory?
How do the analysts access beside the point-in-time, any other common conditions?
Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.
A Join Index on (cus_id), (add_dt) might be helpful, too.
You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):
SELECT *
FROM db1.demo_test
QUALIFY
RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1
In TD14 you might split your single table in two row-containers of a column-partitioned table.
...
The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.
If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.

Resources