MariaDB partitionning last 3 month - mariadb

First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.

I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).

Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview

Related

Azure Analysis Service - partition to refresh modified rows only?

I have an AS tabular model that contains a fact table with 20 mil rows. I have partitioned this so only the new rows get added to each day... however occasionally, a historical row (from years ago) will be modified. I can identify this modified row in SQL (using the last modified timestamp) however would it be possible for me to refresh the row in SSAS to reflect this change without having to refresh my entire data model? How would I achieve this?
First, 20 million rows is not a lot. I’m expecting that will only take 5-10 minutes to process unless your SQL queries are very inefficient or very wide. So why bother to optimize something which may be fast enough already?
If you do need to optimize it, you will first want to partition the large fact table by some date element. Since you only have 20 million rows I would suggest partitioning by year. Optimal compression will be achieved with around 8 million rows per partition. Over-partitioning (such as creating thousands of daily partitions) is counter-productive.
When a new row is added you could perform a ProcessAdd to insert just the new records to the partitions in question. However I would recommend just doing a ProcessFull on any year partitions which have any inserts, updates or deletes in SQL.
SSAS doesn’t support updating a specific row. Thus you have to follow the ProcessFull advice above.
There are several code examples including this one which may help you.
Again this may be overkill if you only have 20 million rows.

What are maximum limits of Mariadb columnstore?

I want to create wide table of thousands of rows in mariadb columnstore. I didn't find any documentation of max number of columns allowed by storage engine. I would also like to know how will be the performance of columnstore for 1000 integer columns?
(Caveat: This 'Answer' is based on my understanding of the design, not on any 'facts'.)
The disk footprint of a table should be proportional to the number of columns.
As with most things in MariaDB, there is probably a hard limit on the number of columns, but I can think of no reason for it to be under 1000. Perhaps, instead, some larger power of 2.
When referencing only a small number of columns, it should not matter how many columns there are in the table. The way the data is structured should allow fetching each column with a relatively fixed amount of effort.
For filtering, I would expect the effort taken to depend on the number of columns used for filtering, and their distribution. If your WHERE clause references a lot of columns, I would not expect good performance.
With any Engine, having lots of columns is not necessarily a wise design. In general, when you have lots of columns that are not used for filtering or sorting (WHERE, ORDER BY), you may as well toss them into a JSON string (or other structure) and store as a single TEXT or BLOB columns. Then let the application parse the string to get the individual columns.
Columnstore shines for 'filtering'. It also is very good in compressing data, and my JSON suggestion would defeat this. But now you are into speed-vs-space tradeoffs that are very data-specific.
Would you care to describe your proposed dataset?
There is no limitation on number of columns for Columnstore but data ingestion performance is not the best ATM. We are decreasing the timing significantly in the nearest future.
When I tried to create a ColumnStore table which has 2310 columns, it returned "Error Code: 1117. Table definition is too large".
I decreased the number of columns and tried again.
It looks like that the maximum number of columns for MariaDB ColumnStore is 2201.

Teradata partition on daily basis while data coming for one day

I have a big table where the partitioning is created. The partition is created on daily basis but the data is coming only for the last day of each month.
This is happening for 10-12 tables.
I want to know what can be the downsides of it.
will it occupy more space? and how will it react to retrieving of records.
Thanks,
Sumit
There's no overhead for empty partitions and if you collect stats regularly there should be no downside. Of course you should write matching WHERE-conditons, if you do something like BETWEEN 2016-01-01 and 2016-08-01 the optimizer still needs to consider all partitions (even if the stats tell most are empty) and might do a different join type.
But IMHO you might better consider monthly partitions instead to avoid all those unused partitions and keep the partition count low. Then it doesn't matter how you write your condition, too.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

Oracle materialized views or aggregated tables in datawarehouse

Is the materialized views of oracle(11g) are good practice for aggregated tables in Data warehousing?
We have DW processes that replace 2 month of data each day. Some time it means few Gigs for each month (~100K rows).
On top of them are materialized views that get refreshed after night cycle of data tranfer.
My question is would it be better to create aggregated tables instead of the MVs?
I think that one case where aggregated tables might be beneficial is where the aggregation can be effectively combined with the atomic-level data load, best illustrated with an example.
Let's say that you load a large volume of data into a fact table every day via a partition exchange. A materialized view refresh using partition change tracking is going to be triggered during or after the partition exchange and it's going to scan the modified partitions and apply the changes to the MV's.
It is possible that as part of the population of the table(s) that you are going to exchange with the fact table partitions you could also compute aggregates at various levels using CUBE/ROLLUP, and use multitable insert to load up tables that you can then partition exchange into one or more aggregation tables. Not only might this be inherently more efficient through avoiding rescanning the atomic-level data, your aggregates are computed prior to the fact table partition exchange so if anything goes wrong you can suspend the modification of the fact table itself.
Other thoughts might occur later ... I'll open the answer up as a community Wiki if other have ideas.

Resources