I have a table which has the primary key on one column and is partitioned by a date column. This is sample format of the DDL:
CREATE MULTISET TABLE DB.TABLE_NAME,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
( FIRST_KEY DECIMAL(20,0) NOT NULL,
SECOND_KEY DECIMAL(20,0) ,
THIRD_COLUMN VARCHAR(5),
DAY_DT DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX TABLE_NAME_IDX_PR (FIRST_KEY)
PARTITION BY RANGE_N(DAY_DT BETWEEN DATE '2007-01-06'
AND DATE '2016-01-02' EACH INTERVAL '1' DAY );
COLLECT STATS ON DB.TABLE_NAME COLUMN(FIRST_KEY);
The incoming data can be of size 30 million each day and I have loaded the data for 2012-04-11. Now i have to collect stats for only '2012-04-11' partition instead of whole table.
Is there any way to collect partition for a particular day?
You can simply collect stats on the system column PARTITION and it should update the histograms relating to the partitioned column.
COLLECT STATS ON {databasename}.{tablename} COLUMN (PARTITION);
This can be collected on both partitioned and non-partitioned tables. It helps provided the optimizer cardinality of the table and partitions (if they exist). It will update the statistics for all the partitions on the table. Collecting stats on the PARTITION column is a low CPU cost, short wall clock process. It is significantly less expensive than collecting stats on a physical column or the entire table. (Even for tables with millions, tens of millions or more records.)
If you want to determine whether the optimizer recognizes the refreshed statistics there is no direct way as of TD 13.10 (not sure about TD 14.x). However, if you run an EXPLAIN on your query you can tell if the optimizer has high confidence on the step which the criteria against the partitioned column is included. If you specify a single date, such as DATE '2012-04-11' you should see in the EXPLAIN that partition elimination has taken place on a single partition.
If you need help with digesting the EXPLAIN, edit your original question with the EXPLAIN plan for the query and I will help you digest it.
Related
I have a couple of big tables (from 60M rows to 2Bi rows) to create some partitions on it, as they are used in the core of our platform we are trying to find out if when we start to create the partitions in the production environment the database will generate any kind of lock on those tables?
We are on MariaDB 10.0.24 and 10.1.34.
ALTER TABLE data_history PARTITION BY RANGE ( period ) (
PARTITION past VALUES LESS THAN (201801),
PARTITION p2018 VALUES LESS THAN (201901),
PARTITION p2019 VALUES LESS THAN (202001),
PARTITION p2020 VALUES LESS THAN (202101),
PARTITION future VALUES LESS THAN MAXVALUE
);
period column is an integer field that goes with year+month (YYYYMM) format.
When adding PARTITIONing to a table, the entire table is copied over. Hence the lock.
When adding a PARTITION to an already-partitioned table, it depends on the actual command issued. All (?) cases will LOCK, but the length of the lock varies. Please provide the actual command.
Why partition? Partitioning does not intrinsically lead to any performance improvements. See this for discussion of the few cases where partitioning is beneficial: http://mysql.rjweb.org/doc.php/partitionmaint
I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.
I have a big table where the partitioning is created. The partition is created on daily basis but the data is coming only for the last day of each month.
This is happening for 10-12 tables.
I want to know what can be the downsides of it.
will it occupy more space? and how will it react to retrieving of records.
Thanks,
Sumit
There's no overhead for empty partitions and if you collect stats regularly there should be no downside. Of course you should write matching WHERE-conditons, if you do something like BETWEEN 2016-01-01 and 2016-08-01 the optimizer still needs to consider all partitions (even if the stats tell most are empty) and might do a different join type.
But IMHO you might better consider monthly partitions instead to avoid all those unused partitions and keep the partition count low. Then it doesn't matter how you write your condition, too.
First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview
I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.
I am planning to split this into 2 tables.
Table with often used demographics (~60 variables sourced from 3 tables)
Normalized (1 customer id and add date for each demographic variable)
Table with rarely or unused demographics (~100 variables sourced from 6 tables)
Normalized (1 customer id and add date for each demographic variable)
MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)
The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:
SELECT cus_id,demo
FROM db1.demo_test
WHERE (cus_id,add_dt) IN (
SELECT cus_id, MAX(add_dt)
FROM db1.dt_test
WHERE add_dt <= '2013-03-01' -- Analyst selected Point-in-Time Date
GROUP BY 1)
GROUP BY 1,2
This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.
Does this approach seem sound for storage and querying?
Is any individual table too large?
Is there a better suggested approach?
My concern with splitting further is
Space due to uncompressible fields such as dates and customer ids
Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)
Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.
Additional to Rob's remarks:
What is your current PI/partitioning?
Is the current performance unsatisfactory?
How do the analysts access beside the point-in-time, any other common conditions?
Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.
A Join Index on (cus_id), (add_dt) might be helpful, too.
You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):
SELECT *
FROM db1.demo_test
QUALIFY
RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1
In TD14 you might split your single table in two row-containers of a column-partitioned table.
...
The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.
If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.