I want to add a new column from a file to an existing table, in the way cbind does in R.
The file has 1 column, 23710 lines, all numbers:
me#my_server:/var/www/html/my_website$ head my_sample.txt
61
66
0
330
76
9
10
16
6
0
Using the code:
ALTER TABLE my_table ADD COLUMN IF NOT EXISTS sample69 INT(10) DEFAULT NULL;
LOAD DATA LOCAL INFILE '/var/www/html/my_website/my_sample.txt' INTO TABLE my_table LINES TERMINATED BY '\n' (sample69);
Before:
MariaDB [my_database]> select * from my_table limit 10;
+------------+-----------+
| geneSymbol | sample000 |
+------------+-----------+
| A1BG | 61 |
| A1BG-AS1 | 66 |
| A1CF | 0 |
| A2M | 330 |
| A2M-AS1 | 76 |
| A2ML1 | 9 |
| A2MP1 | 10 |
| A4GALT | 16 |
| A4GNT | 6 |
| AA06 | 0 |
+------------+-----------+
MariaDB [my_database]> select count(*) from my_table;
+----------+
| count(*) |
+----------+
| 23710 |
+----------+
After:
MariaDB [my_database]> select * from my_table limit 10;
+------------+-----------+-----------+
| geneSymbol | sample000 | sample69 |
+------------+-----------+-----------+
| A1BG | 61 | NULL |
| A1BG-AS1 | 66 | NULL |
| A1CF | 0 | NULL |
| A2M | 330 | NULL |
| A2M-AS1 | 76 | NULL |
| A2ML1 | 9 | NULL |
| A2MP1 | 10 | NULL |
| A4GALT | 16 | NULL |
| A4GNT | 6 | NULL |
| AA06 | 0 | NULL |
+------------+-----------+-----------+
MariaDB [my_database]> select count(*) from my_table;
+----------+
| count(*) |
+----------+
| 47420 |
+----------+
It apparently appends the data to the end of the column. Instead I want the new column to be the same length of 23710, filled with the new data from the file.
What am I doing wrong?
LOAD only loads whole rows.
Even if it could load just one column, how would it know which row each number goes with?
You must reconstruct the data with two columns (geneSymbol and sample69), load that into a temp table, then do a multi-table JOIN to move the data into the main table.
Addenda
If you have 69 columns of samples, that it the wrong way to design the schema. At some point, you will hit a limit.
Plan A: Lots of rows, not lots of columns:
CREATE TABLE x (
geneSymbol VARCHAR(..) ...,
num SMALLINT UNSIGNED NOT NULL,
value SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(geneSymbol, num)
) ENGINE=InnoDB
Plan B (This will require code to add each new sample):
CREATE TABLE x (
geneSymbol VARCHAR(..) ...,
text NOT NULL, -- JSON encoded list of samples for that gene
PRIMARY KEY(geneSymbol)
) ENGINE=InnoDB
Plan C (aimed at reading one sample):
CREATE TABLE x (
num SMALLINT UNSIGNED NOT NULL,
text NOT NULL, -- JSON encoded list of values for that sample
PRIMARY KEY(num)
) ENGINE=InnoDB
What will your queries be like? I suspect you will be reading all the data, not doing any WHERE clauses based on symbol or num??
Related
I'm trying to build a view which allows me to track the difference between paid values at two consecutive month_ids. When a figure is missing however, that would be because it's the first entry and therefore has a paid amount of 0. At present, I'm using the below to represent the previous figure since the [,default] argument has not been implemented in MariaDB.
CASE WHEN (
NOT(policy_agent_month.policy_agent_month_id IS NOT NULL
AND LAG(days_paid, 1) OVER (PARTITION BY claim_id ORDER BY month_id ) IS NULL)) THEN
LAG(days_paid, 1) OVER ( PARTITION BY claim_id ORDER BY month_id)
ELSE
0
END
The problem I have with this is that I have about 30 variables which this function needs to be applied over and it makes my code unreadable and very clunky. Is there a better solution?
Why use WITH?
SELECT province, tot_pop,
tot_pop - COALESCE(
(LAG(tot_pop) OVER (ORDER BY tot_pop ASC)),
0) AS delta
FROM provinces
ORDER BY tot_pop asc;
+---------------------------+----------+---------+
| province | tot_pop | delta |
+---------------------------+----------+---------+
| Nunavut | 14585 | 14585 |
| Yukon | 21304 | 6719 |
| Northwest Territories | 24571 | 3267 |
| Prince Edward Island | 63071 | 38500 |
| Newfoundland and Labrador | 100761 | 37690 |
| New Brunswick | 332715 | 231954 |
| Nova Scotia | 471284 | 138569 |
| Saskatchewan | 622467 | 151183 |
| Manitoba | 772672 | 150205 |
| Alberta | 2481213 | 1708541 |
| British Columbia | 3287519 | 806306 |
| Quebec | 5321098 | 2033579 |
| Ontario | 10071458 | 4750360 |
+---------------------------+----------+---------+
13 rows in set (0.00 sec)
However, it is not cheap (at least in MySQL 8.0);
the table has 13 rows, yet
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';
MySQL 8.0:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_read_rnd | 89 |
| Handler_read_rnd_next | 52 |
| Handler_write | 26 |
(and others)
MariaDB 10.3:
| Handler_read_rnd | 77 |
| Handler_read_rnd_next | 42 |
| Handler_tmp_write | 13 |
| Handler_update | 13 |
You can use a CTE (Common Table Expression) in MariaDB 10.2+ to pre-compute frequently used expressions and name them for later use:
with
x as ( -- first we compute the CTE that we name "x"
select
*,
coalesce(
LAG(days_paid, 1) OVER (PARTITION BY claim_id ORDER BY month_id),
123456
) as prev_month -- this expression gets the name "prev_month"
from my_table -- or a simple/complex join here
)
select -- now the main query
prev_month
from x
... -- rest of your query here where "prev_month" is computed.
In the main query prev_month has the lag value, or the default value 123456 when it's null.
I'm trying to compose an SQLite query and I've found a problem that's beyond my skillset. I'm trying to output columns that are based on the rows of another referenced table.
Food_List:
| food_id | name |
|---------|-----------|
| 1 | Apple |
| 2 | Orange |
| 3 | Pear |
Nutrient_Definition:
| nutrient_id | name |
|-------------|-----------|
| 21 | Carbs |
| 22 | Protein |
| 23 | Fat |
Nutrient_Data:
| food_id | nutrient_id | value |
|---------|-------------|-------|
| 1 | 21 | 50 |
| 1 | 22 | 24 |
| 1 | 23 | 63 |
| 2 | 22 | 12 |
| 2 | 23 | 95 |
| 3 | 21 | 66 |
| 3 | 22 | 87 |
| 3 | 23 | 38 |
Output:
| food_id | name | Carbs | Protein | Fat |
|---------|-----------|-------|---------|-----|
| 1 | Apple | 50 | 24 | 63 |
| 2 | Orange | | 12 | 95 |
| 3 | Pear | 66 | 87 | 38 |
(Note that Orange does not have a "Carbs" entry in the Nutrient_Data table)
I believe the following will do what you want :-
DROP TABLE IF EXISTS food_list;
CREATE TABLE IF NOT EXISTS food_list(food_id INTEGER PRIMARY KEY, name TEXT);
DROP TABLE IF EXISTS nutrient_definition;
CREATE TABLE IF NOT EXISTS nutrient_definition(nutrient_id INTEGER PRIMARY KEY, name TEXT);
DROP TABLE IF EXISTS nutrient_data;
CREATE TABLE IF NOT EXISTS nutrient_data(food_id INTEGER, nutrient_id INTEGER, value INTEGER);
INSERT INTO food_list (name) VALUES
('apple'),('orange'),('pear')
;
INSERT INTO nutrient_definition (name) VALUES
('carbs'),('protien'),('fat')
;
INSERT INTO nutrient_data VALUES
(1,1,50),(1,2,24),(1,3,63),
(2,2,12),(2,3,95),
(3,1,66),(3,2,87),(3,3,38)
;
SELECT food_list.food_id,food_list.name,
(
SELECT value
FROM nutrient_data
WHERE nutrient_data.food_id = food_list.food_id AND
nutrient_data.nutrient_id = (SELECT nutrient_definition.nutrient_id FROM nutrient_definition WHERE nutrient_definition.name = 'carbs')
),
(
SELECT value
FROM nutrient_data
WHERE nutrient_data.food_id = food_list.food_id AND
nutrient_data.nutrient_id = (SELECT nutrient_definition.nutrient_id FROM nutrient_definition WHERE nutrient_definition.name = 'protien')
),
(
SELECT value
FROM nutrient_data
WHERE nutrient_data.food_id = food_list.food_id AND
nutrient_data.nutrient_id = (SELECT nutrient_definition.nutrient_id FROM nutrient_definition WHERE nutrient_definition.name = 'fat')
)
FROM food_list
;
Results in :-
I have the following table in an sqlite database
+----+-------------+-------+
| ID | Week Number | Count |
+----+-------------+-------+
| 1 | 1 | 31 |
| 2 | 2 | 16 |
| 3 | 3 | 73 |
| 4 | 4 | 59 |
| 5 | 5 | 44 |
| 6 | 6 | 73 |
+----+-------------+-------+
I want to get the following table out. Where I get this weeks sales as one column and then the next column will be last weeks sales.
+-------------+-----------+-----------+
| Week Number | This_Week | Last_Week |
+-------------+-----------+-----------+
| 1 | 31 | null |
| 2 | 16 | 31 |
| 3 | 73 | 16 |
| 4 | 59 | 73 |
| 5 | 44 | 59 |
| 6 | 73 | 44 |
+-------------+-----------+-----------+
This is the select statement i was going to use:
select
id, week_number, count,
(select count from tempTable
where week_number = (week_number-1))
from
tempTable;
You are comparing values in two different rows. When you are just writing week_number, the database does not know which one you mean.
To refer to a column in a specific table, you have to prefix it with the table name: tempTable.week_number.
And if both tables have the same name, you have to rename at least one of them:
SELECT id,
week_number,
count AS This_Week,
(SELECT count
FROM tempTable AS T2
WHERE T2.week_number = tempTable.week_number - 1
) AS Last_Week
FROM tempTable;
In case of you want to take a query upon a same table twice, you have to put aliases on the original one and its replicated one to differentiate them
select a.week_number,a.count this_week,
(select b.count from tempTable b
where b.week_number=(a.week_number-1)) last_week
from tempTable a;
I have the query:
SELECT count(*)
FROM
(
SELECT
TBELENCO.DATA_PROC, TBELENCO.POD, TBELENCO.DESCRIZIONE, TBELENCO.ERROR, TBELENCO.STATO,
TBELENCO.SEZIONE, TBELENCO.NOME_FILE, TBELENCO.ID_CARICAMENTO, TBELENCO.ESITO_OPERAZIONE,
TBELENCO.DES_TIPO_MISURA,
--TBELENCO.RAGIONE_SOCIALE,
--ROW_NUMBER() OVER (ORDER BY TBELENCO.DATA_PROC DESC) R
ROWNUM R
FROM(
SELECT
LOG.DATA_PROC, LOG.POD, LOG.DESCRIZIONE, LOG.ERROR, LOG.STATO,
LOG.SEZIONE, LOG.NOME_FILE, LOG.ID_CARICAMENTO, LOG.ESITO_OPERAZIONE, TM.DES_TIPO_MISURA
--,C.RAGIONE_SOCIALE
--ROW_NUMBER() OVER (ORDER BY LOG.DATA_PROC DESC) R
FROM
MS042_LOADING_LOGS LOG JOIN MS116_MEASURE_TYPES TM ON
TM.ID_TIPO_MISURA=LOG.SEZIONE
-- LEFT JOIN(
-- SELECT CUST.RAGIONE_SOCIALE,STR.POD,RSC.DATA_DA, RSC.DATA_A
-- FROM
-- MS038_METERS STR JOIN MS036_REL_SITES_CUSTOMERS RSC ON
-- STR.ID_SITO=RSC.ID_SITO
-- JOIN MS030_CUSTOMERS CUST ON
-- CUST.ID_CLIENTE=RSC.ID_CLIENTE
-- ) C ON
-- C.POD=LOG.POD
--AND LOG.DATA_PROC BETWEEN C.DATA_DA AND C.DATA_A
WHERE
1=1
--AND LOG.DATA_PROC>=TRUNC(SYSDATE)
AND LOG.DATA_PROC>=TRUNC(SYSDATE)-3
--TO_DATE('01/11/2014', 'DD/MM/YYYY')
) TBELENCO
)
WHERE
R BETWEEN 1 AND 200;
If I execute the query with AND LOG.DATA_PROC>=TRUNC(SYSDATE)-3, Oracle uses the index on the data_proc field of the MS042_LOADING_LOGS (LOG) table, if I use, instead, AND LOG.DATA_PROC>=TRUNC(SYSDATE)-4 or -5, or -6, etc, it uses a table access full. Why this behavior?
I also execute a :
ALTER INDEX MS042_DATA_PROC_IDX REBUILD;
but with no changes.
Thank,
Igor
--***********************************************************
SELECT count(*)
FROM
(
SELECT
TBELENCO.DATA_PROC, TBELENCO.POD, TBELENCO.DESCRIZIONE, TBELENCO.ERROR, TBELENCO.STATO,
TBELENCO.SEZIONE, TBELENCO.NOME_FILE, TBELENCO.ID_CARICAMENTO, TBELENCO.ESITO_OPERAZIONE,
TBELENCO.DES_TIPO_MISURA,
ROWNUM R
FROM(
SELECT
LOG.DATA_PROC, LOG.POD, LOG.DESCRIZIONE, LOG.ERROR, LOG.STATO,
LOG.SEZIONE, LOG.NOME_FILE, LOG.ID_CARICAMENTO, LOG.ESITO_OPERAZIONE, TM.DES_TIPO_MISURA
FROM
MS042_LOADING_LOGS LOG JOIN MS116_MEASURE_TYPES TM ON
TM.ID_TIPO_MISURA=LOG.SEZIONE
WHERE
1=1
AND LOG.DATA_PROC>=TRUNC(SYSDATE)-1
) TBELENCO
)
WHERE
R BETWEEN 1 AND 200;
Plan hash value: 2191058229
-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 30866 (2)| 00:06:11 |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
|* 2 | VIEW | | 94236 | 1196K| 30866 (2)| 00:06:11 |
| 3 | COUNT | | | | | |
|* 4 | HASH JOIN | | 94236 | 1104K| 30866 (2)| 00:06:11 |
| 5 | INDEX FULL SCAN | P087_TIPI_MISURE_PK | 15 | 30 | 1 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY INDEX ROWID| MS042_LOADING_LOGS | 94236 | 920K| 30864 (2)| 00:06:11 |
|* 7 | INDEX RANGE SCAN | MS042_DATA_PROC_IDX | 94236 | | 25742 (2)| 00:05:09 |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("R"<=200 AND "R">=1)
4 - access("TM"."ID_TIPO_MISURA"="LOG"."SEZIONE")
7 - access(SYS_OP_DESCEND("DATA_PROC")<=SYS_OP_DESCEND(TRUNC(SYSDATE#!)-1))
filter(SYS_OP_UNDESCEND(SYS_OP_DESCEND("DATA_PROC"))>=TRUNC(SYSDATE#!)-1)
Plan hash value: 69930686
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 95921 (1)| 00:19:12 |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
|* 2 | VIEW | | 1467K| 18M| 95921 (1)| 00:19:12 |
| 3 | COUNT | | | | | |
|* 4 | HASH JOIN | | 1467K| 16M| 95921 (1)| 00:19:12 |
| 5 | INDEX FULL SCAN | P087_TIPI_MISURE_PK | 15 | 30 | 1 (0)| 00:00:01 |
|* 6 | TABLE ACCESS FULL| MS042_LOADING_LOGS | 1467K| 13M| 95912 (1)| 00:19:11 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("R"<=200 AND "R">=1)
4 - access("TM"."ID_TIPO_MISURA"="LOG"."SEZIONE")
6 - filter("LOG"."DATA_PROC">=TRUNC(SYSDATE#!)-4)
The larger the fraction of rows that will be returned, the more efficient a table scan is and the less efficient it is to use an index. Apparently, Oracle expects that inflection point to come when the query returns more than 3 days of data. If that is inaccurate, I would expect that the statistics on your table or indexes are inaccurate.
I use an Index Organized Table (IOT) for a table having 550 M rows. The primary key is composed by two columns (id1 and id2) which are also foreign key towards 2 other tables (id1 FK towards table1, id2 FK towards table2).
When using an IOT, according to Oracle doc (http://docs.oracle.com/cd/B28359_01/server.111/b28310/tables012.htm#i1007389), sorting by prefix of the primary key should be faster than sorting by suffix of the primary key. Indeed, rows are sorted according to the primary key and the order of the columns composing it.
However here are the explain plans I get when joining the IOT with the two other tables and I try to order either by id1 or id2. I would have expected to get a better cost by sorting by id1. But it is not the case.
The value used for id1 corresponds to 58000 rows among 489000 rows in total in table1. The value used for id2 corresponds to 760 rows among 248900 rows in total in table2.
The SQL query :
SELECT a.id1, a.id2, a.some_column
FROM iot_table a
INNER JOIN table1 t1 ON t1.id = a.id1
INNER JOIN table2 t2 ON t2.id = a.id2
WHERE t1.col_x = x AND t2.col_y = y
ORDER BY {a.id1|a.id2};
Explain plan order by id1 :
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 11243 (100)| |
| 1 | NESTED LOOPS | | 1311K| 42M| 11243 (1)| 00:01:44 |
| 2 | MERGE JOIN CARTESIAN | | 46M| 842M| 11173 (1)| 00:01:43 |
|* 3 | TABLE ACCESS BY INDEX ROWID | TABLE1 | 58152 | 511K| 4745 (1)| 00:00:44 |
| 4 | INDEX FULL SCAN | TABLE1_ID1_IDX | 488K| | 15 (0)| 00:00:01 |
| 5 | BUFFER SORT | | 799 | 7990 | 6429 (1)| 00:01:00 |
| 6 | TABLE ACCESS BY INDEX ROWID| TABLE2 | 799 | 7990 | 1 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | TABLE2_COL_Y_IDX | 799 | | 1 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | IOT_TABLE_PK | 1 | 15 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
Explain plan order by id2 :
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 5159 (100)| |
| 1 | NESTED LOOPS | | 1311K| 42M| 5159 (2)| 00:00:48 |
| 2 | MERGE JOIN CARTESIAN | | 46M| 842M| 5089 (1)| 00:00:47 |
|* 3 | TABLE ACCESS BY INDEX ROWID | TABLE2 | 799 | 7990 | 2512 (1)| 00:00:24 |
| 4 | INDEX FULL SCAN | TABLE2_ID2_IDX | 248K| | 28 (0)| 00:00:01 |
| 5 | BUFFER SORT | | 58152 | 511K| 2577 (2)| 00:00:24 |
| 6 | TABLE ACCESS BY INDEX ROWID| TABLE1 | 58152 | 511K| 3 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | TABLE1_COL_X_IDX | 58152 | | 1 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | IOT_TABLE_PK | 1 | 15 | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
I am wondering why the cost is worst for sorting by id1 whereas the rows should be primarily sorted according to this column and Oracle would just have to browse the IOT B*-Tree as it is.
Thank you for your help
I use Oracle 11.2 g and the stats are up to date.