Reversing the order of a MariaDB recursive CTE - recursion

I have a MariaDB 10.2.8 database which I am using to store the results of a crawl of all files beneath a particular root directory. So a file (in the file table) has a parent directory (in the directory table). This parent directory may have its own parents and so on up to the original point at which the directory crawl began.
So if I did a crawl from /home, the file /home/tim/projects/foo/bar.py would have a parent directory foo, which would have a parent directory projects and so on. /home (the root of the crawl) would have a null parent.
I've got the following recursive CTE:
with recursive tree as (
select id, name, parent from directory where id =
(select parent from file where id = #FileID)
union
select d.id, d.name, d.parent from directory d, tree t
where t.parent = d.id
) select name from tree;
which (as expected) returns the results in order, where #FileID is the primary key of the file. e.g.
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 17
Server version: 10.2.8-MariaDB-10.2.8+maria~jessie-log mariadb.org binary distribution
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> use inventory;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MariaDB [inventory]> with recursive tree as (
-> select id, name, parent from directory where id =
-> (select parent from file where id = 3790)
-> union
-> select d.id, d.name, d.parent from directory d, tree t
-> where t.parent = d.id
-> ) select name from tree;
+----------+
| name |
+----------+
| b8 |
| objects |
| .git |
| fresnel |
| Projects |
| metatron |
+----------+
6 rows in set (0.00 sec)
MariaDB [inventory]> Bye
tim#merlin:~$
So in this case, file ID 3790 corresponds a file in the directory /metatron/Projects/fresnel/.git/objects/b8 (/metatron is, of course, the root of the crawl).
Is there a reliable way of reversing the order of the output (as I want to concatentate it together to produce the full path). I can order by id but this doesn't feel reliable as even though I know, in this case, that children will have a higher ID than their parents I can't guarantee this will always be the case in every circumstance I want to use a CTE.

(
your-current-query
) ORDER BY ...;
(If you have trouble with that, then stick SELECT ... in front, too.

Related

Kusto - How to put results into another table in another cluster?

Suppose I have a table, I run a query like below :-
let data = orders | where zip == "11413" | project timestamp, name, amount ;
inject data into newOrdersInfoTable in another cluster // ==> How can i achieve this?
There are many ways to do it. If it is a manual task and with not too much data you can simply do something like this in the target cluster:
.set-or-append TargetTable <| cluster("put here the source cluster url").database("put here the source database").orders
| where zip == "11413" | project timestamp, name, amount
Note that if the dataset is larger you can use the "async" flavor of this command. If the data size is bigger then you should consider exporting the data and importing it to the other cluster.

Fail to access files in ADLS Gen 2 with ADX External table "Virtual columns"

I have a simple folder tree in Azure Data Lake Gen 2 that is partitioned by date with the following standard folder structure: {yyyy}/{MM}/{dd}. e.g. /Container/folder1/sub_folder/2020/11/01
In each leaf folder, I have some CSV files with few columns but without a timestamp (as the date is already embedded in the folder name).
I am trying to create an ADX external table that will include a virtual column of the date, and then query the data in ADX by date (this is a well-known pattern in Hive and Big data in general).
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = ("/date=" datetime_pattern("year={yyyy}/month={MM}/day={dd}", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder/;{key}'
)
with (includeHeaders = 'All')
Unfortunately querying the table fails, and show artifacts return an empty list.
external_table("Table Name")
| take 10
.show external table Walmart_2141_OEE artifacts
with the following exception:
Query execution has resulted in error (0x80070057): Partial query failure: The parameter is incorrect. (message: 'path2
Parameter name: Argument 'path2' failed to satisfy condition 'Can't append a full path': at Concat in C:\source\Src\Common\Kusto.Cloud.Platform\Utils\UriPath.cs: line 25:
I tried to follow many types of pathformats and datetime_pattern as described in the documentation but nothing worked.
Any ideas?
According to your description the following definition should work:
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = (datetime_pattern("yyyy/MM/dd", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder;{key}'
)
with (includeHeaders = 'All')

Problems about DECLARE Block contains a commented dynamic input variable

I am just new to PL/SQL.
I wrote a block to calculate the radius and circumference of a circle like below:
SET SERVEROUTPUT ON;
CREATE OR REPLACE PROCEDURE cal_circle AS
-- DECLARE
pi CONSTANT NUMBER := 3.1415926;
radius NUMBER := 3;
-- to make it more dynamic I can set
-- radius NUMBER := &enter_value;
circumference DECIMAL(4,2) := radius * pi * 2;
area DECIMAL(4,2) := pi * radius ** 2;
BEGIN
-- DBMS_OUTPUT.PUT_LINE('Enter a valur of radius: '|| radius);
dbms_output.put_line('For a circle with radius '
|| radius
|| ',the circumference is '
|| circumference
|| ' and the area is '
|| area
|| '.');
END;
/
Here you could see that I've commented the declare code radius NUMBER := &enter_value;, however, when I run my scripts in SQL*Plus or SQL Developer, I always got popup message like please enter the value for enter_value.
Oppositely, what if I delete this comment in the declare and just put it outside of it, then there will be no prompt anymore.
SET SERVEROUTPUT ON;
/* to make it more dynamic, I can set
radius NUMBER := &enter_value;
*/
CREATE OR REPLACE PROCEDURE cal_circle AS
-- DECLARE
pi CONSTANT NUMBER := 3.1415926;
radius NUMBER := 3;
circumference DECIMAL(4,2) := radius * pi * 2;
area DECIMAL(4,2) := pi * radius ** 2;
BEGIN
......
Here I want to clarify that does it mean the DECLARE block cannot accept comment when I try to comment a dynamic variable?
Thanks.
You can use a parameter instead of a substitution variable to allow different users to call the procedure with different values of pi.
I'd recommend using a FUNCTION instead of a PROCEDURE for this, but here's an example (Also using a parameter for radius). :
CREATE OR REPLACE PROCEDURE CAL_CIRCLE(P_RADIUS IN NUMBER, P_PI IN NUMBER) AS
CIRCUMFERENCE DECIMAL(4, 2) := P_RADIUS * P_PI * 2;
AREA DECIMAL(4, 2) := P_PI * P_RADIUS ** 2;
BEGIN
DBMS_OUTPUT.put_line('For a circle with radius '
|| P_RADIUS
|| ',the circumference is '
|| CIRCUMFERENCE
|| ' and the area is '
|| AREA
|| '.' || 'Calculated with Pi=: ' || P_PI);
END;
/
Then try it out:
BEGIN
CAL_CIRCLE(3, 3.14);
END;
/
For a circle with radius 3,the circumference is 18.84 and the area is
28.26.Calculated with Pi=: 3.14
BEGIN
CAL_CIRCLE(3, 3.14159);
END;
/
For a circle with radius 3,the circumference is 18.85 and the area is
28.27.Calculated with Pi=: 3.14159
If you really need to actually COMPILE the procedure with different values for its constants (not recommended) with a substituted value of pi, you can set the substitution variable first with DEFINE. like DEFINE pi_value = 3.1415;, then using &pi_value later.
Update: Why do SQLPlus and SQL Developer detect the Substitution Variable and request a value for it, even when it is in a comment?
TLDR: SQL Clients must deliver comments to the server. Preprocessing substitutions in comments gives greater flexibility and keeps the SQL Clients simpler. The clients have good support for controlling substitution behavior. There is not much of a reason to have orphan substitution variables in finalized code.
Longer-Version:
These tools are database clients--they have lots of features but first and foremost their first job is to gather input SQL, deliver it to the database server and handle fetched data.
Comments need to be delivered to the database server with their accompanying SQL statements. There are reasons for this -- so users can save comments on their compiled SQL code in the database, of course, but also for compiler hints.
Substitution Variables are not delivered with the SQL to the server like comments are. Instead they are evaluated first, and the resultant SQLText is sent to the server. (You can see the SQL that gets into the server has its Substitution Variables replaced with real values. See V$SQLTEXT).
Since the server "makes use" of the comments, it makes things more flexible and simplifies things for SQLPlus to replace the Substitution Variables even in comments. (If needed this can be overridden. I'll show that below). SQLPlus,SQLDeveloper, etc could have been designed to ignore Substitution Variables in comments, but that would make them less flexible and perhaps require more code since they would need to recognize comments and change their behavior accordingly, line-by-line. I'll show some example of this flexibility further below.
There is not much of a drawback to the tools working this way.
Suppose one just wants to ignore a chunk of code for a minute during development and quickly run everything. It would be annoying if one had DEFINE everything even though it wasn't used, or to delete all the commented code just so it could run. So these tools instead allow you to SET DEFINE OFF; and ignore the variables.
For example, this runs fine:
SET DEFINE OFF;
--SELECT '&MY_FIRST_IGNORED_VAR' FROM DUAL;
-- SELECT '&MY_SECOND_IGNORED_VAR' FROM DUAL;
SELECT 1919 FROM DUAL;
If one needs to use '&' itself in a query, SQLPlus lets you choose another character as the substitution marker. There are lots of options to control things.
If one has finished developing one's final query or procedure, it isn't a valid situation to have leftover "orphan" comments with undefined-substitutions. When development is complete, orphan substitutions should all be removed, and anything remaining should reference valid DEFINEd variables.
Here's an example that make use of processing substitutions in comments.
Suppose you wanted to tune some poor-performing SQL. You could use a substitution variable in the HINTs (in a comment) to allow for quickly changing which index is used, or execution mode, etc. without needing to actually change the query script.
CREATE TABLE TEST_TABLE_1(TEST_KEY NUMBER PRIMARY KEY,
TEST_VALUE VARCHAR2(128) NOT NULL,
CONSTRAINT TEST_VALUE_UNQ UNIQUE (TEST_VALUE));
INSERT INTO TEST_TABLE
SELECT LEVEL, 'VALUE-'||LEVEL
FROM DUAL CONNECT BY LEVEL <= 5000;
Normally a query predicating against TEST_VALUE here would normally use its UNIQUE INDEX when fetching the data.
SELECT TEST_VALUE FROM TEST_TABLE WHERE TEST_VALUE = 'VALUE-1919';
X-Plan:
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 66 | 1 (0)| 00:00:01 |
|* 1 | INDEX UNIQUE SCAN| TEST_VALUE_UNQ | 1 | 66 | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------
But one can force a full-scan via a hint. By using a substitution variable in the hint (in a comment), one can allow the values of the Substitution Variable to direct query-execution:
DEFINE V_WHICH_FULL_SCAN = 'TEST_TABLE';
SELECT /*+ FULL(&V_WHICH_FULL_SCAN) */ TEST_VALUE FROM TEST_TABLE WHERE TEST_VALUE = 'VALUE-1919';
Here the Substitution Variable (in its comment) has changed the query-execution.
X-Plan:
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 23 | 1518 | 9 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_TABLE | 23 | 1518 | 9 (0)| 00:00:01 |
--------------------------------------------------------------------------------
If there were a bunch of tables here instead of one, a person could DEFINE different targets to full-scan, and evaluate each impact on the query quickly.

Retrieving data with dynamic attributes in Cassandra

I'm working on a solution for Cassandra that's proving impossible.
We have a table that will return a set of candidates given some search criteria. The row with the highest score is returned back to the user. We can do this quite easily with SQL, but there's a need to migrate to Cassandra. Here are the tables involved:
Value
ID | VALUE | COUNTRY | STATE | CITY | COUNTY
--------+---------+----------+----------+-----------+-----------
1 | 50 | US | | |
--------+---------+----------+----------+-----------+-----------
2 | 25 | | TX | |
--------+---------+----------+----------+-----------+-----------
3 | 15 | | | MEMPHIS |
--------+---------+----------+----------+-----------+-----------
4 | 5 | | | | BROWARD
--------+---------+----------+----------+-----------+-----------
5 | 30 | | NY | NYC |
--------+---------+----------+----------+-----------+-----------
6 | 20 | US | | NASHVILLE |
--------+---------+----------+----------+-----------+-----------
Scoring
ATTRIBUTE | SCORE
-------------+-------------
COUNTRY | 1
STATE | 2
CITY | 4
COUNTY | 8
A query is sent that can have any of those four attributes populated or not. We search through our values table, calculate the scores, and return the highest one. If a column in the values table is null, it means it's applicable for all.
ID 1 is applicable for all states, cities, and counties within the US.
ID 2 is applicable for all countries, cities, and counties where the state is TX.
Example:
Query: {Country: US, State: TX}
Matches Value IDs: [1, 2, 3, 4, 6]
Scores: [1, 2, 4, 8, 5(1+4)]
Result: {id: 4} (8 was the highest score so Broward returns)
How would you model something like this in Cassandra 2.1?
Found out the best way to achieve this was using Solr with Cassandra.
Somethings to note though about using Solr, since all the resources I needed were scattered amongst the internet.
You must first start Cassandra with Solr. There's a command with the dse tool for starting cassandra with Solr enabled.
$CASSANDRA_HOME/bin/dse cassandra -s
You must create your keyspace with network topology stategy and solr enabled.
CREATE KEYSPACE ... WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'Solr': 1}
Once you create your table within your solr enabled keyspace, create a core using the dsetool.
$CASSANDRA_HOME/bin/dsetool create_core keyspace.table_name generateResources=true reindex=true
This will allow solr to index your data and generate a number of secondary indexes against your cassandra table.
To perform the queries needed for columns where values may or may not exist requires a somewhat complex query.
SELECT * FROM keyspace.table_name WHERE solr_query = '{"q": "{(-column:[* TO *] AND *:*) OR column:value}"';
Finally, you may notice when searching for text, your solr query column:"Hello" may pick up other unwanted values like HelloWorld or HelloThere. This is due to the datatype used in your schema.xml for Solr. Here's how to modify this behavior:
Head to your Solr Admin UI. (Normally http://hostname:8983/solr/)
Choose your core in the drop down list in the left pane, should be named keyspace.table_name.
Look for Config or Schema, both should take you to the schema.xml.
Copy and paste that file to some text editor. Optionally, you could try using wget or curl to download the file, but you need the real link which is provided in the text field box to the top right.
There's a tag <fieldtype>, with the name TextField. Replace org.apache.solr.schema.TextField with org.apache.solr.schema.StrField. You must also remove the analyzers, StrField does not support those.
That's it, hopefully I've saved people from all the headaches I encountered.

Postgresql partitions: Abnormally high seq scan cost on master table

I have a little database of a few hundreds of millions of rows for storing call detail records. I setup partitioning as per:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
and it seemed to work pretty well until now. I have master table "acmecdr" which has rules for inserting into the correct partition and check constraints to make sure the correct table is used when selecting data. Here is an example of one of the partitions:
cdrs=> \d acmecdr_20130811
Table "public.acmecdr_20130811"
Column | Type | Modifiers
-------------------------------+---------+------------------------------------------------------
acmecdr_id | bigint | not null default
...snip...
h323setuptime | bigint |
acmesipstatus | integer |
acctuniquesessionid | text |
customers_id | integer |
Indexes:
"acmecdr_20130811_acmesessionegressrealm_idx" btree (acmesessionegressrealm)
"acmecdr_20130811_acmesessioningressrealm_idx" btree (acmesessioningressrealm)
"acmecdr_20130811_calledstationid_idx" btree (calledstationid)
"acmecdr_20130811_callingstationid_idx" btree (callingstationid)
"acmecdr_20130811_h323setuptime_idx" btree (h323setuptime)
Check constraints:
"acmecdr_20130811_h323setuptime_check" CHECK (h323setuptime >= 1376179200 AND h323setuptime < 1376265600)
Inherits: acmecdr
Now as one would expect with SET constraint_exclusion = on the correct partition should automatically be preferred and since there is an index on it there should only be one index scan.
However:
cdrs=> explain analyze select * from acmecdr where h323setuptime > 1376179210 and h323setuptime < 1376179400;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.00..1435884.93 rows=94 width=1130) (actual time=138857.660..138858.778 rows=112 loops=1)
-> Append (cost=0.00..1435884.93 rows=94 width=1130) (actual time=138857.628..138858.189 rows=112 loops=1)
-> Seq Scan on acmecdr (cost=0.00..1435863.60 rows=1 width=1137) (actual time=138857.584..138857.584 rows=0 loops=1)
Filter: ((h323setuptime > 1376179210) AND (h323setuptime < 1376179400))
-> Index Scan using acmecdr_20130811_h323setuptime_idx on acmecdr_20130811 acmecdr (cost=0.00..21.33 rows=93 width=1130) (actual time=0.037..0.283 rows=112 loops=1)
Index Cond: ((h323setuptime > 1376179210) AND (h323setuptime < 1376179400))
Total runtime: 138859.240 ms
(7 rows)
So, I can see it's not scanning all the partitions, only the relevant one (which in index scan and pretty quick) and also the master table (which seems to be normal from the examples I've seen). But the high cost of the seq scan on the master table seems to be abnormal. I would love for that to come down and I see no reason for it, especially since the master table does not have any records in it:
cdrs=> select count(*) from only acmecdr;
count
-------
0
(1 row)
Unless I'm missing something obvious, this query should be quick. But it's not - it takes about 2 minutes? This does not seem normal at all (even for a slow server).
I'm out of ideas of what to try next, so if anyone has any suggestions or pointers in the right direction, it would be very much appreciated.

Resources