What is the point of Snowflake's Unique constraint? - constraints

Snowflake offers a Unique constraint but doesn't actually enforce it. I have an example below showing that with a test table.
What is the point, what value does the constraint add?
What workarounds do people use to avoid duplicates? I can perform a query before every insert but it seems like unnecessary usage.
CREATE OR REPLACE TABLE dbo.Test
(
"A" INT NOT NULL UNIQUE,
"B" STRING NOT NULL
);
INSERT INTO dbo.Test
VALUES (0, 'ABC');
INSERT INTO dbo.Test
VALUES (0, 'DEF');
SELECT *
FROM dbo.Test;
1. A, B
2. 0, ABC
3. 0, DEF

for one, Snowflake is not alone in this world. So data gets imported and exported, and while Snowflake does not enforce the constraints, some other systems might and this way they won't get lost while travelling through Snowflake
for other, it's also informational for the data analytical tools like already mentioned in the link Kirby provided
please remember, that execution is consecutive, so running a check before every query will still get you duplicates at high concurrency. To avoid duplicates fully you need to either run merges (which is admittedly going to be slower) or manually delete the "excessive" data after it's been loaded

Related

How to put a part of a code as a string in table to use it in a procedure?

I'm trying to resolve below issue:
I need to prepare table that consists 3 columns:
user_id,
month
value.
Each from over 200 users has got different values of parameters that determine expected value which are: LOB, CHANNEL, SUBSIDIARY. So I decided to store it in table ASYSTENT_GOALS_SET. But I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in "where" clause further in procedure.
So, as an example - instead of multiple rows:
I created such entry:
So far I created testing table ASYSTENT_TEST (where I collect month and value for certain user). I wrote a piece of procedure where I used BULK COLLECT.
declare
type test_row is record
(
month NUMBER,
value NUMBER
);
type test_tab is table of test_row;
BULK_COLLECTOR test_tab;
p_lob varchar2(10) :='GOSP';
p_sub varchar2(14);
p_ch varchar2(10) :='BR';
begin
select subsidiary into p_sub from ASYSTENT_GOALS_SET where user_id='40001001';
execute immediate 'select mc, sum(ppln_wartosc) plan from prod_nonlife.mis_report_plans
where report_id = (select to_number(value) from prod_nonlife.view_parameters where view_name=''MIS'' and parameter_name=''MAX_REPORT_ID'')
and year=2017
and month between 7 and 9
and ppln_jsta_symbol in (:subsidiary)
and dcs_group in (:lob)
and kanal in (:channel)
group by month order by month' bulk collect into BULK_COLLECTOR
using p_sub,p_lob,p_ch;
forall x in BULK_COLLECTOR.first..BULK_COLLECTOR.last insert into ASYSTENT_TEST values BULK_COLLECTOR(x);
end;
So now when in table ASYSTENT_GOALS_SET column SUBSIDIARY (varchar) consists string 12_00_00 (which is code of one of subsidiary) everything works fine. But the problem is when user works in two subsidiaries, let say 12_00_00 and 13_00_00. I have no clue how to write it down. Should SUBSIDIARY column consist:
'12_00_00','13_00_00'
or
"12_00_00","13_00_00"
or maybe
12_00_00','13_00_00
I have tried a lot of options after digging on topics like "Deling with single/escaping/double qoutes".
Maybe I should change something in execute immediate as well?
Or maybe my approach to that issue is completely wrong from the very beginning (hopefully not :) ).
I would be grateful for support.
I didn't create the table function described here but that article inspired me to go back to try regexp_substr function again.
I changed: ppln_jsta_symbol in (:subsidiary) to
ppln_jsta_symbol in (select regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''),''[^,]+'', 1, level) from dual
connect by regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''), ''[^,]+'', 1, level) is not null) Now it works like a charm! Thank you #Dessma very much for your time and suggestion!
"I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in 'where' clause further in procedure"
This seems a misguided requirement. You shouldn't worry about number of rows: databases are optimized for storing and retrieving rows.
What they are not good at is dealing with "multi-value" columns. As your own solution proves, it is not nice, it is very far from nice, in fact it is a total pain in the neck. From now on, every time anybody needs to work with subsidiary they will have to invoke a function. Adding, changing or removing a user's subsidiary is much harder than it ought to be. Also there is no chance of enforcing data integrity i.e. validating that a subsidiary is valid against a reference table.
Maybe none of this matters to you. But there are very good reasons why Codd mandated "no repeating groups" as a criterion of First Normal Form, the foundation step of building a sound data model.
The correct solution, industry best practice for almost forty years, would be to recognise that SUBSIDIARY exists at a different granularity to CHANNEL and so should be stored in a separate table.

How to insert large number of nodes into Neo4J

I need to insert about 1 million of nodes in Neo4j. I need to specify that each node is unique, so every time I insert a node it has to be checked that there's not the same node yet. Also the relationships must be unique.
I'm using Python and Cypher:
uq = 'CREATE CONSTRAINT ON (a:ipNode8) ASSERT a.ip IS UNIQUE'
...
queryProbe = 'MERGE (a:ipNode8 {ip:"' + prev + '"})'
...
queryUpdateRelationship= 'MATCH (a:ipNode8 {ip:"' + prev + '"}),(b:ipNode8 {ip:"' + next + '"}) MERGE (a)-[:precede]->(b)'
The problem is that after putting 40-50K nodes into Neo4j , the insertion speed slows down quickly and I can not to put anything else.
Your question is quite open ended. In addition to #InverseFalcon's recommendations, here are some other things you can investigate to speed things up.
Read the Performance Tuning documentation, and follow the recommendations. In particular, you might be running into memory-related issues, so the Memory Tuning section may be very helpful.
Your Cypher query(ies) can probably be sped up. For instance, if it makes sense, you can try something like the following. The data parameter is expected to be a list of objects having the format {a: 123, b: 234}. You can make the list as long as appropriate (e.g., 20K) to avoid running out of memory on the server while it processes the list within a single transaction. (This query assumes that you also want to create b if it does not exist.)
UNWIND {data} AS d
MERGE (a:ipNode8 {ip: d.a})
MERGE (b:ipNode8 {ip: d.b})
MERGE (a)-[:precede]->(b)
There are also periodic execution APOC procedures that you might be able to use.
For mass inserts like this, it's best to use LOAD CSV with periodic commit or the import tool.
I believe it's also best practice to use a parameterized query instead of appending values into a string.
Also, you created a unique property constraint on :ipNode8, but not :ipNode, which is the first one you MERGE. Seems like you'll need a unique constraint for that one too.

How to make values unique in cassandra

I want to make unique constraint in cassandra .
As i want to all the value in my column be unique in my column family
ex:
name-rahul
phone-123
address-abc
now i want that i this row no values equal to rahul ,123 and abc get inserted again on seraching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the 3 values uniques
means if
name- jacob
phone-123
address-qwe
this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul.
The short answer is that constraints of any type are not supported in Cassandra. They are simply too expensive as they must involve multiple nodes, thus defeating the purpose of having eventual consistency in first place. If you needed to make a single column unique, then there could be a solution, but not for more unique columns. For the same reason - there is no isolation, no consistency (C and I from the ACID). If you really need to use Cassandra with this type of enforcement, then you will need to create some kind of synchronization application layer which will intercept all requests to the database and make sure that the values are unique, and all constraints are enforced. But this won't have anything to do with Cassandra.
I know this is an old question and the existing answer is correct (you can't do constraints in C*), but you can solve the problem using batched creates. Create one or more additional tables, each with the constrained column as the primary key and then batch the creates, which is an atomic operation. If any of those column values already exist the entire batch will fail. For example if the table is named Foo, also create Foo_by_Name (primary key Name), Foo_by_Phone (primary key Phone), and Foo_by_Address (primary key Address) tables. Then when you want to add a row, create a batch with all 4 tables. You can either duplicate all of the columns in each table (handy if you want to fetch by Name, Phone, or Address), or you can have a single column of just the Name, Phone, or Address.

Dynamic query and caching

I have two problem sets. What I am preferably looking for is a solution which combines both.
Problem 1: I have a table of lets say 20 rows. I am reading 150,000 rows from other table (say table 2). For each row read from table 2, I have to match it with a specific row of table 1 (not matching whole row, few columns. like if table2.col1 = table1.col && table2.col2 = table1.col2) etc. Is there a way that i can cache table 1 so that i don't have to query it again and again ?
Problem 2: I want to generate query string dynamically i.e., if parameter 2 is null then don't put it in where clause. Now the only option left is to use immidiate execute which will be very slow.
Now what i am asking that how can i have dynamic query to compare it with table 1 ? any ideas ?
For problem 1, as mentioned in the comments, let the database handle it. That's what it does really well. If it is something being hit often, then the blocks for the table should remain in the database buffer cache if the buffer cache is sized appropriately. Part of DBA tuning would be to identify appropriate sizing, pinning tables into the "keep" pool, etc. But probably not something that needs worrying over.
If the desire is just to simplify writing the queries rather than performance, then views or stored procs can simplify the repetitive use of the join.
For problem 2, a query in a format like this might work for you:
SELECT id, val
FROM myTable
WHERE filter = COALESCE(v_filter, filter)
If the input parameter v_filter is null, then just automatically match the existing column. This assumes the existing filter column itself is never null (since you can't use = for null comparisons). Also, it assumes that there are other indexed portions in the WHERE clause since a function like COALESCE isn't going to be able to take advantage of an index.
For problem 1 you just join the tables. If there is an equijoin and one table is quite small and the other large then you're likely to get a hash join. This is effectively a caching mechanism, and the total cost of reading the tables and performing the join is only very slightly higher than that of reading the tables (as long as the hash table fits in memory).
It does not make a difference if the query is constructed and run through execute immediate -- the RDBMS hash join will still act as an effective cache.

tSQLt AssertEqualsTable does not check ordering

I have two tables defined for actual and expected with exactly the same schema. I insert two rows into the expected table with say Ids of 2, 1.
I run
INSERT INTO actual EXEC tSQLt.ResultSetFilter 1, '{statement}'
to populate the actual then
EXEC tSQLt.AssertEqualsTable #expected = 'expected' , #actual = 'actual'
to compare the results.
Even though the data is in a different order (Ids are 1, 2 in the actual), the test passes.
I confirmed that the data was different by adding SELECT * FROM actual and SELECT * FROM expected in the test and running the test on its own with tSQLt.Run '{test name}'.
Does anyone know if this is a known bug? Apparently it is supposed to check per row so the ordering should be checked. All the other columns are NULL that are returned it is just the ID column that contains a value.
Unless an order by clause is specified in the select statement, the order isn't guaranteed by SQL server (see the top bullet point at this MSDN page) - although in practice it is often ordered as you might expect.
Because of this, I believe that tSQLt looking for non-identical and identical rows makes sense - but checking the order doesn't - otherwise the answer could change at the whim of SQL server and the test would be meaningless (and worse - intermittently failing!). The tSQLt user guide on AssertEqualsTable states that it checks the content of the table, but not that it checks the ordering therein. What leads you to conclude that the order should be being checked as well? I couldn't find mention of it.
If you need the order to be checked, you could insert both expected and actual results into a temporary table with an identity column (or use ROW_NUMBER) and check the resultant table - if the order is different then the identity cols would be different.
There is a similar method documented here on Greg M Lucas' blog.
Relying on the order returned from the table without an order by clause is not recommended (MSDN link) - so I'd suggest including one in your application's call to the statement, or if an SP within it if the order of returned rows is important.

Resources