tSQLt AssertEqualsTable does not check ordering - tsqlt

I have two tables defined for actual and expected with exactly the same schema. I insert two rows into the expected table with say Ids of 2, 1.
I run
INSERT INTO actual EXEC tSQLt.ResultSetFilter 1, '{statement}'
to populate the actual then
EXEC tSQLt.AssertEqualsTable #expected = 'expected' , #actual = 'actual'
to compare the results.
Even though the data is in a different order (Ids are 1, 2 in the actual), the test passes.
I confirmed that the data was different by adding SELECT * FROM actual and SELECT * FROM expected in the test and running the test on its own with tSQLt.Run '{test name}'.
Does anyone know if this is a known bug? Apparently it is supposed to check per row so the ordering should be checked. All the other columns are NULL that are returned it is just the ID column that contains a value.

Unless an order by clause is specified in the select statement, the order isn't guaranteed by SQL server (see the top bullet point at this MSDN page) - although in practice it is often ordered as you might expect.
Because of this, I believe that tSQLt looking for non-identical and identical rows makes sense - but checking the order doesn't - otherwise the answer could change at the whim of SQL server and the test would be meaningless (and worse - intermittently failing!). The tSQLt user guide on AssertEqualsTable states that it checks the content of the table, but not that it checks the ordering therein. What leads you to conclude that the order should be being checked as well? I couldn't find mention of it.
If you need the order to be checked, you could insert both expected and actual results into a temporary table with an identity column (or use ROW_NUMBER) and check the resultant table - if the order is different then the identity cols would be different.
There is a similar method documented here on Greg M Lucas' blog.
Relying on the order returned from the table without an order by clause is not recommended (MSDN link) - so I'd suggest including one in your application's call to the statement, or if an SP within it if the order of returned rows is important.

Related

What is the point of Snowflake's Unique constraint?

Snowflake offers a Unique constraint but doesn't actually enforce it. I have an example below showing that with a test table.
What is the point, what value does the constraint add?
What workarounds do people use to avoid duplicates? I can perform a query before every insert but it seems like unnecessary usage.
CREATE OR REPLACE TABLE dbo.Test
(
"A" INT NOT NULL UNIQUE,
"B" STRING NOT NULL
);
INSERT INTO dbo.Test
VALUES (0, 'ABC');
INSERT INTO dbo.Test
VALUES (0, 'DEF');
SELECT *
FROM dbo.Test;
1. A, B
2. 0, ABC
3. 0, DEF
for one, Snowflake is not alone in this world. So data gets imported and exported, and while Snowflake does not enforce the constraints, some other systems might and this way they won't get lost while travelling through Snowflake
for other, it's also informational for the data analytical tools like already mentioned in the link Kirby provided
please remember, that execution is consecutive, so running a check before every query will still get you duplicates at high concurrency. To avoid duplicates fully you need to either run merges (which is admittedly going to be slower) or manually delete the "excessive" data after it's been loaded

BigQuery error: Cannot query the cross product of repeated fields

I am running the following query on Google BigQuery web interface, for data provided by Google Analytics:
SELECT *
FROM [dataset.table]
WHERE
  hits.page.pagePath CONTAINS "my-fun-path"
I would like to save the results into a new table, however I am obtaining the following error message when using Flatten Results = False:
Error: Cannot query the cross product of repeated fields
customDimensions.value and hits.page.pagePath.
This answer implies that this should be possible: Is there a way to select nested records into a table?
Is there a workaround for the issue found?
Depending on what kind of filtering is acceptable to you, you may be able to work around this by switching to OMIT IF from WHERE. It will give different results, but, again, perhaps such different results are acceptable.
The following will remove entire hit record if (some) page inside of it meets criteria. Note two things here:
it uses OMIT hits IF, instead of more commonly used OMIT RECORD IF).
The condition is inverted, because OMIT IF is opposite of WHERE
The query is:
SELECT *
FROM [dataset.table]
OMIT hits IF EVERY(NOT hits.page.pagePath CONTAINS "my-fun-path")
Update: see the related thread, I am afraid this is no longer possible.
It would be possible to use NEST function and grouping by a field, but that's a long shot.
Using flatten call on the query:
SELECT *
FROM flatten([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],customDimensions)
WHERE
  hits.page.pagePath CONTAINS "m"
Thus in the web ui:
setting a destination table
allowing large results
and NO flatten results
does the job correctly and the produced table matches the original schema.
I know - it is old ask.
But now it can be achieved by just using standard SQL dialect instead of Legacy
#standardSQL
SELECT t.*
FROM `dataset.table` t, UNNEST(hits.page) as page
WHERE
  page.pagePath CONTAINS "my-fun-path"

Is this normal behavior for a unique index in Sqlite?

I'm working with SQLite in Flash.
I have this unique index:
CREATE UNIQUE INDEX songsIndex ON songs ( DiscID, Artist, Title )
I have a parametised recursive function set up to insert any new rows (single or multiple).
It works fine if I try to insert a row with the same DiscID, Artist and Title as an existing row - ie it ignores inserting the existing row, and tells me that 0 out of 1 records were updated - GOOD.
However, if, for example the DiscId is blank, but the artist and title are not, a new record is created when there is already one with a blank DiscId and the same artist and title - BAD.
I traced out the disc id prior to the insert, and Flash is telling me it's undefined. So I've coded it to set anything undefined to "" (an empty string) to make sure it's truly an empty string being inserted - but subsequent inserts still ignore the unique index and add a brand new row even though the same row exists.
What am I misunderstanding?
Thanks for your time and help.
SQLite allows NULLable fields to participate in UNIQUE indexes. If you have such an index, and if you add records such that two of the three columns have identical values and the other column is NULL in both records, SQLite will allow that, matching the behavior you're seeing.
Therefore the most likely explanation is that despite your effort to INSERT zero-length strings, you're actually still INSERTing NULLs.
Also, unless you've explicitly included OR IGNORE in your INSERT statements, the expected behavior of SQLite is to throw an error when you attempt to insert a duplicate INDEX value into a UNIQUE INDEX. Since you're not seeing that behavior, I'm guessing that Flash provides some kind of wrapper around SQLite that's hiding the true behavior from you (and could also be translating empty strings to NULL).
Larry's answer is great. To anyone having the same problem here's the SQLite docs citation explaining that in this case all NULLs are treated as different values:
For the purposes of unique indices, all NULL values are considered
different from all other NULL values and are thus unique. This is one
of the two possible interpretations of the SQL-92 standard (the
language in the standard is ambiguous). The interpretation used by
SQLite is the same and is the interpretation followed by PostgreSQL,
MySQL, Firebird, and Oracle. Informix and Microsoft SQL Server follow
the other interpretation of the standard, which is that all NULL
values are equal to one another.
See here: https://www.sqlite.org/lang_createindex.html

Dynamic query and caching

I have two problem sets. What I am preferably looking for is a solution which combines both.
Problem 1: I have a table of lets say 20 rows. I am reading 150,000 rows from other table (say table 2). For each row read from table 2, I have to match it with a specific row of table 1 (not matching whole row, few columns. like if table2.col1 = table1.col && table2.col2 = table1.col2) etc. Is there a way that i can cache table 1 so that i don't have to query it again and again ?
Problem 2: I want to generate query string dynamically i.e., if parameter 2 is null then don't put it in where clause. Now the only option left is to use immidiate execute which will be very slow.
Now what i am asking that how can i have dynamic query to compare it with table 1 ? any ideas ?
For problem 1, as mentioned in the comments, let the database handle it. That's what it does really well. If it is something being hit often, then the blocks for the table should remain in the database buffer cache if the buffer cache is sized appropriately. Part of DBA tuning would be to identify appropriate sizing, pinning tables into the "keep" pool, etc. But probably not something that needs worrying over.
If the desire is just to simplify writing the queries rather than performance, then views or stored procs can simplify the repetitive use of the join.
For problem 2, a query in a format like this might work for you:
SELECT id, val
FROM myTable
WHERE filter = COALESCE(v_filter, filter)
If the input parameter v_filter is null, then just automatically match the existing column. This assumes the existing filter column itself is never null (since you can't use = for null comparisons). Also, it assumes that there are other indexed portions in the WHERE clause since a function like COALESCE isn't going to be able to take advantage of an index.
For problem 1 you just join the tables. If there is an equijoin and one table is quite small and the other large then you're likely to get a hash join. This is effectively a caching mechanism, and the total cost of reading the tables and performing the join is only very slightly higher than that of reading the tables (as long as the hash table fits in memory).
It does not make a difference if the query is constructed and run through execute immediate -- the RDBMS hash join will still act as an effective cache.

What is Better for Mimicking PL/SQL Returning SQL in Interactive Reports: Collection or Pipelined-Function

The worst aspect of the Interactive Report (IR) is that you cannot create it using a PL/SQL returning SQL statement. I have gotten around this using two methods:
1) APEX_COLLECTION.CREATE_COLLECTION in the Before Header Process, which takes a SQL statement (that is constructed in PL/SQL in the process), and have the IR's source be select c001 alias1, c002 alias2 ... from apex_collections a where collection_name = '...'
2) Make a badass pipeline function with a parameter list as long as you need and then have the IR's source be select * from table(package_name.pipelined_function_name(:P1_parameter1, :P1_Parameter2))
Is there a performance difference? I originally used the first method but then ran into an occurrence where it was giving me a bug so I tried the pipelined function and found I just liked it better and have tended to use them ever since unless it was inappropriate to do so (namely when there is a large number of items to be passed to the parameter).
First method gives you opportunity to cache data by re-creating the collection only when you need it. Using n00X and d00X columns will give you some additional performance and right column types for the report definition. You can also create a view based on that collection with type casting and column aliases to add more convenience:
create or replace view apx_my_report
as
select n001 id, c001 data, d001 some_date
from apex_collections
where collection_name = 'MY_REPORT'
/
In that case you report source will be like that:
select id, data, some_date from apx_my_report
/
On the other hand, when you need to execute an ad-hoc query every time when page is rendered, it leads to the unavoidable re-creation of a such collection, therefore the performance goes down because of unwanted transaction maintaining: undo, redo etc.
So, it depends.

Resources