U-Sql use rowset variable for decision making - u-sql

I want to use rowset variable as scaler variable.
#cnt = Select count(*) from #tab1;
If (#cnt > 0) then
#cnt1= select * from #tab2;
End;
Is it possible?
======================================
I want to block the complex u-sql code based on some condition, lets say based on some control table. In my original code, I wrote 10-15 u-sql statements and I want to bound them within the If statement. I don't want to do cross join because it again start trying to join the table. If I use cross join, there is no significant save in execution time. Use of IF statement is, If the condition does not met, complete piece of code should not execute. Is it possible?

To add to wBob's and Alex's answers:
U-SQL does not provide data driven control flow within a script. The current IF statement requires the expression to be evaluated at compile time.
Consider a U-SQL script as just a single declarative query. So you have the following options:
Express your problem with relational expressions. This means that you will have to write a (cross) join to guard the execution. If you feel that the query optimizer does a bad job at optimizing such guards (e.g., it evaluates the expensive side of the join before the cheap guard), please report an issue and we will take a look.
Split your script into several scripts and look at the result of each script before doing your next step. This is a form of orchestration that you can do with ADF or writing your own orchestration with Powershell or any of the SDKs. The caveat here is that you will have to write intermediate results into files and download the files into your orchestration layer.
Having said this, it theoretically is possible to extent the language algebra with a "don't execute the remaining part of this operator tree if a condition is not satisfied" operator. However that is a major work item and can lead to very large query plans during compilation that may be going beyond the current limits. If you feel that neither 1 nor 2 above are sufficient to help with your scenario, please add your vote to https://feedback.azure.com/forums/327234-data-lake/suggestions/17635906-please-add-dynamic-if-evaluation-to-u-sql.

#cnt1 =
SELECT #tab2.*
FROM #tab2
CROSS JOIN (SELECT COUNT(*) AS cnt FROM #tab1) AS c
WHERE c.cnt > 0;
(Adding explanation) CROSS JOIN returns a cartesian product of all rows from #tab2 and the single row generated by the COUNT query. There WHERE condition then ensures the result of the query is all rows from #tab2 if COUNT(*)>0, no rows otherwise.

Related

How to put a part of a code as a string in table to use it in a procedure?

I'm trying to resolve below issue:
I need to prepare table that consists 3 columns:
user_id,
month
value.
Each from over 200 users has got different values of parameters that determine expected value which are: LOB, CHANNEL, SUBSIDIARY. So I decided to store it in table ASYSTENT_GOALS_SET. But I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in "where" clause further in procedure.
So, as an example - instead of multiple rows:
I created such entry:
So far I created testing table ASYSTENT_TEST (where I collect month and value for certain user). I wrote a piece of procedure where I used BULK COLLECT.
declare
type test_row is record
(
month NUMBER,
value NUMBER
);
type test_tab is table of test_row;
BULK_COLLECTOR test_tab;
p_lob varchar2(10) :='GOSP';
p_sub varchar2(14);
p_ch varchar2(10) :='BR';
begin
select subsidiary into p_sub from ASYSTENT_GOALS_SET where user_id='40001001';
execute immediate 'select mc, sum(ppln_wartosc) plan from prod_nonlife.mis_report_plans
where report_id = (select to_number(value) from prod_nonlife.view_parameters where view_name=''MIS'' and parameter_name=''MAX_REPORT_ID'')
and year=2017
and month between 7 and 9
and ppln_jsta_symbol in (:subsidiary)
and dcs_group in (:lob)
and kanal in (:channel)
group by month order by month' bulk collect into BULK_COLLECTOR
using p_sub,p_lob,p_ch;
forall x in BULK_COLLECTOR.first..BULK_COLLECTOR.last insert into ASYSTENT_TEST values BULK_COLLECTOR(x);
end;
So now when in table ASYSTENT_GOALS_SET column SUBSIDIARY (varchar) consists string 12_00_00 (which is code of one of subsidiary) everything works fine. But the problem is when user works in two subsidiaries, let say 12_00_00 and 13_00_00. I have no clue how to write it down. Should SUBSIDIARY column consist:
'12_00_00','13_00_00'
or
"12_00_00","13_00_00"
or maybe
12_00_00','13_00_00
I have tried a lot of options after digging on topics like "Deling with single/escaping/double qoutes".
Maybe I should change something in execute immediate as well?
Or maybe my approach to that issue is completely wrong from the very beginning (hopefully not :) ).
I would be grateful for support.
I didn't create the table function described here but that article inspired me to go back to try regexp_substr function again.
I changed: ppln_jsta_symbol in (:subsidiary) to
ppln_jsta_symbol in (select regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''),''[^,]+'', 1, level) from dual
connect by regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''), ''[^,]+'', 1, level) is not null) Now it works like a charm! Thank you #Dessma very much for your time and suggestion!
"I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in 'where' clause further in procedure"
This seems a misguided requirement. You shouldn't worry about number of rows: databases are optimized for storing and retrieving rows.
What they are not good at is dealing with "multi-value" columns. As your own solution proves, it is not nice, it is very far from nice, in fact it is a total pain in the neck. From now on, every time anybody needs to work with subsidiary they will have to invoke a function. Adding, changing or removing a user's subsidiary is much harder than it ought to be. Also there is no chance of enforcing data integrity i.e. validating that a subsidiary is valid against a reference table.
Maybe none of this matters to you. But there are very good reasons why Codd mandated "no repeating groups" as a criterion of First Normal Form, the foundation step of building a sound data model.
The correct solution, industry best practice for almost forty years, would be to recognise that SUBSIDIARY exists at a different granularity to CHANNEL and so should be stored in a separate table.

How to insert large number of nodes into Neo4J

I need to insert about 1 million of nodes in Neo4j. I need to specify that each node is unique, so every time I insert a node it has to be checked that there's not the same node yet. Also the relationships must be unique.
I'm using Python and Cypher:
uq = 'CREATE CONSTRAINT ON (a:ipNode8) ASSERT a.ip IS UNIQUE'
...
queryProbe = 'MERGE (a:ipNode8 {ip:"' + prev + '"})'
...
queryUpdateRelationship= 'MATCH (a:ipNode8 {ip:"' + prev + '"}),(b:ipNode8 {ip:"' + next + '"}) MERGE (a)-[:precede]->(b)'
The problem is that after putting 40-50K nodes into Neo4j , the insertion speed slows down quickly and I can not to put anything else.
Your question is quite open ended. In addition to #InverseFalcon's recommendations, here are some other things you can investigate to speed things up.
Read the Performance Tuning documentation, and follow the recommendations. In particular, you might be running into memory-related issues, so the Memory Tuning section may be very helpful.
Your Cypher query(ies) can probably be sped up. For instance, if it makes sense, you can try something like the following. The data parameter is expected to be a list of objects having the format {a: 123, b: 234}. You can make the list as long as appropriate (e.g., 20K) to avoid running out of memory on the server while it processes the list within a single transaction. (This query assumes that you also want to create b if it does not exist.)
UNWIND {data} AS d
MERGE (a:ipNode8 {ip: d.a})
MERGE (b:ipNode8 {ip: d.b})
MERGE (a)-[:precede]->(b)
There are also periodic execution APOC procedures that you might be able to use.
For mass inserts like this, it's best to use LOAD CSV with periodic commit or the import tool.
I believe it's also best practice to use a parameterized query instead of appending values into a string.
Also, you created a unique property constraint on :ipNode8, but not :ipNode, which is the first one you MERGE. Seems like you'll need a unique constraint for that one too.

BigQuery error: Cannot query the cross product of repeated fields

I am running the following query on Google BigQuery web interface, for data provided by Google Analytics:
SELECT *
FROM [dataset.table]
WHERE
  hits.page.pagePath CONTAINS "my-fun-path"
I would like to save the results into a new table, however I am obtaining the following error message when using Flatten Results = False:
Error: Cannot query the cross product of repeated fields
customDimensions.value and hits.page.pagePath.
This answer implies that this should be possible: Is there a way to select nested records into a table?
Is there a workaround for the issue found?
Depending on what kind of filtering is acceptable to you, you may be able to work around this by switching to OMIT IF from WHERE. It will give different results, but, again, perhaps such different results are acceptable.
The following will remove entire hit record if (some) page inside of it meets criteria. Note two things here:
it uses OMIT hits IF, instead of more commonly used OMIT RECORD IF).
The condition is inverted, because OMIT IF is opposite of WHERE
The query is:
SELECT *
FROM [dataset.table]
OMIT hits IF EVERY(NOT hits.page.pagePath CONTAINS "my-fun-path")
Update: see the related thread, I am afraid this is no longer possible.
It would be possible to use NEST function and grouping by a field, but that's a long shot.
Using flatten call on the query:
SELECT *
FROM flatten([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],customDimensions)
WHERE
  hits.page.pagePath CONTAINS "m"
Thus in the web ui:
setting a destination table
allowing large results
and NO flatten results
does the job correctly and the produced table matches the original schema.
I know - it is old ask.
But now it can be achieved by just using standard SQL dialect instead of Legacy
#standardSQL
SELECT t.*
FROM `dataset.table` t, UNNEST(hits.page) as page
WHERE
  page.pagePath CONTAINS "my-fun-path"

What is Better for Mimicking PL/SQL Returning SQL in Interactive Reports: Collection or Pipelined-Function

The worst aspect of the Interactive Report (IR) is that you cannot create it using a PL/SQL returning SQL statement. I have gotten around this using two methods:
1) APEX_COLLECTION.CREATE_COLLECTION in the Before Header Process, which takes a SQL statement (that is constructed in PL/SQL in the process), and have the IR's source be select c001 alias1, c002 alias2 ... from apex_collections a where collection_name = '...'
2) Make a badass pipeline function with a parameter list as long as you need and then have the IR's source be select * from table(package_name.pipelined_function_name(:P1_parameter1, :P1_Parameter2))
Is there a performance difference? I originally used the first method but then ran into an occurrence where it was giving me a bug so I tried the pipelined function and found I just liked it better and have tended to use them ever since unless it was inappropriate to do so (namely when there is a large number of items to be passed to the parameter).
First method gives you opportunity to cache data by re-creating the collection only when you need it. Using n00X and d00X columns will give you some additional performance and right column types for the report definition. You can also create a view based on that collection with type casting and column aliases to add more convenience:
create or replace view apx_my_report
as
select n001 id, c001 data, d001 some_date
from apex_collections
where collection_name = 'MY_REPORT'
/
In that case you report source will be like that:
select id, data, some_date from apx_my_report
/
On the other hand, when you need to execute an ad-hoc query every time when page is rendered, it leads to the unavoidable re-creation of a such collection, therefore the performance goes down because of unwanted transaction maintaining: undo, redo etc.
So, it depends.

Why does this query timeout? V2

This question is a followup to This Question
The solution, clearing the execution plan cache seemed to work at the time, but i've been running into the same problem over and over again, and clearing the cache no longer seems to help. There must be a deeper problem here.
I've discovered that if I remove the .Distinct() from the query, it returns rows (with duplicates) in about 2 seconds. However, with the .Distinct() it takes upwards of 4 minutes to complete. There are a lot of rows in the tables, and some of the where clause fields do not have indexes. However, the number of records returned is fairly small (a few dozen at most).
The confusing part about it is that if I get the SQL generated by the Linq query, via Linqpad, then execute that code as SQL or in SQL Management Studio (including the DISTINCT) it executes in about 3 seconds.
What is the difference between the Linq query and the executed SQL?
I have a short term workaround, and that's to return the set without .Distinct() as a List, then using .Distinct on the list, this takes about 2 seconds. However, I don't like doing SQL Server work on the web server.
I want to understand WHY the Distinct is 2 orders of magnitude slower in Linq, but not SQL.
UPDATE:
When executing the code via Linq, the sql profiler shows this code, which is basically identical query.
sp_executesql N'SELECT DISTINCT [t5].[AccountGroupID], [t5].[AccountGroup]
AS [AccountGroup1]
FROM [dbo].[TransmittalDetail] AS [t0]
INNER JOIN [dbo].[TransmittalHeader] AS [t1] ON [t1].[TransmittalHeaderID] =
[t0].[TransmittalHeaderID]
INNER JOIN [dbo].[LineItem] AS [t2] ON [t2].[LineItemID] = [t0].[LineItemID]
LEFT OUTER JOIN [dbo].[AccountType] AS [t3] ON [t3].[AccountTypeID] =
[t2].[AccountTypeID]
LEFT OUTER JOIN [dbo].[AccountCategory] AS [t4] ON [t4].[AccountCategoryID] =
[t3].[AccountCategoryID]
LEFT OUTER JOIN [dbo].[AccountGroup] AS [t5] ON [t5].[AccountGroupID] =
[t4].[AccountGroupID]
LEFT OUTER JOIN [dbo].[AccountSummary] AS [t6] ON [t6].[AccountSummaryID] =
[t5].[AccountSummaryID]
WHERE ([t1].[TransmittalEntityID] = #p0) AND ([t1].[DateRangeBeginTimeID] = #p1) AND
([t1].[ScenarioID] = #p2) AND ([t6].[AccountSummaryID] = #p3)',N'#p0 int,#p1 int,
#p2 int,#p3 int',#p0=196,#p1=20100101,#p2=2,#p3=0
UPDATE:
The only difference between the queries is that Linq executes it with sp_executesql and SSMS does not, otherwise the query is identical.
UPDATE:
I have tried various Transaction Isolation levels to no avail. I've also set ARITHABORT to try to force a recompile when it executes, and no difference.
The bad plan is most likely the result of parameter sniffing: http://blogs.msdn.com/b/queryoptteam/archive/2006/03/31/565991.aspx
Unfortunately there is not really any good universal way (that I know of) to avoid that with L2S. context.ExecuteCommand("sp_recompile ...") would be an ugly but possible workaround if the query is not executed very frequently.
Changing the query around slightly to force a recompile might be another one.
Moving parts (or all) of the query into a view*, function*, or stored procedure* DB-side would be yet another workaround.
 * = where you can use local params (func/proc) or optimizer hints (all three) to force a 'good' plan
Btw, have you tried to update statistics for the tables involved? SQL Server's auto update statistics doesn't always do the job, so unless you have a scheduled job to do that it might be worth considering scripting and scheduling update statistics... ...tweaking up and down the sample size as needed can also help.
There may be ways to solve the issue by adding* (or dropping*) the right indexes on the tables involved, but without knowing the underlying db schema, table size, data distribution etc that is a bit difficult to give any more specific advice on...
 * = Missing and/or overlapping/redundant indexes can both lead to bad execution plans.
The SQL that Linqpad gives you may not be exactly what is being sent to the DB.
Here's what I would suggest:
Run SQL Profiler against the DB while you execute the query. Find the statement which corresponds to your query
Paste the whole statment into SSMS, and enable the "Show Actual Execution Plan" option.
Post the resulting plan here for people to dissect.
Key things to look for:
Table Scans, which usually imply that an index is missing
Wide arrows in the graphical plan, indicating lots of intermediary rows being processed.
If you're using SQL 2008, viewing the plan will often tell you if there are any indexes missing which should be added to speed up the query.
Also, are you executing against a DB which is under load from other users?
At first glance there's a lot of joins, but I can only see one thing to reduce the number right away w/out having the schema in front of me...it doesn't look like you need AccountSummary.
[t6].[AccountSummaryID] = #p3
could be
[t5].[AccountSummaryID] = #p3
Return values are from the [t5] table. [t6] is only used filter on that one parameter which looks like it is the Foreign Key from t5 to t6, so it is present in [t5]. Therefore, you can remove the join to [t6] altogether. Or am I missing something?
Are you sure you want to use LEFT OUTER JOIN here? This query looks like it should probably be using INNER JOINs, especially because you are taking the columns that are potentially NULL and then doing a distinct on it.
Check that you have the same Transaction Isolation level between your SSMS session and your application. That's the biggest culprit I've seen for large performance discrepancies between identical queries.
Also, there are different connection properties in use when you work through SSMS than when executing the query from your application or from LinqPad. Do some checks into the Connection properties of your SSMS connection and the connection from your application and you should see the differences. All other things being equal, that could be the difference. Keep in mind that you are executing the query through two different applications that can have two different configurations and could even be using two different database drivers. If the queries are the same then that would be only differences I can see.
On a side note if you are hand-crafting the SQL, you may try moving the conditions from the WHERE clause into the appropriate JOIN clauses. This actually changes how SQL Server executes the query and can produce a more efficient execution plan. I've seen cases where moving the filters from the WHERE clause into the JOINs caused SQL Server to filter the table earlier in the execution plan and significantly changed the execution time.

Resources