Getting error with basic trim() function in U-SQL script - u-sql

I want to apply .trim() function on a column but getting the error.
Sample data:
Product_ID,Product_Name
1, Office Supplies
2,Personal Care
I have to do some data manipulation but can't get the basic trim() function right.
#productlog =
EXTRACT Product_ID string,
Prduct_Name string
FROM "/Staging/Products.csv"
USING Extractors.Csv();
#output = Select Product_ID, Product_Name.trim() from #productlog;
OUTPUT #output
TO "/Output/Products.csv"
USING Outputters.Csv();
Error:
Activity U-SQL1 failed: Error Id: E_CSC_USER_SYNTAXERROR, Error Message: syntax error. Expected one of: '.' ALL ANTISEMIJOIN ANY AS BEGIN BROADCASTLEFT BROADCASTRIGHT CROSS DISTINCT EXCEPT FULL FULLCROSS GROUP HASH HAVING INDEXLOOKUP INNER INTERSECT JOIN LEFT LOOP MERGE ON OPTION ORDER OUTER OUTER UNION PAIR PIVOT PRESORT PRODUCE READONLY REQUIRED RIGHT SAMPLE SEMIJOIN SERIAL TO UNIFORM UNION UNIVERSE UNPIVOT USING WHERE WITH ';' '(' ')' ',' .

Try the below, afaik you need to alias trimmed fields
#productlog =
EXTRACT Product_ID string,
Prduct_Name string
FROM "/Staging/Products.csv"
USING Extractors.Csv();
#output = Select Product_ID, Product_Name.trim() as Trimmed_Product_Name from #productlog;
OUTPUT #output
TO "/Output/Products.csv"
USING Outputters.Csv();

Got it right finally, in case someone else face the same issue. U-SQL is more like C# so it will be a bit tricky for people like me, coming from pure SQL background.
Code:
#productlog =
EXTRACT Product_ID string,
Prduct_Name string
FROM "/Staging/Products.csv"
USING Extractors.Csv();
#output =
SELECT
T.Product_ID,
T.Prduct_Name.ToUpper().Trim() AS Prduct_Name
FROM #productlog AS T;
OUTPUT #output
TO "/Output/Products.csv"
USING Outputters.Csv();

Related

Explode an Array in Athena

I have a simple table in athena, it has an array of events. I want to write a simple select statement so that each event in array becomes a row.
I tried explode, transform, but no luck. I have successfully done it in Spark and Hive. But this Athena is tricking me. Please advise
DROP TABLE bi_data_lake.royalty_v4;
CREATE external TABLE bi_data_lake.royalty_v4 (
KAFKA_ID string,
KAFKA_TS string,
deviceUser struct< deviceName:string, devicePlatform:string >,
consumeReportingEvents array<
struct<
consumeEvent: string,
consumeEventAction: string,
entryDateTime: string
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://XXXXXXXXXXX';
Query which is not working
select kafka_id, kafka_ts,deviceuser,
transform( consumereportingevents, consumereportingevent -> consumereportingevent.consumeevent) as cre
from bi_data_lake.royalty_v4
where kafka_id = 'events-consumption-0-490565';
Not supported
lateral view explode(consumereportingevents) as consumereportingevent
Answer to question it to use unnset
Found the answer for my question
WITH samples AS (
select kafka_id, kafka_ts,deviceuser, consumereportingevent, consumereportingeventPos
from bi_data_lake.royalty_v4
cross join unnest(consumereportingevents) WITH ORDINALITY AS T (consumereportingevent, consumereportingeventPos)
where kafka_id = 'events-consumption-0-490565' or kafka_id = 'events-consumption-0-490566'
)
SELECT * FROM samples
Flatten ('explode') nested arrays in AWS Athena with UNNEST.
WITH dataset AS (
SELECT
'engineering' as department,
ARRAY['Sharon', 'John', 'Bob', 'Sally'] as users
)
SELECT department, names FROM dataset
CROSS JOIN UNNEST(users) as t(names)
Reference: Flattening Nested Arrays

Splitting Columns in USQL

I am new to USQL and I am having a hard time splitting a column from the rest of my file. With my EXTRACTOR I declared 4 columns because my file is split into 4 pipes. However, I want to remove one of the columns I declared from the file. How do I do this?
The Json column of my file is what I want to split off and make you new object that does not include it. Basically splitting Date, Status, PriceNotification into the #result. This is what I have so far:
#input =
EXTRACT
Date string,
Condition string,
Price string,
Json string
FROM #in
USING Extractor.Cvs;
#result =
SELECT Json
FROM #input
OUTPUT #input
TO #out
USING Outputters.Cvs();
Maybe I have misunderstood your question, but you can simply list the columns you want in the SELECT statement, eg
#input =
EXTRACT
Date string,
Status string,
PriceNotification string,
Json string
FROM #in
USING Extractor.Text('|');
#result =
SELECT Date, Status, PriceNotification
FROM #input;
OUTPUT #result
TO #out
USING Outputters.Cvs();
NB I have switched the variable in your OUTPUT statement to be #result. If this does not answer your question, please post some sample data and expected results.

how to get a unqiue result sets in PL/SQL cursor?

I want use this procedure to display the username and moblephone number,the result sets is this when I use select :
declare enter image description here
when the procedure runs,I get this :
enter image description here
error ORA-01722: invalid number
ORA-06512: at "ABPROD.SHAREPOOL", line 24.
when I use unique or distinct in the cursor,nothing display.
the code source :
create or replace procedure sharepool (assignment in varchar2,myorgname in varchar2) is
rightid T_CLM_AP30_RIGHT.RIGHT_ID%type;
orgid t_clm_ap30_org.org_id%type;
begin
select t.right_id into rightid from T_CLM_AP30_RIGHT t where t.rightdesc=trim(assignment);
dbms_output.put_line(rightid||trim(myorgname)||assignment);
select t.org_id into orgid from t_clm_ap30_org t where t.orgname=trim(myorgname);
dbms_output.put_line(orgid);
declare
cursor namelist is select distinct a.username,a.mobile from t_clm_ap30_user a, T_CLM_AP30_RIGHT_AUTH t where a.user_id=t.user_id and t.right_id=rightid and t.poolorgrange=orgid ;
begin
for c in namelist
loop
dbms_output.put_line(c.username||' '||c.mobile);
end loop;
end;
end sharepool;
INVALID_NUMBER errors indicate a failed casting of a string to a number. That means one of your join conditions is comparing a string column with a number column, and you have values in the string column which cannot be cast to a number.
ORA-06512: at "ABPROD.SHAREPOOL", line 24
Line 24 doesn't align with the code you've posted, presumably lost in translation from your actual source. Also you haven't posted table descriptions so we cannot tell which columns to look at.
So here is a guess.
One (or more) of these joins has an implicit numeric conversion:
where a.user_id = t.user_id
and t.right_id = rightid
and t.poolorgrange = orgid
That is, either t_clm_ap30_user.user_id is numeric and T_CLM_AP30_RIGHT_AUTH.user_id is not, or vice versa. Or T_CLM_AP30_RIGHT_AUTH.right_id is numeric and T_CLM_AP30_RIGHT.right_id is not, or vice versa. Or T_CLM_AP30_RIGHT_AUTH.poolorgrange is numeric and t_clm_ap30_org.org_id is not, or vice versa.
Only you can figure this out, because only you can see your schema. Once you have identified the join where you have a string column being compared to a numeric column you need to query that column to find the data which cannot be converted to a number.
Let's say that T_CLM_AP30_RIGHT_AUTH.poolorgrange is the rogue string. You can see which are the troublesome rows with this query:
select * from T_CLM_AP30_RIGHT_AUTH
where translate (poolorgrange, 'x1234567890', 'x') is not null;
The translate() function strips out digits. So anything which is left can't be converted to a number.
"T_CLM_AP30_RIGHT_AUTH.poolorgrange is varchar2,and t_clm_ap30_org.org_id is numeric ."
You can avoid the error by explicitly casting the t_clm_ap30_org.org_id to a string:
select distinct a.username, a.mobile
from t_clm_ap30_user a,
T_CLM_AP30_RIGHT_AUTH t
where a.user_id = t.user_id
and t.right_id = rightid
and t.poolorgrange = to_char(orgid) ;
Obviously you're not going to get matches on those alphanumeric values but the query will run.

Blank rows not coming

After running the below code i am getting only rows where is present and not the rows where just ID is present.
Is there any problem with the extraction of file.
DROP VIEW IF EXISTS dbo.Consolidated;
CREATE VIEW IF NOT EXISTS dbo.Consolidated
AS
EXTRACT Statement
FROM "adl:///2016/08/12/File_name.csv"
USING Extractors.Csv(silent : true, quoting : true, nullEscape : "/N");
#Temp =
SELECT *
FROM Consolidated;
OUTPUT #Temp
TO "adl://arbit/new_cont_check.csv"
USING Outputters.Csv();
So your number of columns vary? If yes, try using a user-defined extractor. An example of a user-defined extractor where the number of columns vary can be seen # Using User-Defined Extractor - FlexExtractor.
You have added the silent: true argument to your USING clause meaning the EXTRACT will not fail or even complain if the rows don't quite match your schema definition. This is the intended behaviour and probably what you want for this example. In order to pick up the other rows, you can use OUTER UNION, like in this recent example:
Another working example similar to yours:
#input =
EXTRACT ControllerID int?,
ParameterID int?,
MeasureDate DateTime,
Value float
FROM "/input/input56.csv"
USING Extractors.Csv(silent : true, quoting : true, nullEscape : "/N", skipFirstNRows : 1)
OUTER UNION ALL BY NAME ON ( ControllerID )
EXTRACT ControllerID int?
FROM "/input/input56.csv"
USING Extractors.Csv(silent : true, quoting : true, nullEscape : "/N", skipFirstNRows : 1);
OUTPUT #input
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
I used this sample file and got these results:
NB I have changed the nullability of your columns as otherwise the OUTER UNION add default values for the .Net types as per the Sep 2016 release notes.

Writing SQL to U-SQL Query

Can anybody please guide me in writing this below SQL in U-SQL language used in Azure Data Lake
select tt.userId, count(tt.userId) from (SELECT userId,count(userId) as cou
FROM [dbo].[users]
where createdTime> DATEADD(wk,-1,GETDATE())
group by userId,DATEPART(minute,createdTime)/5) tt group by tt.userId
I don't find the DATEPART function in U-SQL . Azure Data Analytic job is giving me error.
U-SQL does not provide T-SQL intrinsic functions except for a few (like LIKE). See https://msdn.microsoft.com/en-us/library/azure/mt621343.aspx for a list.
So how do you do DateTime operations? You just use the C# functions and methods!
So DATEADD(wk, -1, GETDATE()) is something like DateTime.Now.AddDays(-7)
and
DATEPART(minute,createdTime)/5 (there is an extra ) in your line) is something like createdTime.Minute/5 (maybe you need to cast it to a double if you want non-integer value).
For anybody who is looking for the implementation mentioned by Michael. It's like below
#records =
EXTRACT userId string,
createdTime DateTime
FROM "/datalake/input/data.tsv"
USING Extractors.Tsv();
#result =
SELECT
userId,
COUNT(createdTime) AS userCount
FROM #records
WHERE createdTime > DateTime.Now.AddDays(-30)
GROUP BY userId,createdTime.Minute/5;
#result2= SELECT userId,COUNT(userId) AS TotalCount
FROM #result
GROUP BY userId;
OUTPUT #result2
TO "/datalake/output/data.csv"
USING Outputters.Csv();

Resources