this is my first post/question about U-SQL
My process extract from Azure Data Lake Storage huge files data.
My problems is one of the files has wrong structure (one field less than expected) and is crashing my process.
I would like to manage exceptions, keep process running and inform which file has been rejected or with errors.
I know about parameter ( silent : true ), I am just looking for a robust solution for production environment...If I just skip one file I am loosing millions of transactions.
Below is my extract code:
#Source =
EXTRACT [RouteVariant] string,
[StageNumber] string,
[StopNumber] string,
[TransactionTime] string,
[TicketClass] string,
[TransactionDate] int,
[FareValue] double
FROM #"/Files/Transactions/{*}.csv"
USING Extractors.Text(delimiter : ';' );
If you know that some files may come without a field, you could mark that column as Null when you extract.
For example, let's say TicketClass was the dimension that may not appear:
EXTRACT [RouteVariant] string,
[StageNumber] string,
[StopNumber] string,
[TransactionTime] string,
[TicketClass] string?,
[TransactionDate] int,
[FareValue] double
FROM #"/Files/Transactions/{*}.csv"
USING Extractors.Text(delimiter : ';' );
With the interrogation mark you allow that dimension to be Null if it doesn't appear during the extraction.
Related
I am trying to import a .csv file to match the records in the database. However, the database records has leading zeros. This is a character field The amount of data is a bit higher side.
Here the length of the field in database is x(15).
The problem I am facing is that the .csv file contains data like example AB123456789 wherein the database field has "00000AB123456789" .
I am importing the .csv to a character variable.
Could someone please let me know what should I do to get the prefix zeros using progress query?
Thank you.
You need to FILL() the input string with "0" in order to pad it to a specific length. You can do that with code similar to this:
define variable inputText as character no-undo format "x(15)".
define variable n as integer no-undo.
input from "input.csv".
repeat:
import inputText.
n = 15 - length( inputText ).
if n > 0 then
inputText = fill( "0", n ) + inputText.
display inputText.
end.
input close.
Substitute your actual field name for inputText and use whatever mechanism you are actually using for importing the CSV data.
FYI - the "length of the field in the database" is NOT "x(15)". That is a display formatting string. The data dictionary has a default format string that was created when the schema was defined but it has absolutely no impact on what is actually stored in the database. ALL Progress data is stored as variable length length. It is not padded to fit the display format and, in fact, it can be "overstuffed" and it is very, very common for applications to do so. This is a source of great frustration to SQL reporting tools that think the display format is some sort of length limit. It is not.
I am trying to use microsoft's cognitive services with data lake and have run into a problem while trying to get key phrases and sentiment from the text in a column of a CSV file.
I have checked to make sure that the file is formatted correctly and is being read correctly (I have done a few basics, like copying, to make sure it is workable).
I have also made sure that the column I am interested in the CSV file (Description) contains just text(string) when it is extracted by itself.
The input file and output folder are in my Azure data lake and I am running the script from my data lake analytics on Azure. I have not tried to run this locally in Visual Studio.
I used Key Phrases Extraction (U-SQL) and Sentiment Analysis (U-SQL) as my reference and followed the directions there, including getting the plugins.
In each case when I submit the job I get an error that I cannot seem to find a way round. Below I have shown the code that I have used for each and the error that I get when running it.
Key Phrase Code
REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
#myinput =
EXTRACT
Modified_On string,
_Name string,
Description string,
Customer string,
Category string,
Target_Market string,
Person_Responsible string,
Status string,
_Region string,
Modified_On_2 string,
Created_On string,
_Site string,
_Team string
FROM "/userData/fromSharepoint/Game_Plans"
USING Extractors.Csv(skipFirstNRows:1);
#keyphrase =
PROCESS #myinput
PRODUCE
Description,
KeyPhrase string
READONLY
Description
USING new Cognition.Text.KeyPhraseExtractor();
OUTPUT #keyphrase
TO "/userData/testingCognitive/tesing1.csv"
USING Outputters.Csv();
Key Phrase Error Message
Sentiment Code
REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
#myinput =
EXTRACT
Modified_On string,
_Name string,
Description string,
Customer string,
Category string,
Target_Market string,
Person_Responsible string,
Status string,
_Region string,
Modified_On_2 string,
Created_On string,
_Site string,
_Team string
FROM "/userData/fromSharepoint/Game_Plans"
USING Extractors.Csv(skipFirstNRows:1);
#sentiment =
PROCESS #myinput
PRODUCE
Description,
sentiment string,
conf double
READONLY
Description
USING new Cognition.Text.SentimentAnalyzer(true);
OUTPUT #sentiment
TO "/userData/testingCognitive/tesing1.csv"
USING Outputters.Csv();
Sentiment Error Message
Any assistance on how to solve this would be much appreciated.
Alternatively if anyone has got these functions working and can provide some scripts to test with and links to input files to download that would be awesome.
I can't reproduce your exact error (can you post some simple sample data?) but I can get these libraries to work. I think the KeyPhraseExtractor by default expects columns called Text and KeyPhrase so if you are going to change them then you have to pass your column names in as arguments, eg
#keyphrase =
PROCESS #myinput
PRODUCE Description,
KeyPhrase string
READONLY Description
USING new Cognition.Text.KeyPhraseExtractor("Description", "KeyPhrase");
UPDATE: There are some invalid characters in your sample file, just after the word "Bass". This is a non-breaking space (U+00A0) and I don't think you'll be able to import them - happy to be corrected. I removed these manually and was able to import the file. You could pre-process them in some manner.
I'm having a problem that I thought to be rather common, but trying to look it up in the "Oracle Database 10g2 Utilities_b14215.pdf" didn't help. After that I've surfed through numerous threads but no luck so far.
I'm having a tab-delimited file (x'09') e. g. name, userid, persnr. The values for the userids begin with either P, R or T e. g. P2198, P2199, R7288, T1229.
I want to load only the records with userids beginning with P.
Isolating a single record with a controlfile like this works splendidly:
OPTIONS (SKIP=1)
LOAD DATA
INFILE UserlistLoader.dat
APPEND
INTO TABLE Z_USERLIST
WHEN USERID = 'P2198'
FIELDS TERMINATED BY x'09'
TRAILING NULLCOLS
(name, userid, persnr)
But every attempt at using SUBSTR in the when-clause fails.
This:
OPTIONS (SKIP=1)
LOAD DATA
INFILE UserlistLoader.dat
APPEND
INTO TABLE Z_USERLIST
WHEN SUBSTR(USERID, 1, 1) = 'P'
FIELDS TERMINATED BY x'09'
TRAILING NULLCOLS
(name, userid, persnr)
ends in an SQL*Loader-350: Syntax-Error.
This
OPTIONS (SKIP=1)
LOAD DATA
INFILE UserlistLoader.dat
APPEND
INTO TABLE Z_USERLIST
WHEN "SUBSTR(:USERID, 1, 1)" = 'P'
FIELDS TERMINATED BY x'09'
TRAILING NULLCOLS
(name, userid, persnr)
ends in an SQL*Loader-403: Referenced column USERID not present in table Z_USERLIST.
But IT IS PRESENT - as the first example proves. I've found that the column should be preceded by : but that obviously isn't the issue.
What am I doing wrong?
From SQL Loader docs the left-hand side of a WHEN condition can only be a full field name e.g. USERID or a position spec e.g. (3:5).
The docs aren't very clear though on what is allowed - e.g. can LIKE be used as the operator?
USERID LIKE 'P%'
I strongly suspect it can't though.
I would load the entire file into a staging table that matches the file layout, then run a procedure that inserts the rows you want from there into the production table. That is a more common way to handle loads with criteria like this without having to edit source data.
If you can preprocess the source file, move the userid to the first field or copy the first letter of the userid to it's own field and construct the WHEN like this so sqlldr looks at the first position (this will cause sqlldr to return non-zero though, as not all rows meet WHEN clause criteria):
WHEN (1) = 'P'
I would like to query an SQLite table that contains directory paths to find all the paths under some hierarchy. Here's an example of the contents of the column:
/alpha/papa/
/alpha/papa/tango/
/alpha/quebec/
/bravo/papa/
/bravo/papa/uniform/
/charlie/quebec/tango/
If I search for everything under /bravo/papa/, I would like to get:
/bravo/papa/
/bravo/papa/uniform/
I am currently trying to do this like so (see below for the long story of why I can't use more simple methods):
SELECT * FROM Files WHERE Path >= '/bravo/papa/' AND Path < '/bravo/papa0';
This works. It looks a bit weird, but it works for this example. '0' is the unicode code point 1 greater than '/'. When ordered lexicographically, all the paths starting with '/bravo/papa/' compare greater than it and less than 'bravo/papa0'. However, in my tests, I find that this breaks down when we try this:
SELECT * FROM Files WHERE Path >= '/' AND Path < '0';
This returns no results, but it should return every row. As far as I can tell, the problem is that SQLite is treating '0' as a number, not a string. If I use '0Z' instead of '0', for example, I do get results, but I introduce a risk of getting false positives. (For example, if there actually was an entry '0'.)
The simple version of my question is: is there some way to get SQLite to treat '0' in such a query as the length-1 string containing the unicode character '0' (which should sort strings such as '!', '*' and '/', but before '1', '=' and 'A') instead of the integer 0 (which SQLite sorts before all strings)?
I think in this case I can actually get away with special-casing a search for everything under '/', since all my entries will always start with '/', but I'd really like to know how to avoid this sort of thing in general, as it's unpleasantly surprising in all the same ways as Javascript's "==" operator.
First approach
A more natural approach would be to use the LIKE or GLOB operator. For example:
SELECT * FROM Files WHERE Path LIKE #prefix || '%';
But I want to support all valid path characters, so I would need to use ESCAPE for the '_' and '%' symbols. Apparently this prevents SQLite from using an index on Path. (See http://www.sqlite.org/optoverview.html#like_opt ) I really want to be able to benefit from an index here, and it sounds like that's impossible using either LIKE or GLOB unless I can guarantee that none of their special characters will occur in the directory name, and POSIX allows anything other than NUL and '/', even GLOB's '*' and '?' characters.
I'm providing this for context. I'm interested in other approaches to solve the underlying problem, but I'd prefer to accept an answer that directly addresses the ambiguity of strings-that-look-like-numbers in SQLite.
Similar questions
How do I prevent sqlite from evaluating a string as a math expression?
In that question, the values weren't quoted. I get these results even when the values are quoted or passed in as parameters.
EDIT - See my answer below. The column was created with the invalid type "STRING", which SQLite treated as NUMERIC.
* Groan *. The column had NUMERIC affinity because it had accidentally been specified as "STRING" instead of "TEXT". Since SQLite didn't recognize the type name, it made it NUMERIC, and because SQLite doesn't enforce column types, everything else worked as expected, except that any time a number-like string is inserted into that column it is converted into a numeric type.
After much wrestling with the idea of ranking records, I finally settled on numeric based scores for my documents, which I emit to have them sorted based on these scores.
Now these numbers have meaning, where the 1st 2 digits represent a specific type of document.
Therefore, to get documents of type 22 sorted based on their scores, I simply query the view with start key being 220000 and end key being 229999
This is all great and works, my problems occur when I try to use url rewrites.
I'm basically trying to reroute:
/_rewrite/rankings/{doctype}
to
/_list/rankings?startkey=xx0000&endkeyxx9999
where xx is the {doctype}
my issue is with specifying rewrite rule:
[
{ "from":"rankings/:doctype",
"to":"_list/rankings",
"query": ??? //what will this be?
]
How can I construct the start and end keys by appending 0000 and 9999 respectively?
how can I specify a numeric value? since using place holder ":doctype" will result in a string type rather than a numberic type, resulting in a failed query even if I were to modify my pretty url to input both start and end keys.
I worked around the issue by filtering the results in my list view (ignoring docs im not interested in from getRow()), my concern here, should I worry about efficiency of list function now?
feel free to comment also on my sorting strategy .. would be interested to know how others solved their sorting and slicing problems with couchdb
Solution
First, you should emit the type and the score separately in an array instead of concatenating them:
emit([doc.type, doc.score], doc);
Then you can rewrite like this
[
{
"from" : "rankings/:doctype",
"to" : "_list/rankings/rankings",
"query" : {
"startkey" : [":doctype", 0],
"endkey" : [":doctype", 9999]
},
"formats": {
"doctype" : "int"
}
}
]
I tested it on CouchDB 1.1.1 and it works.
Reference
The relevant documentation is buried in this issue on JIRA: COUCHDB-1074
As you can see, the issue was resolved on April 2011, so it should work in CouchDB 1.0.3 and above.