How to get all files from a folder and his subfolders? - u-sql

Imagine this files paths:
/root/subfolder/filename.csv
/root/subfolder/subfolder2/filename2.csv
Can I extract data from "filename.csv" and "filename2.csv" without explicitly write their paths?
I want to do something like:
#var = EXTRACT column FROM "/root/{*}.csv" USING Extractors.Csv(skipFirstNRows:1);
Is it possible?

Unfortunately, this feature (a Kleene-* like recursive folder navigation) is not yet available in filesets, but is on our long term roadmap. Please file/upvote for this feature at http://aka.ms/adlfeedback.
The current work-around is to have one step wild card EXTRACTs for each level you expect to encounter and then UNION ALL them together. E.g.,
#d1 = EXTRACT ... FROM "/fixpath/{*}" USING ... ;
#d2 = EXTRACT ... FROM "/fixpath/{*}/{*}" USING ...;
#d3 = EXTRACT ... FROM "/fixpath/{*}/{*}/{*}" USING ...;
....
#data =
SELECT * FROM #d1 UNION ALL SELECT * FROM #d2 UNION ALL SELECT * FROM #d3 UNION ALL ...;

Yes you can! The formatting is almost exactly what you thought it would be. It's done through filesets in U-SQL, and they allow you to search entire directories of folders and also extract information from the path. You define wildcard characters of your choice anywhere in the folder path, and then save that character as a virtual column in your extract statement.
DECLARE #file_set_path string = "/Samples/Data/AmbulanceData/vehicle{vid}_{date:MM}{date:dd}{date:yyyy}.csv";
#data =
EXTRACT vehicle_id int,
entry_id long,
event_date DateTime,
latitude float,
longitude float,
speed int,
direction string,
trip_id int?,
vid int, // virtual file set column
date DateTime // virtual file set column
FROM #file_set_path
USING Extractors.Csv();
Notice the regular wild card character in the path {vid}, and how it's saved in the extract statement as a new column with the same name (that you can then use to filter your query). The date virtual column is a special feature of filesets that allows you to automatically bundle dates in filesets into a single DateTime object.
Filesets also work in directory paths the same way - you could have a set of subfolders divided by date and version, and use "/Samples/{date:yyyy}/{date:MM}/{date:dd}/{type}/RCV_{vid}.csv" and store the virtual columns identically as above.
Let me know if you have any more questions!

Related

U-SQL How can I get the current filename being processed to add to my extract output?

I need to add meta data about the Row being processed. I need the filename to be added as a column. I looked at the ambulance demos in the Git repo, but can't figure out how to implement this.
You use a feature of U-SQL called 'file sets' and 'virtual columns'. In my simple example, I have two files in my input directory, I use file sets and refer to the virtual columns in the EXTRACT statement, eg
// Filesets, file set with virtual column
#q =
EXTRACT rowId int,
filename string,
extension string
FROM "/input/filesets example/{filename}.{extension}"
USING Extractors.Tsv();
#output =
SELECT filename,
extension,
COUNT( * ) AS records
FROM #q
GROUP BY filename,
extension;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
My results:
Read more about both features here:
https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx

Extract only required files in U-SQL

Is it possible to extract files only for 3 days, without extracting all the files.
DROP VIEW IF EXISTS dbo.Read;
CREATE VIEW IF NOT EXISTS dbo.Read AS
EXTRACT
Statements
FROM
"adl://Test/{date:yyyy}/{date:M}/{date:d}/Testfile.csv"
USING Extractors.Csv(silent:true,quoting : true, nullEscape : "/N");
#res =
SELECT * FROM dbo.Read
WHERE date BETWEEN DateTime.Parse("2015/07/01") AND DateTime.Parse("2015/07/03");
OUTPUT #res
TO "adl://test/Testing/loop.csv"
USING Outputters.Csv();
Partition elimination already ensures for your query that only files matching predicates will actually be read (you can confirm that in the job graph).
See also my previous answer for How to implement Loops in U-SQL
If you have remaining concerns about performance, the job graph can also help you nail down where they originate.
You can use the pattern identifiers in the fileset specification in parts of the path or even parts of the name (see https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx). You can do lists of files, so if you only have one file in each directory you can do;
EXTRACT ...
FROM "adl://Test/2015/07/1/Testfile.csv"
, "adl://Test/2015/07/2/Testfile.csv"
USING ...;
If there is more than one file in each directory you can do individual extracts for each day and then union the result. Something like:
#a = EXTRACT ....
FROM "adl://Test/2015/07/1/{*}.csv"
USING ...;
#b = EXTRACT ....
FROM "adl://Test/2015/07/2/{*}.csv"
USING ...;
#fullset = SELECT * FROM #a UNION SELECT * FROM #b;
Unfortunately I believe there is no list of filesets at the moment allowing you to do above case in one EXTRACT statement.

Use string text from file in u-sql query where clause ( U-SQL )

I need load some text from one file and choose specific records from another where one of second file columns is equalt to text loaded from first file.
I'm trying with something like that but actually it doesn't work.
#countryName =
EXTRACT City string
FROM "/TestCatalog/test.txt"
USING Extractors.Text();
#result =
SELECT CityName,
Temperature,
MeasurmentDate
FROM #readEmployee
WHERE CityName IN(#countryName);
What is the best way to pass some parameters to where expression ( readed from another file in azure data lake ) ?
Variables in U-SQL which are assigned to with EXTRACT or SEELCT are rowsets, rather than scalar variables. Therefore use SEMIJOIN to do this, for example:
#output =
SELECT re.CityName,
re.Temperature,
re.MeasurmentDate
FROM #readEmployee AS re
SEMIJOIN #countryName AS c ON re.CityName == c.City;
EXTRACT this other file into another rowset, and JOIN both rowsets together.

SQLITE: select rows where a certian column is contained in a given string

I have a table which has a column named "directory" which contains strings like:
c:\mydir1\mysubdir1\
c:\mydir2
j:\myotherdir
...
I would like to do something like
SELECT FROM mytable WHERE directory is contained within 'c:\mydir2\something\'
This query should give me as a result:
c:\mydir2
Ok, I've just found that sqlite has a function instr that seems to work for my purpose.
Not sure about the performance, though.

Can the LIKE statement be optimized to not do full table scans?

I want to get a subtree from a table by tree path.
the path column stores strings like:
foo/
foo/bar/
foo/bar/baz/
If I try to select all records that start with a certain path:
EXPLAIN QUERY PLAN SELECT * FROM f WHERE path LIKE "foo/%"
it tells me that the table is scanned, even though the path column is indexed :(
Is there any way I could make LIKE use the index and not scan the table?
I found a way to achieve what I want with closure table, but it's harder to maintain and writes are extremely slow...
To be able to use an index for LIKE in SQLite,
the table column must have TEXT affinity, i.e., have a type of TEXT or VARCHAR or something like that; and
the index must be declared as COLLATE NOCASE (either directly, or because the column has been declared as COLLATE NOCASE):
> CREATE TABLE f(path TEXT);
> CREATE INDEX fi ON f(path COLLATE NOCASE);
> EXPLAIN QUERY PLAN SELECT * FROM f WHERE path LIKE 'foo/%';
0|0|0|SEARCH TABLE f USING COVERING INDEX fi (path>? AND path<?)
The second restriction could be removed with the case_sensitive_like PRAGMA, but this would change the behaviour of LIKE.
Alternatively, one could use a case-sensitive comparison, by replacing LIKE 'foo/%' with GLOB 'foo/*'.
LIKE has strict requirements to be optimizable with an index (ref).
If you can relax your requirements a little, you can use lexicographic ordering to get indexed lookups, e.g.
SELECT * FROM f WHERE PATH >= 'foo/' AND PATH < 'foo0'
where 0 is the lexigographically next character after /.
This is essentially the same optimization the optimizer would do for LIKEs if the requirements for optimization are met.

Resources