Does U-SQL support extracting files based on date of creation in ADLS - u-sql

We know U-SQL supports directory and filename pattern matching while extracting the files. What I wanted to know does it support pattern matching based on date of creation of the file in ADLS (without implementing custom extractors).
Say a folder contains files created across months (filenames don't have date as part of the filename), is there a way to pull only files of a particular month.

The U-SQL EXTRACT operator is not aware of any metadata (such as create date) about a file - only the filename.

You could probably build a solution using the .NET SDK. For something rather simple you could use PowerShell to create a file which will contain all the files that meet your date time criteria. Then consume the content as desired.
# Log in to your Azure account
Login-AzureRmAccount
# Modify variables as required
$DataLakeStoreAccount = "<yourDataLakeStoreAccountNameHere>";
$DataLakeAnalyticsAccount = <yourDataLakeAnalyticsAccountNameHere>";
$DataLakeStorePath = "/Samples/Data/AmbulanceData/"; #modify as desired
$outputFile = "Samples/Outputs/ReferenceGuide/filteredFiles.csv"; #modify as desired
$filterDate = "2016-11-22";
$jobName = "GetFiles";
# Query directory and build main body of script. Note, there is a csv filter.
[string]$body =
"#initial =
SELECT * FROM
(VALUES
" +
(Get-AzureRmDataLakeStoreChildItem -Account $DataLakeStoreAccount -Path $DataLakeStorePath |
Where {$_.Name -like "*.csv" -and $_.Type -eq "FILE"} | foreach {
"(""" + $DataLakeStorePath + $_.Name + """, (DateTime)FILE.CREATED(""" + $DataLakeStorePath + $_.Name + """)), `r`n" });
# formattig, add column names
$body =
$body.Substring(0,$body.Length-4) + "
) AS T(fileName, createDate);";
# U-SQL query and OUTPUT statement
[string]$output =
"
// filter results based on desired time frame
#filtered =
SELECT fileName
FROM #initial
WHERE createDate.ToString(""yyyy-MM-dd"") == ""$filterDate"";
OUTPUT #filtered
TO ""$outputFile""
USING Outputters.Csv();";
# bring it all together
$script = $body + $output;
#Execute job
$jobInfo = Submit-AzureRmDataLakeAnalyticsJob -Account $DataLakeAnalyticsAccount -Name $jobName -Script $script -DegreeOfParallelism 1
#check job progress
Get-AzureRmDataLakeAnalyticsJob -Account $DataLakeAnalyticsAccount -JobId $jobInfo.JobId -ErrorAction SilentlyContinue;
Write-Host "You now have a list of desired files to check # " $outputFile

Currently there is no way to access or use file meta data properties. Please add your vote and use case to the following feedback item: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr

it's been a while since this question was asked, and I'm not sure if this is what you were looking for originally, but now you can use the FILE.MODIFIED U-SQL function:
DECLARE #watermark string = "2018-08-16T18:12:03";
SET ##FeaturePreviews="InputFileGrouping:on";
DECLARE #file_set_path string = "adl://adls.azuredatalakestore.net/stage/InputSample.tsv";
#input =
EXTRACT [columnA] int?,
[columnB] string
FROM #file_set_path
USING Extractors.Tsv(skipFirstNRows : 1, silent : true);
#result =
SELECT *, FILE.MODIFIED(#file_set_path) AS FileModifiedDate
FROM #input
WHERE FILE.MODIFIED(#file_set_path) > DateTime.ParseExact(#watermark, "yyyy-MM-ddTHH:mm:ss", NULL);
OUTPUT #result TO "adl://ADLS.azuredatalakestore.net/stage/OutputSample.tsv" USING Outputters.Tsv(outputHeader:true);
The U-SQL built-in function is documented here:
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/file-modified-u-sql

Related

Is there a way to retrieve physiological data from acqknowledge and preprocess them in R?

I'm trying to retrieve raw physiological data (hand dynanometer) from Acqknowledge to be able to preprocess and analyze them in R (or Matlab) in an automatized way. Is there a way to do that ? I would like to avoid having to copy/paste the data manually from Acknowledge to Excel to read them in R.
Then I would like to apply a filter on the data and retrieve the squeezes of interest in R. Is there a way to do that ?
Any advice is very welcome, thank you in advance!
I had a similar problem. The key is the load_acq.m function, you can find it here -> https://github.com/munoztd0/fusy-octave-memories/blob/main/load_acq.m then you can just loop through them and save them as .csv or anything that you can load and work with in R.
As for how to do that I have put together a little routine to do that.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Create physiological files from AcqKnowledge
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SETTING THE PATH
path = 'your path here'; %your path here
physiodir = fullfile(path, '/SOURCE/physio');
outdir = fullfile(path, '/DERIVATIVES/physio');
session = "first" %your session
base = [path 'sub-**_ses-' session '*'];
subj = dir(base); %check all matching subject for this session
addpath([path '/functions']) % load aknowledge functions
%check here https://github.com/munoztd0/fusy-octave-memories/blob/main/load_acq.m
for i=1:length(subj)
cd (physiodir) %go to the folder
subjO = char(subj(i).name);
subjX = subjO(end-3:end); %remove .mat extension
filename = [subjX '.acq']; %add .acq extension
disp (['****** PARTICIPANT: ' subjX ' **** session ' session ' ****' ]);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% OPEN FILE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
physio = load_acq(filename); %load and transform acknowledge file
data = physio.data; %or whatever the name is
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% CREATE AND SAVE THE CHANNEL
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
cd (outdir) %go to the folder
% save the data as a file text in the participant directory
fid = fopen([subjX ',txt'],'wt');
for ii = 1:length(data)
fprintf(fid,'%g\t',data(ii));
fprintf(fid,'\n');
end
fclose(fid);
end
Hope it helps !

ConvertFrom-StringData values stored in variable

I'm new to Powershell and I'm putting together a script that will populate all variables from data stored in a Excel file. Basically to create numerous VMs.
This works fine apart from where i have a variable with multiple name/value pairs which powershell needs to be a hashtable.
As each VM will need multiple tags applying, i have a column in excel called Tags.
The data in the field would look something like: "Top = Red `n Bottom = Blue".
I'm struggling to use ConvertFrom-StringData to create the hashtable however for these tags.
If i run:
ConvertFrom-StringData $exceldata.Tags
I end up with something like:
Name Value
---- -----
Top Red `n bottom = blue
I need help please with formatting the excel field correctly so ConvertFrom-StringData properly creates the hashtable. Or a better way of achieving this.
Thanks.
Sorted it, formatted the excel field as: Dept=it;env=prod;owner=Me
Then ran the following commands. No ConvertFrom-StringData required.
$artifacts = Import-Excel -Path "C:\temp\Artifacts.xlsx" -WorkSheetname "VM"
foreach ($artifact in $artifacts) {
$inputkeyvalues = $artifact.Tags
# Create hashtable
$tags = #{}
# Split input string into pairs
$inputkeyvalues.Split(';') |ForEach-Object {
# Split each pair into key and value
$key,$value = $_.Split('=')
# Populate $tags
$tags[$key] = $value
}
}

How to get the variable's name from a file using source command in UNIX?

I have a file named param1.txt which contains certain variables. I have another file as source1.txt which contains place holders. I want to replace the place holders with the values of the variables that I get from the parameter file.
I have basically hard coded the script where the variable names in the parameter.txt file is known before hand. I want to know a dynamic solution to the problem where the variable names will not be known beforehand. In other words, is there any way to find out the variable names in a file using the source command in UNIX?
Here is my script and the files.
Script:
#!/bin/bash
source /root/parameters/param1.txt
sed "s/{DB_NAME}/$DB_NAME/gI;
s/{PLANT_NAME}/$PLANT_NAME/gI" \
/root/sources/source1.txt >
/root/parameters/Output.txt`
param1.txt:
PLANT_NAME=abc
DB_NAME=gef
source1.txt:
kdashkdhkasdkj {PLANT_NAME}
jhdbjhasdjdhas kashdkahdk asdkhakdshk
hfkahfkajdfk ljsadjalsdj {PLANT_NAME}
{DB_NAME}
I cannot comment since I don't have enough points.
But is it correct that this is what you're looking for:
How to reference a file for variables using Bash?
Your problem statement isn't very clear to me. Perhaps you can simplify your problem and desired state.
Don't understand why you try to source param1.txt.
You can try with this awk :
awk '
NR == FNR {
a[$1] = $2
next
}
{
for ( i = 1 ; i <= NF ; i++ ) {
b = $i
gsub ( "^{|}$" , "" , b )
if ( b in a )
sub ( "{" b "}" , a[b] , $i )
}
} 1' FS='=' param1.txt FS=" " source1.txt

SQLite: isolating the file extension from a path

I need to isolate the file extension from a path in SQLite. I've read the post here (SQLite: How to select part of string?), which gets 99% there.
However, the solution:
select distinct replace(column_name, rtrim(column_name, replace(column_name, '.', '' ) ), '') from table_name;
fails if a file has no extension (i.e. no '.' in the filename), for which it should return an empty string. Is there any way to trap this please?
Note the filename in this context is the bit after the final '\'- it shouldn't be searching for'.'s in the full path, as it does at moment too.
I think it should be possible to do it using further nested rtrims and replaces.
Thanks. Yes, you can do it like this:
1) create a scalar function called "extension" in QtScript in SQLiteStudio
2) The code is as follows:
if ( arguments[0].substring(arguments[0].lastIndexOf('\u005C')).lastIndexOf('.') == -1 )
{
return ("");
}
else
{
return arguments[0].substring(arguments[0].lastIndexOf('.'));
}
3) Then, in the SQL query editor you can use
select distinct extension(PATH) from DATA
... to itemise the distinct file extensions from the column called PATH in the table called DATA.
Note that the PATH field must contain a backslash ('\') in this implementation - i.e. it must be a full path.

SQL Loader Error - SQL*Loader-503: Error appending extension to file ()

I am trying to load a bunch of Java files into a staging table using SQL Loader. I keep getting the subject line error and I'm not sure why.
My executable file that I am trying to run looks like this.
for i in `find <files in certain directory.java>`
do
echo "File name = ${i}"
COMMAND='sed'; ARG='s/XXXX/${i}/'
echo $COMMAND '; ' $ARG
cat test_load.ctl | $COMMAND "$ARG" > test_load_2.ctl
sqlldr <user>/<password> control=test_load_2.ctl log=<file_name>.log
done
My test_load.ctl file looks like this:
I'm trying to replace the XXXX INFILE with the looped through files in the Java directory above.
LOAD DATA
INFILE 'XXXX'
BADFILE '/<directory>/<filename>.bad'
DISCARDFILE '/<directory>/<filename>.dsc'
APPEND INTO TABLE "<SCHEMA>"."<TABLE_NAME>"
TRAILING NULLCOLS
(
id sequence (1, 1),
raw_string position (1:4000) char(4000),
file_name,
dir_path,
load_date sysdate,
line_number sequence (1, 1)
)
My test_load_2.ctl file looks like this:
LOAD DATA
INFILE '${i}'
BADFILE '/<directory>/<filename>.bad'
DISCARDFILE '/<directory>/<filename>.dsc'
APPEND INTO TABLE "<SCHEMA>"."<TABLE_NAME>"
TRAILING NULLCOLS
(
id sequence (1, 1),
raw_string position (1:4000) char(4000),
file_name,
dir_path,
load_date sysdate,
line_number sequence (1, 1)
)
I keep getting this error:
SQL*Loader-503: Error appending extension to file ()
Additional information: 7217
I'm pretty sure there is an issue with the INFILE parameter in the test_load_2.ctl file, but I am not 100% certain how to fix this?
Also I'm probably doing something wrong in the executable file as well.
Any suggestions?
Thank you!
No variable expansion is going to happen in the control file. Specify the INFILE on the commandline instead using the DATA argument.
https://docs.oracle.com/cd/E11882_01/server.112/e22490/ldr_params.htm#SUTIL004

Resources