Datastage Sequence job- how to process each file at a time if those files are in 7 different folders - unix

DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?

First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.

Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.

Related

scp_download to download multiple files based on a pattern?

I need to download many files from a server (specifically tectia) ideally using the ssh package. These files all follow the a predictable pattern across multiple sub folders. The filepath is formatted like this
/directory/subfolder/A001/abcde001.csv
Where A001 counts up alongside the last 3 digits of the filename (/A002/abcde002.csv and so on)
In the vignette for scp_download it states that the files parameter may contain wildcards so I have tried to do something like
scp_download(session, "/directory/subfolder/A.*/abcde.*[.]csv", to=tempdir())
and
scp_download(session, "directory/subfolder/A\\d{3}/abcde\\d{3}[.]csv", to=tempdir())
but no matter which combination of patterns or wildcards I can think of (which isn't many) I only get something like
Warning: SSH warning: scp: /directory/subfolder/A\d{3}/abcde\d{3}[.]csv: No such file or directory
What I'm hoping to do is either find a way to do pattern matching here, or to find a way to store tectia directories as a string to be read by scp_download. I've made sure that my session is connected properly and it works without attempting to pattern match, which it does.
I had the same problem. The problem is that when you use * in your pattern it gets escaped when you send it to the server. However, when you request a special file name like this /directory/subfolder/A001/abcde001.csv, it works fine.
Finally I changed my code based on the below steps:
I got the list of files/folders using ls command with ssh_exec_wait function and then store them on a variable.
Download files in the variable separately
session <- ssh_connect("username#ip",passwd="password")
files<-capture.output(ssh_exec_wait(session, command = 'ls /directory/subfolder/A001/*'))
dnc1<- scp_download(session, files[1], to = paste0(getwd(),"/data/"))
dnc2<- scp_download(session, files[2], to = paste0(getwd(),"/data/"))
dnc3<- scp_download(session, files[3], to = paste0(getwd(),"/data/"))
The bottom 3 commands can be done in a loop as this could be hundreds or thousands of records.

What could be wrong with my premake5 script that it takes so long to build a solution

I am using premake5 version 0.06 to generate a vs2012 project that contains 3000+ files in a directory tree that goes about 2 levels deep.
The project contains 6 configurations and 3 platforms.
It takes approximately 2 minutes to bake the configurations and then about 10 seconds to process the action and write out the solution and project files.
I am wondering if this is the expected time for this number of files or whether I can optimise my premake scripts to improve the bake times?
I make use of a number of overrides and I include my files by making use of wildcards.
files {
path.join(includeDir,"**.h"),
path.join(includeDir,"**.inl"),
path.join(srcDir,"**.h"),
path.join(srcDir,"**.inl"),
path.join(srcDir,"**.c"),
path.join(srcDir,"**.cpp"),
}
Is it better to put all options under one filter?
For convenience of setup I have options setup by different functions and so effectively list the same filter multiple times for different options e.g.
setupOption1 = function(args)
filters( "platforms:win" )
--set up option1
end
setupOption2 = function(args)
filters( "platforms:win" )
--set up option2
end
--with the project
project("myProject")
--global setup
language "C++"
kind "WindowedApp"
--individual options
setupOption1(args)
setupOption2(args)
That does sound a little long, but as this is still an alpha build performance isn't being as closely monitored right now. There is an open pull request to reduce memory usage that might help?
In general, fewer filters should help, but I would be surprised if it made a dramatic difference (unless you really have a lot).
I found that using ** wildcards in a files filter slows the build right down.
filter {"files:**_win.cpp", "platforms:not win"}
flags "ExcludeFromBuild"
filter {"files:**_xone.cpp", "platforms:not xone"}
flags "ExcludeFromBuild"
filter {"files:**_ps4.cpp", "platforms:not ps4" }
flags "ExcludeFromBuild"
If I comment out these filters, the configuration now takes about 30 seconds to build.

Process many EDI files through single MFX

I've created a mapping in MapForce 2013 and exported the MFX file. Now, I need to be able to run the mapping using MapForce Server. The problem is, I need to specify both the input EDI file and the output file. As far as I can tell, the usage pattern is to run the mapping with MapForce server using the input/output configuration in the MFX itself, not passed in on the command line.
I suppose I could change the input/output to some standard file name and then just write the input file to that path before performing the mapping, and then grab the output from the standard output file path when the mapping is complete.
But I'd prefer to be able to do something like:
MapForceServer run -in=MyInputFile.txt -out=MyOutputFile.xml MyMapping.mfx > MyLogFile.txt
Is something like this possible? Perhaps using parameters within the mapping?
There are two options that I've come across in dealing with a similar situation.
Option 1- If you set the input XML file to *.xml in the component settings, mapforceserver.exe will automatically process all txt in the directory assuming your source is xml (this should work for text just the same). Similar to the example below you can set a cleanup routine to move the files into another folder after processing.
Note: It looks in the folder where the schema file is located.
Option 2 - Since your output is XML you can use Altova's raptorxml (rack up another license charge). Now you can generate code in XSLT 2.0 and use a batch file to automatically execute, something like this.
::#echo off
for %%f IN (*.xml) DO (RaptorXML xslt --xslt-version=2 --input="%%f" --output="out/%%f" %* "mymapping.xslt"
if NOT errorlevel 1 move "%%f" processed
if errorlevel 1 move "%%f" error)
sleep 15
mymapping.bat
I tossed in a sleep command to loop the batch for rechecking every 15 seconds. Unfortunately this does not work if your output target is a database.

How to `diff` files to create a "common" file?

I have a slew of CSS files to go through where someone just grunted through making alterations to various core stylesheets on a number of subsites. Obviously if the original developer had had some foresight they would have just included a master stylesheet and overridden the necessary elements…
I first started off with comm thinking that it might do the trick, but quickly found that it needed to receive a sorted input file.
I then switched over to diff and have gotten down to the following through some reading and research:
diff --unchanged-group-format="## %dn,%df%c'\012'%<" --old-group-format='' --new-group-format='' --changed-group-format='' file_1.css file_2.css
The previous obviously is almost there, but:
A) I need to grep out the ## lines (which should be fine, right? At first glance this appears right, but does diff throw in any other unexpected lines that need to be yanked?) and then
B) I need to create two more files that first is the leftover unique lines from file_1.css and then the leftover unique lines of file_2.css.
Obviously the first "in common" file will go into an include folder and then be included into the two latter created files as a #import url("common.css");
I am thinking that the following simple alteration will create the latter two files to which I'm referring:
diff --unchanged-group-format='' --old-group-format="## %dn,%df%c'\012'%<" --new-group-format='' --changed-group-format='' file_1.css file_2.css
diff --unchanged-group-format='' --old-group-format='' --new-group-format="## %dn,%df%c'\012'%<" file_1.css file_2.css
Sample files:
file 1: https://gist.github.com/c13843972c47b5037704
file 2: https://gist.github.com/fff39eae386e8969dc10
So for example, upon executing a test of the following:
diff --unchanged-group-format="## %dn,%df%c'\012'%<" --old-group-format='' --new-group-format='' --changed-group-format='' file_1.css file_2.css | egrep -v "^##\d*" > common.css
diff --unchanged-group-format='' --old-group-format="## %dn,%df%c'\012'%<" --new-group-format='' --changed-group-format='' file_1.css file_2.css | egrep -v "^##\d*" > old.css
And then searching for body with egrep "^body" *css, it yielded only a body in common.css and none in old.css, whereas it showed that there were two different entries in file_1.css and file_2.css. So obviously this methodology is flawed.
How would one about creating these two files that would ultimately become the common include and the override files?
#ylluminate, you have a couple of options:
use BeyondCompare to visually verify the differences. It does a fantastic job comparing similar files. It allows saving common lines/left only lines/right only lines. Only downside is it is interactive and if you have a lot of files, will take some time. On the positive side, it looks like you want to build trust first by testing it out a few times.
Add formatting text for --changed-group-format and capture modified code (and the old code as your command does it now). You need to run one more comparison to get what is in new code but not in old code. Downside here is the validation is going to be hard.
Saving all the lines in a database table and comparing columns is another option. Take care to store old and new line numbers. Downsides are the data lines need to be unique, blank lines will be chopped off.
I would go with option 1 if i have less than 50 files.
Hope this helps.
PS: I am not associated with BeyondCompare in any way. just a happy user of the software

unix: can i write to the same file in parallel without missing entries?

I wrote a script that executes commands in parallel. I let them all write an entry to the same log file. It does not matter if the order is wrong or entries are interleaved, but i noticed that some entries are missing. I should probably lock the file before writing, however, is it true that if multiple processes try to write to a file simultaneously, it will result in missing entries?
Yes, if different processes independently open and write to the same file, it may result in overlapping writes and missing data. This happens because each process will get its own file pointer, that advances only by local writes.
Instead of locking, a better option might be to open the log file once in an ancestor of all worker processes, have it inherited across fork(), and used by them for logging. This means that there will be a single shared file pointer, that advances when any of the processes writes a new entry.
In a script you should use ">> file" (double greater than) to append output to that file. The interpreter will open the destination in "append" mode. If your program also wants to append, follow the directives below:
Open a text file in "append" mode ("a+") and give preference to printing only full lines (don't do multiple 'print' followed by a final 'println', but print the entire line with a single 'println').
The fopen documentation states this:
DESCRIPTION
The fopen() function opens the file whose pathname is the
string pointed to by filename, and associates a stream with
it.
The argument mode points to a string beginning with one of
the following sequences:
r or rb Open file for reading.
w or wb Truncate to zero length or create file
for writing.
a or ab Append; open or create file for writing
at end-of-file.
r+ or rb+ or r+b Open file for update (reading and writ-
ing).
w+ or wb+ or w+b Truncate to zero length or create file
for update.
a+ or ab+ or a+b Append; open or create file for update,
writing at end-of-file.
The character b has no effect, but is allowed for ISO C
standard conformance (see standards(5)). Opening a file with
read mode (r as the first character in the mode argument)
fails if the file does not exist or cannot be read.
Opening a file with append mode (a as the first character in
the mode argument) causes all subsequent writes to the file
to be forced to the then current end-of-file, regardless of
intervening calls to fseek(3C). If two separate processes
open the same file for append, each process may write freely
to the file without fear of destroying output being written
by the other. The output from the two processes will be
intermixed in the file in the order in which it is written.
It is because of this intermixing that you want to give preference to
using only 'println' (or its equivalent).

Resources