Building and processing a compile graph on a set of watched source files - graph

I think this might be a quite common problem, but somehow it is hard to find suitable answers.
Background
I'm currently investigating how to speed up the node variant of Patternlab. It is basically a system where you create HTML template files containing some fundamental elements (called atoms) like headings, paragraphs, etc. and compose them into something bigger (molecules), then compose multiple molecules until you have a full web page.
The template files are processed by a template engine which is able to reference and include other template files.
Patternlab now would read all files and build a graph of each file's predecessors and ancestors.
The problem is, all files in the template directory(s) are read, processed, the compile output from the template engine is written somewhere, done. This takes about 1 Minute for 10-20 MB of HTML Templates (think e.g. of Mustache). It would be much nicer to have differential updates.
General problem
Given a directed acyclic graph of files(nodes) f and f1 → f2 for f1 includes f2 in its contents. When in f1 → f2 needs also to be recompiled(i.e. the compilation order should be f2, f1) , how to find the set of all files that need recompilation efficiently and what are the initial and target nodes, when there might be a path containing multiple changed files and nodes depending on multiple paths?
Given two parallel paths (i.e. f1 → f2 → f5 and f1 → f4 → f5) how to find the best set of paths for parallel compilation (i.e. by total length of the path) with a minimum length l?
Considerations
So how could this be done? We simply have to watch the template directories for the events create, modify and delete. This could be done by collecting file system events from the kernel.
Remember that for any file we can also receive the "last modified" timestamp (mtime).
The first step, building an initial graph is already done by patternlab. Each node of the graph consists of:
the source file path and mtime
the target file path and mtime
a set of parameters for the file (i.e. username, page title, etc.)
a changed flag that indicates if the file was modified by inspecting the mtimes(see below). This always reflects the current file system state (much simpler).
i.e. there is a "shadow" graph compiling the most current compile output.
This leads to these cases:
source.mtime <= target.mtime: The source was compiled and not modified afterwards. If the source file is changed after building the graph,
source.mtime > target.mtime: The source was changed before building the graph and must be recompiled
Now the idea is to recompile each node if the changed flag is set and return the contents of the target file otherwise.
When a file is compiled, all files referencing this file (e.g. all nodes with incoming edges to the node) must be recompiled as well.
Consider that when the source hsa been compiled to the target, the flag changes to false so it does not get compiled twice. Returning the target content when visited again is okay though.
There are multiple root nodes (only outgoing edges, i.e. pages),some that are "leafs" (only incoming edges, i.e. molecules) and also nodes with no
Assume that patternlab cannot run in paralell, i.e. there is a "busy" flag blocking concurrent overall runs.
However it might be beneficial to use all CPU cores for compiling templates faster and employ content caching for templates that were not modified.
If a file f1 includes two other files f2, f3 and f2, f3 both have files that need recompilations as successors, f1 must wait for all paths to be recompiled first.
For instance (c = changed)
f5 -> f6 (c)
^
/
f1 -> f2 -> f3 -> f4(c)

Related

Datastage Sequence job- how to process each file at a time if those files are in 7 different folders

DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?
First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.
Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.

Scons positively refuses to build into a variant_dir

I have been porting a project that used Make to SCons. Generally, I am pleased by how easy it is to use SCons, relatively to make. However, there is one thing that has resisted several hours of attempts.
The files in my projects are contained into a tree which starts at
ProjectHome. The srouces are in several subdirectories contained in ProjectHome/src/
I have a SConstruct file in ProjectHome which defines the build enviroment and then calls a SConscript
(in ProjectHome) which builds the object files, which are then put into a library in ProjectHome/lib
by SConstruct.
Everything works fine, except that I would like to separate where the .o files are kept from
where the source files are.
So here's what I have
#SConstruct.py
...
# The environment is defined above, no issues
cppobj, chfobj=SConscript('./SConscript.py', 'env', variant_dir='build', src_dir='.', duplicate=False)
env.Install('lib/'+str(Dim)+'D', env.SharedLibrary(target='Grade'+str(n), source=cppobj+chfobj))
and this is for the SConscript.py
#SConscript.py
import platform
import os
import sys
def getSubdirs(abs_path_dir) :
""" returns a sorted list with the subdirectoris in abs_path_dir"""
lst=[x[0] for x in os.walk(abs_path_dir)]
lst.sort()
return lst
Dirs=getSubdirs(os.getcwd()+'/src') # gives me list of the directories in src
CppNodes=[]
ChFNodes=[]
Import('env')
for directory in Dirs[2:3]:
CppNodes+=Glob(directory+'/*.cpp')
ChFNodes+=Glob(directory+'/*.ChF')
# env.Object can work on lists
ChFobj=env.SharedObject(ChFNodes)
# This builder likes to work one at a time
# this build an internal representation of _F.H headers
# so that when an #include in encountered, scons look
# at this list too, and not just what specified by the IncDirs
if len(ChFNodes)==1: # this is ridiculous but having only one ChF file causes troubles
os.system('touch dummyF.ChF')
ChFNodes.append('dummyF.ChF')
ChFHeader=[]
for file in ChFNodes:
ChFHeader+=env._H(source=file)
Cppobj=env.SharedObject(CppNodes)
Return('Cppobj ChFobj')
However, for the life of me, build is ignored completely. I have tried different combinations,
even placing SConscript.py in the build dir, cally SConscript('build/SCoscript.py', 'env',...) you name it: Scons stubbornly refuses to do anything with build. Any help is appreciated. To be clear, it works in creating the libraries. It just that it places the intermediate object files in the src dirs.

Process many EDI files through single MFX

I've created a mapping in MapForce 2013 and exported the MFX file. Now, I need to be able to run the mapping using MapForce Server. The problem is, I need to specify both the input EDI file and the output file. As far as I can tell, the usage pattern is to run the mapping with MapForce server using the input/output configuration in the MFX itself, not passed in on the command line.
I suppose I could change the input/output to some standard file name and then just write the input file to that path before performing the mapping, and then grab the output from the standard output file path when the mapping is complete.
But I'd prefer to be able to do something like:
MapForceServer run -in=MyInputFile.txt -out=MyOutputFile.xml MyMapping.mfx > MyLogFile.txt
Is something like this possible? Perhaps using parameters within the mapping?
There are two options that I've come across in dealing with a similar situation.
Option 1- If you set the input XML file to *.xml in the component settings, mapforceserver.exe will automatically process all txt in the directory assuming your source is xml (this should work for text just the same). Similar to the example below you can set a cleanup routine to move the files into another folder after processing.
Note: It looks in the folder where the schema file is located.
Option 2 - Since your output is XML you can use Altova's raptorxml (rack up another license charge). Now you can generate code in XSLT 2.0 and use a batch file to automatically execute, something like this.
::#echo off
for %%f IN (*.xml) DO (RaptorXML xslt --xslt-version=2 --input="%%f" --output="out/%%f" %* "mymapping.xslt"
if NOT errorlevel 1 move "%%f" processed
if errorlevel 1 move "%%f" error)
sleep 15
mymapping.bat
I tossed in a sleep command to loop the batch for rechecking every 15 seconds. Unfortunately this does not work if your output target is a database.

unix: can i write to the same file in parallel without missing entries?

I wrote a script that executes commands in parallel. I let them all write an entry to the same log file. It does not matter if the order is wrong or entries are interleaved, but i noticed that some entries are missing. I should probably lock the file before writing, however, is it true that if multiple processes try to write to a file simultaneously, it will result in missing entries?
Yes, if different processes independently open and write to the same file, it may result in overlapping writes and missing data. This happens because each process will get its own file pointer, that advances only by local writes.
Instead of locking, a better option might be to open the log file once in an ancestor of all worker processes, have it inherited across fork(), and used by them for logging. This means that there will be a single shared file pointer, that advances when any of the processes writes a new entry.
In a script you should use ">> file" (double greater than) to append output to that file. The interpreter will open the destination in "append" mode. If your program also wants to append, follow the directives below:
Open a text file in "append" mode ("a+") and give preference to printing only full lines (don't do multiple 'print' followed by a final 'println', but print the entire line with a single 'println').
The fopen documentation states this:
DESCRIPTION
The fopen() function opens the file whose pathname is the
string pointed to by filename, and associates a stream with
it.
The argument mode points to a string beginning with one of
the following sequences:
r or rb Open file for reading.
w or wb Truncate to zero length or create file
for writing.
a or ab Append; open or create file for writing
at end-of-file.
r+ or rb+ or r+b Open file for update (reading and writ-
ing).
w+ or wb+ or w+b Truncate to zero length or create file
for update.
a+ or ab+ or a+b Append; open or create file for update,
writing at end-of-file.
The character b has no effect, but is allowed for ISO C
standard conformance (see standards(5)). Opening a file with
read mode (r as the first character in the mode argument)
fails if the file does not exist or cannot be read.
Opening a file with append mode (a as the first character in
the mode argument) causes all subsequent writes to the file
to be forced to the then current end-of-file, regardless of
intervening calls to fseek(3C). If two separate processes
open the same file for append, each process may write freely
to the file without fear of destroying output being written
by the other. The output from the two processes will be
intermixed in the file in the order in which it is written.
It is because of this intermixing that you want to give preference to
using only 'println' (or its equivalent).

ack - Binding an actual file name to a filetype

For me ack is essential kit (its aliased to a and I use it a million times a day). Mostly it has everything I need so I'm figuring that this behavior is covered and I just can't find it.
I'd love to be able to restrict it to specific kinds of files using a type. the problem is that these files have a full filename rather than an extension. For instance I'd like to restrict it to build files for buildr so i can search them with --buildr (Similar would apply for mvn poms). I have the following defined in my .ackrc
--type-set=buildr=buildfile,.rake
The problem is that 'buildfile' is the entire filename, not an extension, and I'd like ack to match completely on this name. However if I look at the types bound to 'buildr' it shows that .buildfile is an extension rather than the whole filename.
--[no]buildr .buildfile .rake
The ability to restrict to a particular filename would be really useful for me as there are numerous xml usecases (e.g. ant build.xml or mvn pom.xml) that it would be perfect for. I do see that binary, Makefiles and Rakefiles have special type configuration and maybe that's the way to go. I'd really like to be able to do it within ack if possible before resorting to custom functions. Anyone know if this is possible?
No, you cannot do it. ack 1.x only uses extensions for detecting file types. ack 2.0 will have much more flexible capabilities, where you'll be able to do stuff like:
# There are four different ways to match
# is: Match the filename exactly
# ext: Match the extension of the filename exactly
# match: Match the filename against a Perl regular expression
# firstlinematch: Match the first 80 characters of the first line
# of text against a Perl regular expression. This is only for
# the --type-add option.
--type-add=make:ext:mk
--type-add=make:ext:mak
--type-add=make:is:makefile
--type-add=make:is:gnumakefile
# Rakefiles http://rake.rubyforge.org/
--type-add=rake:is:Rakefile
# CMake http://www.cmake.org/
--type-add=cmake:is:CMakeLists.txt
--type-add=cmake:ext:cmake
# Perl http://perl.org/
--type-add=perl:ext:pod
--type-add=perl:ext:pl
--type-add=perl:ext:pm
--type-add=perl:firstlinematch:/perl($|\s)/
You can see what development on ack 2.0 is doing at https://github.com/petdance/ack2. I'd love to have your help.

Resources