How to `diff` files to create a "common" file?

How to `diff` files to create a "common" file? - css

I have a slew of CSS files to go through where someone just grunted through making alterations to various core stylesheets on a number of subsites. Obviously if the original developer had had some foresight they would have just included a master stylesheet and overridden the necessary elements…
I first started off with comm thinking that it might do the trick, but quickly found that it needed to receive a sorted input file.
I then switched over to diff and have gotten down to the following through some reading and research:
diff --unchanged-group-format="## %dn,%df%c'\012'%<" --old-group-format='' --new-group-format='' --changed-group-format='' file_1.css file_2.css
The previous obviously is almost there, but:
A) I need to grep out the ## lines (which should be fine, right? At first glance this appears right, but does diff throw in any other unexpected lines that need to be yanked?) and then
B) I need to create two more files that first is the leftover unique lines from file_1.css and then the leftover unique lines of file_2.css.
Obviously the first "in common" file will go into an include folder and then be included into the two latter created files as a #import url("common.css");
I am thinking that the following simple alteration will create the latter two files to which I'm referring:
diff --unchanged-group-format='' --old-group-format="## %dn,%df%c'\012'%<" --new-group-format='' --changed-group-format='' file_1.css file_2.css
diff --unchanged-group-format='' --old-group-format='' --new-group-format="## %dn,%df%c'\012'%<" file_1.css file_2.css
Sample files:
file 1: https://gist.github.com/c13843972c47b5037704
file 2: https://gist.github.com/fff39eae386e8969dc10
So for example, upon executing a test of the following:
diff --unchanged-group-format="## %dn,%df%c'\012'%<" --old-group-format='' --new-group-format='' --changed-group-format='' file_1.css file_2.css | egrep -v "^##\d*" > common.css
diff --unchanged-group-format='' --old-group-format="## %dn,%df%c'\012'%<" --new-group-format='' --changed-group-format='' file_1.css file_2.css | egrep -v "^##\d*" > old.css
And then searching for body with egrep "^body" *css, it yielded only a body in common.css and none in old.css, whereas it showed that there were two different entries in file_1.css and file_2.css. So obviously this methodology is flawed.
How would one about creating these two files that would ultimately become the common include and the override files?

#ylluminate, you have a couple of options:
use BeyondCompare to visually verify the differences. It does a fantastic job comparing similar files. It allows saving common lines/left only lines/right only lines. Only downside is it is interactive and if you have a lot of files, will take some time. On the positive side, it looks like you want to build trust first by testing it out a few times.
Add formatting text for --changed-group-format and capture modified code (and the old code as your command does it now). You need to run one more comparison to get what is in new code but not in old code. Downside here is the validation is going to be hard.
Saving all the lines in a database table and comparing columns is another option. Take care to store old and new line numbers. Downsides are the data lines need to be unique, blank lines will be chopped off.
I would go with option 1 if i have less than 50 files.
Hope this helps.
PS: I am not associated with BeyondCompare in any way. just a happy user of the software

Related

Datastage Sequence job- how to process each file at a time if those files are in 7 different folders

DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?

First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.

Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.

Gulp — how to get lazy, ‘make’-like building?

I am using gulp for css and js processing. Sometimes I am missing the good old lazyness of the unix make command:
only generate transformed (whatsover, e.g. compilation) files from original files, that have actually changed (based on time stamps).
this is true from stage 1 to 2 (.cpp -> .o), stage 2 to 3 (linking or other stuff) whatever your dependency graph gives...
Make is not limited to source code: You can do image manipulation in several steps (efficiently ‘lazy’ generation of downscaled thumbs for example) or much else. All based on the fairly simple rule: „is at least one of the source files newer in respect to the current output file(s)?“
Unlike gulp, every step generates (more or less temporary) files, not a continuous pipe.
Is there a way, to get the same kind of lazyness in gulp**, i.e. when generating css?
only transform those (less|sass|stylus) files➝css if something changed (on the very respective file)
same for adding in browser prefixes, concat, minify
Admittedly, beyond the first 1 or 2 steps, the output is most likely already a single stream. So any change means ‘touched’. Still, when playing for example with minify options, I'd rather be lazy about the early transpile, prefixing and concat stages (drawing prior results from a temp file). Also on the javascript side ( typeScript, ... )
lazypipe and gulp-cache sound tempting but are something else, if I understand correctly. Saying .watch() is also only a partial answer, for the very first stage.
Is there a more generic approach?

If you're set on using Gulp, then this would seem to be the way to do it. It involves the gulp-cached and gulp-remember plugins.

How to insert text into middle of text file in QT?

I'm writing a program that performs several tests on a hardware unit, and logs both the results of each test and the steps taken to perform the test. The trick is that I want the program to log these results to a text file as they become available, so that if the program crashes the results that had been obtained are not lost, and the log can help debug the crash.
For example, assume a program consisting of two tests. If the program has finished the first test and is working on the second, the log file would look like:
Results:
Test 1 Result A: Passed
Test 1 Result B: 1.5 Volts
Log:
Setting up instruments.
Beginning test 1.
[Steps in test 1]
Finished test 1.
Beginning test 2.
[whatever test 2 steps have been completed]
Once the second test has finished, the log file would look like this:
Results:
Test 1 Result A: Passed
Test 1 Result B: 1.5 Volts
Test 2 Result A: Passed
Test 2 Result B: 2.0 Volts
Log:
Setting up instruments.
Beginning test 1.
[Steps in test 1]
Finished test 1.
Beginning test 2.
[Steps in test 2]
Finished test 2.
All tests complete.
How would I go about doing this? I've been looking at the help files for QFile and QTextStream, but I'm not seeing a way to insert text in the middle of existing text. I don't want to create separate files and merge them at the end because I'd end up with separate files in the event of a crash. I also don't want to write the file from scratch every time a change is made because it seems like there should be a faster, more elegant way of doing this.

QFile.readAll will read the entire file into a QByteArray.
On the QByteArray you can then use insert to insert text in the middle,
and then write it back to file again.
Or you could use the classic c style that can modify files in the middle with the help of filepointers.

As #Roku pointed out, there is no built in way to insert data in a file with a rewrite. However if you know the size of the region, i.e., if the text you want to write has a fixed length, then you can write an empty space in the file and replace it later. Check
this discussion in overwriting part of a file.

I ended up going with the "write the file from scratch" method that I mentioned being hesitant about in my question. The benefit of this technique is that it results in a single file, even in the event of a crash since the log and the results are never placed in different files to begin with. Additionally, rewriting the file only happens when adding new results (an infrequent occurrence), whereas updating the log means simply appending text to the file as usual. I'm still a bit surprised that there isn't a way to have the OS insert text into a file for you.
Oh, and for those of you who absolutely must have this functionality as efficiently as possible, the following might be of use:
http://www.codeproject.com/Articles/17716/Insert-Text-into-Existing-Files-in-C-Without-Temp

You just cannot add more stuff in the middle of a file. I would go with two separate files, another for the results and another for the logs.

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!

Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise

There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

Printing hard copies of code

I have to hand in a software project that requires either a paper or .pdf copy of all the code included.
One solution I have considered is grouping classes by context and doing a cat *.extension > out.txt to provide all the code, then by catting the final text files I should have a single text file that has classes grouped by context. This is not an ideal solution; there will be no page breaks.
Another idea I had was a shell script to inject latex page breaks in between files to be joined, this would be more acceptable. Although I'm not too adept at scripting or latex.
Are there any tools that will do this for me?

Take a look at enscript (or nenscript), which will convert to Postscript, render in columns, add headers/footers and perform syntax highlighting. If you want to print code in a presentable fashion, this works very nicely.
e.g. here's my setting (within a zsh function)
# -2 = 2 columns
# -G = fancy header
# -E = syntax filter
# -r = rotated (landscape)
# syntax is picked up from .enscriptrc / .enscript dir
enscript -2GrE $*

For a quick solution, see a2ps, followed by ps2pdf. For a nicer, more complex solution I would go for a simple script that puts each file in a LaTeX listings environment and combines the result.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex