unix compare lists of file names - unix

I believe similar questions have been answered on SO before. I cant find any that seem to match to my particular situation, though I am sure many others have faced this scenario.
In an FTP session on Red Hat I have produced a list of file names that reside on the server currently. The list contains the file names and only the file names. Call this file1. Perhaps it contains something like:
513569430_EDIP000754535900_MFC_20190618032554.txt
blah.txt
duh.txt
Then I have downloaded the files and produced a list of successfully downloaded files. As well, this list contains the file names and only the file names. Call this file2. Perhaps it contains something like:
loadFile.dat
513569430_EDIP000754535900_MFC_20190618032554.txt
localoutfile.log
Now I want to loop through the names in file1 and check if they exist in file2. If exists I will go back to FTP server and delete the file from server.
I have looked at while loops and comm and test command, but I just cant seem to crack the code. I expect there are many ways to achieve this task. Any suggestions out there or working references?
My area of trouble is really not the looping itself but rather the comparing of contents between 2 files.

comm -1 -2 file1 file2 returns just the lines that are identical in both files. This can be used as the basis of a batch command file for sftp.
From the comments to the question, it seems that line-endings differ for the two files. This can be fixed in various ways, simplest probably being with tr. comm understands - as a filename to mean "read from stdin".
For example:
tr -d '\r` file1 | comm -1 -2 - file2
If file1 or file2 are not sorted, this must be corrected for comm to operate properly. With bash, this could be:
comm -1 -2 <( sort file1 | tr -d '\r' ) <( sort file2 )
With shells that don't understand the <( ... ) syntax, temporary files may be used explicitly.

Thank you for the advice #jhnc.
After giving this some deeper consideration and conversation, I realized that I don't even need to do this comparison. After I download the files I just need to produce the list of successful downloads. Then I can go and delete from server based on list of successful downloads.
However, I am still interested to know how to compare with the '\r \n' vs '\n' line ending situation

Related

Is there a way to make Unix diff -r compare only differences in filenames, but not check if any single file actually differs?

I need to compare two large directories with a lot of files in them. I tried using:
diff -r Directory1 Directory2
but the process is really slow due to the amount of files and their huge size.
So I thought about making the process faster by just comparing the content of the folders and not the actual content of the files.
Is there a way to make diff recursively check only if every subdirectory of Directory1 and Directory2 match in name and file content, but not check if every single file in Directory1 actually matches every single file in Directory2?
For example, let's say I have Directory1/SubDirectory1 and Directory2/Subdirectory1.
I want to check only if Directory1/SubDirectory1.1 and Directory2/Subdirectory2.1 have the same number of files with the same filenames (let's say, file1, file2, ... fileN), but I don't care about matching every file1, file2 ... fileN of Directory1/SubDirectory1.1 to every file1, file2 ... fileN of SubDirectory2.1 to see if their content is actually the same.
Is there a way of doing this?
Edit:
I tried using:
diff <(path1) <(path2)
but unfortunately, diff outputs the full path for each file. The output I get is thus:
< /Volume1/.../.../Directory1/SubDirectory1.1/file1
< /Volume1/.../.../Directory1/SubDirectory1.1/file2
...
> /Volume2/.../.../Directory2/SubDirectory2.1/file1
> /Volume2/.../.../Directory2/SubDirectory2.1/file2
...
Here every single filename clearly differs, because the full paths differ.
Is there a way to force find to output paths only starting from the directory you give as argument? For example:
find -(some option I'm not aware of) /Volume1/.../.../Directory1
outputs:
/Directory1/SubDirectory1.1/file1
/Directory1/SubDirectory1.1/file2
...
A simple way:
cd /.../Directory1
find . | sort >/tmp/dir1.lst
cd /.../Directory2
find . | sort >/tmp/dir2.lst
diff /tmp/dir1.lst /tmp/dir2.lst
It will fail if your filenames contain newlines, but in many cases that isn't a concern.
If scripting this, make sure to use auto-generated temp file names, e.g. with mktemp(1), to avoid symlink attacks and other problems.
Nate Eldredge, thank you for your answer!
However, I was able to solve my problem creating a script named fast_diff.sh, with just a line of code, as follows:
diff <(find "$1" | sed "s|$1\/||g" | sort) <(find "$2" | sed "s|$2\/||g" | sort)
The script takes two arguments, let's say path1 and path2:
./fast_diff.sh /Volume1/.../.../Directory1 /Volume2/.../.../Directory2
Now the variable $1 is equal to "/Volume1/.../.../Directory1" and the variable $2 is equal to "/Volume2/.../.../Directory2".
The command find gives as output something like:
/Volume1/.../.../Directory1/SubDirectory1.1/file1
/Volume1/.../.../Directory1/SubDirectory1.1/file2
...
Now I pipe this output to sed, using:
sed "s|$1||g"
which replaces every occurrence of "/Volume1/.../.../Directory1" with nothing. I used | as a separator instead of / because there are many occurrences of / in the directory path.
Employing the previous line of code, though, lists all subdirectories and files starting with a slash:
/SubDirectory1.1/file1
/SubDirectory1.1/file2
...
To remove the slash, I added \/:
sed "s|$1\/||g"

Split files linux and then grep

I'd like to split a file and grep each piece without writing them to indvidual files.
I've attempted a couple variations of split and grep and no such luck; any suggestions?
Something along the lines of:
split -b SIZE filename | grep "string"
I've attempted grep/fgrep to find the string but my shell complains that the files are too large. See: use fgrep instead
There is no point in splitting the file if you plan to [linearly] search each of the pieces anyway (assuming that's the only thing you are doing with it). Consider running grep on the entire file.
If however you plan to utilize the fact that the file is split later on, then the typical way would be:
Create a temporary directory and step into it
Run split/csplit on the original file
Use for loop over written fragment to do your processing.

grep -f alternative for huge files

grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?
Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes.
Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output
I don't think there is an easy solution.
Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.
A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.
Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.
Your best bet is probably taking a step back and rethinking your problem.
You can try ack. They are saying that it is faster than grep.
You can try parallel :
parallel --progress -a file1 'grep -F {} file2'
Parallel has got many other useful switches to make computations faster.
Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).
These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.

script,unix,compare

I have two files ...
file1:
002009092312291100098420090922111
010555101070002956200453T+00001190.81+00001295.920010.87P
010555101070002956200449J+00003128.85+00003693.90+00003128
010555101070002956200176H+00000281.14+00000300.32+00000281
file2:
002009092410521000098420090709111
010560458520002547500432M+00001822.88+00001592.96+00001822
010560458520002547500432D+00000106.68+00000114.77+00000106
In both files in every record starting with 01, the string from 3rd char to 25th char, i.e up to alphabet is the key.
Based on this key, I have to compare two files, and if there is any record matching in file 2, then I have to replace that record in file1, or else append it if it won't match.
Well, this is a fairly unspecific (and basic) programming question. We'll be better able to help us if you explain exactly what you did and where you got stuck.
Also, it looks a bit like homework, and people are wary of giving too much help on homework problems, as it might look like cheating.
To get you started:
I'd recommend Perl to solve this, but awk or another scripting language will also do. I'd recommend against sh/bash, as they are weak on text manipulation; also combining grep et al will become rather cumbersome.
First write a Perl program that filters records starting with 01. Then extract the key and put it into a hash (a Perl structure). Then output a new, combined file as required.
Using awk get the fields from 3-25 but doing something like
awk -F "" '/^01/{print $1}' file_name | cut -c 3-25 and match the first two fields with 01 from both files and get all the lines in two different buffers and compare both the buffers using for line in in a shell script.
Whenever the line in second buffer matches the first one grep the line in second buffer in first file and replace the line in first file with the line in second. I think you need to work a bit around the logic.

Why did my use of the read command not do what I expected?

I did some havoc on my computer, when I played with the commands suggested by vezult [1]. I expected the one-liner to ask file-names to be removed. However, it immediately removed my files in a folder:
> find ./ -type f | while read x; do rm "$x"; done
I expected it to wait for my typing of stdin:s [2]. I cannot understand its action. How does the read command work, and where do you use it?
What happened there is that read reads from stdin. When you put it at the end of a pipe, it read from that pipe.
So your find becomes
file1
file2
and so on; read reads that and replaces x successively with file1 then file2, and so your loop becomes
rm "file1"
rm "file2"
and sure enough, that rm's every file starting at the current directory ".".
A couple hints.
You didn't need the "/".
It's better and safer to say
find . -type f
because should you happen to type ". /" (ie, dot SPACE slash) find will start at the current directory and then go look starting at the root directory. That trick, given the right privileges, would delete every file in the computer. "." is already the name of a directory; you don't need to add the slash.
The find or rm commands will do this
It sounds like what you wanted to do was go through all the files in all the directories starting at the current directory ".", and have it ASK if you want to delete it. You could do that with
find . -type f -exec rm -i {} \;
or
find . -type f -ok rm {} \;
and not need a loop at all. You can also do
rm -r -i *
and get nearly the same effect, except that it will try to delete directories too. If the directory is empty, that'll even work.
Another thought
Come to think of it, unless you have a LOT of files, you could also do
rm -i `find . -type f`
Now the find in backquotes will become a bunch of file names on the command line, and the '-i' interactive flag on rm will ask the yes or no question.
Charlie Martin gives you a good dissection and explanation of what went wrong with your specific example, but doesn't address the general question of:
When should you use the read command?
The answer to that is - when you want to read successive lines from some file (quite possibly the standard output of some previous sequence of commands in a pipeline), possibly splitting the lines into several separate variables. The splitting is done using the current value of '$IFS', which normally means on blanks and tabs (newlines don't count in this context; they separate lines). If there are multiple variables in the read command, then the first word goes into the first variable, the second into the second, ..., and the residue of the line into the last variable. If there's only one variable, the whole line goes into that variable.
There are many uses. This is one of the simpler scripts I have that uses the split option:
#!/bin/ksh
#
# #(#)$Id: mkdbs.sh,v 1.4 2008/10/12 02:41:42 jleffler Exp $
#
# Create basic set of databases
MKDUAL=$HOME/bin/mkdual.sql
ELEMENTS=$HOME/src/sqltools/SQL/elements.sql
cat <<! |
mode_ansi with log mode ansi
logged with buffered log
unlogged
stores with buffered log
!
while read dbs logging
do
if [ "$dbs" = "unlogged" ]
then bw=""; cw=""
else bw="-ebegin"; cw="-ecommit"
fi
sqlcmd -xe "create database $dbs $logging" \
$bw -e "grant resource to public" -f $MKDUAL -f $ELEMENTS $cw
done
The cat command with a here-document has its output sent to a pipe, so the output goes into the while read dbs logging loop. The first word goes into $dbs and is the name of the (Informix) database I want to create. The remainder of the line is placed into $logging. The body of the loop deals with unlogged databases (where begin and commit do not work), then run a program sqlcmd (completely separate from the Microsoft new-comer of the same name; it's been around since about 1990) to create a database and populate it with some standard tables and data - a simulation of the Oracle 'dual' table, and a set of tables related to the 'table of elements'.
Other scripts that use the read command are bigger (by far), but generally read lines containing one or more file names and some other attributes of relevance, and then apply an appropriate transform to the files using the attributes.
Osiris JL: file * | grep 'sh.*script' | sed 's/:.*//' | xargs wgrep read
esqlcver:read version letter
jlss: while read directory
jlss: read x || exit
jlss: read x || exit
jlss: while read file type link owner group perms
jlss: read x || exit
jlss: while read file type link owner group perms
kb: while read size name
mkbod: while read directory
mkbod:while read dist comp
mkdbs:while read dbs logging
mkmsd:while read msdfile master
mknmd:while read gfile sfile version notes
publictimestamp:while read name type title
publictimestamp:while read name type title
Osiris JL:
'Osiris JL: ' is my command line prompt; I ran this in my 'bin' directory. 'wgrep' is a variant of grep that only matches entire words (to avoid words like 'already'). This gives some indication of how I've used it.
The 'read x || exit' lines are for an interactive script that reads a response from standard input, but exits if the command gets EOF (for example, if standard input comes from /dev/null).

Resources