Union, Intersection and Exclude with the FIND-command? - math

I need to manage lists with find-command. Suppose the lists have random names in non-distinct lists (ie their intersection is not empty set). How can I do:
A \ B
find files in the list A except the files in the list B
A intersection B
find files common to the lists A and B
Please, consult here.
A union B
find all files in the two lists
EXAMPLES
$ find . | awk -F"/" '{ print $2 }'
.zcompdump
.zshrc
.bashrc
.emacs
$ find ~/bin/FilesDvorak/.* -maxdepth 0 | awk -F"/" '{ print $6 }'
.bashrc
.emacs
.gdbinit
.git
I want:
A \ B:
.zcompdump
.zshrc
A Intersection B:
.bashrc
.emacs
A Union B:
.zcompdump
.zshrc
.bashrc
.emacs
.bashrc
.emacs
.gdbinit
.git
A try for the Intersection
When I save the outputs to separate lists, I cannot understand why the command does not take the common things, ie the above intersection:
find -f all_files -and -f right_files .
Questions emerged from the question:
find ~/bin/FilesDvorak/.* -maxdepth
0 -and ~/.PAST_RC_files/.*
Please, consult for recursive find
Click
here!
find ~/bin/FilesDvorak/.* -maxdepth
0 -and list

Seriously, this is what comm(1) is for. I don't think the man page could be much clearer: http://linux.die.net/man/1/comm

There are several tools that can help you find the intersection in file lists. 'find' isn't one of them. Find is for finding files that match a certain criteria on the filesystem.
Here are some ways of finding your answer.
To generate your two file lists
find . -maxdepth 1 | sort > a
(cd ~/bin/FilesDvorak/; find . -maxdepth 1 | sort > b)
Now you have two files a and b that contain directory entries without recursing into sub directories. (To remove the leading ./ you can add a "sed -e 's/^.///'" or your first awk command between the find an sort)
To find the Union
cat a b | sort -u
To find the A\B
comm -23 a b
To find the intersection
comm -12 a b
'man comm' and 'man find' for more information.

Related

Unix Find directories containing more than 2 files

I have a directory containing over a thousand subdirectories. I only want to 'ls' all the directories that contain more than 2 files. I don't need the directories that contain less than 2 files. This is in C-shell, not bash. Anyone know of a good command for this?
I tried this command but it's not giving the desired output. I simply want the full list of directories with more than 2 files. A reason it isn't working is because it will go into sub dirs in those dirs to find if they have more than 2 files. I don't want a recursive search. Just a list of first level directories in the main directory they are in.
$ find . -type f -printf '%h\n' | sort | uniq -c | awk '$1 > 2'
My mistake, I was thinking bash rather than csh. Although I don't have a csh to test with, I think this is the csh syntax for the same thing:
foreach d (*)
if (d "$d" && `ls -1 "$d"|wc -l` > 2) echo "$d"
end
I've added a guard so that non-directories aren't unnecessarily processed, and I've included double-quotes in case there are any "funny" file or directory names (containing spaces e.g.).
One possible problem (I don't know what your exact task is): any immediate subdirectories will also count as files.
Sorry, I was working in bash here:
for d in *;do if [ $(ls -1 $d|wc -l) -gt 2 ];then echo $d;fi;done
For a faster solution, you could try "cheating" by deconstructing the format of the directories themselves if you're on pure Unix. They're just files themselves with contents that can be analyzed in that case. Needless to say that is NOT PORTABLE, to e.g. any bash running on Windows, so not recommended.

How to find files that match names in a list and copy them to a directory?

I have a list of 50 names that look like this:
O8-E7
O8-F2
O8-F6
O8-F8
O8-H2
O9-A5
O9-B8
O9-D8
O9-E2
O9-F5
O9-H12
S37-A5
S37-B11
S37-B12
S37-C12
S37-D12
S37-E8
S37-G2
I want to look inside a specific directory for all the subdirectories whose name contains one of these elements.
For example, the directory Sample_S37-G2-from-Specimen-001 would be a match.
Inside those subdirectories, there is a file called accepted_hits.bam (unfortunately named the same way in all of them). I want to find these files and copy them into a single folder, with the name of the sample subdirectory that they came from.
For example, I would copy the accepted_hits.bam file from the subdirectory Sample_S37-G2-from-Specimen-001 to the new_dir as S37-G2_accepted_hits.bam
I tried using find, but it's not working and I don't really understand why.
cat sample.list | while read FILENAME; do find /path/to/sampleDirectories -name "$FILENAME" -exec cp '{}' new
_dir\; done
Any ideas? Thanks!
You are looking for dirs that are exactly the same as the lines in your input.
The first improvement would be using wildcards
cat sample.list | while read FILENAME; do
find /path/to/sampleDirectories -name "*${FILENAME}*" -exec cp '{}' new_dir\; done
Your new problem is that now you will be looking for dir's, not files. You want to find dir's with the filename accepted_hits.bam.
So your next try would be parsing the output of
find /path/to/sampleDirectories -name accepted_hits.bam | grep "${FILENAME}"
but you do not want to call find for each entry in sample.list.
You need to start with 1 find command and get the relevant subdirs from it.
A complication is that you want to have the substring from orgfile in your destfile name. Look at the grep options o and f, they help!
find /path/to/sampleDirectories -name accepted_hits.bam | while read orgfile | do
matched_part=$(echo "${orgfile}" | grep -of sample.list)
if [ -n "${matched_part}" ]; then
cp ${orgfile} newdir/${matched_part}accepted_hits.bam
fi
done
This will only work when your sample.list is without additional spaces. When you have spaces and can not cange the file, you need to copy/parse sample.list to another file.
When one of your 50 entries in sample.list is a substring of "accepted_hits.bam", you need to do some extra work.
Edit: if [ -n "${matched_part}" ] was missing the $.
Try using egrep with alternation
build a text file with single line of patterns: (pat1|pat2|pat3)
call find to list all of the regular files
use egrep to select the ones based on the patterns in the pattern file
awk 'BEGIN { printf("(") } FNR==1 {printf("%s", $0)} FNR>1 {printf("|%s", $0)} END{printf(")\n") } ' sample.list > t.sed
find /path/to/sampleDirectories -type f | egrep -f t.sed > filelist

shell script to find number of unique files in a directory as well its sub- directories?

I’m trying to find number of unique files in a directory as well its sub- directories Is this possible?
Say for example there is a directory with 100 files. How would I count the number of unique files under that directory?
Assuming you're asking about file names, you can
First, list all the files in the directory tree
Second, get the unique values from the list
To list all the files, you can use find. Normally find prints the entire path name of each result, but since you just want to compare the base file names, you will have to customize its output:
find directoryName -type f -printf '%f\n'
This will print each base file name, one per line. Now you can get only the unique file names by sorting and then collapsing all adjacent entries that share a name into a single entry. The sort command with the -u switch does this for you:
find directoryName -type f -printf '%f\n' | sort -u
If you want to get a count of the number of repetitions of each unique file name, then just use sort by itself and use uniq -c to handle the collapsing and the counting:
find directoryName -type f -printf '%f\n' | sort | uniq -c
Note that the above solution will get confused by file names that contain newline (\n) characters. If you have any such file names, you should read in the find manual about null-terminating (instead of newline-terminating) your output.
Finally, if you're simply looking for a numeric output, pipe the whole thing through "wc -l" to count it.
find directoryName -type f -printf '%f\n' | sort | uniq -c | wc -l

GNU find: Search in current directory first

how can I tell find to look in the current folder first and then continue search in subfolders? I have the following:
$ find . -iname '.note'
folder/1/.note
folder/2/.note
folder/3/.note
folder/.note
What I want is this:
$ find . -iname '.note'
folder/.note
folder/1/.note
folder/2/.note
folder/3/.note
Any ideas?
find's algorithm is as follows:
For each path given on the command line, let the current entry be that path, then:
Match the current entry against the expression.
If the current entry is a directory, then perform steps 1 and 2 for every entry in that directory, in an unspecified order.
With the -depth primary, steps 1 and 2 are executed in the opposite order.
What you're asking find to do is to consider files before directories in step 2. But find has no option to do that.
In your example, all names of matching files come before names of subdirectories in the same directory, so find . -iname '.note' | sort would work. But that obviously doesn't generalize well.
One way to process non-directories before directories is to run a find command to iterate over directories, and a separate command (possibly find again) to print matching files in that directory.
find -type d -exec print-matching-files-in {} \;
If you want to use a find expression to match files, here's a general structure for the second find command to iterate only over non-directories in the specified directory, non-recursively (GNU find required):
find -type d -exec find {} -maxdepth 1 \! -type d … \;
For example:
find -type d -exec find {} -maxdepth 1 \! -type d -iname '.note' \;
In zsh, you can write
print -l **/(#i).note(Od)
**/ recurses into subdirectories; (#i) (a globbing flag) interprets what follows as a case-insensitive pattern, and (Od) (a glob qualifier) orders the outcome of recursive traversals so that files in a directory are considered before subdirectories. With (Odon), the output is sorted lexicographically within the constraint laid out by Od (i.e. the primary sort criterion comes first).
Workaround would be find . -iname '.note' | sort -r:
$ find . -iname '.note' | sort -r
folder/.note
folder/3/.note
folder/2/.note
folder/1/.note
But here, the output is just sorted in reverse order and that does not change find's behaviour.
For me with GNU find on Linux I get both orderings with different test runs.
Testcase:
rm -rf /tmp/depthtest ; mkdir -p /tmp/depthtest ; cd /tmp/depthtest ; for dir in 1 2 3 . ; do mkdir -p $dir ; touch $dir/.note ; done ; find . -iname '.note'
With this test I get the poster's first result. Note the ordering of 1 2 3 .. If I alter this ordering to to . 1 2 3
rm -rf /tmp/depthtest ; mkdir -p /tmp/depthtest ; cd /tmp/depthtest ; for dir in . 1 2 3 ; do mkdir -p $dir ; touch $dir/.note ; done ; find . -iname '.note'
I get the poster's second result.
In either case adding -depth to find does nothing.
EDIT:
I wrote a perl oneliner to look in to this further:
perl -e 'opendir(DH,".") ; print join("\n", readdir(DH)),"\n" ; closedir(DH)'
And I ran this against /tmp/depthtest after running testcase 1 with these results:
.
..
1
2
3
.note
I ran it again after testcase 2 with these results:
.
..
.note
1
2
3
Which confirms that the results are in directory order.
The -depth option to find only controls whether e.g. ./1/.note is processed before or after ./1/, not whether ./.note or ./1/ is first, so the order of the results is purely based on directory order (which is mostly creation order).
It might be helpful to look at How do I recursively list all directories at a location, breadth-first? to learn how to work around this problem.
find -s . -iname ".note" doesn't help? or find . -iname '.note'|sort ?
Find in the current folder
find ./in_save/ -type f -maxdepth 1| more
==>73!

Rename files based on sorted creation date?

I have a directory filled with files with random names. I'd like to be able to rename them 'file 1' 'file 2' etc based on chronological order, ie file creation date. I could be writing a short Python script but then I wouldn't learn anything. I was wondering if there's a clever 1 line command that can solve this. If anyone could point me in the right direction.
I'm using zsh.
Thanks!
For zsh:
saveIFS="$IFS"; IFS=$'\0'; while read -A line; do mv "${line[2]}" "${line[1]%.*}.${line[2]}"; done < <(find -maxdepth 1 -type f -printf "%T+ %f\n"); IFS="$saveIFS"
For Bash (note the differences in the option to read and zero-based indexing instead of one-based):
saveIFS="$IFS"; IFS=$'\0'; while read -a line; do mv "${line[1]}" "${line[0]%.*}.${line[1]}"; done < <(find -maxdepth 1 -type f -printf "%T+\0%f\n"); IFS="$saveIFS"
These rename files by adding the modification date to the beginning of the original filename, which is retained to prevent name collisions.
A filename resulting from this might look like:
2009-12-15+11:08:52.original.txt
Because a null is used as the internal field separator (IFS), filenames with spaces should be preserved.

Resources