Unix command to find string set intersections or outliers? - unix

Is there a UNIX command on par with
sort | uniq
to find string set intersections or "outliers".
An example application: I have a list of html templates, some of them have {% load i18n %} string inside, others don't. I want to know which files don't.
edit: grep -L solves above problem.
How about this:
file1:
mom
dad
bob
file2:
dad
%intersect file1 file2
dad
%left-unique file1 file2
mom
bob

It appears that grep -L solves the real problem of the poster, but for the actual question asked, finding the intersection of two sets of strings, you might want to look into the "comm" command. For example, if file1 and file2 each contain a sorted list of words, one word per line, then
$ comm -12 file1 file2
will produce the words common to both files. More generally, given sorted input files file1 and file2, the command
$ comm file1 file2
produces three columns of output
lines only in file1
lines only in file2
lines in both file1 and file2
You can suppress the column N in the output with the -N option. So, the command above, comm -12 file1 file2, suppresses columns 1 and 2, leaving only the words common to both files.

Intersect:
# sort file1 file2 | uniq -d
dad
Left unique:
# sort file1 file2 | uniq -u
bob
mom

From http://www.commandlinefu.com/commands/view/5710/intersection-between-two-files:
Intersection between two (unsorted) files:
grep -Fx -f file1 file2
Lines in file2 that are not in file1:
grep -Fxv -f file1 file2
Explanation:
The -f option tells grep to read the patterns to look for from a file. That means that it performs a search of file2 for each line in file1.
The -F option tells grep to see the search terms as fixed strings, and not as patterns, so that a.c will only match a.c and not abc,
The -x option tells grep to do whole line searches, so that "foo" in file1 won't match "foobar" in file2.
By default, grep will show only the matching lines, giving you the intersection. The -v option tells grep to only show non-matching lines, giving you the lines that are unique to file2.

Maybe I'm misunderstanding the question, but why not just use grep to look for the string (use the -L option to have it print the names of files that don't have the string in them).
In other words
grep -L "{% load i18n %}" file1 file2 file3 ... etc
or with wildcards for the file names as appropriate.

from man grep
-L, --files-without-match
Suppress normal output; instead print
the name of each input file from which
no output would normally have been
printed. The scanning will stop on
the first match.
So if your templates are .html files you want:
grep -L '{% load i18n %}' *.html

Intersection:
comm -12 <(cat file1 | sort | uniq) <(cat file2 | sort | uniq)
All lines by 3 columns (file1 | file2 | intersection):
comm <(cat file1 | sort | uniq) <(cat file2 | sort | uniq)
If your files are not sorted and/or if there might be lines that are duplicated inside one of the files but don't appear at the other one - this one-line-command will sort your files, remove the duplicated lines and you will get directly your desired result.

Related

How to remove values in a text file from another text file in R? [duplicate]

I have two files, file1 and file2, where the target_id compose the first column in both.
I want to compare file1 to file2, and only keep the rows of file1 which match the target_id in file2.
file2:
target_id
ENSMUST00000128641.2
ENSMUST00000185334.7
ENSMUST00000170213.2
ENSMUST00000232944.2
Any help would be appreciated.
% grep -x -f file1 file2 resulted in no output in my terminal
Sample data that actually shows overlaps between the files.
file1.csv:
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000178862.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000179664.2,0,0
ENSMUST00000177564.2,0,0
file2.csv
target_id
ENSMUST00000178537.2
ENSMUST00000196221.2
ENSMUST00000177564.2
Your grep command, but swapped:
$ grep -F -f file2.csv file1.csv
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000177564.2,0,0
Edit: we can add the -F argument since it is a fixed-string search. Plus it adds protection against the . matching something else as a regex. Thanks to #Sundeep for the recommendation.

List a count of similar filenames in a folder

I have a folder which contains multiple files of a similar name (differentiated by a datetime in the filename)
I would like to be able to get a count of each file group/type in that folder.
i.e.
file1_25102019_111402.csv
file1_24102019_111502.csv
file1_23102019_121402.csv
file1_22102019_101402.csv
file2_25102019_161404.csv
file2_24102019_131205.csv
file2_23102019_121306.csv
I need to be able to return something like this;
file1 4
file2 3
Ideally the answer would be something like "count of files whose first x characters are ABCD"
The filenames can be anything. the date part in the example was just to demonstrate that the filenames begin with similar text but are distinghised by "something" further along in the name (in this case a date)
So I want to be able to Group them by the first X characters in the filename.
i.e. i want to be able to say, "give me counts of all files grouped by the first 4 characters, or the first 5 characters etc."
in SQL i'd do somthing like this
select substr(object_name,1,5),
count(*)
from all_objects
group by substr(object_name,1,5)
Edited to show more examples;
File1weifwoeivnw
File15430293fjwnc
File15oiejfiwem
File2sidfsfe
File29fu09f4n
File29ewfoiwwf
File22sdiufsnvfvs
Pseudo Code:
Example 1:
ls count of first 4 characters
Output
File 7
Example 2:
ls count of first 5 characters
Output
File1 3
File2 4
Example 3
ls count of first 6 characters
Output
File1w 1
File15 2
File2s 1
File29 2
File22 1
If you want to extract the first 5 characters you can use
ls | cut -c1-5 | sort | uniq -c |awk '{ print $2,$1 }'
which prints for the first example from the question
file1 3
file2 3
If you want to have a different number of characters, change the cut command as necessary, e.g. cut -c1-6 for the first 6 characters.
If you want to separate the fields with a TAB character instead of a space, change the awk command to
awk -vOFS=\\t '{ print $2,$1 }'
This would result in
file1 3
file2 3
Other solutions that work with the first example that shows file names with a date and time string, but don't work with the additional example added later:
With your first example files, the command
ls | sed 's/_[0-9]\{8\}_[0-9]\{6\}/_*/' | sort | uniq -c
prints
3 file1_*.csv
3 file2_*.csv
Explanation:
The sed command replaces the sequence of a _, 8 digits, another _ and another 6 digits with _*.
With your first example file names, you will get file1_*.csv or
file2_*.csv 3 times each.
sort sorts the lines.
uniq -c counts the number of subsequent lines that are equal.
Or if you want to strip everything from the first _ up to the end, you can use
ls | sed 's/_.*//' | sort | uniq -c
which will print
3 file1
3 file2
You can add the awk command from the first solution to change the output format.

Extract column using grep

I have a data frame with >100 columns each labeled with a unique string. Column 1 represents the index variable. I would like to use a basic UNIX command to extract the index column (column 1) + a specific column string using grep.
For example, if my data frame looks like the following:
Index A B C...D E F
p1 1 7 4 2 5 6
p2 2 2 1 2 . 3
p3 3 3 1 5 6 1
I would like to use some command to extract only column "X" which I will specify with grep, and display both column 1 & the column I grep'd. I know that I can use cut -f1 myfile for the first bit, but need help with the grep per column. As a more concrete example, if my grep phrase were "B", I would like the output to be:
Index B
p1 7
p2 2
p3 3
I am new to UNIX, and have not found much in similar examples. Any help would be much appreciated!!
You need to use awk:
awk '{print $1,$3}' <namefile>
This simple command allows printing the first ($1) and third ($3) column of the file. The software awk is actually much more powerful. I think you should have a look at the man page of awk.
A nice combo is using grep and awk with a pipe. The following code will print column 1 and 3 of only the lines of your file that contain 'p1':
grep 'p1' <namefile> | awk '{print $1,$3}'
If, instead, you want to select lines by line number you can replace grep with sed:
sed 1p <namefile> | awk '{print $1,$3}'
Actually, awk can be used alone in all the examples:
awk '/p1/{print $1,$3}' <namefile> # will print only lines containing p1
awk '{if(NR == 1){print $1,$3}}' <namefile> # Will print only first line
First figure out the command to find the column number.
columnname=C
sed -n "1 s/${columnname}.*//p" datafile | sed 's/[^\t*]//g' | wc -c
Once you know the number, use cut
cut -f1,3 < datafile
Combine into one command
cut -f1,$(sed -n "1 s/${columnname}.*//p" datafile |
sed 's/[^\t*]//g' | wc -c) < datafile
Finished? No, you should improve the first sed command when one header can be a substring of another header: include tabs in your match and put the tabs back in the replacement string.
If you want to retain the first column and a column which contains a specific string in the first row (eg. B), then this should work. It assumes that your string is only present once though.
awk '{if(NR==1){c=0;for(i=1;i<=NF;i++){c++;if($i=="B"){n=c}}}; print $1,$n}' myfile.txt
There is probably a better solution with amazing awk, but this should work.
EXPLANATION: In the first row (NR==1), it iterates through all columns for(i=1;i<=NF;i++) until it finds the string, save the column number and then prints it.
If you want to pass the string as a variable then you can use the -v option.

Compare two files and move the common record to another file

There are two text files.
File1:
xchg_rat_to_cncy_cd=PLN
xchg_rat_to_cncy_cd=PLZ
xchg_rat_to_cncy_cd=PTE
File2:
epoc_id=1455718545/xchg_rat_to_cncy_cd=PLN
epoc_id=1455718545/xchg_rat_to_cncy_cd=PLZ
epoc_id=1455718545/xchg_rat_to_cncy_cd=PTE
*The number of fields is not limited to 1 in file1 or 2 in file2. but the 2nd file will always have the extra column as the first field.
This is what I tried till now with file1 have one column and file2 having 2 columns.
awk -F'/' 'FNR==NR{a[$2]=$1;next} ($1 in a) {print a[$1],$1}' FILE1 FILE2
But this is only limited to file1 having 1 column and file2 having 2 columns.
I want to make it scalable so that it can handle files with n columns.
Kindly suggest a solution
The expected output is somewhat unclear, but try this:
grep -f FILE1 FILE2
or
grep -o -f FILE1 FILE2
xchg_rat_to_cncy_cd=PLN
xchg_rat_to_cncy_cd=PLZ
xchg_rat_to_cncy_cd=PTE

awk script to identify duplicates by column and tag highest version

Looking for UNIX scripting commands such as awk, sed, or sort to look at this list determine duplicates based on 1st, 2nd, and 3rd columns, then use 4th column to determine which is the highest version. Then tag it in the last column.
Already did all of my sorting and duplicate removal with the following command:
cat orig.txt |sort -t\| -k3 |uniq -u -d > list.uni
sort -t'|' -k 1,3 -k 4,4n list.uni > list.txt
Sample listing of part names we will call list.txt
cat list.txt
M86-STD|LIB DRAWING|0186-528013|2| |
M86-STD|LIB DRAWING|0186-528013|3|B|
M86-STD|LIB DRAWING|0186-528013|4|C|
M86-STD|LIB DRAWING|0186-528013|5|D|
M86_STD|LIB ASSEMBLY|0186-528013|3| |
M86_STD|LIB ASSEMBLY|0186-528013|4|B|
M86_STD|LIB ASSEMBLY|0186-528013|5|D|
Result would be appear as follows:
M86-STD|LIB DRAWING|0186-528013|2| |
M86-STD|LIB DRAWING|0186-528013|3|B|
M86-STD|LIB DRAWING|0186-528013|4|C|
M86-STD|LIB DRAWING|0186-528013|5|D|Highestver
M86_STD|LIB ASSEMBLY|0186-528013|3| |
M86_STD|LIB ASSEMBLY|0186-528013|4|B|
M86_STD|LIB ASSEMBLY|0186-528013|5|D|Highestver
Thanks
awk -F'|' -e '{b=$1"|"$2"|"$3;h="highest version";if(a==b)h="";a=b;print $0 h}'
The idea is that b contains the three columns identifying the current sequence of rows. a has the corresponding value of the previous row. If the two differ, then this is the first row of a set, therefore the highest version.

Resources