Unix- Using Grep to get unmatched lines - unix

I am new to unix. I want to grep the unmatched pattern from a file1 provided that the patterns are in the file2. The real files are having more than 1000 lines.
Example:
File1:
Hi(Everyone)
How(u)people(are)doing?
ThanksInadvance
File2:
Hi(Every
ThanksI
Required Result:
How(u)people(are)doing?
I want only the pattern to be used like ("Hi(Every") for the grep.It should return the unmatched line from file1.

this line works for given example:
grep -Fvf file2 file1
The 3 options used above:
-F makes grep do fixed-string match
-v invert matching
-f get patterns from file

the Grep-Flag -v inverts the Grep-Command.
cat File1 |grep -v ("Hi(Every")
should return all Lines from File1 where ("Hi(Every") doesnt contains.
best regards,
Jan

Related

Remove data in file1 against file2

This might be the worst example ever given on StackOverflow, but my purpose is to remove everything in File1 against File2. Whilst ignoring case sensitivity and matching the entire line. For example Cats#123:bob would be removed from File2 as the word Cat appears in File1. So regardless of case sensitivty, if a matching word is found it should eradicate the entirety of the line.
Input (File1):
Cat
Dog
Horse
Wheel
MainFile (File2)
Cats#123:bob
dog#1:truth
Horse-1:fairytale
Wheel:tremendous
Divination:maximus
Desired output
Divination:maximus
As the output shows, only "Divination:maximus" should be outputted as no matching words were found in File1. I prefer to use Sed or Awk generally as I use Cygwin. But any suggestions are welcomed, I can answer all questions you may have, thanks.
Here's what I've tried so far, but it's not working unfortunately, as my output is incorrect. To add to this, simply the wrong lines are being outputted. I'm fairly inexperienced so I don't know how to develop upon this syntax below, and maybe it's completely irrelevant to the job at hand.
grep -avf file1.txt file2.txt > output.txt
The grep command can do that for you:
grep -v -i -f file1 file2
The -f file1 tells grep to use the patterns in file1
The -i flag means case insensitive
The -v flag means to search lines that do not contain those patterns

How to use the output from sed command as a string and use that string in the grep command?

I have a file with only one string per row, "File1", and a file that has two strings per row, "File2". I wan't go through File1 and check for every row/string in how many of the rows in File2 it appear.
I would like to use something like this:
grep '{sed -n 1p File1}' File2
But it doesn't work. Why and what should I do?
//Thanks!
It seems to me that you're asking two completely different questions: one in the title and one in the body. I'll try to summarize them and answer both.
First question:
How to use the output from sed command as a string and use that string in the grep command?
First answer:
The unix way is to pipe a command's output as input to another program. If you instead want to use the first program's output as e.g. a parameter to the subsequent command (not as its input), there's a handy program that allows you to do just that: xargs.
Example of piping the sed output as input to grep:
sed -n 1p File1 | grep 'needle'
Example of using the sed output as a search string for grep:
sed -n 1p File1 | xargs -I{} grep {} File2
Second question:
I have a file with only one string per row, "File1", and a file that has two strings per row, "File2". I wan't go through File1 and check for every row/string in how many of the rows in File2 it appear.
Second answer:
awk 'NR==FNR{!a[$0]++;next} $1 in a{a[$1]++} END{for(i in a){print i" "a[i]-1}}' File1 File2
Test files:
==> File1 <==
one
two
three
==> File2 <==
one apple
two bananas
two strawberries
three kiwis
three pomegrenades
three lychees
Test run output:
three 3
two 2
one 1
That's only if you mean the string from File1 to appear as the first column of File2. For a more general approach, where you want to count every File2 row containing the string from File1:
awk 'NR==FNR{!a[$0]++;next} {for(i in a){if(match($0,i)){a[i]++}}} END{for(i in a){print i" "a[i]-1}}' File1 File2
Test files:
==> File1 <==
one
two
three
==> File2 <==
one one
two one
two two
three three
three two
three one
Test run output:
three 3
two 3
one 3

One liner required, pref UNIX-based, for variation of JOIN command

I need a one-liner (that I can put in a dos batch file), preferably using a unix command like AWK or JOIN. The function I need is essentially a more elaborate version of the following JOIN command:
join -j 1 -a 1 file1.txt file2.txt -t "^" > output.txt
[walkthrough: field separators are "^", join key is 1st field of both fields, and not exactly sure what the "-a 1" is doing exactly but it is sticking the bit-to-be joined on the end of the row of the other file, which is what I want.
Now, this one-liner works fine where both files are sorted and there is only one matching line in the 2nd file ... but I need it to try to match up to 4 lines in the 2nd file.
E.g.
file1:
12^blahblah
13^blahblahblahblah
14^blahblahblahblahblahblahblahblah
file2:
12^banana
12^orange
12^apple
13^potato
14^tomato
So I want the output like this:
12^blahblah^banana,orange,apple
13^blahblahblahblah^potato
14^blahblahblahblahblahblahblahblah^tomato
[Doesn't have to be a comma separating the new items]
You can try this awk command:
awk -F'^' 'NR==FNR{if($1 in a){a[$1]=a[$1]","$2} else {a[$1]=$2}} NR>FNR{print $0 "^" a[$1]}' file2 file1
The script fills an array a with the content of file2 and and append the content of the array when parsing file1
$ awk -F'^' 'NR==FNR{a[$1]=$0 FS;next} {a[$1] = a[$1] s[$1] $2; s[$1]=","} END{for (i in a) print a[i]}' file1 file2
12^blahblah^banana,orange,apple
13^blahblahblahblah^potato
14^blahblahblahblahblahblahblahblah^tomato

Appending multiple files into one file

I append multiple data files into a single data file using the cat command. How can I assign that single file value into a new file?
I am using the command:
cat file1 file2 file3 > Newfile.txt
AnotherFile=`cat Newfile.txt`
sort $AnotherFile | uniq -c
it showing error like can not open AnotherFile
How to assign this newfile value into another file?
Original answer to original question
Well, the easiest way is probably cp:
cat file1 file2 file3 > Newfile.txt
cp Newfile.txt AnotherFile.txt
Failing that, you can use:
cat file1 file2 file3 > Newfile.txt
AnotherFile=$(cat Newfile.txt)
echo "$AnotherFile" > AnotherFile.txt
Revised answer to revised question
The original question had echo "$AnotherFile" as the third line; the revised question has sort $AnotherFile | uniq -c as the third line.
Assuming that sort $AnotherFile is not sorting all the contents of the files mentioned in the list created from concatenating the original files (that is, assuming that file1, file2 and file3 do not contain just lists of file names), then the objective is to sort and count the lines found in the source files.
The whole job can be done in a single command line:
cat file1 file2 file3 | tee Newfile.txt | sort | uniq -c
Or (more usually):
cat file1 file2 file3 | tee Newfile.txt | sort | uniq -c | sort -n
which lists the lines in increasing order of frequency.
If you really do want to sort the contents of the files listed in file1, file2, file3 but only list the contents of each file once, then:
cat file1 file2 file3 | tee Newfile.txt | sort -u | xargs sort | sort | uniq -c
It looks weird having three sort-related commands in a row, but there is justification for each step. The sort -u ensures each file name is listed once. The xargs sort converts a list of file names on standard input into a list of file names on the sort command line. The output of this is the sorted data from each batch of files that xargs produces. If there are so few files that xargs doesn't need to run sort more than once, then the following plain sort is redundant. However, if xargs has to run sort more than once, then the final sort has to deal with the fact that the first lines from the second batch produced by xargs sort probably come before the last lines produced by the first batch produced by xargs sort.
This becomes a judgement call based on knowledge of the data in the original files. If the files are small enough that xargs won't need to run multiple sort commands, omit the final sort. A heuristic would be "if the sum of the sizes of the source files is smaller than the maximum command line argument list, don't include the extra sort".
You can probably do that in one go:
# Write to two files at once. Both files have a constantly varying
# content until cat is finished.
cat file1 file2 file3 | tee Newfile.txt> Anotherfile.txt
# Save the output filename, just in case you need it later
filename="Anotherfile.txt"
# This reads the contents of Newfile into a variable called AnotherText
AnotherText=`cat Newfile.txt`
# This is the same as "cat Newfile.txt"
echo "$AnotherText"
# This saves AnotherText into Anotherfile.txt
echo "$AnotherText" > Anotherfile.txt
# This too, using cp and the saved name above
cp Newfile.txt "$filename"
If you want to create the second file all in one go, this is a common pattern:
# During this process the contents of tmpfile.tmp is constantly changing
{ slow process creating text } > tmpfile.tmp
# Very quickly create a complete Anotherfile.txt
mv tmpfile.tmp Anotherfile.txt
make file and redirectin this in append mode.
touch Newfile.txt
cat files* >> Newfile.txt

Intersection of two large word lists

I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.
What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.
If the two files are sorted (or you can sort them), you can use comm -1 -2 file1 file2 to print out the intersection.
You are correct, grep would be a bad idea. Type "man join" and follow the instructions.
If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:
$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2
Otherwise, you may need to give the join(1) command some additional instructions:
JOIN(1) BSD General Commands Manual JOIN(1)
NAME
join -- relational database operator
SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2
DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The
first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field,
the remaining fields from file1 and then the remaining fields from file2.
. . .
. . .
Presuming one word per line, I would use grep:
grep -xFf seta setb
-x matches the whole lines (no partial matches)
-F interprets the given patterns literally (no regular expressions)
-f seta specifies the patterns to search
setb is the file to search for the contents of seta
comm will do the same thing, but requires your sets to be pre-sorted:
comm -12 <(sort seta) <(sort setb)
grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3
I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.
Back in the days I managed to find a Perl script that does something similar:
http://www.perlmonks.org/?node_id=160735

Resources