One liner required, pref UNIX-based, for variation of JOIN command - unix

I need a one-liner (that I can put in a dos batch file), preferably using a unix command like AWK or JOIN. The function I need is essentially a more elaborate version of the following JOIN command:
join -j 1 -a 1 file1.txt file2.txt -t "^" > output.txt
[walkthrough: field separators are "^", join key is 1st field of both fields, and not exactly sure what the "-a 1" is doing exactly but it is sticking the bit-to-be joined on the end of the row of the other file, which is what I want.
Now, this one-liner works fine where both files are sorted and there is only one matching line in the 2nd file ... but I need it to try to match up to 4 lines in the 2nd file.
E.g.
file1:
12^blahblah
13^blahblahblahblah
14^blahblahblahblahblahblahblahblah
file2:
12^banana
12^orange
12^apple
13^potato
14^tomato
So I want the output like this:
12^blahblah^banana,orange,apple
13^blahblahblahblah^potato
14^blahblahblahblahblahblahblahblah^tomato
[Doesn't have to be a comma separating the new items]

You can try this awk command:
awk -F'^' 'NR==FNR{if($1 in a){a[$1]=a[$1]","$2} else {a[$1]=$2}} NR>FNR{print $0 "^" a[$1]}' file2 file1
The script fills an array a with the content of file2 and and append the content of the array when parsing file1

$ awk -F'^' 'NR==FNR{a[$1]=$0 FS;next} {a[$1] = a[$1] s[$1] $2; s[$1]=","} END{for (i in a) print a[i]}' file1 file2
12^blahblah^banana,orange,apple
13^blahblahblahblah^potato
14^blahblahblahblahblahblahblahblah^tomato

Related

Unix- Using Grep to get unmatched lines

I am new to unix. I want to grep the unmatched pattern from a file1 provided that the patterns are in the file2. The real files are having more than 1000 lines.
Example:
File1:
Hi(Everyone)
How(u)people(are)doing?
ThanksInadvance
File2:
Hi(Every
ThanksI
Required Result:
How(u)people(are)doing?
I want only the pattern to be used like ("Hi(Every") for the grep.It should return the unmatched line from file1.
this line works for given example:
grep -Fvf file2 file1
The 3 options used above:
-F makes grep do fixed-string match
-v invert matching
-f get patterns from file
the Grep-Flag -v inverts the Grep-Command.
cat File1 |grep -v ("Hi(Every")
should return all Lines from File1 where ("Hi(Every") doesnt contains.
best regards,
Jan

need some help on awk command

need a help with awk. reading a csv file and, doing some substitution on some of the columns. It's like 9th column(string type) should be replaced by value of (9th column itself + value of the 4th column(integer)), then 15th column by $15+$12, column 26th with $26+$23. same has to be done line by line for all the records. Suggestions please
Below is the sample I/O. and the first line which is Description must be left as is.
sample Input
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst|Del|20|SD|DA
101|ms|Del|21|XS|DA
Sample output
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
it's like empname has been concatenated with empid & the role desc with roleID.Hope that's helpful :)
This will perform the needed transformation:
$ awk 'NR>1{$2=$2$1;$5=$5$4}1' FS='|' OFS='|' file
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
If you have to do this for many columns you can use a for loop like so (provided a arithmetic or geometric stepsize):
$ awk 'NR>1{for(i=2;i<=5;i+=3)$i=$i$(i-1)}1' FS='|' OFS='|' file
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
When you say +, I'm assuming you mean string concatentation. IN awk, there is no specific concatenation operator, you just put two strings side-by-side.
awk -F, -v OFS=, '{$9 = $9 $4; $15=$15$12; $26=$26$23; print}' file.csv
Also assuming that by "csv", you actually mean comma-separated.
If you want to edit the file in-place, you need to do this:
awk ... file.csv > newfile && mv file.csv file.csv.bak && mv newfile file.csv
Edit: to leave the first line untouched:
awk -F, -v OFS=, 'NR>1 {$9 = $9 $4; $15=$15$12; $26=$26$23} {print}' file.csv
Now the columns are modified for the 2nd and subsequent lines, but every line is printed.
You'll sometimes see that written this way:
awk -F, -v OFS=, 'NR>1 {$9 = $9 $4; $15=$15$12; $26=$26$23} 1' file.csv

How to save both matching and non-matching from grep

I use grep very often and am familiar with it's ability to return matching lines (by default) and non-matching lines (using the -v parameter). However, I want to be able to grep a file once to separate matching and non-matching lines.
If this is not possible, please let me know. I realize I could do this easily in perl or awk, but am curious if it is possible with grep.
Thanks!
If it does NOT have to be grep - this is a single pass split based on a pattern -- pattern found > file1 pattern not found > file2
awk '/pattern/ {print $0 > "file1"; next}{print $0 > "file2"}' inputfile
I had the exact same problem and I wrote a small Perl script for that [1]. It only accepts one argument: the regex to grep input on.
[1] https://gist.github.com/tonejito/c9c0bffd75d8c81483f9107c609439e1
It reads STDIN by line and checks against the given regex, matched lines go to STDOUT and not matched go to STDERR.
I made it this way because this tool sits in the middle of a pipeline and I use shell redirection to save the files on their final location.
Step 1 : Read the file
Step 2 : Replace spaces with a new line and save the result in a temporary file
Step 3 : Get only lines contains '_' from the temporary file and save it into multiwords.txt
Step 4 : Exclude the lines that contains '-' from the temporary file then save the result into singlewords.txt
Step 5 : Delete the temporary file
cat file | tr ' ' '\n' > tmp.txt | grep '_' tmp.txt > multiwords.txt | grep -v '_' tmp.txt > singlewords.txt | find . -type f -name 'tmp.txt' -delete

File1 + (File2 - first line) > File3

I have two csv/text files that I'd like to join. Both contain the same first line. I'm trying to figure out how to use sed and cat to produce a merged file, but with only one copy of the first line. And I'm having a time with syntax. Any help would be greatly appreciated :-D!
Thanks,
Andrew
Another option with awk:
awk 'NR==FNR || FNR>1' file1.txt file2.txt .. fileN.txt
This prints all lines in the first file, OR any line in subsequent files after the first line.
This will combine files data1.txt and data2.txt in file merged.txt, skipping the first line from data2.txt. It uses awk if you are ok with it:
(cat data1.txt; awk 'NR>1' data2.txt) > merged.txt
awk appends all lines with line number > 1 from file data2.txt to file merged.txt.
NR is a built-in awk variable that stands for the current line number of the file being processed. If the Boolean expression NR > 1 is true, awk prints the line implicitly.
If you didn't care about keeping data1.txt intact, you could just append your 2nd file (minus its first line) and reduce to just this:
awk 'NR>1' data2.txt >> data1.txt
I'd say the most straightforward solution is:
( cat file1.txt ; tail -n +2 file2.txt ) > file3.txt
It has the advantage of stating clearly just what you're doing: print the entire first file, then print all but the first line of the second file, writing the output to the third file.
solved with one line
'1 d' means to delete first line in file2
the following command will append the result to file1
sed '1 d' file2 >> file1

Intersection of two large word lists

I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.
What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.
If the two files are sorted (or you can sort them), you can use comm -1 -2 file1 file2 to print out the intersection.
You are correct, grep would be a bad idea. Type "man join" and follow the instructions.
If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:
$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2
Otherwise, you may need to give the join(1) command some additional instructions:
JOIN(1) BSD General Commands Manual JOIN(1)
NAME
join -- relational database operator
SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2
DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The
first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field,
the remaining fields from file1 and then the remaining fields from file2.
. . .
. . .
Presuming one word per line, I would use grep:
grep -xFf seta setb
-x matches the whole lines (no partial matches)
-F interprets the given patterns literally (no regular expressions)
-f seta specifies the patterns to search
setb is the file to search for the contents of seta
comm will do the same thing, but requires your sets to be pre-sorted:
comm -12 <(sort seta) <(sort setb)
grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3
I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.
Back in the days I managed to find a Perl script that does something similar:
http://www.perlmonks.org/?node_id=160735

Resources