I want to replace exact matching string in a tabular file. My file somewhat look like this
I 1 10 100
II 2 20 200
III 3 30 300
IX 9 100 1000
I want out put file like this
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
I have tried sed command with no success
sed -i -e 's/"III"/chrIII/g' -e 's/"II"/chrII/g' -e 's/"I"/chrI/g' filename
please identify the mistake. The file has large memory footprint can awk be used for the same?
sed approach
sed -E -i -e 's/\b(I|II|III|IX)\b/chr\1/g' input
DEMO OUTPUT:
~$ sed -e 's/\b\(I\|II\|III\|IX\)\b/chr\1/g' input
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
posix version (no |, \b )
sed -e 's/^I{1,3}[[:blank:]]/chr&/' -e 's/^IX[[:blank:]]/chr&/' input > output
in your
sed -i -e 's/"III"/chrIII/g' -e 's/"II"/chrII/g' -e 's/"I"/chrI/g' filename
"III" search for "III" not III
g is not a fault but not necessary (search 1st element of the string and only 1 in your sample, is it more ?)
be careful, searching I will also find II if not delimited (here ^ for starting + a [[:blank:]] after)
awk approach:
awk '$1~/^[IXV]*$/{$1="chr"$1}1' OFS='\t' file > tmp && mv tmp file
file contents after modification:
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
$1="chr"$1 - prepends each first field with chr substring
To put "chr" in front of first column.
Either way works well in the examples below.
awk '{print "chr"$0}' file
awk '{$1="chr"$1}1' file
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
Related
I want to match column 1 and 2 from the file for_matching with files in different directories in the path and named /.file and print the entire row that matches those columns
/.file (example)
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
for_matching
carrot 124555
output
carrot 124555 1 2 6
Right now I can just match column 1 between the two.
for i in */*.file; do awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' $i for_matching > $i.matched; done
Using awk
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
Test Results:
$ cat file
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
$ cat for_matching
carrot 124555
$ awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
carrot 124555 1 2 6
Same with multiple files, no need of ls */*.file
#!/usr/bin/env bash
for i in */*.file; do
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching "$i" >"$i.matched"
done
That is so simple that you could just:
$ grep -F -w -f for_matching file
carrot 124555 1 2 6
For a limitation see #karakfa's comment below.
That could of course be circumvented with (:
$ cat file
carrot 124555 1 2 6
1 carrot 124555 1 2 6
$ grep -w -f <(sed 's/^/^/g' for_matching) file
carrot 124555 1 2 6
I have two files which both have over a million lines and I want to print out the match from the first column of file 1 with the second of file 2.
file_1
SNP_A-2131660 1 0.0021
SNP_A-1967418 2 0.0005
SNP_A-1969580 2 0.0011
SNP_A-4263484 1 0.0024
SNP_A-1978185 0 0.0014
SNP_A-4264431 0 0.0038
SNP_A-1980898 2 0.0050
SNP_A-1983139 0 0.0011
file_2
"AFFX-SNP_10000979" "rs4147951"
"AFFX-SNP_10009702" "rs2022235"
"SNP_A-2131660" "rs6425720"
"AFFX-SNP_10021569" "rs12997193"
"AFFX-SNP_10026879" "rs9933410"
"AFFX-SNP_10029725" "rs7142489"
"AFFX-SNP_10034687" "rs1350088"
matches.txt
"SNP_A-2131660" "rs6425720"
Right now I am doing, but it is so slow. Is there a faster way?
awk '{print $1}' file_1 | while read -r a; do grep -H $a file_2; done >> matches.txt
Could you please try following awk.
awk 'FNR==NR{a[$1]=$0;next} {val=$1;gsub(/\"/,"",val)} (val in a)' file_1 file_2
If you want to redirect above command's output into a output file you could do > output.txt(or whatever you like the name for output file) at the end of the command too.
Using awk, and setting field separator FS='"' after reading first file,
awk 'FNR==NR{a[$1];next}$2 in a' file1 FS='"' file2
# OR
awk -v q='"' 'FNR==NR{a[q $1 q];next}$1 in a' file1 file2
It will be faster among all solution posted till now, as we are not sorting file, replacing double quotes and joining them
Test Results :
$ cat f1
SNP_A-2131660 1 0.0021
SNP_A-1967418 2 0.0005
SNP_A-1969580 2 0.0011
SNP_A-4263484 1 0.0024
SNP_A-1978185 0 0.0014
SNP_A-4264431 0 0.0038
SNP_A-1980898 2 0.0050
SNP_A-1983139 0 0.0011
$ cat f2
"AFFX-SNP_10000979" "rs4147951"
"AFFX-SNP_10009702" "rs2022235"
"SNP_A-2131660" "rs6425720"
"AFFX-SNP_10021569" "rs12997193"
"AFFX-SNP_10026879" "rs9933410"
"AFFX-SNP_10029725" "rs7142489"
"AFFX-SNP_10034687" "rs1350088"
$ awk 'FNR==NR{a[$1];next}$2 in a' f1 FS='"' f2
"SNP_A-2131660" "rs6425720"
# OR
$ awk -v q='"' 'FNR==NR{a[q $1 q];next}$1 in a' f1 f2
"SNP_A-2131660" "rs6425720"
One possible tool is join, but you need some preprocessing and some post-processing:
sort < file1 > file1_sorted
sort < file2 | sed 's/^"\([^"]*\)"/\1/' > file2_sorted
join file1_sorted file2_sorted -1 1 -2 1
Gives:
SNP_A-2131660 1 0.0021 "rs6425720"
How to print from 3rd column to till last columns using awk command in unix, if there are 'n' columns in a file. I am getting with cut command but I need awk command. I am trying to do with awk -F " " '{ for{i=3;i<=NF;i++) print $i}', I am getting the output but it is not in the correct format. Can anyone suggest me the proper command.
Combining Ed Morton's answers in:
Print all but the first three columns
delete a column with awk or sed
We get something like this:
awk '{sub(/^(\S+\s*){2}/,""); sub(/(\s*\S+){2}$/,"")}1'
# ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
# remove 2 first cols remove 2 last cols
Which you can adapt to your exact needs in terms of columns.
See an example given this input:
$ paste -d ' ' <(seq 5) <(seq 2 6) <(seq 3 7) <(seq 4 8) <(seq 5 9)
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
Let's just print the 3rd column:
$ awk '{sub(/^(\S+\s*){2}/,""); sub(/(\s*\S+){2}$/,"")}1' <(paste -d ' ' <(seq 5) <(seq 2 6) <(seq 3 7) <(seq 4 8) <(seq 5 9))
3
4
5
6
7
Your attempt was close but appears that it would print each and every column on a new line.
To correct this we create a variable called 'line' and initialize it to an empty string. The first time we are in the loop we just add the column to 'line'. From that point on we will append to 'line' with the field separator and the next column. Finally, we print 'line'. This will happen for each line in the file.
awk '{line="";for(i=3;i<=NF;i++) if(i==3) line=$i; else line=line FS $i; print line}'
In this example I assume to use awk's default field separator. Also any lines that are less than three will print blank lines.
Assuming your fields are space-separated then with GNU awk for gensub():
$ cat file
a b c d e f
g h i j k l
$ awk '{print gensub(/(\S\s){2}/,"",1)}' file
c d e f
i j k l
In general to print from, say, field 3 to field 5 if they are blank separated using GNU awk again with gensub():
$ awk '{print gensub(/(\S\s){2}((\S\s){2}\S).*/,"\\2",1)}' file
c d e
i j k
or the 3rd arg to match():
$ awk 'match($0,/(\S\s){2}((\S\s){2}\S)/,a){print a[2]}' file
c d e
i j k
or in general if they are separated by any single character:
$ awk '{print gensub("([^"FS"]"FS"){2}(([^"FS"]"FS"){2}[^"FS"]).*","\\2",1)}' file
c d e
i j k
$ awk 'match($0,"([^"FS"]"FS"){2}(([^"FS"]"FS"){2}[^"FS"])",a){print a[2]}' file
c d e
i j k
If the fields are separated by a string instead of a single-character but the RS is a single character then you should temporarily change FS to RS (which by definition you KNOW can't be present in the record) so you can negate it in the bracket expressions:
$ cat file
aSOMESTRINGbSOMESTRINGcSOMESTRINGdSOMESTRINGeSOMESTRINGf
gSOMESTRINGhSOMESTRINGiSOMESTRINGjSOMESTRINGkSOMESTRINGl
$ awk -F'SOMESTRING' '{gsub(FS,RS)} match($0,"([^"RS"]"RS"){2}(([^"RS"]"RS"){2}[^"RS"])",a){gsub(RS,FS,a[2]); print a[2]}' file
cSOMESTRINGdSOMESTRINGe
iSOMESTRINGjSOMESTRINGk
If both the FS and the RS are multi-char then there's various options but the simplest is to use the NUL character or some other character you know can't appear in your input file instead of RS as the temporary replacement FS:
$ awk -F'SOMESTRING' '{gsub(FS,"\0")} match($0,/([^\0]\0){2}(([^\0]\0){2}[^\0])/,a){gsub("\0",FS,a[2]); print a[2]}' file
cSOMESTRINGdSOMESTRINGe
iSOMESTRINGjSOMESTRINGk
Obviously change FS to OFS in the final gsub()s above if desired.
If the FS was a regexp instead of a string and you want to retain it in the output then you need to look at GNU awk for the 4th arg for split().
If you don't mind normalizing the space, the most straightforward way is
$ awk '{$1=$2=""}1' | sed -r 's/^ +//'
in action
$ seq 11 40 | pr -6ts' ' | awk '{$1=$2=""}1' | sed -r 's/^ +//'
21 26 31 36
22 27 32 37
23 28 33 38
24 29 34 39
25 30 35 40
for the input
$ seq 11 40 | pr -6ts' '
11 16 21 26 31 36
12 17 22 27 32 37
13 18 23 28 33 38
14 19 24 29 34 39
15 20 25 30 35 40
To print from third column to till end then
cat filename|awk '{for(i=1;i<3;i++) $i="";print $0}'
I am comprating two files containing a large array of numbers using 'diff'. The files look identical except some numbers have minus signs instead of plus signs. Is there an option to ignore a character like '-' and then compare the files?
Just to clarify:
If file 1 is :
1 -1 5 -3
and the other is
1 1 5 3
I want the return to say they are identical.
Thanks!
You could use sed to take out the - characters before diffing:
sed s/-//g f1 | diff - f2
Example:
$ cat f1
1 -1 5 -3
$ cat f2
1 1 5 3
$ sed s/-//g f1 | diff - f2
$ echo $?
0
Suppose to feed the filter standard input with these line:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
It would be nicer if someone tell me how to write a script that prints only every 4 lines, in the case of the example input above:
line 1
line 5
line 9
$ yes | cat -n | head -10 | awk 'NR % 4 == 1'
1 y
5 y
9 y
That is, your answer is "awk 'NR % 4 == 1'".
awk '{ if ((NR-1) %4 ==0) print}'
awk 'NR%4 == 1 {print}'</etc/hosts
Replace 4 by whatever value you want of course.
sed -ne '1~4p’
(GNU sed. Not tested on OSX, etc.)