AWK to use multiple spaces as delimiter - unix

I am using below command to join two files using first two columns.
awk 'NR==FNR{a[$1,$2]=substr($0,3);next} ($1,$2) in a{print $0, a[$1,$2] > "br0102_3.txt"}' br01.txt br02.txt
Now, by default AWk command uses whitespaces as the separators. But my file may contain single space between two words, e.g.
File 1:
ABCD TEXT1 TEXT2 123123112312312312312312312312312312
BCDEFG TEXT3TEXT4 133123123123123123123123123125423423
QWERT TEXT5TEXT6 123123123123125456678786789698758567
File 2:
ABCD TEXT1 TEXT2 12312312312312312312312312312
BCDEFG TEXT3TEXT4 31242342342342342342342342343
MNHT TEXT8 TEXT9 31242342342342342342342342343
I want the result file as ;
ABCD TEXT1 TEXT2 123123112312312312312312312312312312 12312312312312312312312312312
BCDEFG TEXT3TEXT4 133123123123123123123123123125423423 31242342342342342342342342343
QWERT TEXT5TEXT6 123123123123125456678786789698758567
MNHT TEXT8 TEXT9 31242342342342342342342342343
Any hints ?

awk supports a regular expression as the value of FS so you can specify a regular expression that matches at least two spaces. Something like -F '[[:space:]][[:space:]]+'.
$ awk '{print NF}' File2
4
3
4
$ awk -F '[[:space:]][[:space:]]+' '{print NF}' File2
3
3
3

You are using fixed width fields so you should be using gnu awk FIELDWIDTHS (or similar) to separate the fields, e.g. if the 2nd field is the 15 chars from char 8 to char 23 inclusive in this file:
$ cat file
abc def ghi klm
AAAAAAAB C D E F G H IJJJJ
abc def ghi klm
$ awk -v FIELDWIDTHS="7 15 4" '{print "<" $2 ">"}' file
<def ghi >
<B C D E F G H I>
< def ghi >
Any solution that relies on a certain number of spaces between fields will fail when you have 1 or zero spaces between your fields.
If you want to strip leading/trailing blanks from your target field(s):
$ awk -v FIELDWIDTHS="7 15 4" '{gsub(/^\s+|\s+$/,"",$2); print "<" $2 ">"}' file
<def ghi>
<B C D E F G H I>
<def ghi>

awk automatically detects multiple spaces if field seperator is set to " "
Thus, this simply works:
awk -F' ' '{ print $2 }'
to get the second column if you have a table like the one mentioned.

Related

find many word that match in a pattern file (txt file)

Unix system fine all word in a txt file, key word in a pattern file
EX: pattern file txt
1
2
3
EX: a.txt file we want to fine out that word contain 1 or 2 or 3
a
2
4
3
5
4
1
2
Result like:
2
3
1
2
I had try awk, but not good
awk '/1/,/2/,/3/,.... a.txt
You want to have an exact match between pattern.txt and a.txt. This implies if pattern.txt contains a line:
foo
then this line can only match "foo" and not
bar foo
foo bar
foo123
For a perfect match you can do:
$ awk '(NR==FNR){a[$0];next}($0 in a)' pattern.txt a.txt
$ grep -xFf pattern.txt a.txt

How do I list specific lines using awk?

I have a file that is sorted like the following:
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
I want to search for the highest value in the file and display what is one the line?
3 Goodbye
3 begin
3 Yes
3 No
How would I do this?
awk to the rescue!
$ awk 'FNR==NR{if(max<$1) max=$1; next} $1==max' file{,}
3 Goodbye
3 Begin
3 Yes
3 No
double-pass, find the maximum and filter out the rest.
cat file.txt | sort -r | awk '{if ($1>=prev) {print $0; prev=$1}}'
3 Yes
3 No
3 Goodbye
3 Begin
Assuming file.txt contains
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
First get the highest value in the file into a variable. Considering the file is already sorted, pickup the last line in the file. Then parse out the number using awk.
highest=`tail -1 file.list|awk '{print $1}'`
Then grep the file using that value.
grep "^${highest} " file.list
This should do the job. I am only using awk as required in the question:
awk 'BEGIN {v=0} {l = l "\n" $0} {if ($1>v) {l = $0; v = $1}} END {print l}' file.txt
The variable v is initialized (before parsing the file) to 0. Then each line is read and kept in memory; if the first field ($1) is greater than v, then update v and empty what is in l. At the end, just print the content of l.
It's easier than you think.
awk '/^3/' file
3 Goodbye
3 Begin
3 Yes
3 No

Convert specific column of file into upper case in unix (without using awk and sed)

My file is as below
file name = test
1 abc
2 xyz
3 pqr
How can i convert second column of file in upper case without using awk or sed.
You can use tr to transform from lowercase to uppercase. cut will extract the single columns and paste will combine the separated columns again.
Assumption: Columns are delimited by tabs.
paste <(cut -f1 file) <(cut -f2 file | tr '[:lower:]' '[:upper:]')
Replace file with your file name (that is test in your case).
In pure bash
#!/bin/bash
while read -r col1 col2;
do
printf "%s%7s\n" "$col1" "${col2^^}"
done < file > output-file
Input-file
$ cat file
1 abc
2 xyz
3 pqr
Output-file
$ cat output-file
1 ABC
2 XYZ
3 PQR

How to convert multiple lines into fixed column lengths

To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.

How to delete words which start with some specific pattern in a file in unix

I want to delete all words in my file that start with 3: and 4:
For Example -
Input is
13 1:12 2:14 3:11
10 1:9 2:7 4:10 5:2
16 3:7 8:24
7 4:7 6:54
Output should be
13 1:12 2:14
10 1:9 2:7 5:2
14 8:24
7 6:54
Can someone tell me if it is possible using sed command or awk command.
This might work for you (GNU sed):
sed 's/\b[34]:\S*\s*//g' file
Looks for a word boundary and then either 3 or 4 followed by : and zero or more non-spaces followed by zero or more spaces and deletes them throughout the line.
With sed
sed -r 's/ 3:[0-9]*| 4:[0-9]*//g'
$ cat input.txt
13 1:12 2:14 3:11 10 1:9 2:7 4:10 5:2 16 3:7 8:24 7 4:7 6:54
$ cat input.txt | sed -r 's/ 3:[0-9]*| 4:[0-9]*//g'
13 1:12 2:14 10 1:9 2:7 5:2 16 8:24 7 6:54
Explanation:
-r = Regex search
3:[0-9]*: Search for a space, then 3, then :, then [0-9] or a number between 0 and 9, the * means that he will search for zero or more hits in the pervious regex search, which is [0-9], so * means on this case that will search for zero or more numbers behind the first number after :
| : Means OR
4:[0-9]*: Same as above except that instead of 3 it will search for 4
//: The substitution strings, if you put POTATOE behind bars it will type it, on this case, sed will simply don't type anything.
/g: Search in all the input passed to sed.
With awk:
awk '{for (i=1; i<=NF; i++)
{if (! sub("^[34]:", "", $i)) d=d$i" "}
print d; d=""
}' file
It loops through the fields and just store in the variable d those that do not start with 3: or 4:. This is done by checking if sub() function returns true or not. When the loop through the line is done, the d variable is printed.
For your given file:
$ awk '{for (i=1; i<=NF; i++) {if (! sub("^[34]:", "", $i)) d=d$i" "} print d; d=""}' file
13 1:12 2:14
10 1:9 2:7 5:2
16 8:24
7 6:54
sed 's/[[:blank:]][34]:[^[:blank:]]\{1,\}[[:blank:]]*/ /g' YourFile
Posix compliant and assuming there is no (as in sample) first word stating with 3: or 4:.
Assuming all words contains : and has at least one digit after the :
sed "s/ \([34]:[^\b]+\)//g" inputfile
This matches SPACE, 3 or 4, colon and then at least one non word boundary. It replaces it forth nothing and does so for the whole line.

Resources