I have a text file that looks like this:
>long_name
AAC-TGA
>long_name2
CCTGGAA
And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:
cols="2 4 7"
I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:
>long_name
ANCNTGN
>long_name2
CNTNGAN
Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.
It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.
E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):
sed -i.bak '/^[^>]/s/ /N/g' input.txt
And I can use AWK to replace specific columns of lines as I want to like this (I think...):
awk '$2=N'
But I am struggling to stitch this together
With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.
awk -v cols='2 4 7' '
BEGIN {
split(cols,f)
FS=OFS=""
}
!/^>/ {
for (i in f)
$(f[i])="N"
}
1' file
Also see Save modifications in place with awk.
You can generate a list of replacement commands first and then pass them to sed
$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN
Can also use {} grouping
$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2; s/./N/4; s/./N/7; }
Using any awk in any shell on every UNIX box:
$ awk -v cols='2 4 7' '
BEGIN { split(cols,c) }
!/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN
I have a file with some lines:
a
b
c
d
I would like to cat this file into a awk command to produce something like this:
letter is a
letter is b
letter is c
letter is d
using something like this:
cat file.txt | awk 'letter is $1'
But it's not printing out as expected:
$ cat raw.txt | awk 'this is $1'
a
b
c
d
At the moment, you have no { action } block, so your condition evaluates the two empty variables this and is, concatenating them with the first field $1, and checks whether the result is true (a non-empty string). It is, so the default action prints each line.
It sounds like you want to do this instead:
awk '{ print "letter is", $1 }' raw.txt
Although in this case, you might as well just use sed:
sed 's/^/letter is /' raw.txt
This command matches the start of each line and adds the string.
Note that I'm passing the file as an argument, rather than using cat with a pipe.
Not sure if you wanted sed or awk but this is in awk:
$ awk '{print "letter is " $1}' file
letter is a
letter is b
letter is c
letter is d
I have a file which contains multiple fields and 2 types of delimiters. If the number of delimiters in one of the fields reaches a defined number then I want to split the field after the number is met onto the next line while replicating the first part of the line.
Is this possible in awk or sed?
Example
Input
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6,7,8,9,10|
a3|b|c|d|1,2|
Max Number = 6, to split on commas in field 5
Output
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
Assuming not more than one split would be required:
$ sed -E 's/^(([^|]+\|){4})(([^,]+,){5}[^,]+),(.*)/\1\3|\n\1\5/' ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
-E use ERE, some sed version uses -r option instead
^(([^|]+\|){4}) first 4 columns delimited by |
(([^,]+,){5}[^,]+) 6 columns delimited by , (without trailing ,)
, comma between 6th and 7th column
(.*) rest of line
\1\3|\n\1\5 split as required
The column and max number can be passed from shell variables too (example shown for bash)
$ col=5; max=6
$ sed -E "s/^(([^|]+\|){$((col-1))})(([^,]+,){$((max-1))}[^,]+),(.*)/\1\3|\n\1\5/" ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
$ col=5; max=8
$ sed -E "s/^(([^|]+\|){$((col-1))})(([^,]+,){$((max-1))}[^,]+),(.*)/\1\3|\n\1\5/" ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6,7,8|
a2|b|c|d|9,10|
a3|b|c|d|1,2|
awk to the rescue!
awk -F\| -v OFS=\| -v c=',' '
{n=split($5,a,c);
if(n>6)
{f=$5;
$5=a[1] c a[2] c a[3] c a[4] c a[5] c a[6];
print;
$5=f;
gsub(/([^,]+,){6}/,"",$5)}}1' file
How would you count the number of words in a text file which contains all of the letters a, b, and c. These letters may occur more than once in the word and the word may contain other letters as well. (For example, "cabby" should be counted.)
Using sample input which should return 2:
abc abb cabby
I tried both:
grep -E "[abc]" test.txt | wc -l
grep 'abcdef' testCount.txt | wc -l
both of which return 1 instead of 2.
Thanks in advance!
You can use awk and use the return value of sub function. If successful substitution is made, the return value of the sub function will be the number of substitutions done.
$ echo "abc abb cabby" |
awk '{
for(i=1;i<=NF;i++)
if(sub(/a/,"",$i)>0 && sub(/b/,"",$i)>0 && sub(/c/,"",$i)>0) {
count+=1
}
}
END{print count}'
2
We keep the condition of return value to be greater than 0 for all three alphabets. The for loop will iterate over every word of every line adding the counter when all three alphabets are found in the word.
I don't think you can get around using multiple invocations of grep. Thus I would go with (GNU grep):
<file grep -ow '\w+' | grep a | grep b | grep c
Output:
abc
cabby
The first grep puts each word on a line of its own.
Try this, it will work
sed 's/ /\n/g' test.txt |grep a |grep b|grep c
$ cat test.txt
abc abb cabby
$ sed 's/ /\n/g' test.txt |grep a |grep b|grep c
abc
cabby
hope this helps..
I have a text file with 1.3million rows and 258 columns delimited by semicolons (;). How can I search for what characters are in the file, excluding letters of the alphabet (both upper and lower case), semicolon (;), quote (') and double quote (")? Ideally the results should be in a non-duplicated list.
Use the following pipeline
# Remove the characters you want to exclude
tr -d 'A-Za-z;"'\' <file |
# One character on each line
sed 's/\(.\)/\1\
/g' |
# Remove duplicates
sort -u
Example
echo '2343abc34;ABC;;#$%"' |
tr -d 'A-Za-z;"'\' |
sed 's/\(.\)/\1\
/g' |
sort -u
$
%
2
3
4
#
you can use grep -v command and pipe it to sort and then to uniq.