Replacing a String Pattern with another sequence in unix - unix

I want replace the String TaskID_1 with a sequence starting from 1001 and this TaskID_1 can exists any many number of lines in my input file.
Similarly i need to replace all occurrences of TASKID_2 in my input file with next sequence value 1002.
Input file:
12345|45345|TaskID_1|dksj|kdjfdsjf|12
1245|425345|TaskID_1|dksj|kdjfdsjf|12
1234|25345|TaskID_2|dksj|kdjfdsjf|12
123425|65345|TaskID_2|dksj|kdjfdsjf|12
123425|15325|TaskID_1|dksj|kdjfdsjf|12
11345|55315|TaskID_2|dksj|kdjfdsjf|12
6345|15345|TaskID_3|dksj|kdjfdsjf|12
72345|25345|TaskID_4|dksj|kdjfdsjf|12
9345|411345|TaskID_3|dksj|kdjfdsjf|12
The output file should look like:
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12

Here's one way using awk:
awk 'BEGIN { FS=OFS="|" } { $3=1000 + NR }1' file
Or less verbosely:
awk -F '|' '{ $3=1000 + NR }1' OFS='|' file
Results:
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1002|dksj|kdjfdsjf|12
1234|25345|1003|dksj|kdjfdsjf|12
123425|65345|1004|dksj|kdjfdsjf|12
123425|15325|1005|dksj|kdjfdsjf|12
11345|55315|1006|dksj|kdjfdsjf|12
6345|15345|1007|dksj|kdjfdsjf|12
72345|25345|1008|dksj|kdjfdsjf|12
9345|411345|1009|dksj|kdjfdsjf|12
For the first example, the file separator and output file separator are set to a single pipe character. This is set in the BEGIN block, so that it is executed only once, and not on every line of input. We then set the third column to be equal to 1000 plus an incrementing variable. We could use ++i as this variable, but we could instead use NR (which is short for record number/line number) and this would therefore avoid the need to create an extra variable. The 1 on the end enables printing by default. A more verbose solution would look like:
awk 'BEGIN { FS=OFS="|" } { $3=1000 + NR; print }' file
EDIT:
Using the updated data file, try:
awk 'BEGIN { FS=OFS="|" } { sub(/.*_/,"",$3); $3+=1000 }1' file
Results:
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12

A Perl solution using Steve's logic of adding 1000:
perl -pne 's/TaskID_(\d+)/$1+1000/e;' file
This replaces the 'TaskID_n' with 1000+n. 'e' is used to evaluate the replacement.

Replace TaskID_ with 100, this is super easy with sed for single digit IDs:
$ sed 's/TaskID_/100/' file
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12
To store this change back to the file use the -i option:
sed -i 's/TaskID_/100/' file
Note: this works for TaskID_[0-9] if you want TaskID_23 mapped to 1023 then this won't, this would map TaskID_23 to 10023.

I can't come up with a better solution than the one steve suggested in awk.
So here's a worse solution, using only bash.
#!/bin/bash
IFS='|'
while read f1 f2 f3 f4 f5 f6; do
printf '%s|%s|%d|%s|%s|%s\n' "$f1" "$f2" "$((${f3#*_}+1000))" "$f4" "$f5" "$f6"
done < input
It's "worse" only because it'll be much slower than awk, which is fast and efficient with this sort of problem.

perl -F"\|" -lane '$F[2]=~s/.*_/100/g;print join("|",#F)' your_file
Tested Below:
> cat temp
12345|45345|TaskID_1|dksj|kdjfdsjf|12
1245|425345|TaskID_1|dksj|kdjfdsjf|12
1234|25345|TaskID_2|dksj|kdjfdsjf|12
123425|65345|TaskID_2|dksj|kdjfdsjf|12
123425|15325|TaskID_1|dksj|kdjfdsjf|12
11345|55315|TaskID_2|dksj|kdjfdsjf|12
6345|15345|TaskID_3|dksj|kdjfdsjf|12
72345|25345|TaskID_4|dksj|kdjfdsjf|12
9345|411345|TaskID_3|dksj|kdjfdsjf|12
> perl -F"\|" -lane '$F[2]=~s/.*_/100/g;print join("|",#F)' temp
12345|45345|1001|dksj|kdjfdsjf|12
1245|425345|1001|dksj|kdjfdsjf|12
1234|25345|1002|dksj|kdjfdsjf|12
123425|65345|1002|dksj|kdjfdsjf|12
123425|15325|1001|dksj|kdjfdsjf|12
11345|55315|1002|dksj|kdjfdsjf|12
6345|15345|1003|dksj|kdjfdsjf|12
72345|25345|1004|dksj|kdjfdsjf|12
9345|411345|1003|dksj|kdjfdsjf|12
>

Related

Unix: Using filename from another file

A basic Unix question.
I have a script which counts the number of records in a delta file.
awk '{
n++
} END {
if(n >= 1000) print "${completeFile}"; else print "${deltaFile}";
}' <${deltaFile} >${fileToUse}
Then, depending on the IF condition, I want to process the appropriate file:
cut -c2-11 < ${fileToUse}
But how do I use the contents of the file as the filename itself?
And if there are any tweaks to be made, feel free.
Thanks in advance
Cheers
Simon
To use as a filename the contents of a file which is itself identified by a variable (as asked)
cut -c2-11 <"$( cat $filetouse )"
// or in zsh just
cut -c2-11 <"$( < $filetouse )"
unless the filename in the file ends with one or more newline character(s), which people rarely do because it's quite awkward and inconvenient, then something like:
read -rdX var <$filetouse; cut -c2-11 < "${var%?}"
// where X is a character that doesn't occur in the filename
// maybe something like $'\x1f'
Tweaks: your awk prints the variable reference ${completeFile} or ${deltaFile} (because they're within the single-quoted awk script) not the value of either variable. If you actually want the value, as I'd expect from your description, you should pass the shell vars to awk vars like this
awk -vf="$completeFile" -vd="$deltaFile" '{n++} END{if(n>=1000)print f; else print d}' <"$deltaFile"`
# the " around $var can be omitted if the value contains no whitespace and no glob chars
# people _often_ but not always choose filenames that satisfy this
# and they must not contain backslash in any case
or export the shell vars as env vars (if they aren't already) and access them like
awk '{n++} END{if(n>=1000) print ENVIRON["completeFile"]; else print ENVIRON["deltaFile"]}' <"$deltaFile"
Also you don't need your own counter, awk already counts input records
awk -vf=... -vd=... 'END{if(NR>=1000)print f;else print d}' <...
or more briefly
awk -vf=... -vd=... 'END{print (NR>=1000?f:d)}' <...
or using a file argument instead of redirection so the name is available to the script
awk -vf="$completeFile" 'END{print (NR>=1000?f:FILENAME)}' "$deltaFile" # no <
and barring trailing newlines as above you don't need an intermediate file at all, just
cut -c2-11 <"$( awk -vf="$completeFile" -'END{print (NR>=1000?f:FILENAME)}' "$deltaFile")"
Or you don't really need awk, wc can do the counting and any POSIX or classic shell can do the comparison
if [ $(wc -l <"$deltaFile") -ge 1000 ]; then c="$completeFile"; else c="$deltaFile"; fi
cut -c2-11 <"$c"

Enclose columns containing alphabets with single quotes using awk

Can awk process this?
Input
Neil,23,01-Jan-1990
25,Reena,19900203
Output
'Neil',23,'01-Jan-1990'
25,'Reena',19900203
awk approach:
awk -F, '{for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) $i="\047"$i"\047"}1' OFS="," file
The output:
'Neil',23,'01-Jan-1990'
25,'Reena',19900203
if($i~/[[:alpha:]]/) - if field contains alphabetic character
\047 - octal code of single quote ' character
Incorrect was my first attempt
sed -r 's/([^,]*[a-zA-Z]+[^,]*)(,{0,1})/"\1"\2/g' inputfile
#Sundeep gave an excellent comment: I need single quotes and it can be shorter:
I tried to match including the , of end-of-line, causing some complexity for matching. You can just match between the seperators making sure there is an alphabetic character somewhere.
sed 's/[^,]*[a-zA-Z][^,]*/\x27&\x27/g' inputfile
You might use this script:
script.awk
BEGIN { OFS=FS="," }
{ for(i= 1; i<=NF; i++) {
if( !match( $i, /^[0-9]+$/ ) ) $i = "'" $i "'"
}
print
}
and run it like this: awk -f script.awk yourfile .
Explanation
the first line sets up the input and output Fieldseparators to ,.
the loop tests each field, whether it contains only digits (/^[0-9]+$/):
if not the field is put in quotes

transpose a column in unix

I have a Unix file which has data like this.
1379545632,
1051908588,
229102020,
1202084378,
1102083491,
1882950083,
152212030,
1764071734,
1371766009,
(FYI, there is no empty line between two numbers as you see above. Its just because of the editor here. Its just a column with all numbers one below other)
I want to transpose it and print as a single line.
Like this:
1379545632,1051908588,229102020,1202084378,1102083491,1882950083,152212030,1764071734,1371766009
Also remove the last comma.
Can someone help? I need a shell/awk solution.
tr '\n' ' ' < file.txt
To remove the last comma you can try sed 's/,$//'.
With GNU awk for multi-char RS:
$ printf 'x,\ny,\nz,\n' | awk -v RS='^$' '{gsub(/\n|(,\n$)/,"")} 1'
x,y,z
awk 'BEGIN { ORS="" } { print }' file
ORS : Output Record separator.
Each Record will be separated with this delimiter.

Use sed to delete everything after '>' and add index number plus a string?

I know this should be pretty simple to do, but I can't get it to work. My file looks like this
>c12345|random info goes here that I want to delete
AAAAATTTTTTTTCCCC
>c45678| more | random info| here
GGGGGGGGGGG
And what I want to do is just make this far simpler so it might look like this
>seq1 [organism=human]
AAAAATTTTTTTTCCCC
>seq2 [organism=human]
GGGGGGGGGGGG
>seq3 [organism=human]
etc....
I know I can append that constant easily once I get the indexed part in there by doing:
sed '/^>/ s/$/\[organism-human]/g'
But how do I get that index built?
With sed:
sed '/^>/d' filename | sed '=' | sed 's/^[0-9]*$/>seq& [organism=human]/'
(Thanks to NeronLeVelu for the simplification.)
Here's one way you could do it using awk:
$ awk '/^>/ { $0 = ">seq" ++i " [organism=human]" } 1' file
>seq1 [organism=human]
AAAAATTTTTTTTCCCC
>seq2 [organism=human]
GGGGGGGGGGG
When the line begins with >, replace it with seq followed by i (which increases by 1 every time), then [organism=human]. The 1 at the end of the command is true, so awk performs the default action, which is to print the line.
Might be easier with a Perl one-liner:
perl -ne 'chomp; if (/^>/) { s/\|.*$//; print "$_ \[organism=human\]\n";} else { print "$_\n";}' filename

How do I replace empty strings in a tsv with a value?

I have a tsv, file1, that is structured as follows:
col1 col2 col3
1 4 3
22 0 8
3 5
so that the last line would look something like 3\t\t5, if it was printed out. I'd like to replace that empty string with 'NA', so that the line would then be 3\tNA\t5. What is the easiest way to go about this using the command line?
awk is designed for this scenario (among a million others ;-) )
awk -F"\t" -v OFS="\t" '{
for (i=1;i<=NF;i++) {
if ($i == "") $i="NA"
}
print $0
}' file > file.new && mv file.new file
-F="\t" indicates that the field separator (also known as FS internally to awk) is the tab character. We also set the output field separator (OFS) to "\t".
NF is the number of fields on a line of data. $i gets evaluated as $1, $2, $3, ... for each value between 1 and NF.
We test if the $i th element is empty with if ($i == "") and when it is, we change the $i th element to contain the string "NA".
For each line of input, we print the line's ($0) value.
Outside the awk script, we write the output to a temp file, i.e. file > file.new. The && tests that the awk script exited without errors, and if OK, then moves the file.new over the original file. Depending on the safety and security use-case your project requires, you may not want to "destroy" your original file.
IHTH.
A straightforward approach is
sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file
sed -i edit file in place;
s/a/b/ replace a with b;
s/^\t/\tNA/ replace \t in the beginning of the line with NA\t
(the first column becomes NA);
s/\t$/\tNA/ the same for the last column;
s/\t\t/\tNA\t/ insert NA in between \t\t;
:0 s///; t0 repeat s/// if there was a replacement (in case there are other missing values in the line).

Resources