sed takes more time for processing - unix

I have to replace certain characters(around 20 combinations) in each record. I have implemented it using sed command. But it takes more than 24 hours if a file is huge (more than 80000 records). Please find the code snippet below:
I have used 2 loops to read input file and read config file where position of each character to be replaced is mentioned. Each line can have more than one character which need to be replaced. When i replace the character i have to convert it to a decimal number as a result position of next replacement character need to be increased.Please find the code snippet below:
...
#Read the input file line by line
while read -r line
do
Flag='F'
pos_count=0
for pattern in `awk 'NR>1' $CONFIG_FILE`
do
field_type=`echo $pattern | cut -d"," -f6`
if [[ $field_type = 'A' ]];then
echo "For loop.."
echo $pattern
field_type=`echo $pattern | cut -d"," -f6`
echo field_type $field_type
start_pos=`echo $pattern | cut -d"," -f3`
echo start_pos $start_pos
end_pos=`echo $pattern | cut -d"," -f4`
echo end_pos $end_pos
field_len=`echo $pattern | cut -d"," -f5`
if [[ $Flag = 'T' && $field_type = 'A' ]];then
if [[ $replace = 'R' ]];then
pos_count=$(expr $pos_count + 1)
fi
echo pos_count $pos_count
val=$((2 * $pos_count))
start_pos=$(expr $start_pos + $val)
end_pos=$(expr $end_pos + $val)
replace=N
fi
echo "$line"
field=`expr substr "$line" $end_pos 1`
echo field $field
if [[ $start_pos -gt 255 ]];then
lim=255
f_cnt=$(expr $start_pos - 1)
c_cnt=$(expr $end_pos - 2)
#c_cnt1=$(expr $c_cnt - 255)
c_cnt1=$(expr $field_len - 2)
f_cnt1=$(expr $f_cnt - 255)
echo f_cnt1 "$f_cnt1" , c_cnt1 "$c_cnt1" f_cnt $f_cnt
else
lim=$(expr $start_pos - 1)
f_cnt1=$(expr $field_len - 2)
echo lim $lim, f_cnt1 $f_cnt1
fi
echo Flag $Flag
case "$field_type" in
A )
echo Field type is Amount
if [[ "${field}" = "{" ]];then
echo "Replacing { in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^{]*\){/\1\2+\3.\40/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^{]*\){/\1+\2.\30/"`
fi
Flag='T'
elif [[ "${field}" = "A" ]];then
echo "Replacing A in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^A]*\)A/\1\2+\3.\41/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^A]*\)A/\1+\2.\31/"`
fi
Flag='T'
...
elif [[ "${field}" = "R" ]];then
echo "Replacing R in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^R]*\)R/\1\2-\3.\49/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^R]*\)R/\1-\2.\39/"`
fi
Flag='T'
else
echo "Incremeting the size of Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)/\1\2\3 /"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)/\1\2\3 /"`
fi
fi
;;
C )
echo "Column Type is Count"
;;
* )
echo Others
:;
esac
fi
done
echo "$line" >> ${RES_FILE}
done < "$SRC_FILE"
echo `date`
exit 0
Below is the sample input file and config file:
CHD0000204H315604COV2013038 PROD2016022016030218481304COVCTR0000204H3156C00000000897 000000229960000024670141D0000000397577I0000000000000{00000174042
55C0000007666170B0000025070425E0000004863873E0000000631900F0000001649128{0000000018756B0000014798809C0000001890129G00000002384500000000286600000000084900000000155300000
0000055000000021388000000000048000000000003 00000897 0000000000000{0000000002706B0000001217827I000000001069
Config file:
FIELD NO.,FIELD NAME,STARTING POSITION,ENDING POSITION,LENGTH,INDICATOR
1,CHD_CONTRACT_NO,1,5,5,N
2,CHD_FILE_ID,6,21,16,N
3,PHD_CONTRACT_NO,22,26,5,N
4,PHD_PBP_ID,27,29,3,N
5,PHD_FILE_ID,30,45,16,N
6,DET_REC_ID,46,48,3,N
7,DET_SEQ_NO,49,55,7,N
8,DET_DG_CO_ST_CD,56,56,1,N
9,DET_CURR_HICN,57,76,20,N
10,DET_LAST_SUBM_HICN,77,96,20,N
11,DET_LAST_SUBM_CH_ID,97,116,20,N
12,DET_ERL_PDE_ATT_DT,117,124,8,N
13,DET_RX_COUNT,125,135,11,N
14,DET_NET_IGD_COST_AMT,136,149,14,A
15,DET_NET_DISP_FEE,150,163,14,A
16,DET_NET_SAL_TAX_AMT,164,177,14,A
17,DET_NET_GDCB,178,191,14,A
18,DET_NET_GDCA,192,205,14,A
19,DET_NET_GRS_DG_AMT,206,219,14,A
20,DET_NET_PAT_PAY_AMT,220,233,14,A
21,DET_NET_OTR_TROOP_AMT,234,247,14,A
22,DET_NET_LICS_AMT,248,261,14,A
23,DET_NET_TROOP_AMT,262,275,14,A
24,DET_NET_PLRO_AMT,276,289,14,A
25,DET_NET_CPP_AMT,290,303,14,A
26,DET_NET_NPP_AMT,304,317,14,A
27,DET_ORIG_PDE_CNT,318,329,12,N
28,DET_ADJ_PDE_CNT,330,341,12,N
29,DET_DEL_PDE_CNT,342,353,12,N
30,DET_CAT_PDE_CNT,354,365,12,N
31,DET_ATTC_PDE_CNT,366,377,12,N
32,DET_NCAT_PDE_CNT,378,389,12,N
33,DET_NON_STD_CNT,390,401,12,N
34,DET_OON_PDE_CNT,402,413,12,N
35,DET_EST_REB_AT_POS,414,427,14,A
36,DET_VAC_ADM_FEE,428,441,14,A
37,DET_RPT_GAP_DISC,442,455,14,A
38,DET_RPT_GAP_DISC_PDES,456,467,12,N
Can anyone suggest any other design approach to reduce the time for processing?

For massively improved performance you'll need to rewrite this. I suggest Python, Ruby, Awk, Perl, or similar.
The biggest reason why you current have disastrous performance is your nesting of loops is wrong:
for line in data:
for line in config:
do stuff specified in config to line
When you should be doing is:
for line in config:
parse and store line in memory
for line in data:
do stuff specified in config (in memory)
You can do this using any of the above languages, and I promise you those 80,000 records can be processed in just a few seconds, rather than 24 hours.

First read the comments and understand that the main problem is the number of calls to external commands called 80.000 times. When this is all done in one program the overhead and performance issues are solved. Which program/tool is up to you.
You will not get close to the the performance when you stick to bash code, but you can learn a lot when you try to use fast internal bash calls where you can.
Some tips when you want to improve your script.
See answer of #John, Only read config file once.
Use read for splitting the fields in a line of the config file
while IFS="," read -r fieldno fieldname start_pos end_pos length indicator; do
...
done < configfile
Avoid expr
Not f_cnt1=$(expr $field_len - 2) but (( f_cnt1 = field_len - 2))
Redirect to outputfile after the last done, not for each record (currently difficult since you are echoing both debug statements and results).
Delete debug statements
Use <<< for strings
It would be nice when you can change the flow, so that you do not need to call sed (80.000 records x 38 config line) times: generate a complex sed script from the config file that can handle all cases and run sed -f complex.sed "$SRC_FILE" just once.
When this is to complex, introduce a string sed_instructions. For each configfile-line add the sed instruction of that line to the string: sed_instructions="${sed_instructions};s/\(.\{1,$lim\}\)....". Then you only need to call sed -e ''"${sed_instructions}"'' <<< ${line} once for each record.
It would be nice when you can generate the string ${sed_instructions} once before reading the ${SRC_FILE}.
See which is the fastest way to print in awk for another example of performance improvements.
I think it can be improved to 10 minutes using bash, 1 minute using awk and less for programs mentioned by #John.

Related

Unix partial string comparison

I need to compare a string partially to check for a given condition.
Like my $1 will be checked if it has a part of a string BLR
while my file input has $1 entries as BLR21 BLR64 IND23
I only need a true condition when $1 is equal to BLR**
where these stars can be anything.
I used a simple if condition
if($1=="BLR21")
{print $2}
Now this only works when whole BLR21 is there in row.
I need to ckeck not for BLR21 but only BLR.
Please Help
Your question is not great, I hope I understood.
Quick and easy solution
grep BLR input.txt
This will output all the lines in which "BLR" is found, in file input.txt. It will match "BLR" with any prefix and suffix, whatever they might be (spaces, alphanumerical, tabs, ...).
"Complicated" solution
A bit more complicated. It does the same thing, but makes sure input.txt exists, and is in the form of a script.
Input file, input.txt:
BLR21 BLR64 IND23
Your script could be:
#!/bin/bash
#
# Arguments
inputfile="input.txt"
if [[ $# -ne 1 ]]
then
echo "Usage: myscript.bash <STRING>"
exit 1
else
string="$1"
fi
# Validation, and processing...
if [[ ! -f "$inputfile" ]]
then
echo "ERROR: file >>$inputfile<< does not exist."
exit 2
else
grep "$string" "$inputfile"
fi
And to call the script, you do:
./myscript.bash BLR
But really, a simple grep does the job here.
Taking it even further...
#!/bin/bash
#
# Arguments
inputfile="input.txt"
if [[ $# -ne 1 ]]
then
echo "Usage: check.bash <STRING>"
exit 1
else
string="$1"
fi
# Validation, and processing...
if [[ ! -f "$inputfile" ]]
then
echo "ERROR: file >>$inputfile<< does not exist."
exit 2
else
while read -r line
do
if [[ "$line" =~ $string ]]
then
echo "$line"
fi
done <"$inputfile"
fi
Now this one is like going to the moon via mars...
It reads each line of the file, one by one. Then it checks if that line contains the string, using the =~ operator inside the if.
But this is crazy, when a simple grep would do.

Splitting a file in a shell script adds unwanted newlines

I need to process a long text file splitting it into many smaller files. I have a single pass while - read - done <inputfile loop and when a line is matched, that signals the start of new output file. The matched lines are always preceded by a newline character in the input file.
My problem is that the output files (except the final one) are extended by a newline character. I have recreated the problem in this short example.
#!/bin/zsh
rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile
echo " open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'
while IFS= read line; do
if [[ "$line" == "section2" ]]; then
echo " Matched start of section2. Close outputfile1 and open outputfile2"
exec 3>&-
exec 3<> outputfile2
fi
echo "$line" >&3
echo $counter $line
let "counter = $counter + 1"
done <inputfile
echo " Close outputfile2"
exec 3>&-
echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo " The above should show 5, 2, 2 as desired number of newlines in these files."
Which outputs:
open outputfile1
1 section1
2 section1end
3
Matched start of section2. Close outputfile1 and open outputfile2
4 section2
5 section2end
Close outputfile2
5 inputfile
3 outputfile1
2 outputfile2
The above should show 5, 2, 2 as desired number of newlines in these files.
Option 1
Get rid of all empty lines. This only works if you don't need to retain any of the empty lines in the middle of a section.
Change:
echo "$line" >&3
To:
[[ -n "$line" ]] && echo "$line" >&3
Option 2
Rewrite each file using command substitution to trim any trailing newlines. Works best with short files. Change:
exec 3>&-
exec 3<> outputfile2
To:
exec 3>&-
data=$(<outputfile1)
echo "$data" >outputfile1
exec 3<> outputfile2
Option 3
Have the loop write the line from the prior iteration, and then do not write the final line from the prior file when you start a new file:
#!/bin/zsh
rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile
echo " open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'
priorLine=MARKER
while IFS= read line; do
if [[ "$line" == "section2" ]]; then
echo " Matched start of section2. Close outputfile1 and open outputfile2"
exec 3>&-
exec 3<> outputfile2
elif [[ "$priorLine" != MARKER ]]; then
echo "$priorLine" >&3
fi
echo $counter $line
let "counter = $counter + 1"
priorLine="$line"
done <inputfile
echo "$priorLine" >&3
echo " Close outputfile2"
exec 3>&-
echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo " The above should show 5, 2, 2 as desired number of newlines in these files."

Find and replace specific text with sed

I have a file with a lot of test cases what I have to do find "eapi(" and replace with "case( counter ,". here counter is start from 1 ,2 ..etc.
input file
eapi(price1(....))
eapi(price2(....))
eapi(price3(....))
eapi(price4(....))
Expected Results:
case(1,price1(....))
case(2,price2(....))
case(3,price3(....))
case(4,price4(....))
. . . . .
I used below sed command but not working.
COUNTER=1
while read a
do
sed 's/eapi(/case(`echo $COUNTER`,/' $a
echo " $COUNTER "
COUNTER=$[$COUNTER +1]
done < input
Please advise.
You can do this way:
COUNTER=1
cat input.txt | while read a
do
echo $a | sed "s/eapi(/case($COUNTER,/"
echo " $COUNTER "
COUNTER=$[$COUNTER +1]
done
You could do something like:
i=1
while read -r line; do
sed "s/eapi(/case($i,/" <<< $line
((i++))
done < $1
With your input the result is:
case(1,price1(....))
case(2,price2(....))
case(3,price3(....))
case(4,price4(....))
As far as I know you have to use double quotes in Sed for getting the content of a variable. So with the double quotes it is working.
If the number is always the same you can do this:
#!/bin/bash
IFS=$'\n'
while read a
do
nr=$(echo $a | grep -oP 'eapi\(price\K([0-9]+)' )
echo "$a" | sed 's/eapi(/case('$nr',/'
done < input
For this type of replacement, you could use awk:
awk '/^eapi\(price/{sub(/eapi\(price/,"case(" ++c ",price")}1' file
This replaces the pattern ^eapi\(price by adding the counter c and the wanted string.

echo does not display proper output

Following code read the test.txt contents and based on first field it redirect third field to result.txt
src_fld=s1
type=11
Logic_File=`cat /home/script/test.txt`
printf '%s\n' "$Logic_File" |
{
while IFS=',' read -r line
do
fld1=`echo $line | cut -d ',' -f 1`
if [[ $type -eq $fld1 ]];then
query=`echo $line | cut -d ',' -f 3-`
echo $query >> /home/stg/result.txt
fi
done
}
Following is the contents of test.txt:
6,STRING TO DECIMAL WITHOUT DEFAULT,cast($src_fld as DECIMAL(15,2) $tgt_fld
7,STRING TO INTERGER WITHOUT DEFAULT,cast($src_fld as integer) $tgt_fld
11,DEFAULT NO RULE,$src_fld
everything works fine except output in result.txt is $src_fld instead of s1. Can anyone please tell me what is wrong in the code?
Try replacing the below line
echo $query >> /home/stg/result.txt
with this one
eval "echo $query" >> /home/stg/result.txt

Append output of a command to file without newline

I have the following line in a unix script:
head -1 $line | cut -c22-29 >> $file
I want to append this output with no newline, but rather separated with commas. Is there any way to feed the output of this command to printf? I have tried:
head -1 $line | cut -c22-29 | printf "%s, " >> $file
I have also tried:
printf "%s, " head -1 $line | cut -c22-29 >> $file
Neither of those has worked. Anyone have any ideas?
You just want tr in your case
tr '\n' ','
will replace all the newlines ('\n') with commas
head -1 $line | cut -c22-29 | tr '\n' ',' >> $file
An very old topic, but even now i have been needed to do this (on limited command resources) and that one (replied) command havent been working for me due to its length.
Appending to a file can be done also by using file-descriptors:
touch file.txt (create new blank file),
exec 100<> file.txt (new fd with id 100),
echo -n test >&100 (echo test to new fd)
exec 100>&- (close new fd)
Appending starting from specyfic character can be done by reading file from certain point eg.
exec 100 <> file.txt - new descriptor
read -n 4 < &100 - read 4 characters
echo -n test > &100 - append echo test to a file starting from forth character.
exec 100>&- - (close new fd)

Resources