It kind of some complicated question. I want to replace all the ? in the file with X. But the problem is there are some valid ? also there in input file.
eg:
input:
HELLO ?, WELCOME TO THE NEW WORLD??23, and you are most ? valid person.
output:
HELLO X, WELCOME TO THE NEW WORLDX?23, and you are most X valid person.
here, ? comes before 23 is valid one.. like ?23, many values are there. ?24,?33,?45, etc.,
I tried with sed script, but able to find the exact command.
Script which i used:
LINE_NUM=0
while IFS= read -r LINE
do
LINE_NUM=$?
EXTRACTED=`echo "${LINE}" |grep '?23' | sed 's|^.*\?23||; s|\?[0-9].*$||'`
if [ -n "$EXTRACTED" ]
then
UPDATED=`echo "$EXTRACTED" | sed 's/?/X/g'`
UPDATED_1=`echo "$UPDATED" | awk '{gsub("/","%",$0); print}'`
if [ $EXTRACTED != $UPDATED ]
then
LATEST_VALUE=`echo "${LINE}" | sed "s|${EXTRACTED}|${UPDATED}|g"`
fi
LATEST_VALUE=`echo "${LINE}"`
echo "$LATEST_VALUE" >> outputfile.txt
else
echo "$LINE" >> outputfile.txt
fi
done<inputfile.txt
$ echo "HELLO ?, WELCOME TO THE NEW WORLD??23, and you are most ? valid person." |
sed -E 's/\?([^0-9]|$)/X\1/g'
HELLO X, WELCOME TO THE NEW WORLDX?23, and you are most X valid person.
here it escapes ? followed by a digit (or end of line). If your list is more restricted change the regex there.
With your shown samples, please try following. In GNU awk please try following. Simple explanation would be, setting record separator as 1 ? followed by 1 digit, using global substitution to substitute ? with X in current records; setting correct output record separator as RT, then print current line.
awk -v RS='[?][0-9]' '{gsub(/\?/,"X");ORS=RT} 1' Input_file
I have to replace certain characters(around 20 combinations) in each record. I have implemented it using sed command. But it takes more than 24 hours if a file is huge (more than 80000 records). Please find the code snippet below:
I have used 2 loops to read input file and read config file where position of each character to be replaced is mentioned. Each line can have more than one character which need to be replaced. When i replace the character i have to convert it to a decimal number as a result position of next replacement character need to be increased.Please find the code snippet below:
...
#Read the input file line by line
while read -r line
do
Flag='F'
pos_count=0
for pattern in `awk 'NR>1' $CONFIG_FILE`
do
field_type=`echo $pattern | cut -d"," -f6`
if [[ $field_type = 'A' ]];then
echo "For loop.."
echo $pattern
field_type=`echo $pattern | cut -d"," -f6`
echo field_type $field_type
start_pos=`echo $pattern | cut -d"," -f3`
echo start_pos $start_pos
end_pos=`echo $pattern | cut -d"," -f4`
echo end_pos $end_pos
field_len=`echo $pattern | cut -d"," -f5`
if [[ $Flag = 'T' && $field_type = 'A' ]];then
if [[ $replace = 'R' ]];then
pos_count=$(expr $pos_count + 1)
fi
echo pos_count $pos_count
val=$((2 * $pos_count))
start_pos=$(expr $start_pos + $val)
end_pos=$(expr $end_pos + $val)
replace=N
fi
echo "$line"
field=`expr substr "$line" $end_pos 1`
echo field $field
if [[ $start_pos -gt 255 ]];then
lim=255
f_cnt=$(expr $start_pos - 1)
c_cnt=$(expr $end_pos - 2)
#c_cnt1=$(expr $c_cnt - 255)
c_cnt1=$(expr $field_len - 2)
f_cnt1=$(expr $f_cnt - 255)
echo f_cnt1 "$f_cnt1" , c_cnt1 "$c_cnt1" f_cnt $f_cnt
else
lim=$(expr $start_pos - 1)
f_cnt1=$(expr $field_len - 2)
echo lim $lim, f_cnt1 $f_cnt1
fi
echo Flag $Flag
case "$field_type" in
A )
echo Field type is Amount
if [[ "${field}" = "{" ]];then
echo "Replacing { in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^{]*\){/\1\2+\3.\40/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^{]*\){/\1+\2.\30/"`
fi
Flag='T'
elif [[ "${field}" = "A" ]];then
echo "Replacing A in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^A]*\)A/\1\2+\3.\41/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^A]*\)A/\1+\2.\31/"`
fi
Flag='T'
...
elif [[ "${field}" = "R" ]];then
echo "Replacing R in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^R]*\)R/\1\2-\3.\49/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^R]*\)R/\1-\2.\39/"`
fi
Flag='T'
else
echo "Incremeting the size of Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)/\1\2\3 /"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)/\1\2\3 /"`
fi
fi
;;
C )
echo "Column Type is Count"
;;
* )
echo Others
:;
esac
fi
done
echo "$line" >> ${RES_FILE}
done < "$SRC_FILE"
echo `date`
exit 0
Below is the sample input file and config file:
CHD0000204H315604COV2013038 PROD2016022016030218481304COVCTR0000204H3156C00000000897 000000229960000024670141D0000000397577I0000000000000{00000174042
55C0000007666170B0000025070425E0000004863873E0000000631900F0000001649128{0000000018756B0000014798809C0000001890129G00000002384500000000286600000000084900000000155300000
0000055000000021388000000000048000000000003 00000897 0000000000000{0000000002706B0000001217827I000000001069
Config file:
FIELD NO.,FIELD NAME,STARTING POSITION,ENDING POSITION,LENGTH,INDICATOR
1,CHD_CONTRACT_NO,1,5,5,N
2,CHD_FILE_ID,6,21,16,N
3,PHD_CONTRACT_NO,22,26,5,N
4,PHD_PBP_ID,27,29,3,N
5,PHD_FILE_ID,30,45,16,N
6,DET_REC_ID,46,48,3,N
7,DET_SEQ_NO,49,55,7,N
8,DET_DG_CO_ST_CD,56,56,1,N
9,DET_CURR_HICN,57,76,20,N
10,DET_LAST_SUBM_HICN,77,96,20,N
11,DET_LAST_SUBM_CH_ID,97,116,20,N
12,DET_ERL_PDE_ATT_DT,117,124,8,N
13,DET_RX_COUNT,125,135,11,N
14,DET_NET_IGD_COST_AMT,136,149,14,A
15,DET_NET_DISP_FEE,150,163,14,A
16,DET_NET_SAL_TAX_AMT,164,177,14,A
17,DET_NET_GDCB,178,191,14,A
18,DET_NET_GDCA,192,205,14,A
19,DET_NET_GRS_DG_AMT,206,219,14,A
20,DET_NET_PAT_PAY_AMT,220,233,14,A
21,DET_NET_OTR_TROOP_AMT,234,247,14,A
22,DET_NET_LICS_AMT,248,261,14,A
23,DET_NET_TROOP_AMT,262,275,14,A
24,DET_NET_PLRO_AMT,276,289,14,A
25,DET_NET_CPP_AMT,290,303,14,A
26,DET_NET_NPP_AMT,304,317,14,A
27,DET_ORIG_PDE_CNT,318,329,12,N
28,DET_ADJ_PDE_CNT,330,341,12,N
29,DET_DEL_PDE_CNT,342,353,12,N
30,DET_CAT_PDE_CNT,354,365,12,N
31,DET_ATTC_PDE_CNT,366,377,12,N
32,DET_NCAT_PDE_CNT,378,389,12,N
33,DET_NON_STD_CNT,390,401,12,N
34,DET_OON_PDE_CNT,402,413,12,N
35,DET_EST_REB_AT_POS,414,427,14,A
36,DET_VAC_ADM_FEE,428,441,14,A
37,DET_RPT_GAP_DISC,442,455,14,A
38,DET_RPT_GAP_DISC_PDES,456,467,12,N
Can anyone suggest any other design approach to reduce the time for processing?
For massively improved performance you'll need to rewrite this. I suggest Python, Ruby, Awk, Perl, or similar.
The biggest reason why you current have disastrous performance is your nesting of loops is wrong:
for line in data:
for line in config:
do stuff specified in config to line
When you should be doing is:
for line in config:
parse and store line in memory
for line in data:
do stuff specified in config (in memory)
You can do this using any of the above languages, and I promise you those 80,000 records can be processed in just a few seconds, rather than 24 hours.
First read the comments and understand that the main problem is the number of calls to external commands called 80.000 times. When this is all done in one program the overhead and performance issues are solved. Which program/tool is up to you.
You will not get close to the the performance when you stick to bash code, but you can learn a lot when you try to use fast internal bash calls where you can.
Some tips when you want to improve your script.
See answer of #John, Only read config file once.
Use read for splitting the fields in a line of the config file
while IFS="," read -r fieldno fieldname start_pos end_pos length indicator; do
...
done < configfile
Avoid expr
Not f_cnt1=$(expr $field_len - 2) but (( f_cnt1 = field_len - 2))
Redirect to outputfile after the last done, not for each record (currently difficult since you are echoing both debug statements and results).
Delete debug statements
Use <<< for strings
It would be nice when you can change the flow, so that you do not need to call sed (80.000 records x 38 config line) times: generate a complex sed script from the config file that can handle all cases and run sed -f complex.sed "$SRC_FILE" just once.
When this is to complex, introduce a string sed_instructions. For each configfile-line add the sed instruction of that line to the string: sed_instructions="${sed_instructions};s/\(.\{1,$lim\}\)....". Then you only need to call sed -e ''"${sed_instructions}"'' <<< ${line} once for each record.
It would be nice when you can generate the string ${sed_instructions} once before reading the ${SRC_FILE}.
See which is the fastest way to print in awk for another example of performance improvements.
I think it can be improved to 10 minutes using bash, 1 minute using awk and less for programs mentioned by #John.
I have the below lines in a file
id=1234,name=abcd,age=76
id=4323,name=asdasd,age=43
except that the real file has many more tag=value fields on each line.
I want the final output to be like
id,name,age
1234,abcd,76
4323,asdasd,43
I want all values before (left of) the = to come out as separated with a , as the first row and all values after the (right side) of the = to come below for in each line
Is there a way to do it with awk or sed? Please let me know if for loop is required for the same?
I am working on Solaris 10; the local sed is not GNU sed (so there is no -r option, nor -E).
$ cat tst.awk
BEGIN { FS="[,=]"; OFS="," }
NR==1 {
for (i=1;i<NF;i+=2) {
printf "%s%s", $i, (i<(NF-1) ? OFS : ORS)
}
}
{
for (i=2;i<=NF;i+=2) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
id,name,age
1234,abcd,76
4323,asdasd,43
Assuming they don't really exist in your input, I removed the ...s etc. that were cluttering up your example before running the above. If that stuff really does exist in your input, clarify how you want the text "(n number of fields)" to be identified and removed (string match? position on line? something else?).
EDIT: since you like the brevity of the cat|head|sed; cat|sed approach posted in another answer, here's the equivalent in awk:
$ awk 'NR==1{h=$0;gsub(/=[^,]+/,"",h);print h} {gsub(/[^,]+=/,"")} 1' file
id,name,age
1234,abcd,76
4323,asdasd,43
FILE=yourfile.txt
# first line (header)
cat "$FILE" | head -n 1 | sed -r "s/=[^,]+//g"
# other lines (data)
cat "$FILE" | sed -r "s/[^,]+=//g"
sed -r '1 s/^/id,name,age\n/;s/id=|name=|age=//g' my_file
edit: or use
sed '1 s/^/id,name,age\n/;s/id=\|name=\|age=//g'
output
id,name,age
1234,abcd,76 ...(n number of fields)
4323,asdasd,43...
The following simply combines the best of the sed-based answers so far, showing you can have your cake and eat it too. If your sed does not support the -r option, chances are that -E will do the trick; all else failing, one can replace R+ by RR* where R is [^,]
sed -r '1s/=[^,]+//g; s/[^,]+=//g'
(That is, the portable incantation would be:
sed "1s/=[^,][^,]*//g; s/[^,][^,]*=//g"
)
I am new to sed . I want to replace a substring
for example:
var1=server1:game1,server2:game2,sever3:game1
output should be
server1 server2 server3 (with just spaces)
I have tried this.
echo $var1 | sed 's/,/ /g' | sed 's/:* / /g'
This is not working. Please suggest a solution.
You can try this sed,
echo $var1 | sed 's/:[^,]\+,\?/ /g'
Explanation:
:[^,]\+, - It will match the string from : to ,
\? - Previous may occur or may not ( Since end of line don't have , )
echo $var1 | sed s/:game[0-9],*/\ /
Assuming your sub string has game followed by a number([0-9]*)
An awk variation using same regex as sed
awk '{gsub(/:[^,]+,?/," ")}1' <<< "$var1"
PS Its always good custom to "quote" variables
Just for info, you are really only matching, not replacing, so grep can be your friend (with -P):
grep -oP '[^:,=]+(?=:)'
That matches a number of characters that aren't :,= followed by a : using lookahead.
This will put the servers on different lines, which may be what you want anyway. You can put them on one line by adding tr:
grep -oP '[^:,=]+(?=:)' | tr '\n' ' '