Splitting a file in a shell script adds unwanted newlines

Splitting a file in a shell script adds unwanted newlines - zsh

I need to process a long text file splitting it into many smaller files. I have a single pass while - read - done <inputfile loop and when a line is matched, that signals the start of new output file. The matched lines are always preceded by a newline character in the input file.
My problem is that the output files (except the final one) are extended by a newline character. I have recreated the problem in this short example.
#!/bin/zsh
rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile
echo " open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'
while IFS= read line; do
if [[ "$line" == "section2" ]]; then
echo " Matched start of section2. Close outputfile1 and open outputfile2"
exec 3>&-
exec 3<> outputfile2
fi
echo "$line" >&3
echo $counter $line
let "counter = $counter + 1"
done <inputfile
echo " Close outputfile2"
exec 3>&-
echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo " The above should show 5, 2, 2 as desired number of newlines in these files."
Which outputs:
open outputfile1
1 section1
2 section1end
3
Matched start of section2. Close outputfile1 and open outputfile2
4 section2
5 section2end
Close outputfile2
5 inputfile
3 outputfile1
2 outputfile2
The above should show 5, 2, 2 as desired number of newlines in these files.

Option 1
Get rid of all empty lines. This only works if you don't need to retain any of the empty lines in the middle of a section.
Change:
echo "$line" >&3
To:
[[ -n "$line" ]] && echo "$line" >&3
Option 2
Rewrite each file using command substitution to trim any trailing newlines. Works best with short files. Change:
exec 3>&-
exec 3<> outputfile2
To:
exec 3>&-
data=$(<outputfile1)
echo "$data" >outputfile1
exec 3<> outputfile2
Option 3
Have the loop write the line from the prior iteration, and then do not write the final line from the prior file when you start a new file:
#!/bin/zsh
rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile
echo " open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'
priorLine=MARKER
while IFS= read line; do
if [[ "$line" == "section2" ]]; then
echo " Matched start of section2. Close outputfile1 and open outputfile2"
exec 3>&-
exec 3<> outputfile2
elif [[ "$priorLine" != MARKER ]]; then
echo "$priorLine" >&3
fi
echo $counter $line
let "counter = $counter + 1"
priorLine="$line"
done <inputfile
echo "$priorLine" >&3
echo " Close outputfile2"
exec 3>&-
echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo " The above should show 5, 2, 2 as desired number of newlines in these files."

Related

unix command to extract lines with blank values in particular position in a fixed width file

I have a fixed width file with no delimiter.
I would like to extract lines in the fixed width file which has blank values from position 550-552

With sed:
sed -nE '/^.{549}[[:blank:]]{3}/p' file
The [[:blank:]] characters are spaces or tabs, change it to a space character if you want to match three spaces.

You could use egrep (or equivalently, grep -E):
#first let's build a test file, using seq to make 549 dummy characters (X), then 3 characters, then some more dummy characters (Y):
laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n ' '; echo YYYYYYYYYYYYY ) > file
laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n 'zzz'; echo YYYYYYYYYYYYY ) >> file
laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n ' '; echo YYYYYYYYYYYYY ) >> file
laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n '123'; echo YYYYYYYYYYYYY ) >> file
#then do the actual search
laptop:~/tmp$ egrep '.{549} ' file
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYYY
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYYY

Unix - how to merge filename inside content with cat and baseline?

I want to a merge files (one line) with filename at the begining of the line.
'cat' does a merge on files and 'basename -a' gives filemanes but I don't know how to get both ?
$ echo "content1" > f1.txt
$ echo "content2" > f2.txt
$ cat f*.txt > all.txt
$ cat all.txt
content1
content2
$ basename -a f*.txt
f1.txt
f2.txt
Would like this result :
$ cat all.txt
f1.txt content1
f2.txt content2

Just use grep -H. Post process to change the delimiter:
$ for i in 1 2; do echo content$i > f$i.txt; done
$ grep -H . *.txt
f1.txt:content1
f2.txt:content2
$ grep -H . *.txt | sed 's/:/ /'
f1.txt content1
f2.txt content2
or,
$ awk '{printf "%s\t%s\n", FILENAME, $0}' *.txt
f1.txt content1
f2.txt content2

Use a loop. Iterate over every *.txt file, echo its name to the output file without a newline (done with echo -n,) append its contents to the output file, and finally append a newline. Note that >> appends. Using > would overwrite.
rm -f all.txt
for f in *.txt; do echo -n "$f " >> all.txt; cat "$f" >> all.txt; echo >> all.txt; done
If your input files already contain a newline at the end, then you skip the final echo and just do:
for f in *.txt; do echo -n "$f " >> all.txt; cat "$f" >> all.txt; done
If you're using tcsh instead of bash, then you can use foreach, but you can't write the whole loop in a single command. Normally you would write this in a script:
foreach f (*.txt)
echo -n "$f " >> all.txt
cat "$f" >> all.txt
end
To get this in a single command line you need something like this instead:
printf 'foreach f (*.txt)\n echo -n "$f " >> all.txt\n cat "$f" >> all.txt\n end' | tcsh

sed takes more time for processing

I have to replace certain characters(around 20 combinations) in each record. I have implemented it using sed command. But it takes more than 24 hours if a file is huge (more than 80000 records). Please find the code snippet below:
I have used 2 loops to read input file and read config file where position of each character to be replaced is mentioned. Each line can have more than one character which need to be replaced. When i replace the character i have to convert it to a decimal number as a result position of next replacement character need to be increased.Please find the code snippet below:
...
#Read the input file line by line
while read -r line
do
Flag='F'
pos_count=0
for pattern in `awk 'NR>1' $CONFIG_FILE`
do
field_type=`echo $pattern | cut -d"," -f6`
if [[ $field_type = 'A' ]];then
echo "For loop.."
echo $pattern
field_type=`echo $pattern | cut -d"," -f6`
echo field_type $field_type
start_pos=`echo $pattern | cut -d"," -f3`
echo start_pos $start_pos
end_pos=`echo $pattern | cut -d"," -f4`
echo end_pos $end_pos
field_len=`echo $pattern | cut -d"," -f5`
if [[ $Flag = 'T' && $field_type = 'A' ]];then
if [[ $replace = 'R' ]];then
pos_count=$(expr $pos_count + 1)
fi
echo pos_count $pos_count
val=$((2 * $pos_count))
start_pos=$(expr $start_pos + $val)
end_pos=$(expr $end_pos + $val)
replace=N
fi
echo "$line"
field=`expr substr "$line" $end_pos 1`
echo field $field
if [[ $start_pos -gt 255 ]];then
lim=255
f_cnt=$(expr $start_pos - 1)
c_cnt=$(expr $end_pos - 2)
#c_cnt1=$(expr $c_cnt - 255)
c_cnt1=$(expr $field_len - 2)
f_cnt1=$(expr $f_cnt - 255)
echo f_cnt1 "$f_cnt1" , c_cnt1 "$c_cnt1" f_cnt $f_cnt
else
lim=$(expr $start_pos - 1)
f_cnt1=$(expr $field_len - 2)
echo lim $lim, f_cnt1 $f_cnt1
fi
echo Flag $Flag
case "$field_type" in
A )
echo Field type is Amount
if [[ "${field}" = "{" ]];then
echo "Replacing { in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^{]*\){/\1\2+\3.\40/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^{]*\){/\1+\2.\30/"`
fi
Flag='T'
elif [[ "${field}" = "A" ]];then
echo "Replacing A in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^A]*\)A/\1\2+\3.\41/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^A]*\)A/\1+\2.\31/"`
fi
Flag='T'
...
elif [[ "${field}" = "R" ]];then
echo "Replacing R in Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^R]*\)R/\1\2-\3.\49/"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^R]*\)R/\1-\2.\39/"`
fi
Flag='T'
else
echo "Incremeting the size of Amount Column"
replace=R
if [[ $start_pos -gt 255 ]];then
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)/\1\2\3 /"`
else
line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)/\1\2\3 /"`
fi
fi
;;
C )
echo "Column Type is Count"
;;
* )
echo Others
:;
esac
fi
done
echo "$line" >> ${RES_FILE}
done < "$SRC_FILE"
echo `date`
exit 0
Below is the sample input file and config file:
CHD0000204H315604COV2013038 PROD2016022016030218481304COVCTR0000204H3156C00000000897 000000229960000024670141D0000000397577I0000000000000{00000174042
55C0000007666170B0000025070425E0000004863873E0000000631900F0000001649128{0000000018756B0000014798809C0000001890129G00000002384500000000286600000000084900000000155300000
0000055000000021388000000000048000000000003 00000897 0000000000000{0000000002706B0000001217827I000000001069
Config file:
FIELD NO.,FIELD NAME,STARTING POSITION,ENDING POSITION,LENGTH,INDICATOR
1,CHD_CONTRACT_NO,1,5,5,N
2,CHD_FILE_ID,6,21,16,N
3,PHD_CONTRACT_NO,22,26,5,N
4,PHD_PBP_ID,27,29,3,N
5,PHD_FILE_ID,30,45,16,N
6,DET_REC_ID,46,48,3,N
7,DET_SEQ_NO,49,55,7,N
8,DET_DG_CO_ST_CD,56,56,1,N
9,DET_CURR_HICN,57,76,20,N
10,DET_LAST_SUBM_HICN,77,96,20,N
11,DET_LAST_SUBM_CH_ID,97,116,20,N
12,DET_ERL_PDE_ATT_DT,117,124,8,N
13,DET_RX_COUNT,125,135,11,N
14,DET_NET_IGD_COST_AMT,136,149,14,A
15,DET_NET_DISP_FEE,150,163,14,A
16,DET_NET_SAL_TAX_AMT,164,177,14,A
17,DET_NET_GDCB,178,191,14,A
18,DET_NET_GDCA,192,205,14,A
19,DET_NET_GRS_DG_AMT,206,219,14,A
20,DET_NET_PAT_PAY_AMT,220,233,14,A
21,DET_NET_OTR_TROOP_AMT,234,247,14,A
22,DET_NET_LICS_AMT,248,261,14,A
23,DET_NET_TROOP_AMT,262,275,14,A
24,DET_NET_PLRO_AMT,276,289,14,A
25,DET_NET_CPP_AMT,290,303,14,A
26,DET_NET_NPP_AMT,304,317,14,A
27,DET_ORIG_PDE_CNT,318,329,12,N
28,DET_ADJ_PDE_CNT,330,341,12,N
29,DET_DEL_PDE_CNT,342,353,12,N
30,DET_CAT_PDE_CNT,354,365,12,N
31,DET_ATTC_PDE_CNT,366,377,12,N
32,DET_NCAT_PDE_CNT,378,389,12,N
33,DET_NON_STD_CNT,390,401,12,N
34,DET_OON_PDE_CNT,402,413,12,N
35,DET_EST_REB_AT_POS,414,427,14,A
36,DET_VAC_ADM_FEE,428,441,14,A
37,DET_RPT_GAP_DISC,442,455,14,A
38,DET_RPT_GAP_DISC_PDES,456,467,12,N
Can anyone suggest any other design approach to reduce the time for processing?

For massively improved performance you'll need to rewrite this. I suggest Python, Ruby, Awk, Perl, or similar.
The biggest reason why you current have disastrous performance is your nesting of loops is wrong:
for line in data:
for line in config:
do stuff specified in config to line
When you should be doing is:
for line in config:
parse and store line in memory
for line in data:
do stuff specified in config (in memory)
You can do this using any of the above languages, and I promise you those 80,000 records can be processed in just a few seconds, rather than 24 hours.

First read the comments and understand that the main problem is the number of calls to external commands called 80.000 times. When this is all done in one program the overhead and performance issues are solved. Which program/tool is up to you.
You will not get close to the the performance when you stick to bash code, but you can learn a lot when you try to use fast internal bash calls where you can.
Some tips when you want to improve your script.
See answer of #John, Only read config file once.
Use read for splitting the fields in a line of the config file
while IFS="," read -r fieldno fieldname start_pos end_pos length indicator; do
...
done < configfile
Avoid expr
Not f_cnt1=$(expr $field_len - 2) but (( f_cnt1 = field_len - 2))
Redirect to outputfile after the last done, not for each record (currently difficult since you are echoing both debug statements and results).
Delete debug statements
Use <<< for strings
It would be nice when you can change the flow, so that you do not need to call sed (80.000 records x 38 config line) times: generate a complex sed script from the config file that can handle all cases and run sed -f complex.sed "$SRC_FILE" just once.
When this is to complex, introduce a string sed_instructions. For each configfile-line add the sed instruction of that line to the string: sed_instructions="${sed_instructions};s/\(.\{1,$lim\}\)....". Then you only need to call sed -e ''"${sed_instructions}"'' <<< ${line} once for each record.
It would be nice when you can generate the string ${sed_instructions} once before reading the ${SRC_FILE}.
See which is the fastest way to print in awk for another example of performance improvements.
I think it can be improved to 10 minutes using bash, 1 minute using awk and less for programs mentioned by #John.

echo does not display proper output

Following code read the test.txt contents and based on first field it redirect third field to result.txt
src_fld=s1
type=11
Logic_File=`cat /home/script/test.txt`
printf '%s\n' "$Logic_File" |
{
while IFS=',' read -r line
do
fld1=`echo $line | cut -d ',' -f 1`
if [[ $type -eq $fld1 ]];then
query=`echo $line | cut -d ',' -f 3-`
echo $query >> /home/stg/result.txt
fi
done
}
Following is the contents of test.txt:
6,STRING TO DECIMAL WITHOUT DEFAULT,cast($src_fld as DECIMAL(15,2) $tgt_fld
7,STRING TO INTERGER WITHOUT DEFAULT,cast($src_fld as integer) $tgt_fld
11,DEFAULT NO RULE,$src_fld
everything works fine except output in result.txt is $src_fld instead of s1. Can anyone please tell me what is wrong in the code?

Try replacing the below line
echo $query >> /home/stg/result.txt
with this one
eval "echo $query" >> /home/stg/result.txt

check if file is empty or not

How do I check if a file is empty in a korn script
I want to test in my korn script if the output CSV file is empty or not and if it is not empty then it should give the count of values.
Thanks.

The test(1) program has a -s switch:
-s FILE
FILE exists and has a size greater than zero

This is just another way of doing it, albeit a roundabout one:
if [ `ls -l <file> | awk '{print $5}'` -eq 0 ]
then
//condition for being empty
else
//condition for not being empty
fi

if [ ! -f manogna.txt ]
then
echo " Error: manogna.txt does not exist "
else
echo " manogna.txt exist "
echo " no of records in manogna.txt are `cat manogna.txt | wc -l`"
fi

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Splitting a file in a shell script adds unwanted newlines - zsh

Related

unix command to extract lines with blank values in particular position in a fixed width file

Unix - how to merge filename inside content with cat and baseline?

sed takes more time for processing

echo does not display proper output

check if file is empty or not

Categories

Resources