Splitting a file in a shell script adds unwanted newlines - zsh
I need to process a long text file splitting it into many smaller files. I have a single pass while - read - done <inputfile loop and when a line is matched, that signals the start of new output file. The matched lines are always preceded by a newline character in the input file.
My problem is that the output files (except the final one) are extended by a newline character. I have recreated the problem in this short example.
#!/bin/zsh
rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile
echo " open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'
while IFS= read line; do
if [[ "$line" == "section2" ]]; then
echo " Matched start of section2. Close outputfile1 and open outputfile2"
exec 3>&-
exec 3<> outputfile2
fi
echo "$line" >&3
echo $counter $line
let "counter = $counter + 1"
done <inputfile
echo " Close outputfile2"
exec 3>&-
echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo " The above should show 5, 2, 2 as desired number of newlines in these files."
Which outputs:
open outputfile1
1 section1
2 section1end
3
Matched start of section2. Close outputfile1 and open outputfile2
4 section2
5 section2end
Close outputfile2
5 inputfile
3 outputfile1
2 outputfile2
The above should show 5, 2, 2 as desired number of newlines in these files.
Option 1
Get rid of all empty lines. This only works if you don't need to retain any of the empty lines in the middle of a section.
Change:
echo "$line" >&3
To:
[[ -n "$line" ]] && echo "$line" >&3
Option 2
Rewrite each file using command substitution to trim any trailing newlines. Works best with short files. Change:
exec 3>&-
exec 3<> outputfile2
To:
exec 3>&-
data=$(<outputfile1)
echo "$data" >outputfile1
exec 3<> outputfile2
Option 3
Have the loop write the line from the prior iteration, and then do not write the final line from the prior file when you start a new file:
#!/bin/zsh
rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile
echo " open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'
priorLine=MARKER
while IFS= read line; do
if [[ "$line" == "section2" ]]; then
echo " Matched start of section2. Close outputfile1 and open outputfile2"
exec 3>&-
exec 3<> outputfile2
elif [[ "$priorLine" != MARKER ]]; then
echo "$priorLine" >&3
fi
echo $counter $line
let "counter = $counter + 1"
priorLine="$line"
done <inputfile
echo "$priorLine" >&3
echo " Close outputfile2"
exec 3>&-
echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo " The above should show 5, 2, 2 as desired number of newlines in these files."
Related
unix command to extract lines with blank values in particular position in a fixed width file
I have a fixed width file with no delimiter. I would like to extract lines in the fixed width file which has blank values from position 550-552
With sed: sed -nE '/^.{549}[[:blank:]]{3}/p' file The [[:blank:]] characters are spaces or tabs, change it to a space character if you want to match three spaces.
You could use egrep (or equivalently, grep -E): #first let's build a test file, using seq to make 549 dummy characters (X), then 3 characters, then some more dummy characters (Y): laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n ' '; echo YYYYYYYYYYYYY ) > file laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n 'zzz'; echo YYYYYYYYYYYYY ) >> file laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n ' '; echo YYYYYYYYYYYYY ) >> file laptop:~/tmp$ (for n in `seq 1 549`; do echo -n X;done ;echo -n '123'; echo YYYYYYYYYYYYY ) >> file #then do the actual search laptop:~/tmp$ egrep '.{549} ' file XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYYY XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYYY
Unix - how to merge filename inside content with cat and baseline?
I want to a merge files (one line) with filename at the begining of the line. 'cat' does a merge on files and 'basename -a' gives filemanes but I don't know how to get both ? $ echo "content1" > f1.txt $ echo "content2" > f2.txt $ cat f*.txt > all.txt $ cat all.txt content1 content2 $ basename -a f*.txt f1.txt f2.txt Would like this result : $ cat all.txt f1.txt content1 f2.txt content2
Just use grep -H. Post process to change the delimiter: $ for i in 1 2; do echo content$i > f$i.txt; done $ grep -H . *.txt f1.txt:content1 f2.txt:content2 $ grep -H . *.txt | sed 's/:/ /' f1.txt content1 f2.txt content2 or, $ awk '{printf "%s\t%s\n", FILENAME, $0}' *.txt f1.txt content1 f2.txt content2
Use a loop. Iterate over every *.txt file, echo its name to the output file without a newline (done with echo -n,) append its contents to the output file, and finally append a newline. Note that >> appends. Using > would overwrite. rm -f all.txt for f in *.txt; do echo -n "$f " >> all.txt; cat "$f" >> all.txt; echo >> all.txt; done If your input files already contain a newline at the end, then you skip the final echo and just do: for f in *.txt; do echo -n "$f " >> all.txt; cat "$f" >> all.txt; done If you're using tcsh instead of bash, then you can use foreach, but you can't write the whole loop in a single command. Normally you would write this in a script: foreach f (*.txt) echo -n "$f " >> all.txt cat "$f" >> all.txt end To get this in a single command line you need something like this instead: printf 'foreach f (*.txt)\n echo -n "$f " >> all.txt\n cat "$f" >> all.txt\n end' | tcsh
sed takes more time for processing
I have to replace certain characters(around 20 combinations) in each record. I have implemented it using sed command. But it takes more than 24 hours if a file is huge (more than 80000 records). Please find the code snippet below: I have used 2 loops to read input file and read config file where position of each character to be replaced is mentioned. Each line can have more than one character which need to be replaced. When i replace the character i have to convert it to a decimal number as a result position of next replacement character need to be increased.Please find the code snippet below: ... #Read the input file line by line while read -r line do Flag='F' pos_count=0 for pattern in `awk 'NR>1' $CONFIG_FILE` do field_type=`echo $pattern | cut -d"," -f6` if [[ $field_type = 'A' ]];then echo "For loop.." echo $pattern field_type=`echo $pattern | cut -d"," -f6` echo field_type $field_type start_pos=`echo $pattern | cut -d"," -f3` echo start_pos $start_pos end_pos=`echo $pattern | cut -d"," -f4` echo end_pos $end_pos field_len=`echo $pattern | cut -d"," -f5` if [[ $Flag = 'T' && $field_type = 'A' ]];then if [[ $replace = 'R' ]];then pos_count=$(expr $pos_count + 1) fi echo pos_count $pos_count val=$((2 * $pos_count)) start_pos=$(expr $start_pos + $val) end_pos=$(expr $end_pos + $val) replace=N fi echo "$line" field=`expr substr "$line" $end_pos 1` echo field $field if [[ $start_pos -gt 255 ]];then lim=255 f_cnt=$(expr $start_pos - 1) c_cnt=$(expr $end_pos - 2) #c_cnt1=$(expr $c_cnt - 255) c_cnt1=$(expr $field_len - 2) f_cnt1=$(expr $f_cnt - 255) echo f_cnt1 "$f_cnt1" , c_cnt1 "$c_cnt1" f_cnt $f_cnt else lim=$(expr $start_pos - 1) f_cnt1=$(expr $field_len - 2) echo lim $lim, f_cnt1 $f_cnt1 fi echo Flag $Flag case "$field_type" in A ) echo Field type is Amount if [[ "${field}" = "{" ]];then echo "Replacing { in Amount Column" replace=R if [[ $start_pos -gt 255 ]];then line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^{]*\){/\1\2+\3.\40/"` else line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^{]*\){/\1+\2.\30/"` fi Flag='T' elif [[ "${field}" = "A" ]];then echo "Replacing A in Amount Column" replace=R if [[ $start_pos -gt 255 ]];then line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^A]*\)A/\1\2+\3.\41/"` else line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^A]*\)A/\1+\2.\31/"` fi Flag='T' ... elif [[ "${field}" = "R" ]];then echo "Replacing R in Amount Column" replace=R if [[ $start_pos -gt 255 ]];then line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)\([^R]*\)R/\1\2-\3.\49/"` else line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\([^R]*\)R/\1-\2.\39/"` fi Flag='T' else echo "Incremeting the size of Amount Column" replace=R if [[ $start_pos -gt 255 ]];then line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)\(.\{1,$c_cnt1\}\)/\1\2\3 /"` else line=`echo "$line"| sed -e "s/\(.\{1,$lim\}\)\(.\{1,$f_cnt1\}\)/\1\2\3 /"` fi fi ;; C ) echo "Column Type is Count" ;; * ) echo Others :; esac fi done echo "$line" >> ${RES_FILE} done < "$SRC_FILE" echo `date` exit 0 Below is the sample input file and config file: CHD0000204H315604COV2013038 PROD2016022016030218481304COVCTR0000204H3156C00000000897 000000229960000024670141D0000000397577I0000000000000{00000174042 55C0000007666170B0000025070425E0000004863873E0000000631900F0000001649128{0000000018756B0000014798809C0000001890129G00000002384500000000286600000000084900000000155300000 0000055000000021388000000000048000000000003 00000897 0000000000000{0000000002706B0000001217827I000000001069 Config file: FIELD NO.,FIELD NAME,STARTING POSITION,ENDING POSITION,LENGTH,INDICATOR 1,CHD_CONTRACT_NO,1,5,5,N 2,CHD_FILE_ID,6,21,16,N 3,PHD_CONTRACT_NO,22,26,5,N 4,PHD_PBP_ID,27,29,3,N 5,PHD_FILE_ID,30,45,16,N 6,DET_REC_ID,46,48,3,N 7,DET_SEQ_NO,49,55,7,N 8,DET_DG_CO_ST_CD,56,56,1,N 9,DET_CURR_HICN,57,76,20,N 10,DET_LAST_SUBM_HICN,77,96,20,N 11,DET_LAST_SUBM_CH_ID,97,116,20,N 12,DET_ERL_PDE_ATT_DT,117,124,8,N 13,DET_RX_COUNT,125,135,11,N 14,DET_NET_IGD_COST_AMT,136,149,14,A 15,DET_NET_DISP_FEE,150,163,14,A 16,DET_NET_SAL_TAX_AMT,164,177,14,A 17,DET_NET_GDCB,178,191,14,A 18,DET_NET_GDCA,192,205,14,A 19,DET_NET_GRS_DG_AMT,206,219,14,A 20,DET_NET_PAT_PAY_AMT,220,233,14,A 21,DET_NET_OTR_TROOP_AMT,234,247,14,A 22,DET_NET_LICS_AMT,248,261,14,A 23,DET_NET_TROOP_AMT,262,275,14,A 24,DET_NET_PLRO_AMT,276,289,14,A 25,DET_NET_CPP_AMT,290,303,14,A 26,DET_NET_NPP_AMT,304,317,14,A 27,DET_ORIG_PDE_CNT,318,329,12,N 28,DET_ADJ_PDE_CNT,330,341,12,N 29,DET_DEL_PDE_CNT,342,353,12,N 30,DET_CAT_PDE_CNT,354,365,12,N 31,DET_ATTC_PDE_CNT,366,377,12,N 32,DET_NCAT_PDE_CNT,378,389,12,N 33,DET_NON_STD_CNT,390,401,12,N 34,DET_OON_PDE_CNT,402,413,12,N 35,DET_EST_REB_AT_POS,414,427,14,A 36,DET_VAC_ADM_FEE,428,441,14,A 37,DET_RPT_GAP_DISC,442,455,14,A 38,DET_RPT_GAP_DISC_PDES,456,467,12,N Can anyone suggest any other design approach to reduce the time for processing?
For massively improved performance you'll need to rewrite this. I suggest Python, Ruby, Awk, Perl, or similar. The biggest reason why you current have disastrous performance is your nesting of loops is wrong: for line in data: for line in config: do stuff specified in config to line When you should be doing is: for line in config: parse and store line in memory for line in data: do stuff specified in config (in memory) You can do this using any of the above languages, and I promise you those 80,000 records can be processed in just a few seconds, rather than 24 hours.
First read the comments and understand that the main problem is the number of calls to external commands called 80.000 times. When this is all done in one program the overhead and performance issues are solved. Which program/tool is up to you. You will not get close to the the performance when you stick to bash code, but you can learn a lot when you try to use fast internal bash calls where you can. Some tips when you want to improve your script. See answer of #John, Only read config file once. Use read for splitting the fields in a line of the config file while IFS="," read -r fieldno fieldname start_pos end_pos length indicator; do ... done < configfile Avoid expr Not f_cnt1=$(expr $field_len - 2) but (( f_cnt1 = field_len - 2)) Redirect to outputfile after the last done, not for each record (currently difficult since you are echoing both debug statements and results). Delete debug statements Use <<< for strings It would be nice when you can change the flow, so that you do not need to call sed (80.000 records x 38 config line) times: generate a complex sed script from the config file that can handle all cases and run sed -f complex.sed "$SRC_FILE" just once. When this is to complex, introduce a string sed_instructions. For each configfile-line add the sed instruction of that line to the string: sed_instructions="${sed_instructions};s/\(.\{1,$lim\}\)....". Then you only need to call sed -e ''"${sed_instructions}"'' <<< ${line} once for each record. It would be nice when you can generate the string ${sed_instructions} once before reading the ${SRC_FILE}. See which is the fastest way to print in awk for another example of performance improvements. I think it can be improved to 10 minutes using bash, 1 minute using awk and less for programs mentioned by #John.
echo does not display proper output
Following code read the test.txt contents and based on first field it redirect third field to result.txt src_fld=s1 type=11 Logic_File=`cat /home/script/test.txt` printf '%s\n' "$Logic_File" | { while IFS=',' read -r line do fld1=`echo $line | cut -d ',' -f 1` if [[ $type -eq $fld1 ]];then query=`echo $line | cut -d ',' -f 3-` echo $query >> /home/stg/result.txt fi done } Following is the contents of test.txt: 6,STRING TO DECIMAL WITHOUT DEFAULT,cast($src_fld as DECIMAL(15,2) $tgt_fld 7,STRING TO INTERGER WITHOUT DEFAULT,cast($src_fld as integer) $tgt_fld 11,DEFAULT NO RULE,$src_fld everything works fine except output in result.txt is $src_fld instead of s1. Can anyone please tell me what is wrong in the code?
Try replacing the below line echo $query >> /home/stg/result.txt with this one eval "echo $query" >> /home/stg/result.txt
check if file is empty or not
How do I check if a file is empty in a korn script I want to test in my korn script if the output CSV file is empty or not and if it is not empty then it should give the count of values. Thanks.
The test(1) program has a -s switch: -s FILE FILE exists and has a size greater than zero
This is just another way of doing it, albeit a roundabout one: if [ `ls -l <file> | awk '{print $5}'` -eq 0 ] then //condition for being empty else //condition for not being empty fi
if [ ! -f manogna.txt ] then echo " Error: manogna.txt does not exist " else echo " manogna.txt exist " echo " no of records in manogna.txt are `cat manogna.txt | wc -l`" fi