invalid sum expression while trying to obtain sum of index - unix

i need to take the sum of all the values present at a particular index in every line of a csv file. the file may contain more than 50000 records. so efficiency is a given.
i was trying the following code. but doesnt seem to be working.
#!/bin/sh
FILE=$1
# read $FILE using the file descriptors
exec 3<&0
exec 0<$FILE
while read line
do
valindex=`cut -d "," -f 3`
echo $valindex
sum=`expr $sum+$valindex`
done
echo $sum

You should initialise sum before your while loop:
sum=0
You need to cut the line you are reading:
valindex=`echo $line|cut -d "," -f 3`
You need a space before and after the plus in expr:
sum=`expr $sum + $valindex`
Alternatively, use awk. It's a lot simpler:
awk -F, '{SUM+=$3} END{print SUM}' $FILE

Or one of my favourite patterns:
cut -d "," -f 3 "$FILE" | paste -sd+ | bc

Related

AWK : Unzipping and printing File Name and first line

I am trying to unzip files in folder and print first line LASTMODIFIEDDATE
But the below will print First line with '-'
for file in /export/home/xxxxxx/New_folder/*.gz;
do
gzip -dc "$file" | awk 'NR=1 {print $0, FILENAME}' | awk '/LASTMODIFIEDDATE/'
done
1.How can i modify the above code to print filename that is unzipped.
2.I am a beginner and suggestion to improve the above code are welcome
A few issues:
Your first awk should have double equals signs if you mean to address the first line:
awk 'NR==1{...}'
Your second awk will only ever see the output of the first awk, which only shows the first line, so you will not see any lines with LASTMODIFIED in them unless they are the first. So this will show you the first line and any lines containing LASTMODIFIED.
for ...
do
echo $file
gzip -dc "$file" | awk 'NR==1 || /LASTMODIFIED/'
done
Or you may mean this:
for ...
do
gzip -dc "$file" | awk -v file="$file" 'NR==1{print $0 " " file} /LASTMODIFIED/'
done
which will print the first line followed by the filename and also any lines containing LASTMODIFIED.
Do this with an echo. Also you might want to use grep instead of awk in this case.
for file in /export/home/xxxxxx/New_folder/*.gz;
do
echo $file
gzip -dc "$file" | grep LASTMODIFIEDDATE
done

need some help on awk command

need a help with awk. reading a csv file and, doing some substitution on some of the columns. It's like 9th column(string type) should be replaced by value of (9th column itself + value of the 4th column(integer)), then 15th column by $15+$12, column 26th with $26+$23. same has to be done line by line for all the records. Suggestions please
Below is the sample I/O. and the first line which is Description must be left as is.
sample Input
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst|Del|20|SD|DA
101|ms|Del|21|XS|DA
Sample output
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
it's like empname has been concatenated with empid & the role desc with roleID.Hope that's helpful :)
This will perform the needed transformation:
$ awk 'NR>1{$2=$2$1;$5=$5$4}1' FS='|' OFS='|' file
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
If you have to do this for many columns you can use a for loop like so (provided a arithmetic or geometric stepsize):
$ awk 'NR>1{for(i=2;i<=5;i+=3)$i=$i$(i-1)}1' FS='|' OFS='|' file
EmpID|Empname|Empadd|roleId|roleDesc|Dept
100|mst100|Del|20|SD20|DA
101|ms101|Del|21|XS21|DA
When you say +, I'm assuming you mean string concatentation. IN awk, there is no specific concatenation operator, you just put two strings side-by-side.
awk -F, -v OFS=, '{$9 = $9 $4; $15=$15$12; $26=$26$23; print}' file.csv
Also assuming that by "csv", you actually mean comma-separated.
If you want to edit the file in-place, you need to do this:
awk ... file.csv > newfile && mv file.csv file.csv.bak && mv newfile file.csv
Edit: to leave the first line untouched:
awk -F, -v OFS=, 'NR>1 {$9 = $9 $4; $15=$15$12; $26=$26$23} {print}' file.csv
Now the columns are modified for the 2nd and subsequent lines, but every line is printed.
You'll sometimes see that written this way:
awk -F, -v OFS=, 'NR>1 {$9 = $9 $4; $15=$15$12; $26=$26$23} 1' file.csv

Combining awk and csum to hash a field

I have pipe-delimited text files that requires an MD5 hash of a particular field, or set of fields. Because I'm on AIX and have to use the csum function, I don't think I can simply pass the file and a hashing function to awk to do it in one fell swoop.
So I'm writing a script that reads through each line, passes the to-be-hashed field to csum, then drops the result back in as a replacement via a gsub. 99% of the time it seems to work OK, but sometimes something goes afoul because the gsub replaces something it shouldn't.
#!/bin/ksh
rm $2 #Get rid of output file
while read line; do #loop through each line
MYFIELD=$(echo "$line" | cut -d "|" -f 6); #push the 6th field into a var
MYHASH=$(echo $MYFIELD | csum -h MD5 -); #csum will hash a string only on the stdin
echo $line | sed -e "s/$MYFIELD/${MYHASH}/g" >> $2 #gsub replaces, but not always what we want
done < $1 #read in the input file
I think instead I could use awk to update the field. But it's beyond me how to do that one line at a time. Ideally I would like to have a script that would allow me to pass two mandatory parameters (infile and outfile) and then any number of field positions that would get hashed and replaced. A la
foo infile.txt outfile.txt 2 6 12
Which would read in infile.txt, hash fields 2, 6, and 12, and write out to outfile.txt.
Your suggestions would be most appreciated
What about doing it with awk?
Instead of
echo $line | sed -e "s/$MYFIELD/${MYHASH}/g" >> $2 #gsub replaces, but not always what we want
You can use
old=$MYFIELD; new=$MYHASH; echo $line | awk -F"|" -v o="$old" -v n="$new" '{OFS=FS} sub(o, n, $6) {print}' >> $2
Basically what we do is:
old=$MYFIELD; new=$MYHASH We assign the parameters to be sent to awk.
echo $line We output the line so that awk can get it.
In awk,
-F"|" define | as field separator.
-v o="$old" and -v n="$new" let awk work with variables $old and $new naming them o and n respectively.
{OFS=FS} - define the delimiter between fields. It could also be OFS="|", but this way we indicate awk to use the same we defined on -F="|". It is more flexible to keep the field separator in case it changes.
sub(o, n, $6) replaces the text on variable o (that is, $MYFIELD) with text on variable v (that is, $MYHASH), but just on field 6.
print the whole line with substituted text
This worked for me in the example you gave on comments:
old="hashit"; new="WE_DID"; echo "donthashit|foo1|bar1|foo2|bar2|hashit" | awk -F"|" -v o="$old" -v n="$new" '{OFS=FS} sub(o,n,$6) {print}'
donthashit|foo1|bar1|foo2|bar2|WE_DID
Hope it helps.
Edit
I found a way to pass variables to awk easily: -v o=${variable_name}
This way, the solution can be:
echo $line | awk -F"|" -v o=${MYFIELD} -v n=${MYHASH} '{OFS=FS} sub(o, n, $6) {print}' >> $2

Count number of blank lines in a file

In count (non-blank) lines-of-code in bash they explain how to count the number of non-empty lines.
But is there a way to count the number of blank lines in a file? By blank line I also mean lines that have spaces in them.
Another way is:
grep -cvP '\S' file
-P '\S'(perl regex) will match any line contains non-space
-v select non-matching lines
-c print a count of matching lines
If your grep doesn't support -P option, please use -E '[^[:space:]]'
One way using grep:
grep -c "^$" file
Or with whitespace:
grep -c "^\s*$" file
You can also use awk for this:
awk '!NF {sum += 1} END {print sum}' file
From the manual, "The variable NF is set to the total number of fields in the input record". Since the default field separator is the space, any line consisting in either nothing or some spaces will have NF=0.
Then, it is a matter of counting how many times this happens.
Test
$ cat a
aa dd
ddd
he llo
$ cat -vet a # -vet to show tabs and spaces
aa dd$
$
ddd$
$
^I$
he^Illo$
Now let's' count the number of blank lines:
$ awk '!NF {s+=1} END {print s}' a
3
grep -v '\S' | wc -l
(On OSX the Perl expressions are not available, -P option)
grep -cx '\s*' file
or
grep -cx '[[:space:]]*' file
That is faster than the code in Steve's answer.
Using Perl one-liner:
perl -lne '$count++ if /^\s*$/; END { print int $count }' input.file
To count how many useless blank lines your colleague has inserted in a project you can launch a one-line command like this:
blankLinesTotal=0; for file in $( find . -name "*.cpp" ); do blankLines=$(grep -cvE '\S' ${file}); blankLinesTotal=$[${blankLines} + ${blankLinesTotal}]; echo $file" has" ${blankLines} " empty lines." ; done; echo "Total: "${blankLinesTotal}
This prints:
<filename0>.cpp #blankLines
....
....
<filenameN>.cpp #blankLines
Total #blankLinesTotal

Unix - Need to cut a file which has multiple blanks as delimiter - awk or cut?

I need to get the records from a text file in Unix. The delimiter is multiple blanks. For example:
2U2133 1239
1290fsdsf 3234
From this, I need to extract
1239
3234
The delimiter for all records will be always 3 blanks.
I need to do this in an unix script(.scr) and write the output to another file or use it as an input to a do-while loop. I tried the below:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then
int_1=0
else
int_2=0
fi
done < awk -F' ' '{ print $2 }' ${Directoty path}/test_file.txt
test_file.txt is the input file and file1.txt is a lookup file. But the above way is not working and giving me syntax errors near awk -F
I tried writing the output to a file. The following worked in command line:
more test_file.txt | awk -F' ' '{ print $2 }' > output.txt
This is working and writing the records to output.txt in command line. But the same command does not work in the unix script (It is a .scr file)
Please let me know where I am going wrong and how I can resolve this.
Thanks,
Visakh
The job of replacing multiple delimiters with just one is left to tr:
cat <file_name> | tr -s ' ' | cut -d ' ' -f 2
tr translates or deletes characters, and is perfectly suited to prepare your data for cut to work properly.
The manual states:
-s, --squeeze-repeats
replace each sequence of a repeated character that is
listed in the last specified SET, with a single occurrence
of that character
It depends on the version or implementation of cut on your machine. Some versions support an option, usually -i, that means 'ignore blank fields' or, equivalently, allow multiple separators between fields. If that's supported, use:
cut -i -d' ' -f 2 data.file
If not (and it is not universal — and maybe not even widespread, since neither GNU nor MacOS X have the option), then using awk is better and more portable.
You need to pipe the output of awk into your loop, though:
awk -F' ' '{print $2}' ${Directory_path}/test_file.txt |
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done
The only residual issue is whether the while loop is in a sub-shell and and therefore not modifying your main shell scripts variables, just its own copy of those variables.
With bash, you can use process substitution:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done < <(awk -F' ' '{print $2}' ${Directory_path}/test_file.txt)
This leaves the while loop in the current shell, but arranges for the output of the command to appear as if from a file.
The blank in ${Directory path} is not normally legal — unless it is another Bash feature I've missed out on; you also had a typo (Directoty) in one place.
Other ways of doing the same thing aside, the error in your program is this: You cannot redirect from (<) the output of another program. Turn your script around and use a pipe like this:
awk -F' ' '{ print $2 }' ${Directory path}/test_file.txt | while read readline
etc.
Besides, the use of "readline" as a variable name may or may not get you into problems.
In this particular case, you can use the following line
sed 's/ /\t/g' <file_name> | cut -f 2
to get your second columns.
In bash you can start from something like this:
for n in `${Directoty path}/test_file.txt | cut -d " " -f 4`
{
grep -c $n ${Directory path}/file*.txt
}
This should have been a comment, but since I cannot comment yet, I am adding this here.
This is from an excellent answer here: https://stackoverflow.com/a/4483833/3138875
tr -s ' ' <text.txt | cut -d ' ' -f4
tr -s '<character>' squeezes multiple repeated instances of <character> into one.
It's not working in the script because of the typo in "Directo*t*y path" (last line of your script).
Cut isn't flexible enough. I usually use Perl for that:
cat file.txt | perl -F' ' -e 'print $F[1]."\n"'
Instead of a triple space after -F you can put any Perl regular expression. You access fields as $F[n], where n is the field number (counting starts at zero). This way there is no need to sed or tr.

Resources