Awk command to perform action on lines excluding 1st and last - unix

I have multiple MS excel files in csv format in a particular directory.
I want to update the value of one particular column in all the rows of the csv files.
Also, the action should not be operated on 1st and last line.
So far I have come up with below code for one row:
awk -F, 'NR>2{$2=300;}1' OFS=, test.csv
But i am facing difficulty in excluding the last line.
Also, i need to perform the same for all the files in the directory.
So far tried the below but not able to succeed to replace that string value using awk.
1)
2)

This may do:
awk -F, 't{print t} {a=t=$0} NR>1{$2=300;t=$0} END {print a}' OFS=, test.csv

$ cat file
1,a,b
2,c,d
3,e,f
$ awk 'BEGIN{FS=OFS=","} NR>1{print (NR>2 ? chgd : orig)} {orig=$0; $2=300; chgd=$0} END{print orig}' file
1,a,b
2,300,d
3,e,f

You could simplify the script a bit by reading the file twice:
awk 'BEGIN{FS=OFS=","} NR==FNR {c=NR;next} !(FNR==1||FNR==c){$2=200} 1' file file
This uses the NR==FNR section merely to count lines, giving you a simple expression for determining whether to update the field in question.
And if you have GNU awk available, you might save a few CPU cycles by not reassigning the c variable for every line, using something like this:
gawk 'BEGIN{FS=OFS=","} ENDFILE {c=FNR} NR==FNR{next} !(FNR==1||FNR==c){$2=200} 1' file file
This still reads the file twice, but assigns c only after each file is read.
If you want, you can emulate the ENDFILE condition in non-GNU awk using NR>FNR && FNR==1 if you only have two files, then set c=NR-1. It won't perform as well.
I haven't tested the speed difference between these two, but I suspect it would be negligible except in cases of truly obscenely large files.

Thanks all,
I got to make it work. Below is the command:
awk -v sq="" -F, 't{print t} {a=t=$0} NR>2{$3=sq"ops_data"sq;t=$0} END {print a}' OFS=, test1.csv

Related

How can I identify lines from a delimited file, based on a lookup file in unix

Assume that there are two files
File1 - lookup.txt
CAN
USD
INR
EUR
Another file Input.txt
1~Canada~CAN
2~United States of America~USD
3~Brazil~BRL
Both files may be very huge, hypothetically several thousand of records . Now I'm trying to identify the records in Input.txt and identify them based on values in lookup file.
The expected output should be
1~Canada~CAN
2~United States of America~USD
I tried to do something like below
#!/bin/sh
lookupFile=$1 #lookup.txt
inputFile=$2 #input.txt
outputFile=$3 #output.txt
while IFS= read -r line
do
awk -F'~' '{if ($3==$line) print >> $outputFile}' $inputFile
done < "$lookupFile"
But I'm getting error like
awk: cmd. line:1: (FILENAME=input.txt FNR=2) fatal: can't redirect to
How can I fix this issue ? Also if the files really huge, with several thousand of records to search, is this an efficient way ?
With your shown samples please try following awk code. We could do this in single awk we need to take care of setting field separator as ~ before input.txt.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' lookup.txt FS="~" input.txt
Explanation:
awk ' ##starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when lookup.txt is being read.
arr[$0] ##Creating array arr with $0 as index.
next ##next to skip all further statements from here.
}
($3 in arr) ##If $3 is present in arr then print that line.
' lookup.txt FS="~" input.txt ##Mentioning Input_files and setting FS to ~ before input.txt
A non-awk solution that you could compare with on the performance point of view:
$ grep -wFf lookup.txt input.txt
1~Canada~CAN
2~United States of America~USD
Warning: this does not match only on the last word. So if some values in lookup.txt can also be found elsewhere in input.txt, prefer another solution. Or, if it contains nothing that could be interpreted as a regular expression operator, preprocess lookup.txt before grep. Example with bash, sed and grep:
$ grep -f <( sed 's/.*/~&$/' lookup.txt ) input.txt
1~Canada~CAN
2~United States of America~USD

Expressing tail through a variable

So I have a chunk of formatted text, I basically need to use awk to get certain columns out of it. The first thing I did was get rid of the first 10 lines (the header information, irrelevant to the info I need).
Next I got the tail by taking the total lines in the file minus 10.
Here's some code:
import=$HOME/$1
if [ -f "$import" ]
then
#file exists
echo "File Exists."
totalLines=`wc -l < $import`
linesMinus=`expr $totalLines - 10`
tail -n $linesMinus $import
headless=`tail -n $linesMinus $import`
else
#file does not exist
echo "File does not exist."
fi
Now I need to save this tail into a variable (or maybe even separate file) so I can access the columns.
The problem comes here:
headless=`tail -n $linesMinus $import`
When I save the code into this variable and then try to echo it back out, it's all unformatted and I can't distinguish columns to use awk on.
How can I save the tail of this file without compromising the formatting?
Just use Awk. It can do everything you need all at once and all in one program.
E.g. to skip the first 10 lines, then print the second, third, and fourth columns separated by spaces for all remaining lines from INPUT_FILE:
awk 'NR <= 10 {next;}
{print $2 " " $3 " " $4;}' INPUT_FILE
Figured it out, I kind of answered my own question when I asked it. All I did was redirect the tail command to a file in the home directory and I can cat that file. Gotta remember to delete it at the end though!

Print labels using awk

On my FreeBSD 10.1 I'm writing a little piece of code that basically calls ls and automatically breaks the results down into something like this:
directory:
2.4M .git
528K src
380K dist
184K test
file:
856K CONDUCT.md
20K README.md
........
You will only need to list out directories and regular files, and you don't have to list out . .., but you have to list out hidden files, and sort them from largest to smallest separately.
The challenge is to complete it as a one-line command without using $(cmd), &&, ||, >, >>, <, ;, & and within 12 pipes (back quotes count as well).
Currently my progress is:
ls -Alh | sort -d -h -r |
awk 'BEGIN {print "Directories:"}
NR>1 {if(substr($1,1,1)~"d")print" "$5" "$9}'
which prints out only until the last directory item. But since the entire command will output once every record, I can't find a way to print files: only once, and then print out the remaining output.
Well, you may have to store the files in an array and print at the end:
ls -Alh|sed 1d|
sort -h -k5r|
awk 'BEGIN {print "Directories:"}
/^d/{print "\t"$5"\t"$9}
/^-/{f[n++]=sprintf("\t"$5"\t"$9)}
END{print "Files:";
for(i=0;i<n;++i)print f[i]}'
One additional problem you'll need to work out: files and dirs may have spaces in the name, and the simple $9 will be insufficient for that case.

Median Calculation in Unix

I need to calculate median value for the below input file. It is working fine for odd occurrences but not for even occurrences. Below is the input file and the script used. Could you please check what is wrong with this command and correct the same.
Input file:
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
Desired Output:
AR,3.31
Unix command:
cat test.txt | sort -t"," -k2n,2 | awk '{arr[NR]=$1} END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}'
Don't forget that your input file has an additional line, containing the header. You need to take an additional step in your awk script to skip the first line.
Also, due to the fact you're using the default field separator, $1 will contain the whole line, so your code arr[NR/2]+arr[NR/2+1])/2 is never going to work. I would suggest that you changed it so that awk splits the input on a comma, then use the second field $2.
sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
I also removed your useless use of cat. Most tools, including sort and awk, are capable of reading in files directly, so you don't need to use cat with them.
Testing it out:
$ cat file
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
$ sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
3.31
It shouldn't be too difficult to modify the script slightly to change the output to whatever you want.

how to split a large csv file in unix command line

I am just splitting a very large csv file in to parts. When ever i run the following command. the doesn't completely split rather returns me the following error. how can i avoid the split the whole file.
awk -F, '{print > $2}' test1.csv
awk: YY1 makes too many open files
input record number 31608, file test1.csv
source line number 1
Just close the files after writing:
awk -F, '{print > $2; close($2)}' test1.csv
You must have a lot of lines. Are you sure that the second row repeats enough to put those records into an individual file? Anyway, awk is holding the files open until the end. You'll need a process that can close the file handles when not in use.
Perl to the rescue. Again.
#!perl
while( <> ) {
#content = split /,/, $_;
open ( OUT, ">> $content[1]") or die "whoops: $!";
print OUT $_;
close OUT;
}
usage: script.pl your_monster_file.csv
outputs the entire line into a file named the same as the value of the second CSV column in the current directory, assuming no quoted fields etc.

Resources