how to split a large csv file in unix command line - unix

I am just splitting a very large csv file in to parts. When ever i run the following command. the doesn't completely split rather returns me the following error. how can i avoid the split the whole file.
awk -F, '{print > $2}' test1.csv
awk: YY1 makes too many open files
input record number 31608, file test1.csv
source line number 1

Just close the files after writing:
awk -F, '{print > $2; close($2)}' test1.csv

You must have a lot of lines. Are you sure that the second row repeats enough to put those records into an individual file? Anyway, awk is holding the files open until the end. You'll need a process that can close the file handles when not in use.
Perl to the rescue. Again.
#!perl
while( <> ) {
#content = split /,/, $_;
open ( OUT, ">> $content[1]") or die "whoops: $!";
print OUT $_;
close OUT;
}
usage: script.pl your_monster_file.csv
outputs the entire line into a file named the same as the value of the second CSV column in the current directory, assuming no quoted fields etc.

Related

Awk command to perform action on lines excluding 1st and last

I have multiple MS excel files in csv format in a particular directory.
I want to update the value of one particular column in all the rows of the csv files.
Also, the action should not be operated on 1st and last line.
So far I have come up with below code for one row:
awk -F, 'NR>2{$2=300;}1' OFS=, test.csv
But i am facing difficulty in excluding the last line.
Also, i need to perform the same for all the files in the directory.
So far tried the below but not able to succeed to replace that string value using awk.
1)
2)
This may do:
awk -F, 't{print t} {a=t=$0} NR>1{$2=300;t=$0} END {print a}' OFS=, test.csv
$ cat file
1,a,b
2,c,d
3,e,f
$ awk 'BEGIN{FS=OFS=","} NR>1{print (NR>2 ? chgd : orig)} {orig=$0; $2=300; chgd=$0} END{print orig}' file
1,a,b
2,300,d
3,e,f
You could simplify the script a bit by reading the file twice:
awk 'BEGIN{FS=OFS=","} NR==FNR {c=NR;next} !(FNR==1||FNR==c){$2=200} 1' file file
This uses the NR==FNR section merely to count lines, giving you a simple expression for determining whether to update the field in question.
And if you have GNU awk available, you might save a few CPU cycles by not reassigning the c variable for every line, using something like this:
gawk 'BEGIN{FS=OFS=","} ENDFILE {c=FNR} NR==FNR{next} !(FNR==1||FNR==c){$2=200} 1' file file
This still reads the file twice, but assigns c only after each file is read.
If you want, you can emulate the ENDFILE condition in non-GNU awk using NR>FNR && FNR==1 if you only have two files, then set c=NR-1. It won't perform as well.
I haven't tested the speed difference between these two, but I suspect it would be negligible except in cases of truly obscenely large files.
Thanks all,
I got to make it work. Below is the command:
awk -v sq="" -F, 't{print t} {a=t=$0} NR>2{$3=sq"ops_data"sq;t=$0} END {print a}' OFS=, test1.csv

Calling a function from awk with variable input location

I have a bunch of different files.We have used "|" as delimeter All files contain a column titled CARDNO, but not necessarily in the same location in all of the files. I have a function called data_mask. I want to apply to CARDNO in all of the files to change them into NEWCARDNO.
I know that if I pass in the column number of CARDNO I can do this pretty simply, say it's the 3rd column in a 5 column file with something like:
awk -v column=$COLNUMBER '{print $1, $2, FUNCTION($column), $4, $5}' FILE
However, if all of my files have hundreds of columns and it's somewhere arbitrary in each file, this is incredibly tedious. I am looking for a way to do something along the lines of this:
awk -v column=$COLNUMBER '{print #All columns before $column, FUNCTION($column), #All columns after $column}' FILE
My function takes a string as an input and changes it into a new one. It takes the value of the column as an input, not the column number. Please suggest me Unix command which can pass the column value to the function and give the desired output.
Thanks in advance
If I understand your problem correctly, the first row of the file is the header and one of those columns is named CARDNO. If this is the case then you just search for the header in that file and process accordingly.
awk 'BEGIN{FS=OFS="|";c=1}
(NR==1){while($c != "CARDNO" && c<=NF) c++
if(c>NF) exit
$c="NEWCARDNO" }
(NR!=1){$c=FUNCTION($c)}
{print}' <file>
As per comment, if there is no header in the file, but you know per file, which column number it is, then you can simply do:
awk -v c=$column 'BEGIN{FS=OFS="|"}{$c=FUNCTION($c)}1' <file>

Join lines depending on the line beginning

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with a space, empty line or a nonnumeric character. E.g.
40403813|7|Failed|No such file or directory|1
40403816|7|Hi,
The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
I'd like join the split line back with the previous line (as mentioned below):
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
using a Unix command like sed/awk. I'm not clear how to join a line with the preceeding one.
Any suggestion?
awk to the rescue!
awk -v ORS='' 'NR>1 && /^[0-9]/{print "\n"} NF' file
only print newline when the current line starts with a digit, otherwise append rows (perhaps you may want to add a space to ORS if the line break didn't preserve the space).
Don't do anything based on the values of the strings in your fields as that could go wrong. You COULD get a wrapping line that starts with a digit, for example. Instead just print after every complete record of 5 fields:
$ awk -F'|' '{rec=rec $0; nf+=NF} nf>=5{print rec; nf=0; rec=""}' file
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
Try:
awk 'NF{printf("%s",$0 ~ /^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
OR
awk 'NF{printf("%s",/^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
It will check if each line starts from a digit or not if yes and greater than line number 1 than it will insert a new line with-it else it will simply print it, also it will print a new line after reading the whole file, if we not mention it, it is not going to insert that at end of the file reading.
If you only ever have the line split into two, you can use this sed command:
sed 'N;s/\n\([^[:digit:]]\)/\1/;P;D' infile
This appends the next line to the pattern space, checks if the linebreak is followed by something other than a digit, and if so, removes the linebreak, prints the pattern space up to the first linebreak, then deletes the printed part.
If a single line can be broken across more than two lines, we have to loop over the substitution:
sed ':a;N;s/\n\([^[:digit:]]\)/\1/;ta;P;D' infile
This branches from ta to :a if a substitution took place.
To use with Mac OS sed, the label and branching command must be separate from the rest of the command:
sed -e ':a' -e 'N;s/\n\([^[:digit:]]\)/\1/;ta' -e 'P;D' infile
If the continuation lines always begin with a single space:
perl -0000 -lape 's/\n / /g' input
If the continuation lines can begin with an arbitrary amount of whitespace:
perl -0000 -lape 's/\n(\s+)/$1/g' input
It is probably more idiomatic to write:
perl -0777 -ape 's/\n / /g' input
You can use sed when you have a file without \r :
tr "\n" "\r" < inputfile | sed 's/\r\([^0-9]\)/\1/g' | tr '\r' '\n'

Expressing tail through a variable

So I have a chunk of formatted text, I basically need to use awk to get certain columns out of it. The first thing I did was get rid of the first 10 lines (the header information, irrelevant to the info I need).
Next I got the tail by taking the total lines in the file minus 10.
Here's some code:
import=$HOME/$1
if [ -f "$import" ]
then
#file exists
echo "File Exists."
totalLines=`wc -l < $import`
linesMinus=`expr $totalLines - 10`
tail -n $linesMinus $import
headless=`tail -n $linesMinus $import`
else
#file does not exist
echo "File does not exist."
fi
Now I need to save this tail into a variable (or maybe even separate file) so I can access the columns.
The problem comes here:
headless=`tail -n $linesMinus $import`
When I save the code into this variable and then try to echo it back out, it's all unformatted and I can't distinguish columns to use awk on.
How can I save the tail of this file without compromising the formatting?
Just use Awk. It can do everything you need all at once and all in one program.
E.g. to skip the first 10 lines, then print the second, third, and fourth columns separated by spaces for all remaining lines from INPUT_FILE:
awk 'NR <= 10 {next;}
{print $2 " " $3 " " $4;}' INPUT_FILE
Figured it out, I kind of answered my own question when I asked it. All I did was redirect the tail command to a file in the home directory and I can cat that file. Gotta remember to delete it at the end though!

commandline output lines that are specified in another file

iam searching for some command line that takes a text file and a file with line numbers (one on each line) (alternatively from stdin) and outputs only that lines from the first file.
the text file may be several hundreds of MB large and the line list may contains several thousands of entries (but are sorted ascending)
in short:
one file contains data
another file contains indexes
a command should extract only indexed lines
first file:
many lines
of course they are all very different
and contain very important data
...
more lines
...
even more lines
second file
1
5
7
expected output
many lines
more lines
even more lines
The second (line number) file does not necessarily have to exist. Its data also may come from stdin (in deed this would the optimum). Also the format of that data may vary from the shown if this would make the task easier.
This can be an approach:
$ awk 'FNR==NR {a[$1]; next} FNR in a' file_with_line_numbers file_with_data
many lines
more lines
even more lines
It reads the file_with_line_numbers and stores the lines in an array a[]. Then it reads the other file and keeps checking if the line number is in the array, in which case the line is printed.
The trick used is the following:
awk 'FNR==NR {something; next} {other things}' file1 file2
that performs actions related to file1 in the {something} block and then actions related to file2 in the {other things} block.
What if the line numbers are given through stdin?
For this you can use awk '...' - file, so that stdin is called with -. This is called Naming Standard Input. So that you can do:
your_commands | awk 'FNR==NR {a[$1]; next} FNR in a' - file_with_data
Test
$ echo "1
5
7" | awk 'FNR==NR {a[$1]; next} FNR in a' - file_with_data
many lines
more lines
even more lines
With sed, convert the line numbers to a sed program, and use that generated program to print out the wanted lines;
$ sed -n "$( sed 's/$/p/' second_file )" first_file
many lines
more lines
even more lines
This works too.
foreach line ( "cat file2" )
foreach? sed -n "$line p" file1
foreach? end
many lines
more lines
even more lines

Resources