Split file without separating rows beginning with like values in Unix - unix

I have a sorted .csv file that is something like this:
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
I want to split the file into smaller files of roughly two lines each, but I do not want rows with like values in the first column separated.
Here, I would end up with three files:
x00000:
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
x00001:
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
x00002:
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
My actual data is about 7 gigs in size and contains over 100 million lines. I want to split it into files of about 100K lines each or ~6MB. I am fine with using either file size or line numbers for splitting.
I know that I can use "sort" to split, such as:
split -a 5 -d -1 2
Here, that would give me four files, and like values in the first column would be split over files in most cases.
I think I probably need awk, but, even after reading through the manual, I am not sure how to proceed.
Help is appreciated! Thanks!

An awk script:
BEGIN { FS = "," }
!name { name = sprintf("%06d-%s.txt", NR, $1) }
count >= 2 && prev != $1 {
close(name)
name = sprintf("%06d-%s.txt", NR, $1)
count = 0
}
{
print >name
prev = $1
++count
}
Running this on the given data will create three files:
$ awk -f script.awk file.csv
$ cat 000001-AABB1122.txt
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
$ cat 000004-CCDD4444.txt
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
$ cat 000006-CCEE4444.txt
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
I have arbitrarily chosen to use the line number from the original file from where the first line was taken, along with the first field's data on that line as the filename.
The script counts the number of lines printed to the current output file, and if that number is greater than or equal to 2, and if the first field's value is different from the previous line's first field, the current output file is closed, a new output name is constructed, and the count is reset.
The last block simply prints to the current filename, remembers the first field in the prev variable, and increments the count.
The BEGIN block initializes the field delimiter (before the first line is read) and the !name block sets the initial output file name (when reading the very first line).
To get exactly the filenames that you have in the question, use
name = sprintf("x%05d", ++n)
to set the output filename in both places where this is done.

With csplit if available
With the given data
csplit -s infile %^A% /^C/ %^C% /^D/ /^Z/ {*}

Related

.ksh paste user input value into dataset

Good morning.
First things first: I know next to nothing about shell scripting in Unix, so please pardon my naivety.
Here's what I'd like to do, and I think it's relatively simple: I would like to create a .ksh file to do two things: 1) take a user-provided numerical value (argument) and paste it into a new column at the end of a dataset (a separate .txt file), and 2) execute a different .ksh script.
I envision calling this script at the Unix prompt, with the input value added thereafter. Something like, "paste_and_run.ksh 58", where 58 would populate a new, final (un-headered) column in an existing dataset (specifically, it'd populate the 77th column).
To be perfectly honest, I'm not even sure where to start with this, so any input would be very appreciated. Apologies for the lack of code within the question. Please let me know if I can offer any more detail, and thank you for taking a look.
I have found the answer: the "nawk" command.
TheNumber=$3
PE_Infile=$1
Where the above variables correspond to the third and first arguments from the command line, respectively. "PE_Infile" represents the file (with full path) to be manipulated, and "TheNumber" represents the number to populate the final column. Then:
nawk -F"|" -v TheNewNumber=$TheNumber '{print $0 "|" TheNewNumber/10000}' $PE_Infile > $BinFolder/Temp_Input.txt
Here, the -F"|" dictates the delimiter, and the -v dictates what is to be added. For reasons unknown to myself, the declaration of a new varible (TheNewNumber) was necessary to perform the arithmetic manipulation within the print statement. print $0 means that the whole line would be printed, while tacking the "|" symbol and the value of the command line input divided by 10000 to the end. Finally, we have the input file and an output file (Temp_PE_Input.txt, within a path represented by the $Binfolder variable).
Running the desired script afterward was as simple as typing out the script name (with path), and adding corresponding arguments ($2 $3) afterward as needed, each separated by a space.

modifying line names to avoid redundancies when files are merged in terminal

I have two files containing biological DNA sequence data. Each of these files are the output of a python script which assigns each DNA sequence to a sample ID based on a DNA barcode at the beginning of the sequence. The output of one of these .txt files looks like this:
>S066_1 IGJRWKL02G0QZG orig_bc=ACACGTGTCGC new_bc=ACACGTGTCGC bc_diffs=0
TTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCGTGAGAAGTTGAGGTTATGGCAAGCATCCATAAGAACCCTATAGCGAGAATAATTACTACGCTTAGAGCCAGATGGCACCGCCACTGATTTTAGGGGCCGCTGAATAGCGAGCTCCAAGACCCCTTGCGGGATTGGTCAAAATAGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTCCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCAGCGTTCTTCATCGATGACGAGTCTAG
>S045_2 IGJRWKL02H5XHD orig_bc=ATCTGACGTCA new_bc=ATCTGACGTCA bc_diffs=0
CTAAGTTCAGCGGGTAGTCTTGTCTGATATCAGGTCCAATTGAGATACCACCGACAATCATTCGATCATCAACGATACAGAATTTCCCAAATAAATCTCTCTACGCAACTAAATGCAGCGTCTCCGTACATCGCGAAATACCCTACTAAACAACGATCCACAGCTCAAACCGACAACCTCCAGTACACCTCAAGGCACACAGGGGATAGG
The first line is the sequence ID, and the second line in the DNA sequence. S_066 in the first part of the ID indicates that the sequence is from sample 066, and the _1 indicates that its the first sequence in the file (not the first sequence from S_066 per se). Because of the nuances of the DNA sequencing technology being used, I need to generate two files like this from the raw sequencing files, and the result is an output where I have two of these files, which I then use cat to merge together. So far so good.
The next downstream step in my workflow does not allow identical sample names. Right now it gets half way through, errors, and closes because it encounters some identical sequence IDs. So, it must be that the 400th sequence in both files belongs to the same sample, or something, generating identical sample IDs (i.e. both files might have S066_400).
What I would like to do is use some code to insert a number (1000,, 4971, whatever) immediately after the _ on every other line in the second file, starting with the first line. This way the IDs would no longer be confounded and I could proceed. So, it would cover S066_2 to S066_24971 or S066_49712. Part of the trouble is that the ID may be variable in length such that it could begin as S066_ or as 49BBT1_.
Try:
awk '/^\>/ {$1=$1 "_13"} {print $0}' filename > tmp.tmp
mv tmp.tmp filename

not able to understand NAWK use

I found a command which takes the input data from a binary file and writes into a output file.
nawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=1 a=19 s="<Comment>Ericsson_OCS_V1_0.0.0.7" /var/opt/fds/config/ServiceConfig/ServiceConfig.cfg > /opt/temp/"$circle"_"$sdpid"_RG.cfg
It's working but I am not able to find out how...Could anyone please help me out how above command is working and what is it doing?...this nawk is too tough to understand...:(
Thanks in advance......
nawk is not tough to understand and is same like other languages, I guess you are not able to understand it because it not properly formatted, if you format it you will know how it's working.
To answer your question this command is searching lines containing an input text in given input file, and prints few lines before matched line(s) and few lines after the matched line. How many lines to be printed are controlled by variable "b" (no of lines before) and "a" (no of lines after) and string/text to be searched is passed using variable "s".
This command will be helpful in debugging/troubleshooting where one want to extract lines from large size log files (difficult to open in vi or other editor on UNIX/LINUX) by searching some error text and print some lines above it and some line after it.
So in your command
b=1 ## means print only 1 line before the matching line
a=19 ## means print 19 lines after the matching line
s="<Comment>Ericsson_OCS_V1_0.0.0.7" ## means search for this string
/var/opt/fds/config/ServiceConfig/ServiceConfig.cfg ## search in this file
/opt/temp/"$circle"_"$sdpid"_RG.cfg ## store the output in this file
Your formatted command is below, the very first condition which was looking like c-->0 before format is easy to interpret which means c-- greater than 0. NR variable in AWK gives the line number of presently processing line in input file being processed.
nawk '
c-- > 0;
$0 ~ s
{
if(b)
for(c=b+1;c>1;c--)
print r[(NR-c+1)%b];
print;
c=a
}
b
{
r[NR%b]=$0
}' b=1 a=19 s="<Comment>Ericsson_OCS_V1_0.0.0.7" /var/opt/fds/config/ServiceConfig/ServiceConfig.cfg > /opt/temp/"$circle"_"$sdpid"_RG.cfg

Find lines matching a pattern, provided their value in a specified column occurs exactly twice in the input file

Say the input is (.csv file):
a,b_b,3,c
d,k_k,3,f
g,h_h,3,i
j,k_k,4,l
m,n_n,4,o
p,k_k,5,q
r,s_s,5,t
I want this output:
All lines containing the pattern "k_k" whose number in the third column is found in exactly two lines (ex.: numbers 4 and 5):
j,k_k,4,l
p,k_k,5,q
It might be a simple one but I can't find I way to achieve this. Could anyone help me using Unix command lines (awk)?
awk '/k_k/' && ?? file.csv
I think you want something like this:
awk -F, 'FNR==NR{a[$3]++;next} /k_k/ {if(a[$3]==2)print $0}' file file
I am assuming you mean that the number in column 3 appears exactly twice in the file, not that it is the number 4 or 5. This solution makes 2 passes over your file to count the number of times each number occurs in column 3 the first time and to print matching lines the second time. Therefore the input file is specified twice on the command line.
As a note of explanation, it counts the number of times 1 occurs in column 3 in a[1], and it counts the number of times 2 occurs in column 3 in a[2] etc...
Reading your question title, it says "2 lines maximum", so if occurring in one single line is also ok, you should change the "==" in my code to "<=". I cannot tell what you mean.

Perform sequence of edits on a large text file

I am hoping to perform a series of edits to a large text file composed almost entirely of single letters, seperated by spaces. The file is about 300 rows by about 400,000 columns, and about 250 MB.
My goal is to tranform this table using a series of steps, for eventual processing with another language (R, probably). I don't have much experience working with big data files, but PERL has been suggested to me as the best way to go about this. Please let me know if there is a better way :).
So, I am hoping to write a PERL script that does the following:
Open file, edit or write to a new file the following:
remove columns 2-6
merge/concatenate pairs of columns, starting with column 2 (so, merge column 2-3,4-5, etc)
replace each character pair according to sequential conditional algorithm running accross each row:
[example PSEUDOCODE: if character 1 of cell = character 2 of cell=a, cell=1
else if character 1 of cell = character 2 of cell=b, cell=2
etc.] such that except for the first column, the table is a numerical matrix
remove every nth column, or keep every nth column and remove all others
I am just starting to learn PERL, so I was wondering if these operations were possible in PERL, whether PERL would be the best way to do them, and if there were any suggestions for syntax on these operations in the context of reading/writing to a file.
I'll start:
use strict;
use warnings;
my #transformed;
while (<>) {
chomp;
my #cols = split(/\s/); # split on whitespace
splice(#cols, 1,6); # remove columns
push #transformed, $cols[0];
for (my $i = 1; $i < #cols; $i += 2) {
push #transformed, "$cols[$i]$cols[$i+1]";
}
# other transforms as required
print join(' ', #transformed), "\n";
}
That should get you on your way.
You need to post some sample input and expected output or we're just guessing what you want but maybe this will be a start:
awk '{
printf "%s ", $1
for (i=7;i<=NF;i+=2) {
printf "%s%s ", $i, $(i+1)
}
print ""
}' file

Resources