UNIX split command splitting this file, but what names are resulting? - unix

We receive a big csv file from a client (500k lines, est) that we split into smaller chunks using the split command.
You can see how we're using the command below, but my bash knowledge is a bit rusty, could someone refresh me on the ${processFile}_ bit below, and how the files are being named in the end? Not recalling what the underscore does...
split -l 50000 $PROCESSING_CURRENT_DIR/$processFile ${processFile}_

This isn't anything to do with bash but how split(1) command processes its arguments to split the input.
Syntax is:
split [OPTION]... [FILE [PREFIX]]
DESCRIPTION
Output pieces of FILE to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
So it uses the given prefix and makes output files.

Related

Is there a way to skip first x lines of a bz2 file in Python without calling next()?

I'm trying to read the latest Wikidata dump while skipping the first, say, 100 lines.
Is there a better way to do this than calling next() repeatedly?
WIKIDATA_JSON_DUMP = bz2.open('latest-all.json.bz2', 'rt')
for n in range(100):
next(WIKIDATA_JSON_DUMP)
Alternatively, is there a way to split up the file in bash by, say, using bzcat to pipe select chunks to smaller files?
If it was compressed using something like bgzip, you can skip blocks, but they will contain a variable number of lines, depending on the compression ratio. For raw bzip files which are a single stream, I don't think you have any choice but to read and throw away the lines to be skipped, due to the nature of the compression format.
You can try the following in bash, to skip the first 10 lines for example:
bzcat -d -c /tmp/myfile.bz2 | tail -n +11
Notice the tail gets the N+1 number of lines you want to skip.

Issue while renaming a file with file pattern in unix

As part of our process, we get an input file in the .gz format. We need to unzip this file and add some suffix at the end of the file. The input file has timestamp so I am trying to use filter while unzipping and renaming this file.
Input file name :
Mem_Enrollment_20200515130341.dat.gz
Step 1:
Unzipping this file : (working as expected)
gzip -d Mem_Enrollment_*.dat.gz
output :
Mem_Enrollment_20200515130341.dat
Step 2: Renaming this file : (issues while renaming)
Again, I am going with the pattern but I know this won't work in this case. So, what should I do rename this file?
mv Mem_Enrollment_*.dat Mem_Enrollment_*.dat_D11
output :
Mem_Enrollment_*.dat_D11
expected output :
Mem_Enrollment_20200515130341.dat_D11
try
for fn in Mem_Enrollment_*.dat
do
mv ${fn} ${fn}_D11;
done
With just datastage you could loop over ls output from an execute command stage via "ls Mem_Enrollment_*.dat.gz" and then use an #FM as a delimiter when looping the output list. You could then breakout the gzip and rename into two separate commands, which helps with readability in your job.
Only caveat here is that the Start Loop stage doesn't accept the #FM in the delimiter due to some internal funkyness inside Datastage. So you need to set a user variable equal to it and pass that to the mark.

splitting the file basis of Line Number

Can u pls advise the unix command as I have a file which contain the records in the below format
333434
435435
435443
434543
343536
Now the total line count is 89380 , now i want to create a seprate
I am trying to split my large big file into small bits using the line numbers. For example my file has 89380 lines and i would like to divide this into small files wach of which has 1000 lines.
could you please advise unix command to achieve this
can unix split command can be used here..!!
Use split
Syntax split [options] filename prefix
Replace filename with the name of the large file you wish to split. Replace prefix with the name you wish to give the small output files. You can exclude [options], or replace it with either of the following:
-l linenumber
-b bytes
If you use the -l (a lowercase L) option, replace linenumber with the number of lines you'd like in each of the smaller files (the default is 1,000). If you use the -b option, replace bytes with the number of bytes you'd like in each of the smaller files.
The split command will give each output file it creates the name prefix with an extension tacked to the end that indicates its order. By default, the split command adds aa to the first output file, proceeding through the alphabet to zz for subsequent files. If you do not specify a prefix, most systems use x .
Example1:
split myfile
This will output three 1000-line files: xaa, xab, and xac.
Example2:
split -l 500 myfile segment
This will output six 500-line files: segmentaa, segmentab, segmentac, segmentad, segmentae, and segmentaf.
Example3:
Assume myfile is a 160KB file:
split -b 40k myfile segment
This will output four 40KB files: segmentaa, segmentab, segmentac, and segmentad.
You can use the --lines switch or its short form -l
split --lines=1000 input_file_name output_file_prefix
I think you can use sed command.
you can use sed -n "1, 1000p" yourfile > outputfile to get line 1 to line 1000.

Split files linux and then grep

I'd like to split a file and grep each piece without writing them to indvidual files.
I've attempted a couple variations of split and grep and no such luck; any suggestions?
Something along the lines of:
split -b SIZE filename | grep "string"
I've attempted grep/fgrep to find the string but my shell complains that the files are too large. See: use fgrep instead
There is no point in splitting the file if you plan to [linearly] search each of the pieces anyway (assuming that's the only thing you are doing with it). Consider running grep on the entire file.
If however you plan to utilize the fact that the file is split later on, then the typical way would be:
Create a temporary directory and step into it
Run split/csplit on the original file
Use for loop over written fragment to do your processing.

Loop "paste" function over multiple files in the same folder

I'm trying to concatenate horizontally a number of files (1000) *.txt in a folder.
How can I loop over the files using the "paste" function?
NB: all the *.txt files are in the same directory.
Why loop? You can use wildcards.
paste *.txt > combined.txt
In general, it would be a question of just calling paste *.txt (and redirecting the output: paste *.txt > output.txt, as #zx did). Try it, but you'll be generating some enormously long lines. If paste can`t handle the line length you'll be generating, you'll have to reproduce its effect using a scripting language that has no line length limit, like perl or python.
Another possible sticking point is if your shell can't handle this many arguments in the expansion of the glob *.txt. Again, you can solve that with a script. It's easy to do so if that's your situation, let us know here.
PS. Given what paste does, looping is not going to do it for you: You (presumably) need the file contents side by side in the output, not one after the other.

Resources