Split files linux and then grep

Split files linux and then grep - unix

I'd like to split a file and grep each piece without writing them to indvidual files.
I've attempted a couple variations of split and grep and no such luck; any suggestions?
Something along the lines of:
split -b SIZE filename | grep "string"
I've attempted grep/fgrep to find the string but my shell complains that the files are too large. See: use fgrep instead

There is no point in splitting the file if you plan to [linearly] search each of the pieces anyway (assuming that's the only thing you are doing with it). Consider running grep on the entire file.
If however you plan to utilize the fact that the file is split later on, then the typical way would be:
Create a temporary directory and step into it
Run split/csplit on the original file
Use for loop over written fragment to do your processing.

Related

Can I use conditional "or" statements in selecting files to download with wget

I'm trying to download a bunch of files via ftp with wget. I could do this manually for each of the variables that I am interested in, or I was wondering if I could specify these in an "or" type conditional statement in the filepath name.
For example, I would like to download all files that contain the strings "NRRS412", "NRRS443", "NRRS490", etc. I had planned to do individual calls to wget for each of these, like this:
wget -r -A "L3m*NRRS412*.nc" ftp://username:password#ftp.address
I cannot simply use "L3m*NRRS*.nc", as there are other "NRRS" strings that I don't want.
Is there a way to download all of my target strings in a single call to wget?
Thanks for any help

OK, I figured out the solution, which is to create several possible strings separated by commas:
wget -r -A "L3m*NRRS412*.nc, L3m*NRRS43*.nc, L3m*NRRS490*.nc" ftp://username:password#ftp.address

UNIX split command splitting this file, but what names are resulting?

We receive a big csv file from a client (500k lines, est) that we split into smaller chunks using the split command.
You can see how we're using the command below, but my bash knowledge is a bit rusty, could someone refresh me on the ${processFile}_ bit below, and how the files are being named in the end? Not recalling what the underscore does...
split -l 50000 $PROCESSING_CURRENT_DIR/$processFile ${processFile}_

This isn't anything to do with bash but how split(1) command processes its arguments to split the input.
Syntax is:
split [OPTION]... [FILE [PREFIX]]
DESCRIPTION
Output pieces of FILE to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
So it uses the given prefix and makes output files.

Loop "paste" function over multiple files in the same folder

I'm trying to concatenate horizontally a number of files (1000) *.txt in a folder.
How can I loop over the files using the "paste" function?
NB: all the *.txt files are in the same directory.

Why loop? You can use wildcards.
paste *.txt > combined.txt

In general, it would be a question of just calling paste *.txt (and redirecting the output: paste *.txt > output.txt, as #zx did). Try it, but you'll be generating some enormously long lines. If paste can`t handle the line length you'll be generating, you'll have to reproduce its effect using a scripting language that has no line length limit, like perl or python.
Another possible sticking point is if your shell can't handle this many arguments in the expansion of the glob *.txt. Again, you can solve that with a script. It's easy to do so if that's your situation, let us know here.
PS. Given what paste does, looping is not going to do it for you: You (presumably) need the file contents side by side in the output, not one after the other.

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.

Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...

Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.

you are also use awk
awk '/start/,/end/' file

I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)

use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

Unable to combine pwd with filenames in Zsh

Problem: to combine PATHs with filenames, such that I can easily source many files.
I have two files A and B. ls gives their names clearly.
I run
pwd `ls`
I get the error message
too many arguments
I did not find an option for pwd which would allow me to have more than one argument.
How can you combine pwd's output to filenames.

echo $PWD/*

In addition to sigjuice's answer, if, as you state, you need it for sourcing many files, simply use
source ./*
It'll probably burn less CPU cycles because the shell doesn't have to create absolute path names for each file.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split files linux and then grep - unix

Related

Can I use conditional "or" statements in selecting files to download with wget

UNIX split command splitting this file, but what names are resulting?

Loop "paste" function over multiple files in the same folder

Remove lines which are between given patterns from a file (using Unix tools)

Unable to combine pwd with filenames in Zsh

Categories

Resources