programmatic grep command output - unix

Is there a way to get XML or equivalent output of grep command that can be passed on to other programs.
For example, grep can give the file names, line numbers and context of the pattern matched.
Filename and line number extraction can be done using some split command with delimiter ':'. However, if the filename contains ':' character (I know it is weird, but there is a possibility), it would need lot more processing.
With the context (grep -C option), it becomes even more complex. If the context of two matches overlaps, grep optimizes the output and it will be difficult to separate.
So I am wondering if grep command can simply generate an XML or JSON like output that other programs can just load.

There is an option -Z to grep which produces unambiguous output, by using Nul characters.

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Delete line containing a specific string starting with dollar sign using unix sed

I am very new to Unix.
I have a parameter file Parameter.prm containing following lines.
$$ErrorTable1=ErrorTable1
$$Filename1_New=FileNew.txt
$$Filename1_Old=FileOld.txt
$$ErrorTable2=ErrorTable2
$$Filename2_New=FileNew.txt
$$Filename2_Old=FileOld.txt
$$ErrorTable3=ErrorTable3
$$Filename3_New=FileNew.txt
$$Filename3_Old=FileOld.txt
I want get the output as
$$ErrorTable1=ErrorTable1
$$ErrorTable2=ErrorTable2
$$ErrorTable3=ErrorTable3
Basically, I need to delete line starting with $$Filename.
Since $ is a keyword, I am not able to interpret it as a string. How can I accomplish this using sed?
With sed:
$ sed '/$$Filename/d' infile
$$ErrorTable1=ErrorTable1
$$ErrorTable2=ErrorTable2
$$ErrorTable3=ErrorTable3
The /$$Filename/ part is the address, i.e., for all lines matching this, the command following it will be executed. The command is d, which deletes the line. Lines that don't match are just printed as is.
Extracting information from a textfile based on pattern search is a job for grep:
grep ErrorTable file
or even
grep -F '$$ErrorTable' file
-F tells grep to treat the search term as a fixed string instead of a regular expression.
Just to answer your question, if a regular expression needs to search for characters which have a special meaning in the regex language, you need to escape them:
grep '\$\$ErrorTable' file

Is it feasible to narrow down the result returned by ls() with grep in R, much like the `ls -l | grep` command in UNIX?

In Terminal/shell script, you can list all files in the current directory with ls -l, and then pipe it to execute an additional command. For example, ls -l | grep -i "calc" returns all files whose filename includes calc. In R, you can list all objects currently stored in the workspace, with ls() command.
However, I want to do narrow down the list returned by ls() with something like the grep feature in R, where the input is the returned list by ls() and the output is the list narrowed down by grep (or something), much like the UNIX pipe feature I mentioned above. Is it feasible to do it in R?
Also, is it also feasible to narrow down the list by xargs-like functionality in R? So I like to get only the objects on which the literal includes if, so that if a function on the list returned by ls() includes the if-else condition inside it, I want to display the function in console. You can do it in Terminal with find . | xargs grep "if" (of course those are files in the current directory, not an R object in workspace, but I showed it just the purpose of illustration).
Note that this is not a post on how to call shell commands from within R. It's not what I want to do.
I use OS X 10.9.3 and R 3.1.0.
ls() has a pattern parameter that might be what you need:
pattern an optional regular expression. Only names matching pattern
are returned. glob2rx can be used to convert wildcard patterns
to regular expressions.
For the second part of your question, you could use capture.output(getAnywhere()) and grep to look inside function source. You'll need to pass in the functions to that and I'd make that whole operation a function to keep the implementation clean.
You can do
grep("calc",list.files(),value=TRUE)
which should "emulate" ls -l | grep -i "calc". See ?list.files and grep.

How to replace string by an escape character plus string in unix

How can I convert a one line like below:
794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|...and so on
to a linefeed-delimted,
Expected Output:
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
and so on.
BTW I'n not a unix expert and just want steps or simple commands to resolve. Appreciate your help.
I assume you have your string in a file with name "x", then you can do this.
I use the character ":" to represent the carriage return that 'sed' adds to your string. Choose something else if ":" occurs in your string. Then "tr" changes ":" to carriage return. The output is as you desire except that there is an extra carriage return at the beginning.
cat x | sed 's/794170/:794170/g' | tr ':' "\n"
You can use the fold command:
$ fold -w32 file
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
I don't think you can do it with a simple command. There are several options for creating scripts that can split lines more or less arbitrarily. Any Unix will have the awk utility available. On most systems you will also find Python and Perl. My guess is that a Perl or Python script is the easiest way to split lines like the one you gave.
This would be one way to do it in Python
inline = "794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|"
splits = ['794170' + s for s in inline.split('794170')]
for s in splits[1:]:
print s
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

Grep - Tabs + a pattern file

I'm looking to feed grep a pattern file with -f, but add a literal tab before and after each pattern (from the file). This method will allow me to grep for an exact column match, since I am dealing with a tab-separated file.
Yes, I have to use Grep. Awk and perl can't seem to handle my large pattern file.
And Yes, I could just add tabs to the pattern file, but I have many pattern files, so that would take a long time, but if all fails, that is what I'll do.
This should be useful for anyone looking to do an exact column match inside of a tsv file.
I'd try adding the tabs "on the fly":
grep -f <(sed 's/^/\\t/; s/$/\\t/' patterns) [args...]
to let grep interpret \t as tabs.

Resources