I want to check if a multiline text matches an input. grep comes close, but I couldn't find a way to make it interpret pattern as plain text, not regex.
How can I do this, using only Unix utilities?
Use grep -F:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by
POSIX.)
EDIT: Initially I didn't understand the question well enough. If the pattern itself contains newlines, use -z option:
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.
I've tested it, multiline patterns worked.
From man grep
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
If the input string you are trying to match does not contain a blank line (eg, it does not have two consecutive newlines), you can do:
awk 'index( $0, "needle\nwith no consecutive newlines" ) { m=1 }
END{ exit !m }' RS= input-file && echo matched
If you need to find a string with consecutive newlines, set RS to some string that is not in the file. (Note that the results of awk are unspecified if you set RS to more than one character, but most awk will allow it to be a string.) If you are willing to make the sought string a regex, and if your awk supports setting RS to more than one character, you could do:
awk 'END{ exit NR == 1 }' RS='sought regex' input-file && echo matched
Related
I have a variable name that appears in multiple locations of a text file. This variable will always start with the same string but not always end with the same characters. For example, it can be var_name or var_name_TEXT.
I'm looking for a way to extract the first occurrence in the text file of this string starting with var_name and ending with , (but I don't want the comma in the output).
Example1: var_name, some_other_var, another_one, ....
Output: var_name
Example2: var_name_TEXT, some_other_var, another_one, ...
Output: var_name_TEXT
grep -oPm1 '\bvar_name[^, ]*(?=,)' file | head -1
match and output only variables starting with var_name and ending with comma, do not include comma in the output, quit after the first line of match and pick the first match on that line (if there are more than one)
ps. you have to include space in the regex as well.
I suggest with GNU grep:
grep -o '\bvar_name[^,]*' file | head -n 1
All you need is (GNU awk):
$ awk 'match($0,/\<var_name[^,]*/,a){print a[0]; exit}' file
var_name_TEXT
To print the field only (i.e., var_name or var_name_TEXT only; not the line containing it) you could use awk:
awk -F, '{for (i=1;i<=NF;i++) if ($i~/^var_name/) print $i}' file
If you actually have spaces before or after the commas (as you show in your example) you can change to awk field separator:
awk -F"[, ]+" '{for (i=1;i<=NF;i++) if ($i~/^var_name/) print $i}' file
You can also use GNU grep with a word boundary assertion:
grep -o '\bvar_name[^,]*' file
Or GNU awk:
awk '/\<var_name/' file
If you want only one considered, add exit to awk or -m 1 to grep to exit after the first match.
I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with a space, empty line or a nonnumeric character. E.g.
40403813|7|Failed|No such file or directory|1
40403816|7|Hi,
The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
I'd like join the split line back with the previous line (as mentioned below):
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
using a Unix command like sed/awk. I'm not clear how to join a line with the preceeding one.
Any suggestion?
awk to the rescue!
awk -v ORS='' 'NR>1 && /^[0-9]/{print "\n"} NF' file
only print newline when the current line starts with a digit, otherwise append rows (perhaps you may want to add a space to ORS if the line break didn't preserve the space).
Don't do anything based on the values of the strings in your fields as that could go wrong. You COULD get a wrapping line that starts with a digit, for example. Instead just print after every complete record of 5 fields:
$ awk -F'|' '{rec=rec $0; nf+=NF} nf>=5{print rec; nf=0; rec=""}' file
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
Try:
awk 'NF{printf("%s",$0 ~ /^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
OR
awk 'NF{printf("%s",/^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
It will check if each line starts from a digit or not if yes and greater than line number 1 than it will insert a new line with-it else it will simply print it, also it will print a new line after reading the whole file, if we not mention it, it is not going to insert that at end of the file reading.
If you only ever have the line split into two, you can use this sed command:
sed 'N;s/\n\([^[:digit:]]\)/\1/;P;D' infile
This appends the next line to the pattern space, checks if the linebreak is followed by something other than a digit, and if so, removes the linebreak, prints the pattern space up to the first linebreak, then deletes the printed part.
If a single line can be broken across more than two lines, we have to loop over the substitution:
sed ':a;N;s/\n\([^[:digit:]]\)/\1/;ta;P;D' infile
This branches from ta to :a if a substitution took place.
To use with Mac OS sed, the label and branching command must be separate from the rest of the command:
sed -e ':a' -e 'N;s/\n\([^[:digit:]]\)/\1/;ta' -e 'P;D' infile
If the continuation lines always begin with a single space:
perl -0000 -lape 's/\n / /g' input
If the continuation lines can begin with an arbitrary amount of whitespace:
perl -0000 -lape 's/\n(\s+)/$1/g' input
It is probably more idiomatic to write:
perl -0777 -ape 's/\n / /g' input
You can use sed when you have a file without \r :
tr "\n" "\r" < inputfile | sed 's/\r\([^0-9]\)/\1/g' | tr '\r' '\n'
How to use "grep" shell command to show specific word from a line starting with a specific word.
Ex:
I want to print a string "myFTPpath/folderName/" from the line starting with searchStr in the below mentioned line.
searchStr:somestring:myFTPpath/folderName/:somestring
Something like this with awk:
awk -F: '/^searchStr/{print $3}' File
From all the lines starting with searchStr, print the 3rd field (field seperator set as :)
Sample:
AMD$ cat File
someStr:somestring:myFTPpath/folderName/:somestring
someStr:somestring:myFTPpath/folderName/:somestring
searchStr:somestring:myFTPpath/folderName/:somestring
someStr:somestring:myFTPpath/folderName/:somestring
AMD$ awk -F: '/^searchStr/{print $3}' File
myFTPpath/folderName/
Remember that grep isn't the only tool that can usefully do searches.
In this particular case, where the lines are naturally broken into fields, awk is probably the best solution, as #A.M.D's answer suggests.
For more general case edits, however, remember sed's -n option, which suppresses printing out a line after edits:
sed -n 's/searchStr:[^:]*:\([^:]*\):.*/\1/p' input-file
The -n suppresses automatic printing of the line, and the trailing /p flag explicitly prints out lines on which there is a substitution.
This matching pattern is fiddly – use awk in this fielded case – but don't forget sed -n.
You could get the desired output with grep itself but you need to enable -P and -o parameters.
$ echo 'searchStr:somestring:myFTPpath/folderName/:somestring' | grep -oP '^searchStr:[^:]*:\K[^:]*'
myFTPpath/folderName/
\K discards the characters which are matched previously from printing at the final leaving only the characters which are matched by the pattern exists next to \K. Here we used \K instead of a variable length positive lookbehind assertion.
I have a file that has lines that look like this
LINEID1:FIELD1=ABCD,&FIELD2-0&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=ABCD,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=ABCD,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=XYZ,&FIELD2-0&FIELD3-1&FIELD9-0
LINEID3:FIELD1=XYZ,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=PQRS,&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=PQRS,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=PQRS,&FIELD7-0&FIELD8-0;
I'm interested in only the lines that begin with LINEID1 and only some elements (FIELD1, FIELD2, FIELD4 and FIELD9) from that line. The output should look like this (no & signs.can replace with |)
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0;
FIELD1=PQRS|FIELD4-0|FIELD9-0;
If additional information is required, do let me know, I'll post them in edits. Thanks!!
This is not exactly what you asked for, but no-one else is answering and it is pretty close for you to get started with!
awk -F'[&:]' '/^LINEID1:/{print $2,$3,$5,$6}' OFS='|' file
Output
FIELD1=ABCD,|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ,|FIELD2-0|FIELD9-0|
FIELD1=PQRS,|FIELD3-1|FIELD9-0;|
The -F sets the Input Field Separator to colon or ampersand. Then it looks for lines starting LINEID1: and prints the fields you need. The OFS sets the Output Field Separator to the pipe symbol |.
Pure awk:
awk -F ":" ' /LINEID1[^0-9]/{gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2); gsub(/,*&+/,"|",$2); print $2} ' file
Updated to give proper formatting and to omit LINEID11, etc...
Output:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
Explanation:
awk -F ":" - split lines into LHS ($1) and RHS ($2) since output only requires RHS
/LINEID1[^0-9]/ - return only lines that match LINEID1 and also ignores LINEID11, LINEID100 etc...
gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2) - remove all fields that aren't 1, 4 or 9 on the RHS
gsub(/,*&+/,"|",$2) - clean up the leftover delimiters on the RHS
To select rows from data with Unix command lines, use grep, awk, perl, python, or ruby (in increasing order of power & possible complexity).
To select columns from data, use cut, awk, or one of the previously mentioned scripting languages.
First, let's get only the lines with LINEID1 (assuming the input is in a file called input).
grep '^LINEID1' input
will output all the lines beginning with LINEID1.
Next, extract the columns we care about:
grep '^LINEID1' input | # extract lines with LINEID1 in them
cut -d: -f2 | # extract column 2 (after ':')
tr ',&' '\n\n' | # turn ',' and '&' into newlines
egrep 'FIELD[1249]' | # extract only fields FIELD1, FIELD2, FIELD4, FIELD9
tr '\n' '|' | # turn newlines into '|'
sed -e $'s/\\|\\(FIELD1\\)/\\\n\\1/g' -e 's/\|$//'
The last line inserts newlines in front of the FIELD1 lines, and removes any trailing '|'.
That last sed pattern is a little more challenging because sed doesn't like literal newlines in its replacement patterns. To put a literal newline, a bash escape needs to be used, which then requires escapes throughout that string.
Here's the output from the above command:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
This command took only a couple of minutes to cobble up.
Even so, it's bordering on the complexity threshold where I would shift to perl or ruby because of their excellent string processing.
The same script in ruby might look like:
#!/usr/bin/env ruby
#
while line = gets do
if line.chomp =~ /^LINEID1:(.*)$/
f1, others = $1.split(',')
fields = others.split('&').map {|f| f if f =~ /FIELD[1249]/}.compact
puts [f1, fields].flatten.join("|")
end
end
Run this script on the same input file and the same output as above will occur:
$ ./parse-fields.rb < input
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
grep -w uses punctuations and whitespaces as delimiters.
How can I set grep to only use whitespaces as a delimiter for a word?
If you want to match just spaces: grep -w foo is the same as grep " foo ". If you also want to match line endings or tabs you can start doing things like: grep '\(^\| \)foo\($\| \)', but you're probably better off with perl -ne 'print if /\sfoo\s/'
You cannot change the way grep -w works. However, you can replace punctuations with, say, X character using tr or sed and then use grep -w, that will do the trick.
The --word-regexp flag is useful, but limited. The grep man page says:
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
If you want to use custom field separators, awk may be a better fit for you. Or you could just write an extended regular expression with egrep or grep --extended-regexp that gives you more control over your search pattern.
Use tr to replace spaces with new lines. Then grep your string. The contiguous string I needed was being split up with grep -w because it has colons in it. Furthermore, I only knew the first part, and the second part was the unknown data I needed to pull. Therefore, the following helped me.
echo "$your_content" | tr ' ' '\n' | grep 'string'