Removing comments from a datafile. What are the differences? - unix

Let's say that you would like to remove comments from a datafile using one of two methods:
cat file.dat | sed -e "s/\#.*//"
cat file.dat | grep -v "#"
How do these individual methods work, and what is the difference between them? Would it also be possible for a person to write the clean data to a new file, while avoiding any possible warnings or error messages to end up in that datafile? If so, how would you go about doing this?

How do these individual methods work, and what is the difference
between them?
Yes, they work same though sed and grep are 2 different commands. Your sed command simply substitutes all those lines which having # with NULL. On other hand grep will simply skip or ignore those lines which will skip lines which have # in it.
You could get more information on these by man page as follows:
man grep:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
man sed:
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The
replacement may
contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1
through \9 to
refer to the corresponding matching sub-expressions in the regexp.
Would it also be possible for a person to write the clean data to a
new file, while avoiding any possible warnings or error messages to
end up in that datafile?
yes, we could re-direct the errors by using 2>/dev/null in both the commands.
If so, how would you go about doing this?
You could try like 2>/dev/null 1>output_file
Explanation of sed command: Adding explanation of sed command too now. This is only for understanding purposes and no need to use cat and then use sed you could use sed -e "s/\#.*//" Input_file instead.
sed -e " ##Initiating sed command here with adding the script to the commands to be executed
s/ ##using s for substitution of regexp following it.
\#.* ##telling sed to match a line if it has # till everything here.
//" ##If match found for above regexp then substitute it with NULL.

That grep -v will lose all the lines that have # on them, for example:
$ cat file
first
# second
thi # rd
so
$ grep -v "#" file
first
will drop off all lines with # on it which is not favorable. Rather you should:
$ grep -o "^[^#]*" file
first
thi
like that sed command does but this way you won't get empty lines. man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

how to grep nth string

How to use "grep" shell command to show specific word from a line starting with a specific word.
Ex:
I want to print a string "myFTPpath/folderName/" from the line starting with searchStr in the below mentioned line.
searchStr:somestring:myFTPpath/folderName/:somestring
Something like this with awk:
awk -F: '/^searchStr/{print $3}' File
From all the lines starting with searchStr, print the 3rd field (field seperator set as :)
Sample:
AMD$ cat File
someStr:somestring:myFTPpath/folderName/:somestring
someStr:somestring:myFTPpath/folderName/:somestring
searchStr:somestring:myFTPpath/folderName/:somestring
someStr:somestring:myFTPpath/folderName/:somestring
AMD$ awk -F: '/^searchStr/{print $3}' File
myFTPpath/folderName/
Remember that grep isn't the only tool that can usefully do searches.
In this particular case, where the lines are naturally broken into fields, awk is probably the best solution, as #A.M.D's answer suggests.
For more general case edits, however, remember sed's -n option, which suppresses printing out a line after edits:
sed -n 's/searchStr:[^:]*:\([^:]*\):.*/\1/p' input-file
The -n suppresses automatic printing of the line, and the trailing /p flag explicitly prints out lines on which there is a substitution.
This matching pattern is fiddly – use awk in this fielded case – but don't forget sed -n.
You could get the desired output with grep itself but you need to enable -P and -o parameters.
$ echo 'searchStr:somestring:myFTPpath/folderName/:somestring' | grep -oP '^searchStr:[^:]*:\K[^:]*'
myFTPpath/folderName/
\K discards the characters which are matched previously from printing at the final leaving only the characters which are matched by the pattern exists next to \K. Here we used \K instead of a variable length positive lookbehind assertion.

sed - find text between 2 strings and use it for replace

I have a file with many lines like below:
townValue.put("Aachen");
townValue.put("Aalen");
townValue.put("Ahlen");
townValue.put("Arnsberg");
townValue.put("Aschaffenburg");
townValue.put("Augsburg");
I want to change this lines to:
townValue.put("Aalen", "Aalen");
townValue.put("Ahlen", "Ahlen");
townValue.put("Arnsberg", "Arnsberg");
townValue.put("Aschaffenburg", "Aschaffenburg");
townValue.put("Augsburg", "Augsburg");
How can I achieve this with sed or awk. This seems to be a special find & replace task, I couldn't find yet in the net.
Thanks for the help
Use sed -e 's/"[^"]*"/&, &/':
$ cat 1
townValue.put("Aachen");
townValue.put("Aalen");
townValue.put("Ahlen");
townValue.put("Arnsberg");
townValue.put("Aschaffenburg");
townValue.put("Augsburg");
$ sed -e 's/"[^"]*"/&, &/' 1
townValue.put("Aachen", "Aachen");
townValue.put("Aalen", "Aalen");
townValue.put("Ahlen", "Ahlen");
townValue.put("Arnsberg", "Arnsberg");
townValue.put("Aschaffenburg", "Aschaffenburg");
townValue.put("Augsburg", "Augsburg");
According to sed(1):
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The replacement may contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.
Code for awk,because of the large number of quotes in the command line I recommend to use a script:
awk -f script file
script
BEGIN {FS=OFS="\""}
$3=", \""$2"\""$3
$ cat file
townValue.put("Aachen");
townValue.put("Aalen");
townValue.put("Ahlen");
townValue.put("Arnsberg");
townValue.put("Aschaffenburg");
townValue.put("Augsburg");
$ awk -f script file
townValue.put("Aachen", "Aachen");
townValue.put("Aalen", "Aalen");
townValue.put("Ahlen", "Ahlen");
townValue.put("Arnsberg", "Arnsberg");
townValue.put("Aschaffenburg", "Aschaffenburg");
townValue.put("Augsburg", "Augsburg");

modifiy grep output with find and replace cmd

I use grep to sort log big file into small one but still there is long dir path in output log file which is common every time.I have to do find and replace every time.
Isnt there any way i can grep -r "format" log.log | execute findnreplce thing?
Sed will do what you want. Basic syntax to replace all the matches of foo with bar in-place in $file is:
sed -i 's/foo/bar/g' $file
If you're just wanting to delete rather than replace, simply leave out the 'bar' (so s/foo//g).
See this tutorial for a lot more detail, such as regex support.
sed -n '/match/s/pattern/repl/p'
Will print all the lines that match the regex match, with all instances of pattern replaced by repl. Since your lines may contain paths, you will probably want to use a different delimeter. / is customary, but you can also do:
sed -n '\#match#s##repl#p`
In the second case, omitting pattern will cause match to be used for the pattern to be replaced.

How to remove blank lines from a Unix file

I need to remove all the blank lines from an input file and write into an output file. Here is my data as below.
11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321
11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003
11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032
11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001
11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701
11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383
11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo
This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original:
sed '/^$/d' foo > foo.tmp
mv foo.tmp foo
If you also want to remove lines consisting only of whitespace (not just empty lines) then use:
sed -i '/^[[:space:]]*$/d' foo
Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too:
sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename
awk 'NF > 0' filename
sed -i '/^$/d' filename
awk '!/^$/' filename
awk '/./' filename
The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($):
grep -v '^$' infile.txt > outfile.txt
If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways:
grep -P -v '^\s*$' infile.txt > outfile.txt
or, without Perl regular expressions:
grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output
Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option.
If I execute the above command still I have blank lines in my output file. What could be the reason?
There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then:
sed -e 's/ *$//' -e '/^ *$/d' input > output
The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs.
Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return:
tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output
If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against:
head -n 2 input | od -c
Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.)
You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with:
echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /'
If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file
grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk
awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it:
cat file.txt | perl -lane "print if /\S/"
Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do.
Cheers
You can sed's -i option to edit in-place without using temporary file:
sed -i '/^$/d' file

Resources