How to remove all lines starting with Timestamp in unix - unix

I have a text file which is basically a log file. In that there are logs which starts with Timestamp and LogID in the format -
timestamp=2014-08-18 23:59:48.315|logId=22fef71f-979a-46aa-81b5-432d34130c34| ( followed by some text )
timestamp=2014-08-18 22:59:48.315|logId=22fef71f-979b-46aa-81b5-432d34130htf| ( followed by some text )
I need to get rid of the timestamp and get the rest of the part.
How to use "sed" command in such case.

Use cut:
cut -f 2- -d \| file
-f 2- matches everything from 2nd field to the end of the line.
-d \| sets | as field separator.
Using sed:
sed 's#^[^|]*|##' file
[^|] matches anything that's not |
Output:
logId=22fef71f-979a-46aa-81b5-432d34130c34| ( followed by some text )
logId=22fef71f-979b-46aa-81b5-432d34130htf| ( followed by some text )

When you've got fields delimited by a single character ('|' in this case), cut is generally the way to go, as in konsolebox's answer. If the delimiter is not necessarily a single character (for example, any amount of white space), then awk is probably the answer.
However, since you asked specifically about sed, this will work:
sed 's/^[^|]*|//'
It substitutes (s) text starting at the beginning of the line (^) and consisting of any number of non-pipes ([^|]*) followed by a single pipe (|), replacing it with nothing (the nothing between the //).

Related

Change multiple filenames unix

I had to download 15GB of data and for some reason during the downloading process the filenames were messed up in a way so that instead of
test_file.txt
the filenames are doubled, so it's
test_file.txttest_file.txt
instead. My only idea was whether there is any way to count the letters and then rename each file with deleting the first/ or second half of the filename? The filenames are not consistent, so for example in the same folder there might also be files named
files_are_great.txtfiles_are_great.txt
so I'm struggling to find a way to loop over them.
Thanks a lot!
The command sed 's/\(.*\)\1/\1/' will replace all duplicated strings with the single string without requiring a certain part of the file name like .txt. It allows spaces in the string.
Example:
echo 'abc defabc def' | sed 's/^\(.*\)\1$/\1/'
prints
abc def
Explanation of the sed command:
^ anchors the pattern to the beginning of the line
.* is 0 or more occurrences of any character
\(...\) captures what matches the pattern in between
\1 is a reference to the first capture group, i.e. the text that was found before
$ anchors the search pattern to the end of the line
This results in a search pattern that matches a whole line that consists of any text followed by the same text.
\1 in the replacement is the same reference to the matched text, i.e. a single occurrence of the duplicated text.
Any input that does not match the pattern will remain unchanged.
Assuming you want to rename all files in the current directory you can use it like this
for file in *
do
new=$(echo $file|sed 's/\(.*\)\1/\1/')
[ "$file" = "$new" ] || mv "$file" "$new"
done
As the sed command does not change non-matching input, $new will be the same as $file for file names that don't consist of a duplicated string. This would result in an error message from mv. That's why the renaming will be skipped in this case.
Using sed
sed 's#\(\.txt\)#& #g'
Explanation: using \( \) we group the expression which can be accessed using &
Demo:
echo "files_are_great.txtfiles_are_great.txt" | sed 's#\(\.txt\)#& #g'
files_are_great.txt files_are_great.txt
For renaming:
for file_name in $(ls -1 *txt*txt)
do
new_file_name=$(echo $i |sed 's#\(\.txt\)#& #g' | cut -d' ' -f1)
mv $file_name $new_file_name
done

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Insert a new line at nth character after nth occurence of a pattern via a shell script

I have a single line big string which has '~|~' as delimiter. 10 fields make up a row and the 10th field is 9 characters long. I want insert a new line after each row, meaning insert a \n at 10 character after (9,18,27 ..)th occurrence of '~|~'
Is there any quick single line sed/awk option available without looping through the string?
I have used
sed -e's/\(\([^~|~]*~|~\)\{9\}[^~|~]*\)~|~/\1\n/g'
but it will replace every 10th occurrence with a new line. I want to keep the delimiter but add a new line after 9 characters in field 10
cat test.txt
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten1234562one~|~2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten1234563one~|~3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten123456
sed -e's/\(\([^~|~]*~|~\)\{9\}[^~|~]*\)~|~/\1\n/g' test.txt
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten1234562one
2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten1234563one~|~3two
3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten123456
Below is what I want
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten123456
2one~|~2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten123456
63one~|~3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten123456
Let's try awk:
awk 'BEGIN{FS="[~|~]+"; OFS="~|~"}
{for(i=10; i<NF; i+=9){
str=$i
$i=substr(str, 1, 9)"\n"substr(str, 10, length(str))
}
print $0}' t.txt
Input:
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten1234562one~|~2‌​two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten1234563one~|~‌​3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten123456
The output:
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten123456
2one~|~2‌​two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten12345
63one~|~‌​3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten123456
I assume there some error in your comment: If your input contains ten1234562one and 2ten1234563one, then the line break has to be inserted after 2 in the first case and after 6 in the second case (as this is the tenth character). But your expected output is different to this.
Your sed script wasn't too far off. This seems to do the job you want:
sed -e '/^$/d' \
-e 's/\([^~|]*~|~\)\{9\}.\{9\}/&\' \
-e '/' \
-e 'P;D' \
data
For your input file (I called it data), I get:
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten123456
2one~|~2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten12345
63one~|~3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten12345
6
The script requires a little explanation, I fear. It uses some obscure shell and some obscure sed behaviour. The obscure shell behaviour is that within a single-quoted string, backslashes have no special meaning, so the backslash before the second single quote in the second -e appears to sed as a backslash at the end of the argument. The obscure sed behaviour is that it treats the argument for each -e option as if it is a line. So, the trailing backslash plus the / after the third -e is treated as if there was a backslash, newline, slash sequence, which is how BSD sed (and POSIX sed) requires you to add a newline. GNU sed treats \n in the replacement as a newline, but POSIX (and BSD) says:
The escape sequence '\n' shall match a <newline> embedded in the pattern space.
It doesn't say anything about \n being treated as a <newline> in the replacement part of a s/// substitution. So, the first two -e options combine to add a newline after what is matched. What's matched? Well, that's a sequence of 'zero or more non-tilde, non-pipe characters followed by ~|~', repeated 9 times, followed by 9 'any characters'. This is an approximation to what you want. If you had a field such as ~|~tilde~pipe|bother~|~, the regex would fail because of the ~ between 'tilde' and 'pipe' and also because of the | between 'pipe' and 'bother'. Fixing it to handle all possible sequences like that is non-trivial, and not warranted by the sample data.
The remainder of the script is straight-forward: the -e '/^$/d' deletes an empty line, which matters if the data is exactly the right length, and in -e 'P;D' the P prints the initial segment of the pattern space up to the first newline (the one we just added); the D deletes the initial segment of the pattern space up to the first newline and starts over.
I'm not convinced this is worth the complexity. It might be simpler to understand if the script was in a file, script.sed:
/^$/d
s/\([^~|]*~|~\)\{9\}.\{9\}/&\
/
P
D
and the command line was:
$ sed -f script.sed data
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten123456
2one~|~2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten12345
63one~|~3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten12345
6
$
Needless to say, it produces the same output. Without the /^$/d, the script only works because of the odd 6 at the end of the input. With exactly 9 characters after the third record, it then flops into in infinite loop.
Using extended regular expressions
If you use extended regular expressions, you can deal with odd-ball fields that contain ~ or | (or, indeed, ~|) in the middle.
script2.sed:
/^$/d
s/(([^~|]{1,}|~[^|]|~\|[^~])*~\|~){9}.{9}/&\
/
P
D
data2:
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten1234562one~|~2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten1234563one~|~3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten12345666=beast~tilde|pipe~|twiddle~|~4-two~|~4-three~|~4-four~|~4-five~|~4-six~|~4-seven~|~4-eighty-eight~|~4-999~|~987654321
Output from sed -E -f script.sed data2:
one~|~two~|~three~|~four~|~five~|~six~|~seven~|~eight~|~nine~|~ten123456
2one~|~2two~|~2three~|~2four~|~2five~|~2six~|~2seven~|~2eight~|~2nine~|~2ten12345
63one~|~3two~|~3three~|~3four~|~3five~|~3six~|~3seven~|~3eight~|~3nine~|~3ten12345
666=beast~tilde|pipe~|twiddle~|~4-two~|~4-three~|~4-four~|~4-five~|~4-six~|~4-seven~|~4-eighty-eight~|~4-999~|~987654321
That still won't handle a field like tilde~~|~. Using -E is correct for BSD (Mac OS X) sed; it enables extended regular expressions. The equivalent option for GNU sed is -r.

Extract Middle Substring from a given String in Unix

I have a string in different ranges :
WATSON_AJAY_AB04_DOTHING.data
WATSON_NAVNEET_CK4_DOTHING.data
WATSON_PRASHANTH_KJ56_DOTHING.data
WATSON_ABHINAV_KD323_DOTHING.data
On these above string how can I extract
AB04,CK4,KJ56,KD323
in Unix?
echo "$string" | cut -d'_' -f3
You could use sed or grep for this task. But since the string is so simple, I dont think you will need to.
One method is to use the bash 'cut' command. Below is an example directly on the BASH shell/command line:
jimm#pi$ string='WATSON_AJAY_AB04_DOTHING.data'
jimm#pi$ cut -d '_' -f 3 <<< "$string"
AB04 <-- outputs the result directly
(edit: of course Lucas' answer above is also a quick 'one-liner' that does the same thing as above - he beat me to it) :)
The cut will take an _ character as the delimiter (the -d '_' part), then display the 3rd slice of the string (the -f 3 part).
Or, if you want to output that 3rd slice from a list of content (using your list above), you can write a simple BASH script.
First, save the lines above ('WATSON...etc') into something like text.txt. Then open up your favorite text editor and type:
#!/bin/sh
cut -d '_' -f 3 < $1
Save that script to some useful name like slice.sh, and make sure it is executable with something like chmod 775 slice.sh.
Then at the command line you can execute the script against your text file, and immediately get an output of those parts of the file you want (in this case the third set of text, separated by the _ character):
$ ./slice.sh text.txt
AB04
CK4
KJ56
KD323
Hope that helps! Bear in mind that the commands above may vary a bit, depending on the flavor of *nix you are using, but it should at least point you in the right direction.

Regex to remove junk from a .txt file in Unix

I am new to Unix.
I am using a sed command to remove junk from a .txt file in Unix.
This is the command that i used--
sed -e 's/[^ -~]//g' final.txt > file1_now
but here i am facing a problem the junks are getting removed, but in case my data contains a '-' that is also removed. I dont want that.
Appreciate your help.
Thanks,
Binayak
Try doing this :
sed -e 's/[^ ~-]//g' final.txt > file1_now
The - character must be the latest (or the first) in your character class, because the meaning is different in other cases : it means a range like in [a-z]
The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc].
http://en.wikipedia.org/wiki/Regular_expression

Resources