Unix Script reading a specific line and position - unix

I need to have a script read the files coming in and check information for verification.
On the first line of the files to be read is a date but in numeric form. eg: 20080923
But before the date is other information, I need to read it from position 27. Meaning line 1 position 27, I need to get that number and see if it’s greater then another number.
I use the grep command to check other information but I use special characters to search, in this case the information before the date is always different, so I can’t use a character to search on. It has to be done by line 1 position 27.

sed 1q $file | cut -c27-34
The sed command reads the first line of the file and the cut command chops out characters 27-34 of the one line, which is where you said the date is.
Added later:
For the more general case - where you need to read line 24, for example, instead of the first line, you need a slightly more complex sed command:
sed -n -e 24p -e 24q | cut -c27-34
sed -n '24p;24q' | cut -c27-34
The -n option means 'do not print lines by default'; the 24p means print line 24; the 24q means quit after processing line 24. You could leave that out, in which case sed would continue processing the input, effectively ignoring it.
Finally, especially if you are going to validate the date, you might want to use Perl for the whole job (or Python, or Ruby, or Tcl, or any scripting language of your choice).

You can extract the characters starting at position 27 of line 1 like so:
datestring=`head -1 $file | cut -c27-`
You'd perform your next processing step on $datestring.

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

GVim Command / Script to delete lines from a set of 4

This post could count as duplicate , but i have not found any relevant answer in previous threads. I have a large (6 GB) text file and i wish to remove every 3rd and 4th line in a set of 4 lines . For example , the following
line1
line2
line3
line4
line5
line6
line7
line8
needs to be converted to this
line1
line2
line5
line6
Is there any vim script / command to remove those lines ? It could be also in multiple passes . 1 pass to delete the 3rd lines (in a set of 4 (line1,line2,line3,line4)) and another pass to delete again the 3rd lines (previously 4th ones , in a set of 3 (line1,line2,line3)) .
The commands :g/^/+1 d3 is close to what i want but it also removes the second lines .
If you have GNU sed, you can filter the buffer through this pipeline:
sed -e '0~4d' | sed '0~3d'
The first sed deletes every 4th line, the second deletes every 3rd line.
This has the desired effect.
To pipe the current buffer through this command, enter this in command mode:
%!sed -e '0~4d' | sed '0~3d'
The % selects the range of lines to pass to a command (% means all lines, the entire buffer), and !cmd is the command to pipe through.
To perform this outside of vim, run these two steps:
sed -ie '0~4d' file
sed -ie '0~3d' file
This will modify the file, in two steps.
Alternatively you can also use Awk.
awk 'NR%4==3||NR%4==0{next;}1' file.txt > output.txt
To do this via Vim:
%!awk 'NR\%4==3||NR\%4==0{next;}1'
UPDATE: It is a bad approach for large files, it needs ~3 sec for a 6MB file to perform a substitution.
This approach works in vim. Using regular expression, you find 4 lines and substitute them with first two lines of these 4. Works for a long file as well. Doesn't work for last 1–3 lines if there is a remainder of division of total lines number by 4.
:%s#\(^.*\n^.*\)\n^.*\n^.*\n#\1\r#g
Explanation:
:%s — substitute in the whole file, # used as a delimiter
\(^.*\n^.*\) — \(\) select two lines that will be used later as \1; \n stands for linebreak; ^ for the beginning of the line; .* for any symbol repeated as much times as possible before the linebreak
\n — linebreak after the second line
^.*\n^.*\n — next two lines to be deleted
\1\r — substitute for lines with first two lines and add a linebreak \r
g — apply to the whole file

Extract Middle Substring from a given String in Unix

I have a string in different ranges :
WATSON_AJAY_AB04_DOTHING.data
WATSON_NAVNEET_CK4_DOTHING.data
WATSON_PRASHANTH_KJ56_DOTHING.data
WATSON_ABHINAV_KD323_DOTHING.data
On these above string how can I extract
AB04,CK4,KJ56,KD323
in Unix?
echo "$string" | cut -d'_' -f3
You could use sed or grep for this task. But since the string is so simple, I dont think you will need to.
One method is to use the bash 'cut' command. Below is an example directly on the BASH shell/command line:
jimm#pi$ string='WATSON_AJAY_AB04_DOTHING.data'
jimm#pi$ cut -d '_' -f 3 <<< "$string"
AB04 <-- outputs the result directly
(edit: of course Lucas' answer above is also a quick 'one-liner' that does the same thing as above - he beat me to it) :)
The cut will take an _ character as the delimiter (the -d '_' part), then display the 3rd slice of the string (the -f 3 part).
Or, if you want to output that 3rd slice from a list of content (using your list above), you can write a simple BASH script.
First, save the lines above ('WATSON...etc') into something like text.txt. Then open up your favorite text editor and type:
#!/bin/sh
cut -d '_' -f 3 < $1
Save that script to some useful name like slice.sh, and make sure it is executable with something like chmod 775 slice.sh.
Then at the command line you can execute the script against your text file, and immediately get an output of those parts of the file you want (in this case the third set of text, separated by the _ character):
$ ./slice.sh text.txt
AB04
CK4
KJ56
KD323
Hope that helps! Bear in mind that the commands above may vary a bit, depending on the flavor of *nix you are using, but it should at least point you in the right direction.

How to delete one specific line in a file and modify the next line in unix?

I have a text file and there was a mistake when it was created. To fix this I need to delete a line with a specific unique string and delete the characters in the following line that precede the # symbol. I was able to do this with sed and cut but it only output that one line, not the many other 1000s of lines in my file. Here is an example of the part of the file that needs fixing. I know the line #s (delete 45603341 and modify 45603342) where this mistake occurs.
#HWI-1KL135:70:C305EACXX:5:2105:6727:102841 1:N:0:CAGATC
CCAAGTGTCACCTCTTTTATTTATTGATTT#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
I need the output to look like this and for it to leave the rest of the file intact.
#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
Thanks!
How about:
sed -i -e '45603341d;45603342s/^.*\(#.*\)$/\1/' <filename>
where you replace <filename> with the name of your file.
If you want to change a particular line and delete the above line then run,
sed -ri '45603342s/^([^#]*)(#.*)$/\2/g; 45603341d' aa
Example:
$ cat aa
#HWI-1KL135:70:C305EACXX:5:2105:6727:102841 1:N:0:CAGATC
CCAAGTGTCACCTCTTTTATTTATTGATTT#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
$ sed -r '2s/^([^#]*)(#.*)$/\2/g; 1d' aa
#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
This might work for you (GNU sed):
sed '45603341!b;N;s/^.*\n[^#]*//' file
Leave as is any other line ecsept 45603341. On this line , append the following line and then remove everything from the start to the first non-# in the the appended line.
An alternative approach to 'sed' can be to use vim macros (This also works on Windows). The main disadvantage is that you will not be able to integrate inside scripts like 'sed' does. The main advantage is that it allows for complex replacements like "search for this pattern, then clear the line, go down 3 lines, move to column 40, switch lines,...). If you are already familiar with VIM it's also much more intuitive.
In this particular case you will have to do something like
qq (start macro recording)
/^#HWI.*CAGATC$ (search pattern)
dd (delete line)
vw (select word)
d (delete selected word)
q (end macro)
To run the macro 100 times:
100#q

sed '$!N;$!D' explanation

I know that
cat foo | sed '$!N;$!D'
will print out the last two lines of the file foo, but I don't understand why.
I have read the man page and know that N joins the next line to the currently processed line etc - but could someone explain in 'good english' that matches the order of operation what is happening here, step by step?
thanks!
Here is what that script looks like when run through the sedsed debugger (by Aurelio Jargas):
$ echo -e 'a\nb\nc\nd' | sed '$!N;$!D' PATT:^a$
PATT:^a$
COMM:$ !N
PATT:^a\Nb$
COMM:$ !D
PATT:^b$
COMM:$ !N
PATT:^b\Nc$
COMM:$ !D
PATT:^c$
COMM:$ !N
PATT:^c\Nd$
COMM:$ !D
PATT:^c\Nd$
c
d
I've filtered out the lines that would show hold space ("HOLD") since it's not being used. "PATT" shows what's in pattern space and "COMM" shoes the command about to be executed. "\N" indicates an embedded newline. Of course, "^" and "$" indicate the beginning and end of the string.
!N appends the next line and !D deletes the previous line and loops to the beginning of the script without doing an implicit print. When the last line is read, the $! tests fail so there's nothing left to do and the script exits and performs an implicit print of what remains in the pattern space (since -n was not specified in the arguments).
Disclaimer: I am a contributor to the sedsed project and have made a few minor improvements including expanded color support, adding the ^ line-beginning indicator and preliminary support for Python 3. The bleeding edge version (which hasn't been touched lately) is here.
$!N;$!D is a sed program consisting of two statements, $!N and $!D.
$!N matches everything but the last line of the last file of input ($ negated by !) and runs the N command on it, which as you said yourself appends the next line of input to the line currently under scrutiny (the "pattern space"). In other words, sed now has two lines in the pattern space and has advanced to the next line.
$!D also matches everything but the last line, and wipes the pattern space up to the first newline. D also prevents sed from wiping the entire pattern space when reading the next line.
So, the algorithm being executed is roughly:
For every line up to but not including the last {
Read the next line and append it to the pattern space
If still not at the last line
Delete the first line in the pattern space
}
Print the pattern space

Resources