Double spacing a file using awk - unix

I was going through the awk and found the below two commands for double spacing a file.
Can someone please explain how these commands actually work ?
awk '1;{print ""}' filename
awk 'BEGIN{ORS="\n\n"};1' filename
Thanks

Your first example uses two common awk shortcuts: 1 is just a pattern that is always true, so the default action, which is "print the line", is executed for every line. Then, there is a rule with an empty pattern (which is also always true, but you can't omit both pattern and action), which in its action just prints an empty line.
Your second example changes the Output Record Separator, which is usually just a single end-of-line, to be two, so that just copying every line will be enough. (BEGIN rules are executed before the input file is read.)

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Zsh: How to force file completion everywhere following a set of characters?

I'm trying to figure out how to get file completion to work at any word position on the command line after a set of characters. As listed in a shell these characters would be [ =+-\'\"()] (the whitespace is tab and space). Zsh will do this, but only after the backtick character, '`', or $(. mksh does this except not after the characters [+-].
By word position on the command line, I'm talking about each set of characters you type out which are delimited by space and a few other characters. For example,
print Hello World,
has three words at positions 1-3. At position 1, when you're first typing stuff in, completion is pretty much perfect. File completion works after all of the characters I mentioned. After the first word, the completion system gets more limited since it's smart. This is useful for commands, but limiting where you can do file completion isn't particularly helpful.
Here are some examples of where file completion doesn't work for me but should in my opinion:
: ${a:=/...}
echo "${a:-/...}"
make LDFLAGS+='-nostdlib /.../crt1.o /.../crti.o ...'
env a=/... b=/... ...
I've looked at rebinding '^I' (tab) with the handful of different completion widgets Zsh comes with and changing my zstyle ':completion:*' lines. Nothing has worked so far to change this default Zsh behaviour. I'm thinking I need to create a completion function that I can add to the end of my zstyle ':completion:*' completer ... line as a last resort completion.
In the completion function, one route would be to cut out the current word I want to complete, complete it, and then re-insert the completion back into the line if that's possible. It could also be more like _precommand which shifts the second word to the first word so that normal command completion works.
I was able to modify _precommand so that you can complete commands at any word position. This is the new file, I named it _commando and added its directory to my fpath:
#compdef -
# precommands is made local in _main_complete
precommands+=($words[1,$(( CURRENT -1 ))])
shift words
CURRENT=1
_normal
To use it I added it to the end of my ':completion:*' completer ... line in my zshrc so it works with every program in $path. Basically whatever word you're typing in is considered the first word, so command completion works at every word position on the command line.
I'm trying to figure out a way to do the same thing for file completion, but it looks a little more complicated at first glace. I'm not really sure where to go with this, so I'm looking to get some help on this.
I took a closer look at some of Zsh's builtin functions and noticed a few that have special completion behaviour. They belong to the typeset group, which has a function _typeset in the default fpath. I only needed to extract a few lines for what I wanted to do. These are the lines I extracted:
...
elif [[ "$PREFIX" = *\=* ]]; then
compstate[parameter]="${PREFIX%%\=*}"
compset -P 1 '*='
_value
...
These few lines allow typeset completion after each slash in a command like this:
typeset file1=/... file2=~/... file3=/...
I extrapolated from this to create the following function. You can modify it to put in your fpath. I just defined it in my zshrc like this:
_reallyforcefilecompletion() {
local prefix_char
for prefix_char in ' ' $'\t' '=' '+' '-' "'" '"' ')' ':'; do
if [[ "$PREFIX" = *${prefix_char}* ]]; then
if [[ "$PREFIX" = *[\'\"]* ]]; then
compset -q -P "*${prefix_char}"
else
compset -P "*${prefix_char}"
fi
_value
break
fi
done
}
You can use this by adding it to a zstyle line like this:
zstyle ':completion:*' completer _complete _reallyforcefilecompletion
This way, it's only used as a last resort so that smarter completions can try before it. Here's a little explanation of the function starting with the few variables and the command involved:
prefix_char: This gets set to each prefix character we want to complete after. For example, env a=123 has the prefix character =.
PREFIX: Initially this will be set to the part of the current word from the beginning of the word up to the position of the cursor; it may be altered to give a common prefix for all matches.
IPREFIX (not shown in code): compset moves string matches from PREFIX to IPREFIX so that the rest of PREFIX can be completed.
compset: This command simplifies modification of the special parameters, while its return status allows tests on them to be carried out.
_value: Not really sure about this one. The documentation states it plays some sort of role in completion.
Documentation for the completion system
The function: In the second line, we declare prefix_char local to avoid variable pollution. In line three, we start a for loop selecting each prefix_char we want to complete after. In the next if block, we check if the variable PREFIX ends with one of the prefix_chars we want to complete after and if PREFIX contains any quotes. Since PREFIX contains quotes, we use compset -q to basically allow quotes to be ignored so we can complete in them. compset -P strips PREFIX and moves it to IPREFIX, basically so it gets ignored and completion can work.
The next elif statement is for a PREFIX ending with prefix_char but not containing quotes, so we only use compset -P. I added the return 0 to break the loop. A more correct way to make this function would be in a case statement, but we're not using the compset return value, so this works. You don't see anything about file completion besides _value. For the most part we just told the system to ignore part of the word.
Basically this is what the function does. We have a line that looks like:
env TERM=linux PATH=/<---cursor here
The cursor is at the end of that slash. This function allows PREFIX, which is PATH=, to be ignored, so we have:
env TERM=linux /<---cursor here
You can complete a file there with PATH= removed. The function doesn't actually remove the PATH= though, it just recategorizes it as something to ignore.
With this function, you can now complete in all of the examples I listed in the question and a lot more.
One last thing to mention, adding this force-list line in your zshrc cripples this function somehow. It still works but seems to choke. This new force-list function is way better anyway.
zstyle ':completion:*' force-list always
EDIT: There were a couple lines I forgot to copy into the function. Probably should have checked before posting. I think it's good now.

Delete newline in Vim

Is there a way to delete the newline at the end of a line in Vim, so that the next line is appended to the current line?
For example:
Evaluator<T>():
_bestPos(){
}
I'd like to put this all on one line without copying lines and pasting them into the previous one. It seems like I should be able to put my cursor to the end of each line, press a key, and have the next line jump onto the same one the cursor is on.
End result:
Evaluator<T>(): _bestPos(){ }
Is this possible in Vim?
If you are on the first line, pressing (upper case) J will join that line and the next line together, removing the newline. You can also combine this with a count, so pressing 3J will combine all 3 lines together.
Certainly. Vim recognizes the \n character as a newline, so you can just search and replace.
In command mode type:
:%s/\n/
While on the upper line in normal mode, hit Shift+j.
You can prepend a count too, so 3J on the top line would join all those lines together.
As other answers mentioned, (upper case) J and search + replace for \n can be used generally to strip newline characters and to concatenate lines.
But in order to get rid of the trailing newline character in the last line, you need to do this in Vim:
:set noendofline binary
:w
J deletes extra leading spacing (if any), joining lines with a single space. (With some exceptions: after /[.!?]$/, two spaces may be inserted; before /^\s*)/, no spaces are inserted.)
If you don't want that behavior, gJ simply removes the newline and doesn't do anything clever with spaces at all.
set backspace=indent,eol,start
in your .vimrc will allow you to use backspace and delete on \n (newline) in insert mode.
set whichwrap+=<,>,h,l,[,]
will allow you to delete the previous LF in normal mode with X (when in col 1).
All of the following assume that your cursor is on the first line:
Using normal mappings:
3Shift+J
Using Ex commands:
:,+2j
Which is an abbreviation of
:.,.+2 join
Which can also be entered by the following shortcut:
3:j
An even shorter Ex command:
:j3
It probably depends on your settings, but I usually do this with A<delete>
Where A is append at the end of the line. It probably requires nocompatible mode :)
I would just press A (append to end of line, puts you into insert mode) on the line where you want to remove the newline and then press delete.
<CURSOR>Evaluator<T>():
_bestPos(){
}
cursor in first line
NOW, in NORMAL MODE do
shift+v
2j
shift+j
or
V2jJ
:normal V2jJ
if you don't mind using other shell tools,
tr -d "\n" < file >t && mv -f t file
sed -i.bak -e :a -e 'N;s/\n//;ba' file
awk '{printf "%s",$0 }' file >t && mv -f t file
The problem is that multiples char 0A (\n) that are invisible may accumulate.
Supose you want to clean up from line 100 to the end:
Typing ESC and : (terminal commander)
:110,$s/^\n//
In a vim script:
execute '110,$s/^\n//'
Explanation: from 110 till the end
search for lines that start with new line (are blank)
and remove them
A very slight improvement to TinkerTank's solution if you're just looking to quickly concatenate all the lines in a text file is to have something like this in your .vimrc:
nnoremap <leader>j :%s/\n/\ /g<CR>
This globally substitutes newlines with a space meaning you don't end up with the last word of a line being joined onto the first word of the next line. This works perfectly for my typical use-case.
If you're wanting to maintain deliberate paragraph breaks, V):join is probably the easiest solution.

Unix ksh loop and put the result into variable

I have a simple thing to do, but I'm novice in UNIX.
So, I have a file and on each line I have an ID.
I need to go through the file and put all ID's into one variable.
I've tried something like in Java but does not work.
for variable in `cat myFile.txt`
do
param=`echo "${param} ${variable}"`
done
It does not seems to add all values into param.
Thanks.
I'd use:
param=$(<myFile.txt)
The parameter has white space (actually newlines) between the names. When used without quotes, the shell will expand those to spaces, as in:
cat $param
If used with quotes, the file names will remain on separate lines, as in:
echo "$param"
Note that the Korn shell special-cases the '$(<file)' notation and does not fork and execute any command.
Also note that your original idea can be made to work more simply:
param=
for variable in `cat myFile.txt`
do
param="${param} ${variable}"
done
This introduces a blank at the front of the parameter; it seldom matters. Interestingly, you can avoid the blank at the front by having one at the end, using param="${param}${variable} ". This also works without messing things up, though it looks as though it jams things together. Also, the '${var}' notation is not necessary, though it does no harm either.
And, finally for now, it is better to replace the back-tick command with '$(cat myFile.txt)'. The difference becomes crucial when you need to nest commands:
perllib=$(dirname $(dirname $(which perl)))/lib
vs
perllib=`dirname \`dirname \\\`which perl\\\`\``/lib
I know which I prefer to type (and read)!
Try this:
param=`cat myFile.txt | tr '\n' ' '`
The tr command translates all occurrences of \n (new line) to spaces. Then we assign the result to the param variable.
Lovely.
param="$(< myFile.txt)"
or
while read line
do
param="$param$line"$'\n'
done < myFile.txt
awk
var=$(awk '1' ORS=" " file)
ksh
while read -r line
do
t="$t $line"
done < file
echo $t

How to delete duplicate lines in a file without sorting it in Unix

Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.

Resources