UNIX: Replace Newline w/ Colon, Preserving Newline Before EOF - unix

I have a text file ("INPUT.txt") of the format:
A<LF>
B<LF>
C<LF>
D<LF>
X<LF>
Y<LF>
Z<LF>
<EOF>
which I need to reformat to:
A:B:C:D:X:Y:Z<LF>
<EOF>
I know you can do this with 'sed'. There's a billion google hits for doing this with 'sed'. But I'm trying to emphasis readability, simplicity, and using the correct tool for the correct job. 'sed' is a line editor that consumes and hides newlines. Probably not the right tool for this job!
I think the correct tool for this job would be 'tr'. I can replace all the newlines with colons with the command:
cat INPUT.txt | tr '\n' ':'
There's 99% of my work done. I have a problem, now, though. By replacing all the newlines with colons, I not only get an extraneous colon at the end of the sequence, but I also lose the carriage return at the end of the input. It looks like this:
A:B:C:D:X:Y:Z:<EOF>
Now, I need to remove the colon from the end of the input. However, if I attempt to pass this processed input through 'sed' to remove the final colon (which would now, I think, be a proper use of 'sed'), I find myself with a second problem. The input is no longer terminated by a newline at all! 'sed' fails outright, for all commands, because it never finds the end of the first line of input!
It seems like appending a newline to the end of some input is a very, very common task, and considering I myself was just sorely tempted to write a program to do it in C (which would take about eight lines of code), I can't imagine there's not already a very simple way to do this with the tools already available to you in the Linux kernel.

This should do the job (cat and echo are unnecessary):
tr '\n' ':' < INPUT.TXT | sed 's/:$/\n/'
Using only sed:
sed -n ':a; $ ! {N;ba}; s/\n/:/g;p' INPUT.TXT
Bash without any externals:
string=($(<INPUT.TXT))
string=${string[#]/%/:}
string=${string//: /:}
string=${string%*:}
Using a loop in sh:
colon=''
while read -r line
do
string=$string$colon$line
colon=':'
done < INPUT.TXT
Using AWK:
awk '{a=a colon $0; colon=":"} END {print a}' INPUT.TXT
Or:
awk '{printf colon $0; colon=":"} END {printf "\n" }' INPUT.TXT
Edit:
Here's another way in pure Bash:
string=($(<INPUT.TXT))
saveIFS=$IFS
IFS=':'
newstring="${string[*]}"
IFS=$saveIFS
Edit 2:
Here's yet another way which does use echo:
echo "$(tr '\n' ':' < INPUT.TXT | head -c -1)"

Old question, but
paste -sd: INPUT.txt

Here's yet another solution: (assumes a character set where ':' is
octal 72, eg ascii)
perl -l72 -pe '$\="\n" if eof' INPUT.TXT

Related

Join lines depending on the line beginning

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with a space, empty line or a nonnumeric character. E.g.
40403813|7|Failed|No such file or directory|1
40403816|7|Hi,
The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
I'd like join the split line back with the previous line (as mentioned below):
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
using a Unix command like sed/awk. I'm not clear how to join a line with the preceeding one.
Any suggestion?
awk to the rescue!
awk -v ORS='' 'NR>1 && /^[0-9]/{print "\n"} NF' file
only print newline when the current line starts with a digit, otherwise append rows (perhaps you may want to add a space to ORS if the line break didn't preserve the space).
Don't do anything based on the values of the strings in your fields as that could go wrong. You COULD get a wrapping line that starts with a digit, for example. Instead just print after every complete record of 5 fields:
$ awk -F'|' '{rec=rec $0; nf+=NF} nf>=5{print rec; nf=0; rec=""}' file
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
Try:
awk 'NF{printf("%s",$0 ~ /^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
OR
awk 'NF{printf("%s",/^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
It will check if each line starts from a digit or not if yes and greater than line number 1 than it will insert a new line with-it else it will simply print it, also it will print a new line after reading the whole file, if we not mention it, it is not going to insert that at end of the file reading.
If you only ever have the line split into two, you can use this sed command:
sed 'N;s/\n\([^[:digit:]]\)/\1/;P;D' infile
This appends the next line to the pattern space, checks if the linebreak is followed by something other than a digit, and if so, removes the linebreak, prints the pattern space up to the first linebreak, then deletes the printed part.
If a single line can be broken across more than two lines, we have to loop over the substitution:
sed ':a;N;s/\n\([^[:digit:]]\)/\1/;ta;P;D' infile
This branches from ta to :a if a substitution took place.
To use with Mac OS sed, the label and branching command must be separate from the rest of the command:
sed -e ':a' -e 'N;s/\n\([^[:digit:]]\)/\1/;ta' -e 'P;D' infile
If the continuation lines always begin with a single space:
perl -0000 -lape 's/\n / /g' input
If the continuation lines can begin with an arbitrary amount of whitespace:
perl -0000 -lape 's/\n(\s+)/$1/g' input
It is probably more idiomatic to write:
perl -0777 -ape 's/\n / /g' input
You can use sed when you have a file without \r :
tr "\n" "\r" < inputfile | sed 's/\r\([^0-9]\)/\1/g' | tr '\r' '\n'

How split a file in words in unix command line?

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:
Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.
The output file should contain:
Hola
mundo
hablo
español
...
Thank!
Using tr:
tr -s '[[:punct:][:space:]]' '\n' < file
The simplest tool is fmt:
fmt -1 <your-file
fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html
Using sed:
$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile
basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).
grep -o prints only the parts of matching line that matches pattern
grep -o '[[:alpha:]]*' file
Using perl:
perl -ne 'print join("\n", split)' < file
this awk line may work too?
awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1' inputfile
Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.
So you need a solution that will convert this:
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo#bar.com".
into this:
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo#bar.com
Is that correct?
Try this using GNU awk so we can set RS to more than one character:
$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo#bar.com".
$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo#bar.com
Try to come up with some other test cases to see if this always does what you want.
cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v
tr -d ",." deletes , and .
tr " \t" "\n" changes spaces and tabs to newlines
grep -e "^$" -v deletes empty lines (in case of two or more spaces)
A very simple option would first be,
sed 's,\(\w*\),\1\n,g' file
beware it doens't handle neither apostrophes nor punctuation
Using perl :
perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file
Output
Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós
perl -ne 'print join("\n", split)'
Sorry #jsageryd
That one liner does not give correct answer as it joins last word on line with first word on next.
This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that
perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'

Unix - Need to cut a file which has multiple blanks as delimiter - awk or cut?

I need to get the records from a text file in Unix. The delimiter is multiple blanks. For example:
2U2133 1239
1290fsdsf 3234
From this, I need to extract
1239
3234
The delimiter for all records will be always 3 blanks.
I need to do this in an unix script(.scr) and write the output to another file or use it as an input to a do-while loop. I tried the below:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then
int_1=0
else
int_2=0
fi
done < awk -F' ' '{ print $2 }' ${Directoty path}/test_file.txt
test_file.txt is the input file and file1.txt is a lookup file. But the above way is not working and giving me syntax errors near awk -F
I tried writing the output to a file. The following worked in command line:
more test_file.txt | awk -F' ' '{ print $2 }' > output.txt
This is working and writing the records to output.txt in command line. But the same command does not work in the unix script (It is a .scr file)
Please let me know where I am going wrong and how I can resolve this.
Thanks,
Visakh
The job of replacing multiple delimiters with just one is left to tr:
cat <file_name> | tr -s ' ' | cut -d ' ' -f 2
tr translates or deletes characters, and is perfectly suited to prepare your data for cut to work properly.
The manual states:
-s, --squeeze-repeats
replace each sequence of a repeated character that is
listed in the last specified SET, with a single occurrence
of that character
It depends on the version or implementation of cut on your machine. Some versions support an option, usually -i, that means 'ignore blank fields' or, equivalently, allow multiple separators between fields. If that's supported, use:
cut -i -d' ' -f 2 data.file
If not (and it is not universal — and maybe not even widespread, since neither GNU nor MacOS X have the option), then using awk is better and more portable.
You need to pipe the output of awk into your loop, though:
awk -F' ' '{print $2}' ${Directory_path}/test_file.txt |
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done
The only residual issue is whether the while loop is in a sub-shell and and therefore not modifying your main shell scripts variables, just its own copy of those variables.
With bash, you can use process substitution:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done < <(awk -F' ' '{print $2}' ${Directory_path}/test_file.txt)
This leaves the while loop in the current shell, but arranges for the output of the command to appear as if from a file.
The blank in ${Directory path} is not normally legal — unless it is another Bash feature I've missed out on; you also had a typo (Directoty) in one place.
Other ways of doing the same thing aside, the error in your program is this: You cannot redirect from (<) the output of another program. Turn your script around and use a pipe like this:
awk -F' ' '{ print $2 }' ${Directory path}/test_file.txt | while read readline
etc.
Besides, the use of "readline" as a variable name may or may not get you into problems.
In this particular case, you can use the following line
sed 's/ /\t/g' <file_name> | cut -f 2
to get your second columns.
In bash you can start from something like this:
for n in `${Directoty path}/test_file.txt | cut -d " " -f 4`
{
grep -c $n ${Directory path}/file*.txt
}
This should have been a comment, but since I cannot comment yet, I am adding this here.
This is from an excellent answer here: https://stackoverflow.com/a/4483833/3138875
tr -s ' ' <text.txt | cut -d ' ' -f4
tr -s '<character>' squeezes multiple repeated instances of <character> into one.
It's not working in the script because of the typo in "Directo*t*y path" (last line of your script).
Cut isn't flexible enough. I usually use Perl for that:
cat file.txt | perl -F' ' -e 'print $F[1]."\n"'
Instead of a triple space after -F you can put any Perl regular expression. You access fields as $F[n], where n is the field number (counting starts at zero). This way there is no need to sed or tr.

Interpret as fixed string/literal and not regex using sed

For grep there's a fixed string option, -F (fgrep) to turn off regex interpretation of the search string.
Is there a similar facility for sed? I couldn't find anything in the man. A recommendation of another gnu/linux tool would also be fine.
I'm using sed for the find and replace functionality: sed -i "s/abc/def/g"
Do you have to use sed? If you're writing a bash script, you can do
#!/bin/bash
pattern='abc'
replace='def'
file=/path/to/file
tmpfile="${TMPDIR:-/tmp}/$( basename "$file" ).$$"
while read -r line
do
echo "${line//$pattern/$replace}"
done < "$file" > "$tmpfile" && mv "$tmpfile" "$file"
With an older Bourne shell (such as ksh88 or POSIX sh), you may not have that cool ${var/pattern/replace} structure, but you do have ${var#pattern} and ${var%pattern}, which can be used to split the string up and then reassemble it. If you need to do that, you're in for a lot more code - but it's really not too bad.
If you're not in a shell script already, you could pretty easily make the pattern, replace, and filename parameters and just call this. :)
PS: The ${TMPDIR:-/tmp} structure uses $TMPDIR if that's set in your environment, or uses /tmp if the variable isn't set. I like to stick the PID of the current process on the end of the filename in the hopes that it'll be slightly more unique. You should probably use mktemp or similar in the "real world", but this is ok for a quick example, and the mktemp binary isn't always available.
Option 1) Escape regexp characters. E.g. sed 's/\$0\.0/0/g' will replace all occurrences of $0.0 with 0.
Option 2) Use perl -p -e in conjunction with quotemeta. E.g. perl -p -e 's/\\./,/gi' will replace all occurrences of . with ,.
You can use option 2 in scripts like this:
SEARCH="C++"
REPLACE="C#"
cat $FILELIST | perl -p -e "s/\\Q$SEARCH\\E/$REPLACE/g" > $NEWLIST
If you're not opposed to Ruby or long lines, you could use this:
alias replace='ruby -e "File.write(ARGV[0], File.read(ARGV[0]).gsub(ARGV[1]) { ARGV[2] })"'
replace test3.txt abc def
This loads the whole file into memory, performs the replacements and saves it back to disk. Should probably not be used for massive files.
If you don't want to escape your string, you can reach your goal in 2 steps:
fgrep the line (getting the line number) you want to replace, and
afterwards use sed for replacing this line.
E.g.
#/bin/sh
PATTERN='foo*[)*abc' # we need it literal
LINENUMBER="$( fgrep -n "$PATTERN" "$FILE" | cut -d':' -f1 )"
NEWSTRING='my new string'
sed -i "${LINENUMBER}s/.*/$NEWSTRING/" "$FILE"
You can do this in two lines of bash code if you're OK with reading the whole file into memory. This is quite flexible -- the pattern and replacement can contain newlines to match across lines if needed. It also preserves any trailing newline or lack thereof, which a simple loop with read does not.
mapfile -d '' < file
printf '%s' "${MAPFILE//"$pat"/"$rep"}" > file
For completeness, if the file can contain null bytes (\0), we need to extend the above, and it becomes
mapfile -d '' < <(cat file; printf '\0')
last=${MAPFILE[-1]}; unset "MAPFILE[-1]"
printf '%s\0' "${MAPFILE[#]//"$pat"/"$rep"}" > file
printf '%s' "${last//"$pat"/"$rep"}" >> file
perl -i.orig -pse 'while (($i = index($_,$s)) >= 0) { substr($_,$i,length($s), $r)}'--\
-s='$_REQUEST['\'old\'']' -r='$_REQUEST['\'new\'']' sample.txt
-i.orig in-place modification with backup.
-p print lines from the input file by default
-s enable rudimentary parsing of command line arguments
-e run this script
index($_,$s) search for the $s string
substr($_,$i,length($s), $r) replace the string
while (($i = index($_,$s)) >= 0) repeat until
-- end of perl parameters
-s='$_REQUEST['\'old\'']', -r='$_REQUEST['\'new\'']' - set $s,$r
You still need to "escape" ' chars but the rest should be straight forward.
Note: this started as an answer to How to pass special character string to sed hence the $_REQUEST['old'] strings, however this question is a bit more appropriately formulated.
You should be using replace instead of sed.
From the man page:
The replace utility program changes strings in place in files or on the
standard input.
Invoke replace in one of the following ways:
shell> replace from to [from to] ... -- file_name [file_name] ...
shell> replace from to [from to] ... < file_name
from represents a string to look for and to represents its replacement.
There can be one or more pairs of strings.

Parsing line continuations

What is the simplest way to parse line continuation characters? This seems like such a basic action that I'm surprised there's no basic command for doing this. 'while read' and 'while read -r' loops don't do what I want, and the easiest solution I've found is the sed solution below. Is there a way to do this with something basic like tr?
$ cat input
Output should be \
one line with a '\' character.
$ while read l; do echo $l; done < input
Output should be one line with a '' character.
$ while read -r l; do echo $l; done < input
Output should be \
one line with a '\' character.
$ sed '/\\$/{N; s/\\\n//;}' input
Output should be one line with a '\' character.
$ perl -0777 -pe 's/\\\n//s' input
Output should be one line with a '\' character.
If by "simplest" you mean concise and legible, I'd suggest your perl-ism with one small modification:
$ perl -pe 's/\\\n//' /tmp/line-cont
No need for the possibly memory intensive ... -0777 ... (whole file slurp mode) switch.
If, however, by "simplest" you mean not the leaving shell, this will suffice:
$ { while read -r LINE; do
printf "%s" "${LINE%\\}"; # strip line-continuation, if any
test "${LINE##*\\}" && echo; # emit newline for non-continued lines
done; } < /tmp/input
(I prefer printf "%s" $USER_INPUT to echo $USER_INPUT because echo cannot portably be told to stop looking for switches, and printf is commonly a built-in anyway.)
Just tuck that in a user-defined function and never be revolted by it again. Caution: this latter approach will add a trailing newline to a file which lacks one.
The regex way looks like the way to go.
I would go with the Perl solution simply because it will likely be the most extensible if you want to add more functionality later.

Resources