How split a file in words in unix command line? - unix

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:
Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.
The output file should contain:
Hola
mundo
hablo
español
...
Thank!

Using tr:
tr -s '[[:punct:][:space:]]' '\n' < file

The simplest tool is fmt:
fmt -1 <your-file
fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html

Using sed:
$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile
basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).

grep -o prints only the parts of matching line that matches pattern
grep -o '[[:alpha:]]*' file

Using perl:
perl -ne 'print join("\n", split)' < file

this awk line may work too?
awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1' inputfile

Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.
So you need a solution that will convert this:
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo#bar.com".
into this:
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo#bar.com
Is that correct?
Try this using GNU awk so we can set RS to more than one character:
$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo#bar.com".
$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo#bar.com
Try to come up with some other test cases to see if this always does what you want.

cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v
tr -d ",." deletes , and .
tr " \t" "\n" changes spaces and tabs to newlines
grep -e "^$" -v deletes empty lines (in case of two or more spaces)

A very simple option would first be,
sed 's,\(\w*\),\1\n,g' file
beware it doens't handle neither apostrophes nor punctuation

Using perl :
perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file
Output
Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós

perl -ne 'print join("\n", split)'
Sorry #jsageryd
That one liner does not give correct answer as it joins last word on line with first word on next.
This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that
perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'

Related

Unix concatenate two line if shorter

I have 500k line of fix length data, but in some line there is enter character in between data.
Eg. Each line length is 26 character.
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
ABCDEFGHIJKLM<BR>
NOUPQRSTUVWXYZ
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
Line 2 is having enter character. I Want to remove enter character from line 2 and combine it with line below it.
E.g.
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
ABCDEFGHIJKLMNOUPQRSTUVWXYZ
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
I tried to use awk and sed but result is not correct
If you have Perl in your system, you can simply do this.
$ perl -pe 's/<BR>\n//' your_file_name
It is a one-liner. You simply run it at your command line.
Or with awk:
awk '{ORS = sub(/<BR>/,"") ? "" : "\n"; print $0}' your_file_name
This might work for you (GNU sed):
sed 'N;s/<BR>\n//;P;D' file
or:
sed -z 's/<BR>\n//g' file
One, slightly off-the-wall, way of doing this is to:
remove all existing linefeeds
insert new linefeeds every 27 characters
That looks like this:
tr -d '\n' < YOURFILE | fold -w 27
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
ABCDEFGHIJKLMNOUPQRSTUVWXYZ
ABCDEFGHIJKLMNOUPQRSTUVWXTZ

unix merging character and next line to one line

$ cat TPSCIS1705291200.err
0301705293504895 000003330011868452100001742N
#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
#ERROR - Can not find Account:12404797-010
Here I am looking to replace the last character N and next line character # to come in one line
Expected output should be something like
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
please assist I am not able to find the best way for this as character N and # are in different rows
sed won't match newlines. One possible trick is to first "translate" them to other character, then do sed-subtitution.
In this code I use tr command to replace newlines to a different 'carriage feed' character (`\f'), then replace it by sed, and finally replace those new lines back
cat myfile | tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
Another dirty trick above is doing command-substitution for echo '\f\ since it cannot be part of the regular expression either!
Working code (in bash for MacOS):
-- /tmp » cat in
0301705293504895 000003330011868452100001742N
ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
ERROR - Can not find Account:12404797-010
--- /tmp » cat in| tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
It appears you are just looking to merge every other line:
awk 'NR%2 { printf "%s,", $0; next} 1' input
This might work for you (GNU sed):
sed -i ':a;/N$/N;s/\n#/,#/;ta;P;D' file
If the current line ends in N and the following line begins with a # replace the newline with a , and repeat. Otherwise print the first line and repeat.

How to remove blank lines from a Unix file

I need to remove all the blank lines from an input file and write into an output file. Here is my data as below.
11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321
11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003
11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032
11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001
11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701
11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383
11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo
This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original:
sed '/^$/d' foo > foo.tmp
mv foo.tmp foo
If you also want to remove lines consisting only of whitespace (not just empty lines) then use:
sed -i '/^[[:space:]]*$/d' foo
Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too:
sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename
awk 'NF > 0' filename
sed -i '/^$/d' filename
awk '!/^$/' filename
awk '/./' filename
The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($):
grep -v '^$' infile.txt > outfile.txt
If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways:
grep -P -v '^\s*$' infile.txt > outfile.txt
or, without Perl regular expressions:
grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output
Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option.
If I execute the above command still I have blank lines in my output file. What could be the reason?
There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then:
sed -e 's/ *$//' -e '/^ *$/d' input > output
The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs.
Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return:
tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output
If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against:
head -n 2 input | od -c
Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.)
You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with:
echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /'
If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file
grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk
awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it:
cat file.txt | perl -lane "print if /\S/"
Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do.
Cheers
You can sed's -i option to edit in-place without using temporary file:
sed -i '/^$/d' file

How to find lines that contain more than a single whitespace between strings in unix?

I have lines like
1|Harry|says|hi
2|Ron|says|bye
3|Her mi oh ne|is|silent
4|The|above|sentence|is|weird
I need a grep command that'll detect the third line.
This is what Im doing.
grep -E '" "" "+' $dname".txt" >> $dname"_error.txt"
The logic on which I'm basing this is, the first white space must be followed by one or more white spaces to be detected as an error.
$dname is a variable that holds the filename path.
How do I get my desired output?
( which is
3|Her mi oh ne|is|silent
)
grep '[[:space:]]\{2,\}' ${dname}.txt >> ${dname}_error.txt
If you want to catch 2 or more whitespaces.
Just this will do:
grep " " ${dname}.txt >> ${dname}_error.txt
The two spaces in a quoted string work fine. The -E turns the pattern into an extended regular expression, which makes this needlessly complicated here.
below are the four ways.
pearl.268> sed -n 's/ /&/p' ${dname}.txt >> ${dname}_error.txt
pearl.269> awk '$0~/ /{print $0}' ${dname}.txt >> ${dname}_error.txt
pearl.270> grep ' ' ${dname}.txt >> ${dname}_error.txt
pearl.271> perl -ne '/ / && print' ${dname}.txt >> ${dname}_error.txt
If you want 2 or more spaces, then:
grep -E "\s{2,}" ${dname}.txt >> ${dname}_error.txt
The reason why your pattern doesn't work is because of the quotation marks inside. \s is used for [space]. You could actually do the same thing with:
grep -E ' +' ${dname}.txt >> ${dname}_error.txt
But it's difficult to tell exactly what you are looking for with that version. \s\s+ would also work, but \s{2,} is the most concise and also gives you the option of setting an upper limit. If you wanted to find 2, 3, or 4 spaces in a row, you would use \s{2,4}

UNIX: Replace Newline w/ Colon, Preserving Newline Before EOF

I have a text file ("INPUT.txt") of the format:
A<LF>
B<LF>
C<LF>
D<LF>
X<LF>
Y<LF>
Z<LF>
<EOF>
which I need to reformat to:
A:B:C:D:X:Y:Z<LF>
<EOF>
I know you can do this with 'sed'. There's a billion google hits for doing this with 'sed'. But I'm trying to emphasis readability, simplicity, and using the correct tool for the correct job. 'sed' is a line editor that consumes and hides newlines. Probably not the right tool for this job!
I think the correct tool for this job would be 'tr'. I can replace all the newlines with colons with the command:
cat INPUT.txt | tr '\n' ':'
There's 99% of my work done. I have a problem, now, though. By replacing all the newlines with colons, I not only get an extraneous colon at the end of the sequence, but I also lose the carriage return at the end of the input. It looks like this:
A:B:C:D:X:Y:Z:<EOF>
Now, I need to remove the colon from the end of the input. However, if I attempt to pass this processed input through 'sed' to remove the final colon (which would now, I think, be a proper use of 'sed'), I find myself with a second problem. The input is no longer terminated by a newline at all! 'sed' fails outright, for all commands, because it never finds the end of the first line of input!
It seems like appending a newline to the end of some input is a very, very common task, and considering I myself was just sorely tempted to write a program to do it in C (which would take about eight lines of code), I can't imagine there's not already a very simple way to do this with the tools already available to you in the Linux kernel.
This should do the job (cat and echo are unnecessary):
tr '\n' ':' < INPUT.TXT | sed 's/:$/\n/'
Using only sed:
sed -n ':a; $ ! {N;ba}; s/\n/:/g;p' INPUT.TXT
Bash without any externals:
string=($(<INPUT.TXT))
string=${string[#]/%/:}
string=${string//: /:}
string=${string%*:}
Using a loop in sh:
colon=''
while read -r line
do
string=$string$colon$line
colon=':'
done < INPUT.TXT
Using AWK:
awk '{a=a colon $0; colon=":"} END {print a}' INPUT.TXT
Or:
awk '{printf colon $0; colon=":"} END {printf "\n" }' INPUT.TXT
Edit:
Here's another way in pure Bash:
string=($(<INPUT.TXT))
saveIFS=$IFS
IFS=':'
newstring="${string[*]}"
IFS=$saveIFS
Edit 2:
Here's yet another way which does use echo:
echo "$(tr '\n' ':' < INPUT.TXT | head -c -1)"
Old question, but
paste -sd: INPUT.txt
Here's yet another solution: (assumes a character set where ':' is
octal 72, eg ascii)
perl -l72 -pe '$\="\n" if eof' INPUT.TXT

Resources