How split a file in words in unix command line? - unix
I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:
Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.
The output file should contain:
Hola
mundo
hablo
español
...
Thank!
Using tr:
tr -s '[[:punct:][:space:]]' '\n' < file
The simplest tool is fmt:
fmt -1 <your-file
fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html
Using sed:
$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile
basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).
grep -o prints only the parts of matching line that matches pattern
grep -o '[[:alpha:]]*' file
Using perl:
perl -ne 'print join("\n", split)' < file
this awk line may work too?
awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1' inputfile
Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.
So you need a solution that will convert this:
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo#bar.com".
into this:
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo#bar.com
Is that correct?
Try this using GNU awk so we can set RS to more than one character:
$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "foo#bar.com".
$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
foo#bar.com
Try to come up with some other test cases to see if this always does what you want.
cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v
tr -d ",." deletes , and .
tr " \t" "\n" changes spaces and tabs to newlines
grep -e "^$" -v deletes empty lines (in case of two or more spaces)
A very simple option would first be,
sed 's,\(\w*\),\1\n,g' file
beware it doens't handle neither apostrophes nor punctuation
Using perl :
perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file
Output
Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós
perl -ne 'print join("\n", split)'
Sorry #jsageryd
That one liner does not give correct answer as it joins last word on line with first word on next.
This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that
perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'
Related
Unix concatenate two line if shorter
I have 500k line of fix length data, but in some line there is enter character in between data. Eg. Each line length is 26 character. ABCDEFGHIJKLMNOUPQRSTUVWXTZ ABCDEFGHIJKLM<BR> NOUPQRSTUVWXYZ ABCDEFGHIJKLMNOUPQRSTUVWXTZ Line 2 is having enter character. I Want to remove enter character from line 2 and combine it with line below it. E.g. ABCDEFGHIJKLMNOUPQRSTUVWXTZ ABCDEFGHIJKLMNOUPQRSTUVWXYZ ABCDEFGHIJKLMNOUPQRSTUVWXTZ I tried to use awk and sed but result is not correct
If you have Perl in your system, you can simply do this. $ perl -pe 's/<BR>\n//' your_file_name It is a one-liner. You simply run it at your command line. Or with awk: awk '{ORS = sub(/<BR>/,"") ? "" : "\n"; print $0}' your_file_name
This might work for you (GNU sed): sed 'N;s/<BR>\n//;P;D' file or: sed -z 's/<BR>\n//g' file
One, slightly off-the-wall, way of doing this is to: remove all existing linefeeds insert new linefeeds every 27 characters That looks like this: tr -d '\n' < YOURFILE | fold -w 27 ABCDEFGHIJKLMNOUPQRSTUVWXTZ ABCDEFGHIJKLMNOUPQRSTUVWXYZ ABCDEFGHIJKLMNOUPQRSTUVWXTZ
unix merging character and next line to one line
$ cat TPSCIS1705291200.err 0301705293504895 000003330011868452100001742N #ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N #ERROR - Can not find Account:12404797-010 Here I am looking to replace the last character N and next line character # to come in one line Expected output should be something like 0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010 please assist I am not able to find the best way for this as character N and # are in different rows
sed won't match newlines. One possible trick is to first "translate" them to other character, then do sed-subtitution. In this code I use tr command to replace newlines to a different 'carriage feed' character (`\f'), then replace it by sed, and finally replace those new lines back cat myfile | tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n' Another dirty trick above is doing command-substitution for echo '\f\ since it cannot be part of the regular expression either! Working code (in bash for MacOS): -- /tmp » cat in 0301705293504895 000003330011868452100001742N ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N ERROR - Can not find Account:12404797-010 --- /tmp » cat in| tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n' 0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
It appears you are just looking to merge every other line: awk 'NR%2 { printf "%s,", $0; next} 1' input
This might work for you (GNU sed): sed -i ':a;/N$/N;s/\n#/,#/;ta;P;D' file If the current line ends in N and the following line begins with a # replace the newline with a , and repeat. Otherwise print the first line and repeat.
How to remove blank lines from a Unix file
I need to remove all the blank lines from an input file and write into an output file. Here is my data as below. 11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321 11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003 11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032 11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001 11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701 11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383 11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original: sed '/^$/d' foo > foo.tmp mv foo.tmp foo If you also want to remove lines consisting only of whitespace (not just empty lines) then use: sed -i '/^[[:space:]]*$/d' foo Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too: sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename awk 'NF > 0' filename sed -i '/^$/d' filename awk '!/^$/' filename awk '/./' filename The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($): grep -v '^$' infile.txt > outfile.txt If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways: grep -P -v '^\s*$' infile.txt > outfile.txt or, without Perl regular expressions: grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option. If I execute the above command still I have blank lines in my output file. What could be the reason? There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then: sed -e 's/ *$//' -e '/^ *$/d' input > output The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs. Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return: tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against: head -n 2 input | od -c Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.) You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with: echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /' If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it: cat file.txt | perl -lane "print if /\S/" Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do. Cheers
You can sed's -i option to edit in-place without using temporary file: sed -i '/^$/d' file
How to find lines that contain more than a single whitespace between strings in unix?
I have lines like 1|Harry|says|hi 2|Ron|says|bye 3|Her mi oh ne|is|silent 4|The|above|sentence|is|weird I need a grep command that'll detect the third line. This is what Im doing. grep -E '" "" "+' $dname".txt" >> $dname"_error.txt" The logic on which I'm basing this is, the first white space must be followed by one or more white spaces to be detected as an error. $dname is a variable that holds the filename path. How do I get my desired output? ( which is 3|Her mi oh ne|is|silent )
grep '[[:space:]]\{2,\}' ${dname}.txt >> ${dname}_error.txt If you want to catch 2 or more whitespaces.
Just this will do: grep " " ${dname}.txt >> ${dname}_error.txt The two spaces in a quoted string work fine. The -E turns the pattern into an extended regular expression, which makes this needlessly complicated here.
below are the four ways. pearl.268> sed -n 's/ /&/p' ${dname}.txt >> ${dname}_error.txt pearl.269> awk '$0~/ /{print $0}' ${dname}.txt >> ${dname}_error.txt pearl.270> grep ' ' ${dname}.txt >> ${dname}_error.txt pearl.271> perl -ne '/ / && print' ${dname}.txt >> ${dname}_error.txt
If you want 2 or more spaces, then: grep -E "\s{2,}" ${dname}.txt >> ${dname}_error.txt The reason why your pattern doesn't work is because of the quotation marks inside. \s is used for [space]. You could actually do the same thing with: grep -E ' +' ${dname}.txt >> ${dname}_error.txt But it's difficult to tell exactly what you are looking for with that version. \s\s+ would also work, but \s{2,} is the most concise and also gives you the option of setting an upper limit. If you wanted to find 2, 3, or 4 spaces in a row, you would use \s{2,4}
UNIX: Replace Newline w/ Colon, Preserving Newline Before EOF
I have a text file ("INPUT.txt") of the format: A<LF> B<LF> C<LF> D<LF> X<LF> Y<LF> Z<LF> <EOF> which I need to reformat to: A:B:C:D:X:Y:Z<LF> <EOF> I know you can do this with 'sed'. There's a billion google hits for doing this with 'sed'. But I'm trying to emphasis readability, simplicity, and using the correct tool for the correct job. 'sed' is a line editor that consumes and hides newlines. Probably not the right tool for this job! I think the correct tool for this job would be 'tr'. I can replace all the newlines with colons with the command: cat INPUT.txt | tr '\n' ':' There's 99% of my work done. I have a problem, now, though. By replacing all the newlines with colons, I not only get an extraneous colon at the end of the sequence, but I also lose the carriage return at the end of the input. It looks like this: A:B:C:D:X:Y:Z:<EOF> Now, I need to remove the colon from the end of the input. However, if I attempt to pass this processed input through 'sed' to remove the final colon (which would now, I think, be a proper use of 'sed'), I find myself with a second problem. The input is no longer terminated by a newline at all! 'sed' fails outright, for all commands, because it never finds the end of the first line of input! It seems like appending a newline to the end of some input is a very, very common task, and considering I myself was just sorely tempted to write a program to do it in C (which would take about eight lines of code), I can't imagine there's not already a very simple way to do this with the tools already available to you in the Linux kernel.
This should do the job (cat and echo are unnecessary): tr '\n' ':' < INPUT.TXT | sed 's/:$/\n/' Using only sed: sed -n ':a; $ ! {N;ba}; s/\n/:/g;p' INPUT.TXT Bash without any externals: string=($(<INPUT.TXT)) string=${string[#]/%/:} string=${string//: /:} string=${string%*:} Using a loop in sh: colon='' while read -r line do string=$string$colon$line colon=':' done < INPUT.TXT Using AWK: awk '{a=a colon $0; colon=":"} END {print a}' INPUT.TXT Or: awk '{printf colon $0; colon=":"} END {printf "\n" }' INPUT.TXT Edit: Here's another way in pure Bash: string=($(<INPUT.TXT)) saveIFS=$IFS IFS=':' newstring="${string[*]}" IFS=$saveIFS Edit 2: Here's yet another way which does use echo: echo "$(tr '\n' ':' < INPUT.TXT | head -c -1)"
Old question, but paste -sd: INPUT.txt
Here's yet another solution: (assumes a character set where ':' is octal 72, eg ascii) perl -l72 -pe '$\="\n" if eof' INPUT.TXT