using sed to turn paragraph to lines - unix

Using sed and any basic commands, I'm trying to count the number of words in each separate passage that has many separate passages. Each passage begins with a specific number and increases. Example:
0:1.1 This is the first passage...
0:1.2 This is the second passage...
The difficult thing is that each passage is a paragraph that is word wrapped and not a single line. I could count the words in each passage if they were on single lines. How can I do this?Thanks for the help
I did figure how to count each passage with:
grep '[0-9]:[0-9]' file | wc -l

This awk solution might work for you:
awk '/^[0-9]:[0-9]\.[0-9]/{
if (pass_num) printf "%s, word count: %i\n", pass_num, word_count
pass_num=$1
word_count=-1
}
{ word_count+=NF }
END { printf "%s, word count: %i\n", pass_num, word_count }
' file
Test input:
# cat file
0:1.1 I am le passage one.
There are many words in me.
0:1.2 I am le passage two.
One two three four five six
Seven
0:1.3 I am "Hello world"
Test output:
0:1.1, word count: 11
0:1.2, word count: 12
0:1.3, word count: 4
How it works:
Each word is separated by empty space, so each word can be represented by each field in awk, i.e. word count in a line is equal to NF. The word count is summed up every line until the next passage.
When it encounters a new passage (indicated by the presence of a passage number), it
prints out the previous passage's number and word count.
set passage number to this new passage number
reset passage word count (-1 because we don't want the passage number be counted)
The END{..} block is needed because the final passage doesn't have a trigger that causes it to print out the passage number and word count.
The if (pass_num) is to suppress printf when awk encounters the first passage.

This might work for you (GNU sed):
sed -r ':a;$bb;N;/\n[0-9]+:[0-9]+\.[0-9]+/!s/\n/ /g;ta;:b;h;s/\n.*//;s/([0-9]+:[0-9]+\.[0-9]+)(.*)/echo "\1 = $(wc -w <<<"\2")"/ep;g;D' file
It forms each section into a single line then counts the words in the section less the section number (newlines are replaced by spaces).

$ cat file
0:1.1 This is the first passage...
welcome to the SO, you leart a lot of things here.
0:1.2 This is the second passage...
wer qwerqrq ewqr e
0:1.3 This is the second passage...
Using sed and GNU grep:
$ sed -n '/0:1.1/,/[0-9]:[0-9]\.[0-9]/{//!p}' file | grep -Eo '[[:alpha:]]*' | wc -l
11
0:1.1 -> Give the passage number here in which you want to count.

Here's one way with GNU awk:
awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' 'NF > 0 { print R ": " NF - 2 } { R = RT }'
If it is run on the file listed by doubledown, the output is:
0:1.1: 11
0:1.2: 12
0:1.3: 4
Explanation
This works by splitting the input into records according to [0-9]+:[0-9]+\\.[0-9]+ and splitting into fields at whitespace. The record separator is off by one, hence the {R = RT }, the field counter is off by two because each record starts and ends with an FS, hence the NF - 2.
Edit - only count fields containing [:alnum:]
The above also counts e.g. ellipsis (...) as words, to avoid this do something like this:
awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' '
NF > 0 {
wc = NF-2
for(i=2; i<NF; i++)
if($i !~ /[[:alnum:]]+/)
wc--
print R ": " wc
}
{ R = RT }'

Related

How can I split a text based on every n.th words?

I am trying to split a text file for every 1000th word.
awk -v RS='[[:space:]]+' 'END{print NR+0}' filename
with awk I can count the words in a file but I don't know how I can split it.
final output= filename(1).txt, filename(2).txt
This totally sick solution should work for files which are less than 10000 words:
. <(echo -e 'uno due tre\nquattro\ncinque sei sette otto\nnove dieci undici dodici tredici' | sed -zE '
s/^/\x0/
:a
y/012345678/123456789/
s/\x0(([^ \n]+[ \n]+){4})/cat > file0 <<EOF\n\1\nEOF\n\x0/
ta
s/\x0(.*)/cat > file0 <<EOF\n\1\nEOF\n\x0/
s/\n+/\n/g')
Essentially, it intersperses some code at the points where the splits have to occur in such a way that the outcoming file is a bash script which is a sequence of cat commands which read from a heredocument and write to a file (a maximum of 10 files is allowed!). This script is sourced (. file is just source file, just uglier). You can see the script by removing the leading . <( and the trailing ).
The nice thing is that it splits the big file in the middle of lines if necessary, without altering the lines where no split occurs.
The ugliest thing is that it numbers the files backward.
The limitation on the number of words is because I am implementing only a one-digit addition on the filenames; it can be removed by implementing an addition in a similar way as done here or here.
You can do it with awk without too much trouble. It helps keep the clutter down if you write a function to actually handle outputting the words from an array to your file. Keep a counter to number the output file names, e.g. wordsfile_1 (first 1000 words), wordsfile_2 (next 1000 words) and so on. Then it is just a matter of keeping track of how many words you add to your array and call your output function when you hit 1000 words. Then delete the array, to make it ready to hold the next 1000 words, reset your word counter and keep going.
For example you could do something like:
awk '
function writefile() {
fname="wordsfile_" ++c + 0
for (j=1; j<=n; j++)
print a[j] > fname
delete a
n = 0
}
{
for (i=1; i<=NF; i++) {
a[++n] = $i
if (n == 1000)
writefile()
}
}
END {
writefile()
}' input_file
The function writefile() handles writing the output to your 1000 word files and deleting the array and resetting the counter n. The END rule just calls the function once more to output any words collected since the last output.
Let me know if you have further questions.
#!/bin/bash
for FILE in *.txt
do
#FILE="FILENAME.txt"
read -p "HOW MANY WORDS SHOULD BE IN YOUR FILES? (~ APPROXIMATE) " BUFFER
#BUFFER=1000 # APPROXIMATE NUMBER OF WORDS IN A FILE
NW=$(wc -w $FILE | awk '{print $1}') #NW=NUMBER OF WORDS IN YOUR FILE
if [[ $NW -gt $BUFFER ]]
then
LINENUMBER=$(wc -l $FILE | awk '{print $1}')
WCOUNT=0
FL=1 #FIRST LINE NUMBER OF EVERY NEW FILE
FN=1 #FILE NUMBER
for j in $(eval echo "{1..$LINENUMBER}")
do
INC=$(sed -n "${j}p" $FILE | wc -w)
WCOUNT=$(( WCOUNT + INC ))
if [[ $WCOUNT -gt $BUFFER ]];
then
sed -n "${FL},${j}p" $FILE > ${FILE%%.*}_${FN}.txt
FL=$(( j + 1))
(( FN++ ))
WCOUNT=0
fi
done
sed -n "${FL},\$p" $FILE > ${FILE%%.*}_${FN}.txt
fi
done
I found a different solution, It generates files that have roughly 1000 words in each.

Use sed to replace all occurrences of strings which start with 'xy' and of length 5 or more

I am running AIX 6.1
I have a file which contains strings/words starting with some specific characters, say 'xy' or 'Xy' or 'Xy' or 'XY' (case insensitive) and I need to mask the entire word/string with asterisks '*' if the word is greater than say 5 characters.
e.g. I need a sed command which when run against a file containing the below line...
This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings
should give below as the output
This is a test line xy12 which I need to replace specific strings
I tried the below commands (did not yet come to the stage where I restrict to word lengths) but it does not work and displays the full line without any substitutions.
I tried using \< and > as well as \b for word identification.
sed 's/\<xy\(.*\)\>/******/g' result2.csv
sed 's/\bxy\(.*\)\b******/g' result2.csv
You can try with awk:
echo 'This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings' | awk 'BEGIN{RS=ORS=" "} !(/^[xX][yY]/ && length($0)>=5)'
The awk record separator is set to a space in order to be able to get the length of each word.
This works with GNU awk in --posix and --traditional modes.
With sed for the mental exercice
sed -E '
s/(^|[[:blank:]])([xyXY])([xyXY].{2}[^[:space:]]*)([^[:space:]])/\1#\3#/g
:A
s/(#[^#[:blank:]]*)[^#[:blank:]](#[#]*)/\1#\2/g
tA
s/#/*/g'
This need to not have # in the text.
A simple POSIX awk version :
awk '{for(i=1;i<=NF;++i) if ($i ~ /^[xX][yY]/ && length($i)>=5) gsub(/./,"*",$i)}1'
This, however, does not keep the spacing intact (multiple spaces are converted to a single one), the following does:
awk 'BEGIN{RS=ORS=" "}(/^[xX][yY]/ && length($i)>=5){gsub(/./,"*")}1'
You may use awk:
s='This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings xy123 xy1234 xy12345 xy123456 xy1234567'
echo "$s" | awk 'BEGIN {
ORS=RS=" "
}
{
for(i=1;i<=NF;i++) {
if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/)
gsub(/./,"*", $i);
print $i;
}
}'
A one liner:
awk 'BEGIN {ORS=RS=" "} { for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } }'
# => This is a test line ******* xy12 ***** ******* which I need to replace specific strings ***** ****** ******* ******** *********
See the online demo.
Details
BEGIN {ORS=RS=" "} - start of the awk: set the output record separator equal to the space record separator
{ for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^xy[a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } } - iterate over each field (with for(i=1;i<=NF;i++)) and if the current field ($i) length is equal or more than 5 (length($i) >= 5) and it matches a Xy and (&&) 1 or more alphanumeric chars pattern ($i~/^[Xx][Yy][a-zA-Z0-9]+$/), then replace each char with * (with gsub(/./,"*", $i)) and then print the current field value.
This might work for you (GNU sed):
sed -r ':a;/\bxy\S{5,}\b/I!b;s//\n&\n/;h;s/[^\n]/*/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta' file
If the current line does not contain a string which begins with xy case insensitive and 5 or more following characters, then there is no work to be done.
Otherwise:
Surround the string by newlines
Copy the pattern space (PS) to the hold space (HS)
Replace all characters other than newlines with *'s
Append the PS to the HS
Replace the PS with the HS
Swap the strings between the newlines retaining the remainder of the first line
Repeat

Extracting a random pattern after matching a word in following lines

Extract household data corresponding to a keyword.
Z1/NEW "THE_PALM" 769 121003 1545
NEW HOUSE IN
SOMETHING SOMETHING
SN HOUSE CLASS
FIRST PSD93_PU 1579
CHAIRS
WOOD
SILVER SPOON
GREEN GARDEN
Z1/OLD "THE_ROSE" 786 121003 1343
NEW HOUSE OUT
SOMETHING NEW
SN HOUSE CLASS
FIRST_O PSD1000_ST 1432
CHAIRS
WOOD
GREEN GARDEN
BLACK PAINT
Z1/OLD "The_PURE" 126 121003 3097
NEW HOUSE IN
SOMETHING OLD
SN HOUSE CLASS
LAST_O JD4_GOLD 1076
CHAIRS
SILVER SPOON
I have a very large sized file. There is a list of items about the house at the end of every description. Corresponding to the houses containing SILVER SPOON, I want to extract the HOUSE ID as in data PSD93_PU and date 121003. I tried the following:
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=7 a=0 s="SILVER" infile > outfile
But the problem is that the number of lines above the keyword SILVER are so random, that I can't figure out the solution.
assuming each new house starts with Z1
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } \
$1 == "SN" { f=1; next; } \
f == 1 { id=$2; f=0; next; } \
$1" "$2 == "SILVER SPOON" { print id,date }' file
that, on a new house, reset all vars and get the date
if an SN is matched then the next line contains the id
get the id from the line
if "SILVER SPOON" is found print the id and date
if it is not found, a new house will be met and the vars are reset.
test with given data:
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } $1 == "SN" { f=1; next; } f == 1 { id=$2; f=0; next; } $1 == "SILVER SPOON" && $2 == "SPOON" { print id,date }' file
PSD93_PU 121003
JD4_GOLD 121003
note :
if anybody knows how and if $1 == "SILVER" && $2 == "SPOON" can be merge together in one statement that'd be nice :) -- like: $1,$2 == "SILVER SPOON"
edit:
it can be done with $1" "$2 == "SILVER SPOON".
one could possibly omit the space and do $1$2 == "SILVERSPOON" but that would match even if $2 was empty and $1 contained the whole string, or $1 was SILVERSPO and $2 was ON. So the space in acts as a strict match.
Using sed:
sed -n -e 's/^Z1[^"]*"[^"]*"[ \t]*[0-9]*[ \t]*\([0-9]*\).*/\1/p'
-e '/^SN[ \t]*HOUSE/ { n; s/^[^ \t]*[ \t]*\([^ \t]*\).*/\1/p }'
Firstly, we call sed with the -n option in order to tell it to print only what we tell it to.
The first command will search for a particular pattern to extract the date. The pattern consists of:
^Z1: A line starting with the string "Z1".
[^"]*: zero or more characters that aren't double quotes
": double quote character
[^"]*: zero or more characters that aren't double quotes
[ \t]*: zero or more characters that are either tabs or spaces
[0-9]*: zero or more digits
[ \t]*: zero or more characters that are either tabs or spaces
\([0-9]*\): zero or more digits. The backslashed parenthesis are used in order to capture the match, ie. the match is stored into an auxiliary variable \1.
.*: zero or more characters, effectively skipping all characters until the end of the line.
This matched line is then replaced with \1, which holds our captured content: the date. The p after the command tells sed to print the result.
The second line contains two commands grouped together (inside braces) so that they are only executed on the "address" before the braces. The address is a pattern, so that it is executed on every line that matches the pattern. The pattern consists of a line that starts with "SN" followed by a sequence of spaces or tabs, followed by the string "HOUSE".
When the pattern matches, we first execute the n next command, which loads the next line from input. Then, we extract the ID from the new line, in a way analogous to extracting the date. The substitute pattern to match is:
^[^ \t]*: a string that starts with zero or more characters that aren't spaces or tabs (whitespace).
[ \t]*: then has a sequence of zero or more spaces and/or tabs.
\([^ \t]*\): a sequence of non whitespace characters is then captured
.*: the remaining characters are matched so that they are skipped.
The replacement becomes the captured ID, and again we tell sed to print it out.
This will print out a line containing the date, followed by a line containing the ID. If you want a line in the format ID date, you can pipe the output of sed into another sed instance, as follows:
sed -n -e [...] | sed -e 'h;n;G;s/\n/ /'
This sed instance performs the following operations:
Reads a line, and the h command tells it to store the line into the hold space (an auxiliary buffer).
Read the next line with the n command.
The G get command will append the contents of the hold space into the pattern space (the working buffer), so now we have the ID line followed by the date line.
Finally, we replace the new line character by a space, so the lines are joined into a single line.
Hope this helps =)
If your records are separated by two or three blank lines and the line spacing before the household items is consistent, you could use GNU awk like this:
awk -r 'BEGIN { RS="\n{3}\n*"; FS="\n" } /SILVER SPOON/ { split($1, one, OFS); split($6, two, OFS); print two[2], one[4] }' file.txt
Results:
PSD93_PU 121003
JD4_GOLD 121003

Maximum number of characters in a field of a csv file using unix shell commands?

I have a csv file. In one of the fields, say the second field, I need to know maximum number of characters in that field. For example, given the file below:
adf,jlkjl,lkjlk
jf,j,lkjljk
jlkj,lkejflkj,adfafef,
jfje,jj,lkjlkj
jjee,eeee,ereq
the answer would be 8 because row 3 has 8 characters in the second field. I would like to integrate this into a bash script, so common unix command line programs are preferred. Imaginary bonus points for explaining what the command is doing.
EDIT: Here is what I have so far
cut --delimiter=, -f 2 test.csv | wc -m
This gives me the character count for all of the fields, not just one, so I still have progress to make.
I would use awk for the task. It uses a comma to split line in fields and for each line checks if the length of second field is bigger that the value already saved.
awk '
BEGIN {
FS = ","
}
{ c = length( $2 ) > c ? length( $2 ) : c }
END {
print c
}
' infile
Use it as a one-liner and assign the return value to a variable, like:
num=$(awk 'BEGIN { FS = "," } { c = length( $2 ) > c ? length( $2 ) : c } END { print c }' infile)
Well #oob, you basically provided the answer with your last edit, and it's the most simple of all answers given. However, I also like #Birei's answer just because I enjoy AWK. :-)
I too had to find the longest possible value for a given field inside a text file today. Tested with your sample and got the expected 8.
cut -d, -f2 test.csv | wc -L
As you see, just a matter of using the correct option for wc (which I hope you have already figured by now).
My solution is to loop over the lines. Than I exchange the commas with new lines to loop over the words than I check which is the longest word and save the data.
#!/bin/bash
lineno=1
matchline=0
matchlen=0
for line in $(cat input.txt); do
words=`echo $line | sed -e 's/,/\n/g'`
for word in $words; do
# echo "line: $lineno; length: ${#word}; input: $word"
if [ $matchlen -lt ${#word} ]; then
matchlen=${#word}
matchline=$lineno
fi
done;
lineno=$(($lineno + 1))
done;
echo max length is $matchlen in line $matchline
Bash and Coreutils Solution
There are a number of ways to solve this, but I vote for simplicity. Here's a solution that uses Bash parameter expansion and a few standard shell utilities to measure each line:
cut -d, -f2 /tmp/foo |
while read; do
echo ${#REPLY}
done | sort | tail -n1
The idea here is to split the CSV file, and then use the parameter length expansion of the implicit REPLY variable to measure the characters on each line. When we sort the measurements, the last line of the sorted output will hold the length of the longest line found.
cut out the desired column
print each line length
sort the line lengths
grab the max line length
cut -d, -f2 test.csv | awk '{print length($0);}' | sort -n | tail -n 1

How do I delete a matching line and the previous one?

I need delete a matching line and one previous to it.
e.g In file below I need to remove lines 1 & 2.
I tried "grep -v -B 1 "page.of." 1.txt
and I expected it to not print the matchning lines and the context.
I tried the How do I delete a matching line, the line above and the one below it, using sed? but could not understand the sed usage.
---1.txt--
**document 1** -> 1
**page 1 of 2** -> 2
testoing
testing
super crap blah
**document 1**
**page 2 of 2**
You want to do something very similar to the answer given
sed -n '
/page . of ./ { #when pattern matches
n #read the next line into the pattern space
x #exchange the pattern and hold space
d #skip the current contents of the pattern space (previous line)
}
x #for each line, exchange the pattern and hold space
1d #skip the first line
p #and print the contents of pattern space (previous line)
$ { #on the last line
x #exchange pattern and hold, pattern now contains last line read
p #and print that
}'
And as a single line
sed -n '/page . of ./{n;x;d;};x;1d;p;${x;p;}' 1.txt
grep -v -B1 doesnt work because it will skip those lines but will include them later on (due to the -B1. To check this out, try the command on:
**document 1** -> 1
**page 1 of 2** -> 2
**document 1**
**page 2 of 2**
**page 3 of 2**
You will notice that the page 2 line will be skipped because that line won't be matched and the next like wont be matched.
There's a simple awk solution:
awk '!/page.*of.*/ { if (m) print buf; buf=$0; m=1} /page.*of.*/ {m=0}' 1.txt
The awk command says the following:
If the current line has that "page ... of ", then it will signal that you haven't found a valid line. If you do not find that string, then you print the previous line (stored in buf) and reset the buffer to the current line (hence forcing it to lag by 1)
grep -vf <(grep -B1 "page.*of" file | sed '/^--$/d') file
Not too familiar with sed, but here's a perl expression to do the trick:
cat FILE | perl -e '#a = <STDIN>;
for( $i=0 ; $i <= $#a ; $i++ ) {
if($i > 0 && $a[$i] =~ /xxxx/) {
$a[$i] = "";
$a[$i-1] = "";
}
} print #a;'
edit:
where "xxxx" is what you are trying to match.
Thanks, I was trying to use the awk command given by Foo Bah
to delete the matching line and the previous one. I have to use it multiple times, so for the matching part I use a variable. The given awk command works, but when using a variable it does not work (i.e. it does not delete the matching & prev. line). I tried:
awk -vvar="page.*of.*" '!/$var/ { if (m) print buf; buf=$0; m=1} /$var/ {m=0}' 1.txt

Resources