I have a text file in unix formed from multiple long lines
ALTER Tit como(titel('42423432;434235111;757567562;2354679;5543534;6547673;32322332;54545453'))
ALTER Mit como(Alt('432322;434434211;754324237562;2354679;5543534;6547673;32322332;54545453'))
I need to split each line in multiple lines of no longer than 42 characters.
The split should be done at the end of last ";", and
so my ideal output file will be :
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
I used fold -w 42 givenfile.txt | sed 's/ $/ -/g'
it splits the line but doesnt add the "-" at the end of the line and doesnt split after the ";".
any help is much appreciated.
Thanks !
awk -F';' '
w{
print""
}
{
w=length($1)
printf "%s",$1
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
}
END{
print""
}
' file
This produces the output:
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
How it works
Awk implicitly loops through each line of its input and each line is divided into fields. This code uses a single variable w to keep track of the current width of the output line.
-F';'
Tell awk to break fields on semicolons.
`w{print""}
If the last line was not completed, w>0, then print a newline to terminate it before we start with a new line.
w=length($1); printf "%s",$1
Print the first field of the new line and set w according to its length.
Loop over the remaining fields:
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
This loops over the second to final fields of this line. Whenever we reach the point where we can't print another field without exceeding the 42 character limit, we print ; -\n.
END{print""}
Print a newline at the end of the file.
This might work for you (GNU sed):
sed -r 's/.{1,42}$|.{1,41};/& -\n/g;s/...$//' file
This globally replaces 1 to 41 characters followed by a ; or 1 to 42 characters followed by end of line with -\n. The last string will have three characters too many and so they are deleted.
Related
I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters.
(the first record is always 8 numbers and the fiels separator is the pipe character: "|"
The awk scrips is (join4.awk)
BEGIN {
# initialise "line" variable. Maybe unnecessary
line=""
}
{
# check if this line is a beginning of a new record
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
# if it is a new record, then print stuff already collected
# then update line variable with $0
print line
line = $0
} else {
# if it is not, then just attach $0 to the line
line = line $0
}
}
END {
# print out the last record kept in line variable
if (line) print line
}
and the commandline is
cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/> *</></g' > corrected_data.csv
My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.
thanks,
--Trifo
tr -d "\r"
Is just gsub(/\r/, "").
sed 's/> *</></g'
That's just gsub(/> *</, "><")
mawk NF=NF RS='\r?\n' FS='> *<' OFS='><'
Thank you all folks!
You gave me the inspiration to get to a solution. It is like this:
BEGIN {
# initialize "line" variable. Maybe unnecessary.
line=""
}
{
# if the line begins with 8 numbers and a pipe char (the format of the first record)...
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {
# ... then the previous record is ready. We can post process it, the print out
# workarounds for the missing gsub function
# removing extra spaces between xml tags
# removing extra \r characters the same way
while ( line ~ "\r") { sub( /\r/,"",line) }
# "<text text> <tag tag>" should look like "<text text><tag tag>"
while ( line ~ "> *<") { sub( /> *</,"><",line) }
# then print the record and update line var with the beginning of the new record
print line
line = $0
} else {
# just keep extending the record with the actual line
line = line $0
}
}
END {
# print the last record kept in line var
if (line) {
while ( line ~ "\r") { sub( /\r/,"",line) }
while ( line ~ "> *<") { sub( /> *</,"><",line) }
print line
}
}
And yes, it is efficient: the embedded version runs abou 33% faster.
And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.
So, thanks for the inspiration again!
--Trifo
Got a solution to format a unix file containing ^M and "\r\n" in a file as per shared link earlier "https://stackoverflow.com/questions/68919927/removing-new-line-characters-in-csv-file-from-inside-columns-in-unix" .
But current ask is to get rid of "\r\n" and ^M characters in all column of unix file except last one { so last column "\r\n" along with ^M character value cna be used to format the file using command awk -v RS='\r\n' '{gsub(/\n/,"")} 1' test.csv }
sample data is ::
$ cat -v test.csv
234,aa,bb,cc,30,dd^M
22,cc,^M
ff,dd,^M
40,gg^M
pxy,aa,,cc,^M
40
,dd^M
Current Output::
234,aa,bb,cc,30,dd
22,cc,
ff,dd,
40,gg
pxy,aa,,cc,
40,dd
Expected output::
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
Would you please try a perl solution:
perl -0777 -pe 's/\r?\n(?=,)//g; s/(?<=,)\r?\n//g; 's/\r//g; test.csv
Output:
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
The -0777 option tells perl to slurp all lines including line endings at once.
The -pe option interprets the next argument as a perl script.
The regex \r?\n(?=,) matches zero or one CR character followed by
a NL character, with a positive lookahead for a comma.
Then the substitution s/\r?\n(?=,)//g removes the line endings which matches
the condition above. The following comma is not removed due to the nature
of lookaround assertions.
The substitution s/(?<=,)\r?\n//g is the switched version, which removes
the line endings after the comma.
The final s/\r//g removes still remaining CR characters.
[Edit]
As the perl script above slurps all lines into the memory, it may be slow if the file is huge. Here is an alternative which processes the input line by line using a state machine.
awk -v ORS="" ' # empty the output record separator
/^\r?$/ {next} # skip blank lines
f && !/^,/ {print "\n"} # break the line if the flag is set and the line does not start with a comma
{
sub(/\r$/, "") # remove trailing CR character
print # print current line (w/o newline)
if ($0 ~ /,$/) f = 0 # if the line has a trailing comma, clear the flag
else f = 1 # if the line properly ends, set the flag
}
END {
print "\n" # append the newline to the last line
}
' test.csv
BTW if you want to put blank lines in between as the posted expected output which looks like:
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
then append another \n in the print line as:
f && !/^,/ {print "\n\n"}
I am running AIX 6.1
I have a file which contains strings/words starting with some specific characters, say 'xy' or 'Xy' or 'Xy' or 'XY' (case insensitive) and I need to mask the entire word/string with asterisks '*' if the word is greater than say 5 characters.
e.g. I need a sed command which when run against a file containing the below line...
This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings
should give below as the output
This is a test line xy12 which I need to replace specific strings
I tried the below commands (did not yet come to the stage where I restrict to word lengths) but it does not work and displays the full line without any substitutions.
I tried using \< and > as well as \b for word identification.
sed 's/\<xy\(.*\)\>/******/g' result2.csv
sed 's/\bxy\(.*\)\b******/g' result2.csv
You can try with awk:
echo 'This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings' | awk 'BEGIN{RS=ORS=" "} !(/^[xX][yY]/ && length($0)>=5)'
The awk record separator is set to a space in order to be able to get the length of each word.
This works with GNU awk in --posix and --traditional modes.
With sed for the mental exercice
sed -E '
s/(^|[[:blank:]])([xyXY])([xyXY].{2}[^[:space:]]*)([^[:space:]])/\1#\3#/g
:A
s/(#[^#[:blank:]]*)[^#[:blank:]](#[#]*)/\1#\2/g
tA
s/#/*/g'
This need to not have # in the text.
A simple POSIX awk version :
awk '{for(i=1;i<=NF;++i) if ($i ~ /^[xX][yY]/ && length($i)>=5) gsub(/./,"*",$i)}1'
This, however, does not keep the spacing intact (multiple spaces are converted to a single one), the following does:
awk 'BEGIN{RS=ORS=" "}(/^[xX][yY]/ && length($i)>=5){gsub(/./,"*")}1'
You may use awk:
s='This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings xy123 xy1234 xy12345 xy123456 xy1234567'
echo "$s" | awk 'BEGIN {
ORS=RS=" "
}
{
for(i=1;i<=NF;i++) {
if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/)
gsub(/./,"*", $i);
print $i;
}
}'
A one liner:
awk 'BEGIN {ORS=RS=" "} { for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } }'
# => This is a test line ******* xy12 ***** ******* which I need to replace specific strings ***** ****** ******* ******** *********
See the online demo.
Details
BEGIN {ORS=RS=" "} - start of the awk: set the output record separator equal to the space record separator
{ for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^xy[a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } } - iterate over each field (with for(i=1;i<=NF;i++)) and if the current field ($i) length is equal or more than 5 (length($i) >= 5) and it matches a Xy and (&&) 1 or more alphanumeric chars pattern ($i~/^[Xx][Yy][a-zA-Z0-9]+$/), then replace each char with * (with gsub(/./,"*", $i)) and then print the current field value.
This might work for you (GNU sed):
sed -r ':a;/\bxy\S{5,}\b/I!b;s//\n&\n/;h;s/[^\n]/*/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta' file
If the current line does not contain a string which begins with xy case insensitive and 5 or more following characters, then there is no work to be done.
Otherwise:
Surround the string by newlines
Copy the pattern space (PS) to the hold space (HS)
Replace all characters other than newlines with *'s
Append the PS to the HS
Replace the PS with the HS
Swap the strings between the newlines retaining the remainder of the first line
Repeat
Extract household data corresponding to a keyword.
Z1/NEW "THE_PALM" 769 121003 1545
NEW HOUSE IN
SOMETHING SOMETHING
SN HOUSE CLASS
FIRST PSD93_PU 1579
CHAIRS
WOOD
SILVER SPOON
GREEN GARDEN
Z1/OLD "THE_ROSE" 786 121003 1343
NEW HOUSE OUT
SOMETHING NEW
SN HOUSE CLASS
FIRST_O PSD1000_ST 1432
CHAIRS
WOOD
GREEN GARDEN
BLACK PAINT
Z1/OLD "The_PURE" 126 121003 3097
NEW HOUSE IN
SOMETHING OLD
SN HOUSE CLASS
LAST_O JD4_GOLD 1076
CHAIRS
SILVER SPOON
I have a very large sized file. There is a list of items about the house at the end of every description. Corresponding to the houses containing SILVER SPOON, I want to extract the HOUSE ID as in data PSD93_PU and date 121003. I tried the following:
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=7 a=0 s="SILVER" infile > outfile
But the problem is that the number of lines above the keyword SILVER are so random, that I can't figure out the solution.
assuming each new house starts with Z1
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } \
$1 == "SN" { f=1; next; } \
f == 1 { id=$2; f=0; next; } \
$1" "$2 == "SILVER SPOON" { print id,date }' file
that, on a new house, reset all vars and get the date
if an SN is matched then the next line contains the id
get the id from the line
if "SILVER SPOON" is found print the id and date
if it is not found, a new house will be met and the vars are reset.
test with given data:
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } $1 == "SN" { f=1; next; } f == 1 { id=$2; f=0; next; } $1 == "SILVER SPOON" && $2 == "SPOON" { print id,date }' file
PSD93_PU 121003
JD4_GOLD 121003
note :
if anybody knows how and if $1 == "SILVER" && $2 == "SPOON" can be merge together in one statement that'd be nice :) -- like: $1,$2 == "SILVER SPOON"
edit:
it can be done with $1" "$2 == "SILVER SPOON".
one could possibly omit the space and do $1$2 == "SILVERSPOON" but that would match even if $2 was empty and $1 contained the whole string, or $1 was SILVERSPO and $2 was ON. So the space in acts as a strict match.
Using sed:
sed -n -e 's/^Z1[^"]*"[^"]*"[ \t]*[0-9]*[ \t]*\([0-9]*\).*/\1/p'
-e '/^SN[ \t]*HOUSE/ { n; s/^[^ \t]*[ \t]*\([^ \t]*\).*/\1/p }'
Firstly, we call sed with the -n option in order to tell it to print only what we tell it to.
The first command will search for a particular pattern to extract the date. The pattern consists of:
^Z1: A line starting with the string "Z1".
[^"]*: zero or more characters that aren't double quotes
": double quote character
[^"]*: zero or more characters that aren't double quotes
[ \t]*: zero or more characters that are either tabs or spaces
[0-9]*: zero or more digits
[ \t]*: zero or more characters that are either tabs or spaces
\([0-9]*\): zero or more digits. The backslashed parenthesis are used in order to capture the match, ie. the match is stored into an auxiliary variable \1.
.*: zero or more characters, effectively skipping all characters until the end of the line.
This matched line is then replaced with \1, which holds our captured content: the date. The p after the command tells sed to print the result.
The second line contains two commands grouped together (inside braces) so that they are only executed on the "address" before the braces. The address is a pattern, so that it is executed on every line that matches the pattern. The pattern consists of a line that starts with "SN" followed by a sequence of spaces or tabs, followed by the string "HOUSE".
When the pattern matches, we first execute the n next command, which loads the next line from input. Then, we extract the ID from the new line, in a way analogous to extracting the date. The substitute pattern to match is:
^[^ \t]*: a string that starts with zero or more characters that aren't spaces or tabs (whitespace).
[ \t]*: then has a sequence of zero or more spaces and/or tabs.
\([^ \t]*\): a sequence of non whitespace characters is then captured
.*: the remaining characters are matched so that they are skipped.
The replacement becomes the captured ID, and again we tell sed to print it out.
This will print out a line containing the date, followed by a line containing the ID. If you want a line in the format ID date, you can pipe the output of sed into another sed instance, as follows:
sed -n -e [...] | sed -e 'h;n;G;s/\n/ /'
This sed instance performs the following operations:
Reads a line, and the h command tells it to store the line into the hold space (an auxiliary buffer).
Read the next line with the n command.
The G get command will append the contents of the hold space into the pattern space (the working buffer), so now we have the ID line followed by the date line.
Finally, we replace the new line character by a space, so the lines are joined into a single line.
Hope this helps =)
If your records are separated by two or three blank lines and the line spacing before the household items is consistent, you could use GNU awk like this:
awk -r 'BEGIN { RS="\n{3}\n*"; FS="\n" } /SILVER SPOON/ { split($1, one, OFS); split($6, two, OFS); print two[2], one[4] }' file.txt
Results:
PSD93_PU 121003
JD4_GOLD 121003
I need delete a matching line and one previous to it.
e.g In file below I need to remove lines 1 & 2.
I tried "grep -v -B 1 "page.of." 1.txt
and I expected it to not print the matchning lines and the context.
I tried the How do I delete a matching line, the line above and the one below it, using sed? but could not understand the sed usage.
---1.txt--
**document 1** -> 1
**page 1 of 2** -> 2
testoing
testing
super crap blah
**document 1**
**page 2 of 2**
You want to do something very similar to the answer given
sed -n '
/page . of ./ { #when pattern matches
n #read the next line into the pattern space
x #exchange the pattern and hold space
d #skip the current contents of the pattern space (previous line)
}
x #for each line, exchange the pattern and hold space
1d #skip the first line
p #and print the contents of pattern space (previous line)
$ { #on the last line
x #exchange pattern and hold, pattern now contains last line read
p #and print that
}'
And as a single line
sed -n '/page . of ./{n;x;d;};x;1d;p;${x;p;}' 1.txt
grep -v -B1 doesnt work because it will skip those lines but will include them later on (due to the -B1. To check this out, try the command on:
**document 1** -> 1
**page 1 of 2** -> 2
**document 1**
**page 2 of 2**
**page 3 of 2**
You will notice that the page 2 line will be skipped because that line won't be matched and the next like wont be matched.
There's a simple awk solution:
awk '!/page.*of.*/ { if (m) print buf; buf=$0; m=1} /page.*of.*/ {m=0}' 1.txt
The awk command says the following:
If the current line has that "page ... of ", then it will signal that you haven't found a valid line. If you do not find that string, then you print the previous line (stored in buf) and reset the buffer to the current line (hence forcing it to lag by 1)
grep -vf <(grep -B1 "page.*of" file | sed '/^--$/d') file
Not too familiar with sed, but here's a perl expression to do the trick:
cat FILE | perl -e '#a = <STDIN>;
for( $i=0 ; $i <= $#a ; $i++ ) {
if($i > 0 && $a[$i] =~ /xxxx/) {
$a[$i] = "";
$a[$i-1] = "";
}
} print #a;'
edit:
where "xxxx" is what you are trying to match.
Thanks, I was trying to use the awk command given by Foo Bah
to delete the matching line and the previous one. I have to use it multiple times, so for the matching part I use a variable. The given awk command works, but when using a variable it does not work (i.e. it does not delete the matching & prev. line). I tried:
awk -vvar="page.*of.*" '!/$var/ { if (m) print buf; buf=$0; m=1} /$var/ {m=0}' 1.txt