How to extract everything between two keywords in perl - text-extraction

Need to extract everything between start and end.
the below code works if there is no \n.
$mystring = "The start text always precedes \n the end of the text.";
if($mystring =~ m/start(.*?)end/) {
print $1;
}
o/p should be - text always precedes \n the

Study perlre, in particular the /s modifier.

Related

removing new line "\r\n" and ^M characters in all column except last one { in UNIX}

Got a solution to format a unix file containing ^M and "\r\n" in a file as per shared link earlier "https://stackoverflow.com/questions/68919927/removing-new-line-characters-in-csv-file-from-inside-columns-in-unix" .
But current ask is to get rid of "\r\n" and ^M characters in all column of unix file except last one { so last column "\r\n" along with ^M character value cna be used to format the file using command awk -v RS='\r\n' '{gsub(/\n/,"")} 1' test.csv }
sample data is ::
$ cat -v test.csv
234,aa,bb,cc,30,dd^M
22,cc,^M
ff,dd,^M
40,gg^M
pxy,aa,,cc,^M
40
,dd^M
Current Output::
234,aa,bb,cc,30,dd
22,cc,
ff,dd,
40,gg
pxy,aa,,cc,
40,dd
Expected output::
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
Would you please try a perl solution:
perl -0777 -pe 's/\r?\n(?=,)//g; s/(?<=,)\r?\n//g; 's/\r//g; test.csv
Output:
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
The -0777 option tells perl to slurp all lines including line endings at once.
The -pe option interprets the next argument as a perl script.
The regex \r?\n(?=,) matches zero or one CR character followed by
a NL character, with a positive lookahead for a comma.
Then the substitution s/\r?\n(?=,)//g removes the line endings which matches
the condition above. The following comma is not removed due to the nature
of lookaround assertions.
The substitution s/(?<=,)\r?\n//g is the switched version, which removes
the line endings after the comma.
The final s/\r//g removes still remaining CR characters.
[Edit]
As the perl script above slurps all lines into the memory, it may be slow if the file is huge. Here is an alternative which processes the input line by line using a state machine.
awk -v ORS="" ' # empty the output record separator
/^\r?$/ {next} # skip blank lines
f && !/^,/ {print "\n"} # break the line if the flag is set and the line does not start with a comma
{
sub(/\r$/, "") # remove trailing CR character
print # print current line (w/o newline)
if ($0 ~ /,$/) f = 0 # if the line has a trailing comma, clear the flag
else f = 1 # if the line properly ends, set the flag
}
END {
print "\n" # append the newline to the last line
}
' test.csv
BTW if you want to put blank lines in between as the posted expected output which looks like:
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
then append another \n in the print line as:
f && !/^,/ {print "\n\n"}

How to remove the new line when reading from UNIX process groovy? [duplicate]

I have a string that contains some text followed by a blank line. What's the best way to keep the part with text, but remove the whitespace newline from the end?
Use String.trim() method to get rid of whitespaces (spaces, new lines etc.) from the beginning and end of the string.
String trimmedString = myString.trim();
String.replaceAll("[\n\r]", "");
This Java code does exactly what is asked in the title of the question, that is "remove newlines from beginning and end of a string-java":
String.replaceAll("^[\n\r]", "").replaceAll("[\n\r]$", "")
Remove newlines only from the end of the line:
String.replaceAll("[\n\r]$", "")
Remove newlines only from the beginning of the line:
String.replaceAll("^[\n\r]", "")
tl;dr
String cleanString = dirtyString.strip() ; // Call new `String::string` method.
String::strip…
The old String::trim method has a strange definition of whitespace.
As discussed here, Java 11 adds new strip… methods to the String class. These use a more Unicode-savvy definition of whitespace. See the rules of this definition in the class JavaDoc for Character::isWhitespace.
Example code.
String input = " some Thing ";
System.out.println("before->>"+input+"<<-");
input = input.strip();
System.out.println("after->>"+input+"<<-");
Or you can strip just the leading or just the trailing whitespace.
You do not mention exactly what code point(s) make up your newlines. I imagine your newline is likely included in this list of code points targeted by strip:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
It is '\t', U+0009 HORIZONTAL TABULATION.
It is '\n', U+000A LINE FEED.
It is '\u000B', U+000B VERTICAL TABULATION.
It is '\f', U+000C FORM FEED.
It is '\r', U+000D CARRIAGE RETURN.
It is '\u001C', U+001C FILE SEPARATOR.
It is '\u001D', U+001D GROUP SEPARATOR.
It is '\u001E', U+001E RECORD SEPARATOR.
It is '\u001F', U+0
If your string is potentially null, consider using StringUtils.trim() - the null-safe version of String.trim().
If you only want to remove line breaks (not spaces, tabs) at the beginning and end of a String (not inbetween), then you can use this approach:
Use a regular expressions to remove carriage returns (\\r) and line feeds (\\n) from the beginning (^) and ending ($) of a string:
s = s.replaceAll("(^[\\r\\n]+|[\\r\\n]+$)", "")
Complete Example:
public class RemoveLineBreaks {
public static void main(String[] args) {
var s = "\nHello world\nHello everyone\n";
System.out.println("before: >"+s+"<");
s = s.replaceAll("(^[\\r\\n]+|[\\r\\n]+$)", "");
System.out.println("after: >"+s+"<");
}
}
It outputs:
before: >
Hello world
Hello everyone
<
after: >Hello world
Hello everyone<
I'm going to add an answer to this as well because, while I had the same question, the provided answer did not suffice. Given some thought, I realized that this can be done very easily with a regular expression.
To remove newlines from the beginning:
// Trim left
String[] a = "\n\nfrom the beginning\n\n".split("^\\n+", 2);
System.out.println("-" + (a.length > 1 ? a[1] : a[0]) + "-");
and end of a string:
// Trim right
String z = "\n\nfrom the end\n\n";
System.out.println("-" + z.split("\\n+$", 2)[0] + "-");
I'm certain that this is not the most performance efficient way of trimming a string. But it does appear to be the cleanest and simplest way to inline such an operation.
Note that the same method can be done to trim any variation and combination of characters from either end as it's a simple regex.
Try this
function replaceNewLine(str) {
return str.replace(/[\n\r]/g, "");
}
String trimStartEnd = "\n TestString1 linebreak1\nlinebreak2\nlinebreak3\n TestString2 \n";
System.out.println("Original String : [" + trimStartEnd + "]");
System.out.println("-----------------------------");
System.out.println("Result String : [" + trimStartEnd.replaceAll("^(\\r\\n|[\\n\\x0B\\x0C\\r\\u0085\\u2028\\u2029])|(\\r\\n|[\\n\\x0B\\x0C\\r\\u0085\\u2028\\u2029])$", "") + "]");
Start of a string = ^ ,
End of a string = $ ,
regex combination = | ,
Linebreak = \r\n|[\n\x0B\x0C\r\u0085\u2028\u2029]
Another elegant solution.
String myString = "\nLogbasex\n";
myString = org.apache.commons.lang3.StringUtils.strip(myString, "\n");
For anyone else looking for answer to the question when dealing with different linebreaks:
string.replaceAll("(\n|\r|\r\n)$", ""); // Java 7
string.replaceAll("\\R$", ""); // Java 8
This should remove exactly the last line break and preserve all other whitespace from string and work with Unix (\n), Windows (\r\n) and old Mac (\r) line breaks: https://stackoverflow.com/a/20056634, https://stackoverflow.com/a/49791415. "\\R" is matcher introduced in Java 8 in Pattern class: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
This passes these tests:
// Windows:
value = "\r\n test \r\n value \r\n";
assertEquals("\r\n test \r\n value ", value.replaceAll("\\R$", ""));
// Unix:
value = "\n test \n value \n";
assertEquals("\n test \n value ", value.replaceAll("\\R$", ""));
// Old Mac:
value = "\r test \r value \r";
assertEquals("\r test \r value ", value.replaceAll("\\R$", ""));
String text = readFileAsString("textfile.txt");
text = text.replace("\n", "").replace("\r", "");

Split line into multiple lines of 42 Unix after last given char

I have a text file in unix formed from multiple long lines
ALTER Tit como(titel('42423432;434235111;757567562;2354679;5543534;6547673;32322332;54545453'))
ALTER Mit como(Alt('432322;434434211;754324237562;2354679;5543534;6547673;32322332;54545453'))
I need to split each line in multiple lines of no longer than 42 characters.
The split should be done at the end of last ";", and
so my ideal output file will be :
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
I used fold -w 42 givenfile.txt | sed 's/ $/ -/g'
it splits the line but doesnt add the "-" at the end of the line and doesnt split after the ";".
any help is much appreciated.
Thanks !
awk -F';' '
w{
print""
}
{
w=length($1)
printf "%s",$1
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
}
END{
print""
}
' file
This produces the output:
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
How it works
Awk implicitly loops through each line of its input and each line is divided into fields. This code uses a single variable w to keep track of the current width of the output line.
-F';'
Tell awk to break fields on semicolons.
`w{print""}
If the last line was not completed, w>0, then print a newline to terminate it before we start with a new line.
w=length($1); printf "%s",$1
Print the first field of the new line and set w according to its length.
Loop over the remaining fields:
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
This loops over the second to final fields of this line. Whenever we reach the point where we can't print another field without exceeding the 42 character limit, we print ; -\n.
END{print""}
Print a newline at the end of the file.
This might work for you (GNU sed):
sed -r 's/.{1,42}$|.{1,41};/& -\n/g;s/...$//' file
This globally replaces 1 to 41 characters followed by a ; or 1 to 42 characters followed by end of line with -\n. The last string will have three characters too many and so they are deleted.

Extracting a random pattern after matching a word in following lines

Extract household data corresponding to a keyword.
Z1/NEW "THE_PALM" 769 121003 1545
NEW HOUSE IN
SOMETHING SOMETHING
SN HOUSE CLASS
FIRST PSD93_PU 1579
CHAIRS
WOOD
SILVER SPOON
GREEN GARDEN
Z1/OLD "THE_ROSE" 786 121003 1343
NEW HOUSE OUT
SOMETHING NEW
SN HOUSE CLASS
FIRST_O PSD1000_ST 1432
CHAIRS
WOOD
GREEN GARDEN
BLACK PAINT
Z1/OLD "The_PURE" 126 121003 3097
NEW HOUSE IN
SOMETHING OLD
SN HOUSE CLASS
LAST_O JD4_GOLD 1076
CHAIRS
SILVER SPOON
I have a very large sized file. There is a list of items about the house at the end of every description. Corresponding to the houses containing SILVER SPOON, I want to extract the HOUSE ID as in data PSD93_PU and date 121003. I tried the following:
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=7 a=0 s="SILVER" infile > outfile
But the problem is that the number of lines above the keyword SILVER are so random, that I can't figure out the solution.
assuming each new house starts with Z1
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } \
$1 == "SN" { f=1; next; } \
f == 1 { id=$2; f=0; next; } \
$1" "$2 == "SILVER SPOON" { print id,date }' file
that, on a new house, reset all vars and get the date
if an SN is matched then the next line contains the id
get the id from the line
if "SILVER SPOON" is found print the id and date
if it is not found, a new house will be met and the vars are reset.
test with given data:
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } $1 == "SN" { f=1; next; } f == 1 { id=$2; f=0; next; } $1 == "SILVER SPOON" && $2 == "SPOON" { print id,date }' file
PSD93_PU 121003
JD4_GOLD 121003
note :
if anybody knows how and if $1 == "SILVER" && $2 == "SPOON" can be merge together in one statement that'd be nice :) -- like: $1,$2 == "SILVER SPOON"
edit:
it can be done with $1" "$2 == "SILVER SPOON".
one could possibly omit the space and do $1$2 == "SILVERSPOON" but that would match even if $2 was empty and $1 contained the whole string, or $1 was SILVERSPO and $2 was ON. So the space in acts as a strict match.
Using sed:
sed -n -e 's/^Z1[^"]*"[^"]*"[ \t]*[0-9]*[ \t]*\([0-9]*\).*/\1/p'
-e '/^SN[ \t]*HOUSE/ { n; s/^[^ \t]*[ \t]*\([^ \t]*\).*/\1/p }'
Firstly, we call sed with the -n option in order to tell it to print only what we tell it to.
The first command will search for a particular pattern to extract the date. The pattern consists of:
^Z1: A line starting with the string "Z1".
[^"]*: zero or more characters that aren't double quotes
": double quote character
[^"]*: zero or more characters that aren't double quotes
[ \t]*: zero or more characters that are either tabs or spaces
[0-9]*: zero or more digits
[ \t]*: zero or more characters that are either tabs or spaces
\([0-9]*\): zero or more digits. The backslashed parenthesis are used in order to capture the match, ie. the match is stored into an auxiliary variable \1.
.*: zero or more characters, effectively skipping all characters until the end of the line.
This matched line is then replaced with \1, which holds our captured content: the date. The p after the command tells sed to print the result.
The second line contains two commands grouped together (inside braces) so that they are only executed on the "address" before the braces. The address is a pattern, so that it is executed on every line that matches the pattern. The pattern consists of a line that starts with "SN" followed by a sequence of spaces or tabs, followed by the string "HOUSE".
When the pattern matches, we first execute the n next command, which loads the next line from input. Then, we extract the ID from the new line, in a way analogous to extracting the date. The substitute pattern to match is:
^[^ \t]*: a string that starts with zero or more characters that aren't spaces or tabs (whitespace).
[ \t]*: then has a sequence of zero or more spaces and/or tabs.
\([^ \t]*\): a sequence of non whitespace characters is then captured
.*: the remaining characters are matched so that they are skipped.
The replacement becomes the captured ID, and again we tell sed to print it out.
This will print out a line containing the date, followed by a line containing the ID. If you want a line in the format ID date, you can pipe the output of sed into another sed instance, as follows:
sed -n -e [...] | sed -e 'h;n;G;s/\n/ /'
This sed instance performs the following operations:
Reads a line, and the h command tells it to store the line into the hold space (an auxiliary buffer).
Read the next line with the n command.
The G get command will append the contents of the hold space into the pattern space (the working buffer), so now we have the ID line followed by the date line.
Finally, we replace the new line character by a space, so the lines are joined into a single line.
Hope this helps =)
If your records are separated by two or three blank lines and the line spacing before the household items is consistent, you could use GNU awk like this:
awk -r 'BEGIN { RS="\n{3}\n*"; FS="\n" } /SILVER SPOON/ { split($1, one, OFS); split($6, two, OFS); print two[2], one[4] }' file.txt
Results:
PSD93_PU 121003
JD4_GOLD 121003

Using Vim, how can I make CSS rules into one liners?

I would like to come up with a Vim substitution command to turn multi-line CSS rules, like this one:
#main {
padding: 0;
margin: 10px auto;
}
into compacted single-line rules, like so:
#main {padding:0;margin:10px auto;}
I have a ton of CSS rules that are taking up too many lines, and I cannot figure out the :%s/ commands to use.
Here's a one-liner:
:%s/{\_.\{-}}/\=substitute(submatch(0), '\n', '', 'g')/
\_. matches any character, including a newline, and \{-} is the non-greedy version of *, so {\_.\{-}} matches everything between a matching pair of curly braces, inclusive.
The \= allows you to substitute the result of a vim expression, which we here use to strip out all the newlines '\n' from the matched text (in submatch(0)) using the substitute() function.
The inverse (converting the one-line version to multi-line) can also be done as a one liner:
:%s/{\_.\{-}}/\=substitute(submatch(0), '[{;]', '\0\r', 'g')/
If you are at the beginning or end of the rule, V%J will join it into a single line:
Go to the opening (or closing) brace
Hit V to enter visual mode
Hit % to match the other brace, selecting the whole rule
Hit J to join the lines
Try something like this:
:%s/{\n/{/g
:%s/;\n/;/g
:%s/{\s+/{/g
:%s/;\s+/;/g
This removes the newlines after opening braces and semicolons ('{' and ';') and then removes the extra whitespace between the concatenated lines.
If you want to change the file, go for rampion's solution.
If you don't want (or can't) change the file, you can play with a custom folding as it permits to choose what and how to display the folded text. For instance:
" {rtp}/fold/css-fold.vim
" [-- local settings --] {{{1
setlocal foldexpr=CssFold(v:lnum)
setlocal foldtext=CssFoldText()
let b:width1 = 20
let b:width2 = 15
nnoremap <buffer> + :let b:width2+=1<cr><c-l>
nnoremap <buffer> - :let b:width2-=1<cr><c-l>
" [-- global definitions --] {{{1
if exists('*CssFold')
setlocal foldmethod=expr
" finish
endif
function! CssFold(lnum)
let cline = getline(a:lnum)
if cline =~ '{\s*$'
return 'a1'
elseif cline =~ '}\s*$'
return 's1'
else
return '='
endif
endfunction
function! s:Complete(txt, width)
let length = strlen(a:txt)
if length > a:width
return a:txt
endif
return a:txt . repeat(' ', a:width - length)
endfunction
function! CssFoldText()
let lnum = v:foldstart
let txt = s:Complete(getline(lnum), b:width1)
let lnum += 1
while lnum < v:foldend
let add = s:Complete(substitute(getline(lnum), '^\s*\(\S\+\)\s*:\s*\(.\{-}\)\s*;\s*$', '\1: \2;', ''), b:width2)
if add !~ '^\s*$'
let txt .= ' ' . add
endif
let lnum += 1
endwhile
return txt. '}'
endfunction
I leave the sorting of the fields as exercise. Hint: get all the lines between v:foldstart+1 and v:voldend in a List, sort the list, build the string, and that's all.
I won’t answer the question directly, but instead I suggest you to reconsider your needs. I think that your “bad” example is in fact the better one. It is more readable, easier to modify and reason about. Good indentation is very important not only when it comes to programming languages, but also in CSS and HTML.
You mention that CSS rules are “taking up too many lines”. If you are worried about file size, you should consider using CSS and JS minifiers like YUI Compressor instead of making the code less readable.
A convenient way of doing this transformation is to run the following
short command:
:g/{/,/}/j
Go to the first line of the file, and use the command gqG to run the whole file through the formatter. Assuming runs of nonempty lines should be collapsed in the whole file.

Resources