strsplit on periods not in quotes [duplicate]

strsplit on periods not in quotes [duplicate] - r

My program reads a line from a file. This line contains comma-separated text like:
123,test,444,"don't split, this",more test,1
I would like the result of a split to be this:
123
test
444
"don't split, this"
more test
1
If I use the String.split(","), I would get this:
123
test
444
"don't split
this"
more test
1
In other words: The comma in the substring "don't split, this" is not a separator. How to deal with this?

You can try out this regex:
str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.
Explanation:
, // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)
You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:
String[] arr = str.split("(?x) " +
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);

Why Split when you can Match?
Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:
"[^"]*"|[^,]+
This will match all the desired fragments (see demo).
Explanation
With "[^"]*", we match complete "double-quoted strings"
or |
we match [^,]+ any characters that are not a comma.
A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.

Building upon #zx81's answer, cause matching idea is really nice, I've added Java 9 results call, which returns a Stream. Since OP wanted to use split, I've collected to String[], as split does.
Caution if you have spaces after your comma-separators (a, b, "c,d"). Then you need to change the pattern.
Jshell demo
$ jshell
-> String so = "123,test,444,\"don't split, this\",more test,1";
| Added variable so of type String with initial value "123,test,444,"don't split, this",more test,1"
-> Pattern.compile("\"[^\"]*\"|[^,]+").matcher(so).results();
| Expression value is: java.util.stream.ReferencePipeline$Head#2038ae61
| assigned to temporary variable $68 of type java.util.stream.Stream<MatchResult>
-> $68.map(MatchResult::group).toArray(String[]::new);
| Expression value is: [Ljava.lang.String;#6b09bb57
| assigned to temporary variable $69 of type String[]
-> Arrays.stream($69).forEach(System.out::println);
123
test
444
"don't split, this"
more test
1
Code
String so = "123,test,444,\"don't split, this\",more test,1";
Pattern.compile("\"[^\"]*\"|[^,]+")
.matcher(so)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
Explanation
Regex [^"] matches: a quote, anything but a quote, a quote.
Regex [^"]* matches: a quote, anything but a quote 0 (or more) times , a quote.
That regex needs to go first to "win", otherwise matching anything but a comma 1 or more times - that is: [^,]+ - would "win".
results() requires Java 9 or higher.
It returns Stream<MatchResult>, which I map using group() call and collect to array of Strings. Parameterless toArray() call would return Object[].

You can do this very easily without complex regular expression:
Split on the character ". You get a list of Strings
Process each string in the list: Split every string that is on an even position in the List (starting indexing with zero) on "," (you get a list inside a list), leave every odd positioned string alone (directly putting it in a list inside the list).
Join the list of lists, so you get only a list.
If you want to handle quoting of '"', you have to adapt the algorithm a little bit (joining some parts, you have incorrectly split of, or changing splitting to simple regexp), but the basic structure stays.
So basically it is something like this:
public class SplitTest {
public static void main(String[] args) {
final String splitMe="123,test,444,\"don't split, this\",more test,1";
final String[] splitByQuote=splitMe.split("\"");
final String[][] splitByComma=new String[splitByQuote.length][];
for(int i=0;i<splitByQuote.length;i++) {
String part=splitByQuote[i];
if (i % 2 == 0){
splitByComma[i]=part.split(",");
}else{
splitByComma[i]=new String[1];
splitByComma[i][0]=part;
}
}
for (String parts[] : splitByComma) {
for (String part : parts) {
System.out.println(part);
}
}
}
}
This will be much cleaner with lambdas, promised!

Please see the below code snippet. This code only considers happy flow. Change the according to your requirement
public static String[] splitWithEscape(final String str, char split,
char escapeCharacter) {
final List<String> list = new LinkedList<String>();
char[] cArr = str.toCharArray();
boolean isEscape = false;
StringBuilder sb = new StringBuilder();
for (char c : cArr) {
if (isEscape && c != escapeCharacter) {
sb.append(c);
} else if (c != split && c != escapeCharacter) {
sb.append(c);
} else if (c == escapeCharacter) {
if (!isEscape) {
isEscape = true;
if (sb.length() > 0) {
list.add(sb.toString());
sb = new StringBuilder();
}
} else {
isEscape = false;
}
} else if (c == split) {
list.add(sb.toString());
sb = new StringBuilder();
}
}
if (sb.length() > 0) {
list.add(sb.toString());
}
String[] strArr = new String[list.size()];
return list.toArray(strArr);
}

Related

Control file based attribute population : Unix shell

I have a control file header.cntrl. it has details of header. Example below...
cat header.cntrl
id, name, age, location, phone number
Now I am getting files from different sources,
Source 1 is sending input.dat file in the following format
cat input.dat
id, name, age, location, status, phone number
1,Abc, 34,India, active, 9999999999
Source 2 is sending data in the following format
cat input_2.dat
id, age, name, qualification, status, phone number, location
2,24,xyz, L L B, Active, 88888-88888, India
So different sources are sending files in different formats. We would need to convert those input files to header.cntrl file format.
I was trying this using awk code, but for each source, I'll need to write an awk code. Can we do it with a single script which can be used for any new future source as well?

This reformat_data script can reformat the two "non-standard" input formats and any future source formats. The key idea is to use Perl hashes to store the appropriate headings and only print those that are needed as specified in the header.cntrl file.
cat $* | perl -ne '
BEGIN {
#std_header = ("id","name","age","location","phone number");
print join(",", #std_header), ",\n";
chomp($firstline=<>);
$firstline =~ s/,\s+/,/g;
#inputfile_header=split(/,/, $firstline);
%hash=();
}
chomp;
#row = split(/,/);
$i=0;
for $cell (#row) {
$cell =~ s/\s+//;
$header=$inputfile_header[$i];
$hash{$header} = $row[$i];
$i++;
}
foreach $cell (#std_header) {
print "$hash{$cell},";
}
print "\n";
'
Here are the results of running the reformat_data script using the two sample input files:
cat input.dat
id, name, age, location, status, phone number
1,Abc, 34,India, active, 9999999999
reformat_data input.dat
id,name,age,location,phone number,
1,Abc,34,India,9999999999,
cat input_2.dat
id, age, name, qualification, status, phone number, location
2,24,xyz, L L B, Active, 88888-88888, India
reformat_data input_2.dat
id,name,age,location,phone number,
2,xyz,24,India,88888-88888,

In this particular case you can check the number of fields in lines (provided that all lines of a file have the same number of fields) (awk code):
{
n = split($0, a, "[ \t]*,[ \t]*");
if (n < 7) {
print a[1] ", " a[2] ", " a[3] ", " a[4] ", " a[6];
}
else {
print a[1] ", " a[3] ", " a[2] ", " a[7] ", " a[6];
}
}
A more sophisticated solution is to use the first line as key identifier and take remaining fields "by name":
{
n = split($0, a, "[ \t]*,[ \t]*");
if (FNR == 1) {
for (i = 1; i <= n; ++i) {
lbl[a[i]] = i;
}
}
print a[lbl["id"]] ", " a[lbl["name"]] ", " a[lbl["age"]] ", " a[lbl["location"]] ", " a[lbl["phone number"]];
}

Include match index in Regex replacement string

I have a situation where I need to strip out HTML code from some text. However, some of the input text includes lists, and I want to retain the numbering in that case.
If I do
result = Regex.Replace(result, "<li>", vbNewLine & "1. ", RegexOptions.IgnoreCase)
Then after stripping out the other HTML tags, I end up with:
1. List item one
1. List item two
1. List item three
Is there a way to get the index of the match during replacement?
so for example:
result = Regex.Replace(result, "<li>", vbNewLine & replacementIndex + 1 & " ", RegexOptions.IgnoreCase)
Then after stripping out the other HTML tags, I would get:
1. List item one
2. List item two
3. List item three
Is this possible??
Note: This is inside a function, so that each list is handled separately, and unordered lists get bullets (*) instead.

This should be a good starting point. #"(\<ul\>)((.|\n)*?)(\<\/ul\>)" this will match everything in between the tags.

It's messy, but something like the following. Only change one at a time. This may be slow for large data sets.
int lineNbr = 1;
string newResult = result.Replace("(?i)<li>", vbNewLine & (lineNbr++).ToString() & '. ', 1);
while (newResult != result)
{
result = newResult;
newResult = result.Replace("(?i)<li>", vbNewLine & (lineNbr++).ToString() & '. ', 1);
}

Here's how I ended up doing it - first, find each ordered list:
Dim result As String = rawText
Dim orderedLists As MatchCollection = Regex.Matches(rawText, "<ol>.*?</ol>", RegexOptions.Singleline)
For Each ol As Match In orderedLists
result = Replace(result, ol.Value, EncodeOrderedList(ol.Value))
Next
And the function to convert each one:
Private Function EncodeOrderedList(ByVal rawText As String) As String
Dim result As String = rawText
result = Regex.Replace(result, "<ol>\s*<li>", "1. ", RegexOptions.IgnoreCase)
result = Regex.Replace(result, "</li>\s*</ol>", vbNewLine & vbNewLine, RegexOptions.IgnoreCase)
Dim bullets As MatchCollection = Regex.Matches(rawText, "</li>\s*<li>")
Dim i As Integer = 2
For Each li As Match In bullets
result = Replace(result, li.Value, vbNewLine & i & ". ", 1, 1)
i += 1
Next
Return result
End Function
I haven't tested it on nested lists.

asp.net Substring from specific string to the end of the word

I have a multiline textbox that user may type whatever he wants to for example,
"Hello my name is #Konstantinos and i am 20 #years old"
Now i want to place a button when is pressed the output will be #Konstantinos and #years -
Is that something that can be done using substring or any other idea?
Thank you in advance

If all that you want is HashTags(#) from the entire string, you can perform simple .Split() and Linq. Try this:
C#
string a = "Hello my name is #Konstantinos and i am 20 #years old";
var data = a.Split(' ').Where(s => s.StartsWith("#")).ToList();
VB
Dim a As String = "Hello my name is #Konstantinos and i am 20 #years old"
Dim data = a.Split(" ").Where(Function(s) s.StartsWith("#")).ToList()

Using regex will give you more flexibility.
You can define a pattern to search for strings starting with #.
.Net regex cheat sheet
Dim searchPattern = "#(\S+)" '\S - Matches any nonwhite space character
Dim searchString = "Hello my name is #Konstantinos and i am 20 #years old"
For Each match As Match In Regex.Matches(searchString, searchPattern, RegexOptions.Compiled)
Console.WriteLine(match.Value)
Next
Console.Read()

This will work . Try this..
string str = "Hello my name is #Konstantinos and i am 20 #years old asldkfjklsd #kumod";
int i=0;
int k = 0;
while ((i = str.IndexOf('#', i)) != -1)
{
string strOutput = str.Substring(i);
k = strOutput.IndexOf(' ');
if (k != -1)
{
Console.WriteLine(strOutput.Substring(0, k));
}
else
{
Console.WriteLine(strOutput);
}
i++;
}

How to read text files that have line feed and carriage return intermixed using X++?

I am trying to read a text file using Dynamics AX. However, the following code replaces any spaces in the lines with commas:
// Open file for read access
myFile = new TextIo(fileName , 'R');
myFile.inFieldDelimiter('\n');
fileRecord = myFile.read();
while (fileRecord)
{
line = con2str(fileRecord);
info(line);
…
I have tried various combinations of the above code, including specifying a blank '' field delimiter, but with the same behaviour.
The following code works, but seems like there should be a better way to do this:
// Open file for read access
myFile = new TextIo(fileName , 'R');
myFile.inRecordDelimiter('\n');
myFile.inFieldDelimiter('_stringnotinfile_');
fileRecord = myFile.read();
while (fileRecord)
{
line = con2str(fileRecord);
info(line);
The format of the file is field format. For example:
DATAFIELD1 DATAFIELD2 DATAFIELD3
DATAFIELD1 DATAFIELD3
DATAFIELD1 DATAFIELD2 DATAFIELD3
So what I end up with unless I use the workaround above is something like:
line=DATAFIELD1,DATAFIELD2,DATAFIELD3
The underlying problem here is that I have mixed input formats. Some of the files just have line feeds {LF} and others have {CR}{LF}. Using my workaround above seems to work for both. Is there a way to deal with both, or to strip \r from the file?

Con2Str:
Con2Str will retrieve a list of values from a container and by default uses comma (,) to separate the values.
client server public static str Con2Str(container c, [str sep])
If no value for the sep parameter is specified, the comma character will be inserted between elements in the returned string.
Possible options:
If you would like the space to be the default separator, you can pass space as the second parameter to the method Con2Str.
One other option is that you can also loop through the container fileRecord to fetch the individual elements.
Code snippet 1:
Below code snippet loads the file contents into textbuffer and replace the carriage returns (\r) with new line (\n) character. The condition if (strlen(line) > 1) will help to skip empty strings due to the possible occurrence of consecutive newline characters.
TextBuffer textBuffer;
str textString;
str clearText;
int newLinePos;
str line;
str field1;
str field2;
str field3;
counter row;
;
textBuffer = new TextBuffer();
textBuffer.fromFile(#"C:\temp\Input.txt");
textString = textBuffer.getText();
clearText = strreplace(textString, '\r', '\n');
row = 0;
while (strlen(clearText) > 0 )
{
row++;
newLinePos = strfind(clearText, '\n', 1, strlen(clearText));
line = (newLinePos == 0 ? clearText : substr(clearText, 1, newLinePos));
if (strlen(line) > 1)
{
field1 = substr(line, 1, 14);
field2 = substr(line, 15, 12);
field3 = substr(line, 27, 10);
info('Row ' + int2str(row) + ', Column 1: ' + field1);
info('Row ' + int2str(row) + ', Column 2: ' + field2);
info('Row ' + int2str(row) + ', Column 3: ' + field3);
}
clearText = (newLinePos == 0 ? '' : substr(clearText, newLinePos + 1, strlen(clearText) - newLinePos));
}
Code snippet 2:
You could use File macro instead of hard coding the values \r\n and R that denotes the read mode.
TextIo inputFile;
container fileRecord;
str line;
str field1;
str field2;
str field3;
counter row;
;
inputFile = new TextIo(#"c:\temp\Input.txt", 'R');
inputFile.inFieldDelimiter("\r\n");
row = 0;
while (inputFile.status() == IO_Status::Ok)
{
row++;
fileRecord = inputFile.read();
line = con2str(fileRecord);
if (line != '')
{
field1 = substr(line, 1, 14);
field2 = substr(line, 15, 12);
field3 = substr(line, 27, 10);
info('Row ' + int2str(row) + ', Column 1: ' + field1);
info('Row ' + int2str(row) + ', Column 2: ' + field2);
info('Row ' + int2str(row) + ', Column 3: ' + field3);
}
}

Never tried to use the default RecordDelimiter as FieldDelimiter and not setting another RecordDelimiter explicitly. Normally rows (Records) are delimited by \n and fields are delimited by comma, tab, semicolon or some other symbol. You might also be hitting some weird behaviour when TextIO is assuming correct UTF-format. You didn't supply an example of some rows from you datafile, so guessing is hard.
Read more about TextIO here: http://msdn.microsoft.com/en-us/library/aa603840.aspx
EDIT:
With the additional example of file content, it seems to me the file is a fixed width file, where each column has its own fixed width. I would rather recommend using subStr if that is the case. Read about substr here: http://msdn.microsoft.com/en-us/library/aa677836.aspx

use StrAlpha to restrict blank values after you convert Con2Str

Does DateTime.ToString("s") return always same format?

According to MSDN on DateTime.ToString ToString("s") should always return string in the format of the sortable XML Schema style formatting, e.g.: 2008-10-01T17:04:32.0000000
In Reflector I came to this pattern inside DateTimeFormatInfo.
public string SortableDateTimePattern
{
get
{
return "yyyy'-'MM'-'dd'T'HH':'mm':'ss";
}
}
Does DateTime.ToString("s") return always a string in this format?
Regardless the Culture, Region, ...
Yes it does
Code to test that
var dateTime = DateTime.Now;
var originialString = dateTime.ToString("s");
string testString;
foreach (var c in System.Globalization.CultureInfo.GetCultures(CultureTypes.AllCultures))
{
Thread.CurrentThread.CurrentUICulture = c;
if (c.IsNeutralCulture == false)
{
Thread.CurrentThread.CurrentCulture = c;
}
testString = dateTime.ToString("s");
Console.WriteLine("{0} ", testString);
if (originialString != testString)
{
throw new ApplicationException(string.Format("ToString(s) is returning something different for {0} " , c));
}
}

Yes it does. As others have said it only contains numeric values and string literals (e.g. 'T' and ':'), nothing that is altered by region or culture settings.

Yep. Breaking that pattern down, it's only numeric properties, there's no reference to anything like month or day names in there.
yyyy - 4 digit date
MM - 2 digit month, with leading zero
dd - 2 digit day, with leading zero
T - a literal T
HH - 2 digit hour, with leading zero, 24 hour format
mm - 2 digit minute, with leading zero
ss - 2 digit second, with leading zero

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

strsplit on periods not in quotes [duplicate] - r

Related

Control file based attribute population : Unix shell

Include match index in Regex replacement string

asp.net Substring from specific string to the end of the word

How to read text files that have line feed and carriage return intermixed using X++?

Does DateTime.ToString("s") return always same format?

Categories

Resources