pyparsing : group text between dates - pyparsing

I have log files that contain a a date/time with a varied # of lines between the next date/time
ex.
time-date
2/07/18 13:55:00.983
msecVal = pyparsing.Word(pyparsing.nums, max=3)
numPair = pyparsing.Word(pyparsing.nums, exact=2)
dateStr = pyparsing.Combine(numPair + '/' + numPair + '/' + numPair)
timeString = pyparsing.Combine(numPair + ':' + numPair + ':' + numPair\
+ '.' + msecVal)
log file will be
time:date: line of text
possible 2nd line of text
possible 3rd line of text...
time:date: line of text
time:date: line of text
possible 2nd line of text
possible 3rd line of text...
possible <n> line of text...
time:date: line of text
Input will be a large text log file in the above format. I'd like to produce a list list of grouped elements
[[time],[all text until next time]],[[time],[all text until next time]...
I can do this if each time/date entry were a single line.. it's spanning between a random # of multiple lines until the next time/date entry I'm having problems with.

Here is how I interpret your definition of a log enty:
"A date-time at the beginning of a line, followed by a colon, followed by everything
up until the next date-time at the beginning of a line, even if there might be date-times
embedded in the line."
There are two pyparsing features that you need to solve this:
LineStart - to distinguish date-times at the start of the line vs those in the body of the line
SkipTo - quick way to skip over unstructured text until a matching expression is found
I added these expressions to your code (I imported pyparsing as 'pp' because I am a lazy typist):
dateTime = dateStr + timeString
# log entry date-time keys only match if they are at the start of the line
dateTimeKey = pp.LineStart() + dateTime
# define a log entry as a date-time key, followed by everything up to the next
# date-time key, or to the end of the input string
# (use results names to make it easy to get at the parts of the log entry)
logEntry = pp.Group(dateTimeKey("time") + ':' + pp.Empty()
+ pp.SkipTo(dateTimeKey | pp.StringEnd())("body"))
I converted your sample to have different date times in it for testing, and we get this:
sample = """\
2/07/18 13:55:00.983: line of text
possible 2nd line of text
possible 3rd line of text...
2/07/19 13:55:00.983: line of text
2/07/20 13:55:00.983: line of text
possible 2nd line of text
possible 3rd line of text...
possible <n> line of text...
2/07/21 13:55:00.983: line of text
"""
print(pp.OneOrMore(logEntry).parseString(sample).dump())
Gives:
[['2/07/18', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n 2/07/19 13:55:00.983: line of text'], ['2/07/20', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n possible <n> line of text...'], ['2/07/21', '13:55:00.983', ':', 'line of text']]
[0]:
['2/07/18', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n 2/07/19 13:55:00.983: line of text']
- body: 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n 2/07/19 13:55:00.983: line of text'
- time: ['2/07/18', '13:55:00.983']
[1]:
['2/07/20', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n possible <n> line of text...']
- body: 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n possible <n> line of text...'
- time: ['2/07/20', '13:55:00.983']
[2]:
['2/07/21', '13:55:00.983', ':', 'line of text']
- body: 'line of text'
- time: ['2/07/21', '13:55:00.983']
I also had to convert your num_pair to:
numPair = pp.Word(pp.nums, max=2)
else it would not match the leading single-digit '2' in your sample date.

Related

How to remove double quotes(") and new lines in between ," and ", in a unix file

I am getting a comma delimited file with double quotes to string and date fields. we are getting " and new line feeds in string columns like below.
"1234","asdf","with"doublequotes","new line
feed","withmultiple""doublequotes"
want output like
"1234","asdf","withdoublequotes","new linefeed","withmultipledoublequotes"
I have tried
sed 's/\([^",]\)"\([^",]\)/\1\2/g;s/\([^",]\)""/\1"/g;s/""\([^",]\)/"\1/g' < infile > outfile
its removing double quotes in string and removing last double quote like below
"1234","asdf","withdoublequotes","new line
feed","withmultiple"doublequotes
is there a way to remove " and new line feed comes in between ", and ,"
Your substitutions for two consecutive quotes didn't work because they are placed after the substitution for a sole quote, when only one of the two is left.
We could remove " by repeated substitutions (otherwise a quote inserted by the substitution would stay) and new line feed by joining the next input line if the current one's end is no quote:
sed ':1;/[^"]$/{;N;s/\n//;b1;};:0;s/\([^,]\)"\([^,]\)/\1\2/g;t0' <infile >outfile

Returning text between a starting and ending regular expression

I am working on a regular expression to extract some text from files downloaded from a newspaper database. The files are mostly well formatted. However, the full text of each article starts with a well-defined phrase ^Full text:. However, the ending of the full-text is not demarcated. The best that I can figure is that the full text ends with a variety of metadata tags that look like: Subject: , CREDIT:, Credit.
So, I can certainly get the start of the article. But, I am having a great deal of difficulty finding a way to select the text between the start and the end of the full text.
This is complicated by two factors. First, obviously the ending string varies, although I feel like I could settle on something like: `^[:alnum:]{5,}: ' and that would capture the ending. But the other complicating factor is that there are similar tags that appear prior to the start of the full text. How do I get R to only return the text between the Full text regex and the ending regex?
test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')
test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')
My current attempt is here:
test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]
Thank you.
This just searches for the element matching 'Full text:', then the next element after that matching ':'
get_text <- function(x){
start <- grep('Full text:', x)
end <- grep(':', x)
end <- end[which(end > start)[1]] - 1
x[start:end]
}
get_text(test)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"

Parse error when text is split on multi lines

i'm getting a parse error when I split a text line on multiple lines and show the JSON file on screen with the command "jq . words.json".
The JSON file with the text value on a single line looks like this
{
"words" : "one two three four five"
}
The command "jq . words.json" works fine and shows the JSON file on screen.
But when i split the value "one two three four five" on two lines and run the same command I get a parse error
{
"words" : "one two
three four five"
^
}
parse error: Invalid string: control characters from U+0000 through
U+001F must be escaped at line 3, column 20
The parse error points to the " at the end of the third line.
How can i solve this?
Tia,
Anthony
That's because the JSON format is invalid. It should look like this:
{
"words" : "one two \nthree four five"
}
You have to escape end of line in JSON:
{
"words" : "one two\nthree four five"
}
To convert the text with the multi-line string to valid JSON, you could use any-json (https://www.npmjs.com/package/any-json), and pipe that into jq:
$ any-json --input-format=cson split-string.txt
{
"words": "one two three four five"
}
$ any-json --input-format=cson split-string.txt | jq length
1
For more on handling almost-JSON texts, see the jq FAQ: https://github.com/stedolan/jq/wiki/FAQ#processing-not-quite-valid-json
The parse error points to the " at the end of the third line.
The way jq flags this error may be counterintuitive, but the error in the JSON precedes the indicated quote-mark.
If the error is non-obvious, it may be that an end-quote is missing on the prior key or value. In this case, the value that matches the criteria U+0000 through U+001F could be U+000A, which is the line feed character in ASCII.
In the case of this question, the line feed was inserted intentionally. But, unescaped, this is invalid JSON.
In case it helps somebody, I had this error:
E: parse error: Invalid string: control characters from U+0000 through
U+001F must be escaped at line 3, column 5
jq was parsing the file containing this data, with missing " after "someKey
{
"someKey: {
"someData": "someValue"
}
}

Trim function usage

I have one label field max characters allowed is 200. If the string in the label goes above 30 means, I need to trim the value and display the trimmed value.
If I go for editing means, all the 200 characters should be passed in the text box for edit.
label.Text = label.Text.Substring(0, 30) + "..."; //This is displayed in the label
After that i want to edit, for that i need to recover the full string(200 char) in the label, is it possible?
TRIM function
Trim eliminates leading and trailing whitespace. We need to remove whitespace from the beginning or ending of a string. We use the .NET Framework's Trim method to do this efficiently. This method removes any characters specified.
Trim input and output
String input: " This is an example string. "<br>
Trim method result: "This is an example string."
String input: "This is an example string.\r\n\r\n"<br>
Trim method result: "This is an example string."
So it's depend upon your label strings

How to extract everything between two keywords in perl

Need to extract everything between start and end.
the below code works if there is no \n.
$mystring = "The start text always precedes \n the end of the text.";
if($mystring =~ m/start(.*?)end/) {
print $1;
}
o/p should be - text always precedes \n the
Study perlre, in particular the /s modifier.

Resources