Count number of sentences in a string in SAS

Count number of sentences in a string in SAS - count

I was wondering if there is an easy way in SAS to count sentences in a string?
In pseudo code I would search for the index of every ., ?, and !, and check if the index before that (-1 or -2) is a character.
Any better ideas?

Assuming that your sentences are correctly punctuated, there should be exactly 1 sentence per ?!., so in that case you can use countc(my_string,'?!.'). The main exceptions are probably interrobangs (?!,!?) and ellipses (...).
If your string contains lots of sentences with missing stops or double stops, one option is simply to cross your fingers and hope they more or less cancel out.
If there are lots of double stops but not so many missing ones, you could apply a regex to replace any run of consecutive stops with a single . before counting those, e.g. countc(prxchange('s/[\.!\?]{2,}/./',-1,string),'?!.').

Related

R: Regex for identifying numbers within HTML chunk

this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!

So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).

Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.

Parsing - Adding a capturing group

I am attempting to use a fairly complex REGEX expression (see REGEX101 demos below), which I amended slightly from one created by an expert on this site. It parses specific patterns of log events:
1EXE_IN1EXE_CO2CONTENT_ACCESS3CONTENT_ACCESS
These log sequences will always begin with a random selection of EXE_IN or EXE_CO events, preceded sequence numbers. These selections can be any number, in any order. In this case, we just have two EXE events but this may be 200. Or 1. Note that there is a sequence number and we need to capture it.
The second part of the sequence will always be a series of digit-prefaced CONTENT.ACCESS events. Again from 1 to infinity in length.
The following demo shows a working example and probably conveys the concept better than I can : Demo 1
It nicely captures a full match, sequence number, and event in separate groups.
I need to add a timestamp to the pattern (after the sequence number, with a preceding underscore), and then parse this event log e.g.
1_11/08/2014 23:03EXE_IN1_11/08/2014 23:03EXE_CO2_12/08/2014 09:17CONTENT_ACCESS3_13/08/2014 09:17CONTENT_ACCESS
I need to capture the timestamps as well.
I attempted to adjust the regex expression, with mixed results. Please see this demo: demo2
Ideally I'd like to see something like this for each event:
Match n
Full match 266-308 `2_12/08/2014 09:17CONTENT_ACCESS`
Group 1. 266-267 `2`
Group 2. 268-284 `12/08/2014 09:17`
Group 3. 284-308 `CONTENT_ACCESS`
I hope you can help me. REGEX101 pcre testing is sufficient (for the record, I am using perl-compatible str_match_all_perl function in R).
Many thanks in advance.

(\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/1
Due to comments it was changed to (?:\G(?!^)(?(?=\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}(?:EXE_CO|EXE_IN))(?<!\d_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}CONTENT_ACCESS))|(?=(?:\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}(?:EXE_CO|EXE_IN))+(?:\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}CONTENT_ACCESS)+))(\d+)_(\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2})(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/3
Ans also another version, which is shorter
(?:\G(?!^)(?(?=\d+_.{16}(?:EXE_CO|EXE_IN))(?<!\d_.{16}CONTENT_ACCESS))|(?=(?:\d+_.{16}(?:EXE_CO|EXE_IN))+(?:\d+_.{16}CONTENT_ACCESS)+))(\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/4
And even more shorter (?:\G(?!^)(?(?=\d+_.{16}E)(?<!S))|(?=(?:\d+_.{16}(?:EXE_CO|EXE_IN))+\d+_.{16}C))(\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/5
And super short (?:\G|(?=\d+_.{16}E.*CON))(\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)
https://regex101.com/r/EHHcKm/8

Extract first value in boolean search string

I would like to be able to control the hierarchy of elements I extract from a search string.
Specifically, in the string "425 million won", I would like to extract "won" first, but then "n" if "won" doesn't appear.
I want the result to be "won" for the following:
stringr::str_extract("425 million won", "won|n")
Note that specifying a space before won in my regex is inadequate because of other limitations in my data (there may not necessarily be a space between "million" and "won"). Ideally, I would like to do this using regex, as opposed to if-else clauses because of performance considerations.

See code in use here
pattern <- "^(?:(?!won).)*\\K(?:won|n)"
s <- "425 million won"
m <- gregexpr(pattern,s,perl=TRUE)
regmatches(s,m)[[1]]
Explanation
^ Assert position at the start of the line
(?:(?!won).)* Tempered greedy token matching any character except instances where won proceeds
\K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
(?:won|n) Match either won or n

If you just want to extend on the code you already have:
na.omit(str_extract("420 million won", c("won", "n")))[1]

Regex in R match specified words when they all (two or more) occur in whatever order within certain distance in particular line

I have a double challenge.
First, I want to match lines that contain two (or eventually more) specified words within certain distance in whatever order.
Using lookaround I manage to select lines matching two or more words, regardless of the order within they occur. I can also easily add more words to be found in the same line, so it this can also be applied without much effort when more word must occur in order to be selected. The disadvantage is that can't detail the maximal distance between them.
^(?=.*\john)(?=.*\jack).*$
By using the pipe operator I can detail both orders in which the terms may occur as well as the accepted distance between them, but when more words should be matched the code becomes lengthy and errorsensitive.
jack.{0,100}john|john.{0,100}jack
Is there a way to combine the respective advantages of both approaches in one regular expression?
Second, ideally I would like that only 'jack' and 'john' (and are selected in the line but not the whole line.
Is there a possibility to do this all at once?

For this case, you have to use the second approach. But it can't be possible with regex alone.. You have to ask for language tools help like paste in-order to build a regex (given in the second format).
In python, I would do like below to create a long regex.
>>> def create_reg(lis):
out = []
for i in lis:
out.append(''.join(i) + '|' + ''.join([i[2],i[1], i[0]]))
return '(?:' + '|'.join(out) + ')'
>>> lst = [('john', '{0,100}', 'jack'), ('foo', '{0,100}', 'bar')]
>>> create_reg(lst)
'(?:john{0,100}jack|jack{0,100}john|foo{0,100}bar|bar{0,100}foo)'
>>>

COUNTIF of non-empty and non-blank cells

In Google Sheets I want to count the number of cells in a range (C4:U4) that are non-empty and non-blank. Counting non-empty is easy with COUNTIF. The tricky issue seems to be that I want to treat cells with one or more blank as empty. (My users keep leaving blanks in cells which are not visible and I waste a lot of time cleaning them up.)
=COUNTIF(C4:U4,"<>") treats a cell with one or more blanks as non-empty and counts it. I've also tried =COUNTA(C4:U4) but that suffers from the same problem of counting cells with one or more blanks.
I found a solution in stackoverflow flagged as a solution by 95 people but it doesn't work for cells with blanks.
After much reading I have come up with a fancy formula:
=COUNTIF(FILTER(C4:U4,TRIM(C4:U4)>="-"),"<>")
The idea is that the TRIM removes leading and trailing blanks before FILTER tests the cell to be greater than or equal to a hyphen (the lowest order of printable characters I could find). The FILTER function then returns an array to the COUNTIF function which only contains non-empty and non-blank cells. COUNTIF then tests against "<>"
This works (or at least "seems" to work) but I was wondering if I've missed something really obvious. Surely the problem of hidden blanks is very common and has been around since the dawn of excel and google sheets. there must be a simpler way.
(My first question so apologies for any breaches of forum rules.)

I don't know about Google. But for Excel you could use this array formula for multiple contiguous columns:
=ROWS(A1:B10) * COLUMNS(A1:B10)-(COUNT(IF(ISERROR(CODE(A1:B10)),1,""))+COUNT(IF(CODE(A1:B10)=32,1,"")))

Could try this but I'm not at all sure about it
=SUMPRODUCT(--(trim((substitute(A2:A5,char(160),"")))<>""))
seems in Google Sheets that you've got to put char(160) to match a space entered into a cell?
Seems this is due to a non-breaking space and could possibly apply to Excel also - as explained here - the suggestion is that you could also pass it through the CLEAN function to eliminate invisible characters with codes in range 0-31.

I found another way to do it using:
=ARRAYFORMULA(SUM(IF(TRIM($C4:$U4)<>"",1,0)))
I'm still looking for a simpler way to do it if one is available.

This should work:
=countif(C4:U4,">""")
I found this solution here:
Is COUNTA counting blank (empty) cells in new Google spreadsheets?
Please let me know if it does.

=COLUMNS(C4:U4)-COUNTBLANK(C4:U4)
This will count how many cells are in your range (C4 to U4 = 19 cells), and subtract those that are truly "empty".
Blank spaces will not get counted by COUNTBLANK, despite its name, which should really be COUNTEMPTY.