Splitting strings by first instance of pattern R - r

I have a string
string <- "You know that song Mary had a little lamb? Mary is my friend."
I'd like to split it such that
> string[1]
[1] "You know that song "
> string[2]
[1] " had a little lamb? Mary is my friend."
I want to split it on the first instance of "Mary".
Closer to my actual problem, suppose I had the following string:
string <- "Name: Mary
Some stuff about Mary goes here, for a page
Name: Mary
There's more stuff about her.
Name: Sue
Now the name is different. I want to split on Sue here.
Name: Sue
Sue appears again, but because the name is Sue again I don't want to splt.
Name: Beth
The name changed again, so I want to split on Beth above (following Name: ).
Name: Amy
The name changed again and now I want to split on the 'Amy' immediately following Name: ."
Essentially, I want to split this document so that each element corresponds to information about one person so that:
> string
[1] "Name: Mary\n Some stuff about Mary goes here, for a page\n Name: Mary\n There's more stuff about her.\n Name: "
[2] "Sue\n Now the name is different. I want to split on Sue here.\n Name: Sue\n Sue appears again, but because the name is Sue again I don't want to splt.\n Name: "
[3] "Beth\n The name changed again, so I want to split on Beth above (following Name: ).\n Name: "
[4] "Amy\n The name changed again and now I want to split on the 'Amy' immediately following Name: ."

May be this helps
strsplit(string, '(\\b\\S+\\b)(?=.*\\b\\1\\b.*)', perl=TRUE)[[1]]
##[1] "You know that song "
#[2] " had a little lamb? Mary is my friend."
Another case
string1 <- "You know that song Mary had a little lamb? Mary is my friend and she is also a friend of another friend"
strsplit(string1, '(\\b\\S+\\b)(?=.*\\b\\1\\b.*)', perl=TRUE)[[1]]
#[1] "You know that song " " had " " little lamb? Mary "
#[4] " my " " and she is also a " " of another friend"
NOTE: I am not sure whether this is the way the OP wants to split for the second example.

Try this one:
regmatches(string, regexpr("Mary", string), invert = TRUE)

Related

PowerApps - One Collection feeds another, with List lookups

There are a few steps I'm trying to hit, here.
STEP 1:
I have created a Collection (ScanDataCollection) with the following command:
ClearCollect(ScanDataCollection,Split(ScanData.Text,Char(10)));
Where ScanData is a multiline text control, containing data strings such as this:
REQ1805965.RITM2055090.01
REQ1805965.RITM2055091.01
REQ1805982.RITM2055144.01
REQ1805982.RITM2055145.01
This produces a Collection of:
RESULT
REQ1805965.RITM2055090.01
REQ1805965.RITM2055091.01
REQ1805982.RITM2055144.01
REQ1805982.RITM2055145.01
The unique lookup value in this list is the RITM string (for example: RITM2055091)
I want to build a Collection that looks like this:
CUSTOMERNAME CUSTOMEREMAIL MANAGERNAME MANAGEREMAIL ITEMLIST
Edward edward#fish.com Tony tony#fish.com <li><strong>REQ1805965 - RITM2055090 - Vulcan Banana</strong></li>
Edward edward#fish.com Tony tony#fish.com <li><strong>REQ1805965 - RITM2055091 - Vulcan Grape</strong></li>
Joseph joey#fish.com Kate kate#fish.com <li><strong>REQ1805982 - RITM2055144 - Romulan Catfish</strong></li>
Joseph joey#fish.com Kate kate#fish.com <li><strong>REQ1805982 - RITM2055145 - Romulan Salmon</strong></li>
The values in the rows come from a List (called "Spiderfood" at the moment) in SharePoint (this is where RITM value is typically unique, and can be used as the lookup):
Title REQUEST RITM TASK OPENED_DATE ITEM_DESCRIPTION VIP CUSTOMER_NAME CUSTOMER_NT MANAGER_NAME MANAGER_NT TASK_DESCRIPTION CUSTOMER_LOCATION
8-5-2021 REQ1805965 RITM2055090 TASK123 7-27-2021 Vulcan Banana false Edward edward#fish.com Tony tony#fish.com a string a string
8-5-2021 REQ1805965 RITM2055091 TASK123 7-27-2021 Vulcan Grape false Edward edward#fish.com Tony tony#fish.com a string a string
8-5-2021 REQ1805982 RITM2055144 TASK123 7-27-2021 Romulan Catfish false Joseph joey#fish.com Kate kate#fish.com a string a string
8-5-2021 REQ1805982 RITM2055145 TASK123 7-27-2021 Romulan Salmon false Joseph joey#fish.com Kate kate#fish.com a string a string
...[among hundreds of other records in this List]
Then...
STEP 2:
Take the Collection I built above, and deduplicate, based on CUSTOMEREMAIL, but in the process of deduplicating, concatenate the items in the ITEMLIST column.
The result would be a Collection with only two rows, for example:
CUSTOMERNAME CUSTOMEREMAIL MANAGERNAME MANAGEREMAIL ITEMLIST
Edward edward#fish.com Tony tony#fish.com <li><strong>REQ1805965 - RITM2055090 - Vulcan Banana</strong></li><li><strong>REQ1805965 - RITM2055091 - Vulcan Grape</strong></li>
Joseph joey#fish.com Kate kate#fish.com <li><strong>REQ1805982 - RITM2055144 - Romulan Catfish</strong></li><li><strong>REQ1805982 - RITM2055145 - Romulan Salmon</strong></li>
I sure would appreciate guidance/suggestions on this, please!
Thank you kindly in advance!
Okay, for STEP 1:
ClearCollect(ScanDataCollection,Split(ScanData.Text,Char(10)));
ClearCollect(MailingListExploded, AddColumns(ScanDataCollection,
"CustomerName", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Customer_Name),
"CustomerEmail", "edward#fish.com", // this is what I use as a test so that I don't email customers.
//"CustomerEmail", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Customer_NT),
"ManagerName", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Manager_Name),
"ManagerEmail", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Manager_NT),
"ItemListHTML", "<li><strong>" & Left(Result,10) & " - " & Mid(Result, 12, 11) & " - " & LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Item_Description) & "</li></strong>"));
It adds an additional column from ScanDataCollection called "Result", but I can live with that. (I take it out later)
I had to add that specific list ('Spiderfood - RITMs') as a resource to the PowerApp project, which took me a minute to remember. Derp.
It offers a delegation warning about the use of Lookup if the dataset is very large (well, it's gonna be around 15,000, give or take), but for now, I'll not worry about it.
Now, on to STEP 2:
What would have helped me quicker on this would be to better understand the GROUPBY function, and how it can have multiple arguments, and concatenating the strings was a bit of a headscratcher.
But it seems to work, so here it is:
// Trim away the Result column
ClearCollect(MailingListExplodedTrimmed, DropColumns(MailingListExploded, "Result"));
// Group and concatenate - TransmissionGrid is what we need to send the emails
ClearCollect(RecordsByCustEmail, GroupBy(MailingListExplodedTrimmed, "CustomerEmail", "CustomerName", "ManagerName", "ManagerEmail", "OrderData"));
ClearCollect(TransmissionGridExtra, AddColumns(RecordsByCustEmail, "ConcatenatedOrderString", Concat(OrderData, ItemListHTML)));
ClearCollect(TransmissionGrid, DropColumns(TransmissionGridExtra, "OrderData"));
Notify("Process complete!");
I might be able to shave away some steps by nesting things, but in this instance I wanted to be super obvious in case I have to look at this in 96 hours.
Anyway, that's what did it for me. Onward!

R extract specific word after keyword

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

str_extract: match words near each other

I would like to extract a string matching dog|cat (0-5 words, \r, \n or spaces between) 1. and some more text until 2.appears.
myStrings <- c(
"the dog says: 1. hello cat 2. I do not care",
"the dog barks ba ba ba ba ba ba ba and says: 1. no 2. no",
"the doggie says: 1. hello 2. you",
"the cat is angry and asks: 1. hello dog 2. go away",
"the dog says: 2. nothing 3. nothing")
My approach is:
str_extract(string=myStrings,pattern=regex("(dog|cat(?:\\w+\\W+){1,5}?1.).*(?=2.)"))
I tried to implement this (https://www.regular-expressions.info/near.html) , however, my regex matches
> [1] "dog says: 1. hello cat " "dog barks ba ba ba ba ba
> ba ba: 1. no " "doggie says: 1. hello " "dog " "dog says: "
What I would need is
> [1] "dog says: 1. hello cat " "NA" "NA" "the cat is angry and asks: 1. hello dog " "NA"
Your lookbehind assertion is unbounded, meaning, it can match any amount of tokens. The engine needs to statically be able to determine the length of the lookbehind.
Btw, it seems you have uneven parenthesis in your regex, which means I don't know which tokens are supposed to be included in the lookbehind. If you include anything like \w+, it will be unbounded.

How to allow a space into a wildcard?

Let's say I have this sentence :
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
When I write this (kwicis a quantedafunction) :
kwic(text,phrase("great* cake*"))
I get
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
However, when I do
kwic(text,phrase("great*cake*"))
I get a kwicobject with 0 row, i.e. nothing
I would like to know what does the *replace exactly and, more important, how to "allow" a space to be taken into account in the wildcard ?
To answer what the * matches, you need to understand the "glob" valuetype, which you can read about using ?valuetype and also here. In short, * matches any number of any characters including none. Note that this is very different from its use in a regular expression, which means "match none or more of the preceding character".
The pattern argument in kwic() matches one pattern per token, after tokenizing the text. Even wrapped in the phrase() function, it still only considers sequences of matches to tokens. So you cannot match the whitespace (which defines the boundaries between tokens) unless you actually include these inside the token's value itself.
How could you do that? Like this:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
But your original usage of kwic(text, phrase("great* cake*")) is the recommended approach.

How to reverse the Lastname,Firstname of an expression inside a Cognos data item?

I created a list report in Cognos and the LAST NAME column is currently in Lastname,Firstname order, with no spaces except for Firstname with a second name included.
From this current set-up
LAST_NAME COLUMN
Morello,Mortel
Chopra,Deepak
Fothergill,Mike Edward
Smith,David
I'm hoping to get this result.
NEW DATA ITEM
Mortel Morello
Deepak Chopra
Mike Edward Fothergill
David Smith
I tried using a substring function but it does not work.
substring(LAST_NAME, position(',', LAST_NAME)+1,
inStr(LAST_NAME, ' ',position(',', LAST_NAME)+1, 1))
This should work:
substring([Last Name],position(',',[Last Name])+1)
+ ' ' +
substring([Last Name],1,position(',',[Last Name])-1)

Resources