str_extract: match words near each other - r

I would like to extract a string matching dog|cat (0-5 words, \r, \n or spaces between) 1. and some more text until 2.appears.
myStrings <- c(
"the dog says: 1. hello cat 2. I do not care",
"the dog barks ba ba ba ba ba ba ba and says: 1. no 2. no",
"the doggie says: 1. hello 2. you",
"the cat is angry and asks: 1. hello dog 2. go away",
"the dog says: 2. nothing 3. nothing")
My approach is:
str_extract(string=myStrings,pattern=regex("(dog|cat(?:\\w+\\W+){1,5}?1.).*(?=2.)"))
I tried to implement this (https://www.regular-expressions.info/near.html) , however, my regex matches
> [1] "dog says: 1. hello cat " "dog barks ba ba ba ba ba
> ba ba: 1. no " "doggie says: 1. hello " "dog " "dog says: "
What I would need is
> [1] "dog says: 1. hello cat " "NA" "NA" "the cat is angry and asks: 1. hello dog " "NA"

Your lookbehind assertion is unbounded, meaning, it can match any amount of tokens. The engine needs to statically be able to determine the length of the lookbehind.
Btw, it seems you have uneven parenthesis in your regex, which means I don't know which tokens are supposed to be included in the lookbehind. If you include anything like \w+, it will be unbounded.

Related

Extract first letter in each word but keeping specific punctuation

I have a vector with people names with a couple of millions long that I want to remove all characters but the first letter of each word (i.e. initials) and some characters such as ';' and '-'. The vector has large variation in name formats and a small sample would look like this:
text <- c("Alwyn Howard Gentry", "a. h. gentry", "A H GENTRY", "A. H. G.",
"Carl von Martius", "Leitão Filho, H. F. ; Shepherd, G. J.",
"Dárdano de Andrade - Lima")
I was using the solution below, which gives the desired output, but it is too time-consuming:
unlist(lapply(strsplit(text, " ", fixed = TRUE),
function(x) paste0(substr(x, 1, 1), collapse="")))
"AHG" "ahg" "AHG" "AHG" "CvM" "LFHF;SGJ" "DdA-L"
So I tried to adapt an answer I found here based on regexp and gsub. I managed to get the initials but not the initals and the characters at the same time:
gsub('\\b(\\pL)|.', '\\1', text, perl = TRUE)
"AHG" "ahg" "AHG" "AHG" "CvM" "LFHFSGJ" "DdAL"
I am really new to regexp. I tried to adapt '\b(\pL)|.' part of the code to include the characters in the pattern but I gave up after a couple of hours trying.
Any ideas on which regular expression I should use to get with gsub() the same result from the one I got with strsplit() and sapply()?
Thanks a lot!
You can use
text <- c("Alwyn Howard Gentry", "a. h. gentry", "A H GENTRY", "A. H. G.", "Carl von Martius", "Leitão Filho, H. F. ; Shepherd, G. J.", "Dárdano de Andrade - Lima")
gsub("(*UCP)(\\b\\p{L}|[;-])(*SKIP)(*F)|.", "", text, perl=TRUE)
## Or, alternatively,
gsub("(*UCP)[^;-](?<!\\b\\p{L})", "", text, perl=TRUE)
See the R demo and a regex demo #1/regex demo #2.
Details:
(*UCP) - a PCRE verb that makes \b Unicode-aware
(\b\p{L}|[;-])(*SKIP)(*F) - any Unicode letter at the start of a word or a ; or -, and then the match is skipped, and the next match is searched for from the position where the failure occurred
| - or
. - any char but line break chars
[^;-](?<!\b\p{L}) - any char but ; and - that are not any Unicode letter that is preceded with either start of string or a non-word char.

How to allow a space into a wildcard?

Let's say I have this sentence :
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
When I write this (kwicis a quantedafunction) :
kwic(text,phrase("great* cake*"))
I get
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
However, when I do
kwic(text,phrase("great*cake*"))
I get a kwicobject with 0 row, i.e. nothing
I would like to know what does the *replace exactly and, more important, how to "allow" a space to be taken into account in the wildcard ?
To answer what the * matches, you need to understand the "glob" valuetype, which you can read about using ?valuetype and also here. In short, * matches any number of any characters including none. Note that this is very different from its use in a regular expression, which means "match none or more of the preceding character".
The pattern argument in kwic() matches one pattern per token, after tokenizing the text. Even wrapped in the phrase() function, it still only considers sequences of matches to tokens. So you cannot match the whitespace (which defines the boundaries between tokens) unless you actually include these inside the token's value itself.
How could you do that? Like this:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
But your original usage of kwic(text, phrase("great* cake*")) is the recommended approach.

capital letters in firt letter

In python, I want a program that turn the first leter on a word capital letter.
For exemple:
turn "a red apple is sweeter than a green apple" in "A Red Apple is Sweeter Than A Green Apple"
How can I do?
I've tried this:
d = input('insert a quote')
def mydic(d):
dic = {}
for i in d:
palavras = dic.keys()
if i in palavras:
dic[i] += 1
else :
dic[i] = 1
return dic
You could use the title() method.
For example:
sentence = str(input("Insert a quote: ")).title()
print(sentence)
Input: a red apple is sweeter than a green apple
Output: A Red Apple Is Sweeter Than A Green Apple
What you want to do is this:
split the input string into words ie. string.split(' ') splits a given string by spaces, returns a list.
for each word, capitalize the first letter and concatenate onto a bigger string ie. word[:1].upper() + word[1:] this will uppercase the first letter
Add all the words back into a list and return it.

R - How to split text and punctuation with a exception?

Analysing Facebook comments in R for Sentimental Analysis. Emojis are coding in text between <> symbols.
Example:
"Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
<U+2764> and <U+1F628> are emojis (heavy black heart and fearful face,
respectively).
So, I need split words/numbers and punctuations/symbols, except in emoji codes.
I did, using gsub function, this:
a1 <- "([[:alpha:]])([[:punct:]])"
a2 <- "([[:punct:]])([[:alpha:]])"
b <- "\\1 \\2"
gsub(a1, b, gsub(a2, b, "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"))
...but, the results, logically, also affects emojis code:
[1] "Jesus te ama !!! < U +2764> Ou não ...?< U +1F628> ( fé em stand by )"
The objective is create a exception for the text between <>, split it externally and don't split internally - i.e.:
[1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
Note that:
sometimes the space between the sentence/word/punct and a emoji code is non-existent (needs to be created)
It is required that a punct sequence stays join (e.g. "!!!", "...?")
How can I do it?
You may use the following regex solution:
a1 <- "(?<=<)U\\+\\w+>(*SKIP)(*F)|(?<=\\S)(?=<U\\+\\w+>)|(?<=[[:alpha:]])(?=[[:punct:]])|(?<=[[:punct:]])(?=[[:alpha:]])"
gsub(a1, " ", "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)", perl=TRUE)
# => [1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
See the online R demo
This PCRE regex (see perl=TRUE argument in the call to gsub) matches:
(?<=<)U\\+\\w+>(*SKIP)(*F) - a U+ and 1+ word chars with > after if preceded with < - and the match value is discarded with the PCRE verbs (*SKIP)(*F) and the next match is looked for from the end of this match
| - or
(?<=\\S)(?=<U\\+\\w+>) - a non-whitespace char must be present immediately to the left of the current location, and a <U+, 1+ word chars and > must be present immediately to the right of the current location
| - or
(?<=[[:alpha:]])(?=[[:punct:]]) - a letter must be present immediately to the left of the current location, and a punctuation must be present immediately to the right of the current location
| - or
(?<=[[:punct:]])(?=[[:alpha:]]) - a punctuation must be present immediately to the left of the current location, and a letter must be present immediately to the right of the current location
> str <- "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
> strsplit(str,"[[:space:]]|(?=[.!?])",perl=TRUE)
[[1]]
[1] "Jesus" "te" "ama" "!" "!" "!"
[7] "" "<U+2764>" "" "Ou" "não" "."
[13] "." "." "?" "<U+1F628>" "(fé" "em"
[19] "stand" "by)"

Splitting strings by first instance of pattern R

I have a string
string <- "You know that song Mary had a little lamb? Mary is my friend."
I'd like to split it such that
> string[1]
[1] "You know that song "
> string[2]
[1] " had a little lamb? Mary is my friend."
I want to split it on the first instance of "Mary".
Closer to my actual problem, suppose I had the following string:
string <- "Name: Mary
Some stuff about Mary goes here, for a page
Name: Mary
There's more stuff about her.
Name: Sue
Now the name is different. I want to split on Sue here.
Name: Sue
Sue appears again, but because the name is Sue again I don't want to splt.
Name: Beth
The name changed again, so I want to split on Beth above (following Name: ).
Name: Amy
The name changed again and now I want to split on the 'Amy' immediately following Name: ."
Essentially, I want to split this document so that each element corresponds to information about one person so that:
> string
[1] "Name: Mary\n Some stuff about Mary goes here, for a page\n Name: Mary\n There's more stuff about her.\n Name: "
[2] "Sue\n Now the name is different. I want to split on Sue here.\n Name: Sue\n Sue appears again, but because the name is Sue again I don't want to splt.\n Name: "
[3] "Beth\n The name changed again, so I want to split on Beth above (following Name: ).\n Name: "
[4] "Amy\n The name changed again and now I want to split on the 'Amy' immediately following Name: ."
May be this helps
strsplit(string, '(\\b\\S+\\b)(?=.*\\b\\1\\b.*)', perl=TRUE)[[1]]
##[1] "You know that song "
#[2] " had a little lamb? Mary is my friend."
Another case
string1 <- "You know that song Mary had a little lamb? Mary is my friend and she is also a friend of another friend"
strsplit(string1, '(\\b\\S+\\b)(?=.*\\b\\1\\b.*)', perl=TRUE)[[1]]
#[1] "You know that song " " had " " little lamb? Mary "
#[4] " my " " and she is also a " " of another friend"
NOTE: I am not sure whether this is the way the OP wants to split for the second example.
Try this one:
regmatches(string, regexpr("Mary", string), invert = TRUE)

Resources