I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."
Related
I want to remove all words that start with "a" in a string.
Input:
string <- "This is a sentence about nothing."
My attempt:
stringr::str_remove_all(string,"a*\\b")
output I got:
[1] "This is sentence about nothing."
output I want:
[1] "This is sentence nothing."
I am not sure how to detect based on one letter but perform action(e.g., remove, replace) on the whole word. Any input is appreciated!
The a*\b pattern matches zero or more a chars followed with end of string or a word char. It does not match a word unless it is an a word.
You can use
stringr::str_remove_all(string,"\\ba\\w*")
stringr::str_replace_all(string,"\\ba\\w*", "")
gsub("\\ba\\w*", "", string, perl=TRUE) ## ASCII only letters/digits
where \ba\w* matches a word boundary, a, and then zero or more word chars.
If you also want to remove any whitespaces before the word, add \s* at the start:
stringr::str_remove_all(string,"\\s*\\ba\\w*")
stringr::str_replace_all(string,"\\s*\\ba\\w*", "")
gsub("\\s*\\ba\\w*", "", string, perl=TRUE) ## ASCII only letters/digits/whitespaces
If you need to make sure you only remove natural langugage words consisting only of letters, then you can replace \w with \p{L}:
stringr::str_remove_all(string,"\\s*\\ba\\p{L}*")
stringr::str_replace_all(string,"\\s*\\ba\\p{L}*", "")
gsub("(*UCP)\\s*\\ba\\p{L}*", "", string, perl=TRUE) ## any Uncicode letters/digits/whitespaces
I want to know with how many spaces a string starts. Here are some examples:
string.1 <- " starts with 4 spaces"
string.2 <- " starts with only 2 spaces"
My attempt was the following but this leads to 1 in both cases and I understand why this is the case.
stringr::str_count(string.1, "^ ")
stringr::str_count(string.2, "^ ")
I'd prefer if there was a solution completely like this but with another regex.
The ^ pattern matches a single space at the start of the string, that is why both test cases return 1.
To match consecutive spaces at the start of the string, you may use
stringr::str_count(string.1, "\\G ")
Or, to count any whitespaces,
stringr::str_count(string.1, "\\G\\s")
See the R demo
The \G pattern matches a space at the start and each space after the successful match due to the \G anchor.
Another approach: count the length of ^\s+ matches (1 or more whitespace chars at the start of the string):
strings <- c(" starts with 4 spaces", " starts with only 2 spaces")
matches <- regmatches(strings, regexpr("^\\s+", strings))
sapply(matches, nchar)
# => 4 2
One approach might be to take the nchar of the input string, with all content from the first non whitespace character until the end stripped.
string.1 <- " starts with 4 spaces"
nchar(sub("\\S.*$", "", string.1))
Trying to postprocess the LaTeX (pdf_book output) of a bookdown document to collapse biblatex citations to be able to sort them chronologically using \usepackage[sortcites]{biblatex} later on. Thus, I need to find }{ after \\autocites and replace it with ,. I am experimenting with gsub() but can't find the correct incantation.
# example input
testcase <- "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"
# desired output
"text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
A simple approach was to replace all }{
> gsub('\\}\\{', ',', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"
But this also collapses {keep}{separate}.
I was then trying to replace }{ within a 'word' (string of characters without whitspace) starting with \\autocites by using different groups and failed bitterly:
> gsub('(\\\\autocites)([^ \f\n\r\t\v}{}]+)((\\}\\{})+)', '\\1\\2\\3', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"
Addendum:
The actual document contains more lines/elements than the testcase above. Not all elements contain \\autocites and in rare cases one element has more than one \\autocites. I didn't originally think this was relevant. A more realistic testcase:
testcase2 <- c("some text",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}")
A single gsub call is enough:
gsub("(?:\\G(?!^)|\\\\autocites)\\S*?\\K}{", ",", testcase, perl=TRUE)
## => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
See the regex demo. Here, (?:\G(?!^)|\\autocites) matches the end of the previous match or \autocites string, then it matches any 0 or more non-whitespace chars, but as few as possible, then \K discards the text from the current match buffer and consumes the }{ substring that is eventually replaced with a comma.
There is also a very readable solution with one regex and one fixed text replacements using stringr::str_replace_all:
library(stringr)
str_replace_all(testcase, "\\\\autocites\\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
Here, \\autocites\S+ matches \autocites and then 1+ non-whitespace chars, and gsub("}{", ",", x, fixed=TRUE) replaces (very fast) each }{ with , in the matched text.
Not the prettiest solution, but it works. This repeatedly replaces }{ with , but only if it follows autocities with no intervening blanks.
while(length(grep('(autocites\\S*)\\}\\{', testcase, perl=TRUE))) {
testcase = sub('(autocites\\S*)\\}\\{', '\\1,', testcase, perl=TRUE)
}
testcase
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
I'll make the input string slightly bigger to make the algorithm more clear.
str <- "
text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}
text \\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990} text {keep}{separate}
"
We will firstly extract all the citation blocks, replace "}{" with "," in them and then put them back into the string.
# pattern for matching citation blocks
pattern <- "\\\\autocites(\\[[^\\[\\]]*\\])*(\\{[[:alnum:]]*\\})+"
cit <- str_extract_all(str, pattern)[[1]]
cit
#> [1] "\\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990}"
#> [2] "\\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990}"
Replace in citation blocks:
newcit <- str_replace_all(cit, "\\}\\{", ",")
newcit
#> [1] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
Break the original string in the places where citation block was found
strspl <- str_split(str, pattern)[[1]]
strspl
#> [1] "\ntext " " text {keep}{separate}\ntext " " text {keep}{separate}\n"
Insert modified citation blocks:
combined <- character(length(strspl) + length(newcit))
combined[c(TRUE, FALSE)] <- strspl
combined[c(FALSE, TRUE)] <- newcit
combined
#> [1] "\ntext "
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [3] " text {keep}{separate}\ntext "
#> [4] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [5] " text {keep}{separate}\n"
Paste it together to finalize:
newstr <- paste(combined, collapse = "")
newstr
#> [1] "\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\n"
I suspect there could be a more elegant fully-regex solution based on the same idea, but I wasn't able to find one.
I found an incantation that works. It's not pretty:
gsub("\\\\autocites[^ ]*",
gsub("\\}\\{",",",
gsub(".*(\\\\autocites[^ ]*).*","\\\\\\1",testcase) #all those extra backslashes are there because R is ridiculous.
),
testcase)
I broke it in to lines to hopefully make it a little more intelligible. Basically, the innermost gsub extracts just the autocites (anything that follows \\autocites up to the first space), then the middle gsub replaces the }{s with commas, and the outermost gsub replaces the result of the middle one for the pattern extracted in the innermost one.
This will only work with a single autocites in a string, of course.
Also, fortune(365).
How (in R) would I remove any word in a string containing punctuation, keeping words without?
test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"
desired <- "I a see works not"
Here is an approach using sub which seems to work:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
[1] "I a see works not"
This approach is to use the following regex pattern:
[A-Za-z]* match a leading letter zero or more times
[^A-Za-z ] then match a symbol once (not a space character or a letter)
\\S* followed by any other non whitespace character
\\s* followed by any amount of whitespace
Then, we just replace with empty string, to remove the words having one or more symbols in them.
You can use this regex
(?<=\\s|^)[a-z0-9]+(?=\\s|$)
(?<=\\s|^) - positive lookbehind, match should be preceded by space or start of string.
[a-z0-9]+ - Match alphabets and digits one or more time,
(?=\\s|$) - Match must be followed by space or end of string
Demo
Tim's edit:
This answer uses a whitelist approach, namely identify all words which the OP does want to retain in his output. We can try matching using the regex pattern given above, and then connect the vector of matches using paste:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")
[1] "I a see works not"
Here are couple of more approaches
First approach:
str_split(test.string, " ", n=Inf) %>% # spliting the line into words
unlist %>%
.[!str_detect(., "\\W|\r")] %>% # detect words without punctuation or \r
paste(.,collapse=" ") # collapse the words to get the line
Second approach:
str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>%
unlist %>%
trimws() %>%
paste(., collapse=" ")
^\\w+ - words having only [a-zA-Z0-9_] and also is start of the string
\\s\\w+\\s - words with [a-zA-Z0-9_] and having space before and after the word
\\w+$ - words having [a-zA-Z0-9_] and also is the end of the string
I have a vector of strings that looks like:
str <- c("bills slashed for poor families today", "your calls are charged", "complaints dept awaiting refund")
I want to get all the words that end with the letter s and remove the s. I have tried:
gsub("s$","",str)
but it doesn't work because it tries to match with the strings that end with s instead of words. I'm trying to get an output that looks like:
[1] bill slashed for poor familie today
[2] your call are charged
[3] complaint dept awaiting refund
Any pointers as to how I can do this? Thanks
$ checks for the end of the string, not the end of a word.
To check for the word boundaries you should use \b
So:
gsub("s\\b", "", str)
Here's a non base R solution:
library(rebus)
library(stringr)
plurals <- "s" %R% BOUNDARY
str_replace_all(str, pattern = plurals, replacement = "")
You could also use a positive lookahead assertion:
gsub(pattern = "s{1}(?>\\s)", " ", x = str, perl = T)
I am no expert on regex, but I believe this expression looks for an "s" if it is followed by a space. Finding a match, it replaces that "s" with a space. So, final "s's" are removed.