Suppose I have the following text:
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
I want to capitalize the first alphabetical character of a sentence.
I figured out the regular expression to match as: ^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]
A call to gregexpr returns:
> gregexpr("^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]", txt)
[[1]]
[1] 1 16 65 75 104 156
attr(,"match.length")
[1] 0 7 7 8 7 8
attr(,"useBytes")
[1] TRUE
Which are the correct substring indices that match.
However, how do I implement this to properly capitalize the characters I need? I'm assuming I have to strsplit and then... ?
It appears that your regex did not work for your example, so I stole one from this question.
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
print(txt)
gsub("([^.!?\\s])([^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?)(?=\\s|$)", "\\U\\1\\E\\2", txt, perl=T, useBytes = F)
Using rex may make this type of task a little simpler. This implements the same regex that merlin2011 used.
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
re <- rex(
capture(name = 'first_letter', alnum),
capture(name = 'sentence',
any_non_puncts,
zero_or_more(
group(
punct %if_next_isnt% space,
any_non_puncts
)
),
maybe(punct)
)
)
re_substitutes(txt, re, "\\U\\1\\E\\2", global = TRUE)
#>[1] "This is just a test! I'm not sure if this is O.K. Or if it will work? Who knows. Regex is sorta new to me.. There are certain cases that I may not figure out?? Sad! ^_^"
Related
I am working with really long strings. How can I remove the first 4 words after a certain string pattern occurs? For example:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "stackoverflow"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Solution with base R
A one line solution:
pattern <- "stackoverflow"
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
How it works
Create the pattern you want with a regex:
"stackoverflow" followed by 4 words.
Definitely, check out ?regex for more info about it.
Words are identified by \\w+ and separators are identified by \\W+ (capital w, it includes spaces and special characters like the apostrophe that you have in the sentence)
(...){0,4} means that the combination of word and separator may repeat up to 4 times.
\\W* needs to identify a possible final separator, so that the remaining two pieces of the sentence won't have two separators dividing them. Try it without, you'll see what I mean.
gsub locates the pattern you want and replace it with "" (thus deliting it).
Handle Exceptions
Note that it works even for particular cases:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of stackoverflow and I am"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "stackoverflow and I am really"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] ""
A tidyverse solution
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\\W+\\w+){0,4}\\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Search for your pattern with additional spaces and words after it. Find the positions of the first last match, split the string and paste it back together. At the end gsub any double (or more) spaces.
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="stackoverflow"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Quick answer, I am sure you can have better code thant that:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="stackoverflow"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)
I have a lot of german street names. Most of them end with the word ...strasse. I want to replace strasse with its abreviation str allowing for minor misspellings (1 or 2 characters missing or wrong) such as strae or strassee.
I tried many things and looked up some more:
street_names <- c("GERBERSTRAE", "NEUE STRAASSE", "SCHLOSSSTASSE", "HAUPTSTRASSE", "WINZERGASSE")
> gsub("[STRASSE]{5,7}S?T?R?A?S?S?E?$" , "STR", street_names, perl = T)
[1] "GERBSTR" "NEUE STR" "SCHLOSTR" "HAUPSTR" "WINZERGASSE"
> gsub("S?T?R?A?S?S?E?$" , "STR", street_names, perl = T)
[1] "GERBERSTR" "NEUE STRASTR" "SCHLOSSSTR" "HAUPTSTR"
[5] "WINZERGSTR"
But so far all of them get some right and some wrong, and I don't know how to combine them. ("Winzergasse" should not be matched, as it ends on Gasse which translates to alley)
Any help is greatly appreciated.
EDIT:
more examples
street_names <- c("GERBERSTRAE", "NEUE STRAASSE", "SCHLOSSSTASSE", "HAUPTSTRASSE", "LINDENSASSE",
"WINZERGASSE", "PARKSTRASE", "ALTE STTRASSE", "BACHSTRAS", "LANGE SRASS")
You could use
gsub("GASSE(*SKIP)(*FAIL)|ST*R?[ASE]+$", "STR", street_names, perl = T)
Which yields
[1] "GERBERSTR" "NEUE STR" "SCHLOSSSTR" "HAUPTSTR" "WINZERGASSE"
The pattern here is
GASSE(*SKIP)(*FAIL) # match "GASSE" and let it fail
| # or
ST*R?[ASE]+ # S, T (0 or more times), R (optional), any A, S or E
$ # end of the string
See a demo on regex101.com.
I don't know how many different types of typografic errors you can encounter. For the examples you have given, something like this would work:
gsub("STR.*|STA.*","STR",street_names)
[1] "GERBERSTR" "NEUE STR" "SCHLOSSSTR" "HAUPTSTR"
[5] "WINZERGASSE"
Appending a question mark to every character in the pattern makes them all optional, so the pattern will basically match everything.
It's much easier to just list common misspellings fully and live with the fact that some people will find creative spellings that you didn't think of.
A bit brute force, but save I think:
gsub("(STRAE$)|(STRAASSE$)|(STASSE$)|(STRASSE$)", "STR", street_names)
[1] "GERBERSTR" "NEUE STR" "SCHLOSSSTR" "HAUPTSTR" "WINZERGASSE"
I have a string printed out like this:
"\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
(The "\" wasn't there. R just automatically prints it out.)
I would like to calculate how many non-empty segments there are in this string. In this case the answer should be 11.
I tried to convert it to a vector, but R ignores the quotation marks so I still ended up with a vector with length 1.
I don't know whether I need to extract those segments first and then count, or there're easier ways to do that.
If it's the former case, which regular expression function best suits my need?
Thank you very much.
You can use scan to convert your large string into a vector of individual ones, then use nchar to count the lengths. Assuming your large string is x:
y <- scan(text=x, what="character", sep=",", strip.white=TRUE)
Read 12 items
sum(nchar(y)>0)
[1] 11
I assume a segment is defined as anything between . or ,. An option using strsplit can be found as:
length(grep("\\w+", trimws(strsplit(str, split=",|\\.")[[1]])))
#[1] 11
Note: trimws is not mandatory in above statement. I have included so that one can get the value of each segment by just adding value = TRUE argument in grep.
Data:
str <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
strsplit might be one possibility?
txt <- "Jenna and Alex were making cupcakes., Jenna asked Alex whether all were ready to be frosted.,
Alex said that, some of them , were., He added, that, the rest, would be, ready, soon.,"
a <- strsplit(txt, split=",")
length(a[[1]])
[1] 11
If the backslashes are part of the text it doesnt really change a lot, except for the last element which would have "\"" in it. By filtering that out, the result is the same:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all
were ready to be frosted.\", \"Alex said that\", \" some of them \",
\"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
a <- strsplit(txt, split=", \"")
length(a[[1]][a[[1]] != "\""])
[1] 11
This is an absurd idea, but it does work:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
Txt <-
read.csv(text = txt,
header = FALSE,
colClasses = "character",
na.strings = c("", " "))
sum(!vapply(Txt, is.na, logical(1)))
Looking for some guidance on how to replace a curly apostrophe with a straight apostrophe in an R list of character vectors.
The reason I'm replacing the curly apostrophes - later in the script, I check each list item, to see if it's found in a dictionary (using qdapDictionary) to ensure it's a real word and not garbage. The dictionary uses straight apostrophes, so words with the curly apostrophes are being "rejected."
A sample of the code I have currently follows. In my test list, item #6 contains a curly apostrophe, and item #2 has a straight apostrophe.
Example:
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
func_ReplaceTypographicApostrophes <- function(x) {
gsub("’", "'", x, ignore.case = TRUE)
}
list_TestWords_Fixed <- lapply(list_TestWords, func_ReplaceTypographicApostrophes)
The result: No change. Item 6 still using curly apostrophe. See output below.
list_TestWords_Fixed
[[1]]
[1] "this"
[[2]]
[1] "isn't"
[[3]]
[1] "ideal"
[[4]]
[1] "but"
[[5]]
[1] "we"
[[6]]
[1] "can’t"
[[7]]
[1] "fix"
[[8]]
[1] "it"
Any help you can offer will be most appreciated!
This might work: gsub("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", x)
I found it over here: http://axonflux.com/handy-regexes-for-smart-quotes
You might be running up against a bug in R on Windows. Try using utf8::as_utf8 on your input. Alternatively, this also works:
library(utf8)
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
lapply(list_TestWords, utf8_normalize, map_quote = TRUE)
This will replace the following characters with ASCII apostrophe:
U+055A ARMENIAN APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+FF07 FULLWIDTH APOSTROPHE
It will also convert your text to composed normal form (NFC).
I see a problem in your call to gsub:
gsub("/’", "/'", x, ignore.case = TRUE)
You are prefixing the curly single quote with a forward slash. I don't know why you are doing this. I could speculate that you are trying to escape the quote characters, but this is having the side effect that your pattern is now trying to match a forward slash followed by a quote. As this never occurs in your text, no replacements are being made. You should be doing this:
gsub("’", "'", x, ignore.case = TRUE)
Follow the link below for a demo which shows that using the above gsub calls works as you expect.
Demo
Was about to say the same thing.
Try using str_replace from stringr package, will not need to use slashes
I was facing similar problem. Somehow non of the solutions worked for me. So I devised an indirect way of doing it by identifying apostrophe and replacing it with the required format.
gsub("(\\w)(\\W)(\\w\\s)", "\\1'\\3","sid’s bicycle")
[1] "sid's bicycle"
Hope it helps someone.
I have a string, say
fruit <- "()goodapple"
I want to remove the brackets in the string. I decide to use stringr package because it usually can handle this kind of issues. I use :
str_replace(fruit,"()","")
But nothing is replaced, and the following is replaced:
[1] "()good"
If I only want to replace the right half bracket, it works:
str_replace(fruit,")","")
[1] "(good"
However, the left half bracket does not work:
str_replace(fruit,"(","")
and the following error is shown:
Error in sub("(", "", "()good", fixed = FALSE, ignore.case = FALSE, perl = FALSE) :
invalid regular expression '(', reason 'Missing ')''
Anyone has ideas why this happens? How can I remove the "()" in the string, then?
Escaping the parentheses does it...
str_replace(fruit,"\\(\\)","")
# [1] "goodapple"
You may also want to consider exploring the "stringi" package, which has a similar approach to "stringr" but has more flexible functions. For instance, there is stri_replace_all_fixed, which would be useful here since your search string is a fixed pattern, not a regex pattern:
library(stringi)
stri_replace_all_fixed(fruit, "()", "")
# [1] "goodapple"
Of course, basic gsub handles this just fine too:
gsub("()", "", fruit, fixed=TRUE)
# [1] "goodapple"
The accepted answer works for your exact problem, but not for the more general problem:
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace(my_fruits,"\\(\\)","")
## "goodapple" "(bad)apple", "(funnyapple"
This is because the regex exactly matches a "(" followed by a ")".
Assuming you care only about bracket pairs, this is a stronger solution:
str_replace(my_fruits, "\\([^()]{0,}\\)", "")
## "goodapple" "apple" "(funnyapple"
Building off of MJH's answer, this removes all ( or ):
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace_all(my_fruits, "[//(//)]", "")
[1] "goodapple" "badapple" "funnyapple"