I have a lot of german street names. Most of them end with the word ...strasse. I want to replace strasse with its abreviation str allowing for minor misspellings (1 or 2 characters missing or wrong) such as strae or strassee.
I tried many things and looked up some more:
street_names <- c("GERBERSTRAE", "NEUE STRAASSE", "SCHLOSSSTASSE", "HAUPTSTRASSE", "WINZERGASSE")
> gsub("[STRASSE]{5,7}S?T?R?A?S?S?E?$" , "STR", street_names, perl = T)
[1] "GERBSTR" "NEUE STR" "SCHLOSTR" "HAUPSTR" "WINZERGASSE"
> gsub("S?T?R?A?S?S?E?$" , "STR", street_names, perl = T)
[1] "GERBERSTR" "NEUE STRASTR" "SCHLOSSSTR" "HAUPTSTR"
[5] "WINZERGSTR"
But so far all of them get some right and some wrong, and I don't know how to combine them. ("Winzergasse" should not be matched, as it ends on Gasse which translates to alley)
Any help is greatly appreciated.
EDIT:
more examples
street_names <- c("GERBERSTRAE", "NEUE STRAASSE", "SCHLOSSSTASSE", "HAUPTSTRASSE", "LINDENSASSE",
"WINZERGASSE", "PARKSTRASE", "ALTE STTRASSE", "BACHSTRAS", "LANGE SRASS")
You could use
gsub("GASSE(*SKIP)(*FAIL)|ST*R?[ASE]+$", "STR", street_names, perl = T)
Which yields
[1] "GERBERSTR" "NEUE STR" "SCHLOSSSTR" "HAUPTSTR" "WINZERGASSE"
The pattern here is
GASSE(*SKIP)(*FAIL) # match "GASSE" and let it fail
| # or
ST*R?[ASE]+ # S, T (0 or more times), R (optional), any A, S or E
$ # end of the string
See a demo on regex101.com.
I don't know how many different types of typografic errors you can encounter. For the examples you have given, something like this would work:
gsub("STR.*|STA.*","STR",street_names)
[1] "GERBERSTR" "NEUE STR" "SCHLOSSSTR" "HAUPTSTR"
[5] "WINZERGASSE"
Appending a question mark to every character in the pattern makes them all optional, so the pattern will basically match everything.
It's much easier to just list common misspellings fully and live with the fact that some people will find creative spellings that you didn't think of.
A bit brute force, but save I think:
gsub("(STRAE$)|(STRAASSE$)|(STASSE$)|(STRASSE$)", "STR", street_names)
[1] "GERBERSTR" "NEUE STR" "SCHLOSSSTR" "HAUPTSTR" "WINZERGASSE"
Related
I am working with really long strings. How can I remove the first 4 words after a certain string pattern occurs? For example:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "stackoverflow"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Solution with base R
A one line solution:
pattern <- "stackoverflow"
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
How it works
Create the pattern you want with a regex:
"stackoverflow" followed by 4 words.
Definitely, check out ?regex for more info about it.
Words are identified by \\w+ and separators are identified by \\W+ (capital w, it includes spaces and special characters like the apostrophe that you have in the sentence)
(...){0,4} means that the combination of word and separator may repeat up to 4 times.
\\W* needs to identify a possible final separator, so that the remaining two pieces of the sentence won't have two separators dividing them. Try it without, you'll see what I mean.
gsub locates the pattern you want and replace it with "" (thus deliting it).
Handle Exceptions
Note that it works even for particular cases:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of stackoverflow and I am"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "stackoverflow and I am really"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] ""
A tidyverse solution
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\\W+\\w+){0,4}\\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Search for your pattern with additional spaces and words after it. Find the positions of the first last match, split the string and paste it back together. At the end gsub any double (or more) spaces.
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="stackoverflow"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Quick answer, I am sure you can have better code thant that:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="stackoverflow"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)
everyone.
I am completely new to regex in r, and i run into a problem when trying to retrieve a smaller set of pattern in the middle of a larger pattern using tagged xml file.
Here, i have a three-word sequence "reinforce the advantage" tagged by BNC (British National Corpus) Basic (C5) Tagset system. In specific, i want to only retrieve the three lemmatized words immediately after every "hw=" in this long sequence.
<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>
Can anyone please offer a possible solution with gsub or other functions in r? Many thanks in advance!
NF
vec <- "<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>"
m <- gregexpr("(?<=hw=)\\S+", vec, perl = T)
regmatches(vec, m)
# [[1]]
# [1] "reinforce" "the" "advantage"
copied from regex101.com
/
(?<=hw=)\S+
/
Positive Lookbehind (?<=hw=)
Assert that the Regex below matches
hw= matches the characters hw= literally (case sensitive)
\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible,
giving back as needed (greedy)
first ?unlist then collapse (?paste0)
paste0(unlist(
regmatches(vec, m)
), collapse = " ")
# [1] "reinforce the advantage"
I'm trying to remove some VERY special characters in my strings.
i've read other post like:
Remove all special characters from a string in R?
How to remove special characters from a string?
but these are not what im looking for.
lets say my string is as following:
s = "who are í ½í¸€ bringing?"
i've tried following:
test = tm_map(s, function(x) iconv(enc2utf8(x), sub = "byte"))
test = iconv(s, 'UTF-8', 'ASCII')
none of above worked.
edit:
I am looking for a GENERAL solution!
I cannot (and prefer not) manually identify all the special characters.
also these VERY special characters MAY (not 100% sure) be result from emoticons
please help or guide me to the right posts.
Thank you!
So, I'm going to go ahead and make an answer, because I believe this is what you're looking for:
> s = "who are í ½í¸€ bringing?"
> rmSpec <- "í|½|€" # The "|" designates a logical OR in regular expressions.
> s.rem <- gsub(rmSpec, "", s) # gsub replace any matches in remSpec and replace them with "".
> s.rem
[1] "who are ¸ bringing?"
Now, this does have the caveat that you have to manually define the special character in the rmSpec variable. Not sure if you know what special characters to remove or if you're looking for a more general solution.
EDIT:
So it appears you almost had it with iconv, you were just missing the sub argument. See below:
> s
[1] "who are í ½í¸€ bringing?"
> s2 <- iconv(s, "UTF-8", "ASCII", sub = "")
> s2
[1] "who are bringing?"
Suppose I have the following text:
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
I want to capitalize the first alphabetical character of a sentence.
I figured out the regular expression to match as: ^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]
A call to gregexpr returns:
> gregexpr("^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]", txt)
[[1]]
[1] 1 16 65 75 104 156
attr(,"match.length")
[1] 0 7 7 8 7 8
attr(,"useBytes")
[1] TRUE
Which are the correct substring indices that match.
However, how do I implement this to properly capitalize the characters I need? I'm assuming I have to strsplit and then... ?
It appears that your regex did not work for your example, so I stole one from this question.
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
print(txt)
gsub("([^.!?\\s])([^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?)(?=\\s|$)", "\\U\\1\\E\\2", txt, perl=T, useBytes = F)
Using rex may make this type of task a little simpler. This implements the same regex that merlin2011 used.
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
re <- rex(
capture(name = 'first_letter', alnum),
capture(name = 'sentence',
any_non_puncts,
zero_or_more(
group(
punct %if_next_isnt% space,
any_non_puncts
)
),
maybe(punct)
)
)
re_substitutes(txt, re, "\\U\\1\\E\\2", global = TRUE)
#>[1] "This is just a test! I'm not sure if this is O.K. Or if it will work? Who knows. Regex is sorta new to me.. There are certain cases that I may not figure out?? Sad! ^_^"
Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.
Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"
gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"
The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"
I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.