R Character classes - r

Could anybody explain why "aba12" shows up, when I have specified {2}?
strings=c("Ab12","aba12","BA12","A 12b","B!","d", " ab")
grep("^[[:alpha:]]{2}", strings, value=TRUE)

You can use ...
grep("^[[:alpha:]]{2}[^[:alpha:]]", strings, value=TRUE)
# [1] "Ab12" "BA12"
[...] enumerates accepted characters and [^...] negates it. Further, from #Mako212:
^[[:alpha:]]{2} [...] tells the Regex engine to match the beginning of the string, then exactly two ASCII A-Z/a-z characters. It asserts nothing about the remainder of the string. Regex will process the remainder of the string, but there is no remaining criteria to match
My answer above expects a non-alpha character following the initial two. From MrFlick's comment:
If you also want to match "AB", then use
grep("^[[:alpha:]]{2}([^[:alpha:]]|$)", strings, value=TRUE)
to match a non-alpha character or end of string.

Related

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

How to split a string by dashes outside of square brackets

I would like to split strings like the following:
x <- "abc-1230-xyz-[def-ghu-jkl---]-[adsasa7asda12]-s-[klas-bst-asdas foo]"
by dash (-) on the condition that those dashes must not be contained inside a pair of []. The expected result would be
c("abc", "1230", "xyz", "[def-ghu-jkl---]", "[adsasa7asda12]", "s",
"[klas-bst-asdas foo]")
Notes:
There is no nesting of square brackets inside each other.
The square brackets can contain any characters / numbers / symbols except square brackets.
The other parts of the string are also variable so that we can only assume that we split by - whenever it's not inside [].
There's a similar question for python (How to split a string by commas positioned outside of parenthesis?) but I haven't yet been able to accurately adjust that to my scenario.
You could use look ahead to verify that there is no ] following sooner than a [:
-(?![^[]*\])
So in R:
strsplit(x, "-(?![^[]*\\])", perl=TRUE)
Explanation:
-: match the hyphen
(?! ): negative look ahead: if that part is found after the previously matched hyphen, it invalidates the match of the hyphen.
[^[]: match any character that is not a [
*: match any number of the previous
\]: match a literal ]. If this matches, it means we found a ] before finding a [. As all this happens in a negative look ahead, a match here means the hyphen is not a match. Note that a ] is a special character in regular expressions, so it must be escaped with a backslash (although it does work without escape, as the engine knows there is no matching [ preceding it -- but I prefer to be clear about it being a literal). And as backslashes have a special meaning in string literals (they also denote an escape), that backslash itself must be escaped again in this string, so it appears as \\].
Instead of splitting, extract the parts:
library(stringr)
str_extract_all(x, "(\\[[^\\[]*\\]|[^-])+")
I am not familiar with r language, but I believe it can do regex based search and replace. Instead of struggling with one single regex split function, I would go in 3 steps:
replace - in all [....] parts by a invisible char, like \x99
split by -
for each element in the above split result(array/list), replace \x99 back to -
For the first step, you can find the parts by \[[^]]

keep only alphanumeric characters and space in a string using gsub

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.
Here's what I've tried:
gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
However, This removes the special characters (punctuations + non utf8) but the output has no spaces.
gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
The result has spaces but there are still non utf8 characters present.
Any work around?
For the sample string above, output should be:
Sample string here
You could use the classes [:alnum:] and [:space:] for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
everything between () will be stored and can be refered to in the replacement by \\1
\\b indicates a word boundary
This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
after that sequence,fit anything at least zero times to remove the rest of the string.
the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.
stringr may use a differrent regex engine that supports POSIX character classes. The :ascii: names the class, which must generally be enclosed in square brackets [:asciii:], whithin the outer square bracket. The [^ indicates negation of the match.
library(stringr)
str_replace_all("�+ Sample string here =�{�>E�BH�P<]�{�>", "[^[:ascii:]]", "")
result in
[1] "+ Sample string here ={>EBHP<]{>"

How to extract characters from a string based on the text surrounding them in R

Edited to highlight the language I'm using I'm using the R language and I have many large lists of character strings and they have a similar format. I am interested in the characters directly in front of a series of characters that is consistently in the string, but not in a consistent place within the string. For instance:
a <- "aabbccddeeff"
b <- "aabbddff"
c <- "aabbffgghhii"
d <- "bbffgghhii"
I am interested in extracting the two characters directly preceding the "ff" in each character string. I can't find any reasonable solution apart from breaking each character string down using grepl() and then processing them each independently, which seems like an inefficient way to do it.
You can match those two characters and capture them with sub and the right regular expression.
Strings = c("aabbccddeeff",
"aabbddff",
"aabbffgghhii",
"bbffgghhii")
sub(".*(\\w\\w)ff.*", "\\1", Strings)
[1] "ee" "dd" "bb" "bb"
Explanation, This replaces the entire string with the two characters before the "ff". If there are multiple "ff" in the string, this expression takes the two characters before the last "ff".
How this works: The three arguments to sub are:
1. a pattern to search for
2. What it will be replaced with
3. The strings to apply it to.
Most of the work is in the pattern part - .*(\\w\\w)ff.*. The ff part of the pattern must be obvious. We are targeting things near the specific string ff. What comes right before it is (\\w\\w). \w refers to a "word character". That means any letter a-z or A-Z, any digit 0-9 or the one other character _. We want two characters so we have \\w\\w. By enclosing \\w\\w in parentheses, it turns this pattern of two characters into a "capture group", a string that will be saved into a variable for later use. Since this is the first (and only) capture group in this expression, those two characters will be stored in a variable called \1. Now we want only those two characters so in order to blow away everything before and after we put .* at the front and back. . matches any character and * means do this zero or more times, so .* means zero or more copies of any character. Now we have broken the string into four parts: "ff", the two characters before "ff", everything before that and everything after the ff. This covers the entire string. sub will _replace the part that was matched (everything) with whatever it says in the substitution pattern, in this case "\1". That is just how you write a string that evaluates to \1, the name of the variable where we stored the two characters that we want. We write it that way because backslash "escapes" whatever is after it. We actually want the character \ so we write \ to indicate \ and \1 evaluates to \1. So everything in the string is replaced by the targeted two characters. We apply this to every string in the list of strings Strings.

How do I use regex to match alphabetical characters only?

I want to gsub a string that contains only characters and white spaces, for example the string "to delete". I have tried this:
gsub('[^[:alpha:]$]',NA, "to delete", ignore.case=T)
But I get an NA also when the string contains digits, for example:
gsub('[^[:alpha:]$]',NA, "to 1 delete", ignore.case=T)
Anybody could tell me what I am doing wrong? Thanks.
Your regex only tests for a single unanchored bracket expression. This means that any string that has any character which matches the bracket expression will match the regex.
Your bracket expression tests for "not alphabetic not dollar". This matches many things, including spaces, digits, and all punctuation characters other than dollar.
It sounds like you want to match only strings which consist in their entirety of only alphabetic and whitespace characters. To achieve that you need an anchored multiplied bracket expression.
Also, you don't need gsub() for this; you only need sub(), since a regex that can only match the entirety of the input string cannot match multiple times within the input string.
Also, you don't need ignore.case=T, since the [:alpha:] character class already matches all alphabetics, regardless of letter-case.
regex <- '^[[:alpha:][:space:]]*$';
sub(regex,NA,'to delete');
## [1] NA
sub(regex,NA,'to 1 delete');
## [1] "to 1 delete"

Resources