I want to be able to substring the first character from the right hand side of each element of a vector
ABC20
BCD3
B1
AB2222
BX4444
so for the group above I would want, C, D, B, B, X .... is there an easy way to this? I know there is a substr and a numindex/charindex. So I think I can use these but not sure exactly in R.
You can use library stringi,
stringi::stri_extract_last_regex(x, '[A-Z]')
#[1] "C" "D" "B" "B" "X"
DATA
x <- c('ABC20', 'BCD3', 'B1', 'AB2222', 'BX4444')
Try this:
Your data:
list<-c("ABC20","BCD3","B1","AB2222","BX4444")
Identify position
number_pos<-gregexpr(pattern ="[0-9]",list)
number_first<-unlist(lapply(number_pos, `[[`, 1))
Extraction
substr(list,number_first-1,number_first-1)
[1] "C" "D" "B" "B" "X"
We can use sub to capture the last upper case letter (([A-Z])) followed by zero or more digits (\\d*) until the end ($) of the string and replace it with the backreference (\\1) of the captured group
sub(".*([A-Z])\\d*$", "\\1", x)
#[1] "C" "D" "B" "B" "X"
data
x <- c("ABC20", "BCD3", "B1", "AB2222", "BX4444")
Related
I have a vector c(1,3,4,2,5,4,3,1,6,3,1,4,2), and I want make 1="a", 2="b", and so on
So my final outputs should look like c(a,c,d,b...)
I know that I can use for loop and if statement to do this, but is there any other quicker ways to do?
You may use the built-in constant letters.
vec <- c(1,3,4,2,5,4,3,1,6,3,1,4,2)
res <- letters[vec]
res
#[1] "a" "c" "d" "b" "e" "d" "c" "a" "f" "c" "a" "d" "b"
To replace with any other values you can construct a vector to subset.
value <- c('apples', 'banana', 'grapes', .....)
res <- value[vec]
We may use match
letters[match(vec, unique(vec))]
Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"
I want to implement a regex to extract the substring after the final dot.
For example,
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
gsub(".*(\\..*)$", "\\1", a)
The code returns
".d" ".e" "c" ".e" ".z"
How do I modify the code to get
"d" "e" "" "e" "z"
That is to say, if the string contains dot, it will remove the last part without the dot; if the string doesn't contain dot, it will return "".
Here is a way to do this using sub without capture groups. We can try replacing all content up to and including the final dot with empty string.
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
sub(".*\\.", "", a)
[1] "d" "e" "c" "e" "z"
If you want to return empty string should the input have no dot, then we can use ifelse with grepl:
input <- "Hello World!"
output <- ifelse(grepl("\\.", input), sub(".*\\.", "", input), "")
The reason for the verbose above code is that sub by default just returns the original string should no match be found. But, in your case, you want a different behavior.
You need . outside the capture group as you don't need it
sub(".*\\.(.*)", "\\1", a)
#[1] "d" "e" "c" "e" "z"
This will capture everything after the last dot.
For strings where we have no dots, we could check for it using grepl and then extract
ifelse(grepl("\\.", a), sub(".*\\.(.*)", "\\1", a), "")
#[1] "d" "e" "" "e" "z"
When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?
If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"
Say I have the following dataframe consisting of two vectors containing character strings:
df <- data.frame(
"ID"= c("1a", "1b", "1c", "1d"),
"Codes" = c("BX.MX|GX.WX", "MX.RX|BX.YX", "MX.OX|GX.GX", "MX.OX|YX.OX"),
stringsAsFactors = FALSE)
I'd like a simple way to determine which characters have been used in a given vector. In other words, the output of such a function would reveal:
find.characters(df$Codes) # hypothetical function
[1] "B" "G" "M" "W" "X" "R" "Y" "O" "|" "."
find.characters(df$ID) # hypothetical function
[1] "1" "a" "b" "c" "d"
You can create a custom function to do this. The idea is to split the strings into individual characters (strsplit(v1, '')), output will be list. We can unlist it to make it a vector, then get the unique elements. But, this is not sorted yet. Based on the example showed, you may want to sort the letters and other characters differently. So, we use grep to index the 'LETTER' character, and use this to separately sort the subset of vectors and concatenate c( it together.
find.characters <- function(v1){
x1 <- unique(unlist(strsplit(v1, '')))
indx <- grepl('[A-Z]', x1)
c(sort(x1[indx]), sort(x1[!indx]))
}
find.characters(df$Codes)
#[1] "B" "G" "M" "O" "R" "W" "X" "Y" "|" "."
find.characters(df$ID)
#[1] "1" "a" "b" "c" "d"
NOTE: Generally, I would use grepl('[A-Za-z]', x1), but I didn't do that because the expected result for the 'ID' column is different.
find.characters<-function(x){
unique(c(strsplit(split="",x),recursive = T))
}