This question already has answers here:
Extract the last word between | |
(5 answers)
Closed 4 years ago.
I am practicing with regular expressions in R.
I would like to extract the last occurrence of two upper case letters.
I tried
>str_extract("kjhdjkaYY,","[:upper:][:upper:]")
[1] "YY"
And it works perfectly fine. What if I would like to extract the last occurrence of such pattern. Example:
function("kKKjhdjkaYY,")
[1] "YY"
Thank you for your help
We can use stri_extract_last_regex from stringi package
library(stringi)
stri_extract_last_regex("AAkjhdjkaYY,","[:upper:][:upper:]")
#[1] "YY"
Or if you want to stick with stringr, we can extract all the groups which match the pattern and then get the last one using tail
library(stringr)
tail(str_extract_all("AAkjhdjkaYY,","[:upper:][:upper:]")[[1]], 1)
#[1] "YY"
Related
This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!
We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")
This question already has answers here:
Extract number between underscore in text
(3 answers)
Closed 1 year ago.
I want to understand how to use the semantics in str_extract in the stringr package in R.
I have strings that are written like this and 11_3_S11.html"
and I would like to extract from them the value after the first underscore.
I mean, I want to remove the number 3.
files = c("11_3_S11.html")
I would appreciate it if someone can explain the logic or send me a link with all the semantics.
Thank you for your time
In base R, you can use sub to extract a number after 1st underscore.
sub('\\d+_(\\d+)_.*', '\\1', files)
#[1] "3"
where \\d+ refers to 1 or more number.
() is referred as capture group to capture the value that we are interested in.
You can use the same regex in str_match if you want to use stringr.
stringr::str_match(files, '\\d+_(\\d+)_.*')[, 2]
[1] "3"
Using look around.
str_extract("11_3_S11.html", '(?<=_)\\d(?=_)')
[1] "3"
This question already has answers here:
count number of digits in a string in r
(2 answers)
Closed 3 years ago.
I want to know how many digits do I have in a text variable. For example, a function that in the text "ABC234" the answer would be 3.
I tried with this:
aa=gregexpr("[[:digit:]]+\\.*[[:digit:]]*","ABC234")
I almost have it, but honestly I still dont understand the lists, so I have no idea how to get it.
Any function? Or how to manage it with my almost-option?
Thanks
Match each digit and then take the length of the returned value:
lengths(gregexpr("\\d", "ABC234"))
## [1] 3
or replace each non-digit with a zero length string and take the length of what remains:
nchar(gsub("\\D", "", "ABC234"))
## [1] 3
As an option you can use stringi or stringr libraries as well:
stringi::stri_count('ABC234', regex = '\\d')
# [1] 3
stringr::str_count('ABC234', '\\d')
# [1] 3
You can use the dpylr and readr package as follows:
library(readr)
library(dplyr)
string = "ABC234"
parse_number(string) %>%
nchar()
[1] 3
This question already has answers here:
How to sort a character vector where elements contain letters and numbers?
(6 answers)
Closed 3 years ago.
I have data that looks like the following, except the numbers are out of order:
dat<-
paste("Experience",1:20,sep="_")
Basically, I am trying to sort the columns in numerical order based on the ending number to order them as the code above produces. However, when I sort the values, it sorts based on the first digit as such:
"Experience_1" "Experience_10" "Experience_11" "Experience_12"
"Experience_13" "Experience_14" "Experience_15" "Experience_16"
"Experience_17" "Experience_18" "Experience_19" "Experience_2"
"Experience_20" "Experience_3" "Experience_4" "Experience_5"
"Experience_6" "Experience_7" "Experience_8" "Experience_9"
Thoughts?
The Stringr library, a part of the tidyverse, has str_sort() which sorts strings numerically in R.
library(stringr)
str_sort(dat, numeric = TRUE)
An option would be mixedsort from gtools
gtools::mixedsort(dat)
#[1] "Experience_1" "Experience_2" "Experience_3" "Experience_4" "Experience_5" "Experience_6"
#[7] "Experience_7" "Experience_8" "Experience_9" "Experience_10" "Experience_11" "Experience_12"
#[13] "Experience_13" "Experience_14" "Experience_15" "Experience_16" "Experience_17" "Experience_18"
#[19] "Experience_19" "Experience_20"
This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 5 years ago.
I am trying to only keep rows whose id contains letters. And I find the following two ways give different results.
df[grep("[A-Z]",df$id),]
df[grep(LETTERS,df$id),]
It seems the second way will omit many rows that actually have letters.
Why?
If you want to grep patterns in a vector try this:
to_match <- paste(LETTERS, collapse = "|")
to_match
[1] "A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z"
and then
df[grep(to_match, df$id), ]
Explanation:
You will match any of the characters in "to_match" since they are separated by the "or" operator "|".