I have a string like this:
Sample.ID<-"(<SampleID>, 2213 )"
I am using the following gsub code to extract the numbers from this string:
ID<-as.numeric(gsub("\\D", "", Sample.ID))
This is ok, but sometimes in my data the string is like this:
Sample.ID<-"(<SampleID>, 2213-EQUINOX BELL 2-P, )"
Then I have a problem, as it take all number (i.e. 22132) where I just wanted to have 2213.
What is the work-around?
Thanks,
Phuong
You can capture the digits and then use a backreference
sub(".*?(\\d+).*", "\\1", Sample.ID)
[1] "2213" "2213"
As your data looks like
Sample.ID<-"(<SampleID>, 2213-EQUINOX BELL 2-P, )"
use (?<=, )\d+ to match the number: Regex-test
The following code match the whole string and extract the 1st group:
gsub(".*(?<=, )(\\d+).*", "\\1", id, perl=TRUE)
Check the snippet: R-gsub
We can match zero or more characters that are not a comma ([^,]*) from the start (^) of the string followed by a ,, one or more space (\\s+) or | a - or space followed by other characters (.*) and replace it with blank ("")
as.numeric(gsub("^[^,]*,\\s+|(-|\\s+).*", "", Sample.ID))
#[1] 2213 2213
If there are no other restrictions, then str_extract can be used to extract the first occurrence of number
library(stringr)
as.numeric(str_extract(Sample.ID, "\\d+"))
#[1] 2213 2213
Or with parse_number from readr
readr::parse_number(Sample.ID)
#[1] 2213 2213
Or a similar option with base R
as.numeric(regmatches(Sample.ID, regexpr("\\d+", Sample.ID)))
#[1] 2213 2213
data
Sample.ID <- c("(<SampleID>, 2213 )", "(<SampleID>, 2213-EQUINOX BELL 2-P, )")
Related
I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"
suppose I have the next string:
"palavras a serem encontradas fazer-se encontrar-se, enganar-se"
How can I extract the words "fazer-se" "encontrar-se" "enganar-se"
I'm try o use stringr like
library(stringr)
sentence <- "palavras a serem encontradas fazer-se encontrar-se, enganar-se"
str_extract_all(sentence, "se$")
I'd like this output:
[1] "fazer-se" "encontrar-se" "enganar-se"
We can specify the word boundary (\\b) and not the end ($) of the string (there is only one match for that, i.e. at the end of the string) and we need to get the characters that are not a whitespace before the se substring, so use \\S+ i.e. one or more non-whitespace characters
library(stringr)
str_extract_all(sentence, "\\S+se\\b")[[1]]
#[1] "fazer-se" "encontrar-se" "enganar-se"
In base R, we can use gregexpr and regmatches :
regmatches(sentence, gregexpr('\\w+-se', sentence))[[1]]
#[1] "fazer-se" "encontrar-se" "enganar-se"
Given the string "This has 4 words!" I would like to count only the letters and digits. I would like to exclude whitespace and punctuation. As such, the string above should return 13.
I'm not sure why, but I cannot get this for R.
We can use [[:alnum:]] in str_count to count only the alphabets and digits
library(stringr)
str_count(str1, "[[:alnum:]]")
#[1] 13
Or in base R with gsub to remove the [[:punct:]] and then get the number of characters with nchar
nchar(gsub("[[:punct:]]+", "", str1))
Or negate (^) characters that are not alpha numeric, replace with blank ("") and get the nchar
nchar(gsub("[^[:alnum:]]+", "", str1))
#[1] 13
data
str1 <- "This has 4 words!"
I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"
I have the following string in R: "xxx, yyy. zzz"
I want to get the yyy part only, which are in between "," and "."
I don't want to use regex.
I searched half a day, found many string functions in R but none which deal with "cut before/after a character" function.
Is there such?
We can use gsub to match zero or more characters that are not a , ([^,]*) from the start (^) of the string followed by a , followed by zero or more spaces (\\s*) or (!) a dot (\\. - it is a metacharacter meaning any character so it is escaped) followed by other characters (.*) until the end of the string ($) and replace it with blank ("")
gsub("^[^,]*,\\s*|\\..*$", "", str1)
#[1] "yyy"
If we don't need regex then strsplit the string by , followed by zero or more spaces or with a . and select the second entry after converting the list output to vector ([[1]])
strsplit(str1, ",\\s*|\\.")[[1]][2]
#[1] "yyy"
data
str1 <- "xxx, yyy. zzz"
It could be that this suffices:
unlist(strsplit("xxx, yyy. zzz","[,.]"))[2] # get yyy with space, or:
gsub(" ","",unlist(strsplit("xxx, yyy. zzz","[,.]")))[2] # remove space