I have the following value (and similar formatting in hundreds of thousands of fields):
61.00.62.1
that I would like to use a stringr or stringi and (likely) a regex to turn into
61.0
0.6
2.1
I have been unsuccessfully using the stringi::stri_split_regex command of the form
"[0-9]\\d{1,2}(\\.[0-9]\\d{1})"
Thank you.
I would try stri_extract_all_regex:
stringi::stri_extract_all_regex('61.00.62.1', '\\d{1,2}\\.\\d')
# [[1]]
# [1] "61.0" "0.6" "2.1"
\\d{1,2}\\.\\d will match 1 to 2 digits followed by one point and one digit.
We can use strsplit
strsplit(str1, "(?<=\\..)", perl = TRUE)[[1]]
#[1] "61.0" "0.6" "2.1"
data
str1 <- "61.00.62.1"
Make list of given string and separate list by each character by checking '.'
Related
I have a string that consists entirely of simple repeating patterns of a [:digit:]+[A-Z] for instance 12A432B4B.
I want to to use base::strsplit() to get:
[1] "12A" "432B" "4B"
I thought I could use lookahead to split by a LETTER and keep this pattern with unlist(strsplit("12A432B4B", "(?<=.)(?=[A-Z])", perl = TRUE)) but as can be seen I get the split wrongly:
[1] "12" "A432" "B4" "B"
Cant get my mind around a pattern that works with this strsplit strategy? Explanations would be really appreciated.
Bonus:
I also failed to use back reference in gsub (e.g. - pattern not working `gsub("([[:digit:]]+[A-Z])+", "\\1", "12A432B4B"), and can you retrieve more than \\1 to \\9 groups, say if [:digit:]+[A-Z] repeats for more than 9 times ?
We can use regex lookaround to split between an upper case letter and a digit
strsplit(str1, "(?<=[A-Z])(?=[0-9])", perl = TRUE)[[1]]
#[1] "12A" "432B" "4B"
data
str1 <- "12A432B4B"
The pattern mentioned in the post can be used as it is in str_extract_all :
str_extract_all(string, '[[:digit:]]+[A-Z]')[[1]]
#[1] "12A" "432B" "4B"
Or in base R :
regmatches(string, gregexpr('[[:digit:]]+[A-Z]', string))[[1]]
where string is :
string <- '12A432B4B'
I have strings like this: "X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2" and I would like to match only numbers 1, 2 and 3 in between underscores but without them(underscores). The best solution I could come up with is this str_match(sample_names, "_+[1-3]?") I would really appreciate the help.
The simplest method is by using suband backreference:
Data:
d <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
Solution:
sub(".*_(\\d)_.*", "\\1", d)
Here, (\\d) defines the capturing group for a single number (if the number in question can be more than one digit, use \\d+) that is 'recalled' by the backreference \\1in subs replacement argument
Alternatively use str_extract and positive lookaround:
library(stringr)
str_extract(d, "(?<=_)\\d(?=_)")
(?<=_) is positive lookbehind which can be glossed as "If you see _ on the left..."
\\d is the number to be matched
(?=_) is positive lookahead, which can be glossed as "If you see _ on the right..."
Result:
[1] "1" "2" "3"
You can use Look Arounds, I personally rely heavily on the stringr Cheatsheets for these kind of regex, the syntax is a bit hard to remember, here is the rstudio page for Cheatsheets look for stringr ->LOOK AROUNDS
library(tidyverse)
codes <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
codes %>%
str_extract("(?<=_)[:digit:]+(?=_)")
#> [1] "1" "2" "3"
Created on 2020-06-14 by the reprex package (v0.3.0)
Using x in the Note at the end, read it in using read.table and pick off the second field. No packages or regular expressions are used.
read.table(text = x, sep = "_")[[2]]
## [1] 1 2 3
Note
x <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
No need for any third-party module:
strings <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
pattern <- "(?<=_)(\\d+)(?=_)"
unlist(regmatches(strings, gregexpr(pattern, strings, perl = TRUE)))
Which yields:
[1] "1" "2" "3"
Parsing string fields in R data frames is a bit of a mystery to me I'm afraid...would be grateful for help.
I have a string field which always ends in an indeterminate number of numeric characters. I'd like to write a bit of code to just extract the numeric part at the end of each.
An example of the data format is:
df_test <- data.frame(my_string = c("XXX-0387", "XXXX-1-999999", "XXX 12345432", "XXX-2345", "XXX1234"))
What I'd like is to put the numeric part at the end into a new field but to keep any leading zeros - so presumably the new field would have to be chr rather than int. So my output would look like:
c("0387", "999999", "12345432", "2345", "1234)
Is there an easy way to do this please?
Thank you.
A way using sub to capture the last part of string which is number.
sub('.*?(\\d+)$', '\\1', df_test$my_string)
#[1] "0387" "999999" "12345432" "2345" "1234"
Using stringr :
stringr::str_extract(df_test$my_string, '\\d+$')
You can use regexpr with \\d+$ to find the numbers at the end and extracti it with regmatches.
regmatches(df_test$my_string, regexpr("\\d+$", df_test$my_string))
#[1] "0387" "999999" "12345432" "2345" "1234"
We can use stri_extract_last from stringi
library(stringi)
stri_extract_last(df_test$my_string, regex = "\\d+")
#[1] "0387" "999999" "12345432" "2345" "1234"
I'm trying to extract the "Number" of "Humans" in the string below, for example:
string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.
Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.
Any ideas?
Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):
> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"
A variation is to use a PCRE regex with regmatches/regexpr
> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).
The same functionality can be achieved with \K operator:
> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Simplest way I can think of:
as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))
It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)
I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).