How to extract multiple substrings in a string using stringr regex - r

I have this string:
mystring <- "HMSC-bm_in_ALL_CELLTYPES.distal"
What I want to do is to extract the substring as defined
in this bracketing
[HMSC-bm]_in_ALL_CELLTYPES.[distal]
So in the end it will yield a vector with two values: HMSC-bm and distal. How can I do it? I tried this but failed:
> stringr::str_extract(base,"\\([\\w-]+\\)_in_ALL_CELLTYPES\\.\\([\\w+]\\)")
[1] NA

I'd use str_match:
library(stringr)
mymatch <- str_match(mystring, "^(.*?)_.*?\\.(.*?)$")
mymatch
[,1] [,2] [,3]
[1,] "HMSC-bm_in_ALL_CELLTYPES.distal" "HMSC-bm" "distal"
mymatch[, 2]
[1] "HMSC-bm"
mymatch[3, ]
[1] "distal"

We can split the string by _in_ALL_CELLTYPES..
strsplit(mystring, split = "_in_ALL_CELLTYPES.")[[1]]
[1] "HMSC-bm" "distal"

Related

String matching within a list of lists [duplicate]

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

How can I extract unit preceded by number with str_extract?

I think str_extract can do this, but I fail to figure out this. my data contains chinese character so there is no blank white between characters. I simulate the data in english as:
> dd<-c("wwe12hours,fgg23days","ffgg12334hours,23days","ffff1days")
> target <- c("hours","days","hours","days")
> target
[1] "hours" "days" "hours" "days"
How can I achieve the target?
my real case is:
> dd <- c("腹痛发热12小时,再发2天","腹痛132324月,再发1天","发热4天")
> target <- c("小时","月","天")
> target
[1] "小时" "月" "天"
It seems you are looking for regex to capture the units. Since you have a vector of length three, we would prefer to return another vector of length three. From your example(ENGLISH ONE) it is not clear how you obtain a target of 4 units. Although I perceive you meant to have 5 if not 3.
here is how you could tackle. This can generally be used for any language:
English:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "hours,days" "hours,days" "days"
Chinese:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "小时,天" "月,天" "天"
regmatches(ddc,gregexpr("(?<=\\d)\\p{L}+",ddc,perl = TRUE))
[[1]]
[1] "小时" "天"
[[2]]
[1] "月" "天"
[[3]]
[1] "天"
or if you want to use other packages:
using str_extract_all:
library(stringr)
str_extract_all(ddc,"(?<=\\d)\\p{L}+")
You could use str_match_all :
library(stringr)
unlist(sapply(str_match_all(dd, '\\d+(\\w+)'), function(x) x[, 2]))
#[1] "hours" "days" "hours" "days" "days"
This captures the first word that comes after a number.
where
str_match_all(dd, '\\d+(\\w+)') #returns
#[[1]]
# [,1] [,2]
#[1,] "12hours" "hours"
#[2,] "23days" "days"
#[[2]]
# [,1] [,2]
#[1,] "12334hours" "hours"
#[2,] "23days" "days"
#[[3]]
# [,1] [,2]
#[1,] "1days" "days"
As mentioned by #Onyambu, we can use a lookbehind regex to avoid using sapply to subset the capture group.
unlist(str_extract_all(dd,"(?<=\\d)[A-z]+"))
Base R solution:
cleaned_dd <- gsub("[[:punct:]].*", "",
unlist(lapply(strsplit(
gsub("[[:digit:]]", " ", dd), "\\s+"
), '[',-1)))

Use strsplit with multiple delimiters [duplicate]

This question already has answers here:
R strsplit with multiple unordered split arguments?
(4 answers)
Closed 4 years ago.
How can I split this
Chr3:153922357-153944632(-)
Chr11:70010183-70015411(-)
in to
Chr3 153922357 153944632 -
Chr11 70010183 70015411 -
I tried strsplit(df$V1,"[[:punct:]]")), but the negative sign is not coming in the final result
You can also try str_split from stringr:
library(stringr)
lapply(str_split(df$V1, "(?<!\\()\\-|[:\\)\\(]"), function(x) x[x != ""])
Result:
[[1]]
[1] "Chr3" "153922357" "153944632" "-"
[[2]]
[1] "Chr11" "70010183" "70015411" "-"
Data:
df = read.table(text = " Chr3:153922357-153944632(-)
Chr11:70010183-70015411(-) ")
How about this in base R using stringsplit and gsub:
# Your sample strings
ss <- c("Chr3:153922357-153944632(-)",
"Chr11:70010183-70015411(-)")
# Split items as list of vectors
lst <- lapply(ss, function(x)
unlist(strsplit(gsub("(.+):(\\d+)-(\\d+)\\((.)\\)", "\\1,\\2,\\3,\\4", x), ",")))
# rbind to dataframe if necessary
do.call(rbind, lst);
# [,1] [,2] [,3] [,4]
#[1,] "Chr3" "153922357" "153944632" "-"
#[2,] "Chr11" "70010183" "70015411" "-"
This should work for other chromosome names and positive strand features as well.
The issue is that - is both a character you want to extract and a delimiter. Your best bet is using capture groups and specifying the full regex string:
stringr::str_match(x, "^(.{4}):(\\d+)-(\\d+)\\((.)\\)$")
EDIT: If you want to let the first capture group capture strings of arbitrary length (e.g. ChrX for any X), you can change first capture group from .{4} to Chr\\d+.

Extract all values in a string that occur after another substring in R

Lets say I have a string:
fgjh=621729_&ioij_fgjh7=twenty-_-One-_-Forty
I want to extract the following substrings from this string:
1. "621729"
2. "twenty"
3. "One"
4. "Forty"
Basically I want to extract anything after the "fgjh=" substring and "fgjh7=" sub strings.
I've found that this formula works in excel:
=TRIM(RIGHT(SUBSTITUTE(A1,"fgjh=",REPT(" ",LEN(A1))),LEN(A1)))
But the excel file is too large and I need to perform the same operation in R
How would I deal with leading characters and trailing characters. Let's say the string was "lmnop_82137_hhgia=77789_pasdk_ikuk_fgjh=621729_&ioij_fgjh7=‌​twenty--One--Forty_d‌​saoij_882390=lkuk" and I need to extract the data after "fgjh=" i.e 621729 and everything after "fgjh7=" to get only "twenty", "one" and "forty"
You could use the package stringr and the function str_match for example to parse out the interesting bits with regular expressions
> library(stringr)
> s <- "fgjh=621729_&ioij_fgjh7=twenty--One--Forty"
> str_match(s, "^fgjh=([0-9]+)_&ioij_fgjh7=(.+)--(.+)--(.+)$")
[,1] [,2] [,3] [,4] [,5]
[1,] "fgjh=621729_&ioij_fgjh7=twenty--One--Forty" "621729" "twenty" "One" "Forty"
library(stringr)
unlist(strsplit(str_extract_all(string,'(?<=\\=)([^_]+)')[[1]],'--'))
[1] "621729" "twenty" "One" "Forty"
Using sub with regular expression is more flexible than splitting by position:
> sub(".*=(.*)_&.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "621729"
> sub(".*=(.*)--.*--.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "twenty"
> sub(".*--(.*)--.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "One"
> sub(".*--(.*)$", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "Forty"
In one line :
strsplit(sub(".*=(.*)_&.*=(.*)--(.*)--(.*)", "\\1\\|\\2\\|\\3\\|\\4",
"fgjh=621729_&ioij_fgjh7=twenty--One--Forty" ), split="\\|")[[1]]
[1] "621729" "twenty" "One" "Forty"

grep exact match in vector inside a list in R

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Resources