Ho to extract words between two hyphens? - r

How to extract all between two hyphens in R
ts = ("az_bna_njh","j_hj_lkiuy","ml_", "_kk")
I need to extract bna,hj,ml, and kk

We can use
sub("^\\w+_(\\w+)_.*", "\\1", trimws(ts, whitespace = "_"))
#[1] "bna" "hj" "ml" "kk"
Or another option is
sub("^\\w+_(\\w+)_.*", "\\1", gsub("^_|_$", "", ts))

Also you can try:
#Data
ts = c("az_bna_njh","j_hj_lkiuy","ml_", "_kk")
#Code
gsub(".*_(.*)\\_.*", "\\1", trimws(ts,whitespace = '_'))
Output:
[1] "bna" "hj" "ml" "kk"

Another way you can try
library(stringr)
str_replace_all(ts, c("^.*_(\\w+)_.*$" = "\\1", "^_|_$" = ""))
#[1] "bna" "hj" "ml" "kk"

Related

How to extract text between two separators in R?

I have a vector of strings like so:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
I only want the part between the two forward slashes, i.e., "10g" and "6g".
You could sub() here with a capture group:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
sub(".*/([^/]+)/.*", "\\1", mystr)
[1] "10g" "6g"
similar to Tim Biegeleisen, but with a lookbehind and lookahead, using srt_extract from stringr:
library(stringr)
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
str_extract(mystr,"(?<=/)[^/]+(?=/)")
[1] "10g" "6g"
More simply you can capitalize on the fact that the desired substring is one or more digits followed by literal g:
library(stringr)
str_extract(mystr, "\\d+g")
[1] "10g" "6g"
Here are a few alternatives. They use no packages and the first two do not use any regular expressions.
basename(dirname(mystr))
## [1] "10g" "6g"
read.table(text = mystr, sep = "/")[[2]]
## [1] "10g" "6g"
trimws(trimws(mystr,, "[^/]"),, "/")
## [1] "10g" "6g"
We could also reformulate these using pipes
mystr |> dirname() |> basename()
## [1] "10g" "6g"
read.table(text = mystr, sep = "/") |> (`[[`)(2)
## [1] "10g" "6g"
mystr |> trimws(, "[^/]") |> trimws(, "/")
## [1] "10g" "6g"
Note
From the question the input is
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")

R - Extract text between symbol or delimiter '/'

I have several vectors, like these ones:
str <- c("AT/FBA/1/12/360/26/SF/96", "AT/RLMW/1/12/360/44/SF/122", "AT/ACR/1/12/362/66/SF/175", "AT/AA/1/12/363/72/SF/281", "AT/BB/1/12/364/90/SF/310", "AT/ANT/1/123/364/92/SF/338")
N.B. that each argument between '/' may change in length (amount of characters).
I want to extract the 5th and 6th arguments delimited by the '/'.
for example in this case:
"360/26", "360/44", "362/66", "363/72", "364/90", "364/92"
I checked at these answers from similar questions:
Extract text after a symbol in R -
Extracting part of string by position in R -
I tried to use:
sub("^([^/]+/){4}([^/]+).*", "\\2", str)
but it gives me only the 5th argument, as follow:
[1] "360" "360" "362" "363" "364" "364" "364" "365" "365" "366" "365" "002" "002" "002" "002" "003"
[17] "003" "003" "004" "004" "004" "005"
then I tried
scan(text=str, sep="/", what="", quiet=TRUE)[c(5:6)]
but it gives me just the two arguments without the delimiter '/'.
A simple regex solution would be
sub("^([^/]*/){4}([^/]*/[^/]*)/.*", "\\2", str)
returning the desired
[1] "360/26" "360/44" "362/66" "363/72" "364/90"
[6] "364/92"
Use read.table like this:
with(read.table(text = str, sep = "/"), paste(V5, V6, sep = "/"))
## [1] "360/26" "360/44" "362/66" "363/72" "364/90" "364/92"
Will this work:
apply(sapply(strsplit(str, split = '/'), '[', c(5,6)),2, function(x) paste(x, collapse = '/'))
[1] "360/26" "360/44" "362/66" "363/72" "364/90" "364/92"
Here is a tidyverse solution I thought you could also use:
library(dplyr)
library(tidyr)
str %>%
as_tibble() %>%
separate(value, into = LETTERS[1:8], sep = "\\/") %>%
select(5, 6) %>%
unite("Extract", c("E", "F"), sep = "/")
# A tibble: 6 x 1
Extract
<chr>
1 360/26
2 360/44
3 362/66
4 363/72
5 364/90
6 364/92

String splitting with a stop character in R

My data is as follows:
“Louis Hamilton”
“Tiger Wolf”
“Sachin Tendulkar”
“Lebron James”
“Michael Shoemaker”
“Hollywood – Career as an Actor”
I need to extract all the characters until a space or a dash(-) is reached
I need to extract no more than 10 characters
My desired output is
“Louis”
“Tiger”
“Sachin”
“Lebron”
“Michael”
“Hollywood”
I tried using below function, but it didn’t work
Sportstars<-function(charvec)
{min.length < 10, (x, hyph.pattern = Null)}
Can anyone help, please?
We can use sub
sub("^([^- ]+).*", "\\1", v1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or another version with the length condition as well
grep("^.{1,10}$", sub("\\s+.*", "", v1), value = TRUE)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or with word from stringr
library(stringr)
word(v1, 1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Also, if we need to implement the last condition as well
sapply(strsplit(v1, "[– -]"), function(x) {
x1 <- setdiff(x, "")
x1[1][nchar(x1[1]) < 10]})
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
data
v1 <- c( "Louis Hamilton", "Tiger Wolf", "Sachin Tendulkar",
"Lebron James", "Michael Shoemaker", "Hollywood – Career as an Actor")

stringsplit output as new colnames

I would like to create new colnames for my dataframe MirAligner consisting of the part before the first _ in the original colnames. This is what I tried:
unlist(strsplit(as.character(colnames(MirAligner)),'_',fixed=TRUE))
Column names
head(colnames(MirAligner))
[1] "na-008_S52_L003_R1_001.mir.fa.gz" "na-014_S99_L005_R1_001.mir.fa.gz" "na015_S114_L005_R1_001.mir.fa.gz" [4] "na-015_S50_L003_R1_001.mir.fa.gz" "na-018_S147_L007_R1_001.mir.fa.gz" "na020_S162_L007_R1_001.mir.fa.gz"
Expected output:
na-008 na-014 na015
We can use sub
sub('_.*', '', str1)
#[1] "na-014" "na015" "na-015" "na-018" "na020"
data
str1 <- c("na-014_S99_L005_R1_001.mir.fa.gz",
"na015_S114_L005_R1_001.mir.fa.gz",
"na-015_S50_L003_R1_001.mir.fa.gz",
"na-018_S147_L007_R1_001.mir.fa.gz",
"na020_S162_L007_R1_001.mir.fa.gz")
gsub("^(.*?)_.*", "\\1", try5)
#[1] "na-008" "na-014" "na015"
Using strsplit within sapply:
#myColNames <- colnames(MirAligner)
myColNames <- c("na-008_S52_L003_R1_001.mir.fa.gz", "na-014_S99_L005_R1_001.mir.fa.gz")
sapply(strsplit(myColNames, "_", fixed = TRUE), "[[", 1)
#output
# [1] "na-008" "na-014"
Or using read.table:
read.table(text = myColNames, sep = "_", stringsAsFactors = FALSE)[, "V1"]

Extracting values after pattern

A beginner question...
I have a list like this:
x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
(but many more lines)
I need to extract what is between "bb=" and ",". I.e. I want:
x21
g-25
26
Having read many similar questions here, I suppose it is stringr with str_extract I should use, but somehow I can't get it to work. Thanks for all help.
/Chris
strapply in the gsubfn package can do that. Note that [^,]* matches a string of non-commas.
strapply extracts the back referenced portion (the part within parentheses):
> library(gsubfn)
> strapply(x, "bb=([^,]*)", simplify = TRUE)
[1] "x21" "g-25" "26"
If there are several x vectors then provide them in a list like this:
> strapply(list(x, x), "bb=([^,]*)")
[[1]]
[1] "x21" "g-25" "26"
[[2]]
[1] "x21" "g-25" "26"
An option using regexpr:
> temp = regexpr('bb=[^,]*', x)
> substr(x, temp + 3, temp + attr(temp, 'match.length') - 1)
[1] "x21" "g-25" "26"
Here's one solution using the base regex functions in R. First we use strsplit to split on the comma. Then we use grepl to filter only the items that start with bb= and gsub to extract all the characters after bb=.
> x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
> y <- unlist(strsplit(x , ","))
> unlist(lapply(y[grepl("bb=", y)], function(x) gsub("^.*bb=(.*)", "\\1", x)))
[1] "x21" "g-25" "26"
It looks like str_replace is the function you are after if you want to go that route:
> str_replace(y[grepl("bb=",y)], "^.*bb=(.*)", "\\1")
[1] "x21" "g-25" "26"
Read it in with commas as separators and take the second column:
> x.split <- read.table(textConnection(x), header=FALSE, sep=",", stringsAsFactors=FALSE)[[2]]
[1] " bb=x21" " bb=g-25" " bb=26"
Then remove the "bb="
> gsub("bb=", "", x.split )
[1] " x21" " g-25" " 26"

Resources