string split and interchange the position of string in R - r

I have a vector called myvec. I would like to split it at _ and interchange the position. What would be the simplest way to do this?
myvec <- c("08AD09144_NACC022453", "08AD8245_NACC657970")
Result I want:
NACC022453_08AD09144, NACC657970_08AD8245

You can do this with regex capturing data in two groups and interchanging them using back reference.
myvec <- c("A1_B1", "B2_C1", "D1_A2")
sub('(\\w+)_(\\w+)', '\\2_\\1', myvec)
#[1] "B1_A1" "C1_B2" "A2_D1"

We can use strsplit from base R
sapply(strsplit(myvec, "_"), function(x) paste(x[2], x[1], sep = "_"))
#[1] "NACC022453_08AD09144" "NACC657970_08AD8245"

Related

append letter to a string in r

I have a vector:
c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL")
And I would like to remove "BAA" from the words that contain it. And to those words I would like to append ".PR".
Desired outcome:
c("AVAST.PR", "CEZ.PR", "GECBA.PR", "LOL")
Any ideas? Ideally using stringr. Thank you a lot.
You could use the following solution:
gsub("BAA(.*)", "\\1\\.PR", vec)
[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
You could use
library(stringr)
# optimized thanks to Anoushiravan
str_replace(c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL"), "BAA(\\w*)", "\\1.PR")
#> [1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
use \\w* if you want to match word characters only or .* if there are no limitations to the characters.
This is verbose than the other answers. It finds strings with 'BAA' and appends 'PR.' to it.
inds <- grepl('BAA', vec, fixed = TRUE)
vec[inds] <- paste(sub('BAA', '', vec[inds]), 'PR', sep = '.')
vec
#[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"

Use Regular expressions extract specific characters

text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')
I have tried but failed:
str_extract(text, pattern = 'f.*\\|')
How can I get
f__Closteroviridae
f__Alphaflexiviridae
f__Retroviridae
Any help will be high appreciated!
Make the regex non-greedy and since you don't want "|" in final output use positive lookahead.
stringr::str_extract(text, 'f.*?(?=\\|)')
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
In base R, we can use sub :
sub('.*(f_.*?)\\|.*', '\\1', text)
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
For a base R solution, I would use regmatches along with gregexpr:
m <- gregexpr("\\bf__[^|]+", text)
as.character(regmatches(text, m))
[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
The advantage of using gregexpr as above is that should an input contain more than one f__ matching term, we could also capture it. For example:
x <- 'd__Viruses|f__Closteroviridae|g__Closterovirus|f__some_virus'
m <- gregexpr("\\bf__[^|]+", x)
regmatches(x, m)[[1]]
[1] "f__Closteroviridae" "f__some_virus"
Data:
text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')

Substitute strings by their first match in a dictionary

I have a vector long_strings defined as
long_strings <- c("*/1/1/1/1", "*/1/2/1/1", "*/2/1",
"*/2/2/1", "*/3/1/1/1")
and I have a dictionary of short short_strings containing the initial patterns (with differing lengths) of those strings, for example
short_strings <- c("*/1/1", "*/3", "*/2", "*/1/2")
How can I "simplify" the contents of long_strings to match their corresponding value on short_strings?
The results should look like
"*/1/1", "*/1/2", "*/2", "*/2", "*/3"
I can find where are the occurrences of a single element of short_strings using grep("\\*/2", long_strings), but I want to avoid looping over the short_strings.
An option with sapply
as.character(with(stack(sapply(setNames(paste0("\\", short_strings), short_strings),
grep, x = long_strings)), ind[order(values)]))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
Or using str_extract
library(stringr)
str_extract(long_strings, str_c(str_c("\\", short_strings), collapse="|"))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
We can programmatically create a capture group and use it in sub to extract it
sub(paste0(".*(",paste0("\\", short_strings, collapse = "|"), ").*"), "\\1",long_strings)
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"

String between first two (.dots)

Hi have data which contains two or more dots. My requirement is to get string from first to second dot.
E.g string <- "abcd.vdgd.dhdsg"
Result expected =vdgd
I have used
pt <-strapply(string, "\\.(.*)\\.", simplify = TRUE)
which is giving correct data but for string having more than two dots its not working as expected.
e.g string <- "abcd.vdgd.dhdsg.jsgs"
its giving dhdsg.jsgs but expected is vdgd
Could anyone help me.
Thanks & Regards,
In base R we can use strsplit
ss <- "abcd.vdgd.dhdsg"
unlist(strsplit(ss, "\\."))[2]
#[1] "vdgd"
Or using gregexpr with regmatches
unlist(regmatches(ss, gregexpr("[^\\.]+", ss)))[2]
#[1] "vdgd"
Or using gsub (thanks #TCZhang)
gsub("^.+?\\.(.+?)\\..*$", "\\1", ss)
#[1] "vdgd"
Another option:
string <- "abcd.vdgd.dhdsg.jsgs"
library(stringr)
str_extract(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[1] "vdgd"
I like this one because the str_extract function will return the first instance of the correct pattern, but you could also use str_extract_all to get all instances.
str_extract_all(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[[1]]
[1] "vdgd" "dhdsg"
From here, you could index to get any position between two dots you want.
Another solution with the qdapRegex package:
library(qdapRegex)
ex_between("abcd.vdgd.dhdsg.jsgs", ".", ".")[[1]][1]
# "vdgd"
You can use read.table as well if you wish.Here providing the string as given in your problem and selecting the separator as dot("."), Once the column is converted into a data.frame, you may choose to select whatever column you want to pick(In this case it is column number 2).
read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
Output:
> read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
[1] "vdgd"
Here is a fun easy way via stringr
stringr::word(string, 2, sep = '\\.')
Here are two options that are vectorized over the input string vector:
You can try tstrsplit from data.table, which is vectorized over string:
> string <- c("abcd.vdgd.dhdsg", "abcd.vdgd.dhdsg.jsgs")
> tstrsplit(string, '.', fixed = TRUE)[[2]]
[1] "vdgd" "vdgd"
or regex:
> sub('.*?\\.(.*?)\\..*', '\\1', string)
[1] "vdgd" "vdgd"`

Extract only number between commas

I have a returned string like this from my code: (<C1>, 4.297, %)
And I am trying to extract only the value 4.297 from this string using gsub command:
Fat<-gsub("\\D", "", stringV)
However, this extracts not only 4.297 but also the number '1' in C1.
Is there a way to extract only 4.297 from this string, please can you help.
Thanks
How about this?
# Your sample character string
ss <- "(<C1>, 4.297, %)";
gsub(".+,\\s*(\\d+\\.\\d+),.+", "\\1", ss)
#[1] "4.297"
or
gsub(".+,\\s*([0-9\\.]+),.+", "\\1", ss)
Convert to numeric with as.numeric if necessary.
Another option is str_extract to match one or more numeric elements with . and is preceded by a word boundary and succeeded by word boundary(\\b)
library(stringr)
as.numeric(str_extract(stringV, "\\b[0-9.]+\\b"))
#[1] 4.297
If there are multiple numbers, use str_extract_all
data
stringV <- "(<C1>, 4.297, %)"
An alternative is to treat your vector as a comma-separated-variable, and use read.csv
df <- read.csv(text = stringV, colClasses = c("character", "numeric", "character"), header = F)
V1 V2 V3
1 (<C1> 4.297 %)
This method is relying on the 'numeric' being in the 'second' position in the vector.
you can use as.numeric convert no number string to NA.
ss <- as.numeric(unlist(strsplit(stringV, ',')))
ss[!is.na(ss)]
#[1] 4.297

Resources