R: truncate strings to a word - r

I'm new to R, and trying to use it to truncate words in the headers of a spreadsheet to a word. For example:
Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100);
Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);
So I would like to shorten the taxon to a single word without the numbers: like Clostridia and Mollicutes. I think it can be done, but can't figure how.
Thanks.

We can use sub
sub("\\(.*", "", "Firmicutes(100)")
Suppose, we read the data in 'R' using read.csv/read.table with check.names=FALSE, then we apply the same code on the column names
colnames(data) <- sub("\\(.*", "", colnames(data))
If it is a single string
library(stringr)
str1 <- "Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100)"
str_extract_all(str1, "[^()0-9;]+")[[1]]
#[1] "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Lachnospiraceae"
#[6] "unclassified"
Update
Suppose if we need to extract the third word i.e. "Clostridia"
sub("^([^(]+[(][^;]+;){2}(\\w+).*", "\\2", str1)
#[1] "Clostridia"

Using only base commands, the names can be extracted with this code:
nam <- c("Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);")
nam <- strsplit(nam, ";")[[1]]
nam <- unname(sapply(nam, FUN=function(x) sub("\\(.*", "", x)))
nam
[1] "Bacteria" "Tenericutes" "Mollicutes" "Mollicutes_RF9" "unclassified" "unclassified"

Is this what you need? Or did I completely misunderstood?
gsub('\\(.*\\)', '', unlist(strsplit(x, ';'))[3])
#[1] "Clostridia"
where x is your column name

Related

append letter to a string in r

I have a vector:
c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL")
And I would like to remove "BAA" from the words that contain it. And to those words I would like to append ".PR".
Desired outcome:
c("AVAST.PR", "CEZ.PR", "GECBA.PR", "LOL")
Any ideas? Ideally using stringr. Thank you a lot.
You could use the following solution:
gsub("BAA(.*)", "\\1\\.PR", vec)
[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
You could use
library(stringr)
# optimized thanks to Anoushiravan
str_replace(c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL"), "BAA(\\w*)", "\\1.PR")
#> [1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
use \\w* if you want to match word characters only or .* if there are no limitations to the characters.
This is verbose than the other answers. It finds strings with 'BAA' and appends 'PR.' to it.
inds <- grepl('BAA', vec, fixed = TRUE)
vec[inds] <- paste(sub('BAA', '', vec[inds]), 'PR', sep = '.')
vec
#[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"

string split and interchange the position of string in R

I have a vector called myvec. I would like to split it at _ and interchange the position. What would be the simplest way to do this?
myvec <- c("08AD09144_NACC022453", "08AD8245_NACC657970")
Result I want:
NACC022453_08AD09144, NACC657970_08AD8245
You can do this with regex capturing data in two groups and interchanging them using back reference.
myvec <- c("A1_B1", "B2_C1", "D1_A2")
sub('(\\w+)_(\\w+)', '\\2_\\1', myvec)
#[1] "B1_A1" "C1_B2" "A2_D1"
We can use strsplit from base R
sapply(strsplit(myvec, "_"), function(x) paste(x[2], x[1], sep = "_"))
#[1] "NACC022453_08AD09144" "NACC657970_08AD8245"

R. Remove everything between to delimiter characters [duplicate]

This question already has answers here:
Remove the letters between two patterns of strings in R
(3 answers)
Closed 2 years ago.
I have a data frame with this kind of expression in column C:
GT_rs9628326:N_rs9628326
GT_rs1111:N_rs1111
GT_rs8374:N_rs8374
Using R, I want to remove everything between the first "T" and ":", as well as everything after the "N". I know this can be done with gsub. I would get:
GT:N
GT:N
GT:N
Maybe you can try
gsub("_\\w+","",s)
giving
[1] "GT:N" "GT:N" "GT:N"
Data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
Another option would be splitting the strings by : and then replace non necessary text in order to collapse all together again by same split symbol (I have used #ThomasIsCoding data thanks):
#Data
v1 <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
#Code
unlist(lapply(lapply(strsplit(v1,split = ':'),
function(x) sub("_[^_]+$", "", x)),
function(x) paste0(x,collapse = ':')))
Output:
[1] "GT:N" "GT:N" "GT:N"
Using str_remove from stringr
library(stringr)
str_remove_all(s, "_\\w+")
#[1] "GT:N" "GT:N" "GT:N"
data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
Remove a word after either "T" or "N". Using #ThomasIsCoding's data.
gsub('(?<=T|N)\\w+', '', s, perl = TRUE)
#[1] "GT:N" "GT:N" "GT:N"

Extracting specific strings patterns from one column

I would like to extract specific strings with the pattern gene=something from one column in R.
An example of input:
df <- 'V1
ID=gene92;DbX;gene=BH1;genePro
ID=gene91;DbY;gene=BH2;genePro;inf2
ID=gene90;DbY;gene=BH3;genePro;inf2'
df <- read.table(text=df, header=T)
The example of the expected output:
dfout <- 'V1
gene=BH1
gene=BH2
gene=BH3'
dfout <- read.table(text=dfout, header=T)
Some idea to accomplish that?
library(stringr)
str_extract(df$V1, 'gene=BH[0-9]+')
#[1] "gene=BH1" "gene=BH2" "gene=BH3"
You may also use
gsub(".*(gene=.*?)(;|$).*", "\\1", df$V1)
# [1] "gene=BH1" "gene=BH2" "gene=BH3"
so that we match only the part gene=... that follows anything, .*, and is followed by ; or the end of the string, ;|$.

Consecutive string matching in a sentence using R

I have names of some 7 countries which is stored somewhere like:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Now, I have to find out using r if a given sentence has these words.
Sometimes the name of a country is hiding in the consecutive letters within a sentence.
for ex:
You all must pay it bac**k, or ea**ch of you will be in trouble.
If this sentence is passed it should return "korea"
I have tried:
grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
fixed = FALSE)
it should return korea
but it's not working. Perhaps I should not use Partial Matching, but i dont have much knowledge regarding it.
Any help is appreciated.
You can use the handy stringr library for this. First, remove all the punctuation and spaces from your sentence that we want to match.
> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"
Then we can use str_detect to find the matches.
> Random[str_detect(g, Random)]
# [1] "korea"
Basically you're just looking for a sub-string within a sentence, so collapsing the sentence first seems like a good way to go. Alternatively, you could use str_locate with str_sub to find the relevant sub-strings.
> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"
Edit Here's one more I came up with
> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"
Using base functions only:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|") #creating pattern for match
text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T) #removing punctuations and space characters
regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"
You could use stringi which is faster for these operations
library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"
#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
Try:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
tt <- gsub("[[:punct:]]|\\s+", "", txt)
unlist(sapply(Random, function(r) grep(r, tt)))
korea
1

Resources