Extracting specific strings patterns from one column - r

I would like to extract specific strings with the pattern gene=something from one column in R.
An example of input:
df <- 'V1
ID=gene92;DbX;gene=BH1;genePro
ID=gene91;DbY;gene=BH2;genePro;inf2
ID=gene90;DbY;gene=BH3;genePro;inf2'
df <- read.table(text=df, header=T)
The example of the expected output:
dfout <- 'V1
gene=BH1
gene=BH2
gene=BH3'
dfout <- read.table(text=dfout, header=T)
Some idea to accomplish that?

library(stringr)
str_extract(df$V1, 'gene=BH[0-9]+')
#[1] "gene=BH1" "gene=BH2" "gene=BH3"

You may also use
gsub(".*(gene=.*?)(;|$).*", "\\1", df$V1)
# [1] "gene=BH1" "gene=BH2" "gene=BH3"
so that we match only the part gene=... that follows anything, .*, and is followed by ; or the end of the string, ;|$.

Related

append letter to a string in r

I have a vector:
c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL")
And I would like to remove "BAA" from the words that contain it. And to those words I would like to append ".PR".
Desired outcome:
c("AVAST.PR", "CEZ.PR", "GECBA.PR", "LOL")
Any ideas? Ideally using stringr. Thank you a lot.
You could use the following solution:
gsub("BAA(.*)", "\\1\\.PR", vec)
[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
You could use
library(stringr)
# optimized thanks to Anoushiravan
str_replace(c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL"), "BAA(\\w*)", "\\1.PR")
#> [1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
use \\w* if you want to match word characters only or .* if there are no limitations to the characters.
This is verbose than the other answers. It finds strings with 'BAA' and appends 'PR.' to it.
inds <- grepl('BAA', vec, fixed = TRUE)
vec[inds] <- paste(sub('BAA', '', vec[inds]), 'PR', sep = '.')
vec
#[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"

subset the string matches in the middle of the column from dataframe in R

I need to subset the column that contains uniprot/swiss-prot: ID from the data frame in R.The column contains other IDs also.
Below is an example:
biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179
I need the below output:
Q16611
You can use -
x <- 'biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179'
sub('.*swiss-prot:(\\w+)\\|.*', '\\1', x)
#[1] "Q16611"
This will extract a word after swiss-prot: and | in the text.
For apply this to a dataframe column you can do -
df$result <- sub('.*swiss-prot:(\\w+)\\|.*', '\\1', df$col)
Using str_extract
library(stringr)
str_extract(x, "(?<=prot:)\\w+")
[1] "Q16611"
data
x <- 'biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179'

R combining the characters of the first two rows of a data frame

I have a dataframe:
dnames <- data.frame(x1= c("a","b"),x2= c("c","d"),x3= c("e", "f"))
dnames
I would like to combine the characters of each of the first two rows of the data frame
dnames1 <- c("ab","cd","ed")
dnames1
I tried:
dnames1 <- paste(dnames[1,],dnames[2,],sep="")
dnames1
But this did not give the correct result.
Thank you for your help.
For column wise paste, use sapply
sapply(dnames, paste, collapse="")
Or using the OP's method, unlist and paste
paste(unlist(dnames[1,]),unlist(dnames[2,]),sep="")
In tidyverse
library(dplyr)
library(stringr)
dnames %>%
summarise_all(str_c, collapse='')
To keep your code style, you can try the following code
d <- t(dnames)
dnames1 <- paste0(d[,1],d[,2])
such that
> dnames1
[1] "ab" "cd" "ef"

Find first matching substring in a long string in R

I'm trying to find the first matching string from a vector in a long string. I have for example a example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow' and a matching_vector<- c('Turtle',Dog') Now I want that it returns 'Dog' as this is the first substring in the matching_vector that we see in the example string: LionabcdBear1231DogextKittyisananimalTurtleisslow
I already tried pmatch(example_string,matching_vector) but it doesn't work. Obviously as it doesn't work with substrings...
Thanks!
Tim
Is the following solution working for you?
example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow'
matching_vector<- c('Turtle','Dog')
match_ids <- sapply(matching_vector, function(x) regexpr(x ,example_string))
result <- names(match_ids)[which.min(match_ids)]
> result
[1] "Dog"
We can use stri_match_first from stringi
library(stringi)
stri_match_first(example_string, regex = paste(matching_vector, collapse="|"))

Remove part of a string

How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A

Resources