keep only unique elements in string in r - r

In genomics research, you often have many strings with duplicate gene names. I would like to find an efficient way to only keep the unique gene names in a string. This is an example that works. But, isn't it possible to do this in one step, i.e., without having to split the entire string and then having to past the unique elements back together?
genes <- c("GSTP1;GSTP1;APC")
a <- unlist(strsplit(genes, ";"))
paste(unique(a), collapse=";")
[1] "GSTP1;APC"

An alternative is doing
unique(unlist(strsplit(genes, ";")))
#[1] "GSTP1" "APC"
Then this should give you the answer
paste(unique(unlist(strsplit(genes, ";"))), collapse = ";")
#[1] "GSTP1;APC"

Based on the example showed, perhaps
gsub("(\\w+);\\1", "\\1", genes)
#[1] "GSTP1;APC"

Related

Change complicated strings in R with qsub or R-strings

I have a column of a data frame that has thousands complicate sample names like this
sample- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
I am trying with no success to change the sample names to achieve the following sample names
16.3R1, 16.3R2, 2.3R1,2.3R2
I am thinking of solving the problem with qsub or stringsR.
Any suggestion? I have tried qsub but not retrieving the desirable name
You can use sub to extract the parts :
sample <- c("16_3_S16_R1_001","16_3_S16_R2_001","2_3_S2_R1_001","2_3_S2_R2_001")
sub('(\\d+)_(\\d+)_.*(R\\d+).*', '\\1.\\2\\3', sample)
#[1] "16.3R1" "16.3R2" "2.3R1" "2.3R2"
\\d+ refers to one or more digits. The values captured between () are called as capture groups. So here we are capturing one or more digits(1), followed by underscore and by another digit (2) and finally "R" with a digit (3). The values which are captured are referred using back reference so \\1 is the first value, \\2 as second value and so on.
If you split the string sample into substrings according to the pattern "_", you need only the 1st, 2n and 4th parts:
sample <- c("16_3_S16_R1_001",
"16_3_S16_R2_001",
"2_3_S2_R1_001",
"2_3_S2_R2_001")
x <- strsplit(sample, "_")
sapply(x, function(y) paste0(y[1], ".", y[2], y[4]))
Here is one way you could do it.
It helps to create a data frame with a header column, so it's what I did below, and I called the column "cats"
trial <- data.frame( "cats" = character(0))
x <- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
df <- data.frame("cats" = x)
The data needs to be in the right structure, in our case, as.factor()
df$cats <- as.factor(df$cats)
levels(df$cats)[levels(df$cats)=="16_3_S16_R1_001"] <- "16.3R1"
levels(df$cats)[levels(df$cats)=="16_3_S16_R2_001"] <- "16.3R2"
levels(df$cats)[levels(df$cats)=="2_3_S2_R1_001"] <- "2.3R1"
levels(df$cats)[levels(df$cats)=="2_3_S2_R2_001"] <- "2.3R2"
And voilà

Issue with %in% in R

I am trying to get all sentences from a dataframe containing specific words into a new dataframe. I don't really know how to do this, but the first step I tried was to check if a word is in the column.
> "quality" %in% df$text[2]
[1] FALSE
> df$text[2]
[1] "Audio quality is definitely good"
Why is the output false?
Also, do you have any suggestion on how to create my new dataframe? I'd like to, as an example, have a dataframe with all words containing c("word1","word2").
Thank you very much in advance.
It is not a fixed match. If we need to partially match, use grepl
grepl("quality", df$text[2])
If we are doing this to check if there are any 'quality' in the column, wrap with any
any(grepl("quality", df$text))
For multiple elements, paste them together with collapse = "|"
v1 <- c("word1","word2")
any(grepl(paste(v1, collapse="|"), df$text))
According to ?%in%
%in% is currently defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
where match matches the string based on an exact match.

Find match of exactly the same string in R column that has both numeric and character items

I have a column that has numeric and strings. I'd like to find only those rows that has a particular string and not the others. In this case, I only need rows that has SE and not the others.
df :
names
SE123, FE43, SA67
SE167, SE24, SE56, SE34
SE23
FE36, KE90, LS87
DG20, SE34, LP47
SE57, SE39
Result df
names
SE167, SE24, SE56, SE34
SE23
SE57, SE39
My code
df[grep("^SE", as.character(df$names)),]
But this selects every row that has SE. Would somebody please help in achieving the result df? Thanks.
Looking at your expected output it looks like you want to select those rows where every element starts with "SE" where each element is a word between two commas.
Using base R, one method would be to split the strings on "," and select rows where every element startsWith "SE"
df[sapply(strsplit(df$names, ","), function(x)
all(startsWith(trimws(x), "SE"))), , drop = FALSE]
# names
#2 SE167, SE24, SE56, SE34
#3 SE23
#6 SE57, SE39
If you want to find presence of "SE" irrespective of position maybe grepl is a better choice.
df[sapply(strsplit(df$names, ","), function(x)
all(grepl("SE", trimws(x)))), , drop = FALSE]
Make sure you have names as character column before doing strsplit or run
df$names <- as.character(df$names)
names[!grepl("[A-Z]",gsub("SE","",names))]
[1] "SE167, SE24, SE56, SE34" "SE23" "SE57, SE39"
You can remove the SE from all strings and then look for any character. Strings having only SE will not contain any other character and are thus kept by the filter.
(This also works if you have 25SE)

Split a character column from a dataframe based on specific token

I have a dataframe df and the first column looks like this:
[1] "760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353"
I want to split that column on -.
What I'm doing is
strsplit(df[,1], "-")
The problem is that it's not working. It returns me a list without splitting the elements. I already tried adding the parameter fixed = TRUE and putting a regular expressing on the split parameter but nothing worked.
What is weird is that if I replicate the column on my own, for example:
myVector <- c("760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353")
and then apply the strsplit, it works.
I already checked my column type and class with
class(df[,1]) and typeof(df[,1]) and both returns me character, so it's good.
I was also using the dataframe with dplyr so it was of the type tbl_df. I converted it back to dataframe but didn't work too.
Also tried apply(df, 2, function(x) strsplit(x, "-", fixed = T)) but didn't work too.
Any clues?
I don't know how you did it, but you have two different types of dashes:
charToRaw(substr("760–563", 4, 4))
#[1] 96
charToRaw("-")
#[1] 2d
So the strsplit() is working just fine, it's just that the dash isn't there in your original data. Adjust this, and away you go:
strsplit("760–563", "–")
#[[1]]
#[1] "760" "563"
You can just split on a non-numeric character
library(dplyr)
library(tidyr)
data %>%
separate(your_column,
c("first_number", "second_number"),
sep = "[^0-9]")

Unlist (flatten lists) row by row in a data frame using R

I have a list of tweets in a data.frame and I can extract lists of hashtags from them using
> rpg.twitter.df$hashtags <-
regmatches(rpg.twitter.df$text,gregexpr("#(\\d|\\w)+",rpg.twitter.df$text))
It ends up with one list for each row. Now, I want to flatten each list in comma-separated strings (one for each row)
I tried this:
do.call("paste", c(rpg.twitter.df$hashtags, sep=", "))
but it doesn't work as it ends up with one huge vector. Same if i enclose regmatches with unlist(..., recursive=FALSE)
any idea on how to solve it?
Some data for a reproducible example:
rpg.twitter.df <- data.frame(text=rbind("World of Warcrack: http://t.co/3MNRpArnGw #wow #WorldOfWarcraft #warcraft #mmorpg #rpg #RPGChat #gaming #pcgaming #online #WoW_en #NewsWoW", "#ashleythedragon join my journey in Tweeria http://t.co/CFKDLA3ASE #rpg", "How to use of #RPG for motivation #timeboxing http://t.co/mwwN5xErHx"))
You can do:
sapply(rpg.twitter.df$hashtags, paste, collapse = ",")
You can also use this :
toString(rpg.twitter.df$hashtags)

Resources