Split string at a vertical bar character "|" - r

I feel like this question is asked a lot but all the solutions I found don't work for me either.
I have a dataframe with a column (called ID) in which I have a string of numbers and letters (e.g: Q8A203). In a few rows there are two of those constructs separated by a vertical bar (e.g: Q8AA66|Q8AAT5). For my analysis it doesn't matter which one I keep so I wanted to make a new column named NewColumn in which I transfer the first and split the string at |.
I know that the vertical bar must be treated differently and that I have to put \\ in front. I tried strsplit() and unlist():
df$NewColumn <- strsplit(df$ID,split='\\|',fixed=TRUE)
df$NewColumn <- unlist(strsplit(df$ID, " \\| ", fixed=TRUE))
Both options return the exact same content from column ID to the NewColumn.
I would very much appreciate the help.

Rather than splitting you can simply substitute the second part with nothing and it will keep the first ID.
df <- data.frame(ID = c("Q8A203", "Q8AA66|Q8AAT5"))
df$NewColumn <- sub("\\|.*$","", df$ID, )
df
# ID NewColumn
# 1 Q8A203 Q8A203
# 2 Q8AA66|Q8AAT5 Q8AA66
Please next time, add an minimal reproductible example (your df here) to speed up answers ;)
strsplit can work if you remove the fixed option, but you need to provide an exact regex. Also, you will need to work with a list after, which is more complex.
# Working with a list
unlist(lapply(strsplit(df$ID, split='\\|'), "[[", 1))

Related

Extracting information between special characters in a column in R

I'm sorry because I feel like versions of this question have been asked many times, but I simply cannot find code from other examples that works in this case. I have a column where all the information I want is stored in between two sets of "%%", and I want to extract this information between the two sets of parentheses and put it into a new column, in this case called df$empty.
This is a long column, but in all cases I just want the information between the sets of parentheses. Is there a way to code this out across the whole column?
To be specific, I want in this example a new column that will look like "information", "wanted".
empty <- c('NA', 'NA')
information <- c('notimportant%%information%%morenotimportant', 'ignorethis%%wanted%%notthiseither')
df <- data.frame(information, empty)
In this case you can do:
df$empty <- sapply(strsplit(df$information, '%%'), '[', 2)
# information empty
# 1 notimportant%%information%%morenotimportant information
# 2 ignorethis%%wanted%%notthiseither wanted
That is, split the text by '%%' and take second elements of the resulting vectors.
Or you can get the same result using sub():
df$empty <- sub('.*%%(.+)%%.*', '\\1', df$information)

Best way to extract a single letter from each row and create a new column in R?

Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)

Find whether a raw in data table contains at least one word from the list

I am quite new to R and data tables, so probably my question will sound obvious, but I searched through questions here for similar issues and couldn't find a solution anyway.
So, initially, I have a data table and one of the rows contains fields that have many values(in fact these values are all separate words) of the data joined together by &&&&. I also have a list of words (list). This list is big and has 38 000 different words. But for the purpose of example let's sat that it is small.
list <- c('word1', 'word2, 'word3')
What I need is to filter the data table so that I only have rows that contain at least one word from the list of words.
I unjoined the data by &&&&& and created a list
fields_with_words <-strsplit(data_final$fields_with_words,"&&&&")
But I don't know which function should I use to check whether the row from my data table has at least one word from the list. Can you give me some clues?
Try :
data_final[sapply(strsplit(data_final$fields_with_words,"&&&&"), function(x)
any(x %in% word_list)), ]
I have used word_list instead of list here since list is a built-in function in R.
Assuming you want to scan x variable in df with the list of words lw <- c("word1","word2","word3") (character vector of words), you can use
df[grepl(paste0("(",paste(lw, collapse = "|"), ")"), x)]
if you want regular expression. In particular you will have match also if your word is within a sentence. However, with 38k words, I don't know if this solution is scalable.
If your x column contains only words and you want exact matching, the problem is simpler. You can do:
df[any(x %chin% lw)]
%chin% is a data.table special %in% operator for character vectors (%in% can also be used but it will not be as performant). You can have better performance there if you use merge by transforming lw into a data.table:
merge(df, data.table(x = lw), by = "x")

R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string?

I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2
list1:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
"1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
list2:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
i want to do setdiff(list2,list1) , so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:
"1\t1113200\t1118399"
from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.
For extracting the first three columns (not sure why you need this as a long string rather than a dataframe...), I would use beg2char() from the qdap library. (Although, if they are all the same length base substr() will work fine.)
beg2char(list1, '\t', 3) # Will extract from the beginning up to the third tab delimiter
Then rather than setdiff I would simply use %in% to check if the substring of the element in list2 matches any of the elements in list1.
beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3) # will give you TRUE/FALSE
list2[!(beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3))]
Will give the the full elements of list2 that have substring that are nonexistent in list1.

Select or substract an specific part of an element of a data frame

I want to get only a part of an element that is part of a data frame.
My dataframe has 1 column with 6000 rows looking like this:
chr5_122424840_122523745_NM_001136239_mRNA
chr17_38632079_38657854_NM_032865_mRNA
I want to obtain a new data frame with only
NM_001136239
NM_032865
I've tried with split and then paste but it is not working because it eliminates the zeros when pasting (NM_1136239 instead of NM_001136239)
I've also tried with stri_sub and substrbut the lenght before NM is not the same in each row. Also gsub but I don't know how to do it.
Thank you very much for your help, I hope I've been enough specific.
This should work
Data
df <- data.frame(col=c("chr5_122424840_122523745_NM_001136239_mRNA",
"chr17_38632079_38657854_NM_032865_mRNA"))
Code
df$col <- sub(".*(NM.*)_mRNA", "\\1", df$col)
Just as long as the strings have NM and end in _mRNA
There are many ways to do. Here goes one that uses library stringr. I simply recommend it because code is easier to understand
library(stringr)
patron <- "NM_[0-9]+" # NM_ pasted to any number of digits
str_extract(your_data_frame$your_column, patron) # Assign to a column or not

Resources