Extracting information between special characters in a column in R - r

I'm sorry because I feel like versions of this question have been asked many times, but I simply cannot find code from other examples that works in this case. I have a column where all the information I want is stored in between two sets of "%%", and I want to extract this information between the two sets of parentheses and put it into a new column, in this case called df$empty.
This is a long column, but in all cases I just want the information between the sets of parentheses. Is there a way to code this out across the whole column?
To be specific, I want in this example a new column that will look like "information", "wanted".
empty <- c('NA', 'NA')
information <- c('notimportant%%information%%morenotimportant', 'ignorethis%%wanted%%notthiseither')
df <- data.frame(information, empty)

In this case you can do:
df$empty <- sapply(strsplit(df$information, '%%'), '[', 2)
# information empty
# 1 notimportant%%information%%morenotimportant information
# 2 ignorethis%%wanted%%notthiseither wanted
That is, split the text by '%%' and take second elements of the resulting vectors.
Or you can get the same result using sub():
df$empty <- sub('.*%%(.+)%%.*', '\\1', df$information)

Related

Split string at a vertical bar character "|"

I feel like this question is asked a lot but all the solutions I found don't work for me either.
I have a dataframe with a column (called ID) in which I have a string of numbers and letters (e.g: Q8A203). In a few rows there are two of those constructs separated by a vertical bar (e.g: Q8AA66|Q8AAT5). For my analysis it doesn't matter which one I keep so I wanted to make a new column named NewColumn in which I transfer the first and split the string at |.
I know that the vertical bar must be treated differently and that I have to put \\ in front. I tried strsplit() and unlist():
df$NewColumn <- strsplit(df$ID,split='\\|',fixed=TRUE)
df$NewColumn <- unlist(strsplit(df$ID, " \\| ", fixed=TRUE))
Both options return the exact same content from column ID to the NewColumn.
I would very much appreciate the help.
Rather than splitting you can simply substitute the second part with nothing and it will keep the first ID.
df <- data.frame(ID = c("Q8A203", "Q8AA66|Q8AAT5"))
df$NewColumn <- sub("\\|.*$","", df$ID, )
df
# ID NewColumn
# 1 Q8A203 Q8A203
# 2 Q8AA66|Q8AAT5 Q8AA66
Please next time, add an minimal reproductible example (your df here) to speed up answers ;)
strsplit can work if you remove the fixed option, but you need to provide an exact regex. Also, you will need to work with a list after, which is more complex.
# Working with a list
unlist(lapply(strsplit(df$ID, split='\\|'), "[[", 1))

Assigning Unnamed Columns To Another DataFrame

I'm in a very basic class that introduces R for genetic purposes. I'm encountering a rather peculiar problem in trying to follow the instructions given. Here is what I have along with the instructor's notes:
MangrovesRaw<-read.csv("C:/Users/esteb/Documents/PopGen/MangrovesSites.csv")
#i'm going to make a new dataframe now, with one column more than the mangrovesraw dataframe but the same number of rows.
View(MangrovesRaw)
Mangroves<-data.frame(matrix(nrow = 528, ncol = 23))
#next I want you to name the first column of Mangroves "pop"
colnames(Mangroves)<-c(col1="pop")
#i'm now assigning all values of that column to be 1
Mangroves$pop<-1
#assign the rest of the columns (2 to 23) to the entirety of the MangrovesRaw dataframe
#then change the names to match the mangroves raw names
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
I'm not really sure how to assign columns that haven't been named used the $ as we have in the past. A friend suggested I first run
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw
#X338 is the name of the first column from MangrovesRaw
But while this does transfer the data from MangrovesRaw, it comes at the cost of having my column names messed up with X338. added to every subsequent column. In an attempt to modify this I found the following "fix"
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw[,2]
#Mangroves$X338<-MangrovesRaw[,2:22]
#MangrovesRaw has 22 columns in total
While this transferred all the data I needed for the X338 Column, it didn't transfer any data for the remaining 21 columns. The code in # just results in the same problem of having X388. show up in all my column names.
What am I doing wrong?
There are a few ways to solve this problem. It may be that your instructor wants it done a certain way, but here's one simple solution: just cbind() the Mangroves$pop column with the real data. Then the data and column names are already added.
Mangroves <- cbind(Mangroves$pop, MangrovesRaw)
Here's another way:
Mangroves[, 2:23] <- MangrovesRaw
colnames(Mangroves)[2:23] <- colnames(MangrovesRaw)

Find whether a raw in data table contains at least one word from the list

I am quite new to R and data tables, so probably my question will sound obvious, but I searched through questions here for similar issues and couldn't find a solution anyway.
So, initially, I have a data table and one of the rows contains fields that have many values(in fact these values are all separate words) of the data joined together by &&&&. I also have a list of words (list). This list is big and has 38 000 different words. But for the purpose of example let's sat that it is small.
list <- c('word1', 'word2, 'word3')
What I need is to filter the data table so that I only have rows that contain at least one word from the list of words.
I unjoined the data by &&&&& and created a list
fields_with_words <-strsplit(data_final$fields_with_words,"&&&&")
But I don't know which function should I use to check whether the row from my data table has at least one word from the list. Can you give me some clues?
Try :
data_final[sapply(strsplit(data_final$fields_with_words,"&&&&"), function(x)
any(x %in% word_list)), ]
I have used word_list instead of list here since list is a built-in function in R.
Assuming you want to scan x variable in df with the list of words lw <- c("word1","word2","word3") (character vector of words), you can use
df[grepl(paste0("(",paste(lw, collapse = "|"), ")"), x)]
if you want regular expression. In particular you will have match also if your word is within a sentence. However, with 38k words, I don't know if this solution is scalable.
If your x column contains only words and you want exact matching, the problem is simpler. You can do:
df[any(x %chin% lw)]
%chin% is a data.table special %in% operator for character vectors (%in% can also be used but it will not be as performant). You can have better performance there if you use merge by transforming lw into a data.table:
merge(df, data.table(x = lw), by = "x")

How to remove ending from sample names

I am trying to remove endings from sample names in my data frame. There are about 200 samples so I was hoping there was a way to end the name before the first - (common to each sample).
Examples of names are:
Glyc.1.20C.1wk-ATGGTTCACCCG-CATCAGTACGCC-R1.fastq
Glyc.1.20C.2m-CACTACGCTAGA-GTTCCTCCATTA-R1.fastq
Glyc.1.20C.2wk-GCTCGAAGATTC-CGAGGGAAAGTC-R1.fastq
Glyc.1.20C.3m-GTAGGTGCTTAC-GCATAAACGACT-R1.fastq
Using the change colnames(x) <- c("Glyc.1.20C.1wk, etc) would take me forever.
Any ideas?
If df is your dataframe, take the names, remove everything after the first -, and reset the names to the new short values...
names(df) <- gsub("\\-.+","",names(df))

R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string? without using qdap

ive previously asked this question and the answer i received worked: R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string?,
However the qdap requires rJava and correct user system setup.cannot load R package qdap. So now i am re-asking the question but am wondering if there is a way to do this without using qdap? i will repeat the question below:
I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2
list1:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
"1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
list2:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
i want to do setdiff(list2,list1) , so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:
"1\t1113200\t1118399"
from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.
Looks like you just need to extract up to the third tab character (to get the first three columns) from list1 and compare that to the same in list2?
There are quite a few ways to do this in base R, here's one using regular expressions to extract the first three tabs:
# first, let's get the first 3 columns of `list1` (get up to the third tab)
m = regexec("^(?:[^\t]+\t){3}", list1)
# you'll see it's a list with the first 3 columns of each thing in `x`
first3.list1 = unlist(regmatches(list1, m))
Now we have the first three columns we can match against list2. You can extract the first three columns of list2 similarly and use %in% like the answer to your previous question now. (setdiff will only return the non-matching first 3 columns, while using %in% can be used to index the original list2 to extract the entire original string)
m = regexec("^(?:[^\t]+\t){3}", list2)
first3.list2 = unlist(regmatches(list2, m))
list2[!(first3.list2 %in% first3.list1)]
(It seems for the example you provided, there are no lines in list2 whose first 3 columns are not in the first 3 columns of list1).
Other approaches include using strsplit or read.delim to split your dataframe into columns, then using paste to paste the first 3 back together, and then proceeding similarly.

Resources