I have a vector that looks like this:
tt <- c("chr1:110363793:G:C", "chr1:110363823:A:G", "chr1:110363849:A:G")
How do I create a vector with POS and NEG characters appended alternatively to get this below?
"chr1:110363793:G:C_POS", "chr1:110363823:A:G_NEG", "chr1:110363849:A:G_POS", "chr1:110363793:G:C_NEG", "chr1:110363823:A:G_POS", "chr1:110363849:A:G_NEG"
Just make use of the recyling of vector with rep and paste
paste0(rep(tt, 2), c("_POS", "_NEG"))
-output
[1] "chr1:110363793:G:C_POS" "chr1:110363823:A:G_NEG" "chr1:110363849:A:G_POS" "chr1:110363793:G:C_NEG" "chr1:110363823:A:G_POS"
[6] "chr1:110363849:A:G_NEG"
Related
I have a character string that I have split into a list of smaller strings using strsplit. For example:
> full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> full.seq
[1] "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> sequences <- strsplit(full.seq, "cg")
> sequences
[[1]]
[1] "FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa" "GKi3VdVSQzEFZp"
[4] "GKAVdVRpEFKGIZpg13"
I would like to give each of these new strings a unique, sequential name that I can still use to identify that they were from the same original string (for a later analysis I will do on these strings). For example, "ID.seq1", "ID.seq2", "ID.seq3" etc. I have tried doing this manually but receive this error:
> names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4")
Error in names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4") :
'names' attribute [4] must be the same length as the vector [1]
I would also like an automated way of doing this though, as I will need to label up to 30 new strings from a number of original strings. Any suggestions?
First of all, if you want a character vector, you will have to subset the list, because strsplit returns a list. After doing that, you can easily assign names to that vector of terms.
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
sequences <- strsplit(full.seq, "cg")[[1]]
names(sequences) <- paste0("ID.seq", c(1:4))
sequences
ID.seq1 ID.seq2
"FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa"
ID.seq3 ID.seq4
"GKi3VdVSQzEFZp" "GKAVdVRpEFKGIZpg13"
Answer by Tim is perfect. I just want to add if you want to keep your list and name elements of each item:
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
full.seq
sequences <- strsplit(full.seq, "cg")
names(sequences[[1]]) <- paste("ID.seq",1:4,sep="")
If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9
I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")
Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?
The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))
You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)
I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"