How to concatenate two DNAStringSet sequences per sample in R?

How to concatenate two DNAStringSet sequences per sample in R? - r

I have two Large DNAStringSet objects, where each of them contain 2805 entries and each of them has length of 201. I want to simply combine them, so to have 2805 entries because each of them are this size, but I want to have one object, combination of both.
I tried to do this
s12 <- c(unlist(s1), unlist(s2))
But that created single Large DNAString object with 1127610 elements, and this is not what I want. I simply want to combine them per sample.
EDIT:
Each entry in my DNASTringSet objects named s1 and s2, have similar format to this:
width seq
[1] 201 CCATCCCAGGGGTGATGCCAAGTGATTCCA...CTAACTCTGGGGTAATGTCCTGCAGCCGG

You can convert each DNAStringSet into characters. for example:
library(Biostrings)
set1 <- DNAStringSet(c("GCT", "GTA", "ACGT"))
set2 <- DNAStringSet(c("GTC", "ACGT", "GTA"))
as.character(set1)
as.character(set2)
Then paste them together into a DNAStringSet:
DNAStringSet(paste0(as.character(set1), as.character(set2)))

Since you're using DNAStringSet which is in Biostrings package, i recommend you to use this package's default functions for dealing with XStringSets. Using r base functions would take a lot of time because they need unnecessary conversions.
So you can use Biostrings xscat function. for example:
library(Biostrings)
set1 <- DNAStringSet(c("GCT", "GTA", "ACGT"))
set2 <- DNAStringSet(c("GTC", "ACGT", "GTA"))
xscat(set1, set2)
the result would be:
DNAStringSet object of length 3:
width seq
[1] 6 GCTGTC
[2] 7 GTAACGT
[3] 7 ACGTGTA

If your goal is to return a list where each list element is the concatenation of the corresponding list elements from the original lists restulting in a list of with length 2805 where each list element has a length of 402, you can achieve this with Map. Here is an example with a smaller pair of lists.
# set up the lists
set.seed(1234)
list.a <- list(a=1:5, b=letters[1:5], c=rnorm(5))
list.b <- list(a=6:10, b=letters[6:10], c=rnorm(5))
Each list contains 3 elements, which are vectors of length 5. Now, concatenate the lists by list position with Map and c:
Map(c, list.a, list.b)
$a
[1] 1 2 3 4 5 6 7 8 9 10
$b
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$c
[1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559
-0.5747400 -0.5466319 -0.5644520 -0.8900378
For your problem as you have described it, you would use
s12 <- Map(c, s1, s2)
The first argument of Map is a function that tells Map what to do with the list items that you have given it. Above those list items are a and b, in your example, they are s1 and s2.

Related

Deconstruct DNAstringsSets into normal strings

This comes from an R library called "VariantAnnotation" and its dependency "Biostrings"
I have a DNAstringsSetList and I want to transform it into a normal list or a vector of strings.
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
tempo <- rowRanges(vcf)$ALT # Here is the DNAstringsSetList I mean.
print(tempo)
A DNAStringSet instance of length 10376
width seq
[1] 1 G
[2] 1 T
[3] 1 A
[4] 1 T
[5] 1 T
... ... ...
[10372] 1 G
[10373] 1 G
[10374] 1 G
[10375] 1 A
[10376] 1 C
tempo[[1]]
A DNAStringSet instance of length 1
width seq
[1] 1 G
But I don't want this format. I just want strings of the bases, in order to insert them as a column in a new dataframe. I want this:
G
T
A
T
T
I have accomplished this with this package method:
as.character(tempo#unlistData)
However, it returns 10 rows more than tempo has! The head and tail of this result and of tempo are exactly the same, so somewhere in the middle there are 10 extra rows that should not have been formed (not NAs)

You can call as.character on either a DNAString or a DNAStringSet.
as.character(tempo[1 : 5])
# [1] "G" "T" "A" "T" "T"

A simple loop solves the issue, using the toString function of the same library:
ALT <-0
for (i in 1:nrow(vcf)){ ALT[i] <- toString(tempo[[i]]) }
However, I have no idea why tempo#unlistData retrieves too many rows. It is not trustworthy.

Finding matching values in two vectors of different lengths in R

I have two vectors with species names following two different methods. Some names are the same, others are different and both are sorted in different ways. An example:
list 1: c(Homo sapiens sapiens, Homo sapiens neanderthalensis, Homo erectus,...,n)
List 2: c(Homo erectus, Homo sapiens, Homo neanderthalensis,...,n+1)
I write n and n+1 to denote that these lists have different lengths.
I would like to create a new list that consists out of two values: in the case that there is a match between the two vectors (e.g. Homo erectus) I would like to have the name of list 2 at the location the name has in List 1, or in case there is a mismatch a "0" at the location in List 1. So in this case this new list would be newlist: c(0,0, Homo erectus,...)
For this I have written the following code, but it does not work.
data<-read.table("species.txt",sep="\t",header=TRUE)
list1<-as.vector(data$Species1)
list2<-as.vector(data$Species2)
newlist<-as.character(rep(0,length(list1)))
for (i in 1:length(list1)){
for (j in 1:length(list2)){
if(list1[i] == list2[j]){newlist[i]<- list2[j]}else {newlist[i]= 0}
}
}
I hope this is clear.
Thanks for any help!

Take this reproducible example:
set.seed(1)
list1 <- letters[1:10]
list1names
list2 <- letters[sample(1:10, 10)]
You can avoid a loop using ifelse:
newlist <- ifelse(list1==list2, list2, 0)
The issue is that you did not declare newname, did you mean newlist ?
If you want to use a loop you can use only one loop and not 2 because length(list1) = length(list2):
for (i in 1:length(list1)){
if(list1[i] == list2[i]){newlist[i]<- list2[i]}else {newlist[i]= 0}
}
In general if you want to match elements in vectors you can use match like this:
> list1
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
> list2
[1] "c" "d" "e" "g" "b" "h" "i" "f" "j" "a"
> match(list1, list2)
[1] 10 5 1 2 3 8 4 6 7 9
As you can see match gets the indexes of the elements in list2 which are equal to the elements in list1. This is useful in case you have another table data2, and you would like to fetch the column in data2 for corresponding elements from data$list1 in data2$list3, you would use:
data <- data.frame(list1, list2)
list3 <- list2
columntoget <- 1:length(list2)
data2 <- data.frame(list3, columntoget)
data$mynewcolumn <- data2$columntoget[match(data$list1, data2$list3)]
> data$mynewcolumn
[1] 10 5 1 2 3 8 4 6 7 9

I'm not completely certain that I understand what you're trying to achieve, but I think this does what you're after.
list1 <- c("Homo sapiens sapiens","Homo sapiens neanderthalensis","Homo erectus")
list2 <- c("Homo erectus","Homo sapiens","Homo neanderthalensis")
sapply(list1, function(x) { ifelse(x %in% list2, list2[which(list1 == x)], 0) } )

The inner for loop uses newname[i] where it should be newlist[i].
Using your code, you overwrite the newlist[i] entries j times with either 0 or a species name. This is probably not what you want.

How to make a list of many data frames without typing their names in r?

I've been working on several hundred files, which I automatically loaded into the workspace as separate dataframes (let's assume I have 500 dataframes in my workspace).
I would like to create a list consisting of all dataframes/objects in the workspace and to apply a function on all of them. Of course I could type all the objects manually, but it is not very efficient for hundreds or thousands of dataframes. I was wondering whether there is any way I can use the output of ls() function e.g.:
ls()
[1] "a" "b" "c" "d"
[5] "e" "f" "g" "h"
[9] "i" "j" "k" "l"
[13] "m" "n" "o" "p"
...
Unfortunately, when I extract from ls() output, I only end up with a character vector of strings and not a list of dataframes.
I would appreciate your ideas. Thanks.
EDITED: the following page How do I make a list of data frames in r gives some background but it doesn't answer my question as it doesn't cover large amounts of dataframes.

Yes you can retrieve the name of all your data.frame using ls, Filter and class. For example suppose you open an R session and type this:
> df1=data.frame(col=1:10)
> df14=data.frame(col=1:10)
> rr=3
You retrieve data.frame names with:
dfnames=Filter(function(x) class(get(x))=='data.frame', ls(env=globalenv()))
#>dfnames
#[1] "df1" "df14"
And your data.frame list is:
> lapply(dfnames, get)
[[1]]
col
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
[[2]]
col
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Then you can do what you want with this list.

, Hi in one shot :
m1 = mtcars
m2 = mtcars
m3 = 1:10
m4 = "blabla"
df_list <- mget(ls()[sapply(ls(), function(x) is.data.frame(get(x)))])
Feel free to rearrange the code in several steps

If these data.frames are all that are in your environment you can do:
my_list=sapply(ls(),get)
If you have other objects as well that you don't want to incorporate into your list, you can pick out the data frames of interest using grep().

Apply function to corresponding elements in list of data frames

I have a list of data frames in R. All of the data frames in the list are of the same size. However, the elements may be of different types. For example,
I would like to apply a function to corresponding elements of data frame. For example, I want to use the paste function to produce a data frame such as
"1a" "2b" "3c"
"4d" "5e" "6f"
Is there a straightforward way to do this in R. I know it is possible to use the Reduce function to apply a function on corresponding elements of dataframes within lists. But using the Reduce function in this case does not seem to have the desired effect.
Reduce(paste,l)
Produces:
"c(1, 4) c(\"a\", \"d\")" "c(2, 5) c(\"b\", \"e\")" "c(3, 6) c(\"c\", \"f\")"
Wondering if I can do this without writing messy for loops. Any help is appreciated!

Instead of Reduce, use Map.
# not quite the same as your data
l <- list(data.frame(matrix(1:6,ncol=3)),
data.frame(matrix(letters[1:6],ncol=3), stringsAsFactors=FALSE))
# this returns a list
LL <- do.call(Map, c(list(f=paste0),l))
#
as.data.frame(LL)
# X1 X2 X3
# 1 1a 3c 5e
# 2 2b 4d 6f

To explain #mnel's excellent answer a bit more, consider the simple example of summing the corresponding elements of two vectors:
Map(sum,1:3,4:6)
[[1]]
[1] 5 # sum(1,4)
[[2]]
[1] 7 # sum(2,5)
[[3]]
[1] 9 # sum(3,6)
Map(sum,list(1:3,4:6))
[[1]]
[1] 6 # sum(1:3)
[[2]]
[1] 15 # sum(4:6)
Why the second one is the case might be made more obvious by adding a second list, like:
Map(sum,list(1:3,4:6),list(0,0))
[[1]]
[1] 6 # sum(1:3,0)
[[2]]
[1] 15 # sum(4:6,0)
Now, the next is more tricky. As the help page ?do.call states:
‘do.call’ constructs and executes a function call from a name or a
function and a list of arguments to be passed to it.
So, doing:
do.call(Map,c(sum,list(1:3,4:6)))
calls Map with the inputs of the list c(sum,list(1:3,4:6)), which looks like:
[[1]] # first argument to Map
function (..., na.rm = FALSE) .Primitive("sum") # the 'sum' function
[[2]] # second argument to Map
[1] 1 2 3
[[3]] # third argument to Map
[1] 4 5 6
...and which is therefore equivalent to:
Map(sum, 1:3, 4:6)
Looks familiar! It is equivalent to the first example at the top of this answer.

Paste column values together in a data frame

I am trying to paste together the rowname along with the data in the desired column. I wrote the following code but somehow couldnot find a way to do it correctly.
The desired output will be: "a,1,11" "b,2,22" "c,3,33"
x = data.frame(cbind(f1 = c(1,2,3), f2 = c(5,6,7), f3=c(11,22,33)), row.names= c('a','b','c'))
x
# f1 f2 f3
# a 1 5 11
# b 2 6 22
# c 3 7 33
do.call("paste", c(rownames(x), x[c('f1','f3')], sep=","))
# [1] "a,b,c,1,11" "a,b,c,2,22" "a,b,c,3,33"

Two main points:
Use apply instead of do.call(paste, .)
Use cbind instead of c in this case.
If you would rather use c, you would need to coerce the row names to a list or column first, eg: c(list(rownames(x)), x)
Try the following:
apply(cbind(rownames(x), x[c('f1','f3')]), 1, paste, collapse=",")
a b c
"a,1,11" "b,2,22" "c,3,33"

Your do.call instructs R to paste the list c(rownames(x), x[c('f1','f3')]) together. But take a look at your list.
> c(rownames(x), x[c('f1','f3')])
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
$f1
[1] 1 2 3
$f3
[1] 11 22 33
The c command takes the elements of each argument and joins them together. This properly deconstructs x[c('f1','f3')] but also deconstructs rownames(x) in a way you don't want. Obeying the standard recycling rule, paste then takes an item from each list element and patches them together with sep=",".
You could fix this by encapsulating rownames(x) inside a list structure so that your list of arguments comes out properly:
do.call("paste", c(list(rownames(x)), x[c('f1','f3')], sep=","))

No need for do.call or apply:
paste(rownames(x),x[[1]],x[[3]] , sep=",")
[1] "a,1,11" "b,2,22" "c,3,33"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to concatenate two DNAStringSet sequences per sample in R? - r

Related

Deconstruct DNAstringsSets into normal strings

Finding matching values in two vectors of different lengths in R

How to make a list of many data frames without typing their names in r?

Apply function to corresponding elements in list of data frames

Paste column values together in a data frame

Categories

Resources