Deconstruct DNAstringsSets into normal strings - r

This comes from an R library called "VariantAnnotation" and its dependency "Biostrings"
I have a DNAstringsSetList and I want to transform it into a normal list or a vector of strings.
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
tempo <- rowRanges(vcf)$ALT # Here is the DNAstringsSetList I mean.
print(tempo)
A DNAStringSet instance of length 10376
width seq
[1] 1 G
[2] 1 T
[3] 1 A
[4] 1 T
[5] 1 T
... ... ...
[10372] 1 G
[10373] 1 G
[10374] 1 G
[10375] 1 A
[10376] 1 C
tempo[[1]]
A DNAStringSet instance of length 1
width seq
[1] 1 G
But I don't want this format. I just want strings of the bases, in order to insert them as a column in a new dataframe. I want this:
G
T
A
T
T
I have accomplished this with this package method:
as.character(tempo#unlistData)
However, it returns 10 rows more than tempo has! The head and tail of this result and of tempo are exactly the same, so somewhere in the middle there are 10 extra rows that should not have been formed (not NAs)

You can call as.character on either a DNAString or a DNAStringSet.
as.character(tempo[1 : 5])
# [1] "G" "T" "A" "T" "T"

A simple loop solves the issue, using the toString function of the same library:
ALT <-0
for (i in 1:nrow(vcf)){ ALT[i] <- toString(tempo[[i]]) }
However, I have no idea why tempo#unlistData retrieves too many rows. It is not trustworthy.

Related

Why does as.list() applied to a vector generate a list that is not treated the same as a list generated with list() in R?

Here is a very basic example that illustrates the differences in R
Given the following data frames:
a <- data.frame(l=c("object1", "object2"))
b <- data.frame(l=c("object3", "object4"))
Creating a vector for the names of the data frames:
vector <- c("a","b")
And then applying as.list()
list_of_vector <- as.list(vector)
If we try loop this:
lapply(list_of_vector, print)
The output is
[1] "a"
[1] "b"
[[1]]
[1] "a"
[[2]]
[1] "b"
Compared to just manually creating a list and then running the same loop:
straight_list <- list(a,b)
lapply(straight_list, print)
l
1 object1
2 object2
l
1 object3
2 object4
[[1]]
l
1 object1
2 object2
[[2]]
l
1 object3
2 object4
I would like to understand what makes as.list() different from list and how I would be able to convert a vector like the above to create the 2nd, rather than first output. Thanks in advance :)

Is R's list() function actually creating a nested list?

R may have its own loigc but list() did not give me what I expected.
l1 <- list(1,2)
$> l1
[[1]]
[1] 1
[[2]]
[1] 2
To retrieve the element, I need to use double-bracket, i.e.,
$> l1[[1]]
[1] 1
$> class(l1[[1]])
"numeric"
Single-bracket gives me, a sub-list (which is also a list object):
$> l1[1]
l1[[1]]
[1] 1
$> class(l1[1])
"list"
I am not saying this is wrong; this isn't what I expected because I was trying to create a 1-dimensional list whereas what I actually get is a nested list, a 2-dimensional object.
What is the logic behind this behaviour and how do we create an OO type list? i.e., a 1-dimensional data structure?
The behaviour I am expecting, with a 1 dimensional data structure, is:
$> l1[1]
[1] 1
$> l1[2]
[2] 2
If you want to create a list with the two numbers in one element, you are looking for this:
l1 <- list(c(1, 2))
l1
#> [[1]]
#> [1] 1 2
Your code basically puts two vectors of length 1 into a list. To make R understand that you have one vector, you need to combine (i.e., c()) the values into a vector first.
This probably becomes clearer when we create the two vectors as objects first:
v1 <- 1
v2 <- 2
l2 <- list(v1, v2)
l2
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
If you simply want to store the two values in an object, you want a vector:
l1 <- c(1, 2)
l1
#> [1] 1 2
For more on the different data structures in R I recommend this chapter: http://adv-r.had.co.nz/Data-structures.html
For the question about [ and [[ indexing, have a look at this classic answer: https://stackoverflow.com/a/1169495/5028841

How do I sort a nested list using two list elements in R?

I have a nested list of n elements, each element being itself a list of 4 different values. The values vary but are comparable between elements.
I need to sort the list first by the value of 'm.group', and then by the value 'att'. 'm.group' is an integer of, say, 1:3, and each element in the list is assigned a value of 1:3 (though the total number will vary). Within each numbered group, I need to arrange the members by descending order of 'att', which can be any value between 0 and 2.
I can arrange the list in ascending or descending values of either 'm.group' or 'att' using
function(a, field) a[order(sapply(a, "[[", i = field),decreasing = T)]
but I can't work out how to combine the two.
Each element looks like this.
[[1]]
[[1]]$`ind`
[1] 1
[[1]]$m.group
[1] 3
[[1]]$offspring
[1] 0
[[1]]$att
[1] 0.07626772
The values 'offspring' and 'ind' are not important at this stage.
To simplify, I need an output that looks something like this:
[[1]]
[[1]]$m.group
[1] 1
[[1]]$att
[1] 1.49352456
[[2]]
[[2]]$m.group
[1] 1
[[2]]$att
[1] 1.23452221
[[3]]
[[3]]$m.group
[1] 1
[[3]]$att
[1] 0.07626772
[[4]]
[[4]]$m.group
[1] 2
[[4]]$att
[1] 1.51852546
[[5]]
[[5]]$m.group
[1] 2
[[5]]$att
[1] 1.35648527
etc.
EDIT
You can generate a similar list with the following loop:
example <-vector(mode="list", length = 40)
loop.nb<-1
j<-1
for(j in 1:40){
example[[loop.nb]]$m.group <-sample((round(80/(10*2))),1)
example[[loop.nb]]$att <- runif(1, min=0, max=1.7)
loop.nb <-loop.nb+1
}
Hope this is clear!
Thanks in advance for your help,
Andy
You can easily do that by using the list.sort function of the rlist package:
rlist::list.sort(example, m.group, (att))
The parentheses enclosing att mean you want the descending order.

How to concatenate two DNAStringSet sequences per sample in R?

I have two Large DNAStringSet objects, where each of them contain 2805 entries and each of them has length of 201. I want to simply combine them, so to have 2805 entries because each of them are this size, but I want to have one object, combination of both.
I tried to do this
s12 <- c(unlist(s1), unlist(s2))
But that created single Large DNAString object with 1127610 elements, and this is not what I want. I simply want to combine them per sample.
EDIT:
Each entry in my DNASTringSet objects named s1 and s2, have similar format to this:
width seq
[1] 201 CCATCCCAGGGGTGATGCCAAGTGATTCCA...CTAACTCTGGGGTAATGTCCTGCAGCCGG
You can convert each DNAStringSet into characters. for example:
library(Biostrings)
set1 <- DNAStringSet(c("GCT", "GTA", "ACGT"))
set2 <- DNAStringSet(c("GTC", "ACGT", "GTA"))
as.character(set1)
as.character(set2)
Then paste them together into a DNAStringSet:
DNAStringSet(paste0(as.character(set1), as.character(set2)))
Since you're using DNAStringSet which is in Biostrings package, i recommend you to use this package's default functions for dealing with XStringSets. Using r base functions would take a lot of time because they need unnecessary conversions.
So you can use Biostrings xscat function. for example:
library(Biostrings)
set1 <- DNAStringSet(c("GCT", "GTA", "ACGT"))
set2 <- DNAStringSet(c("GTC", "ACGT", "GTA"))
xscat(set1, set2)
the result would be:
DNAStringSet object of length 3:
width seq
[1] 6 GCTGTC
[2] 7 GTAACGT
[3] 7 ACGTGTA
If your goal is to return a list where each list element is the concatenation of the corresponding list elements from the original lists restulting in a list of with length 2805 where each list element has a length of 402, you can achieve this with Map. Here is an example with a smaller pair of lists.
# set up the lists
set.seed(1234)
list.a <- list(a=1:5, b=letters[1:5], c=rnorm(5))
list.b <- list(a=6:10, b=letters[6:10], c=rnorm(5))
Each list contains 3 elements, which are vectors of length 5. Now, concatenate the lists by list position with Map and c:
Map(c, list.a, list.b)
$a
[1] 1 2 3 4 5 6 7 8 9 10
$b
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$c
[1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559
-0.5747400 -0.5466319 -0.5644520 -0.8900378
For your problem as you have described it, you would use
s12 <- Map(c, s1, s2)
The first argument of Map is a function that tells Map what to do with the list items that you have given it. Above those list items are a and b, in your example, they are s1 and s2.

Extract the factor's values positions in level

I'm returning to R after some time, and the following has me stumped:
I'd like to build a list of the positions factor values have in the facor levels list.
Example:
> data = c("a", "b", "a","a","c")
> fdata = factor(data)
> fdata
[1] a b a a c
Levels: a b c
> fdata$lvl_idx <- ????
Such that:
> fdata$lvl_idx
[1] 1 2 1 1 3
Appreciate any hints or tips.
If you convert a factor to integer, you get the position in the levels:
as.integer(fdata)
## [1] 1 2 1 1 3
In certain situations, this is counter-intuitive:
f <- factor(2:4)
f
## [1] 2 3 4
## Levels: 2 3 4
as.integer(f)
## [1] 1 2 3
Also if you silently coerce to integer, for example by using a factor as a vector index:
LETTERS[2:4]
## [1] "B" "C" "D"
LETTERS[f]
## [1] "A" "B" "C"
Converting to character before converting to integer gives the expected values. See ?factor for details.
The solution provided years ago by Matthew Lundberg is not robust. It could be that the as.integer() function was defined for a specific S3 type of factors. Imagine someone would create a new factor class to keep operators like >=.
as.myfactor <- function(x, ...) {
structure(as.factor(x), class = c("myfactor", "factor"))
}
# and that someone would create an S3 method for integers - it should
# only remove the operators, which makes sense...
as.integer.myfactor <- function(x, ...) {
as.integer(gsub("(<|=|>)+", "", as.character(x)))
}
Now this is not working anymore, - it just removes operators:
f <- as.myfactor(">=2")
as.integer(f)
#> [1] 2
But this is robust with any factor you want to know the index of the level of, using which():
f <- factor(2:4)
which(levels(f) == 2)
#> [1] 1

Resources