How to convert DNAbin to FASTA in R? - r

I am trying to convert my_dnabin1, a DNAbin file of 55 samples, to fasta format. I am using the following code to convert it into a fasta file.
dnabin_to_fasta <- lapply(my_dnabin1, function(x) as.character(x[1:length(x)]))
This generates a list of 55 samples which looks like:
$SS.11.01
[1] "t" "t" "a" "c" "c" "t" "a" "a" "a" "a" "a" "g" "c" "c" "g" "c" "t" "t" "c" "c" "c" "t" "c" "c" "a" "a"
[27] "c" "c" "c" "t" "a" "g" "a" "a" "g" "c" "a" "a" "a" "c" "c" "t" "t" "t" "c" "a" "a" "c" "c" "c" "c" "a"
$SS.11.02
[1] "t" "t" "a" "c" "c" "t" "a" "a" "a" "a" "a" "g" "c" "c" "g" "c" "t" "t" "c" "c" "c" "t" "c" "c" "a" "a"
[27] "c" "c" "c" "t" "a" "g" "a" "a" "g" "c" "a" "a" "a" "c" "c" "t" "t" "t" "c" "a" "a" "c" "c" "c" "c" "a"
and so on...
However, I want a fasta formatted file as the output that may look something like:
>SS.11.01 ttacctga
>SS.11.02 ttacctga

you can try this
lapply(my_dnabin1, function(x) paste0(x, collapse = ''))

Related

Replacing a nucleotide in a FASTA fie

I'm trying to make fasta files for each variation of a gene using a CSV file extracted from gnoMAD. In this function,x is a list with coordinates for each variation, Y is a fasta file opened using the read.fasta function from the seqinr library and data is the file I downloaded from gnomAD. I'm having trouble with the last if statement,supposed to manage SNVs. For some reason,instead of inserting the nucleotide at the position specified, the value is concatenated at the end of the fasta file.
I've read the documentation for the library but haven't found anything about the internal representation for the fasta files.
Example of output:
t" "t" "g" "c" "t" "c" "a" "c" "a" "g" "t" "g" "t" "t" "t" "g"
"a" "g" "c" "a" "g" "t" "g" "c" "t" "g" "a" "g" "c" "a" "c" "a" "a" "a" "g" "c"
"a" "g" "a" "c" "a" "c" "t" "c" "a" "a" "t" "a" "a" "a" "t" "g" "c" "t" "a" "g"
9
"a" "t" "t" "t" "a" "c" "a" "c" "a" "c" "t" "c" "C"
The C with a 9 index should be in the ninth position of the sequence
files<-function(x,y,data){
test<-str_detect(data[ ,"Consequence"],"[del]")
names<-paste(data[ ,"Chromosome"],data[ ,"Position"],data[ ,"Reference"],data[ ,"Alternate"],"ACE2",sep="-")
for (j in 1:length(x)){
copy<-y
if(length(x[[j]])!=1 && test[j]==TRUE){
for(i in x[[j]][1]:x[[j]][2]){
copy[[1]][i]<-NA
}
copy<-copy[[1]][!is.na(copy[[1]])]
}
if(length(x[[j]])==1 && test[j]==TRUE){
copy[[1]][x[[j]][1]]<-NA
copy<-copy[[1]][!is.na(copy[[1]])]
}
if(test[j]==FALSE){
n<-x[[j]][1]
copy[[1]][n]<-complementary(data[j,"Alternate"])
print(copy[[1]][n])
}
putz<-paste(names[j],"fasta",sep=".")
write.fasta(copy,names[j],putz)
}
}

R: Efficient way for spreading vectors

Is there an efficient way of programming to solve the following task?
Imagine the following vector:
A<-[a,b,c...k]
And would like to spread it the following way:
Let‘s start with e.g. n=2
B<-[a,a,b,b,c...,k,k]
And now n=4 or any number greater 1
C<-[a,a,a,a,b,...,k,k,k,k]
To solve it via loops seems kind of easy, but is there any function or vector based operation I missed/could use? A tidyverse solutions (for using it in a pipe) would be the best solution for me.
(It is hard to do research on this task as I am a newbie in R and don‘t the correct terms to search for. Any help would be helpful.)
Let
A <- letters[1:11]
A
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k"
If you use function rep with argument each, you get what you want:
rep(A, each=2)
[1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e" "f" "f" "g" "g" "h" "h" "i" "i" "j"
[20] "j" "k" "k"
rep(A, each=3)
[1] "a" "a" "a" "b" "b" "b" "c" "c" "c" "d" "d" "d" "e" "e" "e" "f" "f" "f" "g"
[20] "g" "g" "h" "h" "h" "i" "i" "i" "j" "j" "j" "k" "k" "k"
An option is to use rep with argument times = 2 or 4 and then sort the result. Another option is to use mapply and then c operator.
c(mapply(rep, 2 ,A)) # OR sort(rep(A, times = 2))
#[1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e" "f" "f" "g" "g" "h" "h" "i" "i" "j" "j"
#[21] "k" "k"
c(mapply(rep,A, 4)) #OR sort(rep(A, times = 2))
#[1] "a" "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e"
#[21] "f" "f" "f" "f" "g" "g" "g" "g" "h" "h" "h" "h" "i" "i" "i" "i" "j" "j" "j" "j"
#[41] "k" "k" "k" "k"

R: (Pegas) problems with haplotypes - (error: 'h' must be of class 'haplotype')

I've recently started looking in to haplotype data and I'm messing around with data from the 1000 genomes project and trying to manipulate it with the Pegas package in R. So far I've come this far:
library(pegas)
a <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502"
b <- "ALL.chrY.phase3_integrated_v1b.20130502.genotypes.vcf.gz"
url <- paste(a, b, sep = "/")
download.file(url, "chrY.vcf.gz")
(info <- VCFloci("chrY.vcf.gz"))
SNP <- is.snp(info)
X.SNP <- read.vcf("chrY.vcf.gz", which.loci = which(SNP))
h <- haplotype(X.SNP, 6020:6030)
net <- haploNet(h)
plot(net)
I would like to plot a haplotype net but it doesn't execute it. I get the following message: 'h' must be of class 'haplotype'
If I print out h I get:
> h
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
. "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "T" "C" "C"
. "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "A" "G"
. "C" "C" "C" "C" "C" "C" "T" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C"
. "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "C" "T" "T" "T" "T" "T" "T" "T" "T"
. "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "G" "A" "G" "G" "G" "G" "G"
. "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "C" "T" "T" "T" "T" "T" "T" "T"
. "A" "A" "A" "A" "A" "A" "A" "A" "A" "C" "A" "A" "A" "A" "A" "A" "A" "A" "A"
. "G" "G" "G" "." "G" "G" "G" "G" "G" "G" "G" "G" "A" "G" "G" "G" "G" "G" "G"
. "." "T" "C" "T" "T" "C" "T" "." "." "." "T" "T" "T" "T" "C" "T" "T" "T" "T"
. "." "A" "." "A" "." "C" "A" "A" "C" "." "A" "A" "A" "A" "A" "C" "A" "A" "A"
. "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "C"
attr(,"class")
[1] "haplotype.loci"
attr(,"freq")
[1] 18 1142 2 5 25 6 1 4 2 1 2 5 1 9 1 3 1 4 1
It obviously assigned 19 haplotypes. Something must be wrong with the way the data is presented. Any advice? Also there is very little material on Pegas and how to manipulate with VCF files with the use of Pegas. Does anybody know a good resource (web page or book) for getting information on how to manipulate with haplotypes from VCF files, it doesn't even have to be for Pegas, any R library will do, or Python... anything really.
Thank you for the help, Peter
I know this is an old post, but in case others come along with the same issue I have found a work-around to the issue. Using the pacakage "vcfR" You can read in the vcf with read.vcfR() and then convert it to a DNAbin with vcfR2DNAbin(). Using haplotype() on the DNAbin results in a class "haplotype" not "haplotype.loci".
That's an expected result: for the moment haploNet() works only for the class "haplotype" which is generated from DNA seqs (class "DNAbin"). The output of read.vcf() is of class "loci" and haplotype() is a generic function working on both classes.
If you work on SNPs only, you can avoid this with:
class(h) <- NULL
h <- as.DNAbin(h)
The (ultimate) goal is to have haploNet() works also with the class "haplotype.loci" (which is still in development) and maybe others.
Cheers, Emmanuel

How to replace values in a data frame with another value

I have huge data set. The columns contain values like A,B,C,D,E,F,G,H and I need to replace them with 1,2,3,4...
[1] "C" "C" "C" "C" "C" "A" "H" "G" "G" "G" "G" "G" "G" "G" "C" "C" "C" "C" "C"
[20] "C" "B" "B" "B" "H" "H" "H" "H" "H" "H" "G" "C" "A" "A" "A" "A" "A" "A" "A"
[30]----
Another similar problem is values in one column are more than 1000 and I need to replace them by unique numbers.
try replace
replace function examples
in your case e.g.
replace(df, "A", 1)

R: repeat elements of a list based on another list

I have searched for this but in vain.
the problem is I have two lists, first with the elements to be repeated
for example
my.list<-list(c('a','b','c','d'), c('g','h'))
and the second list is the number of times each element is to be repeated
repeat.list<-list(c(5,7,6,1), c(2,3))
I would like to create a new list in which each element in my.list is repeated based in repeat.list
i.e.
result:
[[1]]
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "c" "d"
[[2]]
[1] "g" "g" "h" "h" "h"
Thank you in advance for your help
Use mapply:
mapply(rep, my.list, repeat.list)
[[1]]
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "c" "d"
[[2]]
[1] "g" "g" "h" "h" "h"
lapply also does the trick, but is more verbose:
lapply(seq_along(my.list), function(i)rep(my.list[[i]], repeat.list[[i]]))
[[1]]
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "c" "d"
[[2]]
[1] "g" "g" "h" "h" "h"

Resources