How to concatenate (merge) AAStringSets by name? - r

In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.
Say these are two multiple sequence alignments.
library(Biostrings)
set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")
set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")
set1
A AAStringSet instance of length 3
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
set2
A AAStringSet instance of length 3
width seq names
[1] 3 VRT org_2
[2] 3 RKG org_3
[3] 3 AST org_4
The correct concatenation of these sequences would be
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.
I thought there would be a function to do this in BioStrings, MSA, DECIPHER, or other related packages, but have been unable to find one.
I found the following Q&As, each does not provide the desired output as described.
1: https://support.bioconductor.org/p/38955/
output
A AAStringSet instance of length 6
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
[4] 3 VRT org_2
[5] 3 RKG org_3
[6] 3 AST org_4
May be better described as 'appending' the sequences (joins the two sets vertically).
2: https://support.bioconductor.org/p/39878/
output
A AAStringSet instance of length 2
width seq
[1] 9 IVRRDGLKS
[2] 9 VRTRKGAST
Concatenates sequences in each set, a complete chimera of each set (certainly not desired).
3: How to concatenate two DNAStringSet sequences per sample in R?
output
A AAStringSet instance of length 3
width seq
[1] 6 IVRVRT
[2] 6 RDGRKG
[3] 6 LKSAST
Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)
4: https://www.biostars.org/p/115192/
Output
A AAStringSet instance of length 2
width seq
[1] 3 IVR
[2] 3 VRT
Only appends the first sequence from each set, not sure why anyone wants this...
I would normally think these kinds of processes would be done with some combination of bash and Python, but I'm using the DECIPHER multiple sequence aligner in R, so it makes sense to do the rest of the processing in R. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!

So I am a somewhat fanatical user of data.table in R, among many things it is great to merge datasets by names. I found Biostrings::AAStringSets can be converted to matrices using as.matrix and these can be converted to data.table and merged.
set1.dt<-data.table(as.matrix(set1), keep.rownames = TRUE)
set2.dt<-data.table(as.matrix(set2), keep.rownames = TRUE)
set12.dt<-merge(set1.dt, set2.dt, by="rn", all=TRUE)
set12.dt
rn V1.x V2.x V3.x V1.y V2.y V3.y
1: org_1 I V R <NA> <NA> <NA>
2: org_2 R D G V R T
3: org_3 L K S R K G
4: org_4 <NA> <NA> <NA> A S T
This is the correct merge, but needs more work to get the final result.
Need to replace "NA" with "-". I always need to look up this question to remember the best way to do this with a data.table.
Fastest way to replace NAs in a large data.table
#slightly modified from original, added arg "x"
f_dowle = function(dt, x) { # see EDIT later for more elegant solution
na.replace = function(v,value=x) { v[is.na(v)] = value; v }
for (i in names(dt))
eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}
f_dowle(set12.dt, "-")
Concatenate the sequences (not included the names with !"rn")
set12<-apply(set12.dt[ ,!"rn"], 1, paste, collapse="")
Convert back to AAStringSet and add back names
set12<-AAStringSet(set12)
names(set12)<-set12.dt$rn
Desired output
set12
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
This works, but seems quite cumbersome, especially converting between different data formats. Obviously can wrap it into a function to use more easily, but again seems like this should already be a function in some Bioconductor package...

Related

(v)matchPattern DNAStringSetList of Codons to Reference DNAString

I am assessing the impact of hotspot single nucleotide polymorphism (SNPs) from a next generation sequencing (NGS) experiment on the protein sequence of a virus. I have the reference DNA sequence and a list of hotspots. I need to first figure out the reading frame of where these hotspots are seen. To do this, I generated a DNAStringSetList with all human codons and want to use a vmatchpattern or matchpattern from the Biostrings package to figure out where the hotspots land in the codon reading frame.
I often struggle with lapply and other apply functions, so I tend to utilize for loops instead. I am trying to improve in this area, so welcome a apply solution should one be available.
Here is the code for the list of codons:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
Current for loop code:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
Current error code. I am filtering the codon DNAStringSetList so that it is a DNAStringSet.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
I can't give out the exact nucleotide sequence, but here is the COVID genome (link: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to use as a reprex:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
For the current code, change the way the codons is formed. Currently the output of codons looks like this:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
Change it from DNAStringSetList to a conglomerate DNAStringSet of the amino acids.
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
When I run the script I get the following output with the SARS-CoV-2 isolate listed for the example (I'm showing a small slice)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
Looking at the ones that had an output, only those with the standard nucleotides ("ATCG", no wobbles) found matches. Those will need to be changed as well to search.
If you're on twitter, I suggest linking the question using the #rstats, #bioconductor, and #bioinformatics hashtags to generate some more traction, I've noticed that bioinformatic specific questions on SO don't generate as much buzz.

Rename row.name in data frame using matches or partial matches from a list

I have a data frame in R with 341 rows. I want to rename the row names using a list with 349 names. All 341 names will be in this list for sure. But not all of them will be perfect hits.
The data looks like this
rownames(df_RPM1)
[1] "LQNS02059392.1_11686_5p"
[2] "LQNS02277998.1_30984_3p"
[3] "LQNS02277998.1_30984_5p"
[4] "LQNS02277998.1_30988_3p"
[5] "LQNS02277998.1_30988_5p"
[6] "LQNS02277997.1_30943_3p"
[7] "miR-9|LQNS02278070.1_31740_3p"
[8] "miR-9|LQNS02278094.1_36129_3p"
head(inlist)
[1] "dpu-miR-2-03_LQNS02059392.1_11686_5p" "dpu-miR-10-P2_LQNS02277998.1_30984_3p"
[3] "dpu-miR-10-P2_LQNS02277998.1_30984_5p" "dpu-miR-10-P3_LQNS02277998.1_30988_3p"
[5] "dpu-miR-10-P3_LQNS02277998.1_30988_5p" "miR-9|LQNS02278070.1_31740_3p"
[6] "miR-9|LQNS02278094.1_36129_3p"
The order won't necessarily be the same in the two.
Can anyone suggest me how to do this in R?
Thanks a lot
Depends a lot what a "non-perfect hit" looks like. Assuming the row name is a substring of the real name, str_detect() does the job quite well:
library(tidyverse)
real_names <- c("dpu-miR-2-03_LQNS02059392.1_11686_5p",
"dpu-miR-10-P2_LQNS02277998.1_30984_3p",
"dpu-miR-10-P2_LQNS02277998.1_30984_5p",
"dpu-miR-10-P3_LQNS02277998.1_30988_3p",
"dpu-miR-10-P3_LQNS02277998.1_30988_5p",
"miR-9|LQNS02278070.1_31740_3p",
"miR-9|LQNS02278094.1_36129_3p")
str_which(real_names, "LQNS02059392.1_11686_5p")
#> [1] 1
So we can vectorize (I removed the element 6 which is not found in the example list):
pos <- map_int(rownames(df_RPM1), ~ str_which(real_names, fixed(.)))
pos
#> [1] 1 2 3 4 5 6 7
And all that's left is to change the row names:
rownames(df_RPM1) <- real_names[pos]
Of course, if a non-perfect hit means something more complicated, you may need to create a regex from the row names or something like that.

How do I subset a list with mixed data type and data structure?

I have a list which included a mix of data type (character) and data structure (dataframe).
I want to keep only the dataframes and remove the rest.
> head(list)
[[1]]
[1] "/Users/Jane/R/12498798.txt error"
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
[[3]]
[1] "/Users/Jane/R/12498780.txt error"
I expect the final list to contain only dataframes:
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
Based on the example, it is possible that the OP's list elements are vectors and want to remove any element having 'error' substring
list[!sapply(list, function(x) any(grepl("error$", x)))]

R "read.table" function gives only odd numbered columns while merging the even numbered columns

I am trying to read a TSV file in R using the read.table function.
myTable <- read.table("file_path", sep='\t', header=T)
But when I try the command
names(myTable)
It gives me column names which are odd numbered, while merging the even numbered columns with those.
[1] "GeneSymbol" "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp" "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp" "GSM480634_JK_C_05.07.mas5.chp"
These are exact column names and you can see that two column names are separated by space while only ODD numbered column names are listed.
The output should be like this:
[1] "GeneSymbol"
[2] "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp"
[4] "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp"
[6] "GSM480634_JK_C_05.07.mas5.chp"
This is creating problem in assigning names to another table where I want to use these column names. Any suggestions ?
As noted in the comments, R is displaying all the columns, but not in the format you expect. This can be forced by casting the result of names() with as.data.frame() as follows:
rawData <- "
Number,Name,Type1,Type2,Total,HP,Attack,Defense,SpecialAtk,SpecialDef,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
8,Wartortle,Water,,405,59,63,80,65,80,58,1,False
9,Blastoise,Water,,530,79,83,100,85,105,78,1,False"
gen01 <- read.csv(textConnection=rawData,header=TRUE)
as.data.frame(names(gen01))
...and the output:
> as.data.frame(names(gen01))
names(gen01)
1 Number
2 Name
3 Type1
4 Type2
5 Total
6 HP
7 Attack
8 Defense
9 SpecialAtk
10 SpecialDef
11 Speed
12 Generation
13 Legendary

Weird conversion from list to dataframe in R

I have a list that I created from a for loop and it looks like this:
I tried to convert it to a dataframe using the code:
dflist<- as.data.frame(mylist)
But my dataframe looks like this now:
I know I probably created my list wrong but I am thinking this is still salvageable if I just need to convert the numbers to a dataframe correctly.
My end goal is to plot the numbers against their index (1-30) and I thought creating a dataframe first to clean it up and then plot would be helpful.
Any help would be really appreciated. Thank you.
The data showed is a list. We can use unlist and create a data.frame. Based on the image showed in OP's post, each list element have a length of 1. By doing unlist, we convert the list to vector and then wrap with data.frame.
data.frame(ind= seq_along(lst), Col1= as.numeric(unlist(lst)))
Or another option would be stack after naming the list elements
df1 <- transform(stack(setNames(lst, seq_along(lst))),
values = as.numeric(values))
It gives a two column dataset. From this we can do the plotting
Regarding the OP's approach about calling as.data.frame directly on the list, it does work in a different way as it calls on as.data.frame.list. For example, if we do as.data.frame on a vector, it uses as.data.frame.vector
as.data.frame(1:5)
# 1:5
#1 1
#2 2
#3 3
#4 4
#5 5
But, if we call as.data.frame.list
as.data.frame.list(1:5)
# X1L X2L X3L X4L X5L
#1 1 2 3 4 5
we get a data.frame with 'n' columns (based on the length of the vector).
Suppose, we do the same on a list
as.data.frame(as.list(1:5))
# X1L X2L X3L X4L X5L
#1 1 2 3 4 5
It uses the as.data.frame.list. To get the complete list of methods of as.data.frame,
methods('as.data.frame')
#[1] as.data.frame.aovproj* as.data.frame.array
# [3] as.data.frame.AsIs as.data.frame.character
# [5] as.data.frame.chron* as.data.frame.complex
# [7] as.data.frame.data.frame as.data.frame.data.table*
# [9] as.data.frame.Date as.data.frame.dates*
#[11] as.data.frame.default as.data.frame.difftime
#[13] as.data.frame.factor as.data.frame.ftable*
#[15] as.data.frame.function* as.data.frame.grouped_df*
#[17] as.data.frame.idf* as.data.frame.integer
#[19] as.data.frame.ITime* as.data.frame.list <-------
#[21] as.data.frame.logical as.data.frame.logLik*
#[23] as.data.frame.matrix as.data.frame.model.matrix
#[25] as.data.frame.noquote as.data.frame.numeric
#[27] as.data.frame.numeric_version as.data.frame.ordered
#[29] as.data.frame.POSIXct as.data.frame.POSIXlt
#[31] as.data.frame.raw as.data.frame.rowwise_df*
#[33] as.data.frame.table as.data.frame.tbl_cube*
#[35] as.data.frame.tbl_df* as.data.frame.tbl_dt*
#[37] as.data.frame.tbl_sql* as.data.frame.times*
#[39] as.data.frame.ts as.data.frame.vector

Resources