(v)matchPattern DNAStringSetList of Codons to Reference DNAString - r

I am assessing the impact of hotspot single nucleotide polymorphism (SNPs) from a next generation sequencing (NGS) experiment on the protein sequence of a virus. I have the reference DNA sequence and a list of hotspots. I need to first figure out the reading frame of where these hotspots are seen. To do this, I generated a DNAStringSetList with all human codons and want to use a vmatchpattern or matchpattern from the Biostrings package to figure out where the hotspots land in the codon reading frame.
I often struggle with lapply and other apply functions, so I tend to utilize for loops instead. I am trying to improve in this area, so welcome a apply solution should one be available.
Here is the code for the list of codons:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
Current for loop code:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
Current error code. I am filtering the codon DNAStringSetList so that it is a DNAStringSet.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
I can't give out the exact nucleotide sequence, but here is the COVID genome (link: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to use as a reprex:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)

For the current code, change the way the codons is formed. Currently the output of codons looks like this:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
Change it from DNAStringSetList to a conglomerate DNAStringSet of the amino acids.
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
When I run the script I get the following output with the SARS-CoV-2 isolate listed for the example (I'm showing a small slice)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
Looking at the ones that had an output, only those with the standard nucleotides ("ATCG", no wobbles) found matches. Those will need to be changed as well to search.
If you're on twitter, I suggest linking the question using the #rstats, #bioconductor, and #bioinformatics hashtags to generate some more traction, I've noticed that bioinformatic specific questions on SO don't generate as much buzz.

Related

Replace words in a tweet with numeric value of their frequency

I am facing a problem to replace words in a tweet with the numeric value of their frequency.
I have already made a data frame showing the words ranked by their frequency.
Now I want to substitute the words in the tweets with the frequency rank of every word.
I attached snips of my data frames.
Tweets and word frequency data:
My goal is that a tweets looks like this:
[1] [3] [7] [11] [18] [12] [10] [5] [3] [44] [23] [46] [2] [90]
The [1] means that it is the most frequent word in the dataset.
Any help appreciated! :)
I think stringr::str_replace_all is an efficient way to go about it: just pass it a named vector with your word frequency and you're done.
See reprex below. The first few lines just generate random data; your frequency table looks like the df I generated below.
sentence <- "the quick brown fox jumps over the lazy dog"
sentence_split <- unique(as.character(stringr::str_split(string = sentence, pattern = " ", simplify = TRUE)))
names(sentence_split) <- sample(x = 1:1000, size = length(sentence_split))
df <- data.frame(word = sentence_split,
n = sample(x = 1:1000, size = length(sentence_split)))
df
#> word n
#> the 740
#> quick 192
#> brown 145
#> fox 809
#> jumps 700
#> over 910
#> lazy 352
#> dog 256
replace_vector <- paste0("[", df$n, "]")
names(replace_vector) <- df$word
stringr::str_replace_all(string = sentence, pattern = replace_vector)
#> [1] "[740] [192] [145] [809] [700] [910] [740] [352] [256]"
Created on 2021-07-24 by the reprex package (v2.0.0)

Extract string between characters and include new lines? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 2 years ago.
Having issues using string to extract string between two characters. I need to get the everything between these characters including the line breaks:
reprEx <- "2100\n\nELECTRONIC WITHDRAWALS| om93 CCD ID: 964En To American Hon\nELECTRONIC WITHDRAWALSda Finance Corp 295.00\nTotal Electronic Withdrawals $93,735.18\n[OTHER WITHDRAWALS| WITHDRAWALS\nDATE DES $93,735.18\n[OTHER WITHDRAWALS| WITHDRAWALS\nDATE DESCRIPTION AMOUNT\n04/09 Pmt ID 7807388390 Refunded IN Error On 04/08"
desiredResult <- "| om93 CCD ID: 964En To American Hon\nELECTRONIC WITHDRAWALSda Finance Corp 295.00\nTotal Electronic Withdrawals $93,735.18\n[OTHER WITHDRAWALS| WITHDRAWALS\nDATE DES $93,735.18\n["
I have tried using:
desiredResult <- str_match(reprEx, "ELECTRONIC WITHDRAWALS\\s*(.*?)\\s*OTHER WITHDRAWALS")[,2]
but I just get NA back. I just want to get everything in the string that is between the first occurrence of ELECTRONIC WITHDRAWALS and the first occurrence of OTHER WITHDRAWALS. I can't tell if the new lines are what is causing the problem
I think your desiredOutput is inconsistent with your paragraph, I'll prioritize the latter:
everything in the string that is between the first occurrence of ELECTRONIC WITHDRAWALS and the first occurrence of OTHER WITHDRAWALS
first <- gregexpr("ELECTRONIC WITHDRAWALS", reprEx)[[1]]
first
# [1] 7 66
# attr(,"match.length")
# [1] 22 22
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# generalized a little, in case you change the reprEx string
leftside <- if (first[1] > 0) first[1] + attr(first, "match.length")[1] else 1
second <- gregexpr("OTHER WITHDRAWALS", substr(reprEx, leftside, nchar(reprEx)))[[1]]
second
# [1] 124 176
# attr(,"match.length")
# [1] 17 17
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
rightside <- leftside + second[1] - 2
c(leftside, rightside)
# [1] 29 151
substr(reprEx, leftside, rightside)
# [1] "| om93 CCD ID: 964En To American Hon\nELECTRONIC WITHDRAWALSda Finance Corp 295.00\nTotal Electronic Withdrawals $93,735.18\n["

How to concatenate (merge) AAStringSets by name?

In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.
Say these are two multiple sequence alignments.
library(Biostrings)
set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")
set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")
set1
A AAStringSet instance of length 3
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
set2
A AAStringSet instance of length 3
width seq names
[1] 3 VRT org_2
[2] 3 RKG org_3
[3] 3 AST org_4
The correct concatenation of these sequences would be
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.
I thought there would be a function to do this in BioStrings, MSA, DECIPHER, or other related packages, but have been unable to find one.
I found the following Q&As, each does not provide the desired output as described.
1: https://support.bioconductor.org/p/38955/
output
A AAStringSet instance of length 6
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
[4] 3 VRT org_2
[5] 3 RKG org_3
[6] 3 AST org_4
May be better described as 'appending' the sequences (joins the two sets vertically).
2: https://support.bioconductor.org/p/39878/
output
A AAStringSet instance of length 2
width seq
[1] 9 IVRRDGLKS
[2] 9 VRTRKGAST
Concatenates sequences in each set, a complete chimera of each set (certainly not desired).
3: How to concatenate two DNAStringSet sequences per sample in R?
output
A AAStringSet instance of length 3
width seq
[1] 6 IVRVRT
[2] 6 RDGRKG
[3] 6 LKSAST
Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)
4: https://www.biostars.org/p/115192/
Output
A AAStringSet instance of length 2
width seq
[1] 3 IVR
[2] 3 VRT
Only appends the first sequence from each set, not sure why anyone wants this...
I would normally think these kinds of processes would be done with some combination of bash and Python, but I'm using the DECIPHER multiple sequence aligner in R, so it makes sense to do the rest of the processing in R. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!
So I am a somewhat fanatical user of data.table in R, among many things it is great to merge datasets by names. I found Biostrings::AAStringSets can be converted to matrices using as.matrix and these can be converted to data.table and merged.
set1.dt<-data.table(as.matrix(set1), keep.rownames = TRUE)
set2.dt<-data.table(as.matrix(set2), keep.rownames = TRUE)
set12.dt<-merge(set1.dt, set2.dt, by="rn", all=TRUE)
set12.dt
rn V1.x V2.x V3.x V1.y V2.y V3.y
1: org_1 I V R <NA> <NA> <NA>
2: org_2 R D G V R T
3: org_3 L K S R K G
4: org_4 <NA> <NA> <NA> A S T
This is the correct merge, but needs more work to get the final result.
Need to replace "NA" with "-". I always need to look up this question to remember the best way to do this with a data.table.
Fastest way to replace NAs in a large data.table
#slightly modified from original, added arg "x"
f_dowle = function(dt, x) { # see EDIT later for more elegant solution
na.replace = function(v,value=x) { v[is.na(v)] = value; v }
for (i in names(dt))
eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}
f_dowle(set12.dt, "-")
Concatenate the sequences (not included the names with !"rn")
set12<-apply(set12.dt[ ,!"rn"], 1, paste, collapse="")
Convert back to AAStringSet and add back names
set12<-AAStringSet(set12)
names(set12)<-set12.dt$rn
Desired output
set12
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
This works, but seems quite cumbersome, especially converting between different data formats. Obviously can wrap it into a function to use more easily, but again seems like this should already be a function in some Bioconductor package...

Conversion of Elastic list data output to R data frame slow

I have an output from Elastic that takes very long to convert to an R data frame. I have tried multiple options; and feel there may be some trick there to quicken the process.
The structure of the list is as follows. The list has aggregated data over 29 days (say). If lets say the Elastic query output is in list 'v_day' then l[[5]]$articles_over_time$buckets[1:29] represents each of the 29 days
length(v_day[[5]]$articles_over_time$buckets)
[1] 29
page(v_day[[5]]$articles_over_time$buckets[[1]],method="print")
$key
[1] 1446336000000
$doc_count
[1] 35332
$group_by_state
$group_by_state$doc_count_error_upper_bound
[1] 0
$group_by_state$sum_other_doc_count
[1] 0
$group_by_state$buckets
$group_by_state$buckets[[1]]
$group_by_state$buckets[[1]]$key
[1] "detail"
$group_by_state$buckets[[1]]$doc_count
[1] 876
There is a "key" value here right at the top here (1446336000000) that I am interested in (lets call it "time bucket key").
Within each day(lets take day i), "v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets" has more data I am interested in. This is an aggregation over each property (property is an entity in the scheme of things here).
page(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets,method="print")
[[1]]
[[1]]$key
[1] "detail"
[[1]]$doc_count
[1] 876
[[2]]
[[2]]$key
[1] "ff8081814fdf2a9f014fdf80b05302e0"
[[2]]$doc_count
[1] 157
[[3]]
[[3]]$key
[1] "ff80818150a7d5930150a82abbc50477"
[[3]]$doc_count
[1] 63
[[4]]
[[4]]$key
[1] "ff8081814ff5f428014ffb5de99f1da5"
[[4]]$doc_count
[1] 57
[[5]]
[[5]]$key
[1] "ff8081815038099101503823fe5d00d9"
[[5]]$doc_count
[1] 56
This shows data over 5 properties in day i, each property has a "key" (lets call it "property bucket key") and a "doc_count" that I am interested in.
Eventually I want a data frame with "time bucket key", "property bucket key", "doc count".
Currently I am looping over using the below code:
v <- NULL
ndays <- length(v_day[[5]]$articles_over_time$buckets)
for (i in 1:ndays) {
v1 <- do.call("rbind", lapply(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets, data.frame))
th_dt <- as.POSIXct(v_day[[5]]$articles_over_time$buckets[[i]]$key / 1000, origin="1970-01-01")
v1$view_date <- th_dt
v <- rbind(v, v1)
msg <- sprintf("Read views for %s. Found %d \n", th_dt, sum(v1$doc_count))
cat(msg)
}
v

extract vertex from a R graph

I have the following problem.
A directed graph called tutti
I have a vector called tabellaerrori containing a vertex in each position
Now my problem is:
I want to create an array cointaining the list of vertex which are both in tutti graph and in errori vector.
I used the following code but it doesn't work:
risultato<-as.character(intersect(tabellaerrori,V(tutti)))
It gives me back always the content of tabellaerrori
What's wrong ?
For those who didn't suffer through the non-downloadable google plus image gallery, the actual line generating the error is:
graph.neighborhood(tutti, vcount(tutti), risultato, "out")
## Error in as.igraph.vs(graph, nodes) : Invalid vertex names
From the help - graph.neighborhood(graph, order, nodes=V(graph), mode=c("all", "out", "in")) is expecting nodes to be an actual igraph vertex sequence. You just need to make sure your intersected nodes are in that form.
Here's what jbaums meant by a reproducible example (provided I've made the right assumptions from your screen captures):
library(igraph)
set.seed(1492) # makes this more reproducible
# simulate your overall graph
tutti <- graph.full(604, directed=TRUE)
V(tutti)$name <- as.character(sample(5000, 604))
# simulate your nodes
tabellaerrori <- as.character(c(sample(V(tutti), 79), sample(6000:6500, 70)))
names(tabellaerrori) <- as.numeric(tabellaerrori)
# give a brief view of the overall data objects
head(V(tutti))
## Vertex sequence:
## [1] "1389" "1081" "922" "553" "261" "42"
length(V(tutti))
## [1] 604
head(tabellaerrori)
## 293 415 132 299 408 526
## "293" "415" "132" "299" "408" "526"
length(tabellaerrori)
## [1] 149
# for your answer, find the intersection of the vertext *names*
risultato <- as.character(intersect(tabellaerrori, V(tutti)$name))
risultato
## Vertex sequence:
## [1] "293" "132" "155" "261" "68" "381" "217" "394" "581"
# who are the ppl in your neighborhood
graph.neighborhood(tutti, vcount(tutti), risultato, "out")
## [[1]]
## IGRAPH DN-- 604 364212 -- Full graph
## + attr: name (g/c), loops (g/l), name (v/c)
##
## [[2]]
## IGRAPH DN-- 604 364212 -- Full graph
## + attr: name (g/c), loops (g/l), name (v/c)
##
## ... (a few more)
What you were really doing before (i.e. what intersect was doing under the covers) is:
risultato <- as.character(intersect(tabellaerrori, as.character(V(tutti))))
hence, your Invalid vertex names error.

Resources