I have a .fasta file with many different sequences. My goal is to use the Biostrings package to convert each individual sequence to the amino acid sequence for it.
The .fasta file looks like this:
>Sequence 1
AAATTTGGGCCC
>Sequence 2
TTTGGGCCCAAA
Any help would be appreciated. Thank you.
The translate function will do what you want:
library(Biostrings)
`Sequence 1` <- DNAString("AAATTTGGGCCC")
`Sequence 2` <- DNAString("TTTGGGCCCAAA")
seq_1 <- translate(`Sequence 1`, no.init.codon=TRUE)
seq_1
#> 4-letter AAString object
#> seq: KFGP
seq_2 <- translate(`Sequence 2`, no.init.codon=TRUE)
seq_2
#> 4-letter AAString object
#> seq: FGPK
To read in the entire fasta file:
seqs <- Biostrings::readDNAStringSet("file.fasta", format = "fasta", use.names = TRUE)
seqs_translated <- translate(seqs, no.init.codon = TRUE)
seqs_translated
#> AAStringSet object of length 2:
#> width seq names
#> [1] 4 KFGP Sequence 1
#> [2] 4 FGPK Sequence 2
Edit
Your problem translating your fasta file is that the sequences use the 'full' alphabet, not just ATCG - you have 'No calls' ("N"), gaps ("-"), and ambivalent/unresolved calls e.g. "K" (Guanine or Thymine). I found these using sed:
grep -v ">" SEQUENCE_orf1ab.fasta | sed 's/[ATCG]//g' | sed '/^$/d'
# explanation: remove lines beginning with ">"
# then remove all A/T/C/G's and blank lines
# what you have left is causing the "not a base" error
If you remove these "non-base" bases using e.g.
sed '/^>/! s/[-NYRKW]//g' SEQUENCE_orf1ab.fasta > test.fasta
#explanation: in lines not beginning with ">", substitute all of the characters "-NYRKW" with nothing (i.e. delete them)
then the file is translated without issue:
seqs <- Biostrings::readDNAStringSet("test.fasta", format = "fasta", use.names = TRUE)
seqs_translated <- translate(seqs, no.init.codon = TRUE)
seqs_translated
#> AAStringSet object of length 91:
#> width seq names
#> [1] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MZ505877.1 |Sever...
#> [2] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MZ020653.1 |Sever...
#> [3] 7092 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MW988268.1 |Sever...
#> [4] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MW928277.1 |Sever...
#> [5] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MW885875.1 |Sever...
#> ... ... ...
#> [87] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN996529.1 |Sever...
#> [88] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN996530.1 |Sever...
#> [89] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN996531.1 |Sever...
#> [90] 7094 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN988713.1 |Sever...
#> [91] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN975262.1 |Sever...
Related
I have this list of character stored in a variable called x:
x <-
c(
"images/logos/france2.png",
"images/logos/cnews.png",
"images/logos/lcp.png",
"images/logos/europe1.png",
"images/logos/rmc-bfmtv.png",
"images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
pattern <- "images/logos/\\s*(.*?)\\s*.png"
regmatches(x, regexec(pattern, x))[[1]][2]
I wish to extract a portion of each chr string according to a pattern, like this function does, which works fine but only for the first item in the list.
pattern <- "images/logos/\\s*(.*?)\\s*.png"
y <- regmatches(x, regexec(pattern, x))[[1]][2]
Only returns:
"france2"
How can I apply the regmatches function to all items in the list in order to get a result like this?
[1] "france2" "europe1" "sudradio"
[4] "cnews" "rmc-bfmtv" "franceinfo"
[7] "lcp" "rmc" "lcp"
FYI this is a list of src tags that comes from a scraper
Try gsub
gsub(
".*/(.*)\\.png", "\\1",
c(
"images/logos/france2.png", "images/logos/cnews.png",
"images/logos/lcp.png", "images/logos/europe1.png",
"images/logos/rmc-bfmtv.png", "images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
)
which gives
[1] "france2" "cnews" "lcp" "europe1" "rmc-bfmtv"
[6] "sudradio" "franceinfo"
Output of regmatches(..., regexec(...)) is a list. You may use sapply to extract the 2nd element from each element of the list.
sapply(regmatches(x, regexec(pattern, x)), `[[`, 2)
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
You may also use the function basename + file_path_sans_ext from tools package which would give the required output directly.
tools::file_path_sans_ext(basename(x))
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
strings = c("images/logos/france2.png","images/logos/cnews.png",
"images/logos/lcp.png","images/logos/europe1.png",
"images/logos/rmc-bfmtv.png","images/logos/sudradio.png",
"images/logos/franceinfo.png")
)
df %>%
mutate(strings = str_remove(strings, "images/logos/") %>%
str_remove("\\.png"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo
Or even simpler:
library(tidyverse)
df %>%
mutate(strings = str_extract(strings, "(?<=images/logos/)(.*)(?=\\.png)"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo
I'm been working with nested lists and recursive functions in R following this instructions. Now there is just one piece I miss to design an own function, which is getting a vector with the respective names sorted from the highest to the deepest level.
The input list is:
lst <- list(
title = "References and Plant Communities in 'SWEA-Dataveg'",
author = "Miguel Alvarez",
date = "Dezember 28, 2019",
"header-includes" = c(
"- \\usepackage[utf8]{inputenc}",
"- \\usepackage[T1]{fontenc}", "- \\usepackage{bibentry}",
"- \\nobibliography{sweareferences.bib}"),
output = list(pdf_document=list(citation_package="natbib")),
"biblio-style" = "unsrtnat",
bibliography = "sweareferences.bib",
papersize = "a4")
The structure of the output list will then looks like this (printed in the console). Herewith note the vector at lst$output$pdf_document$citation_package:
$title
[1] "title"
$author
[1] "author"
$date
[1] "date"
$`header-includes`
[1] "header-includes"
$output
$output$pdf_document
$output$pdf_document$citation_package
[1] "output"
[2] "pdf_document"
[3] "citation_package"
$`biblio-style`
[1] "biblio-style"
$bibliography
[1] "bibliography"
$papersize
[1] "papersize"
Of course, the function has to be recursive to be applied in any different case.
Here is one possible approach, using only base R. The following function f replaces each terminal node (or "leaf") of a recursive list x with the sequence of names leading up to it. It treats unnamed lists like named lists with all names equal to "", which is a useful generalization.
f <- function(x, s = NULL) {
if (!is.list(x)) {
return(s)
}
nms <- names(x)
if (is.null(nms)) {
nms <- character(length(x))
}
Map(f, x = x, s = Map(c, list(s), nms))
}
f(lst)
$title
[1] "title"
$author
[1] "author"
$date
[1] "date"
$`header-includes`
[1] "header-includes"
$output
$output$pdf_document
$output$pdf_document$citation_package
[1] "output" "pdf_document" "citation_package"
$`biblio-style`
[1] "biblio-style"
$bibliography
[1] "bibliography"
$papersize
[1] "papersize"
Using an external package, this can be done quite efficiently with rrapply() in the rrapply-package:
rrapply::rrapply(lst, f = function(x, .xparents) .xparents)
#> $title
#> [1] "title"
#>
#> $author
#> [1] "author"
#>
#> $date
#> [1] "date"
#>
#> $`header-includes`
#> [1] "header-includes"
#>
#> $output
#> $output$pdf_document
#> $output$pdf_document$citation_package
#> [1] "output" "pdf_document" "citation_package"
#>
#>
#>
#> $`biblio-style`
#> [1] "biblio-style"
#>
#> $bibliography
#> [1] "bibliography"
#>
#> $papersize
#> [1] "papersize"
I am assessing the impact of hotspot single nucleotide polymorphism (SNPs) from a next generation sequencing (NGS) experiment on the protein sequence of a virus. I have the reference DNA sequence and a list of hotspots. I need to first figure out the reading frame of where these hotspots are seen. To do this, I generated a DNAStringSetList with all human codons and want to use a vmatchpattern or matchpattern from the Biostrings package to figure out where the hotspots land in the codon reading frame.
I often struggle with lapply and other apply functions, so I tend to utilize for loops instead. I am trying to improve in this area, so welcome a apply solution should one be available.
Here is the code for the list of codons:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
Current for loop code:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
Current error code. I am filtering the codon DNAStringSetList so that it is a DNAStringSet.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
I can't give out the exact nucleotide sequence, but here is the COVID genome (link: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to use as a reprex:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
For the current code, change the way the codons is formed. Currently the output of codons looks like this:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
Change it from DNAStringSetList to a conglomerate DNAStringSet of the amino acids.
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
When I run the script I get the following output with the SARS-CoV-2 isolate listed for the example (I'm showing a small slice)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
Looking at the ones that had an output, only those with the standard nucleotides ("ATCG", no wobbles) found matches. Those will need to be changed as well to search.
If you're on twitter, I suggest linking the question using the #rstats, #bioconductor, and #bioinformatics hashtags to generate some more traction, I've noticed that bioinformatic specific questions on SO don't generate as much buzz.
I am facing a problem to replace words in a tweet with the numeric value of their frequency.
I have already made a data frame showing the words ranked by their frequency.
Now I want to substitute the words in the tweets with the frequency rank of every word.
I attached snips of my data frames.
Tweets and word frequency data:
My goal is that a tweets looks like this:
[1] [3] [7] [11] [18] [12] [10] [5] [3] [44] [23] [46] [2] [90]
The [1] means that it is the most frequent word in the dataset.
Any help appreciated! :)
I think stringr::str_replace_all is an efficient way to go about it: just pass it a named vector with your word frequency and you're done.
See reprex below. The first few lines just generate random data; your frequency table looks like the df I generated below.
sentence <- "the quick brown fox jumps over the lazy dog"
sentence_split <- unique(as.character(stringr::str_split(string = sentence, pattern = " ", simplify = TRUE)))
names(sentence_split) <- sample(x = 1:1000, size = length(sentence_split))
df <- data.frame(word = sentence_split,
n = sample(x = 1:1000, size = length(sentence_split)))
df
#> word n
#> the 740
#> quick 192
#> brown 145
#> fox 809
#> jumps 700
#> over 910
#> lazy 352
#> dog 256
replace_vector <- paste0("[", df$n, "]")
names(replace_vector) <- df$word
stringr::str_replace_all(string = sentence, pattern = replace_vector)
#> [1] "[740] [192] [145] [809] [700] [910] [740] [352] [256]"
Created on 2021-07-24 by the reprex package (v2.0.0)
So I am trying to write into a FASTA file, it does write but for some reason when I open the file it starts with an empty > and then >SOMESEQID and so on. Could someone help?
When opening the file it looks like so:
>
>NP_001997.5 fibroblast growth factor 2 isoform 34 kDa [Homo sapiens] MVGVGGGDVEDVTPRPGGCQISGRGARGCNGIPGAAAWEAALPRRRPRRHPSVNPRSRAAGSPRTRGRRT EERPSGSRLGDRGRGRALPGGRLGGRGRGRAPERVGGRGRGRGTAAPRAAPAARGSRPGPAGTMAAGSIT TLPALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERGVVSIKGV CANRYLAMKEDGRLLASKCVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTGPGQKAIL FLPMSAKS
FGF2 is a vector of ID something like so:
FGF2 = c("ID1","ID2", ...)
Here is my code:
files = entrez_fetch(id = FGF2, rettype = "fasta", db = "protein")
files
fastFile = write.fasta(sequences = files, names = names(files), file.out = "mySeqs.fasta")
You don't need to use write.fasta . That function most likely assumes some kind of data. Just use writeLines() :
library(rentrez)
a = entrez_fetch(id=c("NP_001997.5","NP_001348594.1"),
rettype = "fasta", db = "protein")
writeLines(a,"test.fa")
readLines("test.fa")
[1] ">NP_001997.5 fibroblast growth factor 2 isoform 34 kDa [Homo sapiens]"
[2] "MVGVGGGDVEDVTPRPGGCQISGRGARGCNGIPGAAAWEAALPRRRPRRHPSVNPRSRAAGSPRTRGRRT"
[3] "EERPSGSRLGDRGRGRALPGGRLGGRGRGRAPERVGGRGRGRGTAAPRAAPAARGSRPGPAGTMAAGSIT"
[4] "TLPALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERGVVSIKGV"
[5] "CANRYLAMKEDGRLLASKCVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTGPGQKAIL"
[6] "FLPMSAKS"
[7] ""
[8] ">NP_001348594.1 fibroblast growth factor 2 isoform 18 kDa [Homo sapiens]"
[9] "MAAGSITTLPALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERG"
[10] "VVSIKGVCANRYLAMKEDGRLLASKCVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTG"
[11] "PGQKAILFLPMSAKS"
[12] ""
Or read in using:
library(Biostrings)
readAAStringSet("test.fa")
A AAStringSet instance of length 2
width seq names
[1] 288 MVGVGGGDVEDVTPRPGGCQISG...YKLGSKTGPGQKAILFLPMSAKS NP_001997.5 fibro...
[2] 155 MAAGSITTLPALPEDGGSGAFPP...YKLGSKTGPGQKAILFLPMSAKS NP_001348594.1 fi...