So I am trying to write into a FASTA file, it does write but for some reason when I open the file it starts with an empty > and then >SOMESEQID and so on. Could someone help?
When opening the file it looks like so:
>
>NP_001997.5 fibroblast growth factor 2 isoform 34 kDa [Homo sapiens] MVGVGGGDVEDVTPRPGGCQISGRGARGCNGIPGAAAWEAALPRRRPRRHPSVNPRSRAAGSPRTRGRRT EERPSGSRLGDRGRGRALPGGRLGGRGRGRAPERVGGRGRGRGTAAPRAAPAARGSRPGPAGTMAAGSIT TLPALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERGVVSIKGV CANRYLAMKEDGRLLASKCVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTGPGQKAIL FLPMSAKS
FGF2 is a vector of ID something like so:
FGF2 = c("ID1","ID2", ...)
Here is my code:
files = entrez_fetch(id = FGF2, rettype = "fasta", db = "protein")
files
fastFile = write.fasta(sequences = files, names = names(files), file.out = "mySeqs.fasta")
You don't need to use write.fasta . That function most likely assumes some kind of data. Just use writeLines() :
library(rentrez)
a = entrez_fetch(id=c("NP_001997.5","NP_001348594.1"),
rettype = "fasta", db = "protein")
writeLines(a,"test.fa")
readLines("test.fa")
[1] ">NP_001997.5 fibroblast growth factor 2 isoform 34 kDa [Homo sapiens]"
[2] "MVGVGGGDVEDVTPRPGGCQISGRGARGCNGIPGAAAWEAALPRRRPRRHPSVNPRSRAAGSPRTRGRRT"
[3] "EERPSGSRLGDRGRGRALPGGRLGGRGRGRAPERVGGRGRGRGTAAPRAAPAARGSRPGPAGTMAAGSIT"
[4] "TLPALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERGVVSIKGV"
[5] "CANRYLAMKEDGRLLASKCVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTGPGQKAIL"
[6] "FLPMSAKS"
[7] ""
[8] ">NP_001348594.1 fibroblast growth factor 2 isoform 18 kDa [Homo sapiens]"
[9] "MAAGSITTLPALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERG"
[10] "VVSIKGVCANRYLAMKEDGRLLASKCVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTG"
[11] "PGQKAILFLPMSAKS"
[12] ""
Or read in using:
library(Biostrings)
readAAStringSet("test.fa")
A AAStringSet instance of length 2
width seq names
[1] 288 MVGVGGGDVEDVTPRPGGCQISG...YKLGSKTGPGQKAILFLPMSAKS NP_001997.5 fibro...
[2] 155 MAAGSITTLPALPEDGGSGAFPP...YKLGSKTGPGQKAILFLPMSAKS NP_001348594.1 fi...
Related
I am assessing the impact of hotspot single nucleotide polymorphism (SNPs) from a next generation sequencing (NGS) experiment on the protein sequence of a virus. I have the reference DNA sequence and a list of hotspots. I need to first figure out the reading frame of where these hotspots are seen. To do this, I generated a DNAStringSetList with all human codons and want to use a vmatchpattern or matchpattern from the Biostrings package to figure out where the hotspots land in the codon reading frame.
I often struggle with lapply and other apply functions, so I tend to utilize for loops instead. I am trying to improve in this area, so welcome a apply solution should one be available.
Here is the code for the list of codons:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
Current for loop code:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
Current error code. I am filtering the codon DNAStringSetList so that it is a DNAStringSet.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
I can't give out the exact nucleotide sequence, but here is the COVID genome (link: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to use as a reprex:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
For the current code, change the way the codons is formed. Currently the output of codons looks like this:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
Change it from DNAStringSetList to a conglomerate DNAStringSet of the amino acids.
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
When I run the script I get the following output with the SARS-CoV-2 isolate listed for the example (I'm showing a small slice)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
Looking at the ones that had an output, only those with the standard nucleotides ("ATCG", no wobbles) found matches. Those will need to be changed as well to search.
If you're on twitter, I suggest linking the question using the #rstats, #bioconductor, and #bioinformatics hashtags to generate some more traction, I've noticed that bioinformatic specific questions on SO don't generate as much buzz.
I have a .fasta file with many different sequences. My goal is to use the Biostrings package to convert each individual sequence to the amino acid sequence for it.
The .fasta file looks like this:
>Sequence 1
AAATTTGGGCCC
>Sequence 2
TTTGGGCCCAAA
Any help would be appreciated. Thank you.
The translate function will do what you want:
library(Biostrings)
`Sequence 1` <- DNAString("AAATTTGGGCCC")
`Sequence 2` <- DNAString("TTTGGGCCCAAA")
seq_1 <- translate(`Sequence 1`, no.init.codon=TRUE)
seq_1
#> 4-letter AAString object
#> seq: KFGP
seq_2 <- translate(`Sequence 2`, no.init.codon=TRUE)
seq_2
#> 4-letter AAString object
#> seq: FGPK
To read in the entire fasta file:
seqs <- Biostrings::readDNAStringSet("file.fasta", format = "fasta", use.names = TRUE)
seqs_translated <- translate(seqs, no.init.codon = TRUE)
seqs_translated
#> AAStringSet object of length 2:
#> width seq names
#> [1] 4 KFGP Sequence 1
#> [2] 4 FGPK Sequence 2
Edit
Your problem translating your fasta file is that the sequences use the 'full' alphabet, not just ATCG - you have 'No calls' ("N"), gaps ("-"), and ambivalent/unresolved calls e.g. "K" (Guanine or Thymine). I found these using sed:
grep -v ">" SEQUENCE_orf1ab.fasta | sed 's/[ATCG]//g' | sed '/^$/d'
# explanation: remove lines beginning with ">"
# then remove all A/T/C/G's and blank lines
# what you have left is causing the "not a base" error
If you remove these "non-base" bases using e.g.
sed '/^>/! s/[-NYRKW]//g' SEQUENCE_orf1ab.fasta > test.fasta
#explanation: in lines not beginning with ">", substitute all of the characters "-NYRKW" with nothing (i.e. delete them)
then the file is translated without issue:
seqs <- Biostrings::readDNAStringSet("test.fasta", format = "fasta", use.names = TRUE)
seqs_translated <- translate(seqs, no.init.codon = TRUE)
seqs_translated
#> AAStringSet object of length 91:
#> width seq names
#> [1] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MZ505877.1 |Sever...
#> [2] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MZ020653.1 |Sever...
#> [3] 7092 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MW988268.1 |Sever...
#> [4] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MW928277.1 |Sever...
#> [5] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MW885875.1 |Sever...
#> ... ... ...
#> [87] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN996529.1 |Sever...
#> [88] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN996530.1 |Sever...
#> [89] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN996531.1 |Sever...
#> [90] 7094 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN988713.1 |Sever...
#> [91] 7095 MESLVPGFNEKTHVQ...KTTELLFLVMFLLT MN975262.1 |Sever...
I want to write a script in R that allows me to import MSG files and store the information in a table. The fields may vary by course, so the column names are defined based on the first MSG file being imported.
The import and extraction are already working (special thanks to the user "January")
What does not work is the filling in the table, which consists of two steps. Add column names and fill in rows.
I've tried using unlist to prepare the contents of the lists so that I can add them as colums and rows to a table.
Anmeldung <- gsub("^\\s+", "", Anmeldung) # remove spaces at the beginning and end
Anmeldung <- gsub("\\s+$", "", Anmeldung)
words <- strsplit(Anmeldung, " *[\n\r]+ *")[[1]]
fields <- as.list(words[seq(1, length(words), 2)])
information <- as.list(words[seq(2, length(words), 2)])
resTab1 = data.frame(t(unlist(fields)))
resTab2 = data.frame(t(unlist(information)))
colnames(resTab2) = c(resTab1)
variable.names(resTab2)
When I am trying to create the Table,this error appears:
colnames(resTab2) = c(resTab1)
Error in names(x) <- value :
'names' attribute [22] must be the same length as the vector [21]
This is what the Dataframes Fields and Information look like:
Fields
> fields
[[1]]
[1] "Anrede"
[[2]]
[1] "Vorname"
[[3]]
[1] "Name"
[[4]]
[1] "Email (für Kontaktaufnahme)"
[[5]]
[1] "Telefon/Mobile (geschäftlich)"
[[6]]
[1] "Telefon/Mobile (privat)"
[[7]]
[1] "Strasse/Nr."
Information:
> information
[[1]]
[1] "Herr"
[[2]]
[1] "James"
[[3]]
[1] "Bond"
[[4]]
[1] "james.bond#email.com"
[[5]]
[1] "007 000 77 07"
[[6]]
[1] "007 000 77 07"
[[7]]
[1] "Lampenstrasse 8"
I see you're trying to give names to resTab2 that is shorter than your resTab1
ex:
x <- c(1,2)
y <- c("a","b","c")
names(x) <- y
#Error in names(x) <- y :
#'names' attribute [3] must be the same length as the vector [2]
EDIT:
use unlist to flatten the list
information <- unlist(information)
fields <- unlist(fields)
names(information) <- fields
information
#OUTPUT
#Anrede 'Herr'
#Vorname 'James'
#Name 'Bond'
#Email (für Kontaktaufnahme) 'james.bond#email.com'
#Telefon/Mobile (geschäftlich) '007 000 77 07'
#Telefon/Mobile (privat) '007 000 77 07'
#Strasse/Nr. 'Lampenstrasse 8'
I have a similar question to this:
How to transform XML data into a data.frame?
I have an XML, that I want to convert to a data frame. But when I try this on my data, it doesn't work because i have different number of elements in my list
df3 = plyr::ldply(xmlToList(books), data.frame)
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, : arguments imply differing number of rows: 9, 10
Could anyone tell me how to convert XML to data frame when there are different number of elements in my list?
Thanks,
If you look closely at the XML file, there are 105 nodes under patient. If you pick one like "drugs", you still get 22 subnodes, some tags with text and attributes, some with only attributes and some with more subnodes. ldply can do lots of things, but not combine this mess.
doc <- xmlParse( file )
x <- xmlToList( doc)
names(x)
[1] "admin" "patient" ".attrs"
names(x$patient)
[1] "additional_studies"
[2] "tumor_tissue_site"
[3] "tumor_tissue_site_other"
[4] "prior_dx"
[5] "gender"
[6] "vital_status"
[7] "days_to_birth"
...
[103] "drugs"
[104] "radiations"
[105] "clinical_cqcf"
sapply(x$patient$drugs$drug, names)
## text and attributes (usually 9)
$tx_on_clinical_trial
[1] "text" ".attrs"
# attributes only
$regimen_number
[1] "preferred_name" "display_order" "cde" "cde_ver"
[5] "xsd_ver" "tier" "owner" "procurement_status"
[9] "restricted"
## 2 sub nodes
$therapy_types
[1] "therapy_type" "therapy_type_notes"
...
I have a csv file like this,
x <- read.csv("C:/Users/XXXX/Documents/XXXX/Day1_15042014/work2.csv")
class(x)
x$Sequence.window![enter image description here][1]
> x$Sequence.window
[1] VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN
[2] PGLTTQGTKFGRKIVKTLAYRVKSTQPSSGN
[3] EATEFYLRYYVGHKGKFGHEFLEFEFREDGK
[4] LVPVVWGERKTPEIEKKGFGASSKAATSLPS
[5] NMNELPEKKNSAGFIKLEDKQKLIVEMEKSV
[6] PTLHFNYRYFETDAPKDVPGAPRQWWFGGGT
[7] PDPTTAPMEAAKQPKKKRSRSKKCKSVNNLD
[8] PAKAAKTAKVTSPAKKAVAATKKVATVATKK
The class of this is a dataframe . I would now like to split the sequence window within a range 10:22 ( Ex [1] VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN, output should be like [1] DTLEFHKFYKNFS for all the sequences) . How would I do this within a data frame?
You can use the substr function
#dummy data
x <- read.table(text="Sequence.window
VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN
PGLTTQGTKFGRKIVKTLAYRVKSTQPSSGN
EATEFYLRYYVGHKGKFGHEFLEFEFREDGK",header=TRUE,as.is=TRUE)
#substr from 10 to 22
substr(x$Sequence.window,start=10,stop=22)
#[1] "DTLEFHKFYKNFS" "FGRKIVKTLAYRV" "YVGHKGKFGHEFL"