How can I query a genetics database of SNPs (preferably with R)? - r

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.

Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

Related

Using large GVCF file for population genetics

I am following a tutorial to calculate population genetics statistics in R from GENETIX (extension .gtx), STRUCTURE (.str or.stru), FSTAT (.dat) and Genepop (.gen) files format.
https://github.com/thibautjombart/adegenet
I start from a GVCF file, which has more than 1.5 millions rows.
I have tried different strategies to import my dataset in Genepop format.
The vcfgenind package is crashing on my computer, probably in reason of the very large VCF file.
The packages commands genomic_converter(), or read.vcf() failed to read all the file and incorrectly capture the INFO fields of the GVCF file.
I think I have missed a detail, as an intermediate conversion of the VCF to another format. Would anyone had a method for NGS analysis from GVCF file to population genetics?
Disclaimer: I'm a Hail maintainer.
GVCF files are a bit different from VCF files. GVCF files are a sparse representation of the entire sequence. They contain "reference blocks" which indicate genomic intervals in which the sample is inferred to have homozygous reference calls of a uniform quality.
In contrast, a typical "project" VCF (sometimes called "PVCF", often just "VCF") file densely represents one or more samples, but only at sites at which at least one sample has a non-reference call.
I am not familiar with the tools you've referenced. It's possible those tools do not support GVCF files.
You might find more success working with Hail. Hail is a Python library for working with sequences. It presents a GVCF or VCF to you as a Table or Matrix Table which are similar to Pandas data frames.
Do you have one GVCF file or many? I don't know how you can do population genetics with just one sample. If you have multiple GVCF files, I recommend looking at the Hail Variant Dataset and Variant Dataset Combiner. You can combine one or more GVCF files like this:
gvcfs = [
'gs://bucket/sample_10123.g.vcf.bgz',
'gs://bucket/sample_10124.g.vcf.bgz',
'gs://bucket/sample_10125.g.vcf.bgz',
'gs://bucket/sample_10126.g.vcf.bgz',
]
combiner = hl.vds.new_combiner(
output_path='gs://bucket/dataset.vds',
temp_path='gs://1-day-temp-bucket/',
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
)
combiner.run()
vds = hl.read_vds('gs://bucket/dataset.vds')
If you only have a few thousand samples, I think it is easier to work with a "dense" (project-VCF-like) representation. You can produce this by running:
mt = vds.to_dense_mt()
From here, you might look at the Hail GWAS tutorial if you want to associate genotypes to phenotypes.
For more traditional population genetics, the Martin Lab has shared tutorials on how they analyzed the HGDP+1kg dataset.
If you're looking for something like the F statistic, you can compute that with Hail's hl.agg.inbreeding aggregator:
mt = hl.variant_qc(mt)
mt = mt.annotate_cols(
IB = hl.agg.inbreeding(mt.GT, mt.variant_qc.AF[1])
)
mt.IB.show()

Querying UniProt and RefSeq databases with FASTA headers

This is my first post to StackOverflow. I've been playing with this problem for over a week, and I haven't found a solution using the search function or my limited computational skills. I have a dataset comprised columns of FASTA headers, Sequence, and count number. The FASTA headers technically contain all the info I need, but that's where I'm running into problems...
Different formats
Some of the entries come from UniProt:
tr|V5RFN6|V5RFN6_EBVG_Epstein-Barr_nuclear_antigen_2_(Fragment)_OS=Epstein-Barr_virus_(strain_GD1)_GN=EBNA2_PE=4_SV=1
Some of the entries come from RefSeq:
gi|139424477|ref|YP_001129441.1|EBNA-2[Human_herpesvirus_4_type_2]
Synonyms
I'd like to make graphs using count number vs virus or gene, and I thought it would be easy enough to split up the headers and go from there. However, what I'm discovering is that there's a seemingly endless number of permutations on the names of virus and gene names. In the example above, EBV goes by no less than 4 names, and each individual gene has several different formattings.
I used a lengthy ifelse statement to create a column for virus family name. I shortened the following below to just include EBV, but you can imagine it stretching on for all common viruses.
EBV <- c("EBVG", "Human_herpesvirus_4", "Epstein", "Human_gammaherpesvirus_4")
joint.virus <- joint.virus %>% mutate(Virus_Family =
ifelse(grepl(paste(EBV, collapse = "|"), x = joint.virome$name), "EBV", NA))
This isn't so bad, but I had to do something similar for all of EBV's ~85 genes. Not only was this tedious, but it isn't feasible to do this for all the viruses I want to look at.
I looked into querying the databases using the UniProt.ws package to pull out organism name and gene name, but you need to start from the taxID (which isn't included in the UniProt header). I feel like there should be some way to use the FASTA header to get the organism name and gene name.
I am presently using R. I would greatly appreciate any advice going forward. Is there a package that I'm overlooking? Should I be using a different tool to do this?
Thanks!

How to obtain graph feature values for multiple rna sequences at once?

I need to obtain graph feature values (like maybe around 20) for multiple rna sequences at once.
So if I input the file containing rna sequences the output file should contain these 20 features per line for each sequence. I have looked into GraPPLE but it gives feature values only for one at a time. Similar case for igraph.
I have a 500 sequence file for which I need to obtain these feature values and then further training it.
You may want to have a look at recent articles citing GraPPLE:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2685108/citedby/
Specifically, RNAcon is also graph-based approach:
https://www.ncbi.nlm.nih.gov/pubmed/24521294
Lastly, this review may be of interest:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5153550/

R script to find peptides after trypsin digest in databases/literature

I have a long list of peptide sequences which I would like to look up in batch mode in public sources/ databases, to see if the peptides have been identified as biomarkers in specific tissues (plasma, urine, etc). The problem is that the peptides have been generated using a trypsin digest, which means that I do not always get exact matches and need to also find inexact matches to my peptide queries by finding cases with irregular digest cleavages. The only R packages I am finding that can handle issues with enzyme digests deal with spectra rather than peptide sequences. I would appreciate any suggestions on how to do this, thanks!
I've used the "Cleaver" R package in the past, it seems to have the most comprehensive list of enzymes that I've come across yet:
devtools::install_github("sgibb/cleaver")

counting oligonucleotides and reversed complementary

I'm start to using R and i need some help if it's possible. I need to read fasta files and count for each species the frequency of each nucleotide, dinucleotides and to words with length 10 and the frequency of the reversed complementary. I'm using the package Biostrings. Can you Help me? Thank You
The Bioconductor Biostring Manual contains some pretty descriptive methods that match what you are looking for. They also have attached examples. Otherwise, you could just read in the FASTA file and keep track of how many of each base occurs (If you can't figure out the BioString program).
For the frequency, simply reading from a text file (FASTA after removing name sequences) is also sufficient. As long as you keep count of how many of each oligonucleotide appears.
I'm not exactly sure how you want to measure how much reverse complementary there is, if you kept all the possibilities of size 10 in an array that array wouldn't be too large (4^10 I think?), so if you add the data to the array in a logical way you could pretty easily compare them in an algorithmic manner.

Resources