How to process the output data fromTax4fun, a R program for predicting functional profiles from metagenomic 16S rRNA data? - r

I'm learning to use Tax4Fun (http://tax4fun.gobics.de/) now, but I have some questions. I have got the Tax4FunProfiles, e.g. K00001; alcohol dehydrogenase [EC:1.1.1.1].....(there are more than 10000 reads in the file), and what  I can do with the output file in the downstream analysis, might do some connection to KEEG?

I guess that your answer depends on your research question, right? Why do you want to do a functional profiling of your prokaryotic community and what are you looking for?
Based on your code in the comment you save the KO abundance table as a tab separated file.
You could use this file to analyse your metagenomic profile with STAMP or find "biomarker" KO's using LEfSe.
In addition, I think that your gene table contains more than 6000 KOs. You can also predict pathways with TAX4FUN:
KEGG_pathways <- Tax4Fun(df, fctProfiling = FALSE, refProfile = "UProC", shortReadMode = FALSE, normCopyNo = TRUE, folderReferenceData = "~/work/tax4fun_files/R_ressources/SILVA123")
This would result in ~300 metabolic pathways instead of >6000 KOs. In theory you could also use your KO table with HUMANN or HUMANN2 to create metabolic pathways.

Related

Using large GVCF file for population genetics

I am following a tutorial to calculate population genetics statistics in R from GENETIX (extension .gtx), STRUCTURE (.str or.stru), FSTAT (.dat) and Genepop (.gen) files format.
https://github.com/thibautjombart/adegenet
I start from a GVCF file, which has more than 1.5 millions rows.
I have tried different strategies to import my dataset in Genepop format.
The vcfgenind package is crashing on my computer, probably in reason of the very large VCF file.
The packages commands genomic_converter(), or read.vcf() failed to read all the file and incorrectly capture the INFO fields of the GVCF file.
I think I have missed a detail, as an intermediate conversion of the VCF to another format. Would anyone had a method for NGS analysis from GVCF file to population genetics?
Disclaimer: I'm a Hail maintainer.
GVCF files are a bit different from VCF files. GVCF files are a sparse representation of the entire sequence. They contain "reference blocks" which indicate genomic intervals in which the sample is inferred to have homozygous reference calls of a uniform quality.
In contrast, a typical "project" VCF (sometimes called "PVCF", often just "VCF") file densely represents one or more samples, but only at sites at which at least one sample has a non-reference call.
I am not familiar with the tools you've referenced. It's possible those tools do not support GVCF files.
You might find more success working with Hail. Hail is a Python library for working with sequences. It presents a GVCF or VCF to you as a Table or Matrix Table which are similar to Pandas data frames.
Do you have one GVCF file or many? I don't know how you can do population genetics with just one sample. If you have multiple GVCF files, I recommend looking at the Hail Variant Dataset and Variant Dataset Combiner. You can combine one or more GVCF files like this:
gvcfs = [
'gs://bucket/sample_10123.g.vcf.bgz',
'gs://bucket/sample_10124.g.vcf.bgz',
'gs://bucket/sample_10125.g.vcf.bgz',
'gs://bucket/sample_10126.g.vcf.bgz',
]
combiner = hl.vds.new_combiner(
output_path='gs://bucket/dataset.vds',
temp_path='gs://1-day-temp-bucket/',
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
)
combiner.run()
vds = hl.read_vds('gs://bucket/dataset.vds')
If you only have a few thousand samples, I think it is easier to work with a "dense" (project-VCF-like) representation. You can produce this by running:
mt = vds.to_dense_mt()
From here, you might look at the Hail GWAS tutorial if you want to associate genotypes to phenotypes.
For more traditional population genetics, the Martin Lab has shared tutorials on how they analyzed the HGDP+1kg dataset.
If you're looking for something like the F statistic, you can compute that with Hail's hl.agg.inbreeding aggregator:
mt = hl.variant_qc(mt)
mt = mt.annotate_cols(
IB = hl.agg.inbreeding(mt.GT, mt.variant_qc.AF[1])
)
mt.IB.show()

Predictive modelling for a 3 dimensional dataframe

I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.
The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.
contract.number item.id
0030586792 32X10AVC
0030586792 ZFBBDINING
0030587065 ZSTAIRCL
0030587065 EMS164
0030591125 YCLEANOFF
0030591125 ZSTEPSWC
contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)
Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.
If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.
My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated!
(Sorry if I haven't explained well!)

textProcessor changes the number of observations of my corpus (using with stm package in R)

I'm working with a dataset that has 439 observations for text analysis in stm. When I use textProcessor, the number of observations changes to 438 for some reason. This creates problems later on: when using the findThoughts() function, for example.
##############################################
#PREPROCESSING
##############################################
#Process the data for analysis.
temp<-textProcessor(sovereigncredit$Content,sovereigncredit, customstopwords = customstop, stem=FALSE)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
length(docs) # QUESTION: WHY IS THIS 438 instead of 439, like the original dataset?
length(sovereigncredit$Content) # See, this original one is 439.
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
An example of this becoming a problem down the line is:
thoughts1<-findThoughts(sovereigncredit1, texts=sovereigncredit$Content,n=5, topics=1)
For which the output is:
"Error in findThoughts(sovereigncredit1, texts = sovereigncredit$Content, :
Number of provided texts and number of documents modeled do not match"
In which "sovereigncredit1" is a topic model based on "out" from above.
If my interpretation is correct (and I'm not making another mistake), the problem seems to be this 1 observation difference in the number of observations pre and post textprocessing.
So far, I've looked at the original csv and made sure there are in fact 439 valid observations and no empty rows. I'm not sure what's up. Any help would be appreciated!
stm can't handle empty documents so we simply drop them. textProcessor removes a lot of stuff from texts: custom stopwords, words shorter than 3 characters, numbers etc. So what's happening here is one of your documents (whichever one is dropped) is essentially losing all of its contents sometime during the process of doing the various things textProcessor does.
You can work back what document it was and decide what you want to do about that in this instance. In general if you want more control over the text manipulation, I would strongly recommend the quanteda package which has much more fine-grained tools than stm for manipulating texts into a document-term matrix.

How can I query a genetics database of SNPs (preferably with R)?

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.
Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

Determining distribution so I can generate test data

I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.
From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?
While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.
You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).
Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.
To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.
Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!
To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.
I'm assuming that you're interested in understanding the distribution over your categorical values.
The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.
To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:
affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)
In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:
dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).

Resources