Using large GVCF file for population genetics - r

I am following a tutorial to calculate population genetics statistics in R from GENETIX (extension .gtx), STRUCTURE (.str or.stru), FSTAT (.dat) and Genepop (.gen) files format.
https://github.com/thibautjombart/adegenet
I start from a GVCF file, which has more than 1.5 millions rows.
I have tried different strategies to import my dataset in Genepop format.
The vcfgenind package is crashing on my computer, probably in reason of the very large VCF file.
The packages commands genomic_converter(), or read.vcf() failed to read all the file and incorrectly capture the INFO fields of the GVCF file.
I think I have missed a detail, as an intermediate conversion of the VCF to another format. Would anyone had a method for NGS analysis from GVCF file to population genetics?

Disclaimer: I'm a Hail maintainer.
GVCF files are a bit different from VCF files. GVCF files are a sparse representation of the entire sequence. They contain "reference blocks" which indicate genomic intervals in which the sample is inferred to have homozygous reference calls of a uniform quality.
In contrast, a typical "project" VCF (sometimes called "PVCF", often just "VCF") file densely represents one or more samples, but only at sites at which at least one sample has a non-reference call.
I am not familiar with the tools you've referenced. It's possible those tools do not support GVCF files.
You might find more success working with Hail. Hail is a Python library for working with sequences. It presents a GVCF or VCF to you as a Table or Matrix Table which are similar to Pandas data frames.
Do you have one GVCF file or many? I don't know how you can do population genetics with just one sample. If you have multiple GVCF files, I recommend looking at the Hail Variant Dataset and Variant Dataset Combiner. You can combine one or more GVCF files like this:
gvcfs = [
'gs://bucket/sample_10123.g.vcf.bgz',
'gs://bucket/sample_10124.g.vcf.bgz',
'gs://bucket/sample_10125.g.vcf.bgz',
'gs://bucket/sample_10126.g.vcf.bgz',
]
combiner = hl.vds.new_combiner(
output_path='gs://bucket/dataset.vds',
temp_path='gs://1-day-temp-bucket/',
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
)
combiner.run()
vds = hl.read_vds('gs://bucket/dataset.vds')
If you only have a few thousand samples, I think it is easier to work with a "dense" (project-VCF-like) representation. You can produce this by running:
mt = vds.to_dense_mt()
From here, you might look at the Hail GWAS tutorial if you want to associate genotypes to phenotypes.
For more traditional population genetics, the Martin Lab has shared tutorials on how they analyzed the HGDP+1kg dataset.
If you're looking for something like the F statistic, you can compute that with Hail's hl.agg.inbreeding aggregator:
mt = hl.variant_qc(mt)
mt = mt.annotate_cols(
IB = hl.agg.inbreeding(mt.GT, mt.variant_qc.AF[1])
)
mt.IB.show()

Related

How to process the output data fromTax4fun, a R program for predicting functional profiles from metagenomic 16S rRNA data?

I'm learning to use Tax4Fun (http://tax4fun.gobics.de/) now, but I have some questions. I have got the Tax4FunProfiles, e.g. K00001; alcohol dehydrogenase [EC:1.1.1.1].....(there are more than 10000 reads in the file), and what  I can do with the output file in the downstream analysis, might do some connection to KEEG?
I guess that your answer depends on your research question, right? Why do you want to do a functional profiling of your prokaryotic community and what are you looking for?
Based on your code in the comment you save the KO abundance table as a tab separated file.
You could use this file to analyse your metagenomic profile with STAMP or find "biomarker" KO's using LEfSe.
In addition, I think that your gene table contains more than 6000 KOs. You can also predict pathways with TAX4FUN:
KEGG_pathways <- Tax4Fun(df, fctProfiling = FALSE, refProfile = "UProC", shortReadMode = FALSE, normCopyNo = TRUE, folderReferenceData = "~/work/tax4fun_files/R_ressources/SILVA123")
This would result in ~300 metabolic pathways instead of >6000 KOs. In theory you could also use your KO table with HUMANN or HUMANN2 to create metabolic pathways.

How to get RPKM value from bed file or wig files? And what's the difference between these two type of files?

I want to download fastq raw file from RNAseq to get gene expression values. But GEO only provides .bed.gz and .wig.gz formats. What can I do to get the RPKM values? Thank you very much!
In order to calculate RPKM, you need (mapped) raw reads as contained in BAM/SAM or even CRAM files. Wiggle, BED and their derivatives such as bigWiggle are compressed versions of those only containing the coverage (mainly used for plotting), that is they have lost the read information needed for counting and therefore calculating RPKM (or FPKM/TPM for that manner).
The standard approach is to start from a bam file, extract the reads counts for regions of interest and calculate RPKM etc. There is many pipelines out there such as this.
If Bam files are not available, GEO usually has at least the raw fastq files (or sra files that can be converted to fastq) as a basis for mapping to obtain a bam file. Also have a look at ArrayExpress, they could have the raw files for that project since it is mirroring GEO.
Maybe as a word of warning, if you intend to do differential expression analysis, you need to go from the raw counts, not the RPKM values.

Handling multiple raster files and executing unit conversions on them: R

I've dug around a lot for an answer to this and wasn't able to find anything, so here I am.
I have a whole bunch of ascii raster files corresponding to air temperature and dew point temperature of a certain area over 744 hourly time steps. (So I have 744 air temp and 744 dew point files corresponding to a 31-day month). The files are only about 45 kB each.
I want to stack them together so I can perform some analyses on them, and I also want to convert their units from K to deg F.
The file names air Tair1.txt, Tair2.txt, ... Tair744.txt and Eair1.txt, Eair2.txt, ... Eair744.txt.
Using the raster package, I can easily load all the files as rasters:
for (i in 1:744) {
assign(paste0("Tair",i),raster(paste0("Tair",i,".txt")))
assign(paste0("Eair",i),raster(paste0("Tair",i,".txt")))
}
I've tried to use ls() with pattern or glob2rx to define just the raster file names and
then do conversions on them, or to do something similar to join them in a stack, but to no avail. I also tried mget, values(mget(filename)) and things like that to get at the values in a loop.
I know R doesn't handle large datasets very well, but I'm thinking these aren't really that large so there should be something pretty simple?
I would appreciate any help and advice! Thank you.
The raster package's RasterStack is for this:
library(raster)
files <- paste0("Tair",1:744,".txt")
rs <- stack(files)
Why do you have these files in text format though? Who imposed this disaster on you? I suspect your individual layers have insufficient metadata, so try one and see if it's sensible. You can use extent(rs) <- and projection(rs) <- to fix:
r <- raster(files[1])
print(r)
Don't use assign() that's just creating a mess.

How can I query a genetics database of SNPs (preferably with R)?

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.
Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

Trimming a huge (3.5 GB) csv file to read into R

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.
So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?
Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?
Any ideas would be greatly appreciated.
Update:
Constraints
Needs to use my machine, so no EC2 instances
As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
As you can see below, the data contains mixed types, which I need to operate on later
Data
The data is 3.5GB, with about 8.5 million rows and 17 columns
A couple thousand rows (~2k) are malformed, with only one column instead of 17
These are entirely unimportant and can be dropped
I only need ~100,000 rows out of this file (See below)
Data example:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC [Malformed row]
[8.5 Mill rows]
I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:
County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]
Results:
After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.
I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.
It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.
## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)
Failings by Approach:
sqldf
This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
MapReduce
To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
bigmemory
This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
scan
Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.
My try with readLines. This piece of a code creates csv with selected years.
file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers
B <- 300000 # depends how large is one pack
while(length(x)) {
ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
if (length(ind)) writeLines(x[ind], file_out)
x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)
I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:
mapReduce (pure R)
RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files
Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.
Is there a similar way to read in files a piece at a time in R?
Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.
If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.
If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
It passes any arguments along to read.table, so the combination should let you subset pretty tightly.
The ff package is a transparent way to deal with huge files.
You may see the package website and/or a presentation about it.
I hope this helps
What about using readr and the read_*_chunked family?
So in your case:
testfile.csv
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5
Actual code
require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)
This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.
This results in:
# A tibble: 2 × 8
County State Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment` GDP
* <chr> <chr> <int> <int> <chr> <chr> <chr> <dbl>
1 Ada County NC 2009 4 FIRE Financial Banks 801
2 Ada County NC 2010 1 FIRE Financial Banks 825
You can even increase chunk_size but in this example there are only 4 lines.
You could import data to SQLite database and then use RSQLite to select subsets.
Have you consisered bigmemory ?
Check out this and this.
Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.
It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.
scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.
These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.
Now, 2017, I would suggest to go for spark and sparkR.
the syntax can be written in a simple rather dplyr-similar way
it fits quite well to small memory (small in the sense of 2017)
However, it may be an intimidating experience to get started...
I would go for a DB and then make some queries to extract the samples you need via DBI
Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html
It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...
Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.
my two humble pennies worth.

Resources