I have a vcf file with 20 variants in chromosome 1 that I would like to visualise using vcfR.
What I am doing is the following:
#Read in my mouse genome and filter and rename chromosome 1
ref_genome <- ape::read.FASTA("mouse_genome/Mus_musculus.GRCm38.dna.primary_assembly.fa", type = "DNA")
ref_genome_chr1 <- ref_genome[ grep("GRCm38:1:", names(ref_genome))]
names(ref_genome_chr1) <- "1"
ref_genome_chr1 <- as.matrix(ref_genome_chr1)
#Read in my vcf file and also a mouse gff annotation file
vis_test_vcf <- read.vcfR("test_data/filter_chr1_test.recode.vcf", verbose = TRUE)
mouse_gff <- read.table("mouse_genome/Mus_musculus.GRCm38.102.gff3", sep="\t", quote="")
#Generate chromR object
chrom_test <- create.chromR(name="chr1_test", vcf=vis_test_vcf, seq=ref_genome_chr1, ann=mouse_gff, verbose=TRUE)
#Now try and plot this
chromoqc(chrom_test)
When I head() etc the various objects they look ok and I don't get any warnings about chromosome names not matching or anything. However, the plot is missing the "Variants per site" track, which is all I care about...I get this plot, whereby it's not showing the Variants per site. It's also not showing the DP and MQ but I'm not so worried about that at this stage...
Has anyone had a similar issue? I would be grateful for any pointers!
Kind regards
Cora
Ok so I just found the answer!
I needed to use
proc.chromR()
to process my chromR object, now it has plotted the variants.
Still need to work out the DP and MQ stuff..
Related
I have the following problem: I need to process multiple raster files using the same function in R package landscapemetrics. Basically my raster files are parts of a country map, all of the same shape and size (i.e. quadrants. I figured out a code for 1 file, but I have to do the same with more than 600 rasters. So, doing it manually is very irrational. The steps in my code are the following:
# 1. I load "raster" and "landscapemetrics" packages:
library(raster)
library(landscapemetrics)
# 2. I read in my quadrant:
Quadrant <- raster("C:\\Users\\customer\\Documents\\ ... \\2434-44.tif")
# 3. I process the raster to get landscape metrics tibble:
LS_metrics <- calculate_lsm(landscape = Quadrant)
# 4. Finally, I write it into a csv:
write.csv(LS_metrics, file = "2434-44.csv")
I need to keep the same file name for my csv files as I had for tif (e.g. results from processing quadrant "2434-44.tif", need to be stored in "2434-44.csv", possibly in a folder in wd).
I am new to R. I tried to use list.files() and then apply a for loop, but my code did not work.
I need your advice.
Yours faithfully,
Denis
Your question is really about iteration and character (filename) manipulation; not about landscapemetrics etc. There are many similar questions on this site and resources elsewhere that you can consult. The basic approach can be like this:
# get input filenames
inf <- list.files("/my/path", pattern="\\.tif$", full=TRUE)
# create output filenames
outf <- gsub(".tif", ".csv", basename(inf))
# perhaps put output files in particular folder
dir.create("out", FALSE, FALSE)
outf <- file.path("out", outf)
# iterate
for (i in 1:length(inf)) {
# read input
input <- raster(inf[i])
# do something
output <- data.frame(id=1)
# write output
write.csv(output, outf[i])
}
It's very hard to help without further information. What was the issue with your approach of looping through all files using list.files(). In general, this should work.
Furthermore, most likely you don't want to calculate all available landscape metrics, but rather specify a subselection during the calculate_lsm() function call.
I worked a lot with MaxEnt in R recently (dismo-package), but only using a crossvalidation to validate my model of bird-habitats (only a single species). Now I want to use a self-created test sample file. I had to pick this points for validation by hand and can't use random test point.
So my R-script looks like this:
library(raster)
library(dismo)
setwd("H:/MaxEnt")
memory.limit(size = 400000)
punkteVG <- read.csv("Validierung_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteTG <- read.csv("Training_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteVG$X <- as.numeric(punkteVG$X)
punkteVG$Y <- as.numeric(punkteVG$Y)
punkteTG$X <- as.numeric(punkteTG$X)
punkteTG$Y <- as.numeric(punkteTG$Y)
##### mask NA ######
mask <- raster("final_merge_8class+le_bb_mask.img")
dataframe_VG <- extract(mask, punkteVG)
dataframe_VG[dataframe_VG == 0] <- NA
dataframe_TG <- extract(mask, punkteTG)
dataframe_TG[dataframe_TG == 0] <- NA
punkteVG <- punkteVG*dataframe_VG
punkteTG <- punkteTG*dataframe_TG
#### add the raster dataset ####
habitat_all <- stack("blockstats_stack_8class+le+area_8bit.img")
#### MODEL FITTING #####
library(rJava)
system.file(package = "dismo")
options(java.parameters = "-Xmx1g" )
setwd("H:/MaxEnt/results_8class_LE_AREA")
### backgroundpoints ###
set.seed(0)
backgrVMmax <- randomPoints(habitat_all, 100000, tryf=30)
backgrVM <- randomPoints(habitat_all, 1000, tryf=30)
### Renner (2015) PPM modelfitting Maxent ###
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10",
"replicatetype=subsample",
"randomtestpoints=20",
"randomseed=true",
"testsamplesfile=H:/MaxEnt/Validierung_FL_XY_2016_swd_NA"))
After the "maxent()"-command I ran into multiple errors. First I got an error stating that he needs more than 0 (which is the default) "randomtestpoints". So I added "randomtestpoints = 20" (which hopefully doesn't stop the program from using the file). Then I got:
Error: Test samples need to be in SWD format when background data is in SWD format
Error in file(file, "rt") : cannot open the connection
The thing is, when I ran the script with the default crossvalidation like this:
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10"))
...all works fine.
Also I tried multiple things to get my csv-validation-data in the correct format. Two rows (labled X and Y), Three rows (labled species, X and Y) and other stuff. I would rather use the "punkteVG"-vector (which is the validation data) I created with read.csv...but it seems MaxEnt wants his file.
I can't imagine my problem is so uncommon. Someone must have used the argument "testsamplesfile" before.
I found out, what the problem was. So here it is, for others to enjoy:
The correct maxent-command for a Subsample-file looks like this:
maxentVMmax_Renner<-maxent(habitat_all, punkteTG, backgrVMmax, path=paste('H:/MaxEnt',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=1",
"replicatetype=Subsample",
"testsamplesfile=H:/MaxEnt/swd.csv"))
Of course, there can not be multiple replicates, since you got only one subsample.
Most importantly the "swd.csv" Subsample-file has to include:
the X and Y coordinates
the Values at the respective points (e.g.: with "extract(habitat_all, PunkteVG)"
the first colum needs to consist of the word "species" with the header "Species" (since MaxEnt uses the default "species" if you don't define one in the Occurrence data)
So the last point was the issue here. Basically, if you don't define the species-colum in the Subsample-file, MaxEnt will not know how to assign the data.
I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.
I am a student and have been given a project to study climate data from Giovanni (NASA). Our code is provided and we are left to 'find our way' and therefore other answers don't seem to relate to the style of code i've been given. Further to this i am a beginner in R so changing the code is very difficult.
Basically i'm trying to create a time-series plot from the following code:
## Function for loading Giovanni time series data
load_giovanni_time <- function(path){
file_data <- read.csv(path,
skip=6,
col.names = c("Date",
"Temperature",
"NA",
"Site",
"Bleached"))
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
return(file_data)
}
## Creat a list of files
file.list <- list.files("./Data/courseworktimeseries/")
file.list <- as.list(paste0("./Data/courseworktimeseries/", file.list))
# for(i in file.list){
# load_giovanni_time(i)
# }
#Load all the files
all_data <- lapply(X=file.list,
FUN=load_giovanni_time)
all_data <- as.data.frame(do.call(rbind, all_data))
## Inspect the data with a plot
p <- qplot(data=all_data,
x=Date,
y=Temperature,
colour=Site,
linetype=Bleached,
geom="line")
print(p)
Now the first problem is that when the data is merged into one dataset, it changes all the dates (the starting date range is 2002-2015 and it changes to 2002-2030), which obviously ruins the plot. I found that i can stop the dates changing by deleting this code:
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
However, when this is deleted, i get the following error:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
Could anyone help me get round this without editing the code too much? I feel like it's a problem with the line of the code formatting the date incorrectly or something so i imagine it's only a small problem. I'm just very much a beginner and have to implement the code within 1-2 days.
Thanks
For anyone who ever has this problem... I found the solution.
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
This line of code reads in the date information from my CSV in that order. In excel my date was the other way round so if it said 30th December 2015 (30/12/2015), R would read it in as 2030-12-20, screwing up the data.
In Excel select all dates, CTRL+1 and then change format to match the date R is 'parsing'.
All done =)
This question is pretty simple and maybe even dumb, but I can't find an answer on google. I'm trying to read a .txt file into R using this command:
data <- read.csv("perm2test.txt", sep="\t", header=FALSE, row.names=1, col.names=paste("V", seq_len(max(count.fields("perm2test.txt", sep="\t"))), sep=""), fill=TRUE)
The reason I have the col.names command is because every line in my .txt file has a different number of observations. I've tested this on a much smaller file and it works. However, when I run it on my actual dataset (which is only 48MB), I'm not sure if it is working... The reason I'm not sure is because I haven't received an error message, yet it has been "running" for over 24 hours at this point (just the read.csv command above). Is it possible that it has run out of memory and it just doesn't output a warning?
I've looked around and I know people say there are functions out there to reduce the size and remove lines that aren't needed, etc. but to be honest I don't think this file is THAT big, and unfortunately I do need every line in the file... (it's actually only 70 lines, but some lines contain as much as 100k entries, while others may only have say 100). Any ideas what is happening?
Obviously untested but should give you some code to modify:
datL <- readLines("perm2test.txt") # one line per group
# may want to exclude some lines but question is unclear
listL <- lapply(datL, function(L) read.delim(text=L, colCasses="numeric") )
# This is a list of values by group
dfL <- data.frame( vals = unlist(listL),
# Now build a grouping vector that is associated with each bundle of values
groups= rep( LETTERS[1:length(listL)] ,
sapply(listL, length) )
# Might have been able to do that last maneuver with `stack`.
library(lattice)
bwplot( vals ~ groups, data=dfL)