I have 2 sets of pairwise alignments, where query genome 1 (q1) is aligned to the reference genome and query genome 2 (q2) is aligned to the same reference genome. Therefore, I have both alignments with a coordinate system in the reference genome. The alignments are in the form of GRanges objects.
I would like to project the breakpoints of q2 onto q1, by aligning the breakpoints of q1 in the center, and look for any clustering of q2 breakpoints around the q1 breakpoints, all in the reference genome coordinate system.
Therefore, I make a GRanges object of q1 with its breakpoints in the center. For example, if there is a breakpoint in q1 relative to the reference genome at scaffold 1, bp 833, then taking a window on 500 either side of this, the q1 GRanges object will have an element:
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] S1 333-1333 *
-------
seqinfo: 576 sequences from an unspecified genome; no seqlengths
I then construct a GRanges object of the breakpoints on q2, but all seqlengths are of length 1. I intersect this with the q1 GRanges object, so that q2 only obtains points that can be projected onto q1.
The CoverageHeatmap function requires:
windows:
A set of GRanges of equal length
track:
A GRanges or RleList object specifying coverage
When I call the CoverageHeatmap function, I always get this error and warning message:
Error: subscript contains out-of-bounds ranges
In addition: Warning message:
In e1 == Rle(e2) :
longer object length is not a multiple of shorter object length
Called from: S4Vectors:::.subscript_error("subscript contains out-of-bounds ",
"ranges")
I've tried a bunch of things to try and make this work and still get the same error and warning message. This is my code (including when I've tried the function with q2 as a GRanges object and an RleList)
## BP Pairwise comparison, using 3rd genome as co-ordinate reference
# q1 is used as the centre point reference, with q2 bps projected on to it.
# gr_ref_q1 is the pw alignment between the reference and query genome 1
# gr_ref_q2 is the pw alignment between the reference and query genome 2
# We construct two GRanges objects to feed into CoverageHeatMaps
library(schoolmath)
library(heatmaps)
library(IRanges)
bp_3gen_v2 <- function(gr_ref_q1, gr_ref_q2, win){
# Failsafes (check ref genome is the same, etc)
if(!(is.even(win))){stop("win should be an even number")}
## Construct g1_rco (1st GRanges object)
# IRanges object
q1_starts1 <- start(ranges(gr_ref_q1)) - (win*0.5)
q1_starts2 <- end(ranges(gr_ref_q1)) - (win*0.5)
q1_starts <- c(q1_starts1, q1_starts2)
q1_ends1 <- start(ranges(gr_ref_q1)) + (win*0.5)
q1_ends2 <- end(ranges(gr_ref_q1)) + (win*0.5)
q1_ends <- c(q1_ends1, q1_ends2)
q1_ir_ob <- IRanges(start = q1_starts, end = q1_ends)
# GR object
g1_vec_seq <- as.vector(seqnames(gr_ref_q1))
gr1_seqnames <- c(g1_vec_seq, g1_vec_seq)
g1_rco <- GRanges(seqnames = gr1_seqnames, ranges = q1_ir_ob,
seqinfo = seqinfo(gr_ref_q1))
# Remove negative ranges from GR object
g1_rco <- g1_rco[!(start(ranges(g1_rco)) < 0)]
## Construct g2_rco (2nd GRanges object)
# IRanges object
q2_starts <- start(ranges(gr_ref_q2))
q2_ends <- end(ranges(gr_ref_q2))
q2_bps <- c(q2_starts, q2_ends)
q2_ir_ob <- IRanges(start = q2_bps, end = q2_bps)
# GR object
g2_vec_seq <- as.vector(seqnames(gr_ref_q2))
gr2_seqnames <- c(g2_vec_seq, g2_vec_seq)
g2_rco <- GRanges(seqnames = gr2_seqnames, ranges = q2_ir_ob,
seqinfo = seqinfo(gr_ref_q2))
# Try removing anywhere in g2_rco that is not present in g1_rco
# find intersection of seqnames
g_inter <- intersect(g1_vec_seq, g2_vec_seq)
# apply to g2_rco to remove out of bound scaffols
g2_rco <- g2_rco[seqnames(g2_rco) == g_inter]
# now to remove out of bound ranges (GRanges object)
g2_red <- intersect(g1_rco, g2_rco)
# And try as RleList object
g2_red_rle <- coverage(g2_red)
# Heatmap
heat_map <- CoverageHeatmap(windows = g1_rco, track = g2_red_rle)
To avoid these problems and to achieve what you need, the simplest solution is to have the same seqlevels and seqlenghts for both GRanges. If you know this for your reference then provide it, if not try this:
First example datasets:
library(heatmaps)
gr1 = GRanges(seqnames=c(1,2,3),
IRanges(start=c(1,101,1001),end=c(500,600,1500)))
gr2 = GRanges(seqnames=c(2,2,3,3),
IRanges(start=c(1,301,1,1201),end=c(2500,4800,3500,9700)))
Then we make a combined range to get the levels and lengths:
combined= range(c(gr1,gr2))
seqlevels(gr1) = as.character(seqnames(combined))
seqlevels(gr2) = as.character(seqnames(combined))
seqlengths(gr1) = end(combined)
seqlengths(gr2) = end(combined)
Then the heatmap can be easily obtained by:
CoverageHeatmap(gr1,coverage(gr2))
Or if you only want to look at gr1 windows that have some values in gr2, then do:
CoverageHeatmap(gr1[countOverlaps(gr1,gr2)>0],coverage(gr2))
Related
I want to perform what I would call a union on 2 sf objects.
I have the following code:
east.west.sf <- st_sfc(st_polygon(list(cbind(c(1,2,2,1,1),c(0,0,2,2,0)))),
st_polygon(list(cbind(c(0,1,1,0,0),c(0,0,2,2,0)))))
east.west.df <- data.frame(var1=c("east", "west"), var2=c(1,2))
east.west <- st_sf(east.west.df, geom=east.west.sf)
north.south.sf <- st_sfc(st_polygon(list(cbind(c(0,2,2,0,0),c(1,1,2,2,1)))),
st_polygon(list(cbind(c(0,2,2,0,0),c(0,0,1,1,0)))))
north.south.df <- data.frame(var3=c("north", "south"), var4=c(FALSE, TRUE))
north.south <- st_sf(north.south.df, geom=north.south.sf)
map.overlay <- st_union(east.west, north.south)
st_area(map.overlay)
I was expecting all four areas to be 1 unit. Why are they 3 units?
An analogous union() operation with sp objects gives me what I expect.
How can I get four features of 1 unit each with my two initial sf objects?
I have a dataframe which contains overlapping sites, I want to be able to remove these overlapping site according to this rule.
End(B) > Start(A) & End(A) > Start(B) then we remove site B
For example ,
Chrom Start End
scaffold_98 8309 8313
scaffold_98 8311 8320
scaffold_98 8811 8815
In this case, I should remove site (8311,8320) because it overlaps with (8309,8313).
Is there a fast way to do it when we are working with a large data.
Using GenomicRanges is not critically necessary here, but I recommend its use for more complex operations involving genomic coordinates. It is a very powerful library, and has been designed for these kind of operations.
Here is a solution using findOverlaps::GenomicRanges.
Load the library.
library(GenomicRanges);
Read-in the sample data and turn into GRanges object.
df <- read.table(text =
"Chrom Start End
scaffold_98 8309 8313
scaffold_98 8311 8320
scaffold_98 8811 8815 ", header = T)
gr <- makeGRangesFromDataFrame(df);
Select for non-overlapping regions using findOverlaps.
gr[unique(findOverlaps(gr, type = "any", select = "first"))];
#GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] scaffold_98 [8309, 8313] *
# [2] scaffold_98 [8811, 8815] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Assuming that your df is ordered based on the start column, this might work:
remove <- vector()
for (i in 2:nrow(df)){
if(df[i,3] > df[i-1, 2] && df[i-1, 3] > df[i, 2]) {
remove <- append(remove, i)
}
}
df[-remove,]
I'm lost on how to combine my data into a usable data frame. I have a list of lists of character and number vectors Here is a working example of my code so far:
remove(list=ls())
# Headers for each of my column names
headers <- c("name","p","c","prophylaxis","control","inclusion","exclusion","conversion excluded","infection criteria","age criteria","mean age","age sd")
#_name = author and year
#_p = no. in experimental arm.
#_c = no. in control arm
#_abx = antibiotic used
#_con = control used
#_inc = inclusion criteria
#_exc = exclusion criteria
#_coexc = was conversion to open excluded?
#_infxn = infection criteria
#_agecrit = age criteria
#_agemean = mean age of study
#_agesd = sd age of study
# Passos 2016
passos_name <- c("Passos","2016")
passos_p <- 50
passos_c <- 50
passos_abx <- "cefazolin 1g at induction"
passos_con <- "none"
passos_inc <- c("elective LC","symptomatic cholelithiasis","low risk")
passos_exc <- c("renal impairment","hepatic impairment","immunosuppression","regular steroid use","antibiotics within 48H","acute cholecystitis","choledocolithiasis")
passos_coexc <- TRUE
passos_infxn <- c("temperature >37.8C","tachycardia","asthenia","local pain","local purulence")
passos_agecrit <- NULL
passos_agemean <- 48
passos_agesd <- 13.63
passos <- list(passos_name,passos_p,passos_c,passos_abx,passos_con,passos_inc,passos_exc,passos_coexc,passos_infxn,passos_agecrit,passos_agemean,passos_agesd)
names(passos) <- headers
# Darzi 2016
darzi_name <- c("Darzi","2016")
darzi_p <- 182
darzi_c <- 247
darzi_abx <- c("cefazolin 1g 30min prior to induction","cefazolin 1g 6H after induction","cefazolin 1g 12H after induction")
darzi_con <- "NaCl"
darzi_inc <- c("elective LC","first time abdominal surgery")
darzi_exc <- c("antibiotics within 7 days","immunosuppression","acute cholecystitis","choledocolithiasis","cholangitis","obstructive jaundice",
"pancreatitis","previous biliary tract surgery","previous ERCP","DM","massive intraoperative bleeding","antibiotic allergy","major thalassemia",
"empyema")
darzi_coexc <- TRUE
darzi_infxn <- c("temperature >38C","local purulence","intra-abdominal collection")
darzi_agecrit <- c(">18", "<75")
darzi_agemean <- 43.75
darzi_agesd <- 13.30
darzi <- list(darzi_name,darzi_p,darzi_c,darzi_abx,darzi_con,darzi_inc,darzi_exc,darzi_coexc,darzi_infxn,darzi_agecrit,darzi_agemean,darzi_agesd)
names(darzi) <- headers
# Matsui 2014
matsui_name <- c("Matsui","2014")
matsui_p <- 504
matsui_c <- 505
matsui_abx <- c("cefazolin 1g at induction","cefazolin 1g 12H after induction","cefazolin 1g 24H after induction")
matsui_con <- "none"
matsui_inc <- "elective LC"
matsui_exc <- c("emergent","concurrent surgery","regular insulin use","regular steroid use","antibiotic allergy","HD","antibiotics within 7 days","hepatic impairment","chemotherapy")
matsui_coexc <- FALSE
matsui_infxn <- c("local purulence","intra-abdominal collection","distant infection","temperature >38C")
matsui_agecrit <- ">18"
matsui_agemean <- NULL
matsui_agesd <- NULL
matsui <- list(matsui_name,matsui_p,matsui_c,matsui_abx,matsui_con,matsui_inc,matsui_exc,matsui_coexc,matsui_infxn,matsui_agecrit,matsui_agemean,matsui_agesd)
names(matsui) <- headers
# Find unique exclusion critieria in order to create the list of all possible levels
exc <- ls()[grepl("_exc",ls())]
exclist <- sapply(exc,get)
exc.levels <- unique(unlist(exclist,use.names = F))
# Find unique inclusion critieria in order to create the list of all possible levels
inc <- ls()[grepl("_inc",ls())]
inclist <- sapply(inc,get)
inc.levels <- unique(unlist(inclist,use.names = F))
# Find unique antibiotics order to create the list of all possible levels
abx <- ls()[grepl("_abx",ls())]
abxlist <- sapply(abx,get)
abx.levels <- unique(unlist(abxlist,use.names = F))
# Find unique controls in order to create the list of all possible levels
con <- ls()[grepl("_con",ls())]
conlist <- sapply(con,get)
con.levels <- unique(unlist(conlist,use.names = F))
# Find unique age critieria in order to create the list of all possible levels
agecrit <- ls()[grepl("_agecrit",ls())]
agecritlist <- sapply(agecrit,get)
agecrit.levels <- unique(unlist(agecritlist,use.names = F))
I have been struggling with:
1) Turn each of the _exc, _inc, _abx, _con, _agecrit lists into factors using the levels generated at the end of the code block. I have been trying to use a for loop such as:
for (x in exc) {
as.name(x) <- factor(get(x),levels = exc.levels)
}
This only creates a variable, x, that stores the last parsed list as a factor.
2) Combine all of my data into a data frame formatted as such:
name, p, c, prophylaxis, control, inclusion, exclusion, conversion excluded, infection criteria, age criteria, mean age, age sd
"Passos 2016", 50, 50, "cefazolin 1g at induction", "none", ["elective LC","symptomatic cholelithiasis","low risk"], ["renal impairment","hepatic impairment","immunosuppression","regular steroid use","antibiotics within 48H","acute cholecystitis","choledocolithiasis"], TRUE, ["temperature >37.8C","tachycardia","asthenia","local pain","local purulence"], NULL, 48, 13.63
...
# [] = factors
# columns correspond to each studies variables (i.e. passos_name, passos_p, passos_c, etc..)
# rows correspond to each study (i.e., passos, darzi, matsui)
I have tried various solutions on StackOverflow, but have not found any that work; for example:
studies <- list(passos,darzi,matsui,ruangsin,turk,naqvi,hassan,sharma,uludag,yildiz,kuthe,koc,maha,tocchi,higgins,mahmoud,kumar)
library(data.table)
rbindlist(lapply(studies,as.data.frame.list))
I suspect my data may not be exactly amenable to a data frame? Primarily because of trying to store a list of factors in a column. Is that allowed? If not, how is this type of data normally stored? My goal is to be able to meaningfully compare these various criterion across studies.
This is too long for a comment, so I turn it into an "answer":
To start with, have a look at what happens here:
data.frame(name = "Passos, 2016", p = 50)
name p
1 Passos, 2016 50
data.frame(name = c("Passos", "2016"), p = 50)
name p
1 Passos 50
2 2016 50
In the first one, we created a dataframe with the column "name" which contained one entry "Passos, 2016", i.e. one character containing both pieces of information, and the column "p". All fine. Now, in the second version, I specified the column "name" as you did above, using c(Passos, 2016). This is a two-element vector, and hence we get two rows in the dataframe: one with name Passos, one with name 2016, and the column p gets recycled.
Clearly, the latter is probably not what you intended. But it works anyway because R just recycles the shorter vector. Now, what do you think happens if I add a vector that contains three elements?
And this highlights the main issue with what you are doing: you are trying to get a dataframe from many vectors with different lengths. Now, in some cases this is fine if you want the shorter vector to be repeated (in R speech, we call this "recycled"), but it does not look like something you want to do here.
So, my recommendation would be this: try to imagine a matrix and make sure you understand what each element (row and column) is supposed to be. Then specify your data accordingly. If in doubt, look up "tidy data".
This is I assume a somewhat simple programming issue, but I've been struggling with it. Mostly because I don't know the right words to use, perhaps?
Given a set of "ranges" (in the form of 1-a set of numbers as below, 2-IRanges, or 3-GenomicRanges), I'd like to split it into a set of smaller ranges.
Example Beginning:
Chr Start End
1 1 10000
2 1 5000
Example size of breaks: 2000
New dataset:
Chr Start End
1 1 2000
1 2001 4000
1 4001 6000
1 6001 8000
1 8001 10000
2 1 2000
2 2001 4000
2 4001 5000
I'm doing this in R. I know I could generate these simply with seq, but I'd like to be able to do it based on a list/df of regions instead of having to manually do it every time I have a new list of regions.
Here's an example I've made using seq:
Given 22 chromosomes, loop through them and break each into pieces
# initialize df
Regions <- data.frame(Chromosome = c(), Start = c(), End = c())
# for each row, do the following
for(i in 1:nrow(Chromosomes)){
# create a sequence from the minimum start to the max end by some value
breks <- seq(min(Chromosomes$Start[Chromosomes$Chromosome == i]), max(Chromosomes$End[Chromosomes$Chromosome == i]), by=2000000)
# put this into a dataframe
database <- data.frame(Chromosome = i, Start = breks, End = c(breks[2:length(breks)]-1, max(Chromosomes$End[Chromosomes$Chromosome == i])))
# bind with what we already have
Regions <- rbind(Regions, database)
rm(database)
}
This works fine, I'm wondering if there is something built into a package already to do this as a one-liner OR that is more flexible, as this has its limitations.
Using the R / Bioconductor package GenomicRanges, here are your initial ranges
library(GenomicRanges)
rngs = GRanges(1:2, IRanges(1, c(10000, 5000)))
and then create a sliding window across the genome, generated first as a list (one set of tiles per chromosome) and then unlisted for the format you have in your question
> windows = slidingWindows(rngs, width=2000, step=2000)
> unlist(windows)
GRanges object with 8 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 1, 2000] *
[2] 1 [2001, 4000] *
[3] 1 [4001, 6000] *
[4] 1 [6001, 8000] *
[5] 1 [8001, 10000] *
[6] 2 [ 1, 2000] *
[7] 2 [2001, 4000] *
[8] 2 [4001, 5000] *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Coerce from / to a data.frame with as(df, "GRanges") or as(unlist(tiles), "data.frame").
Find help at ?"slidingWindows,GenomicRanges-method" (tab completion is your friend, ?"slidingW<tab>).
Embarrassingly, this seems to be implemented only in the 'devel' version of GenomicRanges (v. 1.25.93?); tile does something similar but rounds the width of ranges to be approximately equal while spanning the width of the GRanges. Here is a poor-man's version
windows <- function(gr, width, withMcols=FALSE) {
starts <- Map(seq, start(rngs), end(rngs), by=width)
ends <- Map(function(starts, len) c(tail(starts, -1) - 1L, len),
starts, end(gr))
seq <- rep(seqnames(gr), lengths(starts))
strand <- rep(strand(gr), lengths(starts))
result <- GRanges(seq, IRanges(unlist(starts), unlist(ends)), strand)
seqinfo(result) <- seqinfo(gr)
if (withMcols) {
idx <- rep(seq_len(nrow(gr)), lengths(starts))
mcols(result) = mcols(gr)[idx,,drop=FALSE]
}
result
}
invoked as
> windows(rngs, 2000)
If the approach is useful, consider asking follow-up questions on the Bioconductor support site.
I'm trying to read a GRIB file wavedata.grib with wave heights from the ECMWF ERA-40 website, using an R function. Here is my source code until now:
mylat = 43.75
mylong = 331.25
# read the GRIB file
library(rgdal)
library(sp)
gribfile<-"wavedata.grib"
grib <- readGDAL(gribfile)
summary = GDALinfo(gribfile,silent=TRUE)
save(summary, file="summary.txt",ascii = TRUE)
# >names(summary): rows columns bands ll.x ll.y res.x res.y oblique.x oblique.y
rows = summary[["rows"]]
columns = summary[["columns"]]
bands = summary[["bands"]]
# z=geometry(grib)
# Grid topology:
# cellcentre.offset cellsize cells.dim
# x 326.25 2.5 13
# y 28.75 2.5 7
# SpatialPoints:
# x y
# [1,] 326.25 43.75
# [2,] 328.75 43.75
# [3,] 331.25 43.75
myframe<-t(data.frame(grib))
# myframe[bands+1,3]=331.25 myframe[bands+2,3]=43.75
# myframe[1,3]=2.162918 myframe[2,3]=2.427078 myframe[3,3]=2.211989
# These values should match the values read by Degrib (see below)
# degrib.exe wavedata.grib -P -pnt 43.75,331.25 -Interp 1 > wavedata.txt
# element, unit, refTime, validTime, (43.750000,331.250000)
# SWH, [m], 195709010000, 195709010000, 2.147
# SWH, [m], 195709020000, 195709020000, 2.159
# SWH, [m], 195709030000, 195709030000, 1.931
lines = rows * columns
mycol = 0
for (i in 1:lines) {
if (mylat==myframe[bands+2,i] & mylong==myframe[bands+1,i]) {mycol = i+1}
}
# notice mycol = i+1 in order to get values in column to the right
myvector <- as.numeric(myframe[,mycol])
sink("output.txt")
cat("lat:",myframe[bands+2,mycol],"long:",myframe[bands+1,mycol],"\n")
for (i in 1:bands) { cat(myvector[i],"\n") }
sink()
The wavedata.grib file has grided SWH values, in the period 1957-09-01 to 2002-08-31. Each band refers to a pair of lat/long and has a series of 16346 SWH values at 00h of each day (1 band = 16346 values at a certain lat/long).
myframe has dimensions 16438 x 91. Notice 91 = 7rows x 13columns. And the number 16438 is almost equal to number of bands. The additional 2 rows/bands are long and lat values, all other columns should be wave heights corresponding to the 16436 bands.
The problem is I want to extract SWH (wave heights) at lat/long = 43.75,331.25, but they don't match the values I get reading the file with Degrib utility at this same lat/long.
Also, the correct values I want (2.147, 2.159, 1.931, ...) are in column 4 and not column 3 of myframe, even though myframe[16438,3]=43.75 (lat) and myframe[16437,3]=331.25 (long). Why is this? I would like to know to which lat/long do myframe[i,j] values actually correspond to or if there is some data import error in the process. I'm assuming Degrib has no errors.
Is there any R routine to easily interpolate values in a matrix if I want to extract values between grid points? More generally, I need help in writing an effective R function to extract wave heights like this:
SWH <- function (latitude, longitude, date/time)
Please help.