Plotting genomic data using RCircos package - r

I am trying to use the RCircos package in R to visualize links between genomic positions. I am unfamiliar with this package and have been using the package documentation available from the CRAN repository from 2016.
I have attempted to format my data according to the package requirements. Here is what it looks like:
> head(pts3)
Chromosome ChromStart ChromEnd Chromosome.1 ChromStart.1 ChromEnd.1
1 chr1 33 34 chr1 216 217
2 chr1 33 34 chr1 789 790
3 chr1 33 34 chr1 1716 1717
4 chr1 33 34 chr1 1902 1903
5 chr1 33 34 chr2 2538 2539
6 chr1 33 34 chr2 4278 4279
Ultimately, I would like to produce a plot with tracks from ChromStart to ChromStart.1 and each gene labeled along the outside of the plot. I thought the script would look something like:
RCircos.Set.Core.Components(cyto.info = pts3,
chr.exclude = NULL,
tracks.inside = 1,
tracks.outside = 2)
RCircos.Set.Plot.Area()
RCircos.Chromosome.Ideogram.Plot()
RCircos.Link.Plot(link.data = pts3,
track.num = 3,
by.chromosome = FALSE)
It appears that to do so, I must first initialize with the RCircos.Set.Core.Components() function which requires positional information for each gene to pass to RCircos.Chromosome.Ideogram.Plot(). So, I created a second data frame containing the required information to pass to the function and this is the error that I get:
> head(genes)
Chromosome ChromStart ChromEnd GeneName Band Stain
1 chr1 0 2342 PB2 NA NA
2 chr2 2343 4683 PB1 NA NA
3 chr3 4684 6917 PA NA NA
4 chr4 6918 8710 HA NA NA
5 chr5 8711 10276 NP NA NA
6 chr6 10277 11735 NA NA NA
> RCircos.Set.Core.Components(cyto.info = genes,
+ chr.exclude = NULL,
+ tracks.inside = 1,
+ tracks.outside = 2)
Error in RCircos.Validate.Cyto.Info(cyto.info, chr.exclude) :
Cytoband start should be 0.
I don't actually have data for the Band or Stain columns and don't understand what they are for, but adding data to the those columns (such as 1:8 or chr1, chr2, etc) does not resolve the problem. Based on a recommendation from another forum, I also tried to reset the plot parameters for RCircos using the following functions, but it did not resolve the error:
core.chrom <- data.frame("Chromosome" = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
"ChromStart" = c(0, 2343, 4684, 6918, 8711, 10277, 11736, 12763),
"ChromEnd" = c(2342, 4683, 6917, 8710, 10276, 11735, 12762, 13666),
"startLoc" = c(0, 2343, 4684, 6918, 8711, 10277, 11736, 12763),
"endLoc" = c(2342, 4683, 6917, 8710, 10276, 11735, 12762, 13666),
"Band" = NA,
"Stain" = NA)
RCircos.Reset.Plot.Ideogram(chrom.ideo = core.chrom)
Any advice would be deeply appreciated!

I'm not sure if you figured this one out or moved on etc. I had the same problem and ended up resolving it by reformatting my start positions for each chromosome to 0 as opposed to a continuation of the previous chr. For you it would be:
Chromosome ChromStart ChromEnd GeneName Band Stain
1 chr1 0 2342 PB2 NA NA
2 chr2 0 2340 PB1 NA NA
3 chr3 0 2233 PA NA NA
...etc

Related

Error in using GADEM function from rGADEM package

I have big peak list in the "Bed" format and I converted it to GenomicRange for use as an input for the GADEM package to find denovo motifs. But when I try the GADEM function always I face the below error.
Could you please anybody who knows help me with this error?
This is a small example of my real file with only 20 rows.
1 chr6 29723590 29723790
2 chr14 103334312 103334512
3 chr1 150579030 150579230
4 chr7 76358527 76358727
5 chr6 11537891 11538091
6 chr14 49893256 49893456
7 chr5 179623200 179623400
8 chr1 228082831 228083031
9 chr12 93441644 93441844
10 chr10 3784776 3784976
11 chr3 183635833 183636033
12 chr7 975301 975501
13 chr12 123364510 123364710
14 chr1 1615578 1615778
15 chr1 36156320 36156520
16 chr14 55051781 55051981
17 chr8 11867697 11867897
18 chr22 38706135 38706335
19 chr6 44265256 44265456
20 chr1 185316658 185316858
and the code that I use is :
library(GenomicRanges)
library(rGADEM)
data = makeGRangesFromDataFrame(data, keep.extra.columns = TRUE)
data = reduce(data)
data = resize(data, width = 50, fix='center')
gadem<-GADEM(data,verbose=1,genome=Hsapiens)
plot(gadem)
and error is:
[ Retrieving sequences... Error in.Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord:
solving row 136: 'allow.nonnarrowing' is FALSE and the supplied start (55134751) is > refwidth + 1 ]
Better to mention that, when I try an example input file with less than 136 rows, it works and I get motifs.
Thanks in advance.

Finding overlaps between 2 ranges and their overlapped region lengths?

I need to find length of overlapped region on same chromosomes between 2 group(gp1 & gp2). (similar question in stackoverflow were different from my aim, because I wanna find overlapped region not a TRUE/FALSE answer).
For example:
gp1:
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
gp2:
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
I'm looking for a way to compare these 2 group and get results like this:
id1 id2 chr overlapped_length
1 1 chr1 10
3 2 chr3 50
Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Adding columns by splitting number, and removing duplicates

I have a dataframe like the following (this is a reduced example, I have many more rows and columns):
CH1 CH2 CH3
1 3434 282 7622
2 4442 6968 8430
3 4128 6947 478
4 6718 6716 3017
5 3735 9171 1128
6 65 4876 4875
7 9305 6944 3309
8 4283 6060 650
9 5588 2285 203
10 205 2345 9225
11 8634 4840 780
12 6383 0 1257
13 4533 7692 3760
14 9363 9846 4697
15 3892 79 4372
16 6130 5312 9651
17 7880 7386 6239
18 8515 8021 2295
19 1356 74 8467
20 9024 8626 4136
I need to create additional columns by splitting the values. For example, value 1356 would have to be split into 6, 56, and 356. I do this on a for loop splitting by string. I do this to keep the leading zeros. So far, decent.
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- as.character(data[,col] )
# Save the new column
for(mod in c(-1, -2, -3)) {
# Create the column
temp <- cbind(temp, str_sub(as.character(data[,col]), mod))
}
# Merge to the row
data <- cbind(data, temp)
}
My problem is that not all cells have 4 digits: some may have 1, 2 or 3 digits. Therefore, I get repeated values when I split. For example, for 79 I get: 79 (original), 9, 79, 79, 79.
Problem: I need to remove the repeated values. Of course, I could do unique, but that gives me rows of uneven number of columns. I need to fill those missing (i.e. the removed repeated values) with NA. I can only compare this by row.
I checked CJ Yetman's answer here, but they only replace consecutive numbers. I only need to keep unique values.
Reproducible Example: Here is a fiddle with my code working: http://rextester.com/IKMP73407
Expected outcome: For example, for rows 11 & 12 of the example (see the link for the reproducible example), if this is my original:
8634 4 34 634 4840 0 40 840 780 0 80 780
6383 3 83 383 0 0 0 0 1257 7 57 257
I'd like to get this:
8634 4 34 634 4840 0 40 840 780 NA 80 NA
6383 3 83 383 0 NA NA NA 1257 7 57 257
You can use apply():
The data:
data <- structure(list(CH1 = c(3434L, 4442L, 4128L, 6718L, 3735L, 65L,
9305L, 4283L, 5588L, 205L, 8634L, 6383L, 4533L, 9363L, 3892L,
6130L, 7880L, 8515L, 1356L, 9024L), CH2 = c(282L, 6968L, 6947L,
6716L, 9171L, 4876L, 6944L, 6060L, 2285L, 2345L, 4840L, 0L, 7692L,
9846L, 79L, 5312L, 7386L, 8021L, 74L, 8626L), CH3 = c(7622L,
8430L, 478L, 3017L, 1128L, 4875L, 3309L, 650L, 203L, 9225L, 780L,
1257L, 3760L, 4697L, 4372L, 9651L, 6239L, 2295L, 8467L, 4136L
)), .Names = c("CH1", "CH2", "CH3"), row.names = c(NA, 20L), class = "data.frame")
Select row 11 and 12:
data <- data[11:12, ]
Using your code:
# CREATE ADDITIONAL COLUMNS
for(col in 1:3) {
# Create a temporal variable
temp <- data[,col]
# Save the new column
for(mod in c(10, 100, 1000)) {
# Create the column
temp <- cbind(temp, data[, col] %% mod)
}
data <- cbind(data, temp)
}
data[,1:3] <- NULL
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 0 80 780
12 6383 3 83 383 0 0 0 0 1257 7 57 257
Then go through the data row by row and remove duplicates and transpose the outcome:
t(apply(data, 1, function(row) {
row[duplicated(row)] <- NA
return(row)
}))
The result is:
temp V2 V3 V4 temp V2 V3 V4 temp V2 V3 V4
11 8634 4 34 634 4840 0 40 840 780 NA 80 NA
12 6383 3 83 383 0 NA NA NA 1257 7 57 257

No layers in plot (R)

I have a table that look like below,
chr1 500 15 0.502 na
chr1 1000 21 0.641 0.019704955
chr1 1500 21 0.621 0.016777844
chr1 2000 22 0.534 na
chr1 2500 35 0.698 0.028712731
chr2 4500 2 0.371 na
chr2 5000 3 0.342 na
chr4 5500 1 0.068 na
chr4 6000 0 0.000 na
chr4 6500 0 0.000 na
chr5 7000 2 0.079 na
chr5 7500 12 0.440 na
From this table, I would like to generate multiple plots - one for each chr- where Xaxis and Y axis will be column 2 and 5.
Based on a response to another question, I tried this,
require(ggplot2)
require(plyr)
Y <- read.table("integ.pi")
names(Y) <- c("Chr","Window","SNPs","covfra","pi")
chrs <- levels(Y[,"Chr"])
c <- lapply(chrs, function(chr) {
ggplot(Y[Y[, "Chr"]==chr,], aes(x=as.factor(Window), y=pi))
})
lapply(c)
But I am an error
"Error: No layers in plot".
How should I go about this? Any ideas?
Thanks.
Cheers,
Just a simple example to see how to use the commands:
library(ggplot2)
dt = data.frame(Chr = c("c1","c1","c1","c2","c2","c2","c3","c3","c3"),
x = c(1,2,3,4,5,6,7,8,9),
y = c(2,4,5,2,3,4,6,6,7))
ggplot(dt, aes(x,y, col=Chr)) +
geom_point(size = 3) +
geom_line() +
facet_grid(. ~ Chr) # remove this to have all lines in same plot

apply function to return a data.table, or convert the list directly to a data.table

I would like to apply a function that returns a matrix to each row of a large data.table object (original file is around 30 GB, I have 80 GB ram), and get back a data.table object. I'd like to do it efficiently. My current approach is the following:
my.function <- function(x){
alnRanges<-cigarToIRanges(x[6]);
alnStarts<-start(alnRanges)+as.numeric(x[4])-1;
alnEnds<-end(alnRanges)+as.numeric(x[4])-1;
y<-x[-4];
ys<-matrix(rep(y,length(alnRanges)),nrow=length(alnRanges),ncol=length(y),byrow=TRUE);
ys<-cbind(ys,alnStarts,alnEnds);
return(ys); # ys is a matrix
}
my.dt<-fread(my.file.name);
my.list.of.matrices<-apply(my.dt,1,my.function);
new.df<-do.call(rbind.data.frame,my.list.of.matrices);
colnames(new.df)[1:14]<-colnames(my.dt)[-4];
new.dt<-as.data.table(new.df);
Note1: I specify the my.function just to show that it returns a matrix, and that my apply line is therefore a list of matrices.
Note2: I am not sure how slow are the operations I am doing but seems that I could reduce the number of lines. For example, is it slow to convert a data frame to a data table for large objects?
Reproducible example:
Note that Arun and Roland made me think harder about the problem so I am still working on it... may be that I do not need these lines...
I want to take a sam file, and then create a new coordinates file where each read is split according to its CIGAR field.
My sam file:
qname rname pos cigar
2218 chr1 24613476 42M2S
2067 chr1 87221030 44M
2129 chr1 79702717 44M
2165 chr1 43113438 44M
2086 chr1 52155089 4M921N40M
code:
library("data.table");
library("GenomicRanges");
sam2bed <- function(x){
alnRanges<-cigarToIRanges(x[4]);
alnStarts<-start(alnRanges)+as.numeric(x[3])-1;
alnEnds<-end(alnRanges)+as.numeric(x[3])-1;
#y<-as.data.frame(x[,pos:=NULL]);
#ys<-y[rep(seq_len(nrow(y)),length(alnRanges)),];
y<-x[-3];
ys<-matrix(rep(y,length(alnRanges)),nrow=length(alnRanges),ncol=length(y),byrow=TRUE);
ys<-cbind(ys,alnStarts,alnEnds);
return(ys);
}
sam.chr.dt<-fread(sam.parent.chr.file);
setnames(sam.chr.dt,old=c("V1","V2","V3","V4"),new=c("qname","rname","pos","cigar"));
bed.chr.lom<-apply(sam.chr.dt,1,sam2bed);
> bed.chr.lom
[[1]]
alnStarts alnEnds
[1,] "2218" "chr1" "42M2S" "24613476" "24613517"
[[2]]
alnStarts alnEnds
[1,] "2067" "chr1" "44M" "87221030" "87221073"
[[3]]
alnStarts alnEnds
[1,] "2129" "chr1" "44M" "79702717" "79702760"
[[4]]
alnStarts alnEnds
[1,] "2165" "chr1" "44M" "43113438" "43113481"
[[5]]
alnStarts alnEnds
[1,] "2086" "chr1" "4M921N40M" "52155089" "52155092"
[2,] "2086" "chr1" "4M921N40M" "52156014" "52156053"
bed.chr.df<-do.call(rbind.data.frame,bed.chr.lom);
> bed.chr.df
V1 V2 V3 alnStarts alnEnds
1 2218 chr1 42M2S 24613476 24613517
2 2067 chr1 44M 87221030 87221073
3 2129 chr1 44M 79702717 79702760
4 2165 chr1 44M 43113438 43113481
5 2086 chr1 4M921N40M 52155089 52155092
6 2086 chr1 4M921N40M 52156014 52156053
bed.chr.dt<-as.data.table(bed.chr.df);
> bed.chr.dt
V1 V2 V3 alnStarts alnEnds
1: 2218 chr1 42M2S 24613476 24613517
2: 2067 chr1 44M 87221030 87221073
3: 2129 chr1 44M 79702717 79702760
4: 2165 chr1 44M 43113438 43113481
5: 2086 chr1 4M921N40M 52155089 52155092
6: 2086 chr1 4M921N40M 52156014 52156053
Assuming ff is your data.table, how about this?
splits <- cigarToIRangesListByAlignment(ff$cigar, ff$pos, reduce.ranges = TRUE)
widths <- width(attr(splits, 'partitioning'))
cbind(data.table(qname=rep.int(ff$qname, widths),
rname=rep.int(ff$rname, widths)), as.data.frame(splits))
qname rname space start end width
1: 2218 chr1 1 24613476 24613517 42
2: 2067 chr1 2 87221030 87221073 44
3: 2129 chr1 3 79702717 79702760 44
4: 2165 chr1 4 43113438 43113481 44
5: 2086 chr1 5 52155089 52155092 4
6: 2086 chr1 5 52156014 52156053 40

Resources