R - retrieve specific information from several columns - r

I have a huge dataframe df which includes information about overlapping intervals (A) and (B) and on which chromosome (chrom) they were located. There is also information about a value (level of gene expression) observed over interval (A).
chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732
Note that the same interval may appear several times, for instance, an interval (B) will have been reported two times if it overlapped with two (A) intervals:
Astart(1)=========================Aend1 Astart(2)========================Aend(2)
Bstart(1)=======================================Bend(1)
chrom value Astart Aend Bstart Bend
chr1 0 0 25 15 35 #A(1) and B(1) overlap
chr1 1 28 45 15 35 #A(2) and B(1) overlap
Likewise, an interval (A) will have been reported two or more times if it overlapped with two or more (B) intervals:
Astart(3)===================================================================Aend(3)
Bstart(2)=========Bend(2) Bstart(3)===========Bend(3) Bstart(4)===============Bend(4)
chrom value Astart Aend Bstart Bend
chr4 0 10 100 15 25 #A(3) and B(2) overlap
chr4 0 10 100 30 75 #A(3) and B(3) overlap
chr4 3 10 100 80 120 #A(3) and B(4) overlap
My goal is to output all the individual positions from intervals (B) and the corresponding values from (A). I have a piece of code that beautifully outputs all the relevant positions in (B):
position <- unlist(mapply(seq, ans$Bstart, ans$Bend - 1))
> head(position)
[1] 17408 17409 17410 17411 17412 17413
The problem with this is that it is not enough to retrieve the chromosome information back from there. I need to check chromosome information AND position at the same time when I list these positions. That is because the same position integer may occur on several chromosomes, so I can't afterwards just run something like for position %in% range(Astart, Aend) output $chrom, $value (dummy code).
How can I retrieve (chrom, position, value) at the same time?
The expected result would be something like this:
> head(expected_result)
chrom position value
chr1 17408 0
chr1 17409 0
chr1 17410 0
chr1 17411 0
chr1 17412 0
chr1 17413 0
#skipping some lines to show another part of the dataframe
chr11 93466856 4
chr11 93466857 4

A call to ddply might be more elegant, but the logic would be the same:
dfA = read.table(textConnection("chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732"), header = TRUE)
dfB = as.data.frame(do.call(rbind,
apply(dfA, MARGIN = 1, FUN = function(x) {
cbind(mapply(seq,
as.numeric(x['Bstart']),
as.numeric(x['Bend']) - 1),
x['chrom'], x['value'])
}
)))
lapply(dfB, typeof)

Related

Error in using GADEM function from rGADEM package

I have big peak list in the "Bed" format and I converted it to GenomicRange for use as an input for the GADEM package to find denovo motifs. But when I try the GADEM function always I face the below error.
Could you please anybody who knows help me with this error?
This is a small example of my real file with only 20 rows.
1 chr6 29723590 29723790
2 chr14 103334312 103334512
3 chr1 150579030 150579230
4 chr7 76358527 76358727
5 chr6 11537891 11538091
6 chr14 49893256 49893456
7 chr5 179623200 179623400
8 chr1 228082831 228083031
9 chr12 93441644 93441844
10 chr10 3784776 3784976
11 chr3 183635833 183636033
12 chr7 975301 975501
13 chr12 123364510 123364710
14 chr1 1615578 1615778
15 chr1 36156320 36156520
16 chr14 55051781 55051981
17 chr8 11867697 11867897
18 chr22 38706135 38706335
19 chr6 44265256 44265456
20 chr1 185316658 185316858
and the code that I use is :
library(GenomicRanges)
library(rGADEM)
data = makeGRangesFromDataFrame(data, keep.extra.columns = TRUE)
data = reduce(data)
data = resize(data, width = 50, fix='center')
gadem<-GADEM(data,verbose=1,genome=Hsapiens)
plot(gadem)
and error is:
[ Retrieving sequences... Error in.Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord:
solving row 136: 'allow.nonnarrowing' is FALSE and the supplied start (55134751) is > refwidth + 1 ]
Better to mention that, when I try an example input file with less than 136 rows, it works and I get motifs.
Thanks in advance.

Finding overlaps between 2 ranges and their overlapped region lengths?

I need to find length of overlapped region on same chromosomes between 2 group(gp1 & gp2). (similar question in stackoverflow were different from my aim, because I wanna find overlapped region not a TRUE/FALSE answer).
For example:
gp1:
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
gp2:
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
I'm looking for a way to compare these 2 group and get results like this:
id1 id2 chr overlapped_length
1 1 chr1 10
3 2 chr3 50
Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Plotting genomic data using RCircos package

I am trying to use the RCircos package in R to visualize links between genomic positions. I am unfamiliar with this package and have been using the package documentation available from the CRAN repository from 2016.
I have attempted to format my data according to the package requirements. Here is what it looks like:
> head(pts3)
Chromosome ChromStart ChromEnd Chromosome.1 ChromStart.1 ChromEnd.1
1 chr1 33 34 chr1 216 217
2 chr1 33 34 chr1 789 790
3 chr1 33 34 chr1 1716 1717
4 chr1 33 34 chr1 1902 1903
5 chr1 33 34 chr2 2538 2539
6 chr1 33 34 chr2 4278 4279
Ultimately, I would like to produce a plot with tracks from ChromStart to ChromStart.1 and each gene labeled along the outside of the plot. I thought the script would look something like:
RCircos.Set.Core.Components(cyto.info = pts3,
chr.exclude = NULL,
tracks.inside = 1,
tracks.outside = 2)
RCircos.Set.Plot.Area()
RCircos.Chromosome.Ideogram.Plot()
RCircos.Link.Plot(link.data = pts3,
track.num = 3,
by.chromosome = FALSE)
It appears that to do so, I must first initialize with the RCircos.Set.Core.Components() function which requires positional information for each gene to pass to RCircos.Chromosome.Ideogram.Plot(). So, I created a second data frame containing the required information to pass to the function and this is the error that I get:
> head(genes)
Chromosome ChromStart ChromEnd GeneName Band Stain
1 chr1 0 2342 PB2 NA NA
2 chr2 2343 4683 PB1 NA NA
3 chr3 4684 6917 PA NA NA
4 chr4 6918 8710 HA NA NA
5 chr5 8711 10276 NP NA NA
6 chr6 10277 11735 NA NA NA
> RCircos.Set.Core.Components(cyto.info = genes,
+ chr.exclude = NULL,
+ tracks.inside = 1,
+ tracks.outside = 2)
Error in RCircos.Validate.Cyto.Info(cyto.info, chr.exclude) :
Cytoband start should be 0.
I don't actually have data for the Band or Stain columns and don't understand what they are for, but adding data to the those columns (such as 1:8 or chr1, chr2, etc) does not resolve the problem. Based on a recommendation from another forum, I also tried to reset the plot parameters for RCircos using the following functions, but it did not resolve the error:
core.chrom <- data.frame("Chromosome" = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
"ChromStart" = c(0, 2343, 4684, 6918, 8711, 10277, 11736, 12763),
"ChromEnd" = c(2342, 4683, 6917, 8710, 10276, 11735, 12762, 13666),
"startLoc" = c(0, 2343, 4684, 6918, 8711, 10277, 11736, 12763),
"endLoc" = c(2342, 4683, 6917, 8710, 10276, 11735, 12762, 13666),
"Band" = NA,
"Stain" = NA)
RCircos.Reset.Plot.Ideogram(chrom.ideo = core.chrom)
Any advice would be deeply appreciated!
I'm not sure if you figured this one out or moved on etc. I had the same problem and ended up resolving it by reformatting my start positions for each chromosome to 0 as opposed to a continuation of the previous chr. For you it would be:
Chromosome ChromStart ChromEnd GeneName Band Stain
1 chr1 0 2342 PB2 NA NA
2 chr2 0 2340 PB1 NA NA
3 chr3 0 2233 PA NA NA
...etc

Product between two data.frames columns

I have two data.frames:
The first one is the coefficients of my regressions for each day:
> parametrosBase
beta0 beta1 beta4
2015-12-15 0.1622824 -0.012956819 -0.04637442
2015-12-16 0.1641884 -0.007914548 -0.06170213
2015-12-17 0.1623660 -0.005618474 -0.05914809
2015-12-18 0.1643263 0.005380472 -0.08533237
2015-12-21 0.1667710 0.003824588 -0.09040071
The second one is: the independent (x) variables:
> head(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
1 2015-12-15 21 1 0.5642792 0.2859359 0 0 0
2 2015-12-15 42 1 0.3606713 0.2831963 0 0 0
3 2015-12-15 63 1 0.2550200 0.2334554 0 0 0
4 2015-12-15 84 1 0.1943071 0.1883048 0 0 0
5 2015-12-15 105 1 0.1561231 0.1544524 0 0 0
6 2015-12-15 126 1 0.1302597 0.1297947 0 0 0
> tail(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
835 2015-12-21 2415 1 0.006799321 0.006799321 0 0 0
836 2015-12-21 2436 1 0.006740707 0.006740707 0 0 0
837 2015-12-21 2457 1 0.006683094 0.006683094 0 0 0
838 2015-12-21 2478 1 0.006626457 0.006626457 0 0 0
839 2015-12-21 2499 1 0.006570773 0.006570773 0 0 0
840 2015-12-21 2520 1 0.006516016 0.006516016 0 0 0
What i want is to multiply the beta0 column of "parametrosBase" by h0x column of "ir_dfSTORED" and store the result in the beta0_h0x column. And the same for the others: beta1 and beta4
The problem im facing is with the dates in "ind" column. This multiplication has to respect the dates.
So, once i change the day in "ir_dfSTORED" i have to change to the same day in "parametrosBase".
For example:
The first rowof "parametrosBase" df is
2015-12-15 0.1622824 -0.012956819 -0.04637442
is fixed for the 2015-12-15 day. And then i do the product. Once i enter on the 2015-12-16 day i will have to consider the second row of "parametrosBase" df.
How can i do this?
Thanks a lot. :)
Maybe you should merge the two datasets first:
parametrosBase$ind <- rownames(parametrosBase)
df <- merge(ir_dfSTORED,parametrosBase)
df <- within(df,{
beta0_h0x <- beta0*h0x
beta1_h0x <- beta1*h0x
beta4_h0x <- beta4*h0x
})
Since I don't know the structure of the data, you may have to convert the dates from rownames to a date format in order for the merge to work. Using ind as the name of the date in parametrosBase is key to making merge work, otherwise you'll have to specify the variables to merge by.

Using a column entry as a "selector" for datasets in R

My array looks like this:
Slide Index A B C DoseGroup
482 778 l 0 0 2 13Gy_p_75_42wk
483 778 r 0 0 2 13Gy_p_75_42wk
484 779 l 0 0 2 13Gy_p_75_42wk
485 779 r 0 0 2 13Gy_p_75_42wk
486 4700 l 2 2 2 14.25Gy_C_50pl_42wk
487 4700 r 0 0 1 14.25Gy_C_50pl_42wk
488 4701 l 0 0 1 14.25Gy_C_50pl_42wk
I would like to use the DoseGroup column's entries to be able to select the respective entries in the other columns. I would like to be able to tell R, e.g., "Do a wilcox.test between the 13Gy_p_75_42wk and the 14.25Gy_C_50pl_42wk datasets using column C."
How can I do this with R? Is there some kind of way to select all columns having the entry 14.25Gy_C_50pl_42wk?
I modified your data to add a third level in DoseGroup to make it more realistic.
txt <- "Slide Index A B C DoseGroup
778 l 0 0 2 13Gy_p_75_42wk
778 r 0 0 2 13Gy_p_75_42wk
779 l 0 0 2 13Gy_p_75_42wk
779 r 0 0 2 13Gy_p_75_42wk
4700 l 2 2 2 14.25Gy_C_50pl_42wk
4700 r 0 0 1 14.25Gy_C_50pl_42wk
4701 l 0 0 1 14.25Gy_C_50pl_42wk
4702 l 0 0 10 15Gy_C_50pl_42wk"
dat <- read.table(text = txt, header = TRUE)
wilcox.test(C ~ DoseGroup, data = dat,
subset = DoseGroup %in% c("13Gy_p_75_42wk", "14.25Gy_C_50pl_42wk"))
## Wilcoxon rank sum test with continuity correction
## data: C by DoseGroup
## W = 10, p-value = 0.1175
## alternative hypothesis: true location shift is not equal to 0
To select data, you can use one of these two command.
dat[dat$DoseGroup == "14.25Gy_C_50pl_42wk", ]
subset(dat, DoseGroup == "14.25Gy_C_50pl_42wk")
Those commands are basics in R and if you read any introduction to R, you'll be able to do same.
So I urge you to do so, I you want to really enjoy R.

Resources