Understanding an RLE coverage value - r

Using R and bioconductor.
I'm not sure how to understand an integer rle that you'd get from functions like coverage() such as this
integer-Rle of length 3312 with 246 runs
Lengths: 25 34 249 16 7 11 16 ... 2 32 2 26 34 49
Values : 0 1 0 1 2 3 2 ... 1 2 1 0 1 0
Okay so I get that it represents coverage of one range vs other ranges. In this case reads of an experiment over a given range. What do the 'runs' mean? What about the 'Lengths' and 'Values'? I thought that maybe Lengths represent a postion and values represent the amount of times its covered but then why would there be multiples of the same position such as 2 above? Why would they be out of order?
I ask because I'm using
sum(coverage)
to compare the coverage of one range to another of a different length and I was wondering if that was appropriate.

Probably it's better to ask about Bioconductor packages on the Bioconductor support site.
The interpretation is that there is a run of 25 nucleotides with 0 coverage, then a run of 24 nucleotides with 1 coverage (i.e., a single read) then another run of 249 nucleotides with no coverage, then things start to get interesting as multiple reads overlap positions. From the summary line at the top of the output, your read covers 3312 nucleotides, maybe from a single transcript? If you were to
plot(as.integer(coverage))
you'd get a quick plot of how coverage varies along the length of the transcript.
Maybe sum(coverage) is appropriate; a more usual metric is to count reads rather than coverage, e.g., with GenomicRanges::summarizeOverlaps() illustrated in this DESeq2 work flow in the context of RNA-seq.

This might help to understand the concept of RLE: https://www.youtube.com/watch?v=ypdNscvym_E
Here is an easy example:
> x <- IRanges(start=c(-2L, 1L, 3L),
+ width=c( 5L, 4L, 6L))
> x
IRanges of length 3
start end width
[1] -2 2 5
[2] 1 4 4
[3] 3 8 6
> coverage(x)
integer-Rle of length 8 with 2 runs
Lengths: 4 4
Values : 2 1
The output means the first 4 places are in packs of 2 and the next four places are in single-packs. All places including 0 and below 0 were ignored!
The length means that the complete range that we are looking at, so to say all places together, are 8.
The runs are the types of packs that occur. Here, we only have overlaps that include two ranges (pack of two) and overlaps that don't really overlap (single pack).

Related

Given how long many things were each running for; calculating the total number of of things that were running over time?

I have a few machines that start operating at the same time. I have a vector of numbers vec_num representing for how many units of time each machine was running for since it was started (there are no zero values).
I'm trying to find a way to efficiently calculate how many machines were running over time by making a vector of length(max(vec_num)) where each element represents a unit of time, and its value represents how many machines were running.
# For instance, take
vec_num <- c(1,1,4,3,1,10)
The ideal output would be a vector as generated from:
vec_num <- lapply(vec_num, function(x) {
vec_zero <- rep(0, max(vec_num))
vec_zero[1:x] <- 1
return(vec_zero)
})
Reduce(`+`, vec_num)
>> 6 3 3 2 1 1 1 1 1 1
Where 6 machines were running at the first unit of time, 3 machines were running at the second and third, 2 at the fourth, and only one machine ran for 10. Think of the first index representing how many machines were running at the first unit of time, second index being the second unit of time, third index being the third, and so fourth.
However, this way is computationally and memory inefficient, and doesn't scale when there are hundreds of thousands of machines that are running for several thousand units of time. Is there a more efficient way to go about calculating this?
Here is an option:
rev(cumsum(rev(tabulate(vec_num))))
#[1] 6 3 3 2 1 1 1 1 1 1
We can use sequence + table
as.integer(table(sequence(vec_num)))
#[1] 6 3 3 2 1 1 1 1 1 1

Viterbi algorithm post-treatment

I am running scripts for a project in Hidden Markov Model with 2 hiddens states at school. At some point, I use Viterbi's algrithm to find the most suitable sequences of hidden states. My output is a vector like that :
c("1","1","1","2","2","1", "1","1","1","2", "2","2")
I would like to count how many subsequences of each states there is, and also record their length and positions. The output would be, for example, a matrx like that:
State Length Starting_Position
1 3 1
2 2 4
1 4 6
2 3 10
Is there any R command or package who can do that easily ?
Thank you.

Check if two intervals overlap in R

Given values in four columns (FromUp,ToUp,FromDown,ToDown) two of them always define a range (FromUp,ToUp and FromDown,ToDown). How can I test whether the two ranges overlap. It is important to state that the ranges value are not sorted so the "From" value can be higher then the "To" value and the other way round.
Some Example data:
FromUp<-c(5,32,1,5,15,1,6,1,5)
ToUp<-c(5,31,3,5,25,3,6,19,1)
FromDown<-c(1,2,8,1,22,2,1,2,6)
ToDown<-c(4,5,10,6,24,4,1,16,2)
ranges<-data.frame(FromUp,ToUp,FromDown,ToDown)
So that the result would look like:
FromUp ToUp FromDown ToDown Overlap
5 5 1 4 FALSE
32 31 2 5 FALSE
1 3 8 10 FALSE
5 5 1 6 TRUE
15 25 22 24 TRUE
1 3 2 4 TRUE
6 6 1 1 FALSE
1 19 2 16 TRUE
5 1 6 2 TRUE
I tried a view things but did not get it to work especially the thing that the intervals are not "sorted" makes it for my R skills to difficult to figure out a solution.
I though about finding the min and max values of the pairs of columns(e.g FromUp, ToUp) and than compare them?
Any help would be appreciated.
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of #Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.
In general:
apply(ranges,1,function(x){y<-c(sort(x[1:2]),sort(x[3:4]));max(y[c(1,3)])<=min(y[c(2,4)])})
or, in case intervals cannot overlap at just one point (e.g. because they are open):
!apply(ranges,1,function(x){y<-sort(x)[1:2];all(y==sort(x[1:2]))|all(y==sort(x[3:4]))})

How to change the names of confidence levels per variable in linear regression

I got the confidence levels per variable in linear regression.I wanted to use the results for sorting variables so I kept the result set as a data frame. However when I tried to do an str() function on one of the variables I got an error (written below).How can I store the result data set so I'll be able to work on it?
df <- read.table(text = "target birds wolfs
1 9 7
1 8 4
0 2 8
1 2 3 3
0 1 2
1 7 1
0 1 5
1 9 7
1 8 7
0 2 7
0 2 3
1 6 3
0 1 1
0 3 9
0 1 1 ",header = TRUE)
model<-lm(target~birds+wolfs,data=df)
confint(model)
2.5 % 97.5 %
(Intercept) -0.23133823 0.36256052
birds 0.10102771 0.18768505
wolfs -0.09698902 0.00812353
s<-as.data.frame(confint(model))
str(s$2.5%)
Error: unexpected numeric constant in "str(s$2.5"
The expression behind the $ operator must be a valid R identifier. 2.5% isn’t a valid R identifier, but there’s a simple way of making it one: put it into backticks: `2.5%`1. In addition, you need to pay attention that the column name matches exactly (or at least its prefix does). In other words, you need to add a space before the %:
str(s$`2.5 %`)
In general, a$b is the same as a[['b']] (with some subtleties; refer to the documentation). So you can also write:
str(s[['2.5 %']])
Alternatively, you could provide different column names for the data.frame that are valid identifiers, by just assigning different column names. Beware of make.names though: it makes your strings into valid R names, but at the cost of mangling them in ways that are not always obvious. Relying on it risks confusing readers of the code, because previously undeclared identifiers suddenly appear in the code. In the same vein, you should always specify check.names = FALSE with data.frame, otherwise R once again mangles your column names.
1 In fact, R also accepts single quotes here (s$'2.5 %'). However, I suggest you forget this immediately; it’s a historical accident of the R language, and treating identifiers and strings the same (especially since it’s done inconsistently) does more harm than good.

Summing up prior rows contingent on ID #'s in R, for loop vs apply

I have a dataframe of xyz coordinates of units in 5 different boxes, all 4x4x8 so 128 total possible locations. The units are all different lengths. So even though I know the coordinates of the unit (3 units in, 2 left, and 1 up) I don't know the exact location of the unit in the box (12' in, 14' left, 30' up?). The z dimension corresponds to length and is the dimension I am interested in.
My instinct is to run a for loop summing values, but that is generally not the most efficient in R. The key elements of the for loop would be something along the lines of:
master$unitstartpoint<-if(master$unitz)==1 0
master$unitstartpoint<-if(master$unitz)>1 master$unitstartpoint[i-1] + master$length[i-1]
i.e. the unit start point is 0 if it is the first in the z dimension, otherwise it is the start point of the prior unit + the length of the prior unit. Here's the data:
# generate dataframe
master<-c(rep(1,128),rep(2,128),rep(3,128),rep(4,128),rep(5,128))
master<-as.data.frame(master)
# input basic data--what load number the unit was in, where it was located
# relative other units
master$boxNumber<-master$master
master$unitx<-rep(c(rep(1,32),rep(2,32),rep(3,32),rep(4,32)),5)
master$unity<-c(rep(1,8),rep(2,8),rep(3,8),rep(4,8))
master$unitz<-rep(1:8,80)
# create unique unit ID # based on load number and xyz coords.
transform(master,ID=paste0(boxNumber,unitx,unity,unitz))
# generate how long the unit is. this length will be used to identify unit
# location in the box
master$length<-round(rnorm(640,13,2))
I'm guessing there is a relatively easy way to do this with apply or by but I am unfamiliar with those functions.
Extra info: the unit ID's are unique and the master dataframe is sorted by boxNumber, unitx, unity, and then unitz, respectively.
This is what I am shooting for:
length unitx unity unitz unitstartpoint
15 1 1 1 0
14 1 1 2 15
11 1 1 3 29
13 1 1 4 40
Any guidance would be appreciated. Thanks!
It sounds like you just want a cumulative sum along the z dimesion for each box/x/y combination. I used cumulative sum because otherwise if you reset at 0 when z=1 your definition would be leaving off the length at z=8. We can do this easily with ave
clength <- with(master, ave(length, boxNumber, unitx, unity, FUN=cumsum))
I'm exactly sure which values you want returned, but this column roughly transates to how you were redefining length above. If i combine with the original data and look at the total lenth for the first box for x=1, y=1:4
# head(subset(cbind(master, ml), unitz==8),4)
master boxNumber unitx unity unitz length ID ml
8 1 1 1 1 8 17 1118 111
16 1 1 1 2 8 14 1128 104
24 1 1 1 3 8 10 1138 98
32 1 1 1 4 8 10 1148 99
we see the total lengths for those positions. Since we are using cumsum we are summing that the z are sorted as you have indicated they are. If you just want one total overall length per box/x/y combo, you can replace cumsum with sum.

Resources