Given values in four columns (FromUp,ToUp,FromDown,ToDown) two of them always define a range (FromUp,ToUp and FromDown,ToDown). How can I test whether the two ranges overlap. It is important to state that the ranges value are not sorted so the "From" value can be higher then the "To" value and the other way round.
Some Example data:
FromUp<-c(5,32,1,5,15,1,6,1,5)
ToUp<-c(5,31,3,5,25,3,6,19,1)
FromDown<-c(1,2,8,1,22,2,1,2,6)
ToDown<-c(4,5,10,6,24,4,1,16,2)
ranges<-data.frame(FromUp,ToUp,FromDown,ToDown)
So that the result would look like:
FromUp ToUp FromDown ToDown Overlap
5 5 1 4 FALSE
32 31 2 5 FALSE
1 3 8 10 FALSE
5 5 1 6 TRUE
15 25 22 24 TRUE
1 3 2 4 TRUE
6 6 1 1 FALSE
1 19 2 16 TRUE
5 1 6 2 TRUE
I tried a view things but did not get it to work especially the thing that the intervals are not "sorted" makes it for my R skills to difficult to figure out a solution.
I though about finding the min and max values of the pairs of columns(e.g FromUp, ToUp) and than compare them?
Any help would be appreciated.
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of #Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.
In general:
apply(ranges,1,function(x){y<-c(sort(x[1:2]),sort(x[3:4]));max(y[c(1,3)])<=min(y[c(2,4)])})
or, in case intervals cannot overlap at just one point (e.g. because they are open):
!apply(ranges,1,function(x){y<-sort(x)[1:2];all(y==sort(x[1:2]))|all(y==sort(x[3:4]))})
Related
Data science student here. New to R, in my first course. I've spent way too much time trying to figure out this exercise, so I figured I would ask someone on here.
I have created a dataframe built from 4 matrices, titled bee_numbers_data_2:
buff_tail garden_bee red_tail honeybee carder_bee
10 8 18 12 8
1 3 9 13 27
37 19 1 16 6
5 6 2 9 32
12 4 4 10 23
The exercise asks us to only show honeybee numbers >= 10.
So I've created a boolean expression to display the TRUE FALSE statements:
bee_numbers_data_2$honeybee>=10
Which returns:
[1] TRUE TRUE TRUE FALSE TRUE
However, I want to display a list of the VALUES of the true statements, not a list of TRUE FALSE statements.
I've been pouring over my textbook and the internet trying to figure out this simple problem, so any help would be greatly appreciated. Thanks so much.
Although this is a fairly simple question, covered in most introductory texts on R, I could not find a duplicate on SO, so it seems worth answering here.
Let's break it down. As you already showed, we can use boolean expressions to generate a vector of boolean values:
bee_numbers_data_2 = data.frame(honeybee=c(12,13,16,9,10))
bee_numbers_data_2$honeybee >= 10
# [1] TRUE TRUE TRUE FALSE TRUE
If we want to know which of those are true, we can use the base R function which:
which(bee_numbers_data_2$honeybee >= 10)
# [1] 1 2 3 5
If we want to know the original values corresponding to those position indices, we can use those indices to subset the original data, using [
bee_numbers_data_2$honeybee[which(bee_numbers_data_2$honeybee >= 10)]
# [1] 12 13 16 10
Or, equivalently and a little more simple, we can subset using the boolean vales directly:
bee_numbers_data_2$honeybee[bee_numbers_data_2$honeybee >= 10]
Note that as you learn more R, you will find that there are also some more advanced ways to filter and subset data, such as the packages data.table and dplyr. However, it is best to understand how to use base R first, as shown above.
I've googled around, but have not found anything similar to this, but I'm hoping what I'm trying to do has already been done by someone else before.
I have a set of data with timestamps.
I need a running cumulative count of transactions per second - calculated as a true rolling second window. Would be nice to just truncate / round off to nearest second but that wont be enough for my use case.
#Timestamp
Current TPS
00:00:00.1
1
................................................................................................
00:00:00.2
2
00:00:00.3
3
00:00:00.4
4
00:00:00.5
5
00:00:00.6
6
00:00:00.7
7
00:00:00.8
8
00:00:00.9
9
00:00:01.0
10
....................................10 TPS here............................................................
00:00:01.1
10
00:00:01.2
10
.................................... still 10 TPS here............................................................
00:00:01.4
9
............ only 9 here, because no event at 00:00:01.3
00:00:01.5
9
00:00:01.5
10
00:00:01.8
8
Initially, I was planning to calculate a time interval difference between rows, but that doesn't solve the question of how to determine which rows should be included or excluded in the aggregate window.
This morning, I thought about mutating a new column that is just the subsecond portion of time. Then, I use that new column as a substraction on the time column, and cumsum it inside a 2nd if_else mutate that does a look-back over last X number of rows?
Does that sound reasonable? Have I overlooked some other/better approach?
library(dplyr)
timestamps <- c("00:00:00.1", "00:00:00.2", "00:00:00.3", "00:00:00.4", "00:00:00.5", "00:00:00.6", "00:00:00.7", "00:00:00.8", "00:00:00.9", "00:00:01.0", "00:00:01.1", "00:00:01.2", "00:00:01.4", "00:00:01.5", "00:00:01.5", "00:00:01.8") %>%
lubridate::hms %>% # convert to a time period in hours minutes seconds
as.numeric # convert that to a number of seconds
slider::slide_index_dbl(timestamps,
timestamps,
~length(.x), # = how many timestamps are in the window
.before = .99) # Note: using 1 here gave me an incorrect result,
# presumably due to floating point arithmetic errors
# https://en.wikipedia.org/wiki/Floating-point_error_mitigation
[1] 1 2 3 4 5 6 7 8 9 10 10 10 9 10 10 8
Using R and bioconductor.
I'm not sure how to understand an integer rle that you'd get from functions like coverage() such as this
integer-Rle of length 3312 with 246 runs
Lengths: 25 34 249 16 7 11 16 ... 2 32 2 26 34 49
Values : 0 1 0 1 2 3 2 ... 1 2 1 0 1 0
Okay so I get that it represents coverage of one range vs other ranges. In this case reads of an experiment over a given range. What do the 'runs' mean? What about the 'Lengths' and 'Values'? I thought that maybe Lengths represent a postion and values represent the amount of times its covered but then why would there be multiples of the same position such as 2 above? Why would they be out of order?
I ask because I'm using
sum(coverage)
to compare the coverage of one range to another of a different length and I was wondering if that was appropriate.
Probably it's better to ask about Bioconductor packages on the Bioconductor support site.
The interpretation is that there is a run of 25 nucleotides with 0 coverage, then a run of 24 nucleotides with 1 coverage (i.e., a single read) then another run of 249 nucleotides with no coverage, then things start to get interesting as multiple reads overlap positions. From the summary line at the top of the output, your read covers 3312 nucleotides, maybe from a single transcript? If you were to
plot(as.integer(coverage))
you'd get a quick plot of how coverage varies along the length of the transcript.
Maybe sum(coverage) is appropriate; a more usual metric is to count reads rather than coverage, e.g., with GenomicRanges::summarizeOverlaps() illustrated in this DESeq2 work flow in the context of RNA-seq.
This might help to understand the concept of RLE: https://www.youtube.com/watch?v=ypdNscvym_E
Here is an easy example:
> x <- IRanges(start=c(-2L, 1L, 3L),
+ width=c( 5L, 4L, 6L))
> x
IRanges of length 3
start end width
[1] -2 2 5
[2] 1 4 4
[3] 3 8 6
> coverage(x)
integer-Rle of length 8 with 2 runs
Lengths: 4 4
Values : 2 1
The output means the first 4 places are in packs of 2 and the next four places are in single-packs. All places including 0 and below 0 were ignored!
The length means that the complete range that we are looking at, so to say all places together, are 8.
The runs are the types of packs that occur. Here, we only have overlaps that include two ranges (pack of two) and overlaps that don't really overlap (single pack).
I have a dataframe of xyz coordinates of units in 5 different boxes, all 4x4x8 so 128 total possible locations. The units are all different lengths. So even though I know the coordinates of the unit (3 units in, 2 left, and 1 up) I don't know the exact location of the unit in the box (12' in, 14' left, 30' up?). The z dimension corresponds to length and is the dimension I am interested in.
My instinct is to run a for loop summing values, but that is generally not the most efficient in R. The key elements of the for loop would be something along the lines of:
master$unitstartpoint<-if(master$unitz)==1 0
master$unitstartpoint<-if(master$unitz)>1 master$unitstartpoint[i-1] + master$length[i-1]
i.e. the unit start point is 0 if it is the first in the z dimension, otherwise it is the start point of the prior unit + the length of the prior unit. Here's the data:
# generate dataframe
master<-c(rep(1,128),rep(2,128),rep(3,128),rep(4,128),rep(5,128))
master<-as.data.frame(master)
# input basic data--what load number the unit was in, where it was located
# relative other units
master$boxNumber<-master$master
master$unitx<-rep(c(rep(1,32),rep(2,32),rep(3,32),rep(4,32)),5)
master$unity<-c(rep(1,8),rep(2,8),rep(3,8),rep(4,8))
master$unitz<-rep(1:8,80)
# create unique unit ID # based on load number and xyz coords.
transform(master,ID=paste0(boxNumber,unitx,unity,unitz))
# generate how long the unit is. this length will be used to identify unit
# location in the box
master$length<-round(rnorm(640,13,2))
I'm guessing there is a relatively easy way to do this with apply or by but I am unfamiliar with those functions.
Extra info: the unit ID's are unique and the master dataframe is sorted by boxNumber, unitx, unity, and then unitz, respectively.
This is what I am shooting for:
length unitx unity unitz unitstartpoint
15 1 1 1 0
14 1 1 2 15
11 1 1 3 29
13 1 1 4 40
Any guidance would be appreciated. Thanks!
It sounds like you just want a cumulative sum along the z dimesion for each box/x/y combination. I used cumulative sum because otherwise if you reset at 0 when z=1 your definition would be leaving off the length at z=8. We can do this easily with ave
clength <- with(master, ave(length, boxNumber, unitx, unity, FUN=cumsum))
I'm exactly sure which values you want returned, but this column roughly transates to how you were redefining length above. If i combine with the original data and look at the total lenth for the first box for x=1, y=1:4
# head(subset(cbind(master, ml), unitz==8),4)
master boxNumber unitx unity unitz length ID ml
8 1 1 1 1 8 17 1118 111
16 1 1 1 2 8 14 1128 104
24 1 1 1 3 8 10 1138 98
32 1 1 1 4 8 10 1148 99
we see the total lengths for those positions. Since we are using cumsum we are summing that the z are sorted as you have indicated they are. If you just want one total overall length per box/x/y combo, you can replace cumsum with sum.
I'm fairly new to R so I'd like to apollogize in advance for eventually not choosing the best words to explain my issue.
My problem is that I'd like to create a subset out of a dataset (old) which has several colums. So far no problem...
My subset should start when the value (x) in one of the colums reaches its highest point; and stop right after x droppend down again to its lowest point.
Then create a new dataset (new) with this subset of the data (old).
As there are multiple positions in my original dataset (old) where the value x behaves as descibed above I'd like to have a new dataset (new1, new2, new....) for every subset I create.
I hope a was clear in what I'd like to say. If there is more information needed, I'm happy to provide it.
Thanks a lot for your help.
If for instance you have
x <- c(5,4,3,2,1,2,3,4,5,4,3,2,1,2,3,2,1)
Then
direction <- sign(diff(x))
will give a series of +1s and -1s indicating whether x is on an upward or downward swing. We're only interested in downward swings, so let's label upward points with NA, and downward points in the nth swing with the number n:
run <- rle(direction)
run$values[run$values==1] <- NA
run$values[!is.na(run$values)] <- 1:sum(!is.na(run$values))
Now it seems you want to include the last point in a run of downward points (where the sign is positive, as the point after the last point in a downward run is higher). So we need to extend the length of downward runs, and decrease the upward:
run$lengths <- run$lengths + ifelse(is.na(run$values), -1, +1)
swing <- inverse.rle(run)
plot(x, col=swing)
should colour downward runs in different colours, and omit upward runs. You've now got a variable that labels the runs, and you can split your data.frame by
split(myDataFrame, swing)
You might need to check this works if we start/finish a on an up or down swing
Here is an option where we check when direction changes with diff, and then split along that. First, make some data:
df <- data.frame(x=rep(c(1:3, 2:1), 3))
Then:
dir.vec <- c(diff(df$x) <= 0, tail(diff(df$x) <= 0, 1)) # has drop started?
split.vec <- cumsum(c(0, diff(dir.vec)) < 0) # which drop # is this?
split(df[dir.vec,,drop=F], split.vec[dir.vec]) # split drops by drop num
Original:
x
1 1
2 2
3 3
4 2
5 1
6 1
7 2
8 3
9 2
10 1
11 1
12 2
13 3
14 2
15 1
Split:
$`0`
x
3 3
4 2
5 1
$`1`
x
8 3
9 2
10 1
$`2`
x
13 3
14 2
15 1