Count number of alternations in a coin flip sequence - r

I have a sequence of ones and zeros and I would like to count the number of alternations, e.g:
x <- rbinom(10, 1, 1/2)
> x
[1] 0 0 1 1 1 1 1 0 1 0
Thus I would like to count (in R) how many times the sequence alternates (or flips) from one to zero. In the above sequence, the number of alternations (counted by hand) is 4.

You can use diff() :
> x <- rbinom(10,1,1/2)
> x
[1] 0 0 0 1 1 1 1 0 1 0
> sum(diff(x)!=0)
[1] 4

The rle function will count the number of 'runs' of the same value in a vector. Hence the length of this vector (minus 1) gives you the number of alterations:
> x
[1] 0 0 0 1 1 1 1 0 1 0
> rle(x)
Run Length Encoding
lengths: int [1:5] 3 4 1 1 1
values : num [1:5] 0 1 0 1 0
> length(rle(x)$lengths)-1
[1] 4
Might be quicker or slower than the diff() method, but it also gives you the run lengths if you need them...

It definitely doesn't beat diff in terms of elegance but another way:
sum(x[-1] != head(x, n=-1))
On my system, this seems to be a teeny bit faster:
> x <- rbinom(1e6, 1, 0.5)
> system.time(replicate(100, sum(x[-1] != head(x, n=-1))))
user system elapsed
8.421 3.688 12.150
> system.time(replicate(100, sum(diff(x) != 0)))
user system elapsed
9.431 4.525 14.248
It seems like there ought to be a nice analytic solution for the distribution of the number of unequal adjacent elements in the sequence.

Pseudocode (sequence is an array with your coin flips):
variable count = 0
variable state = sequence[0]
iterate i from sequence[1] to sequence[n]
if (state not equal sequence[i])
state = 1 - state
count++
count should be your result

Related

How do I replace a value taking into account the previous value from a list in R?

I am trying to replace all the ones values after zero with zero values.
The list is something like this:
x <- c(1,1,0,1,1,1,1,1,0,1,1)
I want the output to be like this:
c(1,1,0,0,1,1,1,1,0,0,1)
So, the next value after 0 are also 0.
I have done it with loops but because it´s a large amount of information is a lot of time to wait.
I hope someone could give me an idea.
x[ c(FALSE, x[-length(x)] == 0) ] <- 0
x
# [1] 1 1 0 0 1 1 1 1 0 0 1
identical(
x,
c(1,1,0,0,1,1,1,1,0,0,1) # from the OP
)
# [1] TRUE
You could do:
x[which(x == 0) + 1] <- 0
[1] 1 1 0 0 1 1 1 1 0 0 1

Average length of runs selected from a rasterbrick in R

I am working with a rasterbrick "a" with thousands of layers, closer description is not necessary for my problem. I am using following function to create a rasterlayer of the total amount of runs of at least 5 days with values greater than 1 (one layer in brick is one day):
indices<-rep(1:69,each=90)
ff<-function(x,na.rm=TRUE){
y<-x > 1
n<- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Y<-stackApply(a,indices,fun=ff)
This works great, I tested that. In a similar manner, I wrote new function:
fff<-function(x,na.rm = TRUE){
y <- x > 1
n <- ave(y, cumsum(y == 0), FUN = cumsum)
mean(n[n >= 5])
}
X<-stackApply(a,indices,fun=fff)
Using this function, I wanted to create a rasterlayer of average lengths of those runs greater than 5 days. It seems reasonable and fine, but it does not work correctly. For example, when there is a run of 6 days (satisfying my criterion of value>1), it counts two runs, one of 5 and another one of six, and thus the average is 5,5 instead of 6. I am not sure how to adjust my function fff. If there is a way to do it, it would be great, otherwise I would be greatful if anyone shares another way how to calculate means of those runs. Thanks!
In the future, please include a minimal, reproducible, self-contained example. Do not describe the behavior of your code, but show it. Also, be very clear aobut the question. As-is it is hard to see that your question is not about raster data at all, as you are looking for a function that works on any numeric vector (that you may then apply to raster data).
You are looking for function that finds local maxima larger than 5, in the cumulated sum of neighbors that are > 1; and then average these local maxima.
You have this function
ff<-function(x,na.rm=TRUE){
y<-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Example data
x <- c(-1:10, -1:3, -1:6)
x
# [1] -1 0 1 2 3 4 5 6 7 8 9 10 -1 0 1 2 3 -1 0 1 2 3 4 5 6
ff(x)
# [1] 2
(two local maxima that are larger than 5)
To write the function you want we can start with what we have
y <-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 0 0 0 1 2 0 0 0 1 2 3 4 5
In this case, you need to find the numbers 9 and 5. You can start with
n[n<5] <- 0
n
# [1] 0 0 0 0 0 0 0 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 5
And now we can use diff to find the local maxima. These are the values for which the difference with the previous value is negative. Note the zero added to n to consider the last element of the vector.
i <- which(diff(c(n, 0)) < 0)
i
# [1] 12 25
n[i]
# [1] 9 5
Such that we can put the above together in a function like this
f <- function(x) {
y <- x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}
f(x)
# [1] 7
If you have NAs you may do
f <- function(x) {
y <- x > 1
y[is.na(y)] <- FALSE
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}

Add index to runs of positive or negative values of certain length

I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0

overlapping segments R

there is a data frame with which I am working it looks like this
the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?
Here's some data
m = matrix(c(10, 20, 25, 30), 2)
An IRanges notion is coverage()
> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
Lengths: 9 10 6 5
Values : 0 1 2 1
Which is a compact run-length encoding; query at the ith location
> cvg[22]
integer-Rle of length 1 with 1 run
Lengths: 1
Values : 2
> runValue(cvg[22])
[1] 2
Do math on the Rle
> cvg > 1
logical-Rle of length 30 with 3 runs
Lengths: 19 6 5
Values : FALSE TRUE FALSE
or coerce to an integer vector
> as(cvg, "integer")
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
This
> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0
will also be reasonably fast.
Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.
The data structure you are looking for is called interval tree, which is a type of sorted binary tree that contains (guess what) intervals, each of which usually has start and end positions.
I never used an interval tree to store points as you need, but I guess you can define your intervals as interval.start = interval.end.
Building the tree will take linear time and querying the intervals of your data frame will take logarithmic time, which is much faster than pteetor's quadratic time approach.
The R package IRanges from Bioconductor may help you. I would try the function findOverlaps() and then table() the results. I invite you to read the documentation to see whether it fits your specific needs.
I took that matrix and examined the overlaps, of which there were only five intervals with any overlaps and none with 2, assuming they were ordered by their starting postions:
> sum( mat[1:28,2] > mat[2:29,1] )
[1] 5
> sum( mat[1:27,2] > mat[3:29,1] )
[1] 0
So which ones were they?
> which( mat[1:28,2] > mat[2:29,1] )
[1] 19 21 23 25 28
So it seemed rather wasteful of machine resources and time to create a vector that was 23 million items long and it would be a lot easier to simply build a function that would count the number of intervals in which any particular position was within:
fchunk <- function(pos) {sum( mat[ , 1] <= pos & mat[,2] >= pos)}
#--------
> fchunk(16675330)
[1] 2
> fchunk(16675329)
[1] 1
These are the intervals where there are 2:
sapply( which( mat[1:28,2] > mat[2:29,1] ) ,
function(int1) c( mat[int1+1, 1], mat[int1, 2] ) )
#--------
[,1] [,2] [,3] [,4] [,5]
n7 16675330 18097680 20233612 21288777 22847516
n8 16724700 18445265 20741145 22780817 22967567
If you really want the count at every position -- all 23,110,906 positions -- this code will tell you.
countChunks = function(i) sum(dfrm$n7 <= i & i <= dfrm$n8)
counts = sapply(1:23110906, countChunks)
But it's very slow. Faster code would require some clever optimization to eliminate the (very) redundant counting down by these two lines.
If you simply want the count at one position, i, just call countChunks(i).

Split numeric vector into unequal sections, then apply a custom function to each section

I have a long sequence of 1s and 0s which represent bird incubation patterns, 1 being bird ON the nest, 0 being OFF.
> Fake.data<- c(1,1,1,1,1,0,0,1,1,1,1,0,0,0,1,1,1,1,0,1,1,1,1,0,0,1,1,1,1,1,0,0,0,0,1,1,0,1,0)
As an end point I would essentially like a single value for the ratio between each ON period and the consecutive OFF period. So ideally this should be for Fake.data a vector like this
[1] 0.4 0.75 0.25 0.5 0.8 0.5 1 #(I just typed this out!)
So far I have split the vector into sections using split()
> Diff<-diff(Fake.data)
> SPLIT<-split(Fake.data, cumsum(c(1, Diff > 0 )))
> SPLIT
Which returns...
$`1`
[1] 1 1 1 1 1 0 0
$`2`
[1] 1 1 1 1 0 0 0
$`3`
[1] 1 1 1 1 0
$`4`
[1] 1 1 1 1 0 0
$`5`
[1] 1 1 1 1 1 0 0 0 0
$`6`
[1] 1 1 0
$`7`
[1] 1 0
So I can get the ratio for a single split group using
> SPLIT$'1'<- ((length(SPLIT$'1'))-(sum(SPLIT$'1')))/sum(SPLIT$'1')
> SPLIT$'1'
[1] 0.4
However in my data I have some several thousand of these to do and would like to apply some sort of tapply() or for() loop to calculate this automatically for all and put it into a single vector. I have tried each of these methods with little success as the split() output structure does not seem to fit with these functions?
I create a new vector to receive the for() loop output
ratio<-rep(as.character(NA),(length(SPLIT)))
Then attempting the for() loop using the code above which work for a single run.
for(i in SPLIT$'1':'7')
{ratio[i]<-((length(SPLIT$'[i]'))-(sum(SPLIT$'[i]')))/sum(SPLIT$'[i]')}
What I get is...
[1] "NaN" "NaN" "NaN" "NaN" "NaN" "NaN" NA
Tried many other variations along this theme but now just really stuck!
I think you were very close with your stategy. The sapply function is very happy to work with lists. I would just change the last step to
sapply(SPLIT, function(x) sum(x==0)/sum(x==1))
which returns
1 2 3 4 5 6 7
0.40 0.75 0.25 0.50 0.80 0.50 1.00
with your sample data. No additional packages needed.
Here are two possibiities:
1) Compute the lengths using rle and then in the if statement if the data starts with 0 don't include the first length so we are assured that we are starting out with a 1. Finally compute the ratios using rollapply from the zoo package:
library(zoo)
lengths <- rle(Fake.data)$lengths
if (Fake.data[1] == 0) lengths <- lengths[-1]
rollapply(lengths, 2, by = 2, function(x) x[2]/x[1])
giving:
[1] 0.40 0.75 0.25 0.50 0.80 0.50 1.00
The if line can be removed if we know that the data always starts with a 1.
2) If we can assume that the series always starts with a 1 and ends in a 0 then this one liner would work:
with( rle(Fake.data), lengths[values == 0] / lengths[values == 1] )
giving the same answer as above.

Resources