Related
For some reason the diff() functions na.pad parameter is not working properly? Anyone else having this problem or have a work around?
yo <- c(5,3,3,4,5,6,5,8,9)
diff(yo, na.pad = TRUE)
[1] -2 0 1 1 1 -1 3 1
The resulting vector should be:
[1] NA -2 0 1 1 1 -1 3 1
The function diff you use certainly comes from xts package, na.pad does not apply on base R vectors. And you also need to convert your vector to times series:
library(xts)
library(zoo)
yy = zoo(yo)
diff(yy, na.pad=TRUE)
# 1 2 3 4 5 6 7 8 9
#NA -2 0 1 1 1 -1 3 1
there is a data frame with which I am working it looks like this
the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?
Here's some data
m = matrix(c(10, 20, 25, 30), 2)
An IRanges notion is coverage()
> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
Lengths: 9 10 6 5
Values : 0 1 2 1
Which is a compact run-length encoding; query at the ith location
> cvg[22]
integer-Rle of length 1 with 1 run
Lengths: 1
Values : 2
> runValue(cvg[22])
[1] 2
Do math on the Rle
> cvg > 1
logical-Rle of length 30 with 3 runs
Lengths: 19 6 5
Values : FALSE TRUE FALSE
or coerce to an integer vector
> as(cvg, "integer")
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
This
> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0
will also be reasonably fast.
Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.
The data structure you are looking for is called interval tree, which is a type of sorted binary tree that contains (guess what) intervals, each of which usually has start and end positions.
I never used an interval tree to store points as you need, but I guess you can define your intervals as interval.start = interval.end.
Building the tree will take linear time and querying the intervals of your data frame will take logarithmic time, which is much faster than pteetor's quadratic time approach.
The R package IRanges from Bioconductor may help you. I would try the function findOverlaps() and then table() the results. I invite you to read the documentation to see whether it fits your specific needs.
I took that matrix and examined the overlaps, of which there were only five intervals with any overlaps and none with 2, assuming they were ordered by their starting postions:
> sum( mat[1:28,2] > mat[2:29,1] )
[1] 5
> sum( mat[1:27,2] > mat[3:29,1] )
[1] 0
So which ones were they?
> which( mat[1:28,2] > mat[2:29,1] )
[1] 19 21 23 25 28
So it seemed rather wasteful of machine resources and time to create a vector that was 23 million items long and it would be a lot easier to simply build a function that would count the number of intervals in which any particular position was within:
fchunk <- function(pos) {sum( mat[ , 1] <= pos & mat[,2] >= pos)}
#--------
> fchunk(16675330)
[1] 2
> fchunk(16675329)
[1] 1
These are the intervals where there are 2:
sapply( which( mat[1:28,2] > mat[2:29,1] ) ,
function(int1) c( mat[int1+1, 1], mat[int1, 2] ) )
#--------
[,1] [,2] [,3] [,4] [,5]
n7 16675330 18097680 20233612 21288777 22847516
n8 16724700 18445265 20741145 22780817 22967567
If you really want the count at every position -- all 23,110,906 positions -- this code will tell you.
countChunks = function(i) sum(dfrm$n7 <= i & i <= dfrm$n8)
counts = sapply(1:23110906, countChunks)
But it's very slow. Faster code would require some clever optimization to eliminate the (very) redundant counting down by these two lines.
If you simply want the count at one position, i, just call countChunks(i).
I'm encountering a problem that I'm failling to understand. Here is the commented code :
library(zoo)
#Pattern p used as row feeding matrix to apply() for function f
> p
[,1] [,2] [,3]
[1,] -1 1 1
[2,] 1 1 1
#Function f supposed to take rows of matrix p as vectors,
#compare them with vector x and return index
f <- function(x) { # identifies which row of `patterns` matches sign(x)
which(apply(p,1,function(row)all(row==sign(x))))
}
#rollapplying f over c(1,-1,1,1) : there is no vector c(1,-1,1) in p
#so why the first atom is 1 ?
> rollapply(c(1,-1,1,1),width=3,f,align="left")
[1] 1 1
#rollapply identity, rollapply is supposed to feed row to the function right ?
> t = rollapply(c(1,-1,1,1),width=3,function(x)x,align="left")
[,1] [,2] [,3]
[1,] 1 -1 1
[2,] -1 1 1
#Feeding the first row of the precedent matrix to f is giving the expected result
> f(t[1,])
integer(0)
#rollapply feeds the rolls to the function
> rollapply(c(1,-1,1,1),width=3,function(x) paste(x,collapse=","),align="left")
[1] "1,-1,1" "-1,1,1"
#rollapply feeds vectors to the function
> rollapply(c(1,-1,1,1),width=3,function(x) is.vector(x),align="left")
[1] TRUE TRUE
#Unconsistent with the 2 precedent results
> f(c(1,-1,1))
integer(0)
Basically I don't understand why rollapply(c(1,-1,1,1),width=3,f,align="left") is returning 1 1 when the first roll from rollapply is supposed to yield the vector 1 -1 1 that is absent from the pattern matrix p. I was expecting the result NA 1 instead. There must be something I don't understand about rollapply but strangely enough if I feed the vector c(-1, -1, -1 ,-1) to rollapply I get the expected result NA NA. In some cases I have a mix 1 2 but never a mix NA 1 or NA 2.
According to G. Grothendieck rollapply does not support functions producing zero length outputs. It is possible to get rid of the problem by adding a condition in the function f returning a specific output in case it was returning zero length output.
f <- function(x) { # identifies which row of `patterns` matches sign(x)
t <- which(apply(patterns,1,function(row)all(row==sign(x))))
ifelse(length(t)==0, return(0), return(t))
}
For completeness, quoting GGrothendieck's comment. "rollapply does not support functions producing zero length outputs. " That is consistent with the behavior below.
Further confusion, at least for me (this should be a comment but I wanted some decent formatting):
sfoo<-c(1,-1,1,1,1,-1,1,1)
rollapply(sfoo,width=3,function(j){k<-f(j);print(j);return(k)})
[1] 1 -1 1
[1] -1 1 1
[1] 1 1 1
[1] 1 1 -1
[1] 1 -1 1
[1] -1 1 1
[1] 1 2 1 1 2 1
I then tried:
ff<-function(x) which(rowSums(p)==sum(x))
sbar<-c(0,1,1,1,-1,0,-1)
rollapply(sbar,width=3,function(j){k<-ff(j);print(j);return(k)})
[1] 0 1 1
[1] 1 1 1
[1] 1 1 -1
[1] 1 -1 0
[1] -1 0 -1
[1] 2 1 2 1 2
Which sure looks like rollapply is doing a na.locf-sort of filling in operation.
I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).
I have two vectors actual and predicted as shown:
> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say:
0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9
Now, I've to bin the values of 'predicted' also into the above mentioned buckets.
I tried to achieve this using the cut() function in R:
binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)
However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval.
I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?
I'm not 100% sure of what you want to do.
However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.
If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.
Then you create a new set of classes for your vector predicted.
The classification is not the same anymore.
Assuming that I understood what you want to do, I'd rather write :
actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)
binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
It gives :
> binActual
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
> binPred
[1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
I'm not sure it is what you're looking for, so let me know, I might be able to help you.
Best wishes.
Is this what you want?
intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))
cutFixed <- function(x, intervals) {
sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}
This gives the following result
> cutFixed(actual, intervals)
[1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
[1] 1 NA 1 1 7 1 1 1 1 1 1 3 1 2 1 1 1 2 1
I'm doing a failure analysis, for which I like to try some different scenarios and some random trials. So far I've done this with the mosaic package and its working out great.
In one specific scenario I want to generate a vector of (semi)random numbers with from different distributions. No problem so far.
Now I want to have defined number of negative numbers in this vector.
For example I want to have between 0-5 negative numbers in the vector of 25 numbers.
I thought I could use something like rbinom(n=25,prob=5/25,size=1) to get 5 random ones first but of course 5/25, 25 times can be more than 5 ones. This seems a dead end.
I could get it done with some for loops, but probably something easier exists.
I've tried all sorts of sample,seq, shuffle combinations but I cannot get it to work so far.
does anyone have any ideas or suggestions?
If you have a vector x where all elements are >= 0, let's say drawn from Poisson:
x = rpois(25, lambda=3)
You can make a random 5 of the negative by doing
x * sample(rep(c(1, -1), c(length(x) - 5, 5)))
This works because
rep(c(1, -1), c(length(x) - 5, 5))
will be
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1
and sample(rep(c(1, -1), c(length(x) - 5, 5))) simply shuffles them up randomly:
sample(rep(c(1, -1), c(length(x) - 5, 5)))
# [1] 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1 -1 -1 1 1 1 -1 1 1 1 1
I can suggest a very straightforward solution, guaranteeing 5 negative values and working for any continuous distribution. The idea is just to sort the vector and substract the 6th biggest to each value:
x <- rnorm(25)
res <- sort(x, T)[6] - x
#### [1] 0.4956991 1.5799885 2.4207497 1.1639569 0.2161187 0.2443917 -0.4942884 -0.2627706 1.5188197
#### [10] 0.0000000 1.6081025 1.4922573 1.4828059 0.3320079 0.3552913 -0.6435770 -0.3106201 1.5074491
#### [19] 0.6042724 0.3707655 -0.2624150 1.1671077 2.4679686 1.0024573 0.2453597
sum(res<0)
#### [1] 5
It also works for discrete distributions but only if there are no ties..