Smoothing a sequence without using a loop in R - r

I am implementing a statistical method from an academic paper (see the end for a citation) in R. I think there's a way to do one of the steps without using a loop, but I'm having trouble deciding how to attack it.
This method operates on a data frame with three variables: x, n, and p. It can only operate if p[i] <= p[i+1] for all i. If a pair of points violates that, they are smoothed out by setting both p[i] and p[i+1] equal to their weighted average
(n[i]*p[i]+n[i+1]*p[i+1])/(n[i]+n[i+1])
This smoothing is iterated until the p_i are a nondecreasing sequence.
The problem with this smooth is that a) loops are bad form in R, and b) if there are multiple points in a row such that p_i > p_(i+1) >= p_(i+2), the method can fail to terminate or take a very long time to converge. For instance, if a sequence like so happens:
x n p
2 10 0.6
5 10 0.5
10 10 0.5
the smooth will set the first two values of p to 0.55, then the second two to 0.525, then set the first two to 0.5325, and so on and loop forever (or if I'm lucky reach the limit of significance in a bajillion iterations). There should be a mathematically equivalent but more efficient way to do this by identifying adjacent decreasing data points and averaging them as a group, but I'm not sure how to approach that in R.
If you need more background, the paper in question is Martin A. Hamilton, Rosemarie C. Russo, Robert V. Thurston.
"Trimmed Spearman-Karber method for estimating median lethal concentrations in toxicity bioassays." Environ. Sci. Technol., 1977, 11 (7), pp 714–719. I'm referring to the "first step" section on page 716.

As I understand the algorithm, you need to locate positions where p is decreasing and, starting from each of these, find out for how long the (cumulative) weighted average is decreasing so that p can be updated block by block. I do not see how this can be done without a loop of some sort. Some solution might hide the loop under lapply or an equivalent but IMHO, this is one of those algorithms that are complex enough that I prefer a good old loop. You may lose a bit in efficiency but the code reads nicely. My attempt, using a while loop:
smooth.p <- function(df) {
while (any(diff(df$p) < 0)) {
# where does it start decreasing
idx <- which(diff(df$p) < 0)[1]
# from there, compute the cumulative weighted average
sub <- df[idx:nrow(df), ]
cuml.wavg <- cumsum(sub$n * sub$p) / cumsum(sub$n)
# and see for how long it is decreasing
bad.streak.len <- rle(diff(cuml.wavg) <= 0)$lengths[1]
# these are the indices for the block to average
block.idx <- seq(from = idx, length = bad.streak.len + 1)
# compute and apply the average p
df$p[block.idx] <- sum(df$p[block.idx] * df$n[block.idx]) /
sum(df$n[block.idx])
}
return(df)
}
Here is some data, including a rough patch like you suggested:
df <- data.frame(x = 1:9,
n = rep(1, 9),
p = c(0.1, 0.3, 0.2, 0.6, 0.5, 0.5, 0.8, 1.0, 0.9))
df
# x n p
# 1 1 1 0.1
# 2 2 1 0.3
# 3 3 1 0.2
# 4 4 1 0.6
# 5 5 1 0.5
# 6 6 1 0.5
# 7 7 1 0.8
# 8 8 1 1.0
# 9 9 1 0.9
And the output:
smooth.p(df)
# x n p
# 1 1 1 0.1000000
# 2 2 1 0.2500000
# 3 3 1 0.2500000
# 4 4 1 0.5333333
# 5 5 1 0.5333333
# 6 6 1 0.5333333
# 7 7 1 0.8000000
# 8 8 1 0.9500000
# 9 9 1 0.9500000

Following Glen_b above, what's described in Hamilton's paper is equivalent to gpava from the CRAN package isotone.

Related

increasing correlation values - R

I would like to obtain the pairs of values that decrease the correlation between two vectors by using a threshold and to find the values that maximize the correlation, with the restriction to have at least three pairs of values. I am using R.
For example, suppose to have this dataframe:
df <- data.frame(val1 = c(1,2,4,8,10,20), val2 = c(2,4,8,16, 35, 14))
rownames(df) <- c('a','b','c','d','e','f')
I would like remove the pairs of values that don't allow me to obtain a correlation greater than 0.6, so in this case I would like to find that f element (row) decreases my correlation. Lastly, if it is easy, I would like to find that by using a,b,c,d elements (rows) I can obtain the highest correlation.
Do you have any idea how I can do it?
Thank you in advance for your kind help.
Best
The restriction of at least 3 rows helps. There are two ways to approach the problem. Which one is best depends a bit on what you are trying to accomplish. We can start with all of the points and remove one at a time or we can start with 3 points and add one at a time. Your example has 6 points so it does not make that much difference. Here is code to find the best 3 point combination:
combos <- combn(6, 3)
corrs <- combn(6, 3, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 1.0000000
# [2,] 1 2 4 1.0000000
# [3,] 2 3 4 1.0000000
# [4,] 1 3 4 1.0000000
# [5,] 1 2 5 0.9988739
# [6,] 1 2 6 0.9940219
We use the combn() function twice, once to get a matrix of the possible combinations of 3 items out of 6 and a second time to apply the correlation function to each combination
Then we combine the results and list the best 6. There are three best 3-point solutions having correlations of +1. For the 5-point solutions we get the following:
combos <- combn(6, 5)
corrs <- combn(6, 5, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 4 5 0.9381942
# [2,] 1 2 3 4 6 0.7514174
# [3,] 1 2 3 5 6 0.4908234
# [4,] 1 2 4 5 6 0.4639890
# [5,] 1 3 4 5 6 0.4062324
# [6,] 2 3 4 5 6 0.3591037
Now there is one clear solution which excludes point 6 ("f") with a correlation of +.938. In general the size of the correlation will increase with decreasing points until it reaches +1 or -1. As the number of points increases, it will take more processing time to compute all of the alternatives. A short cut would be to look at deviations from the first principal component:
df.pca <- prcomp(df)
abval <- abs(df.pca$x[, "PC2"])
df.pca$x[order(abval, decreasing=TRUE), "PC2"]
# f e a b c d
# -11.4055987 5.3497271 2.1507072 1.9191656 1.4560825 0.5299163
Point f (the 6th point) has the largest deviation from the first principal component so removing it should improve the correlation. Likewise removing e and f gives the best 4-point correlation. This is simpler, but generally you would want to remove a point, compute the principal components with that point removed and then identify the next point for removal.

Vectorizing R-loop for better performance

I have a problem to find a vectorization representation for a specific loop in R. My objective is to enhance the performance of the loop, because it has to be run thousands of times in the algorithm.
I want to find the position of the lowest value in a particular array section defined by a vector 'Level' for each row.
Example:
Level = c(2,3)
Let first row of array X be: c(2, -1, 3, 0.5, 4).
Searching for the position of the lowest value in the range 1:Level[1] of the row (that is (2, -1)), I get a 2, because -1 < 2 and -1 stands on second position of the row. Then, searching the position of the lowest value in the second range (Level[1]+1):(Level[1]+Level[2]) (that is (3, 0.5, 4)), I get a 4, because 0.5 < 3 < 4 and 0.5 stands on fourth position of the row.
I have to perform this over each row in the array.
My solution to the problem works as follows:
Level = c(2,3,3) #elements per section, here: 3 sections with 2,3 and 3 levels
rows = 10 #number of rows in array X
X = matrix(runif(rows*sum(Level),-5,5),rows,sum(Level)) #array with 10 rows and sum(Level) columns, here: 8
Position_min = matrix(0,rows,length(Level)) #array in which the position of minimum values for each section and row are stored
for(i in 1:rows){
for(j in 1:length(Level)){ #length(Level) is number of intervals, here: 3
if(j == 1){coeff=0}else{coeff=1}
Position_min[i,j] = coeff*sum(Level[1:(j-1)]) + which(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])] == min(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])]))
}
}
It works fine but I would prefer a solution with better performance. Any ideas?
This will remove the outer level of the loop:
Level1=c(0,cumsum(Level))
for(j in 1:(length(Level1)-1)){
Position_min[,j]=max.col(-X[,(Level1[j]+1):Level1[j+1]])+(Level1[j])
}
Here is a "fully vectorized" solution with no explicit loops:
findmins <- function(x, level) {
series <- rep(1:length(Level), Level)
x <- split(x, factor(series))
minsSplit <- as.numeric(sapply(x, which.min))
minsSplit + c(0, cumsum(level[-length(level)]))
}
Position_min_vectorized <- t(apply(X, 1, findmins, Level))
identical(Position_min, Position_min_vectorized)
## [1] TRUE
You can get better performance by making your matrix into a list, and then using parallel's mclapply():
X_list <- split(X, factor(1:nrow(X)))
do.call(rbind, parallel::mclapply(X_list, findmins, Level))
## [,1] [,2] [,3]
## 1 1 5 6
## 2 2 3 6
## 3 1 4 7
## 4 1 5 6
## 5 2 5 7
## 6 2 4 6
## 7 1 5 8
## 8 1 5 8
## 9 1 3 8
## 10 1 3 8

Interpolating data in R

Let's suppose I have a 3 by 5 matrix in R:
4 5 5 6 8
3 4 4 5 6
2 3 3 3 4
I would like to interpolate in between these values to create a matrix of size 15 by 25. I would also like to specify if the interpolation is linear, gaussian, etc. How can I do this?
For example, if I have a small matrix like this
2 3
1 3
and I want it to become 3 by 3, then it might look like
2 2.5 3
1.5 2.2 3
1 2 3
app <- function(x, n) approx(x, n=n)$y # Or whatever interpolation that you want
apply(t(apply(x, 1, function(x) app(x, nc))), 2, function(x) app(x, nr))
[,1] [,2] [,3]
[1,] 2.0 2.50 3
[2,] 1.5 2.25 3
[3,] 1.0 2.00 3
Long time ago I wrote a similar toy, except I never got around to defining the interpolation function. There's also raster::disaggregate .
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}

aggregate data frame by equal buckets

I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733

What is the most efficient way to return ranks of a vector within levels of a factor, as a vector having the same order/length as the original vector?

With one more requirement - that the resulting vector is in the same order as the original.
I have a very basic function that percentiles a vector, and works just the way I want it to do:
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
p
}
data <- c(1, 2, 3, 100, 200, 300)
For example, ptile(data) generates:
[1] 0.0 0.2 0.4 0.6 0.8 1.0
What I'd really like to be able to do is use this same function (ptile) and have it work within levels of a factor. So suppose I have a "factor" f as follows:
f <- as.factor(c("a", "a", "b", "a", "b", "b"))
I'd like to be able to transform "data" into a vector that tells me, for each observation, what its corresponding percentile is relative to other observations within its same level, like this:
0.0 0.5 0.0 1.0 0.5 1.0
As a shot in the dark, I tried:
tapply(data,f,ptile)
and see that it does, in fact, succeed at doing the ranking/percentiling, but does so in a way that I have no idea which observations match up to their indices in the original vector:
[1] a a b a b b
Levels: a b
> tapply(data,f,ptile)
$a
[1] 0.0 0.5 1.0
$b
[1] 0.0 0.5 1.0
This matters because the actual data I'm working with can have 1000-3000 observations (stocks) and 10-55 levels (things like sectors, groupings by other stock characteristics, etc), and I need the resulting vector to be in the same order as the way it went in, in order for everything to line up, row by row in my matrix.
Is there some "apply" variant that would do what I am seeking? Or a few quick lines that would do the trick? I've written this functionality in C# and F# with a lot more lines of code, but had figured that in R there must be some really direct, elegant solution. Is there?
Thanks in advance!
The ave function is very useful. The main gotcha is to remember that you always need to name the function with FUN=:
dt <- data.frame(data, f)
dt$rank <- with(dt, ave(data, list(f), FUN=rank))
dt
#---
data f rank
1 1 a 1
2 2 a 2
3 3 b 1
4 100 a 3
5 200 b 2
6 300 b 3
Edit: I thought I was answering the question in the title but have been asked to include the code that uses the "ptile" function:
> dt$ptile <- with(dt, ave(data, list(f), FUN=ptile))
> dt
data f rank ptile
1 1 a 1 0.0
2 2 a 2 0.5
3 3 b 1 0.0
4 100 a 3 1.0
5 200 b 2 0.5
6 300 b 3 1.0
For what you are trying to do, I would first put the stock, sector, value as columns in a data-frame. E.g with some made-up data:
> set.seed(1)
> df <- data.frame(stock = 1:10,
+ sector = sample(letters[1:2], 10, repl = TRUE),
+ val = sample(1:10))
> df
stock sector val
1 1 a 3
2 2 a 2
3 3 b 6
4 4 b 10
5 5 a 5
6 6 b 7
7 7 b 8
8 8 b 4
9 9 b 1
10 10 a 9
Then you can use the ddply function from the plyr package to do the "sectorwise" percentile (there are other ways, but I find the plyr to be very useful, and would recommend you take a look at it):
require(plyr)
df.p <- ddply(df, .(sector), transform, pct = ptile(val))
Now of course in df.p the rows will be arranged by the factor (i.e. sector), and it's a simple matter to restore it to the original order, e.g.:
> df.p[ order(df.p$stock),]
stock sector val pct
1 1 a 3 0.3333333
2 2 a 2 0.0000000
5 3 b 6 0.4000000
6 4 b 10 1.0000000
3 5 a 5 0.6666667
7 6 b 7 0.6000000
8 7 b 8 0.8000000
9 8 b 4 0.2000000
10 9 b 1 0.0000000
4 10 a 9 1.0000000
In particular the pct column is the final vector you are seeking in your original question.
When you call tapply() with INDEX=f you get a result that is subsetted by f and broken into a list in order of the levels of f. To reverse that process, simply:
unlist(tapply(data, f, ptile))[order(order(f))]
Your example data vector happened to be in numeric order already, but this works even if the data is in random order...
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
# concatenated with the original data to make the match clear
paste(round(p * 100, 2), x, sep="% ")
}
data <- sample(c(1:5, (1:5)*100), 10)
f <- sample(letters[1:2], 10, replace=TRUE)
result <- unlist(tapply(data, f, ptile))[order(order(f))]
data.frame(result, data, f)

Resources