How to return possible pairs of variables based on a function? - r

I have a pair of variables (x, y) and for each variable there is a possible range of values (xmin, xmax and ymin, ymax). I am looking for such pairs that based on a function would yield the same probability.
This is my function that would return probabilities.
f <- function(x, y) 1-exp(-(x^(1/0.9)+y^(1/0.9))^0.9)
Now suppose I want to know that for a certain probability, say 0.01 what are the possible pairs of variables of x and y yielding that (considering their constraints, min and max values).
(What I have already tried is doing the whole thing the other way around by creating a matrix first for x and y and then for each combination I calculated the probability, but then I would need to find the same probabilities in the matrix, which seems to be even more difficult.)

So by doing some math (sorry latex formatting is not supported in SO):
P=1-exp(-(x^(1/0.9)+y^(1/0.9))^0.9)
ln(1-P)=-(x^(1/0.9)+y^(1/0.9))^0.9)
(-ln(1-P))^(1/0.9)-y^(1/0.9)=x^(1/0.9)
((-ln(1-P))^(1/0.9)-y^(1/0.9))^0.9=x
Now if we put it in some R code, and check when results do not exists :
get_x <- function(P,y)
{
x=((-log(1-P))^(1/0.9)-y^(1/0.9))^0.9
# Verification of the results
# If results no real (then x[i]=NaN) or if it does
# not match the given probability (should never happens)
# the results is set to NaN
# This verification is for debugging only, should be removed
for (i in c(1:length(y))){
if(is.na(x[i]) | abs(P-1+exp(-(x[i]^(1/0.9)+y[i]^(1/0.9))^0.9))>0.00001)
{
x[i]=NaN
print(paste0("Oops, something went wrong with y=",y[i]))
}
}
return(x)
}
y_values=seq(0.01,0.99,by=0.001)
get_x(0.09,y_values)
Which is pretty fast, now only one loop is used, instead of two to fill the matrix, so order n instead of n^2

We can calculate probability for all possible combinations and create a dataframe with combination which satisfies our criteria with some tolerance (for floating point comparison)
tol <- 0.0001
mat <- which((matrix2 >= 0.01 - tol) & (matrix2 <= 0.01 + tol), arr.ind = TRUE)
data.frame(comb1 = rownames(matrix2)[mat[, 1]], comb2 = colnames(matrix2)[mat[, 2]])

Related

R: compute an integral with an unknown parameter equal to a certain value (for example: int x = 0.6)

I try to simulate values out of an unknown integral (to create a climatological forecaster)
my function is: $\int_{x = 0}^{x = 0.25} 4*y^(-1/x) dx$
Normally one inputs the variable y and gets a value as output.
However, I want to input the value this integral is equal to and get the value of y as an output.
I have 3 runif vectors of length 1 000, 10 000 and 100 000 (with values between 0 and 1), which I use as my input values.
Say the first value is 0.3 and the second value is 0.78
I want to calculate for which y, the integral above is equal to 0.3 (or equal to 0.78 for the second value).
how am I able to do this in R?
I've tried some stuff with the integrate function, but then I need a value for y to make that work
You are trying to solve a non-linear equation with an integral inside.
Intuitively, what you need to do is to start with an interval in which the desired y sits on. Then try different values of y and calculate the integral, narrow the interval by the result.
You can implement that in R using integrate and optimize as below:
f <- function(x, y) {
4*y^(-1/x)
}
intf <- function(y) {
integrate(f, 0, 0.25, y=y)
}
objective <- function(y, value) {
abs(intf(y)$value - value)
}
optimize(objective, c(1, 10), value=0.3)
#$minimum
#[1] 1.14745
#
#$objective
#[1] 1.540169e-05
optimize(objective, c(1, 10), value=0.78)
#$minimum
#[1] 1.017891
#
#$objective
#[1] 0.0001655954
Here, f is the function to be integrated, intf calculates the integral for a given y, and objective measures the distance between the value of the integral against the desired value.
Since optimize function finds the minimum value of a function, it finds y such that the objective is closest to the target value.
Note that non-linear equations with an integral inside are in general tough to solve. This case seems manageable since the function is monotonic and continuous in y. The solution y should be unique and can be easily found by narrowing down the interval.

how to make a fast pairwise Tanimoto distance function in R

I have a data.frame of items identified by an integer property ID, which is also the row number of the data.frame.
Each item has a vector of features FP associated to it. The elements of each FP are unique (within that FP). So for instance c(1,2,7) but never c(1,7,7).
The Tanimoto distance between any two ID's is defined as 1 minus the number of unique elements in the intersection of their FP's, divided by the number of unique elements in the union of their FP's.
I need to calculate such distances in the context of a 'maxmin' algorithm. See for instance this blog post.
The most important point to note is that I must NOT compute a full distance matrix (even with the best algorithms it would be unfeasible on the scale of datasets I am working with).
As explained in the above post, the strength of the iterative maxmin picker according to Roger Sayle's method is that one can avoid computing most of the pairwise distances, and instead calculate only the few relevant ones. Hence my question.
Here's what I could come up with so far:
# make a random dataset
set.seed(1234567)
d <- sample(30:45, 1000, replace = T)
dd <- setNames(data.frame(do.call(rbind, sapply(d,function(n) list(sample(as.character(1:(45*2)), n, replace = F)), simplify = F))), "FP")
dd["ID"] <- 1:NROW(dd)
# define a pairwise distance function for ID's
distfun <- function(ID1,ID2) {
FP1 <- dd$FP[[ID1]]
FP2 <- dd$FP[[ID2]]
int <- length(intersect(FP1,FP2))
1 - int/(d[ID1]+d[ID2]-int)
}
# test performance of distance function
x <- sample(dd$ID, 200, replace = F)
y <- sample(dd$ID[!(dd$ID %in% x)], 200, replace = F)
pairwise.dist <- NULL
system.time(
for(i in x) {
for (j in y) {
dij <- distfun(i,j)
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 0.86 0.00 0.86
Question 1 : do you think the distance function could be made faster?
I tried making a sparse matrix of the features (ddu.tab in the code below, where I omitted the denominator, which is trivial to compute from the intersection) and defining the distance function as vector operations, but that was much slower (a bit to my surprise, I must say).
ddu <- do.call(rbind, sapply(dd$ID, function(x) {data.frame("ID"=x, "FP"=dd$FP[[x]], stringsAsFactors = F)}, simplify = F))
ddu.tab <- xtabs(~ID+FP, ddu, sparse = T)
system.time(
for(i in x) {
for (j in y) {
dij <- t(ddu.tab[i,]) %*% ddu.tab[j,]
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 32.35 0.03 32.66
Question 2 : actually less important than the distance calculation, but if anyone can advise... The update of pairwise.dist by rbind is (apparently) very costly. I don't know if I can do it differently (meaning not adding new elements at each iteration), because in the maxmin application the pairs of ID's whose distances are to be calculated are not known upfront like in this example, and pairwise.dist is continuously read and appended new elements.
Someone in the past suggested to me that lists may be better than matrices for read/write. If that is the case, I could write out pairwise.dist as a named list.
BTW, just FYI, in this specific example the full distance matrix is calculated quite fast:
system.time(ddu.dist <- dist(ddu.tab, method = "binary"))
# user system elapsed
# 0.61 0.00 0.61
which seems to indicate that there is indeed a fast method to calculate binary distances.
If anyone could please advise and/or point me to relevant resources, it would be great.
Thanks!
Not sure about speeding up the distance function itself, but you could replace your double loop, using the tidyverse, with
library(tidyverse)
results <- crossing(x = x, y = y) %>% #all x,y combinations
filter(x < y) %>% #remove duplicates
mutate(pairwise.dist = map2_dbl(x, y, distfun)) #apply distance function

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

R: looping to search for max of non-monotonic function

Refer to the R code below. The function (someRfunction) operates on a vector and returns a scalar value. The data are pairs (x,y), where x and y are vectors of length n, which may be large.
I want to know the value of x* such that the result of someRfunction on y where {x>x*} is maximized. The function operates on y values and is non-monotonic in x*. I need to evaluate for all x* (i.e. each element of x). Speed is not an issue if executed once, but the code would be executed many times in a simulation. Is there any way to make this code more efficient/faster?
### x and y are vectors of length n
### sort x and y such that they are ordered by descending x
xord <- x[order(-x)]
yord <- y[order(-x)]
maxf <- -99999
maxcut <- NA
for (i in 1:n) {
### yi is a subvector of y that corresponds to y[x>x{i}]
### where x{i} is the (n-i+1)th order statistic of x
yi <- yord[1:(i-1)]
fxi <- someRfunction(yi)
if (fxi>maxf) {
maxf <- fxi
maxcut <- xord[i]
}
}
Thanks.
Edit: let someRfunction(yi)=t.test(yi)$statistic.
If you can say anything more about the function, particularly whether it is smooth and whether its gradient can be determine, you will get a better answer. At the moment the only increase in speed will be modest due to the ability to pre-specify a vector to hold the results, omit that if-max clause and then use which.max() on the vector. You might want to look at the function optimx in package "optimx".

Resources