Say we have a data.frame where the columns represent the quantiles for a given set of probabilities. Each row represents a different subject and the quantiles vary by subject. The goal is to take n_draws for each subject.
n <- 1e5
alphas <- c(.05, .25, .5, .75, .95)
n_draws <- 100
dt <- data.frame(quantile_05 = runif(n),
quantile_25 = runif(n, min = 10, max = 20),
quantile_5 = runif(n, min = 30, max = 40),
quantile_75 = runif(n, min = 50, max = 60),
quantile_95 = runif(n, min = 70, max = 80))
R has stats::approx. The issue is that it can only be applied to 1 row at once.
draws <- apply(X = dt, MARGIN = 1, function(q){
stats::approx(x=alphas, y=q, yleft = 0, rule = 2, method="linear", xout = seq(.0001, .99999, length.out = n_draws))$y
})
Naturally, one way to speed this up is parallelization:
library(parallel)
registerDoParallel(cores=8)
cl <- makeCluster(8)
clusterExport(cl, c('alphas', 'n_draws'))
draws <- parApply(cl=cl, X = dt, MARGIN = 1, function(q){
stats::approx(x=alphas, y=q, yleft = 0, rule = 2, method="linear", xout = seq(.0001, .99999, length.out = n_draws))$y
})
This parallel code is several times faster on my machine. I am curious if anyone has suggestions for further speed ups.
To be a little more clear, this is just a simplification of what I'm using in real life. I have more than 5 quantiles. I want to use this concept to estimate not only other quantiles, but also the mean. Further, I'd like to use the draws to model other quantities like max(y - 10, 0) (or y - any arbitrary value).
Your proposed method is rather inefficient for the desired goal. You end up storing 100 points for each desired ECDF and you will still need to write a function to extract a desired draw. Instead I suggest you consider using the approxfun functions. It will return a more compact set of values which will be individual functions with associated environments that contain the knots for later calculation. The calculation will be done with a C call that is accessed with the invisible helper function, .approxfun.
Demonstrating the internals (up to a point) that I'm suggesting:
out <- approxfun(y=dt[1,], x=alphas,yleft = 0, rule = 2, method="linear")
# So out is now a single instance using the knots in the first row
out
#function (v)
#.approxfun(x, y, v, method, yleft, yright, f, na.rm)
#<bytecode: 0x558366535968>
#<environment: 0x5583690a04f8>
ls(environment(out))
#[1] "f" "method" "na.rm" "x" "y" "yleft" "yright"
environment(out)$x
#[1] 0.05 0.25 0.50 0.75 0.95
environment(out)$y
#[1] 0.4038727 17.7069735 33.4438595 57.2753257 77.2024894
If you wanted the estimated 55th percentile for the first case, you could get it with:
out(55/100)
#[1] 38.21015
And now that I've suggested a way to speed up you creation of this list of functions, I'm not even sure it's worth it. I think you could just leave that dt dataframe in place and call approxfun when needed. But that's your call.
Note: This is essentially the method used by the ecdf function:
ecdf
function (x)
{
x <- sort(x)
n <- length(x)
if (n < 1)
stop("'x' must have 1 or more non-missing values")
vals <- unique(x)
rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n,
method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
class(rval) <- c("ecdf", "stepfun", class(rval))
assign("nobs", n, envir = environment(rval))
attr(rval, "call") <- sys.call()
rval
}
<bytecode: 0x558364a0f360>
<environment: namespace:stats>
And it's possible that you might want to use the ecdf function because it has some class-associated functions.
Related
I have two diffent columns
a = c(65.96924, 7.084438, 81.65288 )
b= c(2197.62974660844, 1934.40212085843, 1939.64438773692,
)`
They are different, but I wonder why I go similar output when applying:
quants <- seq(0, 1, length.out = 51)
ecdf(a)(quantile(a, quants))
ecdf(b)(quantile(b, quants))
This can be better understood if you calculate the ecdf by hand instead of using the ecdf() function, which is just a step function that increases up by 1/n at each of the n data points. Since both are the same length you get the same result.
quants <- seq(0, 1, length.out = 51)
A <- sort(a)
B <- sort(b)
A_ecdf <- 1:length(A)/length(A)
B_ecdf <- 1:length(B)/length(B)
plot(A, A_ecdf, type = "s", col = 1)
par(new = TRUE)
plot(B, B_ecdf, type = "s", col = 2)
Though I am not sure exactly what you are trying to do, a possible solution from How to find quantiles of an empirical cumulative density function (ECDF) may be:
my_quantile <- function(x, prob) {
n <- length(x)
approx(seq(0, 1, length = n), x, prob)$y
}
my_quantile(A, quants)
my_quantile(B, quants)
You are calculating the Empirical Cumulative Distribution Function of the quantiles itself and not of the distribution e.g. ecdf(b)(2000). By definition, you end up with a straight line.
So I have a dataset with 600 points, their latitude, longitude, and demands.
I have to make clusters such that for each cluster the points will be near each other and the total capacity of that cluster will not exceed a certain limit.
A sample dataset for the problem:
set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))
df<- data.frame(id, lon, lat, demand)
What I want to have approximately:
What I am getting (cluster boundaries are approximated):
The code I've written:
library(tidyverse)
constrained_cluster <- function(df,capacity=170){
lon_max <- max(df$lon)
lat_max <- max(df$lat)
#Calculating the distance between an extreme point and all other points
df$distance<-6377.83*acos(sin(lat_max*p)*sin(df$lat*p) + cos(lat_max*p)*cos(df$lat*p) * cos((lon_max-df$lon)*p))
df<- df[order(df$distance, decreasing = FALSE),]
d<-0
cluster_number<-1
cluster_list<- c()
i<-1
#Writing a loop to form the cluster which will fill up the cluster_list accordingly
while (i <= length(df$distance)){
d <- d+ df$demand[i]
if(d<=capacity){
cluster_list[i] <- cluster_number
i<- i+1
}
else{
cluster_number <- cluster_number+1
d <- 0
i<-i
}
}
#Return a dataframe with the list of clusters
return(cbind(df,as.data.frame(cluster_list)))
}
df_with_cluster<- constrained_cluster(df, capacity = 1000)
Here is one possible approach, in which I treat the problem directly as an optimisation problem.
Suppose you have a feasible partition of the rows into groups. Not necessarily a good one, but one that does not violate the constraints. For every
group (cluster), you compute the centre. Then you
compute the distances of all points in a group to the
group's centre, and sum them. In this way, you have a
measure of quality of your initial partition.
Now, randomly pick on row, and move it into another
group. You get new solution. Complete the steps
as before, and compare the new solution's quality with
the previous one. If it's better, keep it. If it's
worse, stay with the old solution. Now repeat this
whole procedure for a fixed number of iterations.
This process is called a Local Search. Of course, it is
not guaranteed it will take you to an optimum
solution. But it will likely give you a good
solution. (Note that k-means implementations are
typically stochastic as well, and there is no guaranty
for an "optimal" partition.)
The good thing about a Local Search is that it gives
you much flexibility. For instance, I assumed you
started with a feasible solution. Suppose you make a
random move (i.e. move one row into another cluster),
but now this new cluster is to big. You could now simply discard this new, infeasible solution, and draw a new one.
Here is a code example, really just an outline; but with luck it is useful for you.
set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))
df<- data.frame(id, lon, lat, demand)
Fix a number of clusters, k.
k <- 5
Start with kmeans and plot the solution.
par(mfrow = c(1, 3))
km <- kmeans(cbind(df$lat, df$lon), centers = k)
cols <- hcl.colors(n = k, "Cold")
plot(df$lon,
df$lat,
type = "p", pch = 19, cex = 0.5,
main = "kmeans")
for (i in seq_len(k)) {
lines(df$lon[km$cluster == i],
df$lat[km$cluster == i],
type = "p", pch = 19,
col = cols[i])
}
Now a Local Search. I use an implementation in package NMOF (which I maintain).
library("NMOF")
## a random initial solution
x0 <- sample(1:k, length(id), replace = TRUE)
X <- as.matrix(df[, 2:3])
The objective function: it takes a partition x and computes the sum of distances, for all clusters.
sum_diff <- function(x, X, k, ...) {
groups <- seq_len(k)
d_centre <- numeric(k)
for (g in groups) {
centre <- colMeans(X[x == g, ], )
d <- t(X[x == g, ]) - centre
d_centre[g] <- sum(sqrt(colSums(d * d)))
}
sum(d_centre)
}
The neighbourhood function: it takes a partition and moves
one row into another cluster.
nb <- function(x, k, ...) {
groups <- seq_len(k)
x_new <- x
p <- sample.int(length(x), 1)
g_ <- groups[-x_new[p]]
x_new[p] <- g_[sample.int(length(g_), 1)]
x_new
}
Run the Local Search. I actually use a method called Threshold Accepting, which is based on Local Search, but can move away from local minima. See ?NMOF::TAopt for references on that method.
sol <- TAopt(sum_diff,
list(x0 = x0,
nI = 20000,
neighbour = nb),
X = as.matrix(df[, 2:3]),
k = k)
Plot the solution.
plot(df$lon,
df$lat,
type = "p", pch = 19, cex = 0.5,
main = "Local search")
for (i in seq_len(k)) {
lines(df$lon[sol$xbest == i],
df$lat[sol$xbest == i],
type = "p", pch = 19,
col = cols[i])
}
Now, one way to add a capacity constraint. We start with a feasible solution.
## CAPACITY-CONSTRAINED
max.demand <- 6600
all(tapply(df$demand, x0, sum) < max.demand)
## TRUE
The constraint is handled in the neighbourhood. If the new solution exceeds the capacity, it is discarded.
nb_constr <- function(x, k, demand, max.demand,...) {
groups <- seq_len(k)
x_new <- x
p <- sample.int(length(x), 1)
g_ <- groups[-x_new[p]]
x_new[p] <- g_[sample.int(length(g_), 1)]
## if capacity is exceeded, return
## original solution
if (sum(demand[x_new == x_new[p]]) > max.demand)
x
else
x_new
}
Run the method and compare the results.
sol <- TAopt(sum_diff,
list(x0 = x0,
nI = 20000,
neighbour = nb_constr),
X = as.matrix(df[, 2:3]),
k = k,
demand = df$demand,
max.demand = max.demand)
plot(df$lon,
df$lat,
type = "p", pch = 19, cex = 0.5,
main = "Local search w/ constraint")
for (i in seq_len(k)) {
lines(df$lon[sol$xbest == i],
df$lat[sol$xbest == i],
type = "p", pch = 19,
col = cols[i])
}
all(tapply(df$demand, sol$xbest, sum) < max.demand)
## TRUE
This is really just an example and could be improved. For instance, the objective function here recomputes the distance of all groups, when it would only need to look at the changed groups.
Something like this might get you started?
nmax <- 100
num.centers <- 1
km <- kmeans(cbind(df$lat, df$lon), centers = num.centers)
#check if there are no clusters larger than nmax
while (prod(km$size < nmax) == 0) {
num.centers <- num.centers + 1
km <- kmeans(cbind(df$lat, df$lon), centers = num.centers)
}
plot(df$lon, df$lat, col = km$cluster, pch = 20)
I am struggling with this for so long. I have a logistic growth function where the growth parameter
r is a matrix. The model is constructed in a way that I have as an output two N the N1 and N2.
I would like to be able to change the r parameter over time. When time < 50 I would like
r = r1 where
r1=matrix(c(
2,3),
nrow=1, ncol=2
When time >= 50 I would like r=r2 where
r2=matrix(c(
1,2),
nrow=1, ncol=2
Here is my function. Any help is highly appreciated.
rm(list = ls())
library(deSolve)
model <- function(time, y, params) {
with(as.list(c(y,params)),{
N = y[paste("N",1:2, sep = "")]
dN <- r*N*(1-N/K)
return(list(c(dN)))
})
}
r=matrix(c(
4,5),
nrow=1, ncol=2)
K=100
params <- list(r,K)
y<- c(N1=0.1, N2=0.2)
times <- seq(0,100,1)
out <- ode(y, times, model, params)
plot(out)
I would like ideally something like this but it does not work
model <- function(time, y, params) {
with(as.list(c(y,params)),{
N = y[paste("N",1:2, sep = "")]
r = ifelse(times < 10, matrix(c(1,3),nrow=1, ncol=2),
ifelse(times > 10, matrix(c(1,4),nrow=1, ncol=2), matrix(c(1,2),nrow=1, ncol=2)))
print(r)
dN <- r*N*(1-N/K)
return(list(c(dN)))
})
}
Thank you for your time.
Here a generic approach that uses an extended version of the approx function. Note also some further simplifications of the model function and the additional plot of the parameter values.
Edit changed according to the suggestion of Lewis Carter to make the parameter change at t=3, so that the effect can be seen.
library(simecol) # contains approxTime, a vector version of approx
model <- function(time, N, params) {
r <- approxTime(params$signal, time, rule = 2, f=0, method="constant")[-1]
K <- params$K
dN <- r*N*(1-N/K)
return(list(c(dN), r))
}
signal <- matrix(
# time, r[1, 2],
c( 0, 2, 3,
3, 1, 2,
100, 1, 2), ncol=3, byrow=TRUE
)
## test of the interpolation
approxTime(signal, c(1, 2.9, 3, 100), rule = 2, f=0, method="constant")
params <- list(signal = signal, K = 100)
y <- c(N1=0.1, N2=0.2)
times <- seq(0, 10, 0.1)
out <- ode(y, times, model, params)
plot(out)
For a small number of state variables like in the example, separate signals with approxfun from package stats will look less generic but may be slighlty faster.
As a further improvement, one may consider to replace the "hard" transitions with a more smooth one. This can then directly be formulated as a function without the need of approx, approxfun or approxTime.
Edit 2:
Package simecol imports deSolve, and we need only a small function from it. So instead of loading simecol it is also possible to include the approxTime function explicitly in the code. The conversion from data frame to matrix improves performance, but a matrix is preferred anyway in such cases.
approxTime <- function(x, xout, ...) {
if (is.data.frame(x)) {x <- as.matrix(x); wasdf <- TRUE} else wasdf <- FALSE
if (!is.matrix(x)) stop("x must be a matrix or data frame")
m <- ncol(x)
y <- matrix(0, nrow=length(xout), ncol=m)
y[,1] <- xout
for (i in 2:m) {
y[,i] <- as.vector(approx(x[,1], x[,i], xout, ...)$y)
}
if (wasdf) y <- as.data.frame(y)
names(y) <- dimnames(x)[[2]]
y
}
If you want to pass a matrix parameter you should pass a list of parameters and you can modify it inside the model when your time limit is exceeded (in the example below you don't even have to pass the r matrix to the model function)
library(deSolve)
model <- function(time, y, params) {
with(as.list(c(y,params)),{
if(time < 3) r = matrix(c(2,3), nrow = 1, ncol = 2)
else r = matrix(c(1,3), nrow = 1, ncol = 2)
N = y[paste("N",1:2, sep = "")]
dN <- r*N*(1-N/K)
return(list(c(dN)))
})
}
y <- c(N1=0.1, N2=0.2)
params <- list(r = matrix(c(0,0), nrow = 1, ncol = 2), K=100)
times <- seq(0,10,0.1)
out <- ode(y, times, model, params)
plot(out)
You can see examples of this for instance with Delay Differential Equations ?dede
At first, I have two functions like the following:
ef <- function(x, a){
if(a == 0){
return(x)
} else {
return(1-exp(-a*(5+x)))
}
}
f1 <- function(x) ef(x,a)-0.75*ef(2.5,a)-0.25*ef(-1,a)
If a is 2 (i.e. a <- 2), then the root should be:
uniroot(f1, c(-5, 0), tol = 0.0001)$root
Now my question is how to calculate the root of x of the function when a change from 0.05 to 3 by 0.05?
I think it's more flexible to put a into f1() as an argument.
f1 <- function(x, a) ef(x, a)-0.75*ef(2.5, a)-0.25*ef(-1, a)
Then use sapply() to operate each value in the sequence seq(0.05, 3, 0.05):
sapply(seq(0.05, 3, 0.05), function(A){
uniroot(f1, c(-10, 10), tol = 0.0001, extendInt = "yes", a = A)$root
})
# [1] 1.565924900 1.503659791 1.438426382 1.370549617 1.300423929
# [6] 1.228478774 1.155273229 1.081323809 1.007194271 0.933431003 ...
The argument extendInt = "yes" can conquer the error when f1() does not have different signs at the endpoints. In addition, I prefer apply family rather than a for loop in this case. You can read this for the reason.
Edit: for loop solution
a <- seq(0.05, 3, 0.05)
root <- numeric()
for(i in 1:length(a)){
root[i] <- uniroot(f1, c(-10, 10), tol = 0.0001, extendInt = "yes", a = a[i])$root
}
At the end of the loop, the variable root will store all the roots. You can check whether the outputs of the two solutions are equal.
I'm doing Maximum Likelihood Estimation using maxLik, which requires specifying starting values. Instead of specifying a single value, is there any way that allows me to use all the values from a matrix as the start value?
My current code of maxLik is:
f12 <- function(param){
alpha <- param[1]
rho <- param[2]
lambda <- param[3]
u <- 0.5*(p12$v_50_1)^alpha + 0.5*lambda*(p12$v_50_2)^alpha
p <- 1/(1 + exp(-rho*u))
f <- sum(p12$gamble*log(p) + (1-p12$gamble)*log(1-p))}
ml <- maxLik(f12, start = c(alpha = 1, rho=2, lambda = 1), method = "NM")
I create a dataframe with the upper and lower bounds of potential start values:
st <- expand.grid(alpha = seq(0, 2, len = 100),rho = seq(0, 1, len = 100),lambda = seq(0,2, length(100))
There are 3 parameters in my function, and my goal is to loop all the values in the above dataframe st and select the best vector of start values after running the model from a variety of starting parameters.
Thanks!
Consider Map (wrapper to mapply) to pass the st columns elementwise through your methods. Here, Map will return a list of maxLik objects, specifically inherited maxim class objects containing a list of other components. The number of items in this list will be equal to rows of st.
Notice input parameters, a, r, and l being passed into start argument of maxLik() and no longer hard-coded integers. And f12 is left untouched.
maxLik_run <- function(a, r, l) {
tryCatch({
f12 <- function(param){
alpha <- param[1]
rho <- param[2]
lambda <- param[3]
u <- 0.5*(p12$v_50_1)^alpha + 0.5*lambda*(p12$v_50_2)^alpha
p <- 1/(1 + exp(-rho*u))
f <- sum(p12$gamble*log(p) + (1-p12$gamble)*log(1-p))
}
return(maxLik(f12, start = c(alpha = a, rho = r, lambda = l), method = "NM"))
}, error = function(e) return(NA))
}
st <- expand.grid(alpha = seq(0, 2, len = 100),
rho = seq(0, 1, len = 100),
lambda = seq(0, 2, length(100)))
maxLik_list <- Map(maxLik_run, st$alpha, st$rho, st$lambda)
And to answer the question --best vector of start values after running the model from a variety of starting parameters-- requires a particular definition of "best". Once you define this, you can use Filter() on your returned list of objects to select the one or more element that yields this "best".
Below is a demonstration to find the highest value across each maximum likelihood's maximum. Use estimate if needed. Do note, this returned list can have more than one if the highest value is shared by other list items:
highest_value <- max(sapply(maxLik_list, function(item) item$maximum))
maxLik_item_list <- Filter(function(i) i$maximum == highest_value, maxLik_list)
What you are doing in your logLik function is that you are calculating alpha,lambda,rho whereas your data already has them.Those are the lines with u,p and f12(that is also your function name!). Also it is possible to calculate log likelihood for one row as your log likelihood function has single indices. So you run the code using apply like this
#create a function to find mle estimate for first row
maxlike <- function(a) {
f12 <- function(param){
alpha <- param[1]
rho <- param[2]
lambda <- param[3]
#u <- 0.5*(p12$v_50_1)^alpha + 0.5*lambda*(p12$v_50_2)^alpha
#p <- 1/(1 + exp(-rho*u))
#f12 <- sum(p12$gamble*log(p) + (1-p12$gamble)*log(1-p))
}
ml <- maxLik(f12, start = c(alpha = 1, rho=2, lambda = 1), method = "NM")
}
#then using apply with data = st, 2 means rows and your mle function
mle <- apply(st,2,maxlike)
mle