function to create a vectorized piecwise function R - r

I'm pretty new to R so apologies in advance if this question is poorly constructed. Basically I have a piece-wise function that I need to calculate the value for a large number of rows. My current function looks something like this:
f <- function(x){
(x <= 1000) * x^2 +
(x > 1000 & x <= 2000) * x^3 +
(x > 2000 & x <= 3000) * x^4 +
(x > 4000) * x^5
}
However I need to be able to create or generalize this function for a variety of different sets of breakpoints (ie maybe 1500,2500,3500, etc) and for different numbers of breakpoints. Also given the large number of rows that will need to be calculated on, the function has to be vectorized. Any advice?
Edit:
To clarify, I made the function above from some table of breakpoints (1000,2000,3000,4000) and associated powers to raise x to (2,3,4,5). However I need to be able to take multiple of such tables, each with varying breakpoints and number of breakpoints (with potentially 100 or so breakpoints) and be able to apply the resulting piecewise function to a large number of rows.

A vectorised version of your function with additional breaks and power arguments can be written this way:
function(x, breaks, power){
x^power[as.numeric(cut(x, breaks))]
}
as.numeric(cut(...)) gets the position of all x values in the breaks, then the square bracket looks up the power in the power vector and raises the corresponding x to the correct power. Tests:
Some breaks points and powers:
> bp <- c(10,20,30,40)
> po = c(2,3,4)
Note the breakpoints are left-excluded:
> f(9,bp,po)
[1] NA
> f(10,bp,po)
[1] NA
So the first valid x has to be above 10:
> f(11,bp,po)
[1] 121
And gets us 11^2 as expected. So 20 gets squared and 21 gets cubed:
> f(20,bp,po)
[1] 400
> f(21,bp,po)
[1] 9261
Good so far. Vectorised?
> f(19:22, bp, po)
[1] 361 400 9261 10648
Yes - the change from square to cube happens between 20 and 21.
See the help for the right option for the cut function if you want the intervals to be closed on the left or right.

From what I understand from your example code, you basically want to minimize the coding, and also want the code to be dynamic, so that you can dynamically vary the breaks and power.
Below is the sample code, which tries to do the same.
f <- function(x, breakPoints, powerX) {
cutX <- cut(x, breaks=breakPoints)
cutX1 <- factor(cutX, labels=powerX)
retX <- x ^ as.numeric(as.character(cutX1))
retX
}
x1 <- sample(1:10000, 1000)
x1 <- x1[order(x1)]
breakPoints1 <- c(min(x1)-1, 1000, 2000, 3000, max(x1))
powerX1 <- c(2, 3, 4, 5)
newX1 <- f(x1, breakPoints1, powerX1)
head(newX1) # manual check whether the values make sense
head(x1)
This code will do that.
But my suggestion will be to test this code, as much as possible, so that you can use it reliably. Hope this code is useful to you.

Related

Exponential distribution in R

I want to simulate some data from an exp(1) distribution but they have to be > 0.5 .so i used a while loop ,but it does not seem to work as i would like to .Thanks in advance for your responses !
x1<-c()
w<-rexp(1)
while (length(x1) < 100) {
if (w > 0.5) {
x1<- w }
else {
w<-rexp(1)
}
}
1) The code in the question has these problems:
we need a new random variable on each iteration but it only generates new random variables if the if condition is FALSE
x1 is repeatedly overwritten rather than extended
although while could be used repeat seems better since having the test at the end is a better fit than the test at the beginning
We can fix this up like this:
x1 <- c()
repeat {
w <- rexp(1)
if (w > 0.5) {
x1 <- c(x1, w)
if (length(x1) == 100) break
}
}
1a) A variation would be the following. Note that an if whose condition is FALSE evaluates to NULL if there is no else leg so if the condition is FALSE on the line marked ## then nothing is concatenated to x1.
x1 <- c()
repeat {
w <- rexp(1)
x1 <- c(x1, if (w > 0.5) w) ##
if (length(x1) == 100) break
}
2) Alternately, this generates 200 exponential random variables keeping only those greater than 0.5. If fewer than 100 are generated then repeat. At the end it takes the first 100 from the last batch generated. We have chosen 200 to be sufficiently large that on most runs only one iteration of the loop will be needed.
repeat {
r <- rexp(200)
r <- r[r > 0.5]
if (length(r) >= 100) break
}
r <- head(r, 100)
Alternative (2) is actually faster than (1) or (1a) because it is more highly vectorized. This is despite it throwing away more exponential random variables than the other solutions.
I would advise against a while (or any other accept/reject) loop; instead use the methods from truncdist:
# Sample 1000 observations from a truncated exponential
library(truncdist);
x <- rtrunc(1000, spec = "exp", a = 0.5);
# Plot
library(ggplot2);
ggplot(data.frame(x = x), aes(x)) + geom_histogram(bins = 50) + xlim(0, 10);
It's also fairly straightforward to implement a sampler using inverse transform sampling to draw samples from a truncated exponential distribution that avoids rejecting samples in a loop. This will be a more efficient method than any accept/reject-based sampling method, and works particularly well in your case, since there exists a closed form of the truncated exponential cdf.
See for example this post for more details.

Optimization - Limits and simple constraint

I have a rather simple optimization question and while I'm fairly decent with R, optimization is something I haven't done a lot.
my.function <- function(parameters){
x <- parameters[1]
y <- parameters[2]
z <- parameters[3]
((10*x^2) - ((y/2) * (z/4)))^2
}
result <- optim(c(7,10,18),fn = my.function, method = 'L-BFGS-B',
lower = c(2,7,7),
upper = c(15,20,20))
result$par
#[1] 2.205169 19.546621 19.902243
This is a made up version of the problem I'm working on, so please forgive it if its purpose makes no sense. I have limits in place using the 'L-BFGS-B' method but I need to add a constraint and I'm unsure how to do it. My rules that I'm trying to implement are as follows:
x must be between 2 and 15
y must be between 7 and 20
z must be between 7 and 20
z <= y
It's the last one I don't know how to implement. Any help would be appreciated. Thank you.
Add a large number to the objective function if the constraint is violated, i.e. change the last line of my.function to:
((10*x^2) - ((y/2) * (z/4)))^2 + ifelse(y > z, 10^5, 0)
The result in this case is the following which does satisfy the constraint. Also, since the objective is non-negative its value cannot be less than 0 so we have achieved the minimum to numeric tolerance.
result$par
## [1] 2.223537 19.776462 20.000000
result$value
## [1] 1.256682e-11

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

To find the distance between two roots in R

Suppose I have a function f(x) that is well defined on an interval I. I want to find the greatest and smallest roots of f(x), then taking the difference of them. What is a good way to program it?
To be precise, f can at worst be a rational function like (1+x)/(1-x). It should be a (high degree) polynomial most of the times. I only need to know the result numerically to some precision.
I am thinking about the following:
Convert f(x) into a form recognizable by R. (I can do)
Use R to list all roots of f(x) on I (I found the uniroot function only give me one root)
Use R to to find the maximum and minimum elements in the list (should be possible once I converted it to a vector)
Taking the difference of the two roots. (should be trivial)
I am stuck on step (2) and I do not know what to do. My professor give a brutal force solution, suggesting me to do:
Divide interval I into one million pieces.
Evaluate f on each end points, find the end points where f>=0.
Choose the maximum and minimum elements from the set formed in step 2.
Take the difference between them.
I feel this way is not very efficient and might not work for all f in general, but I am having trouble to implement it even for quadratics. I do not know how to do step (2) as well. So I want to ask for a hint or some toy examples.
At this point I am trying to implement the following code:
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(tail(ypos, -1) != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
at here everything is okay, but when I try to extract the roots to Y[i,1], Y[i,2] by
Y[i,1]=(ri<-root intervals(function(x)(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505, c(0,40),n=1e6)[1]
I found I cannot evaluate it anymore. R keep telling me
Error: unexpected symbol in:
"}
Y[i,1]=(ri<-root intervals"
and I got stuck. I really appreciate everyone's help as I am feeling lost.
I checked the function's expression many times using the plot function and it has no grammar mistakes. Also I believe it is well defined for all X in the interval.
This should give you a good start on the brute force solution. You're right, it's not elegant, but for relatively simple univariate functions, evaluating 1 million points is trivial.
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(ypos[-1] != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
This function returns a two column matrix of x values, where the function changes sign between column 1 and column 2:
f1 <- function (x) 0.05 * x^5 - 2 * x^4 + x^3 - x^2 + 1
> (ri <- root_intervals(f1, c(-10, 10), n = 1e6))
[,1] [,2]
[1,] -0.6372706 -0.6372506
[2,] 0.8182708 0.8182908
> f1(ri)
[,1] [,2]
[1,] -3.045326e-05 6.163467e-05
[2,] 2.218895e-05 -5.579081e-05
Wolfram Alpha confirms results on the specified interval.
The top and bottom rows will be the min and max intervals found. These intervals (over which the function changes sign) are precisely what uniroot wants for it's interval, so you could use it to solve for the (more) exact roots. Of course, if the function changes sign twice within one interval (or any even number of times), it won't be picked up, so choose a big n!
Response to edited question:
Looks like your trying to define a bunch of functions, but your edits have syntax errors. Here's what I think you're trying to do: (this first part might take some more work to work right)
my_funs <- list()
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
my_funs[[i]] <- function(x){(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505}
}
Here's using the root_intervals on the first of your generated functions.
> root_intervals(my_funs[[1]], interval = c(0, 40))
[,1] [,2]
[1,] 0.8581609 0.8582009
[2,] 11.4401314 11.4401714
Notice the output, a matrix, with the roots of the function being between the first and second columns. Being a matrix, you can't assign it to a vector. If you want a single root, use uniroot using each row to set the upper and lower bounds. This is left as an exercise to the reader.

Generate 3 random number that sum to 1 in R

I am hoping to create 3 (non-negative) quasi-random numbers that sum to one, and repeat over and over.
Basically I am trying to partition something into three random parts over many trials.
While I am aware of
a = runif(3,0,1)
I was thinking that I could use 1-a as the max in the next runif, but it seems messy.
But these of course don't sum to one. Any thoughts, oh wise stackoverflow-ers?
This question involves subtler issues than might be at first apparent. After looking at the following, you may want to think carefully about the process that you are using these numbers to represent:
## My initial idea (and commenter Anders Gustafsson's):
## Sample 3 random numbers from [0,1], sum them, and normalize
jobFun <- function(n) {
m <- matrix(runif(3*n,0,1), ncol=3)
m<- sweep(m, 1, rowSums(m), FUN="/")
m
}
## Andrie's solution. Sample 1 number from [0,1], then break upper
## interval in two. (aka "Broken stick" distribution).
andFun <- function(n){
x1 <- runif(n)
x2 <- runif(n)*(1-x1)
matrix(c(x1, x2, 1-(x1+x2)), ncol=3)
}
## ddzialak's solution (vectorized by me)
ddzFun <- function(n) {
a <- runif(n, 0, 1)
b <- runif(n, 0, 1)
rand1 = pmin(a, b)
rand2 = abs(a - b)
rand3 = 1 - pmax(a, b)
cbind(rand1, rand2, rand3)
}
## Simulate 10k triplets using each of the functions above
JOB <- jobFun(10000)
AND <- andFun(10000)
DDZ <- ddzFun(10000)
## Plot the distributions of values
par(mfcol=c(2,2))
hist(JOB, main="JOB")
hist(AND, main="AND")
hist(DDZ, main="DDZ")
just random 2 digits from (0, 1) and if assume its a and b then you got:
rand1 = min(a, b)
rand2 = abs(a - b)
rand3 = 1 - max(a, b)
When you want to randomly generate numbers that add to 1 (or some other value) then you should look at the Dirichlet Distribution.
There is an rdirichlet function in the gtools package and running RSiteSearch('Dirichlet') brings up quite a few hits that could easily lead you to tools for doing this (and it is not hard to code by hand either for simple Dirichlet distributions).
I guess it depends on what distribution you want on the numbers, but here is one way:
diff(c(0, sort(runif(2)), 1))
Use replicate to get as many sets as you want:
> x <- replicate(5, diff(c(0, sort(runif(2)), 1)))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 0.66855903 0.01338052 0.3722026 0.4299087 0.67537181
[2,] 0.32130979 0.69666871 0.2670380 0.3359640 0.25860581
[3,] 0.01013117 0.28995078 0.3607594 0.2341273 0.06602238
> colSums(x)
[1] 1 1 1 1 1
I would simply randomly select 3 numbers from uniform distribution and then divide by their sum:
n <- 3
x <- runif(n, 0, 1)
y <- x / sum(x)
sum(y) == 1
n could be any number you like.
This problem and the different solutions proposed intrigued me. I did a little test of the three basic algorithms suggested and what average values they would yield for the numbers generated.
choose_one_and_divide_rest
means: [ 0.49999212 0.24982403 0.25018384]
standard deviations: [ 0.28849948 0.22032758 0.22049302]
time needed to fill array of size 1000000 was 26.874945879 seconds
choose_two_points_and_use_intervals
means: [ 0.33301421 0.33392816 0.33305763]
standard deviations: [ 0.23565652 0.23579615 0.23554689]
time needed to fill array of size 1000000 was 28.8600130081 seconds
choose_three_and_normalize
means: [ 0.33334531 0.33336692 0.33328777]
standard deviations: [ 0.17964206 0.17974085 0.17968462]
time needed to fill array of size 1000000 was 27.4301018715 seconds
The time measurements are to be taken with a grain of salt as they might be more influenced by the Python memory management than by the algorithm itself. I'm too lazy to do it properly with timeit. I did this on 1GHz Atom so that explains why it took so long.
Anyway, choose_one_and_divide_rest is the algorithm suggested by Andrie and the poster of the question him/herself (AND): you choose one value a in [0,1], then one in [a,1] and then you look what you have left. It adds up to one but that's about it, the first division is twice as large as the other two. One might have guessed as much ...
choose_two_points_and_use_intervals is the accepted answer by ddzialak (DDZ). It takes two points in the interval [0,1] and uses the size of the three sub-intervals created by these points as the three numbers. Works like a charm and the means are all 1/3.
choose_three_and_normalize is the solution by Anders Gustafsson and Josh O'Brien (JOB). It just generates three numbers in [0,1] and normalizes them back to a sum of 1. Works just as well and surprisingly a little bit faster in my Python implementation. The variance is a bit lower than for the second solution.
There you have it. No idea to what beta distribution these solutions correspond or which set of parameters in the corresponding paper I referred to in a comment but maybe someone else can figure that out.
The simplest solution is the Wakefield package probs() function
probs(3) will yield a vector of three values with a sum of 1
given that you can rep(probs(3),x) where x is "over and over"
no drama

Resources