Sample a number from 0 to 1 with uneven probabilities - r

I would like to sample a random number between 0 and 1, with an 90% probability to sample from 0-0.3 and 10% to sample between 0.3-1.
I tried the following:
0.9*runif(1, 0, 0.3) + 0.1*runif(1, 0.3, 1)
But that's not quite it: I will never get the number 0.8, for example.
Is there a simple way to do it in Base R?

sample(c(runif(1,0,0.3),runif(1,0.3,1)),1,prob=c(0.9,0.1))

Usually in R you want to do stuff in a vectorized way. So don't draw a number at the time, but draw all of them in one call (much faster). Here you can use sample to draw the higher extremum and the draw. Like this:
nsamples<-100000
res<-runif(nsamples,0,sample(c(0.03,1),nsamples,TRUE,prob=c(10,90)))
#just to check the result
hist(res)
#this should be around 0.127(=0.9*0.03+0.1*1) if correct
mean(res<0.03)

You can write a small function to do the job whenever you need it.
runif_probs <- function(n, p = 0.9, cutpoint = 0.3){
ifelse(runif(n) <= p, runif(n, 0, cutpoint), runif(n, cutpoint, 1))
}
set.seed(8862)
which(runif_probs(100) > 0.8)
#[1] 38 62

Related

How do I make an xspline curve symmetric in R?

I'm trying to draw a shape using the xspline function in R.
Using a set of control points, I can get the shape but it is asymmetric even though the points and shape values are all symmetric.
How do I draw this shape symmetrically?
This draws the approximate shape but the lines show how it is asymmetric.
curve <- data.frame(x=c(-0.1,-0.1,-0.1,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,-0.1,-0.1,-0.1),y=c(0.1,0.1,0.1,0.3,0.3,0.1,0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
plot(curve)
xspline(curve,shape=1,open=F)
lines(x=c(-0.15,0.15),y=c(0.15,0.15),col="red")
lines(x=c(-0.15,0.15),y=c(-0.15,-0.15),col="red")
I have tried changing the shape values for each node but with no success.
Your question are actually two questions in one:
Is the curve (as a mathematical object) symmetric with respect to the x-axis?
Does it seem so in the picture?
Answer 2
Even if Answer 1 were "Yes" (which I doubt, see below), I think the answer is "No." Judging from the documentation, what xspline does is that it evaluates the curve at many points and then plots a polyline connecting these. You can persuade yourself: setting draw to F, the following should give you two arrays, one of x- and one of y-values.
curve <- data.frame(x=c(-0.1,-0.1,-0.1,-0.1,0.1,0.1,0.1,0.1,0.1,0.1,-0.1,-0.1,-0.1),y=c(0.1,0.1,0.1,0.3,0.3,0.1,0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
plot(curve)
pts=xspline(curve,shape=1,open=F,draw=F)
pts
I don't think there is any way of controlling the number or density of the evaluation points. So even if your curve (as a mathematical object, blue) is symmetric, its polyline rendering (black) is not necessarily:
This alone might explain the small differences from #Mike's comment.
Answer 1
We don't know exactly how R enforces the curve being closed. Based on the documentation,
For open X-splines, the start and end control points must have a shape of 0 (and non-zero values are silently converted to zero).
I suppose that it adds another control point at the very end, makes it equal to the first one and sets the shape of both of them equal to zero. But this is different from what your control points on the right hand-side of the picture look like! Your control point (0.1, 0.1) is repeated twice (not three times as (-0.1, 0.1) is) and its shape is 1, not 0 (caveat: the control point being repeated three times, maybe this does not have any influence; we would have to check the paper linked from the documentation).
I have adapted this and plotted the curve and its mirrored version so that we see the difference.
curve <- data.frame(
x=c(-0.1,-0.1,-0.1,-0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,-0.1,-0.1,-0.1),
y=c( 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
xswap <- data.frame(
x=c( 0.1, 0.1, 0.1, 0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1, 0.1, 0.1, 0.1),
y=c( 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.1,-0.1,-0.1,-0.3,-0.3,-0.1,-0.1))
plot(curve)
xspline(curve,shape=c(0,1,1,1,1,1,1,0,1,1,1,1,1,1),open=F)
xspline(xswap,shape=c(0,1,1,1,1,1,1,0,1,1,1,1,1,1),open=F)
lines(x=c(-0.15,0.15),y=c(0.15,0.15),col="red")
lines(x=c(-0.15,0.15),y=c(-0.15,-0.15),col="red")
To me they seem pretty much overlapping, especially taking the effects from Answer 2 into account.
General recommendations
Unless absolutely necessary, do not repeat control points. This tends to have funny effects on the underlying parametrization. More often than not (in my opinion), repeated control points come from people confusing control points with knots.
When you want symmetry with respect to the x-axis, put the first (and last) control point on it. Then you don't have to worry about finding the corresponding control point whose shape you have to set to 0.
For example,
curve <- data.frame(
x=c(0, 1, 1, 0, -1, -1),
y=c(1, 1, -1, -1, -1, 1))
curve_x_swapped <- data.frame(
x=c(0, -1, -1, 0, 1, 1),
y=c(1, 1, -1, -1, -1, 1))
plot(curve)
xspline(curve, shape=1,open=F)
xspline(curve_x_swapped,shape=1,open=F,border="red")

Why opencl spec subtracts 0.5 for CLK_FILTER_LINEAR

While reading OpenCL 1.1 spec about CLK_FILTER_LINEAR (section 8.2, p258), I came to know that for calculating weights of bilinear filter 0.5 will be subtracted as shown below.
i0 = address_mode((int)floor(u – 0.5))
j0 = address_mode((int)floor(v – 0.5))
i1 = address_mode((int)floor(u – 0.5) + 1)
j1 = address_mode((int)floor(v – 0.5) + 1)
While for CLK_FILTER_NEAREST, it directly floor the u and v as below:
i = address_mode((int)floor(u))
j = address_mode((int)floor(v))
So, there seems to be discrepancy. When I provide unnormalized coordinates (5,4) NEAREST filter will read pixel (5,4). And for LINEAR filter will produce average pixel from (4,3), (5,3), (4,4) and (5,4). But even for LINEAR filter I would expect to read from (5,4) because weights will be 1, 0, 0, 0.
opencl1.1_spec
Can anyone please clarify the spec intention?
It's true. If you want to read a non-interpolated pixel, you'll need to add (0.5,0.5) to the coordinate. "Round" numbers (ending in .0) sit between the pixels and will be equally blended.

Why does R density function return nonzero values outside the interval [from, to]?

I have entered into the R code of the density function, and I have noticed the following strange lines:
lo <- from - 4 * bw
up <- to + 4 * bw
To my understanding, they mean that the density is estimated on the interval [from - 4*bw, to + 4*bw] instead of [from, to].
To really understand what is going on, I have created a densityfun function, which copy-pastes the code of density, except at the end in order to return a function (you can find the R code on GitHub here). Then I get the following:
set.seed(1)
x <- rbeta(10000, 0.5, 0.5)
f <- densityfun(x, from = 0, to = 1)
f(-0.01) # 1.135904
Surprise: f(-0.01) is nonzero!
It also implies that the integral of f on [0, 1] is not 1:
integrate(f, 0, 1) # 0.8787954
integrate(f, -0.1, 1.1) # 0.997002
So why is the density function written that way (is it a bug?), and what could I do to avoid this behaviour (to have f(-0.01) = 0 in this example) without loosing any mass on f (to have integrate(f, 0, 1) approximately equal to 1 in this example)?
Thanks!
EDIT: I have changed a bit the values used in the example.

Compare two user defined curves and score their similarity

I have a set of 2 curves (each with a few hundreds to a couple thousands datapoints) that I want to compare and get some similarity "score". Actually, I have >100 of those sets to compare... I am familiar with R (or at least bioconductor) and would like to use it.
I tried the ccf() function but I'm not too happy about it.
For example, if I compare c1 to the following curves:
c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)
c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1
c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)
c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???
Note that the vectors don't have the same size and it needs to be normalized, somehow... Any idea?
If you look at those 2 lines, they are fairly similar and I think that in a first step, measuring the area under the 2 curves and subtracting would do. I look at the post "Shaded area under 2 curves in R" but that is not quite what I need.
A second issue (optional) is that for lines that have the same profile but different amplitude, I would like to score those as very similar even though the area under them would be big:
c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)
c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??
I hope that a biologist pretending to formulate problem to programmer is OK...
I'd be happy to provide some real life examples if needed.
Thanks in advance!
They don't form curves in the usual meaning of paired x.y values unless they are of equal length. The first three are of equal length and after packaging in a matrix the rcorr function in HMisc package returns:
> rcorr(as.matrix(dfrm))[[1]]
c1 c1b c1c
c1 1 1 -1
c1b 1 1 -1
c1c -1 -1 1 # as desired if you scaled them to 0-1
The correlation of the c1 and c4 vectors:
> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975
I do not have a very good answer, but I did face similar question in the past, probably on more than 1 occasion. My approach is to answer to myself what makes my curves similar when I subjectively evaluate them (the scientific term here is "eye-balling" :). Is it the area under the curve? Do I count linear translation, rotation, or scaling (zoom) of my curves as contributing to dissimilarity? If not, I take out all the factors that I do not care about by selected normalization (e.g. scale the curves to cover the same ranges in x and y).
I am confident that there is a rigorous mathematical theory for this topic, I would search for the words "affinity" "affine". That said, my primitive/naive methods usually sufficed for the work I was doing.
You may want to ask this question on some math forum.
If the proteins you compare are reasonably close orthologs, you should be able to obtain alignments for either each pair you want to score the similarity of, or a multiple alignment for the entire bunch. Depending on the application, I think the latter will be more rigorous. I would then extract the folding score of only those amino acids that are aligned so that all profiles have the same length, and calculate correlation measures or squared normalized dot-products of the profiles as a similarity measure. The squared normalized dot product or the spearman rank correlation will be less sensitive to amplitude differences, which you seem to want. That will make sure you are comparing elements which are reasonable paired (to the extent the alignment is reasonable), and will let you answer questions like: "Are corresponding residues in the compared proteins generally folded to a similar extent?".

Looks like a simple graphing problem

At present I have a control to which I need to add the facility to apply various acuteness (or sensitivity). The problem is best illustrated as an image:
Graph http://img87.imageshack.us/img87/7886/control.png
As you can see, I have X and Y axess that both have arbitrary limits of 100 - that should suffice for this explanation. At present, my control is the red line (linear behaviour), but I would like to add the ability for the other 3 curves (or more) i.e. if a control is more sensitive then a setting will ignore the linear setting and go for one of the three lines. The starting point will always be 0, and the end point will always be 100.
I know that an exponential is too steep, but can't seem to figure a way forward. Any suggestions please?
The curves you have illustrated look a lot like gamma correction curves. The idea there is that the minimum and maximum of the range stays the same as the input, but the middle is bent like you have in your graphs (which I might note is not the circular arc which you would get from the cosine implementation).
Graphically, it looks like this:
(source: wikimedia.org)
So, with that as the inspiration, here's the math...
If your x values ranged from 0 to 1, the function is rather simple:
y = f(x, gamma) = x ^ gamma
Add an xmax value for scaling (i.e. x = 0 to 100), and the function becomes:
y = f(x, gamma) = ((x / xmax) ^ gamma) * xmax
or alternatively:
y = f(x, gamma) = (x ^ gamma) / (xmax ^ (gamma - 1))
You can take this a step further if you want to add a non-zero xmin.
When gamma is 1, the line is always perfectly linear (y = x). If x is less than 1, your curve bends upward. If x is greater than 1, your curve bends downward. The reciprocal value of gamma will convert the value back to the original (x = f(y, 1/g) = f(f(x, g), 1/g).
Just adjust the value of gamma according to your own taste and application needs. Since you're wanting to give the user multiple options for "sensitivity enhancement", you may want to give your users choices on a linear scale, say ranging from -4 (least sensitive) to 0 (no change) to 4 (most sensitive), and scale your internal gamma values with a power function. In other words, give the user choices of (-4, -3, -2, -1, 0, 1, 2, 3, 4), but translate that to gamma values of (5.06, 3.38, 2.25, 1.50, 1.00, 0.67, 0.44, 0.30, 0.20).
Coding that in C# might look something like this:
public class SensitivityAdjuster {
public SensitivityAdjuster() { }
public SensitivityAdjuster(int level) {
SetSensitivityLevel(level);
}
private double _Gamma = 1.0;
public void SetSensitivityLevel(int level) {
_Gamma = Math.Pow(1.5, level);
}
public double Adjust(double x) {
return (Math.Pow((x / 100), _Gamma) * 100);
}
}
To use it, create a new SensitivityAdjuster, set the sensitivity level according to user preferences (either using the constructor or the method, and -4 to 4 would probably be reasonable level values) and call Adjust(x) to get the adjusted output value. If you wanted a wider or narrower range of reasonable levels, you would reduce or increase that 1.5 value in the SetSensitivityLevels method. And of course the 100 represents your maximum x value.
I propose a simple formula, that (I believe) captures your requirement. In order to have a full "quarter circle", which is your extreme case, you would use (1-cos((x*pi)/(2*100)))*100.
What I suggest is that you take a weighted average between y=x and y=(1-cos((x*pi)/(2*100)))*100. For example, to have very close to linear (99% linear), take:
y = 0.99*x + 0.01*[(1-cos((x*pi)/(2*100)))*100]
Or more generally, say the level of linearity is L, and it's in the interval [0, 1], your formula will be:
y = L*x + (1-L)*[(1-cos((x*pi)/(2*100)))*100]
EDIT: I changed cos(x/100) to cos((x*pi)/(2*100)), because for the cos result to be in the range [1,0] X should be in the range of [0,pi/2] and not [0,1], sorry for the initial mistake.
You're probably looking for something like polynomial interpolation. A quadratic/cubic/quartic interpolation ought to give you the sorts of curves you show in the question. The differences between the three curves you show could probably be achieved just by adjusting the coefficients (which indirectly determine steepness).
The graph of y = x^p for x from 0 to 1 will do what you want as you vary p from 1 (which will give the red line) upwards. As p increases the curve will be 'pushed in' more and more. p doesn't have to be an integer.
(You'll have to scale to get 0 to 100 but I'm sure you can work that out)
I vote for Rax Olgud's general idea, with one modification:
y = alpha * x + (1-alpha)*(f(x/100)*100)
alt text http://www4c.wolframalpha.com/Calculate/MSP/MSP4501967d41e1aga1b3i00004bdeci2b6be2a59b?MSPStoreType=image/gif&s=6
where f(0) = 0, f(1) = 1, f(x) is superlinear, but I don't know where this "quarter circle" idea came from or why 1-cos(x) would be a good choice.
I'd suggest f(x) = xk where k = 2, 3, 4, 5, whatever gives you the desired degre of steepness for &alpha = 0. Pick a value for k as a fixed number, then vary α to choose your particular curve.
For problems like this, I will often get a few points from a curve and throw it through a curve fitting program. There are a bunch of them out there. Here's one with a 7-day free trial.
I've learned a lot by trying different models. Often you can get a pretty simple expression to come close to your curve.

Resources