Set symmetric specific cuts for bimodal data - r

I have a binomial assymetric distribution which I would like to cut at both ends. The specific part of it is that I would like to calculate symmetric boundaries at the appropriate side of each 'bell'. The figure shows an extreme case of separation between bells for simplicity.
In this case the red cuts were selected by eye and the 1550 blue lines used at each side represent an arbitrary value that could potentially be passed through a function for the trim. My goal would be subset everything between blue lines.
hist(p3_cut$x,50)
abline(v=c(6200,7600),col='red')
abline(v=c(6200-1500,7600+1500),col='blue')
My guess is that the problem here is basically find the 'edges' of each curve. I cannot use half distance between means, I need something that recognizes frequency change from 0 (or very low value) to something relatively high.

A somewhat general answer. Depending on the problem you might need to adjust the binwidth in the density function:
# get density of x and normalize so max is one
dens <- density(x,adjust=0.1)
dens$y <- dens$y / max(dens$y)
# keep all x where density is higher than some fraction of max (here 1%)
min_frac <- 0.01
x_keep <- dens$x[dens$y > 0.01]
# find position of gap in x, and get x just before and after gap
gap_pos <- which.max(diff(x_keep))
left_cut <- x_keep[gap_pos]
right_cut <- x_keep[gap_pos + 1]
Using this code and changing the adjust parameter in the density function I was able to calculate almost perfect cuts at least for this case. I am positive that this approach is flexible enough for most situations that are similar to this one. I show the results for the cuts proposed.

Related

How to fix a straight line plot of a logistic map in R?

The logistic map (a map is a function that takes its value at any time step to its value at the next time step) is a model that has its roots in the prediction of animal population sizes. It has become famous, in part, due to special cases of its parameterization that exhibit surprising chaotic behavior. The logistic map equation is
xi+1 = rxi(1 - xi)
where xi ∈ [0,1] is the value ratio of current population size to maximum possible size at time i, xi+1 is the ratio at the next generation and r is the driving rate, representing animal reproduction and death. For r < 3.5 the population eventually reaches a stable size or will oscillate between a set of fixed values. However, if r > 3.5 then the system destabilizes and exhibits chaotic behavior!
That is background or context for the following problem statement:
Generate a set of points S = {r, x} where, for each r ∈ [1.0, 4.1] by increments of 0.001025 there will be a sequence of xi values for i = 0,...,16. So, for each r value there will be 17 xi values. Use x0 = 0.01. Depending on your implementation, you may find the rbind function useful. It may take a few seconds for the code to run since it will generate a lot of points in S. No more than 10 lines of R code.
Admittedly, this is a lab assignment; however, I am not a student in the class. I am learning R, and I am trying to work through the online assignments and come up with a solution myself. I have tried to create the set of points to plot, and based on manual verification of a few points, the set looks accurate.
for(j in c(0:3024)) {
rm(x)
x <- 1:17
x[1] <- 0.01
r <- 1 + (j * 0.001025)
for(i in c(1:(17-1))) {
x[i+1] <- r *x[i] * (1 - x[i])
}
if (j==0) {
binded <- cbind(r,x)
} else {
binded <- rbind(binded, cbind(r,x))
}
}
When I invoke plot(binded, pch='.') RStudio displays the result as a straight line. So I am unsure if I am using plot correctly, or even if I am generating all the points correctly. If I decrease the maximum value of j to something less than 2000, you will see a plot; it is just when the j value iterates up to 3024 that you only plot a straight line.
I believe your code is correct, what happens is when time exceeds 4, the of iterations are widely unstable and are going to -infinity. This large variation in the y value is compressing the scale and making the plot look like a flat line.
Cutting off the tail end of the matrix makes a very interesting plot:
plot(binded[-which(binded[,2]<0),], pch=".")
If you do want to plot the entire matrix, consider manually setting your y-axis limits to [0,1]. This way, the plot won't be stretched down to -1e24.
As an added bonus, here's a version in a different plotting library that has points colored by i.

Calculate total absolute curvature from coordinates in R

Given a set of coordinates corresponding to a closed shape, I want to calculate the total absolute curvature, which requires calculating the curvature for each point, taking the absolute value, and summing them. Simple enough.
I used the answer to this question to calculate the curvature from a matrix of x y coordinates (xymat) and get what I thought would be the total absolute curvature:
sum(abs(predict(smooth.spline(xymat), deriv = 2)$y))
The problem is that total absolute curvature has a minimum value of 2*pi and is exactly that for circles, but this code is evaluating to values less than 2*pi:
library(purrr)
xymat <- map_df(data.frame(degrees=seq(0:360)),
function(theta) data.frame(x = sin(theta), y = cos(theta)))
sum(abs(predict(smooth.spline(xymat), deriv = 2)$y))
This returns 1.311098 instead of the expected value of 6.283185.
If I change the df parameter of smooth.spline to 3 as in the previous answer, the returned value is 3.944053, still shy of 2*pi (the df value smooth.spline calculated for itself was 2.472213).
Is there a better way to calculate curvature? Is smooth.spline parameterized by arc length or will incorporating it (somehow) rescue this calculation?
Okay, a few things before we begin. You're using degrees in your seq, which will give you incorrect results (0 to 360 degrees). You can check that this is wrong by taking cos(360) in R, which isn't 1. This is explained in the documentation for the trig functions under Details.
So let's change your function to this
xymat <- map_df(data.frame(degrees=seq(0,2*pi,length=360)),
function(theta) data.frame(x = sin(theta), y = cos(theta)))
If you plot this, this indeed looks like a circle.
Let's actually restrict this to the lower half of the circle. If you put a spline through this without understanding the symmetry and looking at the plot, chances are that you'll get a horizontal line through the circle.
Why? because the spline doesn't know that it's symmetric above and below y = 0. The spline is trying to fit a function that explains the "data", not trace an arc. It splits the difference between two symmetric sets of points around y = 0.
If we restrict the spline to the lower half of the circle, we can use y values between 1 and -1, like this:
lower.semicircle <- data.frame(predict(smooth.spline(xymat[91:270,], all.knots = T)))
And let's fit a spline through it.
lower.semicircle.pred<-data.frame(predict(smooth.spline(lower.semicircle, all.knots = T)))
Note that I'm not using the deriv function here. That is for a different problem in the cars example to which you linked. You want total absolute curvature and they are looking at rate of change of curvature.
What we have now is an approximation to a lower semicircle using splines. Now you want the distance between all of the little sequential points like in the integral from the wikipedia page.
Let's calculate all of the little arc distances using a distance matrix. This literally calculates the Euclidean distances between each point to every other point.
all.pairwise.distances.in.the.spline.approx<-dist(lower.semicircle.pred, diag=F)
dist.matrix<-as.matrix(all.pairwise.distances.in.the.spline.approx)
seq.of.distances.you.want<-dist.matrix[row(dist.matrix) == col(dist.matrix) + 1]
This last object is what you need to sum across.
sum(seq.of.distances.you.want)
..which evaluates to [1] 3.079 for the lower semicircle, around half of your 2*pi expected value.
It's not perfect but splines have problems with edge effects.

How to count line segment occurrences by pixel in R?

I am trying to convey the concentration of lines in 2D space by showing the number of crossings through each pixel in a grid. I am picturing something similar to a density plot, but with more intuitive units. I was drawn to the spatstat package and its line segment class (psp) as it allows you to define line segments by their end points and incorporate the entire line in calculations. However, I'm struggling to find the right combination of functions to tally these counts and would appreciate any suggestions.
As shown in the example below with 50 lines, the density function produces values in (0,140), the pixellate function tallies the total length through each pixel and takes values in (0, 0.04), and as.mask produces a binary indictor of whether a line went through each pixel. I'm hoping to see something where the scale takes integer values, say 0..10.
require(spatstat)
set.seed(1234)
numLines = 50
# define line segments
L = psp(runif(numLines),runif(numLines),runif(numLines),runif(numLines), window=owin())
# image with 2-dimensional kernel density estimate
D = density.psp(L, sigma=0.03)
# image with total length of lines through each pixel
P = pixellate.psp(L)
# binary mask giving whether a line went through a pixel
B = as.mask.psp(L)
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(L, main="L")
plot(D, main="density.psp(L)")
plot(P, main="pixellate.psp(L)")
plot(B, main="as.mask.psp(L)")
The pixellate.psp function allows you to optionally specify weights to use in the calculation. I considered trying to manipulate this to normalize the pixels to take a count of one for each crossing, but the weight is applied uniquely to each line (and not specific to the line/pixel pair). I also considered calculating a binary mask for each line and adding the results, but it seems like there should be an easier way. I know that you can sample points along a line, and then do a count of the points by pixel. However, I am concerned about getting the sampling right so that there is one and only one point per line crossing of a pixel.
Is there is a straight-forward way to do this in R? Otherwise would this be an appropriate suggestion for a future package enhancement? Is this more easily accomplished in another language such as python or matlab?
The example above and my testing has been with spatstat 1.40-0, R 3.1.2, on x86_64-w64-mingw32.
You are absolutely right that this is something to put in as a future enhancement. It will be done in one of the next versions of spatstat. It will probably be an option in pixellate.psp to count the number of crossing lines rather than measure the total length.
For now you have to do something a bit convoluted as e.g:
require(spatstat)
set.seed(1234)
numLines = 50
# define line segments
L <- psp(runif(numLines),runif(numLines),runif(numLines),runif(numLines), window=owin())
# split into individual lines and use as.mask.psp on each
masklist <- lapply(1:nsegments(L), function(i) as.mask.psp(L[i]))
# convert to 0-1 image for easy addition
imlist <- lapply(masklist, as.im.owin, na.replace = 0)
rslt <- Reduce("+", imlist)
# plot
plot(rslt, main = "")

Probability distribution values plot

I have
probability values: 0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04 ,summing up to 1
the corresponding intervals where a stochastic variable may take its value with the corresponding probability from the above list:
126,162,233,304,375,446,517,588,659,730,801,839
How can I plot the probability distribution?
On the x axis, the interval values, between the intervals histogram with the probability value?
Thanks.
How about
x <- c(126,162,233,304,375,446,517,588,659,730,801,839)
p <- c(0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04)
plot(x,c(p,0),type="s")
lines(x,c(0,p),type="S")
rect(x[-1],0,x[-length(x)],p,col="lightblue")
for a quick answer? (With the rect included you might not need the lines call and might be able to change it to plot(x,p,type="n"). As usual I would recommend par(bty="l",lty=1) for my preferred graphical defaults ...)
(Explanation: "s" and "S" are two different stair-step types (see Details in ?plot): I used them both to get both the left and right boundaries of the distribution.)
edit: In your comments you say "(it) doesn't look like a histogram". It's not quite clear what you want. I added rectangles in the example above -- maybe that does it? Or you could do
b <- barplot(p,width=diff(x),space=0)
but getting the x-axis labels right is a pain.

R question about plotting probability/density histogram the right way

I have a following matrix [500,2], so we have 500 rows and 2 columns, the left one gives us the index of X observations, and the right one gives the probability with which this X comes true, so - a typical probability density relationship.
So, my question is, how to plot the histogram the right way, so that the x-axis is the x-index, and the y-axis is the density(0.01-1.00). The bandwidth of the estimator is 0.33.
Thanks in advance!
the end of the whole data looks like this: just for a little orientation
[490,] 2.338260830 0.04858685
[491,] 2.347839477 0.04797310
[492,] 2.357418125 0.04736149
[493,] 2.366996772 0.04675206
[494,] 2.376575419 0.04614482
[495,] 2.386154067 0.04553980
[496,] 2.395732714 0.04493702
[497,] 2.405311361 0.04433653
[498,] 2.414890008 0.04373835
[499,] 2.424468656 0.04314252
[500,] 2.434047303 0.04254907
#everyone,
yes, I have made the estimation before, so.. the bandwith is what I mentioned, the data is ordered from low to high values, so respecively the probability at the beginning is 0,22, at the peak about 0,48, at the end 0,15.
The line with the density is plotted like a charm but I have to do in addition is to plot a histogram! So, how I can do this, ordering the blocks properly(ho the data to be splitted in boxes etc..)
Any suggestions?
Here is a part of the data AFTER the estimation, all values are discrete, so I assume histogram can be created.., hopefully.
[491,] 4.956164 0.2618131
[492,] 4.963014 0.2608723
[493,] 4.969863 0.2599309
[494,] 4.976712 0.2589889
[495,] 4.983562 0.2580464
[496,] 4.990411 0.2571034
[497,] 4.997260 0.2561599
[498,] 5.004110 0.2552159
[499,] 5.010959 0.2542716
[500,] 5.017808 0.2533268
[501,] 5.024658 0.2523817
Best regards,
appreciate the fast responses!(bow)
What will do the job is to create a histogram just for the indexes, grouping them in a way x25/x50 each, for instance...and compute the average probability for each 25 or 50/100/150/200/250 etc as boxes..?
Assuming the rows are in order from lowest to highest value of x, as they appear to be, you can use the default plot command, the only change you need is the type:
plot(your.data, type = 'l')
EDIT:
Ok, I'm not sure this is better than the density plot, but it can be done:
x = dnorm(seq(-1, 1, length = 500))
x.bins = rep(1:50, each = 10)
bars = aggregate(x, by = list(x.bins), FUN = sum)[,2]
barplot(bars)
In your case, replace x with the probabilities from the second column of your matrix.
EDIT2:
On second thought, this only makes sense if your 500 rows represent discrete events. If they are instead points along a continuous distribution function adding them together as I have done is incorrect. Mathematically I don't think you can produce the binned probability for a range using only a few points from within that range.
Assuming M is the matrix. wouldn't this just be :
plot(x=M[ , 1], y = M[ , 2] )
You have already done the density estimation since this is not the original data.

Resources