R density plot y axis larger than 1 - r

I want a density plot, and here is the code:
d = as.matrix(read.csv(file = '1.csv'))
plot(density(d))
my data is a list of number. What I don't understand is that the value of y axis large than 1
I think there is something wrong and search the internet, but i can't find any resource, Can you guys help me?
enter image description here
here is the data:
link:http://pan.baidu.com/s/1hsE8Ony password:7a4z

There is nothing wrong with the density being greater than 1 at some points. The area under the curve must be 1, but at specific points the density can be greater than 1. For example,
dnorm(0,0, 0.1)
[1] 3.989423
See this Cross Validated post
Edit:
I think that the dnorm part above could be amplified a little.
For a Gaussian distribution, with mean μ and standard deviation σ approximately 99.73% of the area under the density curve lies between
μ-3σ and μ+3σ. The example above used μ=0 and σ=0.1 so the area under the curve between -0.3 and 0.3 should be 0.9973. The width of the curve here is 0.6. Compare this with a rectangle of equal area (0.9973) and the same base (0.6).
If the area of the rectangle is 0.9973 and the base is 0.6, the height must be 0.9973/0.6 = 1.6621, i.e. the average height of the curve must be 1.6621. Clearly there must be some points with height greater than 1.

Related

How to define transformation of curve using single parameter?

Let's assume I have a curve defined by 4 points and I have 2 'states' of curve, like on this picture:
I want to control the deformation of the curve by single parameter in range [0, 1], 0 is corresponding to upper curve and 1 corresponding to lower curve, intermediate values like 0.5 should represent some intermediate transformation from upper curve to lower curve. How it can be done?
Do you know how to parametrize the motion of one point?
Suppose you have a point that can move on a vertical line, its position varying between two extremes, y0 and y1.
Now assign a parameter t, which varies from 0 t0 1, so that y(t=0) = y0 and y(t=1) = y1.
Now make y a linear function of t: y(t) = y0 + t(y1-y0)
Now look at your curves. The only motion of the points to get from one state to the other appears to be vertical. So each of the four points moves like an example of the y(t) above, but with different values of x, y0 and y1. (From your drawing, it looks as if the two end points are stationary and the two middle points move the same way, but that's just a special case.)

Calculate total absolute curvature from coordinates in R

Given a set of coordinates corresponding to a closed shape, I want to calculate the total absolute curvature, which requires calculating the curvature for each point, taking the absolute value, and summing them. Simple enough.
I used the answer to this question to calculate the curvature from a matrix of x y coordinates (xymat) and get what I thought would be the total absolute curvature:
sum(abs(predict(smooth.spline(xymat), deriv = 2)$y))
The problem is that total absolute curvature has a minimum value of 2*pi and is exactly that for circles, but this code is evaluating to values less than 2*pi:
library(purrr)
xymat <- map_df(data.frame(degrees=seq(0:360)),
function(theta) data.frame(x = sin(theta), y = cos(theta)))
sum(abs(predict(smooth.spline(xymat), deriv = 2)$y))
This returns 1.311098 instead of the expected value of 6.283185.
If I change the df parameter of smooth.spline to 3 as in the previous answer, the returned value is 3.944053, still shy of 2*pi (the df value smooth.spline calculated for itself was 2.472213).
Is there a better way to calculate curvature? Is smooth.spline parameterized by arc length or will incorporating it (somehow) rescue this calculation?
Okay, a few things before we begin. You're using degrees in your seq, which will give you incorrect results (0 to 360 degrees). You can check that this is wrong by taking cos(360) in R, which isn't 1. This is explained in the documentation for the trig functions under Details.
So let's change your function to this
xymat <- map_df(data.frame(degrees=seq(0,2*pi,length=360)),
function(theta) data.frame(x = sin(theta), y = cos(theta)))
If you plot this, this indeed looks like a circle.
Let's actually restrict this to the lower half of the circle. If you put a spline through this without understanding the symmetry and looking at the plot, chances are that you'll get a horizontal line through the circle.
Why? because the spline doesn't know that it's symmetric above and below y = 0. The spline is trying to fit a function that explains the "data", not trace an arc. It splits the difference between two symmetric sets of points around y = 0.
If we restrict the spline to the lower half of the circle, we can use y values between 1 and -1, like this:
lower.semicircle <- data.frame(predict(smooth.spline(xymat[91:270,], all.knots = T)))
And let's fit a spline through it.
lower.semicircle.pred<-data.frame(predict(smooth.spline(lower.semicircle, all.knots = T)))
Note that I'm not using the deriv function here. That is for a different problem in the cars example to which you linked. You want total absolute curvature and they are looking at rate of change of curvature.
What we have now is an approximation to a lower semicircle using splines. Now you want the distance between all of the little sequential points like in the integral from the wikipedia page.
Let's calculate all of the little arc distances using a distance matrix. This literally calculates the Euclidean distances between each point to every other point.
all.pairwise.distances.in.the.spline.approx<-dist(lower.semicircle.pred, diag=F)
dist.matrix<-as.matrix(all.pairwise.distances.in.the.spline.approx)
seq.of.distances.you.want<-dist.matrix[row(dist.matrix) == col(dist.matrix) + 1]
This last object is what you need to sum across.
sum(seq.of.distances.you.want)
..which evaluates to [1] 3.079 for the lower semicircle, around half of your 2*pi expected value.
It's not perfect but splines have problems with edge effects.

Find scatterplot area where ~50% of points have one of 2 values

I have a data frame that has 3 values for each point in the form: (x, y, boolean). I'd like to find an area bounded by values of (x, y) where roughly half the points in the area are TRUE and half are FALSE.
I can scatterplot then data and color according to the 3rd value of each point and I get a general idea but I was wondering if there would be a better way. I understand that if you take a small enough area where there are only 2 points and one if TRUE and the other is FALSE then you have 50/50 so I was thinking there has to be a better way of deciding what size area to look for.
Visually I see this has drawing a square on the scatter plot and moving it around the x and y axis each time checking the number of TRUE and FALSE points in the area, but is there a way to determine what a good size for the area is based on the values?
Thanks
EDIT: G5W's answer is a step in the right direction but based on their scatterplot, I'm looking to create a square / rectangle idea in which ~ half the points are green and half are red. I understand that there is potentially an infinite amount of those areas but thinking there might be a good way to determine an optimal size for the area (maybe it should contain at least a certain percentage of the points or something)
Note update below
You do not provide any sample data, so I have created some bogus data like this:
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
y = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
z = rep(c(TRUE,FALSE), each=100))
I think that what you want is how much area is taken up by each of the TRUE and FALSE points. A way to interpret that task is to find the convex hull for each group and take its area. That is, find the minimum convex polygon that contains a group. The function chull will compute the convex hull of a set of points.
plot(TestData[,1:2], pch=20, col=as.numeric(TestData$z)+2)
CH1 = chull(TestData[TestData$z,1:2])
CH2 = chull(TestData[!TestData$z,1:2])
polygon(TestData[which(TestData$z)[CH1],1:2], lty=2, col="#00FF0011")
polygon(TestData[which(!TestData$z)[CH2],1:2], lty=2, col="#FF000011")
Once you have the polygons, the polyarea function from the pracma package will compute the area. Note that it computes a "signed" area so you either need to be careful about which direction you traverse the polygon or take the absolute value of the area.
library(pracma)
abs(polyarea(TestData[which(TestData$z)[CH1],1],
TestData[which(TestData$z)[CH1],2]))
[1] 16.48692
abs(polyarea(TestData[which(!TestData$z)[CH2],1],
TestData[which(!TestData$z)[CH2],2]))
[1] 15.17897
Update
This is a completely different answer based on the updated question. I am leaving the old answer because the question now refers to it.
The question now gives a little more information about the data ("There are about twice as many FALSE than TRUE") so I have made an updated bogus data set to reflect that.
set.seed(2017)
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(200, 1, 1)),
y = c(rnorm(100, 1, 1), rnorm(200, -1,1)),
z = rep(c(TRUE,FALSE), c(100,200)))
The problem is now to find regions where the density of TRUE and FALSE are approximately equal. The question asked for a rectangular region, but at least for this data, that will be difficult. We can get a good visualization to see why.
We can use the function kde2d from the MASS package to get the 2-dimensional density of the TRUE points and the FALSE points. If we take the difference of these two densities, we need only find the regions where the difference is near zero. Once we have this difference in density, we can visualize it with a contour plot.
library(MASS)
Grid1 = kde2d(TestData$x[TestData$z], TestData$y[TestData$z],
lims = c(c(-3,3), c(-3,3)))
Grid2 = kde2d(TestData$x[!TestData$z], TestData$y[!TestData$z],
lims = c(c(-3,3), c(-3,3)))
GridDiff = Grid1
GridDiff$z = Grid1$z - Grid2$z
filled.contour(GridDiff, color = terrain.colors)
In the plot it is easy to see the place that there are far more TRUE than false near (-1,1) and where there are more FALSE than TRUE near (1,-1). We can also see that the places where the difference in density is near zero lie in a narrow band in the general area of the line y=x. You might be able to get a box where a region with more TRUEs is balanced by a region with more FALSEs, but the regions where the density is the same is small.
Of course, this is for my bogus data set which probably bears little relation to your real data. You could perform the same sort of analysis on your data and maybe you will be luckier with a bigger region of near equal densities.

How do I interpret the output of corrplot?

The corrplot packages provides some neat plots and documents with examples.
But I don't understand the output. I can see that if you have a matrix A_ij, you can plot it as an arrangement of n by n square tiles, where the color of tile ij corresponds to the value of A_ij. But some examples appear to have more dimensions:
Here we can guess that color shows the correlation coefficient, and orientation of the ellipse is negative/positive correlation. What is the eccentricity?
The documentation for method says:
the visualization method of correlation matrix to be used. Currently, it supports seven methods, named "circle" (default), "square", "ellipse", "number", "pie", "shade" and "color". See examples for details.
The areas of circles or squares show the absolute value of corresponding correlation coefficients. Method "pie" and "shade" came from Michael Friendly’s job (with some adjustment about the shade added on), and "ellipse" came from D.J. Murdoch and E.D. Chow’s job, see in section References.
So we know that the area, for circles and squares, should show the coefficient. What about the other dimensions, and other methods?
There is only one dimension shown by the plot.
Michael Friendly, in Corrgrams: Exploratory displays for correlation matrices (the corrplot documentation confusingly refers to this as his "job"), says:
In the shaded row, each cell is shaded blue or red depending on the sign of the correlation, and with the intensity of color scaled 0–100% in proportion to the magnitude of the correlation. (Such scaled colors are easily computed using RGB coding from red, (1, 0, 0), through white (1, 1, 1), to blue (0, 0, 1). For simplicity, we ignore the non-linearities of color reproduction and perception, but note that these are easily accommodated in the color mapping function.) White diagonal lines are added so that the direction of the correlation may still be discerned in black and white. This bipolar scale of color was chosen to leave correlations near 0 empty (white), and to make positive and negative values of equal magnitude approximately equally intensely shaded. Gray scale and other color schemes are implemented in our software (Section 6), but not illustrated here.
The bar and circular symbols also use the same scaled colors, but fill an area proportional to the absolute value of the correlation. For the bars, negative values are filled from the bottom, positive values from the top. The circles are filled clockwise for positive values, anti-clockwise for negative values. The ellipses have their eccentricity parametrically scaled to the correlation value (Murdoch and Chow, 1996). Perceptually, they have the property of becoming visually less prominent as the magnitude of the correlation increases, in contrast to the other glyphs.
(emphasis mine)
"Murdoch and Chow, 1996" is a publication describing the equation for drawing the ellipses (A Graphical Display of Large Correlation Matrices). The ellipses are apparently meant to be caricatures of bivariate normal distributions:
So in conclusion, the only dimension shown is always the correlation coefficient (or the value of A_ij, to use the question's terminology) itself. The multiple apparent dimensions are redundant.
I think the plot is quite self explanatory. On the right hand side you have the scale which is colored from red (negative correlation) to blue (positive correlation). The color follows a gradient according to the strength of the correlation.
If the ellipse leans towards the right, it is again positive correlation and if it leans to the left, it is negative correlation.
The diffusion around a line (which denotes perfect correlation, for example mpg ~ mpg) creates an ellipse. You will have a more diffused ellipse for lower strengths of the correlation. This is typically how a weakly correlated relationship will look in a scatterplot. These I think are caricatures, however.
Here is some code from the corrplot function responsible for drawing ellipses. I am not going to attempt to explain this (because it is a part of a larger system). I wanted to show that the logic is all there if you'd like to deep dive into it:
if (method == "ellipse" & plotCI == "n") {
ell.dat <- function(rho, length = 99) {
k <- seq(0, 2 * pi, length = length)
x <- cos(k + acos(rho)/2)/2
y <- cos(k - acos(rho)/2)/2
return(cbind(rbind(x, y), c(NA, NA)))
}
ELL.dat <- lapply(DAT, ell.dat)
ELL.dat2 <- 0.85 * matrix(unlist(ELL.dat), ncol = 2,
byrow = TRUE)
ELL.dat2 <- ELL.dat2 + Pos[rep(1:length(DAT), each = 100),
]
polygon(ELL.dat2, border = col.border, col = col.fill)
}

kde2d density comparison

I have a question about the kde2d (Kernel density estimator). I am computing two different kde2d for two different sets of data in the same space of variables. When I compare both with a filled.contour2 or contours I see that the set with lower density of points in a scatter plot(Also has less points in the total with a factor of 10) has an higher density for the contours values. I was expecting that the set with higher point density will have higher density contours values, but like I said above is not the case. It has to be with the choice of bandwidth (h)? I am using equals h, and i tried to change but the result did not changed a lot. What is my error?
An example
a <- runif(1000, 5.0, 7.5)
b <- runif(1000, 2.0, 3.0)
c <- runif(100000,5.0, 7.5)
d <- runif(100000, 2.0, 3.0)
library(MASS)
abdens <- kde2d(a,b,n=100,h=0.5)
cddens <- kde2d(c,d,n=100,h=0.5)
mylevels <- seq(-2.4,30,0.9)
filled.contour2(abdens,xlab="a",ylab="b",xlim=c(5,7.5),ylim=c(2,3),
col=hsv(seq(0,1,length=length(mylevels))))
plot(a,b)
contour(abdens,nlevels=5,add=T,col="blue")
plot(c,d)
contour(cddens,nlevels=5,add=T,col="orange")
I'm not sure I agree that the densities should be different in the uniform case. I would have expected a set with a higher number of randomly drawn points from a Normal distribution to have more support for extreme regions and therefore to have lower (estimated) density in the center. That effect might be also be occasionally apparent with bibariate Uniform draws with 1,000 points versus 100,000. I hope my modifications to your code are acceptable. It's easier to see the contours if done after the plots.
(The theoretic density would be the same in both these cases since the density distribution is normalized to an integral of 1.0. We are only looking at estimates with some expected artifacts from "edge" effects. In the case of univariate densities adding the information about bounds can be done with the desity functions in package::logspline.)

Resources