I try to figure out, if my point is below or above a simple curve and struggling with my math at the moment, I guess...
I prepared a working example, but the math first.
I have some points and I want to check if they are above or below a curve. The curve has the function y=1/(x-.5). So I thought I will set the function to 0 and get 0=1/(x-.5)-y.
Afterwards I will get negative values if the point is on one side of the curve, and positive values on the other side.
I realised a problem, if the x values is smaller then .5, then the part below 1/ gets negative and all my values are also negative.
I added a special point (5) which gives the expected positive value, but how about the other ones, how should I test those?
points <- data.frame(
x=c(-3.6030515,-0.2791478,10.2045860,-0.7457344,1,0.4037591,0.1555678,
6.1525442,1.9831603),
y=c(0.95715140,0.18139107,2.87456154,0.17190597,0.5,0.09778570,0.02708183,
2.69455955,1.09943870)
)
curves <- data.frame(x=c(seq(.1,10,.1)))
curves$y <- 1/(curves$x-.5)
plot(points$x,points$y)
lines(curves$x,curves$y)
lines(-curves$x,curves$y)
1/(points$x-.5)-points$y >= 0
To count the number below the curve :
## count the number below the curve
sum(points$y<1/(points$x-0.5) )
To show it graphically :
## plot it using plot and curve
plot(points$x,points$y,col=ifelse(points$y<1/(points$x-0.5) ,'blue','red'),pch=20)
curve(1/(x-.5),-4,10,add=TRUE,col='green',lwd=2)
discontinuity part :
To show the discontinuity part graphically you should use curve:
curve(1/(x-.5),0,1,col='green',lwd=2)
abline(v=0.5,lwd=3)
`
Unless I've misunderstood the question, you should be able to just evaluate the function at your points' x values, and compare the outcome (i.e. the y value according to the function) to your points' y values.
f <- function(x) 1 / (x-0.5)
f(points$x) < points$y
# [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
With the way I've structured the inequality, TRUE indicates that the curve is below the corresponding y value in the points vector. In other words, all but the fifth point are above the curve.
Related
I'm still trying to find the best way to classify bivariate point patterns:
Point pattern classification with spatstat: what am I doing wrong?
I now analysed 110 samples of my dataset using #Adrian's suggestion with sigma=bw.diggle (as I wanted an automatic bandwidth selection). f is a "resource selection function" (RSF) which describes the relationship between the intensity of the Cancer point process and the covariate (here kernel density of Immune):
Cancer <- split(cells)[["tumor"]]
Immune <- split(cells)[["bcell"]]
Dimmune <- density(Immune,sigma=bw.diggle)
f <- rhohat(Cancer, Dimmune)
I am in doubt about some results I've got. A dozen of rho-functions looked weird (disrupted, single peak). After changing to default sigma=NULL or sigma=bw.scott (which are smoother) the functions became "better" (see examples below). I also experimented with the following manipulations:
cells # bivariate point pattern with marks "tumor" and "bcell"
o.marks<-cells$marks # original marks
#A) randomly re-assign original marks
a.marks <- sample(cells$marks)
#B) replace marks randomly with a 50/50 proportion
b.marks<-as.factor(sample(c("tumor","bcell"), replace=TRUE, size=length(o.marks)))
#C) random (homogenious?) pattern with the original number of points
randt<-runifpoint(npoints(subset(cells,marks=="tumor")),win=cells$window)
randb<-runifpoint(npoints(subset(cells,marks=="bcell")),win=cells$window)
cells<-superimpose(tumor=randt,bcell=randb)
#D) tumor points are associated with bcell points (is "clustered" a right term?)
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),Dimmune,win=cells$window)
#E) tumor points are segregated from bcell points
reversedD<-Dimmune
density.scale.v<-sort(unique((as.vector(Dimmune$v)[!is.na(as.vector(Dimmune$v))]))) # density scale
density.scale.v.rev<-rev(density.scale.v)# reversed density scale
new.image.v<-Dimmune$v
# Loop over matrix
for(row in 1:nrow(Dimmune$v)) {
for(col in 1:ncol(Dimmune$v)) {
if (is.na(Dimmune$v[row, col])==TRUE){next}
number<-which(density.scale.v==Dimmune$v[row, col])
new.image.v[row, col]<-density.scale.v.rev[number]}
}
reversedD$v<-new.image.v # reversed density
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),reversedD,win=cells$window)
A better way to generate inverse density heatmaps is given by #Adrian in his post below.
I could not generate rpoint patterns for the bw.diggle density as it produced negative numbers.Thus I replaced the negatives Dimmune$v[which(Dimmune$v<0)]<-0 and could run rpoint then. As #Adrian explained in the post below, this is normal and can be solved easier by using a density.ppp option positive=TRUE.
I first used bw.diggle, because hopskel.test indicarted "clustering" for all my patterns. Now I'm going to use bw.scott for my analysis but can this decision be somehow justified? Is there a better method besides "RSF-function is looking weird"?
some examples:
sample10:
sample20:
sample110:
That is a lot of questions!
Please try to ask only one question per post.
But here are some answers to your technical questions about spatstat.
Negative values:
The help for density.ppp explains that small negative values can occur because of numerical effects. To force the density values to be non-negative, use the argument positive=TRUE in the call to density.ppp. For example density(Immune, bw.diggle, positive=TRUE).
Reversed image: to reverse the ordering of values in an image Z you can use the following code:
V <- Z
A <- order(Z[])
V[][A] <- Z[][rev(A)]
Then V is the order-reversed image.
Tips for your code:
to generate a random point pattern with the same number of points and in the same window as an existing point pattern X, use Y <- runifpoint(ex=X).
To extract the marks of a point pattern X, use a <- marks(X). To assign new marks to a point pattern X, use marks(X) <- b.
to randomly permute the marks attached to the points in a point pattern X, use Y <- rlabel(X).
to assign new marks to a point pattern X where the new marks are drawn randomly-with-replacement from a given vector of values m, use Y <- rlabel(X, m, permute=FALSE).
I want to calculate the following integrate by using the hit and miss method.
I=∫x^3dx with lower= 0 and upper =1
I know how to solve it but I cannot find the right code in R to calculate it and generate -for example 100000 random- and then plot them like this:
Thank you.
1. Generate 2 vectors from uniform distribution of the desired length
l = 10000
x = runif(l)
y = runif(l)
2. The approximation of the integral is the number of cases where the (x,y) points are below the function you want to integrate:
sum(y<x^3)/l
3. For the plot, you just have to plot the points, changing their color depending whether they are above or below the curve, and add the function with curve():
plot(x,y,col=1+(y<x^3))
curve(x^3,add=T,col=3)
I have a data frame that has 3 values for each point in the form: (x, y, boolean). I'd like to find an area bounded by values of (x, y) where roughly half the points in the area are TRUE and half are FALSE.
I can scatterplot then data and color according to the 3rd value of each point and I get a general idea but I was wondering if there would be a better way. I understand that if you take a small enough area where there are only 2 points and one if TRUE and the other is FALSE then you have 50/50 so I was thinking there has to be a better way of deciding what size area to look for.
Visually I see this has drawing a square on the scatter plot and moving it around the x and y axis each time checking the number of TRUE and FALSE points in the area, but is there a way to determine what a good size for the area is based on the values?
Thanks
EDIT: G5W's answer is a step in the right direction but based on their scatterplot, I'm looking to create a square / rectangle idea in which ~ half the points are green and half are red. I understand that there is potentially an infinite amount of those areas but thinking there might be a good way to determine an optimal size for the area (maybe it should contain at least a certain percentage of the points or something)
Note update below
You do not provide any sample data, so I have created some bogus data like this:
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
y = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
z = rep(c(TRUE,FALSE), each=100))
I think that what you want is how much area is taken up by each of the TRUE and FALSE points. A way to interpret that task is to find the convex hull for each group and take its area. That is, find the minimum convex polygon that contains a group. The function chull will compute the convex hull of a set of points.
plot(TestData[,1:2], pch=20, col=as.numeric(TestData$z)+2)
CH1 = chull(TestData[TestData$z,1:2])
CH2 = chull(TestData[!TestData$z,1:2])
polygon(TestData[which(TestData$z)[CH1],1:2], lty=2, col="#00FF0011")
polygon(TestData[which(!TestData$z)[CH2],1:2], lty=2, col="#FF000011")
Once you have the polygons, the polyarea function from the pracma package will compute the area. Note that it computes a "signed" area so you either need to be careful about which direction you traverse the polygon or take the absolute value of the area.
library(pracma)
abs(polyarea(TestData[which(TestData$z)[CH1],1],
TestData[which(TestData$z)[CH1],2]))
[1] 16.48692
abs(polyarea(TestData[which(!TestData$z)[CH2],1],
TestData[which(!TestData$z)[CH2],2]))
[1] 15.17897
Update
This is a completely different answer based on the updated question. I am leaving the old answer because the question now refers to it.
The question now gives a little more information about the data ("There are about twice as many FALSE than TRUE") so I have made an updated bogus data set to reflect that.
set.seed(2017)
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(200, 1, 1)),
y = c(rnorm(100, 1, 1), rnorm(200, -1,1)),
z = rep(c(TRUE,FALSE), c(100,200)))
The problem is now to find regions where the density of TRUE and FALSE are approximately equal. The question asked for a rectangular region, but at least for this data, that will be difficult. We can get a good visualization to see why.
We can use the function kde2d from the MASS package to get the 2-dimensional density of the TRUE points and the FALSE points. If we take the difference of these two densities, we need only find the regions where the difference is near zero. Once we have this difference in density, we can visualize it with a contour plot.
library(MASS)
Grid1 = kde2d(TestData$x[TestData$z], TestData$y[TestData$z],
lims = c(c(-3,3), c(-3,3)))
Grid2 = kde2d(TestData$x[!TestData$z], TestData$y[!TestData$z],
lims = c(c(-3,3), c(-3,3)))
GridDiff = Grid1
GridDiff$z = Grid1$z - Grid2$z
filled.contour(GridDiff, color = terrain.colors)
In the plot it is easy to see the place that there are far more TRUE than false near (-1,1) and where there are more FALSE than TRUE near (1,-1). We can also see that the places where the difference in density is near zero lie in a narrow band in the general area of the line y=x. You might be able to get a box where a region with more TRUEs is balanced by a region with more FALSEs, but the regions where the density is the same is small.
Of course, this is for my bogus data set which probably bears little relation to your real data. You could perform the same sort of analysis on your data and maybe you will be luckier with a bigger region of near equal densities.
I am stuck in simple problem. I have a scatter plot.
I am plotted confidence lines around it using my a custom formula. Now, i just want only the names outside the cutoff lines to be displayed nothing inside. But, I can't figure out how to subset my data on the based of the line co-ordinates.
The line is plotted using the lines function which is a vector of 128 x and y values. Now, how do I subset my data (x,y points) based on these 2 values. I can apply a static limit of a single number of sub-setting data like 1,2 or 3 but how to use a vector to subset data, got me stuck.
For an reproducible example, consider :
df=data.frame(x=seq(2,16,by=2),y=seq(2,16,by=2),lab=paste("label",seq(2,16,by=2),sep=''))
plot(df[,1],df[,2])
# adding lines
lines(seq(1,15),seq(15,1),lwd=1, lty=2)
# adding labels
text(df[,1],df[,2],labels=df[,3],pos=3,col="red",cex=0.75)
Now, I need just the labels, which are outside or intersecting the line.
What I was trying to subset my dataframe with the values used for the lines, but I cant make it right.
Now, static sub-setting can be done for single values like
df[which(df[,1]>8 & df[,2]>8),] but how to do it for whole list.
I also tried sapply, to cycle over all the values of x and y used for lines on the df iteratively, but most values become +ve for a limit but false for other values.
Thanks
I will speak about your initial volcano-type-graph problem and not the made up one because they are totally different.
So I really thought this a lot and I believe I reached a solid conclusion. There are two options:
1. You know the equations of the lines, which would be really easy to work with.
2. You do not know the equation of the lines which means we need to work with an approximation.
Some geometry:
The function shows the equation of a line. For a given pair of coordinates (x, y), if y > the right hand side of the equation when you pass x in, then the point is above the line else below the line. The same concept stands if you have a curve (as in your case).
If you have the equations then it is easy to do the above in my code below and you are set. If not you need to make an approximation to the curve. To do that you will need the following code:
df=data.frame(x=seq(2,16,by=2),y=seq(2,16,by=2),lab=paste("label",seq(2,16,by=2),sep=''))
make_vector <- function(df) {
lab <- vector()
for (i in 1:nrow(df)) {
this_row <- df[i,] #this will contain the three elements per row
if ( (this_row[1] < max(line1x) & this_row[2] > max(line1y) & this_row[2] < a + b*this_row[1])
|
(this_row[1] > min(line2x) & this_row[2] > max(line2y) & this_row[2] > a + b*this_row[1]) ) {
lab[i] <- this_row[3]
} else {
lab[i] <- <NA>
}
}
return(lab)
}
#this_row[1] = your x
#this_row[2] = your y
#this_row[3] = your label
df$labels <- make_vector(df)
plot(df[,1],df[,2])
# adding lines
lines(seq(1,15),seq(15,1),lwd=1, lty=2)
# adding labels
text(df[,1],df[,2],labels=df[,4],pos=3,col="red",cex=0.75)
The important bit is the function. Imagine that you have df as you created it with x,y and labs. You also will have a vector with the x,y coordinates for line1 and x,y coordinates for line2.
Let's see the condition of line1 only (the same exists for line 2 which is implemented on the code above):
this_row[1] < max(line1x) & this_row[2] > max(line1y) & this_row[2] < a + b*this_row[1]
#translates to:
#this_row[1] < max(line1x) = your x needs to be less than the max x (vertical line in graph below
#this_row[2] > max(line1y) = your y needs to be greater than the max y (horizontal line in graph below
#this_row[2] < a + b*this_row[1] = your y needs to be less than the right hand side of the equation (to have a point above i.e. left of the line)
#check below what the line is
This will make something like the below graph (this is a bit horrible and also magnified but it is just a reference. Visualize it approximating your lines):
The above code would pick all the points in the area above the triangle and within the y=1 and x=1 lines.
Finally the equation:
Having 2 points' coordinates you can figure out a line's equation solving a system of two equations and 2 parameters a and b. (y = a +bx by replacing y,x for each point)
The 2 points to pick are the two points closest to the tangent of the first line (line1). Chose those arbitrarily according to your data. The closest to the tangent the better. Just plot the spots and eyeball.
Having done all the above you have your points with your labels (approximately at least).
And that is the only thing you can do!
Long talk but hope it helps.
P.S. I haven't tested the code because I have no data.
I have
probability values: 0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04 ,summing up to 1
the corresponding intervals where a stochastic variable may take its value with the corresponding probability from the above list:
126,162,233,304,375,446,517,588,659,730,801,839
How can I plot the probability distribution?
On the x axis, the interval values, between the intervals histogram with the probability value?
Thanks.
How about
x <- c(126,162,233,304,375,446,517,588,659,730,801,839)
p <- c(0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04)
plot(x,c(p,0),type="s")
lines(x,c(0,p),type="S")
rect(x[-1],0,x[-length(x)],p,col="lightblue")
for a quick answer? (With the rect included you might not need the lines call and might be able to change it to plot(x,p,type="n"). As usual I would recommend par(bty="l",lty=1) for my preferred graphical defaults ...)
(Explanation: "s" and "S" are two different stair-step types (see Details in ?plot): I used them both to get both the left and right boundaries of the distribution.)
edit: In your comments you say "(it) doesn't look like a histogram". It's not quite clear what you want. I added rectangles in the example above -- maybe that does it? Or you could do
b <- barplot(p,width=diff(x),space=0)
but getting the x-axis labels right is a pain.