I would like to produce what I think will be a very simple diagram in R - it will show the number of genes that fall in to one of two categories.
The area of circles must be relative to each other and show the vast difference between the number of counts in my two categories. One category is 15000 the other is 15. Therefore the area of one circle should be 1000 times greater than the other. Is there a simple R script that can be used to do this? (Draw two circles, one of which there area is X times smaller than the other)
You can draw circles using the plotrix package and draw.circle function. So to answer your question, we just need to calculate the radius of each circle. To make the comparison, it's easier to make the first circle have unit area. So,
## Calculate radius for given area
get_radius = function(area = 1) sqrt(area/pi)
##Load package and draw blank graph
library(plotrix)
plot(-10:10,seq(-10,10,length=21),type="n",xlab="",ylab="")
## Unit area
draw.circle(0, 0, get_radius())
## 10 times larger
draw.circle(0, 0, get_radius(10))
As shown in this post, you can use for example the shape package and use the function plotcircle where you can chose the radius. Example:
require("shape")
emptyplot(c(0, 1))
plotcircle(mid = c(0.2, 0.5), r = 0.1)
plotcircle(mid = c(0.6, 0.5), r = 0.01)
Related
I want to visualize proportions using points inside a circle. For example, let's say that I have 100 points that I wish to scatter (somewhat randomly jittered) in a circle.
Next, I want to use this diagram to represent the proportions of people who voted Biden/Harris in 2020 US presidential elections, in each state.
Example #1 -- Michigan
Biden got 50.62% of Michigan's votes. I'm going to draw a horizontal diameter that splits the circle to two halves, and then color the points under the diameter in blue (Democrats' color).
Example #2 -- Wyoming
Unlike Michigan, in Wyoming Biden got only 26.55% of the votes, which is approximately a quarter of the vote. In this case I'd draw a horizontal chord that divides the circle such that the disk's area under the chord is 25% of the entire disk area. Then I'll color the respective points in that area in blue. Since I have 100 points in total, 25 points represent the 25% who voted Biden in Wyoming.
My question: How can I do this with ggplot? I researched this issue, and there's a lot of geometry going on here. First, the kind of area I'm talking about is called a "circular segment". Second, there are many formulas to calculate its area, if we know some other parameters about the shape (such as the radius length, etc.). See this nice demo.
However, my goal isn't to solve geometry problems, but just to represent proportions in a very specific way:
draw a circle
sprinkle X number of points inside
draw a (real or invisible) horizontal line that divides the circle/disk area according to a given proportion
ensure that the points are arranged respective to the split. That is, if we want to represent a 30%-70% split, then have 30% of the points under the line that divides the disk.
color the points under the line.
I understand that this is somewhat an exotic visualization, but I'll be thankful for any help with this.
EDIT
I've found a reference to a JavaScript package that does something very similar to what I'm asking.
I took a crack at this for fun. There's a lot more that could be done. I agree that this is not a great way to visualize proportions, but if it's engaging your audience ...
Formulas for determining appropriate heights are taken from Wikipedia. In particular we need the formulas
a/A = (theta - sin(theta))/(2*pi)
h = 1-cos(theta/2)
where a is the area of the segment; A is the whole area of the circle; theta is the angle described by the arc that defines the segment (see Wikipedia for pictures); and h is the height of the segment.
Machinery for finding heights.
afun <- function(x) (x-sin(x))/(2*pi)
## curve(afun, from=0, to = 2*pi)
find_a <- function(a) {
uniroot(
function(x) afun(x) -a,
interval=c(0, 2*pi))$root
}
find_h <- function(a) {
1- cos(find_a(a)/2)
}
vfind_h <- Vectorize(find_h)
## find_a(0.5)
## find_h(0.5)
## curve(vfind_h(x), from = 0, to= 1)
set up a circle
dd <- data.frame(x=0,y=0,r=1)
library(ggforce)
library(ggplot2); theme_set(theme_void())
gg0 <- ggplot(dd) + geom_circle(aes(x0=x,y0=y,r=r)) + coord_fixed()
finish
props <- c(0.2,0.5,0.3) ## proportions
n <- 100 ## number of points to scatter
cprop <- cumsum(props)[-length(props)]
h <- vfind_h(cprop)
set.seed(101)
r <- runif(n)
th <- runif(n, 0, 2 * pi)
dd <-
data.frame(x = sqrt(r) * cos(th),
y = sqrt(r) * sin(th))
dd2 <- data.frame(x=r*cos(2*pi*th), y = r*sin(2*pi*th))
dd2$g <- cut(dd2$y, c(1, 1-h, -1))
gg0 + geom_point(data=dd2, aes(x, y, colour = g), size=3)
There are a bunch of tweaks that would make this better (meaningful names for the categories; reverse the axis order to match the plot; maybe add segments delimiting the sections, or (more work) polygons so you can shade the sections.
You should definitely check this for mistakes — e.g. there are places where I may have used a set of values where I should have used their first differences, or vice versa (values vs cumulative sum). But this should get you started.
I'm working on the following: I have a store layout, example see below (cannot add the real thing for GDPR reasons but the example should do the trick) on which I have xy coordinates from visitors (anonymous of course)
I already placed a grid on the picture so I can see which route they take in the store. That works fine. origin is bottom left and x & y are scaled from 0-100.
So far so good. Now next step is identifying the coordinates of the shelves, rectangles in the picture. Is there a way to do this without having to do this manually? The real store layout contains more than 900 shelves or am I pushing out the boat too far?
The output I'm looking for is a dataframe that contains a shelve ID and the coordinates for the corners. Idea is to create some heatmaps in the store to see that there are blind spots, hotspots, ...
The second analysis needs also the integer points. The idea is to create vectors of visitor points so we get a direction to which they are looking. By using the scope of what a human being can see I would give percentages of "seen" the products based on intersection with integer points.
thx!
JL
One approach is to perform clustering on the black pixels of the image. The clusters are then the shelves. If the shelves are axis parallel you can find the rectangles by just taking min/max in each direction. This works quite well:
Sample code (I converted the image to PNG as it is easier to read than gif):
library(png)
library(dbscan)
library(tidyverse)
library(RColorBrewer)
img <- readPNG("G18JU.png")
is_black <-
img %>%
apply(c(1, 2), sum) %>% #sum all color channels
{. < 2.5} %>% # we assume black if the sum is lower than 2.5 (max value is 3)
which(arr.ind=TRUE) # the indices of the black pixels
clust <- dbscan(is_black, 2) # identify clusters
rects <-
as.tibble(is_black) %>%
mutate(cluster = clust$cluster) %>% # add cluster information
group_by(cluster) %>%
## find corner points of rectangles normalized to [0, 1]
summarise(xleft = max(col) / dim(img)[2],
ybottom = 1 - min(row) / dim(img)[1],
xright = min(col) / dim(img)[2],
ytop = 1 - max(row) / dim(img)[1])
## plot the image and the rectangles
plot(c(0, 1), c(0, 1), type="n")
rasterImage(img, 0, 0, 1, 1)
for (i in seq_len(nrow(rects))) {
rect(rects$xleft[i], rects$ybottom[i], rects$xright[i], rects$ytop[i],
border = brewer.pal(nrow(rects), "Paired")[i], lwd = 2)
}
Of course this approach also detects other black lines as "rectangles" (e.g. the black border). But I guess you can easily create a "clean" image.
Edit: extend method to find shelves that share a black line
To extend the method such that it can separate shelves that share a black line:
First, identify the rectangles in the way outlined above.
Then, extract each rectangle from the image and compute the row means. This gives you a 1d image (= line) for each rectangle. In this line apply thresholding and clustering as before. The clusters are now the black line segments, and the mean of each cluster corresponds to a vertical line shared by two shelves.
To find horizontal shared lines, the same procedure can be applied, but with column means instead of row means.
I have a data frame that has 3 values for each point in the form: (x, y, boolean). I'd like to find an area bounded by values of (x, y) where roughly half the points in the area are TRUE and half are FALSE.
I can scatterplot then data and color according to the 3rd value of each point and I get a general idea but I was wondering if there would be a better way. I understand that if you take a small enough area where there are only 2 points and one if TRUE and the other is FALSE then you have 50/50 so I was thinking there has to be a better way of deciding what size area to look for.
Visually I see this has drawing a square on the scatter plot and moving it around the x and y axis each time checking the number of TRUE and FALSE points in the area, but is there a way to determine what a good size for the area is based on the values?
Thanks
EDIT: G5W's answer is a step in the right direction but based on their scatterplot, I'm looking to create a square / rectangle idea in which ~ half the points are green and half are red. I understand that there is potentially an infinite amount of those areas but thinking there might be a good way to determine an optimal size for the area (maybe it should contain at least a certain percentage of the points or something)
Note update below
You do not provide any sample data, so I have created some bogus data like this:
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
y = c(rnorm(100, -1, 1), rnorm(100, 1,1)),
z = rep(c(TRUE,FALSE), each=100))
I think that what you want is how much area is taken up by each of the TRUE and FALSE points. A way to interpret that task is to find the convex hull for each group and take its area. That is, find the minimum convex polygon that contains a group. The function chull will compute the convex hull of a set of points.
plot(TestData[,1:2], pch=20, col=as.numeric(TestData$z)+2)
CH1 = chull(TestData[TestData$z,1:2])
CH2 = chull(TestData[!TestData$z,1:2])
polygon(TestData[which(TestData$z)[CH1],1:2], lty=2, col="#00FF0011")
polygon(TestData[which(!TestData$z)[CH2],1:2], lty=2, col="#FF000011")
Once you have the polygons, the polyarea function from the pracma package will compute the area. Note that it computes a "signed" area so you either need to be careful about which direction you traverse the polygon or take the absolute value of the area.
library(pracma)
abs(polyarea(TestData[which(TestData$z)[CH1],1],
TestData[which(TestData$z)[CH1],2]))
[1] 16.48692
abs(polyarea(TestData[which(!TestData$z)[CH2],1],
TestData[which(!TestData$z)[CH2],2]))
[1] 15.17897
Update
This is a completely different answer based on the updated question. I am leaving the old answer because the question now refers to it.
The question now gives a little more information about the data ("There are about twice as many FALSE than TRUE") so I have made an updated bogus data set to reflect that.
set.seed(2017)
TestData = data.frame(x = c(rnorm(100, -1, 1), rnorm(200, 1, 1)),
y = c(rnorm(100, 1, 1), rnorm(200, -1,1)),
z = rep(c(TRUE,FALSE), c(100,200)))
The problem is now to find regions where the density of TRUE and FALSE are approximately equal. The question asked for a rectangular region, but at least for this data, that will be difficult. We can get a good visualization to see why.
We can use the function kde2d from the MASS package to get the 2-dimensional density of the TRUE points and the FALSE points. If we take the difference of these two densities, we need only find the regions where the difference is near zero. Once we have this difference in density, we can visualize it with a contour plot.
library(MASS)
Grid1 = kde2d(TestData$x[TestData$z], TestData$y[TestData$z],
lims = c(c(-3,3), c(-3,3)))
Grid2 = kde2d(TestData$x[!TestData$z], TestData$y[!TestData$z],
lims = c(c(-3,3), c(-3,3)))
GridDiff = Grid1
GridDiff$z = Grid1$z - Grid2$z
filled.contour(GridDiff, color = terrain.colors)
In the plot it is easy to see the place that there are far more TRUE than false near (-1,1) and where there are more FALSE than TRUE near (1,-1). We can also see that the places where the difference in density is near zero lie in a narrow band in the general area of the line y=x. You might be able to get a box where a region with more TRUEs is balanced by a region with more FALSEs, but the regions where the density is the same is small.
Of course, this is for my bogus data set which probably bears little relation to your real data. You could perform the same sort of analysis on your data and maybe you will be luckier with a bigger region of near equal densities.
Hi R expert of the world,
Assume I have a point pattern that generate an intensity map and that this map is color coded in 3 region in an pixeled image.... how could I get the color-coded area?
here it is an example using spatstat:
library(spatstat)
japanesepines
Z<-density(japanesepines); plot(dens) # ---> I create a density map
b <- quantile(Z, probs = (0:3)/3) # ---> I "reduce it" to 3 color-ceded zones
Zcut <- cut(Z, breaks = b, labels = 1:3); plot(Zcut)
class(Zcut) # ---> and Zcut is my resultant image ("im")
Thank you in advance
Sacc
In your specific example it is very easy to calculate the area because you used quantile to cut the image: This effectively divides the image into areas of equal size, so there should be three areas of size 1/3 since the window is a unit square. In general to calculate areas from a factor valued image you could use as.tess and tile.areas (continuing your example):
Ztess <- as.tess(Zcut)
tile.areas(Ztess)
In this case the areas are 0.333313, which must be due to discretization.
I'm not exactly sure what you're after, but you can count up the number of pixels in each color using the table() function.
table(Zcut[[1]])
I want to find the total area from multiple polygons within different contour lines from kernel densities (kde2d).
Here is an image of the kernel density and the 50% contour line. How do I calculate the area within the 50% contour line?
I also created a matrix of lat lon coordinates, which represents the points within this 50% contour line. Would it be easier to calculate the total area using these points.
Any suggestions would be greatly appreciated!
Once you have your coordinates in a cartesian system, and have done the kernel smoothing using those coordinates, you can use the contourLines function to get the coordinates of the lines, and then the areapl function from the splancs package to compute the area of each simple ring.
For example, using the example in help(kde2d):
attach(geyser)
plot(duration, waiting, xlim = c(0.5,6), ylim = c(40,100))
f1 <- kde2d(duration, waiting, n = 50, lims = c(0.5, 6, 40, 100))
image(f1)
contour(f1)
so that's our data set up - suppose we want the area in the 0.008 contour:
C8 = contourLines(f1,level=0.008)
length(C8)
[1] 3
Now C8 is a list of length 3. We need to apply the areapl function over each of these:
> sapply(C8,function(ring){areapl(cbind(ring$x,ring$y))})
[1] 14.65282 12.27329 14.75005
And we can obviously sum:
> sum(sapply(C8,function(ring){areapl(cbind(ring$x,ring$y))}))
[1] 41.67617
Now this only makes sense if the coordinates are cartesian, and if the contour lines are complete loops. If the 0.008 contour was near the edge then its possible for the contour to get clipped to the bounding box and then bad things happen. Check at least that the last point of each ring is the same as the first.