I am plotting a graph using the following piece of code:
library (ggplot2)
png (filename = "graph.png")
stats <- read.table("processed-r.dat", header=T, sep=",")
attach (stats)
stats <- stats[order(best), ]
sp <- stats$A / stats$B
index <- seq (1, sum (sp >= 1.0))
stats <- data.frame (x=index, y=sp[sp>=1.0])
ggplot (data=stats, aes (x=x, y=y, group=1)) + geom_line()
dev.off ()
1 - How one can add a vertical line in the plot which intersects at a particular value of y (for example 2)?
2 - How one can make the y-axis start at 0.5 instead of 1?
You can add vertical line with geom_vline(). In your case:
+ geom_vline(xintercept=2)
If you want to see also number 0.5 on your y axis, add scale_y_continuous() and set limits= and breaks=
+ scale_y_continuous(breaks=c(0.5,1,2,3,4,5),limits=c(0.5,6))
Regarding the first question:
This answer is assuming that the value of Y you desire is specifically within your data set. First, let's create a reproducible example as I cannot access your data set:
set.seed(9999)
stats <- data.frame(y = sort(rbeta(250, 1, 10)*10 ,decreasing = TRUE), x = 1:250)
ggplot(data=stats, aes (x=x, y=y, group=1)) + geom_line()
What you need to do is to use the y column in your data frame to search for the specific value. Essentially you will need to use
ggplot(data=stats, aes (x=x, y=y, group=1)) + geom_line() +
geom_vline(xintercept = stats[stats$y == 2, "x"])
Using the data I generated above, here's an example. Since my data frame does not likely contain the exact value 2, I will use the trunc function to search for it:
stats[trunc(stats$y) == 2, ]
# y x
# 9 2.972736 9
# 10 2.941141 10
# 11 2.865942 11
# 12 2.746600 12
# 13 2.741729 13
# 14 2.693501 14
# 15 2.680031 15
# 16 2.648504 16
# 17 2.417008 17
# 18 2.404882 18
# 19 2.370218 19
# 20 2.336434 20
# 21 2.303528 21
# 22 2.301500 22
# 23 2.272696 23
# 24 2.191114 24
# 25 2.136638 25
# 26 2.067315 26
Now we know where all the values of 2 are. Since this graph is decreasing, we will reverse it, then the value closest to 2 will be at the beginning:
rev(stats[trunc(stats$y) == 2, 1])
# y x
# 26 2.067315 26
And we can use that value to specify where the x intercept should be:
ggplot(data=stats, aes (x=x, y=y, group=1)) + geom_line() +
geom_vline(xintercept = rev(stats[trunc(stats$y) == 2, "x"])[1])
Hope that helps!
Related
I have data frame, for example
df <- data.frame(x = 1:1e3, y = rnorm(1e3))
I need to split points on N (in my case N = 6, 12 and 24) rectangles with equal number of points. How to split my df using R-tree algorithm?
For uniformely distributed data on the x axis, kmeans clustering works (without surprise) well:
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(x = 1:1e3, y = rnorm(1e3))
N <- 10
df$cluster <- kmeans(df,N)$cluster
cluster_rectangles <- df %>% group_by(cluster) %>%
summarize(xmin = min(x),
xmax = max(x),
ymin = min(y),
ymax = max(y),
n = n())
ggplot() + geom_rect(data = cluster_rectangles, mapping=aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, fill=cluster)) +
geom_point(data = df,mapping=aes(x,y),color='white')
It also works if x distribution is normal :
df <- data.frame(x = rnorm(1e3), y = rnorm(1e3))
Drawback is that the number of points for each rectangle varies :
> cluster_rectangles %>% select(cluster,n)
# A tibble: 10 x 2
cluster n
<int> <int>
1 1 137
2 2 58
3 3 121
4 4 61
5 5 72
6 6 184
7 7 78
8 8 70
9 9 126
10 10 93
For an uniform distribution, the result is quite good (with N=9):
In case that all the points have different x coordinates, as it is the case in your example, sort the points increasingly according to the x coordinate. Note that, in this case, your problem of finding a covering with rectangles (with equal number of points) for the 2d points can be simplified to finding a covering with segments for 1d points (i.e. you can ignore the height of the rectangles).
Here how you can find the points in each rectangle:
num_rect <- 7 # In your example 6, 12 or 24
num_points <- 10 # In your example 1e3
# Already ordered according to x
df <- data.frame(x = 1:num_points, y = rnorm(num_points))
# Minimum number of points in the rectangles to cover all of them
points_in_rect <- ceiling(num_points/num_rect)
# Cover the first points using non-overlaping rectangles
breaks <- seq(0,num_points, by=points_in_rect)
cover <- split(seq(num_points), cut(seq(num_points), breaks))
names(cover) <- paste0("rect", seq(length(cover)))
# Cover the last points using overlaping rectangles
cur_num <- length(cover)
if (num_points < num_rect*points_in_rect ) {
# To avoid duplicate rectangles
last <- num_points
if (num_points %% 1 == 0)
last <- last -1
while (cur_num < num_rect) {
cur_num <- cur_num + 1
new_rect <- list(seq(last-points_in_rect+1, last))
names(new_rect) <- paste0("rect", cur_num)
cover <- c(cover,new_rect)
last <- last - points_in_rect
}
}
The points in the rectangles are:
$rect1
[1] 1 2
$rect2
[1] 3 4
$rect3
[1] 5 6
$rect4
[1] 7 8
$rect5
[1] 9 10
$rect6
[1] 8 9
$rect7
[1] 6 7
The minimum bounding rectangles (parallel to the axes) that enclose those set of points are the ones that you are finding.
Duplicated coordinate values in both axes
Randomly rotate the points (save the rotation angle) and check if there are not duplicate x (or y) coordinates. If this is the case, use the above strategy with the rotated coordinates (remember to sort before the rotated points according to the new x coordinates), and then rotate back the obtained rectangles in the opposite direction. If duplicated coordinates remain in both axes, rotate the points again with a different (random) angle. Since you have a finite number of points, you can always find a rotation angle that separates de x (or y) coordinates.
I have Test data as below;
Test
x y
1 4324.3329 484.6496
3 3258.4572 499.9621
4 4462.8230 562.7703
7 5173.4353 572.9492
8 4188.0244 530.8349
9 3557.5385 494.6672
10 2353.1382 517.5235
11 4944.2605 537.7489
15 3335.6628 488.4479
16 4059.0555 534.5479
17 4694.1778 531.7709
18 3213.8639 496.0062
19 4119.5348 516.3399
20 4267.7457 537.1041
22 4284.2706 503.8527
23 3019.6271 498.8519
35 2549.8743 503.5473
36 4976.5386 566.5985
37 2717.9942 513.2320
38 3545.2092 448.4752
40 3352.3206 457.7265
41 3198.0481 560.4075
42 1387.7531 395.7657
43 957.6421 296.1419
44 3168.8167 489.5333
45 2717.1015 478.6760
46 3694.8913 455.2763
47 4131.9760 519.9161
48 4366.2339 502.5977
49 4314.1003 486.7103
50 3818.1977 461.5844
52 3745.0532 467.7885
I add scatter plot as follows;
gg <- ggplot(Test, aes(x = x, y = y))+
geom_point()+
stat_ellipse()
ggMarginal(
gg,
type = "boxplot",
margins = "both",
size = 5
)
print(gg)
It seems like there are two groups;
(1) at right-top with large number of points
(2) at left-bottom with two points.
In this case, how can I divide the data into two groups?
I have tried k-mean clustering as follows;
#k-mean
km <- kmeans(Test,2)
library(cluster)
clusplot(Test, km$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
But, this changes x-y coordinates into PC1 & PC2, which is not what I want in this case.
For example,
set.seed(42)
km <- kmeans(Test,2)
ggplot(Test, aes(x = x, y = y,colour = factor(km$cluster)))+
geom_point()+
stat_ellipse(type = "norm", linetype = 2)
gives,
I am trying to plot six histograms (2 colums of data (calories, sodium) x 3 types (beef, meat, poultry)) with these data and I want to give them the same scale for x and y axis. I'm using scale_x_continuous to limit the x axis, which according to various sources, removes data that won't appear on the plot. Here is my code:
#src.table is the data frame containing my data
histogram <- function(df, dataset, n_bins, label) {
ggplot(df, aes(x=df[[dataset]])) +
geom_histogram(color="darkblue", fill="lightblue", bins = n_bins) + xlab(label)
}
src2_12.beef <- src2_12.table[src2_12.table$Type == "Beef",]
src2_12.meat <- src2_12.table[src2_12.table$Type == "Meat",]
src2_12.poultry <- src2_12.table[src2_12.table$Type == "Poultry",]
src2_12.calories_scale <- lims(x = c(min(src2_12.table$Calories), max(src2_12.table$Calories)), y = c(0, 6))
src2_12.sodium_scale <- lims(x = c(min(src2_12.table$Sodium), max(src2_12.table$Sodium)), y = c(0, 6))
#src2_12.calories_scale <- lims()
#src2_12.sodium_scale <- lims()
src2_12.plots <- list(
histogram(src2_12.beef, "Calories", 10, "Calories-Beef") + src2_12.calories_scale,
histogram(src2_12.meat, "Calories", 10, "Calories-Meat") + src2_12.calories_scale,
histogram(src2_12.poultry, "Calories", 10, "Calories-Poultry") + src2_12.calories_scale,
histogram(src2_12.beef, "Sodium", 10, "Sodium-Beef") + src2_12.sodium_scale,
histogram(src2_12.meat, "Sodium", 10, "Sodium-Meat") + src2_12.sodium_scale,
histogram(src2_12.poultry, "Sodium", 10, "Sodium-Poultry") + src2_12.sodium_scale
)
multiplot(plotlist = src2_12.plots, cols = 2, layout = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE))
Here is the output:
vs. what the data are supposed to look like:
I couldn't understand why some data points are missing since given that the limit I set is already the min and the max of the data.
You probably want to use coord_cartesian instead of lims. Unexpected things can happen when you're fiddling around with the limits on histograms, because a fair bit of fiddly transformations have to happen to get from your raw data to the actual histogram.
Let's peer under the hood for one example:
p <- ggplot(src2_12.beef,aes(x = Calories)) +
geom_histogram(bins = 10)
p1 <- ggplot(src2_12.beef,aes(x = Calories)) +
geom_histogram(bins = 10) +
lims(x = c(86,195))
a <- ggplot_build(p)
b <- ggplot_build(p1)
>a$data[[1]][,1:5]
y count x xmin xmax
1 1 1 114.1111 109.7222 118.5000
2 0 0 122.8889 118.5000 127.2778
3 3 3 131.6667 127.2778 136.0556
4 2 2 140.4444 136.0556 144.8333
5 5 5 149.2222 144.8333 153.6111
6 2 2 158.0000 153.6111 162.3889
7 0 0 166.7778 162.3889 171.1667
8 2 2 175.5556 171.1667 179.9444
9 3 3 184.3333 179.9444 188.7222
10 2 2 193.1111 188.7222 197.5000
> b$data[[1]][,1:5]
y count x xmin xmax
1 0 0 NA NA 90.83333
2 0 0 96.88889 90.83333 102.94444
3 1 1 109.00000 102.94444 115.05556
4 0 0 121.11111 115.05556 127.16667
5 4 4 133.22222 127.16667 139.27778
6 4 4 145.33333 139.27778 151.38889
7 4 4 157.44444 151.38889 163.50000
8 1 1 169.55556 163.50000 175.61111
9 4 4 181.66667 175.61111 187.72222
10 2 2 193.77778 187.72222 NA
>
So now you're wondering, how the heck did that happen, right?
Well, when you tell ggplot that you want 10 bins and the x limits go from 86 to 195, the histogram algorithm tries to create ten bins that span that actual range. That's why it's trying to create bins down below 100 even though there's no data there.
And then further oddities can happen because the bars may extend past the nominal data range (the xmin and xmax values), since the bar widths will generally encompass a little above and a little below your actual data at the high and low ends.
coord_cartesian will adjust the x limits after all this processing has happened, so it bypasses all these little quirks.
Suppose I have following data for a student's score on a test.
set.seed(1)
df <- data.frame(question = 0:10,
resp = c(NA,sample(c("Correct","Incorrect"),10,replace=TRUE)),
score.after.resp=50)
for (i in 1:10) {
ifelse(df$resp[i+1] == "Correct",
df$score.after.resp[i+1] <- df$score.after.resp[i] + 5,
df$score.after.resp[i+1] <- df$score.after.resp[i] - 5)
}
df
.
question resp score.after.resp
1 0 <NA> 50
2 1 Correct 55
3 2 Correct 60
4 3 Incorrect 55
5 4 Incorrect 50
6 5 Correct 55
7 6 Incorrect 50
8 7 Incorrect 45
9 8 Incorrect 40
10 9 Incorrect 35
11 10 Correct 40
I want to get following graph:
library(ggplot2)
ggplot(df,aes(x = question, y = score.after.resp)) + geom_line() + geom_point()
My problem is: I want to color segments of this line according to student response. If correct (increasing) line segment will be green and if incorrect response (decreasing) line should be red.
I tried following code but did not work:
ggplot(df,aes(x = question, y = score.after.resp, color=factor(resp))) +
geom_line() + geom_point()
Any ideas?
I would probably approach this a little differently, and use geom_segment instead:
df1 <- as.data.frame(with(df,cbind(embed(score.after.resp,2),embed(question,2))))
colnames(df1) <- c('yend','y','xend','x')
df1$col <- ifelse(df1$y - df1$yend >= 0,'Decrease','Increase')
ggplot(df1) +
geom_segment(aes(x = x,y = y,xend = xend,yend = yend,colour = col)) +
geom_point(data = df,aes(x = question,y = score.after.resp))
A brief explanation:
I'm using embed to transform the x and y variables into starting and ending points for each line segment, and then simply adding a variable that indicates whether each segment went up or down. Then I used the previous data frame to add the original points themselves.
Alternatively, I suppose you could use geom_line something like this:
df$resp1 <- c(as.character(df$resp[-1]),NA)
ggplot(df,aes(x = question, y = score.after.resp, color=factor(resp1),group = 1)) +
geom_line() + geom_point(color = "black")
By default ggplot2 groups the data according to the aesthetics that are mapped to factors. You can override this default by setting group explicitly,
last_plot() + aes(group=NA)
I would like to know what is geom_density() exactly doing, so I justify the graph and if there is any way of extracting the function or points that generates for each of the curves being plotted.
Thanks
Typing get("compute_group", ggplot2::StatDensity) (or, formerly, get("calculate", ggplot2:::StatDensity)) will get you the algorithm used to calculate the density. (At root, it's a call to density() with kernel="gaussian" the default.)
The points used in the plot are invisibly returned by print.ggplot(), so you can access them like this:
library(ggplot2)
m <- ggplot(movies, aes(x = rating))
m <- m + geom_density()
p <- print(m)
head(p$data[[1]], 3)
# y x density scaled count PANEL group ymin ymax
# 1 0.0073761 1.0000 0.0073761 0.025917 433.63 1 1 0 0.0073761
# 2 0.0076527 1.0176 0.0076527 0.026888 449.88 1 1 0 0.0076527
# 3 0.0078726 1.0352 0.0078726 0.027661 462.81 1 1 0 0.0078726
## Just to show that those are the points you are after,
## extract and use them to create a lattice xyplot
library(gridExtra)
library(lattice)
mm <- xyplot(y ~x, data=p$data[[1]], type="l")
As suggested in other answers, you can access the ggplot points using print.ggplot(). However, print()-ing code also prints the ggplot object, which may not be desired.
You can get extract the ggplot object data, without printing the plot, using ggplot_build():
library(ggplot2)
library(ggplot2movies)
m <- ggplot(movies, aes(x = rating))
m <- m + geom_density()
p <- ggplot_build(m) # <---- INSTEAD OF `p <- print(m)`
head(p$data[[1]], 3)
# y x density scaled count n PANEL group ymin
# 1 0.007376115 1.000000 0.007376115 0.02591684 433.6271 58788 1 -1 0
# 2 0.007652653 1.017613 0.007652653 0.02688849 449.8842 58788 1 -1 0
# 3 0.007872571 1.035225 0.007872571 0.02766120 462.8127 58788 1 -1 0
# Just to show that those are the points you are after, extract and use them
# to create a lattice xyplot
library(lattice)
m2 <- xyplot(y ~x, data=p$data[[1]], type="l")
library(gridExtra)
grid.arrange(m, m2, nrow=1)