I want to represent three lines on a graph overlain with datapoints that I used in a discriminant function analysis. From my analysis, I have two points that fall on each line and I want to represent these three lines. The lines represent the probability contours of the classification scheme and exactly how I got the points on the line are not relevant to my question here. However, I want the lines to extend further than the points that define them.
df <-
data.frame(Prob = rep(c("5", "50", "95"), each=2),
Wing = rep(c(107,116), 3),
Bill = c(36.92055, 36.12167, 31.66012, 30.86124, 26.39968, 25.6008))
ggplot()+
geom_line(data=df, aes(x=Bill, y=Wing, group=Prob, color=Prob))
The above df is a dataframe for my points from which the three lines are constructed. I want the lines to extend from y=105 to y=125.
Thanks!
There are probably more idiomatic ways of doing it but this is one way to get it done.
In short you quickly calculate the linear formula that will connect the lines i.e y = mx+c
df_withFormula <- df |>
group_by(Prob) |>
#This mutate command will create the needed slope and intercept for the geom_abline command in the plotting stage.
mutate(increaseBill = Bill - lag(Bill),
increaseWing = Wing - lag(Wing),
slope = increaseWing/increaseBill,
intercept = Wing - slope*Bill)
# The increaseBill, increaseWing and slope could all be combined into one calculation but I thought it was easier to understand this way.
ggplot(df_withFormula, aes(Bill, Wing, color = Prob)) +
#Add in this just so it has something to plot ontop of. You could remove this and instead manually define all the limits (expand_limits would work).
geom_point() +
#This plots the three lines. The rows with NA are automatically ignored. More explicit handling of the NA could be done in the data prep stage
geom_abline(aes(slope = slope, intercept = intercept, color = Prob)) +
#This is the crucial part it lets you define what the range is for the plot window. As ablines are infite you can define whatever limits you want.
expand_limits(y = c(105,125))
Hope this helps you get the graph you want.
This is very much dependent on the structure of your data it could though be changed to fit different shapes.
Similar to the approach by #James in that I compute the slopes and the intercepts from the given data and use a geom_abline to plot the lines but uses
summarise instead of mutate to get rid of the NA values
and a geom_blank instead of a geom_point so that only the lines are displayed but not the points (Note: Having another geom is crucial to set the scale or the range of the data and for the lines to show up).
library(dplyr)
library(ggplot2)
df_line <- df |>
group_by(Prob) |>
summarise(slope = diff(Wing) / diff(Bill),
intercept = first(Wing) - slope * first(Bill))
ggplot(df, aes(x = Bill, y = Wing)) +
geom_blank() +
geom_abline(data = df_line, aes(slope = slope, intercept = intercept, color = Prob)) +
scale_y_continuous(limits = c(105, 125))
Related
I have a dataset df, using that I am plotting a scatterplot. Following is the code for it:
g <- ggplot() + theme_bw() +
geom_point(data = df, aes_string(df[,1], df[,2]), color = 'red')+
geom_smooth(data = df, aes_string(df[,1], df[,2]),formula = y ~ splines::bs(x, df = input$degree_1), method = "lm", color = "green3", level = 1, size = 0.5)
input$degree_1 is the slider to change the degree of the polynomial fit.
Secondly, I am extracting the data points of the smoothen curve like this:
r <- ggplot_build(g)$data[[2]]
Now, I want to cut that smoothen curve using two verticle lines and extract the data points of the curve lying in between those two lines:
v_f1 <- subset(r, x > input$Vert1 & x < input$Vert2, select = c(x,y))
input$Vert1 and input$Vert2are the sliders to change the positions of the verticle lines.
What I want:
I want to be able to control the number of points that are getting subsetted and extracted in the above-mentioned command by those verticle lines.
For now, it is extracting a random number of points, I want the user to be able to control that. For eg., if I want to cut that profile and extract 100 points in one case and 120 points in another case and so on. Or I could just set a fix number for all the cases.
I tried to plot the distribution of my test and train data set in a histogram and found something curious:
Background:
I have a test set with 50 rows and a training set with 100 rows each with the same column structure.
I'd normally plot the data like that:
plot2 <- ggplot(data=Donald_1) +
geom_histogram(aes_string(x = "Alter", y = "..count..", fill = "Group"),
bins=20, alpha=0.7)
which results in the right histogram shown below. I then wondered how it could be that test has a higher count than training as the test set is only 50 rows instead of 100. And it seems as if the test bars show the sum of the test and training bars of the left plot.
Then I tried:
plot1 <- ggplot() +
geom_histogram(data=Donald_1 %>% filter(Group == "Training"),
aes_string(x="Alter", y="..count..", fill = "Group"),
bins=20, alpha=0.7) +
geom_histogram(data=Donald_1 %>% filter(Group == "Test"),
aes_string(x="Alter", y="..count..", fill="Group"),
bins=20, alpha=0.7)
which results in the left plot shown below and that results makes more sense to me.
I now wonder, why the first attempt doesn't result in the same plot as the second attempt. Am I missing something obvious here?
In your dataframe, you have the column "Group" which represents both values Training and Test.
ggplot understands that you are representing one histogram with two groups.
Your second plot represents two distinct histograms on the same grid, and transparency (alpha) makes it what it actually what it look like.
Moreover, maybe you will prefer this one :
plot3 <- ggplot(data=Donald_1) +
geom_histogram(aes_string(x = "Alter", y = "..count..", fill = "Group"),
bins=20, alpha=0.7, position="dodge")
Background
I am playing box plots and violin plots with ggplot2, but I find some odd phenomena which happen only when the number of unique data are less than four. I am not very sure whether SO is the proper place for this thread, if not, please guild me to the right place.
Single data point: plot is not rendered
df <- data.frame(state = "bedtime", value = 100)
Box plot
ggplot(aes(x = state, y = value), data = df) + geom_boxplot() + geom_point()
Violin plot
ggplot(aes(x = state, y = value), data = df) + geom_violin()
Nothing. Received a warning message.
Two to three data points: plot is sometimes rendered
If it's not, it's like the case of single data point. If it's rendered, the quantile lines are inconsistent.
df <- data.frame(state = rep("after_meal", 4), value = rep(c(178, 162), each = 2))
Box plot
ggplot(aes(x = state, y = value), data = df) + geom_boxplot() + geom_point()
Violin plot
ggplot(aes(x = state, y = value), data = df) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))
As you can see, the quantile lines are inconsistent with each other.
Questions
Why isn't the violin plot showed when there's only one data point? I looked up kernel density estimation, and I thought there should be a very wide but flat violin. Are there other limitations or constraint in geom_violin? Or is it the rule of violin plots?
Why are the 25% and 75% quantiles put at different places between a box plot and a violin plot in the second case?
A violin plot is a density estimate plot reflected along the vertical axis, and is different from a box plot in that a box plot shows the data itself.
So as to your first question, with one point the density is infinite, because you request it at one specific point in space with a zero width, i.e. infinite height (to see this, replace geom_violin with geom_density.
The second issue stems from the same thing: a box plot is more accurate for a small number of points, because a density estimation is continuous, and is not well-defined for a very short range.
I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)
I have 12 variables, M1, M2, ..., M12, for which I compute a certain statistic x.
df = data.frame(model = paste("M", 1:28, sep = ""), x = runif(28, 1, 1.05))
levels = seq(0.8, 1.2, 0.05)
I would like to plot this data as follows:
Each circle (contour) represents the a level of that statistic "x". The three blue lines simply represent three different scenarios.
The dataframe included in this example represents one scenario. The blue line would simply join the values of all the models M1 to M28 for that specific scenario.
Is there any tool in R that allow for such a plot? I tried contour() from library(MASS) but the contours are not drawn as perfect circles.
Any help would be appreciated. Thanks!
Here is a ggplot solution:
library(ggplot2)
ggplot(data=df, aes(x=model, y=x, group=1)) +
geom_line() + coord_polar() +
scale_y_continuous(limits=range(levels), breaks=levels, labels=levels)
Note this is a little confusing because of the names in your data frame. x is really the y variable here, and model the real x, so the graph scale label seems odd.
EDIT: I had to set your factor levels for model in the data frame so they plot in the correct order.