R ggplot Histogram group shows sum of two groups - r

I tried to plot the distribution of my test and train data set in a histogram and found something curious:
Background:
I have a test set with 50 rows and a training set with 100 rows each with the same column structure.
I'd normally plot the data like that:
plot2 <- ggplot(data=Donald_1) +
geom_histogram(aes_string(x = "Alter", y = "..count..", fill = "Group"),
bins=20, alpha=0.7)
which results in the right histogram shown below. I then wondered how it could be that test has a higher count than training as the test set is only 50 rows instead of 100. And it seems as if the test bars show the sum of the test and training bars of the left plot.
Then I tried:
plot1 <- ggplot() +
geom_histogram(data=Donald_1 %>% filter(Group == "Training"),
aes_string(x="Alter", y="..count..", fill = "Group"),
bins=20, alpha=0.7) +
geom_histogram(data=Donald_1 %>% filter(Group == "Test"),
aes_string(x="Alter", y="..count..", fill="Group"),
bins=20, alpha=0.7)
which results in the left plot shown below and that results makes more sense to me.
I now wonder, why the first attempt doesn't result in the same plot as the second attempt. Am I missing something obvious here?

In your dataframe, you have the column "Group" which represents both values Training and Test.
ggplot understands that you are representing one histogram with two groups.
Your second plot represents two distinct histograms on the same grid, and transparency (alpha) makes it what it actually what it look like.
Moreover, maybe you will prefer this one :
plot3 <- ggplot(data=Donald_1) +
geom_histogram(aes_string(x = "Alter", y = "..count..", fill = "Group"),
bins=20, alpha=0.7, position="dodge")

Related

Extend line length with geom_line

I want to represent three lines on a graph overlain with datapoints that I used in a discriminant function analysis. From my analysis, I have two points that fall on each line and I want to represent these three lines. The lines represent the probability contours of the classification scheme and exactly how I got the points on the line are not relevant to my question here. However, I want the lines to extend further than the points that define them.
df <-
data.frame(Prob = rep(c("5", "50", "95"), each=2),
Wing = rep(c(107,116), 3),
Bill = c(36.92055, 36.12167, 31.66012, 30.86124, 26.39968, 25.6008))
ggplot()+
geom_line(data=df, aes(x=Bill, y=Wing, group=Prob, color=Prob))
The above df is a dataframe for my points from which the three lines are constructed. I want the lines to extend from y=105 to y=125.
Thanks!
There are probably more idiomatic ways of doing it but this is one way to get it done.
In short you quickly calculate the linear formula that will connect the lines i.e y = mx+c
df_withFormula <- df |>
group_by(Prob) |>
#This mutate command will create the needed slope and intercept for the geom_abline command in the plotting stage.
mutate(increaseBill = Bill - lag(Bill),
increaseWing = Wing - lag(Wing),
slope = increaseWing/increaseBill,
intercept = Wing - slope*Bill)
# The increaseBill, increaseWing and slope could all be combined into one calculation but I thought it was easier to understand this way.
ggplot(df_withFormula, aes(Bill, Wing, color = Prob)) +
#Add in this just so it has something to plot ontop of. You could remove this and instead manually define all the limits (expand_limits would work).
geom_point() +
#This plots the three lines. The rows with NA are automatically ignored. More explicit handling of the NA could be done in the data prep stage
geom_abline(aes(slope = slope, intercept = intercept, color = Prob)) +
#This is the crucial part it lets you define what the range is for the plot window. As ablines are infite you can define whatever limits you want.
expand_limits(y = c(105,125))
Hope this helps you get the graph you want.
This is very much dependent on the structure of your data it could though be changed to fit different shapes.
Similar to the approach by #James in that I compute the slopes and the intercepts from the given data and use a geom_abline to plot the lines but uses
summarise instead of mutate to get rid of the NA values
and a geom_blank instead of a geom_point so that only the lines are displayed but not the points (Note: Having another geom is crucial to set the scale or the range of the data and for the lines to show up).
library(dplyr)
library(ggplot2)
df_line <- df |>
group_by(Prob) |>
summarise(slope = diff(Wing) / diff(Bill),
intercept = first(Wing) - slope * first(Bill))
ggplot(df, aes(x = Bill, y = Wing)) +
geom_blank() +
geom_abline(data = df_line, aes(slope = slope, intercept = intercept, color = Prob)) +
scale_y_continuous(limits = c(105, 125))

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

Summarising data before plotting with geom_tile() renders different results

I have noticed that when plotting with ggplot2's geom_tile(), summarising the data before plotting renders a completely different result than when it is not pre-summarised. I don't understand why.
For a dataframe with three columns, year (character), state (character) and profit (numeric), consider the following examples:
# Plot straight away
data %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit))
# Summarise before plotting
data %>% group_by(year, state) %>% summarize(profit_mean = mean(profit)) %>%
ungroup() %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit_mean))
These two examples render two different tile plots - the values are quite different. I thought that these two methods of plotting would be analogous and that ggplot2 would take a mean automatically - is that not so?
I tried reproducing this error on a smaller subset of data, but it didn't appear. What could be going on here?
OP, this was a very interesting question.
First, let's get this out of the way. It is clear what plotting the summary of your data is plotting just that: the summary. You are summarizing via mean, so what is plotted equals the mean of the values for each tile.
The actual question here is: If you have a dataset containing more than one value per tile, what is the result of plotting the "non-summarized" dataset?
User #akrun is correct: the default stat used for geom_tile is stat="identity", but it might not be clear what that exactly means. It says it "leaves the data unchanged"... but that's not clear what that means here.
Illustrative Example Dataset
For purposes of demonstration, I'll create an illustrative dataset, which will answer the question very clearly. I'm creating two individual datasets df1 and df2, which each contain 4 "tiles" of data. The difference between these is that the values themselves for the tiles are different. I've include text labels on each tile for more clarity.
library(ggplot2)
library(cowplot)
df1 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(5,15,20,25)
)
df2 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(10,5,25,15)
)
tile1 <- ggplot(df1, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df1")
tile2 <- ggplot(df2, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df2")
plot_grid(tile1, tile2)
Plotting the Combined Data Frame
Each of the data frames df1 and df2 contain only one value per tile, so in order to see how that changes when we have more than one value per tile, we need to combine them into one so that each tile will contain 2 values. In this example, we are going to combine them in two ways: first df1 then df2, and the other way is df2 first, then df1.
df12 <- rbind(df1, df2)
df21 <- rbind(df2, df1)
Now, if we plot each of those as before and compare, the reason for the discrepancy the OP posted should be quite obvious. I'm including the value for each tile for each originating dataset to make things super-clear.
tile12 <- ggplot(df12, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df1, then df2") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
tile21 <- ggplot(df21, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df2, then df1") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
plot_grid(tile12, tile21)
Note that the legend colorbar value does not change, so it's not doing an addition. Plus, since we know it's stat="identity", we know this should not be the case. When we use the dataset that contains first observations from df1, then observations from df2, the value plotted is the one from df2. When we use the dataset that contains observations first from df2, then from df1, the value plotted is the one from df1.
Given this piece of information, it can be clear that the value shown in geom_tile() when using stat="identity" (default argument) corresponds to the last observation for that particular tile represented in the data frame.
So, that's the reason why your plot looks odd OP. You can either summarize beforehand as you have done, or use stat_summary(geom="tile"... to do the transformation in one go within ggplot.

Generating only one density graph for each group of user - R

I have a binary data frame, which each row represents data related to a user (size of data frame :90 rows * 65 cols). The last column of this data frame contains the label for the users (4 labels :Excellent, Good, bad, fail).
My question is, how can I plot only one density curve for each label. I mean, My final plot would have only 4 curves (each curve corresponding to each label).
Thanks
I think that this question can be found here (Multiple Groups in geom_density() plot) so my answer is almost exactly the same.
The only difference is that I used mtcars with an extra column :
library(ggplot2)
test <- head(mtcars)
addcol <- c("great", "good", "bad", "great", "bad", "good")
test <- cbind(test, addcol)
ggplot() +
geom_density(data = test, aes(x = wt, group = addcol, color = addcol), adjust=2) +
xlab("wt") +
ylab("Density")

Plotting level plot in R

I have 12 variables, M1, M2, ..., M12, for which I compute a certain statistic x.
df = data.frame(model = paste("M", 1:28, sep = ""), x = runif(28, 1, 1.05))
levels = seq(0.8, 1.2, 0.05)
I would like to plot this data as follows:
Each circle (contour) represents the a level of that statistic "x". The three blue lines simply represent three different scenarios.
The dataframe included in this example represents one scenario. The blue line would simply join the values of all the models M1 to M28 for that specific scenario.
Is there any tool in R that allow for such a plot? I tried contour() from library(MASS) but the contours are not drawn as perfect circles.
Any help would be appreciated. Thanks!
Here is a ggplot solution:
library(ggplot2)
ggplot(data=df, aes(x=model, y=x, group=1)) +
geom_line() + coord_polar() +
scale_y_continuous(limits=range(levels), breaks=levels, labels=levels)
Note this is a little confusing because of the names in your data frame. x is really the y variable here, and model the real x, so the graph scale label seems odd.
EDIT: I had to set your factor levels for model in the data frame so they plot in the correct order.

Resources