Related
I have a dataframe (sample below) with 3 columns. My goal is to have the variable "Return" on the y-axis and "BetaRealized" on the x-axis. Based on that, I would like to have two regression lines grouped by "SML" e.g. one regression line for the two "Theoretical" values and one for the 10 "Empirical" values. Preferably I would like to use ggplot2.
I've looked through several other questions but I wasn't able to find one that fits my case. As I am very new to R, I would greatly appreciate any help. Feel free to help me improve my question for future users if necessary.
Reproducible data sample:
structure(list(SML = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L), .Label = c("Empirical", "Theoretical"), class = "factor"),
Return = c(0.00136162543341773, 0.00327371856919072, 0.00402550498386094,
0.00514512870557883, 0.00491788632261087, 0.00501053666090353,
0.00485590289408263, 0.00576880451680399, 0.00579134238930521,
0.00704131096883141, 0.00471917614445859, 0), BetaRealized = c(0.42574984058487,
0.576898009418581, 0.684024167075167, 0.763551381826944,
0.833875797322081, 0.902738972263857, 0.976227211834564,
1.06544414896672, 1.19436401770255, 1.50932083346054, 0.893219438045588,
0)), class = "data.frame", row.names = c(NA, -12L))
Following AntoniosK comment, it seems the solution is to use geom_smooth with a color argument in the following manner. First, transforming you sample data into a dataframe:
df<-data.frame(structure(list(SML = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L), .Label = c("Empirical", "Theoretical"), class = "factor"),
Return = c(0.00136162543341773, 0.00327371856919072, 0.00402550498386094,
0.00514512870557883, 0.00491788632261087, 0.00501053666090353,
0.00485590289408263, 0.00576880451680399, 0.00579134238930521,
0.00704131096883141, 0.00471917614445859, 0), BetaRealized = c(0.42574984058487,
0.576898009418581, 0.684024167075167, 0.763551381826944,
0.833875797322081, 0.902738972263857, 0.976227211834564,
1.06544414896672, 1.19436401770255, 1.50932083346054, 0.893219438045588,
0)), class = "data.frame", row.names = c(NA, -12L)))
In the sequence, just call ggplot like this:
ggplot(df, aes(BetaRealized, Return, color = SML)) + geom_point()+geom_smooth(method=lm, se=FALSE)
the output will be this one: graph
Addtionally, you can add the equation using the package ggpubr:
ggplot(df, aes(BetaRealized, Return, color = SML)) + geom_point()+stat_smooth(method=lm, se=FALSE)+
stat_regline_equation()
Finally, depending on your objectvei, it may be interesting to use facet_wrap to distinguish the categories:
ggplot(df, aes(BetaRealized, Return, color = SML)) + geom_point()+
stat_smooth(method=lm, se=FALSE)+ facet_wrap(~SML)+
stat_regline_equation()
The image will look like this: graph2
Trying to produce a point plot that reorders my values and also has a mean line above the values.
I can produce the plot with the mean line, or the reordered values but not both at the same time because I get the error
"geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?".
I believe I am getting the error as some of my data only has one observation but I don't understand why this only becomes an issue with the reorder data.
In the end all I want is to be able to show the means of the two different values groups for each x value.
Here is my sample code
library(ggplot2)
typ <- c("T", "N", "T", "T", "N")
samplenum <- c(7,7,6,8,8)
values <- c(1,2,1,3,2)
df = data.frame(typ, samplenum, values)
d <- ggplot(df, aes(x= reorder(samplenum, values), y= values))
d <- d + geom_point(position=position_jitter(width=0.15, height=0.05))
d <- d + aes(colour = factor(df$typ))
d <- d + stat_summary(fun.y = mean, geom="line")
d
Thank you for the help in advance.
This is what I am going for
Here is some steps before the completion sample pictures of what I have produced from my larger data set.
With Line but Not Reordered
Reordered but No Mean Line
As the error message suggests, you need to adjust the group aesthetic. When you use reorder you will end up with a discrete scale but you want to draw lines that connect across groups, that's why the error.
You can try this
ggplot(df, aes(x = reorder(samplenum, values), y = values, colour = factor(typ))) +
geom_jitter(width = 0.15, height = 0.05) +
stat_summary(fun.y = mean, geom = "line", aes(group = factor(typ)))
(I altered your data slighly so it contains more observations.)
data
df <- structure(list(typ = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("N", "T"), class = "factor"),
samplenum = c(7, 7, 6, 8, 8, 7, 7, 6, 8, 8, 7, 7, 6, 8, 8
), values = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 2L,
1L, 3L, 1L, 2L)), .Names = c("typ", "samplenum", "values"
), row.names = c(NA, -15L), class = "data.frame")
The resulting plot with your input data
I am trying to create this type of chart from the data on the left (arbitrary values for simplicity):
The goal is to plot variable X on the x-axis with the mean on the Y-axis and error bars equal to the standard error se.
The problem is that values 1-10 should be each be represented individually (blue curve), and that the values for A and B should be plotted on each of the 1-10 values (green and red line).
I can draw the curve if I manually save the data and manually copy the values for A and B to each value for X but this is not very time efficient. Is there a more elegant way to do this?
Thanks in advance!
EDIT: As suggested the code:
df <- structure(list(X = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 2L, 11L, 12L), .Label = c("1", "10", "2", "3", "4", "5",
"6", "7", "8", "9", "A", "B"), class = "factor"), mean = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 5.5, 6.5), sd = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), se = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("X", "mean", "sd", "se"), class = "data.frame", row.names = c(NA,-12L))
df<-as.data.frame(df)
df$X<-factor(df$X)
plot <- ggplot(df, aes(x=df$X, y=df$mean)) + geom_point() + geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1)
plot
Im afraid I don't know ggplot, but hopefully this is what you want (it might also aid others in understanding your question).
You want a ggplot with three lines,
1. df$X,df$mean
2. df$X,df$row_A_mean
3. df$X,df$row_B_mean
4. error bars of the SE column
df <- structure(list(X = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 2L, 11L, 12L), .Label = c("1", "10", "2", "3", "4", "5",
"6", "7", "8", "9", "A", "B"), class = "factor"), mean = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 5.5, 6.5), sd = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), se = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("X", "mean", "sd", "se"), class = "data.frame", row.names = c(NA,-12L))
df<-as.data.frame(df)
df$X<-factor(df$X)
plot <- ggplot(df, aes(x=df$X, y=df$mean)) + geom_point() + geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1)
plot
#row A mean
df$row_A_mean<-rep(df[11,]$mean,nrow(df))# note that this could also be replaces by a horizontal line, unless the mean changes
#row A sd
df$row_A_sd<-rep(df[11,]$sd,nrow(df))
plot(as.numeric(df$X),df$mean,type="p",col="red")
lines(as.numeric(df$X),df$row_A_mean,col="green")
If we use a subset to define the data elements of the ggplot, we can come up with one solution using geom_hline:
theme_set(theme_bw())
ggplot(data = df[1:10,])+
geom_errorbar(aes(x = X, ymin = mean - se, ymax = mean + se))+
geom_point(aes(x = X, y = mean))+
geom_line(aes(x = X, y = mean), group = 1)+
geom_hline(data = df[11,], aes(yintercept = mean, colour = 'A'))+
geom_hline(data = df[12,], aes(yintercept = mean, colour = 'B'))
It's helpful to reorient your data into long form so that you can really utilize the aesthetic part of ggplot. Generally I would use reshape2::melt for this, but your data the way it's currently formatted doesn't really lend itself to it. I'll show you what I mean by long form and you can get the idea what we're shooting for:
#setting variables for your classes so it's a bit more scalable - reset as applicable
x.seriesLength <- 10
x.class.name <- "X" #name of the main series class; X in your example
a.vec <- c(5.5, 1, 1, "A")
b.vec <- c(6.5, 1, 1, "B")
#trimming df so we can reshape
df <- df[1:x.seriesLength, 2:4]
df$class <- x.class.name #adding class column
#converting your static A and B values to long form, sending to a data.frame and adding to df
add <- matrix(c(rep(a.vec, times = x.seriesLength),
rep(b.vec, times = x.seriesLength)),
byrow = T,
ncol = 4)
colnames(add) <- c("mean", "sd", "se", "class")
df <- rbind(df, add)
print(df)
Then we need to do a bit more cleaning:
df$rownum <- rep(1:x.seriesLength, times = 3)
df[,1:3] <- sapply(df[,1:3], as.numeric) #casting as numeric
df$barmin <- df$mean - df$sd
df$barmax <- df$mean + df$sd
Now we have a long form data frame with the required data. We can then use the new class column to plot and color multiple series.
#use class column to tell ggplot which points belong to which series
g <- ggplot(data = df) +
geom_point(aes(x = rownum, y = mean, color = class)) +
geom_errorbar(aes(x = rownum, ymin=barmin, ymax=barmax, color = class), width=.1)
g
Edit: If you want lines instead of points, just replace geom_point with geom_line.
For a sample dataframe:
df <- structure(list(region = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("a", "b", "c", "d"), class = "factor"),
result = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L), weight = c(0.126,
0.5, 0.8, 1.5, 5.3, 2.2, 3.2, 1.1, 0.1, 1.3, 2.5)), .Names = c("region",
"result", "weight"), row.names = c(NA, 11L), class = "data.frame")
I draw a cross tabulation using:
df$region <- factor(df$region)
result <- xtabs(weight ~ region + result, data=df)
result
However I want to ensure the regions of the xtab are in order of magnitude of percentage 1s in sample. (i.e. 1s represent 29% of region a and 33% of region b). Therefore I would like the xtab to be reordered, so region b is first, then a.
I know I could use relevel, however this would be dependent on me looking at the result and re-levelling where appropriate.
Instead I want this to be automatic in the code and not dependent on the user (as this code will be running lots of times, and completing further analysis on the resulting xtab).
If anyone has any ideas, I would greatly appreciate it.
You can reorder the xtab on the values of the second column using order as follows:
result[order(result[, 2], decreasing=T),]
order ranks the values, adding decreasing=T ranks from top to bottom.
I want to overlay a density curve to a frequency histogram I have constructed. For the frequency histogram I used aes(y=..counts../40) because 40 is my total sample number. I used aes(y=..density..*0.1) to force the density to be somewhere between 0 and 1 since my binwidth is 0.1. However, density curve doesn't fit my data and it excludes the values that are equal to 1.0 (notice that the histogram shows accumulation values for the bin=(1.0,1.1) but the density curve ends at 1.0)
this is my data
data<-structure(list(variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("E1", "test"
), class = "factor"), value = c(0.288888888888889, 0.0817901234567901,
0.219026548672566, 0.584795321637427, 0.927554980595084, 0.44661095636026,
1, 0.653780942692438, 1, 0.806451612903226, 1, 0.276794335371741,
1, 0.930109557990178, 0.776864728192162, 0.824909747292419, 1,
1, 1, 1, 1, 0.0875912408759124, 0.308065494238933, 1, 0.0258064516129032,
0.0167322834645669, 1, 1, 0.355605889014723, 0.310344827586207,
0.106598984771574, 0.364447494852436, 0.174724342663274, 0.77491961414791,
1, 0.856026785714286, 0.680759275237274, 0.850657108721625, 1,
1, 0, 0.851851851851852, 1, 0, 0.294954721862872, 0.819870009285051,
0, 0.734147168531706, 0.0135424091233072, 0.0189098998887653,
0.0101010101010101, 0, 0.296905222437137, 0.706837929731772,
0.269279393173198, 0.135379061371841, 0.158969804618117, 0.0902981940361193,
0.00423131170662906, 0, 0.374880611270296, 0.0425790754257908,
0.145542753183748, 0, 0.129032258064516, 0.260334645669291, 0,
0, 1, 0.175505350772889, 0.08248730964467, 0, 0.317217981340119,
0.614147909967846, 0, 0.264508928571429, 0.883520276100086, 0.0657108721624851,
0, 0.560229445506692)), row.names = c(NA, -80L), .Names = c("variable",
"value"), class = "data.frame")
Plot
q<-ggplot(data, aes(value, fill = variable))
q + geom_density(alpha = 0.6,aes(y=..density..*0.1),binwidth=0.1)
+ theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA"))
+ theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1))
+ labs(title="Density Plot GrupoB",x="Respuesta",y="Density")
+scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))
+geom_histogram(alpha = 0.6,aes(y=..count../40),binwidth=0.1,position="dodge")
The output I get is this
Your plot is doing exactly what is to be expected from your data:
You plot data$value, which contains numeric values between 0 and 1, so you should expect the density curve to run from 0 to 1 as well.
You plot a histogram with binwidth 0.1. Bins are closed on the lower and open on the upper end. So the binning you get in your case is [0,0.1), [0.1, 0.2), ..., [0.9,1.0), [1.0,1.1). You have 17 values in your data that are 1 and thus go into the last bin, which is plotted from 1 to 1.1.
I think it's a bad idea to plot the histogram the way you do. The reason is that for a histogram, the x-axis is continuous, meaning that the bar that covers the x-axis range from, say, 0.1 to 0.2 stands for the count of values between (and including) 0.1 and 0.2 (not including the latter). Using dodge in this situation leads to a distorted picture, since the bars do now no longer cover the correct x-axis range. Two bars share the range that should be covered in full by both of them. This distortion is one of the reasons why the density curve seems not to match the histogram.
So, what can you do about it? I can give you a few suggestions, but maybe others have better ideas...
Instead of plotting the histograms next to each other with position="dodge", you could use faceting, that is, plot the histograms (and corresponding density curves) into separate plots. This can be achieved by adding + facet_grid(variable~.) to your plot.
You could cheat a little bit to have the last bin, which is [0.9,1), include 1 (i.e. have it be [0.9,1.0]). Simply replace 1 in your data by 0.999 as follows: data$value[data$value==1]<-0.999. It is important that you do this only for the plot, where it really only means that you slightly redefine the binning. For all the numeric evaluations that you indent to do, you should not do this replacement! (It will, e.g., change the mean of data$value.)
Regarding the normalisation of your density curve and the histogram: there is no need for the density curve to lie between 0 and 1. The restriction is that the integral over the density curve should be 1. Thus, to make density curve and histogram compareable, also the histogram should have integral 1, which is achieved, by also dividing the y-value by the bindwidth. So, you should use geom_density(alpha = 0.6,aes(y=..density..)) (I also removed bindwith=0.1 because it has no effect for geom_density) and geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) (no need for position="dodge", once you use faceting). This leads, of course, to exactly the relative normalisation that you had, but it makes more sense because the integrals over density curve and histogram are 1, as they should be.
The density curve does still not perfectly match the histogram and this has to do with the way the density estimator is calculated. I don't know this in detail and can thus unfortunately not explain it further. But you can get a better understanding of how it works by playing with the parameter adjust to geom_density. It will make the curve less smooth for smaller numbers and the curve will resemble the histogram more closely.
To put everything together, I have built all my suggestions into your code, used adjust=0.2 in geom_density and plotted the result:
data$value[data$value==1]<-0.999
q<-ggplot(data, aes(value, fill = variable))
q + geom_density(alpha = 0.6,aes(y=..density..),adjust=0.2) +
theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA")) +
theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1)) +
labs(title="Density Plot GrupoB",x="Respuesta",y="Density")+
scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))+
geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) +
facet_grid(variable~.)
Unfortunately, I can not give you a more complete answer, but I hope these ideas give you a good start.