To perform an ANOVA in R I normally follow two steps:
1) I compute the anova summary with the function aov
2) I reorganise the data aggregating subject and condition to visualise the plot
I wonder whether is always neccesary this reorganisation of the data to see the results, or whether it exists a f(x) to plot rapidly the results.
Thanks for your suggestions
G.
I think what you mean is to illustrate the result of your test with a figure ? Anova are usually illustrate with boxplot.
set.seed(1234)
data <- data.frame(group = c(rep("group_1",25),rep("group_2",25)), scores = c(runif(25,0,1),runif(25,1.5,2.5)))
mod1<-aov(scores~group,data=data)
summary(mod1)
You can make boxplot with the implemented function plot or boxplot
boxplot(scores~group,data=data)
plot(scores~group,data=data)
Or with ggplot
require(ggplot2)
require(ggsignif)
ggplot(data, aes(x = group, y = scores)) +
geom_boxplot(fill = "grey80", colour = "blue") +
scale_x_discrete() + xlab("Group") +
ylab("Scores") +
geom_signif(comparisons = list(c("group_1", "group_2")),
map_signif_level=TRUE)
Hope this helps
Related
I have create the attached plot using ggplot and data I currently hold. I want to be able to add on a simple linear forecast for future years, ideally with some sort of confidence intervals but can't seem to find anyway to do it without calculating the forecasted values in a separate dataframe
In addition to phivers and danloos solutions (if you only need a quick linear projection) you can extend the axis range and set the geom_smooth layer to fill the plot, not only the data range:
data.frame(x = 1:10, y = 1:10 + runif(10)) %>%
ggplot(aes(x,y)) +
geom_point() +
geom_smooth(method = 'lm', ## lm for linear model
## extend smoother to plot range:
fullrange = TRUE
) +
## extend axis beyond data range:
scale_x_continuous(limits = c(1,20))
sidenote: if you want to emphasise the trend in each plot facet, not the difference between facets, you can set the scales argument to facet_wrap to any of 'free', 'free_x' or 'free_y': facet_wrap(..., scales = 'free')
I have included a sample data set just to demonstrate what I am trying to do.
Speed <- c(400,220,490,210,500,270,200,470,480,310,240,490,420,330,280,210,300,470,230,430,460,220,250,200,390)
Hit <- c(0,1,0,1,0,0,1,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1,1,1,0)
obs <- c(1:25)
msl2.data <- as.data.frame(cbind(obs,Hit,Speed))
msl2.glm <- glm(Hit ~ Speed, data = msl2.data, family = binomial)
Doing What I want in the base package.
plot(Hit~ Speed, data = msl2.data, xlim = c(0,700), xlab = "Speed", ylab = "Hit", main = "Plot of hit vs Speed")
pi.hat<-(predict( msl2.glm, data.frame(Speed=c(0:700)), type="response" ))
lines( 0:700, pi.hat, col="blue" )
I am trying to recreate the above plot, but in ggplot. The error I have been unable to work around is the aes(x,y) have different lengths, which is true, but I want them to have different lengths.
Any ideas for this in gg?
You have a couple of approaches; the first does all the modelling
inside of ggplot, the second does it outside and passes the relevant data
to be plot.
First
gplot(dat=msl2.data, aes(Speed, Hit)) +
geom_point() +
geom_smooth(method="glm", method.args=list(family="binomial"),
fullrange=TRUE, se=FALSE) +
xlim(0, 700)
fullrange is specified so the prediction lines covers the x-range. xlim extends the x-axis.
Second
#Create prediction dataframe
pred <- data.frame(Speed=0:700, pi.hat)
ggplot() +
# prediction line
geom_line(data=pred, aes(Speed, pi.hat)) +
# points - note different dataframe is used
geom_point(dat=msl2.data, aes(Speed, Hit))
I generally prefer to do the modelling outside (second approach), and use ggplot purely as a plotting mechanism.
The most commonly cited example of how to visualize a logistic fit using ggplot2 seems to be something very much like this:
data("kyphosis", package="rpart")
ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
geom_point() +
stat_smooth(method="glm", family="binomial")
This visualisation works great if you don't have too much overlapping data, and the first suggestion for crowded data seems to be to use injected jitter in the x and y coordinates of the points then adjust the alpha value of the points. When you get to the point where individual points aren't useful but distributions of points are, is it possible to use geom_density(), geom_histogram(), or something else to visualise the data but continue to split the categorical variable along the y-axis as it is done with geom_point()?
From what I have found, geom_density() and geom_histogram() can easily be split/grouped by the categorical variable and both levels can easily be reversed using scale_y_reverse() but I can't figure out if it is even possible to move only one of the categorical variable distributions to the top of the plot. Any help/suggestions would be appreciated.
The annotate() function in ggplot allows you to add geoms to a plot with properties that "are not mapped from the variables of a data frame, but are instead in as vectors," meaning that you can add layers that are unrelated to your data frame. In this case your two density curves are related to the data frame (since the variables are in it), but because you're trying to position them differently, using annotate() is useful.
Here's one way to go about it:
data("kyphosis", package="rpart")
model.only <- ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
stat_smooth(method="glm", family="binomial")
absents <- subset(kyphosis, Kyphosis=="absent")
presents <- subset(kyphosis, Kyphosis=="present")
dens.absents <- density(absents$Age)
dens.presents <- density(presents$Age)
scaling.factor <- 10 # Make the density plots taller
model.only + annotate("line", x=dens.absents$x, y=dens.absents$y*scaling.factor) +
annotate("line", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1)
This adds two annotated layers with scaled density plots for each of the kyphosis groups. For the presents variable, y is scaled and increased by 1 to shift it up.
You can also fill the density plots instead of just using a line. Instead of annotate("line"...) you need to use annotate("polygon"...), like so:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green", colour="black", alpha=0.4)
Technically you could use annotate("density"...), but that won't work when you shift the present plot up by one. Instead of shifting, it fills the whole plot:
model.only + annotate("density", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red") +
annotate("density", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green")
The only way around that problem is to use a polygon instead of a density geom.
One final variant: flipping the top density plot along y-axis = 1:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=(1 - dens.presents$y*scaling.factor), fill="green", colour="black", alpha=0.4)
I am not sure I get your point, but here an attempt:
dat <- rbind(kyphosis,kyphosis)
dat$grp <- factor(rep(c('smooth','dens'),each = nrow(kyphosis)),
levels = c('smooth','dens'))
ggplot(dat,aes(x=Age)) +
facet_grid(grp~.,scales = "free_y") +
#geom_point(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1)) +
stat_smooth(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1),
method="glm", family="binomial") +
geom_density(data=subset(dat,grp=='dens'))
I have a dataset that contains observations for every second of four consecutive days (roughly 340'000 data points). This is too much to display in a scatter plot. I would like to plot only a uniform sample of, say, 2000 time points.
Is it possible to achieve this with ggplot2's "grammar of graphics" approach? I haven't found any built-in "sampling" modifier, but perhaps it's easy enough to write one?
library(ggplot2)
x <- 1:100000
d <- data.frame(x=x, y=rnorm(length(x)))
ggplot(d[sample(x, 2000), ], aes(x=x, y=y)) + geom_point()
This is how it can be "hacked" by modifying the data passed to ggplot. But I don't want to modify the data, just filter it to include only a sample.
ggplot(d, aes(x=x, y=y)) + ??? + geom_point()
EDIT: I'm specifically looking for sampling, not smoothing or binning. The data I have shows the time it takes to simulate one second of a specific process. The simulation has been parallelized, and for each simulated seconds I have the run times for each of the cores involved (8 in total). I want to show sub-optimal load balancing by plotting just the raw data points. The reason for the sampling is just that 300'000 data points are way too much for a scatter plot: Plotting takes too long and the visualization is no good.
You can subset with in the geom_point call using the data argument:
... + geom_point(data=d[sample(x,2000),])
This way, you are free to add other geoms using all the data, eg, using the example data:
ggplot(d, aes(x=x, y=y)) + geom_hex() + geom_point(data=d[sample(x,2000),])
If you want create a scatter plot for big data here are a couple of ggplot2 options
They come from This course by hadley
# upload all images to imgur.com
opts_chunk$set(fig.width = 5, fig.height = 5, dev = "png")
render_markdown(strict = T)
# some autocorrelated data
set.seed(1)
x <- 1:1e+05
d <- data.frame(x = x)
d$y <- arima.sim(list(order = c(1, 1, 0), ar = 0.9), n = 1e+05 - 1)
# the basic plot
base_plot <- ggplot(d, aes(x = x, y = y))
geom_bin2d
you can set the binwidth for the x and y variables
base_plot + geom_bin2d(binwidth = c(200, 5))
geom_hex
you can set the number of bins
base_plot + geom_hex(bins = 200)
small points
Stops overplotting
base_plot + geom_point(size = I("."))
use a smoother
This relies on having a smoothing method that will get you the detail you want without crashing or taking too long. In this case the number of knots was chosen by trial and error (and perhaps you will want more detail)
library(mgcv)
base_plot + stat_smooth(method = "gam", formula = y ~ s(x, k = 50))
Is there anyway to add a reduced major axis line (and ideally CI) to a ggplot? I know I can use method="lm" to get an OLS fit, but there doesn't seem to be a default method for RMA. I can get the RMA coefs and the CI interval from package lmodel2, but adding them with geom_abline() doesn't seem to work. Here's dummy data and code. I just want to replace the OLS line and CI with a RMA line and CI:
dat <- data.frame(a=log10(rnorm(50, 30, 10)), b=log10(rnorm(50, 20, 2)))
ggplot(dat, aes(x=a, y=b) ) +
geom_point(shape=1) +
geom_smooth(method="lm")
Edit1: the code below gets the RMA (here called SMA - standardized major axis) coefs and CIs. Package lmodel2 provides more detailed output, while package smatr returns just the coefs and CIs, if that's any help:
library(lmodel2)
fit1 <- lmodel2(b ~ a, data=dat)
library(smatr)
fit2 <- line.cis(b, a, data=dat)
Building off Joran's answer, I think it's a little easier to pass the whole data frame to geom_abline:
library(ggplot2)
library(lmodel2)
dat <- data.frame(a=log10(rnorm(50, 30, 10)), b=log10(rnorm(50, 20, 2)))
mod <- lmodel2(a ~ b, data=dat,"interval", "interval", 99)
reg <- mod$regression.results
names(reg) <- c("method", "intercept", "slope", "angle", "p-value")
ggplot(dat) +
geom_point(aes(b, a)) +
geom_abline(data = reg, aes(intercept = intercept, slope = slope, colour = method))
As Chase commented, the actual lmodel2() code and the ggplot code you are using would be helpful. But here's an example that may point you in the right direction:
dat <- data.frame(a=log10(rnorm(50, 30, 10)), b=log10(rnorm(50, 20, 2)))
mod <- lmodel2(a ~ b, data=dat,"interval", "interval", 99)
#EDIT: mod is a list, with components (data.frames) regression.results and
# confidence.intervals containing the intercepts+slopes for different
# estimation methods; just put the right values into geom_abline
ggplot(dat,aes(x=b,y=a)) + geom_point() +
geom_abline(intercept=mod$regression.results[4,2],
slope=mod$regression.results[4,3],colour="blue") +
geom_abline(intercept=mod$confidence.intervals[4,2],
slope=mod$confidence.intervals[4,4],colour="red") +
geom_abline(intercept=mod$confidence.intervals[4,3],
slope=mod$confidence.intervals[4,5],colour="red") +
xlim(c(-10,10)) + ylim(c(-10,10))
Full disclosure: I know nothing about RMA regression, so I just plucked out the relevent slopes and intercepts and plopped them into geom_abline(), using some example code from lmodel2 as a guide. The CIs produced in this toy example don't seem to make much sense, since I had to force ggplot to zoom out using xlim() and ylim() in order to see the CI lines (red).
But maybe this will help you construct a working example in ggplot().
EDIT2: With OPs added code to extract the coefficients, the ggplot() would be something like this:
ggplot(dat,aes(x=b,y=a)) + geom_point() +
geom_abline(intercept=fit2[1,1],slope=fit2[2,1],colour="blue") +
geom_abline(intercept=fit2[1,2],slope=fit2[2,2],colour="red") +
geom_abline(intercept=fit2[1,3],slope=fit2[2,3],colour="red")
I found myself in the same situation.
Obtain fitted values and their confidence intervals using the ggpmisc package:
cibrary(ggpmisc)
ci <- predict.lmodel2(fit1, method= 'RMA', interval= "confidence")
Add the model predictions to your data:
datci <- cbind(dat, ci)
Plot using geom_smooth arguments like transparency and line width (of course, you can customize them)
p <- ggplot(datci, aes(x= b, y= a)) + geom_point() + geom_line(aes(x= b, y= a)), lwd= 1.1, alpha= 0.6)
Use geom_ribbon if you want to add confidence intervals:
p + geom_ribbon(aes(ymin= lwr, ymax= upr, fill= feather), alpha= 0.3, color= NA)