Density plots in lattice that are proportional to the total group - r

I am trying to create density plots in lattice where each estimate is proportional to the total group and not just to the subset.
This is possible using histograms, like so:
require(lattice)
set.seed(1)
sex <- factor(sample(c("Men", "Women"), 200, replace = T, prob = c(.3, .7)))
age <- rnorm(200, 50, 10)
histogram(~age | sex, type = "count")
But how do I manage this using densityplot()? The type = "count" argument does not seem to work here.
Edit: This is the result I'm looking for:
library(ggplot2)
data <- data.frame(sex, age)
ggplot(data, aes(age)) +
geom_density(aes(y = ..count..)) +
facet_grid(~ sex)

Related

Fit a regression line with categorical variable in ggplot?

I have simulated a dataset and stored it in a tibble:
library(tidyverse)
set.seed(2002)
tre.sett <- rnorm(n = 12, mean = 41, sd = 5) #12 individer
ett.sett <-rnorm(n = 12, mean = 21, sd = 5) #12 individer
dat <- tibble(individ = seq(1:24),
gruppe = rep(c("tre.sett", "ett.sett"), c(length(tre.sett), length(ett.sett))),
rm = c(tre.sett, ett.sett))
Next I can create a basic plot of rm and gruppe using ggplot from tidyverse.
ggplot(dat, aes(gruppe, rm)) +
geom_point()+
theme_bw()
This gives me the following figure:
I want to add a regresson line between the two groups, but I'm struggling to implement one. If I use geom_smooth() nothing appears in figure. The intercept and slope from my model is 21.900 and 20.524, respectively.
One solution has been given in the comments: re-encode the categories as integers before using geom_smooth.
Another solution. Since the "regression line" just connects the mean of the two groups, you can use stat_summary:
dat %>%
ggplot(aes(gruppe, rm)) +
geom_point() +
stat_summary(geom = "line", fun = mean, group = 1) +
theme_bw()
Result:
You might also want to look at the sjPlot package which uses the plot_model function to visualise regression models. It would be used something like this:
library(sjPlot)
lm1 <- lm(rm ~ gruppe, data = dat)
lm1 %>%
plot_model(type = "pred",
terms = "gruppe",
show.data = TRUE) +
geom_line() +
theme_bw()
Result:

ggplot2 geom_qq change theoretical data

I have a set of pvalues i.e 0<=pval<=1
I want to plot qqplot using ggplot2
As in the documentation the following code will plot a q_q plot, however if my data are pvalues I want the therotical values to be also probabilites ie. 0<=therotical v<=1
df <- data.frame(y = rt(200, df = 5))
p <- ggplot(df, aes(sample = y))
p + stat_qq() + stat_qq_line()
I am aware of the qqplot.pvalues from gaston package it does the job but the plot is not as customizable as the ggplot version.
In gaston package the theoretical data are plotted as -log10((n:1)/(n + 1)) where n is number of pvalues. How to pass these values to ggplot as theoritical data?
Assuming you have some p-values, say from a normal distribution you could create it manually
library(ggplot2)
data <- data.frame(outcome = rnorm(150))
data$pval <- pnorm(data$outcome)
data <- data[order(data$pval),]
ggplot(data = data, aes(y = pval, x = pnorm(qnorm(ppoints(nrow(data)))))) +
geom_point() +
geom_abline(slope = 1) +
labs(x = 'theoraetical p-val', y = 'observed p-val', title = 'qqplot (pval-scale)')
Although I am not sure this plot is sensible to use for conclusions.

annotate r squared to ggplot by using facet_wrap

I just joined the community and looking forward to get some help for the data analysis for my master thesis.
At the moment I have the following problem:
I plotted 42 varieties with ggplot by using facet_wrap:
`ggplot(sumfvvar,aes(x=TemperaturCmean,y=Fv.Fm,col=treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety)`
That works very well, but I would like to annotate the r squared values for the regression lines. I have two treatments and 42 varieties, therefore 84 regression lines.
Are there any possibilties to calculate all r squared values and integrate them into the ggplot? I found allready the function
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
but that works just for one variety and one treatment. Could be a loop for the lm() function an option?
Here is an example with the ggpmisc package:
library(ggpmisc)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x = x,
y = y,
group = c("A", "B"))
formula <- y ~ poly(x, 1, raw = TRUE)
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, parse = TRUE,
mapping = aes(label = stat(rr.label)))
You can't apply different labels to different facet, unless you add another r^2 column to your data.. One way is to use geom_text, but you need to calculate the stats you need first. Below I show an example with iris, and for your case, just change Species for Variety, and so on
library(tidyverse)
# simulate data for 2 treatments
# d2 is just shifted up from d1
d1 <- data.frame(iris,Treatment="A")
d2 <- data.frame(iris,Treatment="B") %>%
mutate(Sepal.Length=Sepal.Length+rnorm(nrow(iris),1,0.5))
# combine datasets
DF <- rbind(d1,d2) %>% rename(Variety = Species)
# plot like you did
# note I use "free" scales, if scales very different between Species
# your facet plots will be squished
g <- ggplot(DF,aes(x=Sepal.Width,y=Sepal.Length,col=Treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety,scales="free")
# rsq function
RSQ = function(y,x){signif(summary(lm(y ~ x))$adj.r.squared, 3)}
#calculate rsq for variety + treatment
STATS <- DF %>%
group_by(Variety,Treatment) %>%
summarise(Rsq=RSQ(Sepal.Length,Sepal.Width)) %>%
# make a label
# one other option is to use stringr::str_wrap in geom_text
mutate(Label=paste("Treat",Treatment,", Rsq=",Rsq))
# set vertical position of rsq
VJUST = ifelse(STATS$Treatment=="A",1.5,3)
# finally the plot function
g + geom_text(data=STATS,aes(x=-Inf,y=+Inf,label=Label),
hjust = -0.1, vjust = VJUST,size=3)
For the last geom_text() call, I allowed the y coordinates of the text to be different by multiplying the Treatment.. You might need to adjust that depending on your plot..

apply jittering to outliers data in a boxplot with ggplot2

do you have any idea of how to apply jittering just to the outliers data of a boxplot? This is the code:
ggplot(data = a, aes(x = "", y = a$V8)) +
geom_boxplot(outlier.size = 0.5)+
geom_point(data=a, aes(x="", y=a$V8[54]), colour="red", size=3) +
theme_bw()+
coord_flip()
thank you!!
Added a vector to your data set to indicate which points are and are not outliers. Then, Set the geom_boxplot to not plot any outliers and use a geom_point to plot the outliers explicity.
I will use the diamonds data set from ggplot2 to illustrate.
library(ggplot2)
library(dplyr)
diamonds2 <-
diamonds %>%
group_by(cut) %>%
mutate(outlier = price > median(price) + IQR(price) * 1.5) %>%
ungroup
ggplot(diamonds2) +
aes(x = cut, y = price) +
geom_boxplot(outlier.shape = NA) + # NO OUTLIERS
geom_point(data = function(x) dplyr::filter_(x, ~ outlier), position = 'jitter') # Outliers
This is slightly different approach than above (assigns a color variable with NA for non-outliers), and includes a correction for the upper and lower bounds calculations.
The default "outlier" definition is a point beyond the 25/75th quartile +/- 1.5 x the interquartile range (IQR).
Generate some sample data:
set.seed(1)
a <- data_frame(x= factor(rep(1:4, each = 1000)),
V8 = c(rnorm(1000, 25, 4),
rnorm(1000, 50, 4),
rnorm(1000, 75, 4),
rnorm(1000, 100, 4)))
calculate the upper/lower limit outliers (uses dplyr/tidyverse functions):
library(tidyverse)
a <- a %>% group_by(x) %>%
mutate(outlier.high = V8 > quantile(V8, .75) + 1.50*IQR(V8),
outlier.low = V8 < quantile(V8, .25) - 1.50*IQR(V8))
Define a color for the upper/lower points:
a <- a %>% mutate(outlier.color = case_when(outlier.high ~ "red",
outlier.low ~ "steelblue"))
The unclassified cases will be coded as "NA" for color, and will not appear in the plot.
The dplyr::case_when() function is not completely stable yet (may require github development version > 0.5 at enter link description here), so here is a base alternative if that does not work:
a$outlier.color <- NA
a$outlier.color[a$outlier.high] <- "red"
a$outlier.color[a$outlier.low] <- "steelblue"
Plot:
a %>% ggplot(aes(x, V8)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(color = a$outlier.color, width = .2) + # NA not plotted
theme_bw() + coord_flip()

Plot summary of unique observations with ggplot

Is it possible to count unique observations via a ggplot formula? For instance by somehow achieving the same result as this by cutting the middle line? My efforts so far e.g. using geom_histogram with stat='bin' have failed.
set.seed(1)
d = data.frame(year = sample(2005:2009, 50, prob = 1:5, rep=T),
group = sample(letters, 50, prob = 1:26, rep=T))
d2 = plyr::count(unique(d)$year)
ggplot(d2, aes(x, freq)) + geom_bar(stat='identity') + labs(x='year', y='count of groups')
stat_bin() will do the trick like this:
ggplot(unique(d), aes(x = as.factor(year))) +
stat_bin() +
labs(x='year', y='count of groups')

Resources