I have boxplots representing results of two methods, each with two validation approaches and three scenarios, to be plotted using ggplot2. Everything works fine, but I want to change the x-axis tick label to differentiate between the type of technique used in each group.
I used the following code:
data <- read.csv("results.csv", header = TRUE, sep=',')
ggplot() +
geom_boxplot(data = data, aes(x = Validation, y = Accuracy, fill = Scenario)) +
facet_wrap(~ Method) +
labs(fill = "")
where the structure of my data is as follows:
Method Validation Scenario Accuracy
-------------------------------------------------------
Method 1 Iterations Scenario 1 0.90
Method 1 Iterations Scenario 2 0.80
Method 1 Iterations Scenario 3 0.86
Method 1 Recursive Scenario 2 0.82
Method 2 Iterations Scenario 1 0.69
Method 2 Recursive Scenario 3 0.75
and got the following plot:
I just want to change the first x-tick label (Iterations) in Method 1 and Method 2 into 100-iterations and 10-iterations, respectively.
I tried to add this code but that changes the labels for both groups.
+ scale_x_discrete(name = "Validation",
labels = c("100-iterations", "Recursive",
"10-iterations", "Recursive")) +
Thanks in advance.
The ggplot package's facet options were not designed for varying axis labels / scales across facets (see here for a detailed explanation), but one work around in this instance would be to vary the underlying x-axis variable's values for different facets, & set scales = "free_x" in facet_wrap() so that only the relevant values are shown in each facet's x-axis:
library(ggplot2)
library(dplyr)
ggplot(data %>%
mutate(Validation = case_when(Validation == "Recursive" ~ "Recursive",
Method == "Method 1" ~ "100-iterations",
TRUE ~ "10-iterations")),
aes(x = Validation, y = Accuracy, fill = Scenario)) +
geom_boxplot() +
facet_wrap(~ Method, scales = "free_x")
Data:
set.seed(1)
data <- data.frame(
Method = rep(c("Method 1", "Method 2"), each = 100),
Validation = rep(c("Iterations", "Recursive"), times = 100),
Scenario = sample(c("Scenario 1", "Scenario 2", "Scenario 3"), 200, replace = TRUE),
Accuracy = runif(200)
)
Related
Reproduced from this code:
library(haven)
library(survey)
library(dplyr)
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
nhanesAnalysis <- nhanesDemo %>%
mutate(LowIncome = case_when(
INDFMIN2 < 40 ~ T,
T ~ F
)) %>%
# Select the necessary columns
select(INDFMIN2, LowIncome, persWeight, psu, strata)
# Set up the design
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
svyhist(~log10(INDFMIN2), design=nhanesDesign, main = '')
How do I color the histogram by independent variable, say, LowIncome? I want to have two separate histograms, one for each value of LowIncome. Unfortunately I picked a bad example, but I want them to be see-through in case their values overlap.
If you want to plot a histogram from your model, you can get its data from model.frame (this is what svyhist does under the hood). To get the histogram filled by group, you could use this data frame inside ggplot:
library(ggplot2)
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(alpha = 0.5, color = "gray60", breaks = 0:20 / 10) +
theme_classic()
Edit
As Thomas Lumley points out, this does not incorporate sampling weights, so if you wanted this you could do:
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(aes(weight = persWeight), alpha = 0.5,
color = "gray60", breaks = 0:20 / 10) +
theme_classic()
To demonstrate this approach works, we can replicate Thomas's approach in ggplot using the data example from svyhist. To get the uneven bin sizes (if this is desired), we need two histogram layers, though I'm guessing this would not be required for most use-cases.
ggplot(model.frame(dstrat), aes(enroll)) +
geom_histogram(aes(fill = "E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype == "E"),
breaks = 0:35 * 100,
position = "identity", col = "gray50") +
geom_histogram(aes(fill = "Not E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype != "E"),
position = "identity", col = "gray50",
breaks = 0:7 * 500) +
scale_fill_manual(NULL, values = c("#00880020", "#88000020")) +
theme_classic()
You can't just extract the data and use ggplot, because that won't use the weights and so misses the whole point of svyhist. You can use the add=TRUE argument, though. You do need to set the x and y axis ranges correctly to make sure the whole plot is visible
Using the data example from ?svyhist
svyhist(~enroll, subset(dstrat,stype=="E"), col="#00880020",ylim=c(0,0.003),xlim=c(0,3500))
svyhist(~enroll, subset(dstrat,stype!="E"), col="#88000020",add=TRUE)
here is my model. Exam_taken is a binary variable (0,1), and social class (1-10 scale) and GDP are continuous variables.
fit<-glm(Exam_taken~Gender+Social_class*GDP, data=final, family=binomial(link="probit")
summary(fit)
I need to draw graphs. Goal 1) the relationship between Social_class and Exam_taken; Goal 2) the interaction of Social_class*GDP on Exam_taken.
I encountered two problems.
I used the following code for Goal 1:
#exclude missing values
final=subset(final, final$Social_class!="NA")
final=subset(final, final$Exam_taken!="NA")
#graph
library(popbio)
logi.hist.plot(final$Social_class, final$Exam_taken, boxp=FALSE, type = "hist")
I got an error "Error in seq.default(min(independ),max(independ),len=100):'from' must be a finite number"
How to fix it? Thank you so much
I have no idea how to draw the interaction with two continuous variables on a binary outcome. Can anyone provide some directions? Thanks!
It can be difficult to represent a regression involving three dependent variables, since it is effectively a four-dimensional structure. However, since one of the variables (Gender) has only two levels, and Social class has 10 discrete levels, we can display the model using color scales and facets. First we create a data frame with all combinations of Gender and Social class at every value of GDP from, say, $1000 to $100,000
pred_df <- expand.grid(Gender = c("Male", "Female"),
Social_class = 1:10,
GDP = 1:100 * 1000)
Now we get the probability of taking the exam at each combination:
pred_df$fit <- predict(fit, newdata = pred_df, type = "response")
We can then plot the model predictions like so:
ggplot(pred_df, aes(GDP, fit, colour = Social_class, group = Social_class)) +
geom_line() +
facet_grid(Gender~.) +
scale_x_continuous(labels = scales::dollar, limits = c(0, 1e5)) +
labs(y = "Probability of taking exam",
color = "Social class") +
scale_color_viridis_c(breaks = 1:10) +
theme_minimal(base_size = 16) +
guides(color = guide_colorbar(barheight = unit(50, "mm")))
Data used
Obviously, we don't have your data, but we can make a reasonable replica given clues from your description and code.
set.seed(1)
final <- data.frame(Gender = rep(c("Male", "Female"), 100),
Social_class = sample(10, 200, TRUE),
GDP = 1000 * sample(20:60, 200, TRUE))
final$Exam_taken <- rbinom(200, 1,
c(0, 0.1) + 0.05 * final$Social_class +
final$GDP/1e5 - 0.2)
You can use the sjPlot package to plot the predicted values from the model. If you save the output of the plot_model() function, you can modify its appearance using ggplot2.
Here is one of many pages that can show you other options with this package:
https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_model_estimates.html
library(sjPlot)
plot_model(fit, type = "int")
My problems seems simple, I am using ggplot2 with geom_jitter() to plot a variable. (take my picture as an example)
Jitter now adds some random noise to the variable (the variable is just called "1" in this example) to prevent overplotting. So I have now random noise in the y-direction and clearly what otherwise would be completely overplotted is now better visible.
But here is my question:
As you can see, there are still some points, that overplot each other. In my example here, this could be easily prevented, if it wouldn't be random noise in y-direction... but somehow more strategically placed offsets.
Can I somehow alter the geom_jitter() behavior or is there a similar function in ggplot2 that does exactly this?
Not really a minimal example, but also not too long:
library("imputeTS")
library("ggplot2")
data <- tsAirgap
# 2.1 Create required data
# Get all indices of the data that comes directly before and after an NA
na_indx_after <- which(is.na(data[1:(length(data) - 1)])) + 1
# starting from index 2 moves all indexes one in front, so no -1 needed for before
na_indx_before <- which(is.na(data[2:length(data)]))
# Get the actual values to the indices and put them in a data frame with a label
before <- data.frame(id = "1", type = "before", input = na_remove(data[na_indx_before]))
after <- data.frame(id = "1", type = "after", input = na_remove(data[na_indx_after]))
all <- data.frame(id = "1", type = "source", input = na_remove(data))
# Get n values for the plot labels
n_before <- length(before$input)
n_all <- length(all$input)
n_after <- length(after$input)
# 2.4 Create dataframe for ggplot2
# join the data together in one dataframe
df <- rbind(before, after, all)
# Create the plot
gg <- ggplot(data = df) +
geom_jitter(mapping = aes(x = id, y = input, color = type, alpha = type), width = 0.5 , height = 0.5)
gg <- gg + ggplot2::scale_color_manual(
values = c("before" = "skyblue1", "after" = "yellowgreen","source" = "gray66"),
)
gg <- gg + ggplot2::scale_alpha_manual(
values = c("before" = 1, "after" = 1,"source" = 0.3),
)
gg + ggplot2::theme_linedraw() + theme(aspect.ratio = 0.5) + ggplot2::coord_flip()
So many good suggestions...here is what Bens suggestion would look like for my example:
I changed parts of my code to:
gg <- ggplot(data = df, aes(x = input, color = type, fill = type, alpha = type)) +
geom_dotplot(binwidth = 15)
Would basically also work as intended for me. ggbeeplot as suggested by Jon also worked great for my purpose.
I thought of a hack I really like, using ggrepel. It's normally used for labels, but nothing preventing you from making the label into a point.
df <- data.frame(x = rnorm(200),
col = sample(LETTERS[1:3], 200, replace = TRUE),
y = 1)
ggplot(df, aes(x, y, label = "●", color = col)) + # using unicode black circle
ggrepel::geom_text_repel(segment.color = NA,
box.padding = 0.01, key_glyph = "point")
A downside of this method is that ggrepel can take a lot time for a large number of points, and will recalculate differently each time you change the plot size. A faster alternative would be to use ggbeeswarm::geom_quasirandom, which uses a deterministic process to define jitter that looks random.
ggplot(df, aes(x,y, color = col)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE)
This question already has answers here:
Plotting two variables as lines using ggplot2 on the same graph
(5 answers)
Closed 2 years ago.
I'm using a data frame in R with 3 variables. I want to plot (ggplot) 2 variables (CMod4X and CMod5X) as two distinct lines, in function of the 3th variable (AmtX). In the end I succeed in creating some kind of graph that suits me, but I fail to include a legend. I have already consulted some other treads here, but the answers don't seem not to work for me.
The (artificial) data set looks like this
AmtX <- seq(from = 1, to = 10001, by = 50)
CMod4X <- rnorm(201, mean = 0.87, sd = 0.01)
CMod5X <- rnorm(201, mean = 0.84, sd = 0.01)
EvalAmtX <- as.data.frame(cbind(AmtX,CMod4X,CMod5X))
I have made the plot like this
pltX <- ggplot(data = EvalAmtX, aes (x = AmtX)) +
geom_line(aes(y = CMod4X), color = "red", show.legend = TRUE) +
geom_line(aes(y = CMod5X), color = "blue", show.legend = TRUE) +
geom_smooth(aes(y = CMod4X), color = "red", se = FALSE, show.legend = TRUE) +
geom_smooth(aes(y = CMod5X), color = "blue", se = FALSE, show.legend = TRUE) +
labs(y = "C-index", x = "Amount (Tau)", title = "model 4 and model 5") +
scale_colour_manual(name = "Models", values = c("CMod4" = "red", "CMod5" = "blue"))
pltX
But this plot won't include a label. I've included my plot below:
What am I doing wrong and what must I do to obtain a plot telling me the red line is CMod4 and the blue line is CMod5?
Thx for your help!!
Leonard
I guess you need to dive a little deeper into how ggplot2 works, since your question is related to the basic set up of your data frame. There are a lot of great resources around on this topic, e.g. this one. Anyway, here are two solutions for putting the legend into your graph.
Solution 1: Rearrange data frame to long format
library(reshape2)
df <- melt(data = EvalAmtX, id.vars = "AmtX")
The data frame now looks like this:
head(df)
# AmtX variable value
# 1 1 CMod4X 0.8772716
# 2 51 CMod4X 0.8524197
# 3 101 CMod4X 0.8686019
# 4 151 CMod4X 0.8638835
# 5 201 CMod4X 0.8674627
# 6 251 CMod4X 0.8729925
Now, plotting is easy. Instead of telling ggplot2 the color of each individual line, you simply give it the information which column in your data frame contains the factor that should determine the color of the lines. So you add another aesthetic (col = variable). This also automatically adds a legend for color.
ggplot(df, aes(x=AmtX, y=value, col = variable)) +
geom_line()
Solution 2: Use a manual color scale
You almost got it right in your code.
pltX <- ggplot(data = EvalAmtX, aes (x = AmtX)) +
geom_line(aes(y = CMod4X, color = "CMod4")) +
geom_line(aes(y = CMod5X, color = "CMod5")) +
geom_smooth(aes(y = CMod4X, color = "CMod4"), se = FALSE) +
geom_smooth(aes(y = CMod5X, color = "CMod5"), se = FALSE) +
labs(y = "C-index", x = "Amount (Tau)", title = "model 4 and model 5") +
scale_colour_manual(name = "Models", values = c(CMod4 = "red", CMod5 = "blue"))
pltX
I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.