Displaying a Cross-tabulation As a Plot on RStudio - r

I'm trying to visualize a cross-tabulation on RStudio using ggplot2. I've been able to create plots in the past, and have a cross-tabulation done as well, but can't crack this. Can anyone help?
Here's my code for an x-tab:
library(dplyr)
data_dan %>%
group_by(Sex, Segment) %>%
count(variant) %>%
mutate(prop = prop.table(n))
and here's what I've got for creating a plot:
#doing a plot
variance_art_new.plot = ggplot(data_dan, aes(Segment, fill=variant)) +
geom_bar(position="fill")+
theme_classic()+
scale_fill_manual(values = c("#fc8d59", "#ffffbf", "#99d594"))
variance_art_new.plot
Here's a sample of the data I'm operating with:
Word Segment variant Position Sex
1 LIKE K R End Female
2 LITE T S End Male
3 CRACK K R End Female
4 LIKE K R End Male
5 LIPE P G End Female
6 WALK K G End Female
My aim is to have the independent variables of 'Sex', 'Segment' plotted on a boxplot against the dependent variable 'variant'.
I included the first code to show that I can create a table to show this cross-tabulation and the second bit is what I normally do for running a box plot for just one independent variable.

I'm still not sure if this gets all the way to what you are asking, but if you are asking for counts (or portions) within two separate variable, you can use facet_wrap to separate the two groups.
(Note, all of these are run with theme_set(theme_bw()) because I prefer it for this type of plot.)
Working with the builtin dataset mtcars you can get counts with:
mtcars %>%
ggplot(aes(x = factor(cyl), fill = factor(gear))) +
geom_bar() +
facet_wrap(~vs)
Or with the sorting reversed with:
mtcars %>%
ggplot(aes(x = factor(vs), fill = factor(gear))) +
geom_bar() +
facet_wrap(~cyl, labeller = label_both)
You can also plot the within-group distribution by using position = "fill"
mtcars %>%
ggplot(aes(x = factor(vs), fill = factor(gear))) +
geom_bar(position = "fill") +
facet_wrap(~cyl, labeller = label_both) +
scale_y_continuous(name = "Within group Percentage"
, labels = scales::percent)

Related

Combine scale_x_upset with scale_y_break

I made an upset plot using the ggupset package and added a break to the y axis with scale_y_break from the ggbreakpackage.
However, when I add scale_y_break, the combination matrix under the bar plot disappears.
Is there a way to combine the combination matrix of the plot made without scale_y_break with the bar plot portion of a plot made with scale_y_break? I can't seem to be able to access the grobs of these plots or use any other workaround. If anyone could help, I would greatly appreciate it!
Example with scale_x_upset and scale_y_break:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
I would like to combine the barplot portion of the plot created with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
with the combination matrix portion of the plot made with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)
Thanks!

Where does ggplot set the order of the color scheme?

I have a data set that I'm showing in a series of violin plots with one categorical variable and one continuous numeric variable. When R generated the original series of violins, the categorical variable was plotted alphabetically (I rotated the plot, so it appears alphabetically from bottom to top). I thought it would look better if I sorted them using the numeric variable.
When I do this, the color scheme doesn't turn out as I wanted it to. It's like R assigned the colors to the violins before it sorted them; after the sorting, they kept their original colors - which is the opposite of what I wanted. I wanted R to sort them first and then apply the color scheme.
I'm using the viridis color scheme here, but I've run into the same thing when I used RColorBrewer.
Here is my code:
# Start plotting
g <- ggplot(NULL)
# Violin plot
g <- g + geom_violin(data = df, aes(x = reorder(catval, -numval,
na.rm = TRUE), y = numval, fill = catval), trim = TRUE,
scale = "width", adjust = 0.5)
(snip)
# Specify colors
g <- g + scale_colour_viridis_d()
# Remove legend
g <- g + theme(legend.position = "none")
# Flip for readability
g <- g + coord_flip()
# Produce plot
g
Here is the resulting plot.
If I leave out the reorder() argument when I call geom_violin(), the color order is what I would like, but then my categorical variable is sorted alphabetically and not by the numeric variable.
Is there a way to get what I'm after?
I think this is a reproducible example of what you're seeing. In the diamonds dataset, the mean price of "Good" diamonds is actually higher than the mean for "Very Good" diamonds.
library(dplyr)
diamonds %>%
group_by(cut) %>%
summarize(mean_price = mean(price))
# A tibble: 5 x 2
cut mean_price
<ord> <dbl>
1 Fair 4359.
2 Good 3929.
3 Very Good 3982.
4 Premium 4584.
5 Ideal 3458.
By default, reorder uses the mean of the sorting variable, so Good is plotted above Very Good. But the fill is still based on the un-reordered variable cut, which is a factor in order of quality.
ggplot(diamonds, aes(x = reorder(cut, -price),
y = price, fill = cut)) +
geom_violin() +
coord_flip()
If you want the color to follow the ordering, then you could reorder upstream of ggplot2, or reorder in both aesthetics:
ggplot(diamonds, aes(x = reorder(cut, -price),
y = price,
fill = reorder(cut, -price))) +
geom_violin() +
coord_flip()
Or
diamonds %>%
mutate(cut = reorder(cut, -price)) %>%
ggplot(aes(x = cut, y = price, fill = cut)) +
geom_violin() +
coord_flip()

Box plot with ggplot2 using data from read.table

I am plotting a box plot that shows the height of students. However I am unsure what I use as x and y. I have only measurments, so one should be height and the other one amount of students that have that height.
x=N, y=Height
My code:
# Library
library(ggplot2)
library(tidyverse)
# 1. Read data (comma separated)
data = read.table(text = "184,180,183,184,184,160,173",
sep=",",stringsAsFactors=F, na.strings="unknown")
# 2. Print table
print(data)
# 3. Plot box plot
data %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x=value, y=value)) +
geom_boxplot() +
theme_classic() +
xlab("Students") +
ylab("Height") +
ggtitle("Height of students")
I think the best plot to represent a vector of data is an histogram. However you could use the boxplot by create a dummy factor that group your observation. i.e.
data %>%
pivot_longer(cols = everything()) %>%
mutate(type="student") %>%
ggplot(aes(x=type, y=value)) +
geom_boxplot() +
theme_classic() +
xlab("Students") +
ylab("Height") +
ggtitle("Height of students")
if you want a histogram (I think much better for your situation), you don'ty need the dummy factor and you could do something like :
data %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x=value)) +
geom_histogram() +
theme_classic() +
xlab("Students") +
ylab("Height") +
ggtitle("Height of students")
To use a boxplot correctly, you have to have one categorical variable and one continuous. Put the categorical (e.g. make, female, etc.) on the x-axis and the continuous on the y-axis (height in your case).

How to graph "before and after" measures using ggplot with connecting lines and subsets?

I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:
show each individual observation separately (now same values are overlapping)
connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
subset the data along class and id using shape and colors
How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?
Main alternative: geom_point()
Here is some sample data and example code using genom_point
x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual
df <- data.frame(x,y,class,id)
ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()
Alternative: scale_size()
I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.
ggplot(df, aes(x=x, y=y)) +
stat_sum()
Alternative: geom_dotplot()
I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.
df1 <- df[1:10,] # data before
df2 <- df[11:20,] # data after
p1 <- ggplot(df1, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
p2 <- ggplot(df2, aes(x=x, y=y)) +
geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
binwidth=(1/0.3))
grid.arrange(p1,p2, nrow=1) # GridExtra package
Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:
library(ggplot)
library(dplyr)
#library(ggthemes)
df <- df %>%
group_by(x, id, class) %>%
summarize(y = median(y, na.rm = T)) %>%
ungroup() %>%
mutate(
id = factor(id),
x = factor(x, labels = c("before", "after")),
class = factor(class, labels = c("one day", "multiple days")),
) %>%
group_by(id) %>%
mutate(nas = any(is.na(y))) %>%
ungroup() %>%
filter(!nas) %>%
select(-nas)
ggplot(df, aes(x = x, y = y, col = id, group = id)) +
geom_point(aes(shape = class)) +
geom_line(show.legend = F) +
#theme_few() +
#theme(legend.position = "none") +
ylab("Feelings of peace, %") +
xlab("")
Here's one possible solution for you.
First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".
To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots.
The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it.
Connecting the dots is done by geom_line - feel free to try that as well.
Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.
ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) +
geom_point(position = position_jitter(width = .1)) +
geom_smooth(method = 'lm', se = FALSE) +
labs(
x = "x",
color = "ID",
shape = 'Class'
)

Split data to plot histograms side-by-side in R

I am learning R with the Australian athletes data set.
By using ggplot, I can plot a histogram like this.
library(DAAG)
ggplot(ais, aes(wt, fill = sex)) +
geom_histogram(binwidth = 5)
By using summary(ais$wt), the 3rd Quartile is 84.12. Now I want to split the data by the wt 84.12. and plot 2 similar histograms accordingly (side by side)
The split is:
ais1 = ais$wt[which(ais$wt>=0 & ais$wt<=84.12)]
ais2 = ais$wt[which(ais$wt>84.12)]
But I don’t know how to fit them in the plotting. I tried but it doesn't work:
ggplot(ais1, aes(wt, fill = sex)) +...
How can I plot the histograms (2 similar histograms accordingly, side by side)?
Add the split as a column to your data
ais$wt_3q = ifelse(ais$wt < 84.12, "Quartiles 1-3", "Quartile 4")
Then use facets:
ggplot(ais, aes(wt, fill = sex)) +
geom_histogram(binwidth = 5) +
facet_wrap(~ wt_3q)
The created variable is a factor, if you specify the order of the levels you can order the facets differently (lots of questions on here showing that if you search for them - same as reordering bars for a ggplot barplot). You can also let the scales vary - look at ?facet_wrap for more details.
Generally, you shouldn't create more data frames. Creating ais1 and ais2 is usually avoidable, and your life will be simpler if you use a single data frame for a single data set. Adding a new column for grouping makes it easy to keep things organized.
We can do this with ggarrange to arrange the plot objects for each subset
library(DAAG)
library(ggplot2)
library(ggpubr)
p2 <- ais %>%
filter(wt>=0, wt<=84.12) %>%
ggplot(., aes(wt, fill = sex)) +
geom_histogram(binwidth = 5) +
coord_cartesian(ylim = c(0, 30))
p1 <- ais %>%
filter(wt>84.12) %>%
ggplot(., aes(wt, fill = sex)) +
geom_histogram(binwidth = 5) +
coord_cartesian(ylim = c(0, 30))
ggarrange(p1, p2, ncol =2, nrow = 1, labels = c("p1", "p2"))
-output

Resources