Newer to using R and ggplot2 for my data analysis. Trying to figure out how to turn my data from R into the ggplot2 format. The data is a set of values for 5 different categories and I want to make a stacked bar graph that allows me to section the stacked bar graph into 3 sections based on the value. Ex. small, medium, and large values based on arbitrary cutoffs. Similar to the 100% stacked bar graph in excel where the proportion of all the values adds up to 1 (on the y axis). There is a fair amount of data (~1500 observations) if that is also a valuable thing to note.
here is a sample of what the data looks like (but it has approx 1000 observations for each column) (I put an excel screenshot because I don't know if that worked below)
dput(sample-data)
This sort of problem is usually a data reformating problem. See reshaping data.frame from wide to long format.
The following code uses built-in data set iris, with 4 numeric columns, to plot a bar graph with the data values cut into levels after reshaping the data.
I have chosen cutoff points 0.2 and 0.7 but any other numbers in (0, 1) will do. The cutoff vector is brks and levels names labls.
library(tidyverse)
data(iris)
brks <- c(0, 0.2, 0.7, 1)
labls <- c('Small', 'Medium', 'Large')
iris[-5] %>%
pivot_longer(
cols = everything(),
names_to = 'Category',
values_to = 'Value'
) %>%
group_by(Category) %>%
mutate(Value = (Value - min(Value))/diff(range(Value)),
Level = cut(Value, breaks = brks, labels = labls,
include.lowest = TRUE, ordered_result = TRUE)) %>%
ggplot(aes(Category, fill = Level)) +
geom_bar(stat = 'count', position = position_fill()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Here's a solution requiring no data reformating.
The diamonds dataset comes with ggplot2. Column "color" is categorical, column "price" is numeric:
library(ggplot)
ggplot(diamonds) +
geom_bar(aes(x = color, fill = cut(price, 3, labels = c("low", "mid", "high"))),
position = "fill") +
labs(fill = "price")
Related
So I have the following code which produces:
The issue here is twofold:
The group bar chart automatically places the highest value on the top (i.e. for avenue 4 CTP is on top), whereas I would always want FTP to be shown first then CTP to be shown after (so always blue bar then red bar)
I need all of the values to scale to 100 or 100% for their respective group (so for CTP avenue 4 would have a huge bar graph but the other avenues should be extremely tiny)
I am new to 'R'/Stack overflow so sorry if anything is wrong/you need more but any help is greatly appreciated.
library(ggplot2)
library(tidyverse)
library(magrittr)
# function to specify decimals
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
# sample data
avenues <- c("Avenue1", "Avenue2", "Avenue3", "Avenue4")
flytip_amount <- c(1000, 2000, 1500, 250)
collection_amount <- c(5, 15, 10, 2000)
# create data frame from the sample data
df <- data.frame(avenues, flytip_amount, collection_amount)
# got it working - now to test
df3 <- df
SumFA <- sum(df3$flytip_amount)
df3$FTP <- (df3$flytip_amount/SumFA)*100
df3$FTP <- specify_decimal(df3$FTP, 1)
SumCA <- sum(df3$collection_amount)
df3$CTP <- (df3$collection_amount/SumCA)*100
df3$CTP <- specify_decimal(df3$CTP, 1)
# Now we have percentages remove whole values
df2 <- df3[,c(1,4,5)]
df2 <- df2 %>% pivot_longer(-avenues)
FTGraphPos <- df2$name
ggplot(df2, aes(x = avenues, fill = as.factor(name), y = value)) +
geom_col(position = "dodge", width = 0.75) + coord_flip() +
labs(title = "Flytipping & Collection %", x = "ward_name", y = "Percentageperward") +
geom_text(aes(x= avenues, label = value), vjust = -0.1, position = "identity", size = 5)
I have tried the above and I have looked at lots of tutorials but nothing is exactly precise to what I need of ensuring the group bar charts puts the layers in the same order despite amount and scaling to 100/100%
As Camille notes, to handle ordering of the categories in a plot, you need to set them as factors, and then use functions from the forcats package to handle the order. Here I am using fct_relevel() (note that it will automatically convert character variables to factors).
Your numeric values are in fact set to character, so they need to be set to numeric for the chart to make sense.
To cover point #2, I'm using group_by() to calculate percentages within each name.
I have also fixed the labels so that they are properly dodged along with the bar chart. Also, note that you don't need to call ggplot2 or magrittr if you are calling tidyverse - those packages come along with it already.
df_plot <- df2 |>
mutate(name = fct_relevel(name, "CTP"),
value = as.numeric(value)) |>
group_by(name) |>
mutate(perc = value / sum(value)) |>
ungroup()
ggplot(df_plot, aes(x = value, y = avenues, fill = name)) +
geom_col(position = "dodge", width = 0.75) +
geom_text(aes(label = value), position = position_dodge(width = 0.75), size = 5) +
labs(title = "Flytipping & Collection %", x = "Percentageperward", y = "ward_name") +
guides(fill = guide_legend(reverse = TRUE))
I'm not able to share my data, so sorry for that. Most of my data are either dummy or ordinal or unordered discrete variables. Only age is numeric.
I used this code to see which values are outliers
boxplot(df$var1, plot = TRUE)$out
And this code for count how many outliers:
length(boxplot(dataDK$sclmeet)$out)
I replaced the outliers with NA's using the sapply function.
I now want to either create boxplot or a table that count the amount of outliers and which they are. How is this possible?
If you help with the boxplot method then I can make mutilple boxplots and then combine them into one using par(mfrow = c(,))
The boxplot could look like this, where 1 (blue) is the value of outlier and 4 (blue) is the count of how many 1 there are:
Edit:
I forgot to mention that I know this method:
out <- boxplot.stats(df$var1)$out
boxplot(df$var1,
ylab = "var1",
main = "Boxplot for var1"
)
mtext(paste("Outliers: ", paste(out, collapse = ", ")))
This will give a plot similary to this. However it is not a good method for many different outliers
(taken from boxplot outlier labels):
These kinds of plots are easier with ggplot, not base R. Have you considered adding a table of your outliers next to your plot? There may be cases where you have different kinds of outliers (and thus your text would be cumbersome). However, if you already know how many outliers you have, you can use annotate to add simple text.
library(tidyverse)
library(cowplot) # to plot stuff side by side
library(gridExtra)
data(iris)
boxplot(iris$Sepal.Width, plot = TRUE)$out
length(boxplot(iris$Sepal.Width)$out)
# https://stackoverflow.com/questions/54993511/how-to-replace-outliers-with-na-in-r-from-vector-created-with-boxplotout
iris$is_outlier <- ifelse(iris$Sepal.Width %in% boxplot.stats(iris$Sepal.Width)$out, 1, 0)
iris <- iris %>%
select(Sepal.Width, is_outlier) %>%
mutate(Sepal.Width_NA = ifelse(is_outlier == 1, NA, Sepal.Width))
t <- iris %>%
filter(is_outlier == 1) %>%
select(Sepal.Width) %>% table() %>% as.data.frame() %>% tableGrob(rows = NULL)
p <- ggplot(iris, aes(y = Sepal.Width_NA)) +
geom_boxplot()
# plot with table side-by-side
plot_grid(p, t, rel_widths = c(2, 1))
# close to your original desired plot
ggplot(iris, aes(y = Sepal.Width_NA)) +
geom_boxplot() +
annotate("label", color = "blue",
size = 4,
x = 0, y = 2,
label = "1 (4)")
I have data looking like:
Accounts
Income
Expense
Benefit
selling food
338.96
43.18
295.78
selling books
2757.70
2341.66
416.04
selling bikes
1369.00
1157.00
212.00
and I would like to get a combined bar plot such as this:
To do that, I wrote this R script:
## Take a data set in and produces a bar chart
## Get rid of the column containing account names to only keep values:
values <- as.matrix(data)[,-1]
## Convert strings to numbers:
values <- apply(values, 2, as.numeric)
## Transpose the matrix:
values <- t(values)
## Vertical axis labels are taken from the first column:
accountNames <- data$Accounts
## The legend is the first row except the first cell:
legend <- tail(names(data), -1)
## Colors are taken randomly
colors <- rainbow(length(legend))
## Increase left margin to fit horizontal axis labels:
par(mar=c(5,8,4,2)+.1)
## Axis labels are drawn horizontal:
par(las=1)
barplot(
values,
names.arg=accountNames,
col=colors,
beside = TRUE,
legend = legend,
horiz = TRUE
)
I would like to modernize this bar chart with ggplot2 which I use for other graphs of the same document. The documentations I found to do that always assume data in a very different shape and I don't know R enough to find out what to do by myself.
Here is the basic,then you can customize the plot the way you want
Libraries
library(tidyverse)
Data
data <-
tibble::tribble(
~Accounts, ~Income, ~Expense, ~Benefit,
"selling food", 338.96, 43.18, 295.78,
"selling books", 2757.7, 2341.66, 416.04,
"selling bikes", 1369, 1157, 212
)
Code
data %>%
#Pivot Income, Expense and Benefit
pivot_longer(cols = -Accounts) %>%
#Define each aesthetic
ggplot(aes(x = value, y = Accounts, fill = name))+
# Add geometry column
geom_col(position = position_dodge())
Results
1.Bring your data in long format with pivot_longer
2.Then plot with geom_bar and
Use coord_flip
library(tidyverse)
df %>%
pivot_longer(
cols= -Accounts,
names_to = "Category",
values_to = "values"
) %>%
ggplot(aes(Accounts, y=values, fill = Category)) +
geom_bar(stat="identity", position = "dodge")+
coord_flip() +
theme_classic()
data1=data.frame("Grade"=c(1,1,1,2,2,2,3,3,3),
"Class"=c(1,2,3,1,2,3,1,2,3),
"Score"=c(6,9,9,7,7,4,9,6,6))
I am sincerely apologetic if this already was posted but I did not see it. I wish to prepare a stacked bar plot there the X axis is 'Grade' and each Grade is 1 bar. Every bar contains three color shades because there are three classes ('Class'). Finally the height of the bar is 'Score' and it always starts from low class to high. So it will look something like this but this is not to proper scale
We can use xtabs to convert the data to wide format and then apply the barplot
barplot(xtabs(Score ~ Grade + Class, data1), legend = TRUE,
col = c('yellow', 'red', 'orange'))
Or using ggplot
library(dplyr)
library(ggplot2)
data1 %>%
mutate_at(vars(Grade, Class), factor) %>%
ggplot(aes(x = Grade, y = Score, fill = Class)) +
geom_col()
If we want to order for 'Class', convert to factor with levels specified in that order based on the 'Score' values
data1 %>%
mutate(Class = factor(Class, levels = unique(Class[order(Score)])),
Grade = factor(Grade)) %>%
ggplot(aes(x = Grade, y = Score, fill = Class)) +
geom_col()
I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.
The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))