For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)
Related
I got a df where variables 1-5 is scale with values total counts.
df<-data.frame(
speed=c(2,3,3,2,2),
race=c(5,5,4,5,5),
cake=c(5,5,5,4,4),
lama=c(2,1,1,1,2))
library(data.table)
dcast(melt(df), variable~value)
# variable 1 2 3 4 5
#1 speed 0 3 2 0 0
#2 race 0 0 0 1 4
#3 cake 0 0 0 2 3
#4 lama 3 2 0 0 0
I want to do stacked bar chart with mean and scale variables 1-5 on x axe by variables in first column (speed, race, cake, lama).
I tried solution from Stacked Bar Plot in R, but there is not what I am looking for.
I had to try a few things and do some workarround to get something very close to want you are looking for (given that I understood the problem correctly):
library(dplyr)
library(ggplot2)
library(tidyr)
df<-data.frame(
speed=c(2,3,3,2,2),
race=c(5,5,4,5,5),
cake=c(5,5,5,4,4),
lama=c(2,1,1,1,2))
# get the data in right shape for ggplot2
dfp <- df %>%
# a column that identifies the rows uniquely is needed ("name of data row")
dplyr::mutate(ID = as.factor(dplyr::row_number())) %>%
# the data has to shaped into "tidy" format (similar to excel pivot)
tidyr::pivot_longer(-ID) %>%
# order by name and ID
dplyr::arrange(name, ID) %>%
# group by name
dplyr::group_by(name) %>%
# calculate percentage and cumsum to be able to calculate label position (p2)
dplyr::mutate(p = value/sum(value),
c= cumsum(p),
p2 = c - p/2,
# the groups or x-axis values have to be recoded to numeric type
name = recode(name, "cake" = 1, "lama" = 2, "race" = 3, "speed" = 4))
# calculate the mean value per group (or label) as you want them in the plot
sec_labels <- dfp %>%
dplyr::summarise(m = mean(value)) %>%
pull(m)
dfp %>%
# building base plot, telling to fill by the new name variable
ggplot2::ggplot(aes(x = name, y = value, fill = ID)) +
# make it a stacked bar chart by percentiles
ggplot2::geom_bar(stat = "identity", position = "fill") +
# recode the x axis labels and add a secondary x axis with the labels
ggplot2::scale_x_continuous(breaks = 1:4,
labels = c("cake", "lama","race", "speed"),
sec.axis = sec_axis(~.,
breaks = 1:4,
labels = sec_labels)) +
# flip the chart by to the side
ggplot2::coord_flip() +
# scale the y axis (now after flipping x axis) to percent
ggplot2::scale_y_continuous(labels=scales::percent) +
# add a layer with labels acording to p2
ggplot2::geom_text(aes(label = value,
y=p2)) +
# put a name to the plot
ggplot2::ggtitle("meaningfull plot name") +
# put the labels on top
ggplot2::theme(legend.position = "top")
I have a dataset with multiple columns. I am visually summarizing several columns using simple bar plots. A simple example:
set.seed(123)
df <-
data.frame(
a = sample(1:2, 20, replace = T),
b = sample(0:1, 20, replace = T)
)
ggplot(gather(df,, factor_key = TRUE), aes(x = factor(value))) +
geom_bar() +
facet_wrap(~ key, scales = "free_x", as.table = TRUE) +
xlab("")
Now, I want to add percentages above each of the 4 columns, saying what percent of rows in the dataframe each column represents. I.e., here, the following numbers would right above the four columns, from left to right in this order: 55%, 45%, 60%, 40%.
How can I automate this---given that I have a large number of columns I have to do this for? (Note I want to keep the raw count of responses on the Y axis and just have percentages appear in the plots.)
In addition to the answer proposed by #BappaDas, in your particular case you want to preserve the count and add percentage whereas the proposed answer has percentages both on y axis and text labeling.
Here, a modified solution is to compute the count for each variable and calculate the percentage. A possible way of doing it is to use tidyr (for reshaping the data in a "long" form) and dplyr package:
library(tidyr)
library(dplyr)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n))
# A tibble: 4 x 4
# Groups: var [2]
var val n Label
<chr> <int> <int> <dbl>
1 a 1 11 0.55
2 a 2 9 0.45
3 b 0 12 0.6
4 b 1 8 0.4
Now at the end of this pipe sequence, you can add ggplot plotting code in order to obtain the desired output by passing the count as y argument and the percentage as label argument:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n)) %>%
ggplot(aes(x = factor(val), y = n))+
geom_col()+
facet_wrap(~var, scales = "free", as.table = TRUE)+
xlab("")+
geom_text(aes(label = scales::percent(Label)), vjust = -0.5)
I have a question concerning ordering of stacked bars in a swimmer plot using GGplot in R.
I have a sample dataset of (artificial) patients, who receive treatments.
library(tidyverse)
df <- read.table(text="patient start_t_1 t_1_duration start_t_2 t_2_duration start_t_3 t_3_duration start_t_4 t_4_duration end
1 0 1.5 1.5 3 NA NA 4.5 10 10
2 0 2 4.5 2 NA NA 2 2.5 10
3 0 5 5 2 7 0.5 7.5 2 9.5
4 0 8 NA NA NA NA 8 2 10", header=TRUE)
All patients start the first treatment at time = 0. Subsequently, patients get different treatments (numbered t_2 up to t_4).
I tried to plot the swimmer plot, using the following code:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration, t_4_duration)) %>%
ggplot(aes(x = patient, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
However, the treatments are not displayed in the right order.
For example: patient 3 receives all treatments in consecutive orde, while patient 2 receives first treatment 1, then 4 and eventually 2.
So, simply reversing the order does not work.
How do I order the stacked bars in a chronological way?
What about this:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration,t_4_duration)) %>%
ggplot(aes(x = patient,
y = value,
# here you can specify the order of the variable
fill = factor(variable,
levels =c("t_4_duration", "t_3_duration", "t_2_duration","t_1_duration")))) +
geom_bar(stat = "identity") +
coord_flip()+ guides(fill=guide_legend("My title"))
EDIT:
that has been a long trip, because it involves a kind of hack. I think it's not not a dupe of that question, because it involves also some data reshaping:
library(reshape2)
# divide starts and duration
starts <- df %>% select(patient, start_t_1, start_t_2, start_t_3, start_t_4)
duration <- df %>% select(patient, t_1_duration,t_2_duration, t_3_duration, t_4_duration)
# here you melt them
starts <- melt(starts, id = 'patient') %>%
mutate(keytreat = substr(variable,nchar(as.vector(variable))-2, nchar(as.vector(variable)))) %>%
`colnames<-`(c("patient", "variable", "start","keytreat")) %>% select(-variable)
duration <- melt(duration, id = 'patient') %>% mutate(keytreat = substr(variable,1, 3)) %>%
`colnames<-`(c("patient", "variable", "duration","keytreat")) %>% select(-variable)
# join
dats <- starts %>% left_join(duration) %>% arrange(patient, start) %>% filter(!is.na(start))
# here the part for the plot
bars <- map(unique(dats$patient)
, ~geom_bar(stat = "identity", position = "stack"
, data = dats %>% filter(patient == .x)))
dats %>%
ggplot(aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) + coord_flip()
I am working with categorical longitudinal data. My data has 3 simple variables such as :
id variable value
1 1 1 c
2 1 2 b
3 1 3 c
4 1 4 c
5 1 5 c
...
Where variable is basically time, and value are the 3 possible categories one id can take.
I am interested in producing a "parallel" longitudinal graph, similar to this with ggplot2
I am struggling a bit to get it right. What I came up for now is this :
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=value, size = nn), size=4) +
scale_colour_brewer(palette="Set1") + theme_minimal()
The issue with this graph is that we can't really see the "thickness" of the "transition" (the id lines).
I wondered if you could help me for :
a) help make visible the id lines, or make it "thicker" according to the number of id going form one state to the other
b) I also would like to re-size the point according to the number of id in this state. I tried to do it with geom_point(aes(colour=value, size = nn), size=4) but it doesn't seem to work.
Thanks.
# data #
library(dplyr)
library(ggplot2)
set.seed(10)
# generate random sequences #
dt = as.data.frame( cbind(id = 1:1000, replicate(5, sample( c('a', 'b', 'c'), prob = c(0.1,0.2,0.7), 1000, replace = T)) ) )
# transform to PP file #
dt = dt %>% melt(id.vars = c('id'))
# create a vector 1-0 if the activity was performed #
dt0 = dt %>% group_by(id) %>% mutate(variable = 1:n()) %>% arrange(id)
# create the number of people in that state #
dt0 = dt0 %>% count(id, variable, value)
dt0 = dt0 %>% group_by(variable, value, n) %>% mutate(nn = n())
# to produce the first graph #
library(vcrpart)
otsplot(dt0$variable, factor(dt0$value), dt0$id)
you were so close with geom_point(aes(colour=value, size = nn), size=4), the problem was that with you redefined size after defining it in aes() ggplot overwrote the variable reference with the constant 4. Assuming you want to use nn to scale line thinkness as well, you could tweak your code to this:
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70", aes(size = nn)) +
geom_point(aes(colour=value, size = nn)) +
scale_colour_brewer(palette="Set1") + theme_minimal()
If you wanted to use a lag value for the line thickness I would suggests adding that as a new column in dt0.
My data frame looks like this:
plant distance
one 0
one 1
one 2
one 3
one 4
one 5
one 6
one 7
one 8
one 9
one 9.9
two 0
two 1
two 2
two 3
two 4
two 5
two 6
two 7
two 8
two 9
two 9.5
I want to split distance of each level into groups by interval(for instance,interval=3), and compute percentage of each group. Finally, plot the percentages of each level of each group similar like this:
my code:
library(ggplot2)
library(dplyr)
dat <- data %>%
mutate(group = factor(cut(distance, seq(0, max(distance), 3), F))) %>%
group_by(plant, group) %>%
summarise(percentage = n()) %>%
mutate(percentage = percentage / sum(percentage))
p <- ggplot(dat, aes(x = plant, y = percentage, fill = group)) +
geom_bar(stat = "identity", position = "stack")+
scale_y_continuous(labels=percent)
p
But my plot is shown below: the group 4 was missing.
And I found that the dat was wrong, the group 4 was NA.
The likely reason is that the length of group 4 was less than the interval=3, so my question is how to fix it? Thank you in advance!
I have solved the problem.The reason is that the cut(distance, seq(0, max(distance), 3), F) did not include the maximum and minimum values.
Here is my solution:
dat <- my_data %>%
mutate(group = factor(cut(distance, seq(from = min(distance), by = 3, length.out = n()/ 3 + 1), include.lowest = TRUE))) %>%
count(plant, group) %>%
group_by(plant) %>%
mutate(percentage = n / sum(n))