Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a list of decimal numbers, ranging from 1 to 40K and I am trying to plot a frequency histogram together with the total sum of a given bin. I'm attempting to do it using ggplot2 but getting lost on how to use the same x axis bins from the histogram:
sales <- data.frame(amount = runif(100, min=0, max=40000))
h <- hist(sales$amount, breaks=b)
sales$groups <- cut(sales$amount, breaks=h$breaks)
ggplot(sales,aes(x=groups)) +
geom_bar(stat="count")+
geom_bar(aes(x=groups, y=amount), stat="identity") +
scale_y_continuous(sec.axis = sec_axis(~.*5, name = "sum"))
I managed to create both graphs independently, but they seem to overwrite each other.
or
If I understand right, you tried to plot two different variables (Count and Sum) in the bar graph. As they have really different ranges, you need to define a secondary y axis.
First, the grammar of ggplot2 asks for one for column for x values, one column for y values and one or several columns for groups (I'm doing a very brief and dirty summary of my understanding of how ggplot2 works).
Here, the idea is to have your "breaks" as x variable, a second column with all y values to be plot and a group column stipulating if a y value belongs to the group "Count" or "amount". You can achieve this using dplyr and tidyr packages:
set.seed(123)
sales <- data.frame(amount = runif(100, min=0, max=40000))
b = 4
h <- hist(sales$amount, breaks=b)
sales$groups <- cut(sales$amount, breaks=h$breaks)
library(tidyr)
library(dplyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
pivot_longer(.,cols = c(Count, amount), names_to = "Variable", values_to = "Value")
# A tibble: 200 x 3
# Groups: groups [4]
groups Variable Value
<fct> <chr> <dbl>
1 (1e+04,2e+04] Count 27
2 (1e+04,2e+04] amount 11503.
3 (3e+04,4e+04] Count 27
4 (3e+04,4e+04] amount 31532.
5 (1e+04,2e+04] Count 27
6 (1e+04,2e+04] amount 16359.
7 (3e+04,4e+04] Count 27
8 (3e+04,4e+04] amount 35321.
9 (3e+04,4e+04] Count 27
10 (3e+04,4e+04] amount 37619.
# … with 190 more rows
However, if you are trying to plot this straight you will get a bad plot with bars for "Count" really small compared to "amount":
library(ggplot2)
library(tidyr)
library(dplyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
pivot_longer(.,cols = c(Count, amount), names_to = "Variable", values_to = "Value")%>%
ggplot(aes(x=groups, y = Value, fill = Variable)) +
geom_bar(stat="identity", position = position_dodge())
So, you can try to pass a secondary y axis using sec.axis argument in scale_y_continuous. However, this won't change your plot, it will simply create a "fake" right axis with the scale modify by the value you pass on the argument sec.axis:
So, if you want to have both group of values visible on your graph you need to either scale down "amount" or scale up "Count" in order that both group have a similar range of values.
Here, as you want to have the sum on the right axis, we will scale down the "Sum" in order it get values in the same range than "Count" values.
On the graph, you can see that "amount" values is reaching around 40000 whereas the maximal value of "Count" is 30. So, you can choose the following scale factor: 40000 / 30 = 1333.333.
So, now, if you create a second column called "Amount" that is the result of "amount" divided by 1300, you will have "Amount" and "Count" on the same range. So, your data will looks like that now:
library(dplyr)
library(tidyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
mutate(Amount = amount /1300) %>%
pivot_longer(.,cols = c(Count, Amount), names_to = "Variable", values_to = "Value")
# A tibble: 200 x 4
# Groups: groups [4]
amount groups Variable Value
<dbl> <fct> <chr> <dbl>
1 24000. (2e+04,3e+04] Count 30
2 24000. (2e+04,3e+04] Amount 18.5
3 13313. (1e+04,2e+04] Count 30
4 13313. (1e+04,2e+04] Amount 10.2
5 19545. (1e+04,2e+04] Count 30
6 19545. (1e+04,2e+04] Amount 15.0
7 38179. (3e+04,4e+04] Count 20
8 38179. (3e+04,4e+04] Amount 29.4
9 19316. (1e+04,2e+04] Count 30
10 19316. (1e+04,2e+04] Amount 14.9
# … with 190 more rows
In order the secondary y axis reflect the reality of "amount" values, you can pass the opposite scale factor and multiply it by 1300.
Altogether, you get the following code:
library(ggplot2)
library(dplyr)
library(tidyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
mutate(Amount = amount /1300) %>%
pivot_longer(.,cols = c(Count, Amount), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x=groups, y = Value, fill = Variable)) +
geom_bar(stat="identity", position = position_dodge()) +
scale_y_continuous(name = "Count",sec.axis = sec_axis(~.*1300, name = "Sum"))
Thus, you have the illusion to have plot two different group of values on two different scales.
Hope that this long explanation was helpful for you.
Related
I trying to compare two groups of patients (control and intervention) for multiple study visits.
Example of measurements: Hemoglobin, Troponin, Myoglobin, Creatinin, C reactive Protein (CRP)
This means I would like to see a difference between these groups for different Visits, e.g. intervention group has lower CRP at visit 2 than controls. Additionally, I would like to compare the patients with themselves, e.g. patient 2 has lower CRP at visit 3, than at visit 2.
Ultimately, I would like to show my data graphically (for a mean of the interventions and controls a line, one plot for every marker) and primarily do descriptive statistics without testing (since my sample size is pretty small and this is more explorative.
So far I have created a .csv with all data where I made columns indicating, if patients are control or intervention. This table is sortable by visit, control/intervention and patient ID.
First step is to install and load the packages.
install.packages("tidyverse")
install.packages("janitor")
library(tidyverse)
library(janitor)
library(readr)
data <- read_csv("Descriptive statistics_Sample data.csv")
The dataset had 12 columns and 31 rows with the following names
pseudonym
control(0/1)
intervention(0/1)
visit
weight-V1-3
height
Systolic.Blood.Pressure_V1-3
Diastolic.Blood.Pressure_V1-3
Pulse_V1-3
Respiration.Rate_V1-3
HS.cTnT.(ng/l)_V1-3
Myoglobin.(ug/l)_V1-3
Some of these names may be hard to work with in R, so I cleaned the names using a function called clean_names() from the janitor package.
data <- clean_names(data)
pseudonym
control_0_1
intervention_0_1
visit
weight_v1_3
height
systolic_blood_pressure_v1_3
diastolic_blood_pressure_v1_3
pulse_v1_3
respiration_rate_v1_3
hs_c_tn_t_ng_l_v1_3
myoglobin_ug_l_v1_3
Next, we need to create a new categorical variable by combining the control_0_1 and intervetion_0_1 variables. The name of the variable can be anything. I have named it group. We create this variable using mutate function. We then fill in values for this new variable using case_when function which helps us recode values. If there is a 1 in the control_0_1 variable, we ask it to call it "control", and similarly for the intervention_0_1 variable.
mutate(group = case_when(control_0_1 == 1 ~ "control",
intervention_0_1 == 1 ~ "intervention"))
I like to move the newly created variable to the beginning of the dataframe to see it easier. This step is not necessary.
relocate(group, .after = 1)
Symbols like %>% are called pipes. Read them like "and then". For example, we get data (and then) mutate a new column (and then) relocate it. We are also overwriting the object with <- symbol.
data <- data %>%
mutate(group = case_when(control_0_1 == 1 ~ "control",
intervention_0_1 == 1 ~ "intervention")) %>% # creates a new categorical variable called "group".
relocate(group, .after = 1) # moves the group column from the end of the dataframe to after the 1st column - this step is not necessary, but I like to see the grouping variable close to the beginning of the dataframe.
data
To get the means for the entire dataset, we use a function called summarize. This is similar to mutate where we create a new column called mean_resp (name can be anything) and calculate the mean of the respiration_rate_v1_3 column. We also remove missing values if we need to with na.rm = TRUE.
data %>%
summarize(mean_resp = mean(respiration_rate_v1_3, na.rm = TRUE))
mean_resp
15.32258
To group this by the new group variable, we add a new line with group_by function and add the group variable inside like this group_by(group).
data %>%
group_by(group) %>%
summarize(mean_resp = mean(respiration_rate_v1_3, na.rm = TRUE))
This results in:
group mean_resp
control 14.80000
intervention 15.57143
To further group this by visits, we have to add visit to the group_by function.
data %>%
group_by(group, visit) %>%
summarize(mean_resp = mean(respiration_rate_v1_3, na.rm = TRUE))
group visit mean_resp
control 1 14.00000
control 2 15.60000
intervention 1 15.33333
intervention 2 15.00000
intervention 3 16.11111
This has 5 rows, but it will be nice to see this as a transposed table.
This can be done by using the pivot_wider function. We take the names from column visit and create three new columns simply called 1, 2, 3. The values for these new columns will be from the mean_resp column. We do this with this pivot_wider(names_from = visit, values_from = mean_resp)
data %>%
group_by(group, visit) %>%
summarize(mean_resp = mean(respiration_rate_v1_3, na.rm = TRUE)) %>%
pivot_wider(names_from = visit, values_from = mean_resp)
This results in
group 1 2 3
control 14.00000 15.6 NA
intervention 15.33333 15.0 16.11111
To visualise this, we can create a ggplot.
data %>%
group_by(group, visit) %>%
summarize(mean_resp = mean(respiration_rate_v1_3)) %>%
ggplot(aes(x = factor(visit), y = mean_resp, group = factor(group), color = factor(group))) +
geom_line(size = 1) +
geom_point() + scale_color_brewer(palette = "Dark2") + theme_minimal()
To get means by patient
data %>%
group_by(pseudonym, visit) %>%
summarize(mean_resp = mean(respiration_rate_v1_3, na.rm = TRUE)) %>%
pivot_wider(names_from = visit, values_from = mean_resp)
pseudonym 1 2 3
1 20 20 20
2 16 12 16
3 16 19 NA
4 13 13 14
5 13 15 16
6 18 16 16
7 12 14 18
8 13 16 18
9 16 16 11
10 13 14 16
data %>%
group_by(pseudonym, visit) %>%
summarize(mean_resp = mean(respiration_rate_v1_3)) %>%
ggplot(aes(x = factor(visit), y = mean_resp, group = factor(pseudonym), color = factor(pseudonym))) +
geom_line(size = 1) +
geom_point() +
scale_y_binned(limits = c(10, 21)) +
scale_color_brewer(palette = "Paired") +
theme_bw()
Google Colab link:
https://colab.research.google.com/drive/1bNZwpvEOt6dOEOoCrN_a14G5Pz021NAf?usp=sharing
I have a peculiar problem with arranging boxplots given a certain order of the x-axis, as I am adding two boxplots from different dataframe in the same plot and each time I add the second geom_boxplot, R reorders my x axis alphabetically instead of following ordered levels of factor(x).
So, I have two dataframe of different lengths lookings something like this:
df1:
id value
1 A 1
2 A 2
3 A 3
4 A 5
5 B 10
6 B 8
7 B 1
8 C 3
9 C 7
df2:
id value
1 A 4
2 A 5
3 B 6
4 B 8
There is always more observations per id in df1 than in df2 and there is some ids in df1 that are not available in df2.
I'd like df1 to be sorted by the median(value) (ascending) and to first plot boxplots for each id in that order.
Then I add a second layer with boxplots for all other measurements per id from df2, which should maintain the same order on the x-axis.
Here's how I approached that:
vec <- df %>%
group_by(id) %>%
summarize(m = median(value)) %>%
arrange(m) %>%
pull(id)
p1 <- df1 %>%
ggplot(aes(x = factor(id, levels = vec), y = value)) +
geom_boxplot()
p1
p2 <- p1 +
geom_boxplot(data = df2, aes(x = factor(id, levels = vec), y = value))
p2
p1 shows the right order (ids are ordered based on ascending medians), p2 always throws my order off and goes back to plotting ids alphabetically (my id is a character column with names actually). I tried with sample dataframes and the above code achieves what is required. Hence, I am not sure what could be specifically wrong about my data so that the code fails when applied to the specific data and not the above mock data.
Any ideas?
Thanks a lot in advance!
If I understood correctly, this shoud work.
library(tidyverse)
# Sample data
df1 <-
tibble(
id = c("A","A","A","A","B","B","B","C","C"),
value = c(1,2,3,5,10,8,1,3,7),
type = "df1"
)
df2 <-
tibble(
id = c("A","A","B","B"),
value = c(4,5,6,8),
type = "df2"
)
df <-
# Create single data.frame
df1 %>%
bind_rows(df2) %>%
# Reorder id by median(value)
mutate(id = fct_reorder(id,value,median))
df %>%
ggplot(aes(id, y = value, fill = type)) +
geom_boxplot()
I got a data frame producers with two colums: person_id and year.
# A tibble: 3,207 x 2
person_id year
<chr> <chr>
1 GASH1991-04-30 2020
2 LOSP1969-06-29 2020
3 CRGM1989-08-26 2020
4 CEVE1954-07-15 2020
5 HERR1998-01-06 2020
6 TOLR1951-04-09 2020
7 BEAM1953-09-07 2020
8 ANRJ1977-07-06 2020
9 PAMH1982-02-06 2020
10 AKTE1967-11-15 2020
# ... with 3,197 more rows
I can summarise this dataframe to obtain cumulative sum:
producers %>%
select(person_id, year) %>%
group_by(year) %>%
distinct(person_id) %>%
summarise(total = n()) %>%
ungroup() %>%
mutate(cum = cumsum(total))
# A tibble: 3 x 3
year total cum
<chr> <int> <int>
1 2019 456 456
2 2020 1832 2288
3 2021 160 2448
An I can make a cummulative bar plot like this:
ggplot(producers, aes(x = as.factor(year), y = as.integer(cum))) +
geom_bar(position = "stack", stat = "identity") +
ylim(0,3000) +
xlab("Year") +
ylab("Producers") +
theme_classic()
But what I really want is something like this:
I've been trying with aes(fill = year) and other arguments but I can't get it. Thanks for your responses.
Here's an approach. Ultimately, we'll need two "year" variables, one to mark the category within each stack, and one to mark which stack we want it to appear in. Here, I set up year2 for the 2nd one, and filter out the values that shouldn't appear yet in each stack.
df2 <- data.frame(
year = 2019:2021,
total = c(456, 1832, 160)
)
library(tidyverse)
df2 %>%
crossing(year2 = df2$year) %>% # make copy for each year
filter(year <= year2) %>% # keep just the years up to current year
ggplot(aes(year2, total, fill = fct_rev(as.factor(year)))) +
geom_col() +
scale_fill_discrete(name = "Year")
ggplot2 works best with data in a long format where you have one variable to plot and then various identifying variables to control the fill, color, and facetting. Here I explicitly build a repeated data frame using map_dfr which essentially is running a for loop for each year in the input dataset. In dat_long the new column yearid becomes the x-axis identifier so within 2021 we can access the data for year 2019 through 2021 to control the color fill.
library(ggplot2)
library(dplyr)
library(purrr)
library(forcats)
year = c(2019, 2020, 2021)
sum = c(456, 1832, 160)
cumsum = c(456, 2288, 2448)
dat <- data.frame(year, sum)
# note: don't need the cumsum column
# instead, create long, replicated data where we repeat
# each years entry for every year that comes after it
dat_long <-
map_dfr(unique(dat$year),
~filter(dat, year <= .x) %>%
mutate(yearid = .x))
ggplot(data = dat_long,
aes(x = yearid,
y = sum,
# note: use factor to get discrete color palette, fct_rev to stack 2021 on top
fill = fct_rev(factor(year)))) +
geom_col()
ggplot(data, aes(x, y))+ geom_count()
provides a plot showing the count of each [x, y] case. If there were 4 values in x and 6 values in y, geom_count would show 24 circles in a plot and each circle size representing the count.
How do I create a count table of two variables similarly using summarise() or any other function in dplyr? This table would show the count as a number instead of a circle size as in geom_count() for each [x, y] case.
Welcome to stackoverflow. Based on your comments, I think something like this is what you are looking for:
library(tidyverse)
df <-
mpg %>%
mutate(
x = cyl,
y = drv
)
df %>%
# create a column called n for number of times x & y occur together
count(x, y) %>%
# create columns for each unique value of y and
# put the values of column n below
spread(key = y, value = n, fill = 0)
# x `4` f r
# ------------------------
# 4 23 58 0
# 5 0 4 0
# 6 32 43 4
# 8 48 1 21
I have two columns in a data.frame, that should have levels sorted in the same order, but I don't know how to do it in a straightforward manner.
Here's the situation:
library(ggplot2)
library(dplyr)
library(magrittr)
set.seed(1)
df1 <- data.frame(rating = sample(c("GOOD","BAD","AVERAGE"),10,T),
div = sample(c("A","B","C"),10,T),
n = sample(100,10,T))
# I'm adding a label column that I use for plotting purposes
df1 <- df1 %>% group_by(rating) %>% mutate(label = paste0(rating," (",sum(n),")")) %>% ungroup
# # A tibble: 10 x 4
# rating div n label
# <fctr> <fctr> <int> <chr>
# 1 BAD C 48 BAD (220)
# 2 BAD B 87 BAD (220)
# 3 BAD C 44 BAD (220)
# 4 GOOD B 25 GOOD (77)
# 5 AVERAGE B 8 AVERAGE (117)
# 6 AVERAGE C 10 AVERAGE (117)
# 7 AVERAGE A 32 AVERAGE (117)
# 8 GOOD B 52 GOOD (77)
# 9 AVERAGE C 67 AVERAGE (117)
# 10 BAD C 41 BAD (220)
# rating levels are sorted
df1$rating <- factor(df1$rating,c("BAD","AVERAGE","GOOD"))
ggplot(df1,aes(x=rating,y=n,fill=div)) + geom_col() # plots in the order I want
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col() # doesn't because levels aren't sorted
How do I manage to copy the factor order from one column to another ?
I can make it work this way but I think it's really awkward:
lvls <- df1 %>% select(rating,label) %>% unique %>% arrange(rating) %>% extract2("label")
df1$label <- factor(df1$label,lvls)
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col()
Instead of adding a label column and use aes(x = label, you may stick to aes(x = rating, and create the labels in scale_x_discrete:
ggplot(df1, aes(x = rating, y = n, fill = div)) +
geom_col() +
scale_x_discrete(labels = df1 %>%
group_by(rating) %>%
summarize(n = sum(n)) %>%
mutate(lab = paste0(rating, " (", n, ")")) %>%
pull(lab))
Once you have set the levels of rating, you can use forcats to set the levels of label by the order of rating like this...
library(forcats)
df1 <- df1 %>% group_by(rating) %>%
mutate(label=paste0(rating," (",sum(n),")")) %>%
ungroup %>%
arrange(rating) %>% #sort by rating
mutate(label=fct_inorder(label)) #set levels by order in which they appear
Or you can use forcats::fct_reorder to do the same thing...
df1$label <- fct_reorder(df1$label, as.numeric(df1$rating))
The plot then has the bars in the right order.