Equally distributed bar chart in ggplot2 - r

What I want to do
My dataset consists of several cases (id) with different outcomes (outcome) for a given number of repeated meaures (cycle). Each cycle should be counted as 1 (val) or be visualized of equal length.
The plot I want to end up with is a stacked bar chart, where each cycle of each case has the same length. The sequence of cycles must be continous. The sequence of the outcomes is dependent on the according cycles.
My Problem
The sample code below produces a bar chart that sums up the cycles (although being a factor). However, using the val column instead of cycle messes with the sequence of the outcomes, which must not change.
# setup
library(ggplot2)
library(dplyr)
set.seed(0)
# test data
data.frame(
cycle=factor(rep(1:8,2),levels=1:8),
val=1,
id=factor(rep(1:2,each=8)),
outcome=factor(paste("Outcome",sample(1:8,16,T)),levels=paste("Outcome",1:8))) %>%
# plot
ggplot(.,aes(id,cycle,fill=outcome))+
geom_bar(stat="identity",position=position_stack(reverse=T),width=0.99)+
coord_flip()
My Question
Is it possible to make cycles count as 1 for each id, keeping the outcome sequence?
Thank you in advance!
The Plots
This is what I get when using the above code:
This is what I get, when using val instead of cycle:
The goal is to keep the outcome sequence, while counting each cycle as 1 or making them appear of the same length for each id.

As far as I get it you could achieve your desired result using geom_tile:
library(ggplot2)
set.seed(0)
dat <- data.frame(
cycle = factor(rep(1:8, 2), levels = 1:8),
val = 1,
id = factor(rep(1:2, each = 8)),
outcome = factor(paste("Outcome", sample(1:8, 16, T)), levels = paste("Outcome", 1:8))
)
ggplot(dat, aes(cycle, id, fill = outcome)) +
geom_tile()

Related

Split one massive plot into smaller sub-plots for better visualisation in ggplot

I've got data on survival/sampling dates of over 500 dogs, each dog having been sampled at least once, and several having been sampled three or four times. For e.g.
Microchip_number Date Sampling_occasion
White notched fatso 20,11,2018 First
White notched fatso 28,12,2018 Second
White notched fatso 09,04,2019 Third
White notched fatso 23,10,2019 Fourth
Tuttu Jeevan 06,12,2018 First
Tuttu Jeevan 03,01,2019 Second
Tuttu Jeevan 04,05,2019 Third
Tuppy 22,10,2018 First
Tuppy 20,11,2018 Second
Tuppy 17,04,2019 Third
Tuppy 31,07,2019 Lost to study
I've managed to plot this in ggplot, but it's a very large image which requires zooming in and scrolling to view the sampling times of each individual dog.
Plot of outcomes for all dogs
I've found suggestions to split large dataframes based on a certain variable (e.g. month) or to use facet_wrap, but in my case, I don't have any such variable to use. Is there a way to split this large plot into multiple smaller plots that don't need to be zoomed in to view all the details clearly, such as below (without having to separately plot subsets of the dataframe)?
How I'd like each split/sub-plot to appear
This is the code I'm using
outcomes <- read_xlsx("Dog outcomes.xlsx", col_types = c("text", "date", "text"))
outcomes$Microchip_number<- as.factor(outcomes$Microchip_number)
outcomes$Sampling_occasion<- factor(outcomes$Sampling_occasion,
levels = c("First", "Second", "Third", "Fourth", "Lost to study", "Died"))
g<- ggplot(outcomes)
g + geom_point(aes(x = Date, y = Microchip_number, colour = Sampling_occasion, shape = Sampling_occasion)) +
geom_line(aes(x = Date, y = Microchip_number, group = Microchip_number, colour = Sampling_occasion)) +
theme_bw()
Thanks so much, Jrm FRL, the code to add the counter and subgroup columns was exactly what I needed! As Gregor mentioned, facet_wrap just made things more difficult to view, so I used a for loop using subgroup to plot 50 dogs per pdf page (or any other device). This is the code I used, and it's worked perfectly, although for some reason, the 'Microchip_number's are displaying in reverse sequence / alphabetical order (68481, 68480, 68479 etc.), despite being organised the other way round in the main dataframe 'outcomes'. Minor quibble, however! This makes it so much easier to visualise outcomes for specific individuals. Cheers!
outcomes2 <- outcomes %>%
mutate(counter = 1 + cumsum(c(0,as.numeric(diff(Microchip_number))!=0)), # this counter starting at 1 increments for each new dog
subgroup = as.factor(ceiling(counter/50)))
pdf(file = "All_outcomes_50.pdf") #
for (i in 1:length(unique(outcomes2$subgroup))) {
outcomes2 %>%
filter(subgroup == i) -> df
ggplot(df) + geom_point(aes(x = Date, y = Microchip_number, colour = Sampling_occasion, shape = Sampling_occasion)) +
geom_line(aes(x = Date, y = Microchip_number, group = Microchip_number, colour = Sampling_occasion)) +
theme_bw() -> wow
print(wow)
}
dev.off()
New plot after using 'for' loop
You can simply divide your dasatet in sub-groups containing the same number of dogs (e.g. 10).
Add an intermediate counter column to overcome the small difficulty that there is not necessarly the same number of rows for each dog.
I would suggest :
library('dplyr')
outcomes <- outcomes %>%
mutate(counter = 1 + cumsum(c(0,as.numeric(diff(Microchip_number))!=0)), # this counter starting at 1 increments for each new dog
subgroup = as.factor(ceiling(counter/10)))
You will obtain a new dataset with a factor subgroup column whose value is different every 10th dog. Then just add a + facet_wrap(.~subgroup) to your plot.
Hope this will help.

Generating R ggplot line graph with color/type conditional on different variables

I'm struggling to get the exact output needed for a ggplot line graph. As an example, see the code below. Overall, I have two conditions (A/B), and two treatments (C/D). So four total series, but in a factorial way. The lines can be viewed as a time series but with ordinal markings (rather than numeric).
I'd like to generate a connected line graph for the four types, where the color depends on the condition, and the line type depends on the treatment. Thus two different colors and two line types. To make things a bit more complicated, one condition (B) does not have data for the third time period.
I cannot seem to generate the graph needed for these constraints. The closest I got is shown below. What am I doing wrong? I try to remove the group=condition code, but that doesn't help either.
library(ggplot2)
set.seed<-1
example_df <- data.frame(time = c('time1','time2','time3','time1','time2','time3','time1','time2','time1','time2'),
time_order = c(1,2,3,1,2,3,1,2,1,2),
condition = c('A','A','A','A','A','A','B','B','B','B'),
treatment = c('C','C','C','D','D','D','C','C','D','D'),
value = runif(10))
ggplot(example_df, aes(x=reorder(time,time_order), y=value, color=condition , line_type=treatment, group=condition)) +
geom_line()
You've got 3 problems, from what I can tell.
linetype doesn't have an underscore in it.
With a categorical axis, you need to use the group aesthetic to set which lines get connected. You've made a start with group = condition, but this would imply one line for each condition type (2 lines), but you want one line for each condition:treatment interaction (2 * 2 = 4 lines), so you need group = interaction(condition, treatment).
Your sample data doesn't quite make sense. Your condition B values have two treatment Cs at time 1 and two Ds at time 2, so there is no connection between times 1 and 2. This doesn't much matter, and your real data is probably fine.
This should work:
ggplot(
example_df,
aes(
x = reorder(time, time_order),
y = value,
color = condition,
linetype = treatment,
group = interaction(condition, treatment)
)
) +
geom_line()

ggplot structuring data boxplot of treatment effects in multiple time periods

I have data currently structured like so:
set.seed(100)
require(ggplot2)
require(reshape2)
d<-data.frame("ID" = 1:30,
"Treatment1" = sample(0:1,30,replace = T, prob = c(0.5,0.5)),
"Score1" = rnorm(30)^2,
"Treatment2" = sample(0:1,30,replace = T,prob = c(0.3,0.7)),
"Score2" = rnorm(30)^2,
"Treatment3" = sample(0:1,30,replace = T,prob = c(0.2,0.8)),
"Score3" = rnorm(30)^2)
Where there are unique IDs, 3 different treatments (coded 1 if they received the given treatment and 0 if not), and the different scores the Ids have after each treatment period. I'm trying to create a boxplot that will illustrate the score distribution associated with each treatment period for each of the unique ids in the data set, but I'm either not melting the data properly or not coding the plot properly or both.
d.melt<-melt(d,id.vars = c("ID","Treatment1","Treatment2","Treatment3"),measure.vars = c("Score1","Score2","Score3"))
I can produce the boxplot that shows the scores separated by whether they recieved one of the three treatments with this code:
ggplot(d.melt)+
geom_boxplot(aes(x = variable,y = value,fill = factor(Treatment1)))
But this will only plot the difference in all the scores for the IDs that got treatment 1 and not the difference in scores for all of the 3 levels...
Any help getting my head around this problem would be great. Thank you in advance
The complication is that the data has pairs of columns (Treatment1, Score1, etc.) representing each treatment/score and we need to keep track of both whether a given subject received a given Treatment and their Score for each treatment. I've used one of the map functions from the purrr package (which is part of the tidyverse suite of packages) for this.
The code steps through each of the three pairs of treatments/scores, adds a column called Treatment indicating the treatment number and returns the stacked (long format) data frame.
library(tidyverse)
dr = map2_df(seq(2,ncol(d),2), seq(3,ncol(d),2),
function(t,s) {
data.frame(ID = d[,"ID"],
Treatment = gsub(".*([0-9]$)", "\\1", names(d)[t]),
Treat_Flag = d[,t],
Score = d[,s])
})
Now we plot the data using Treatment on the x-axis to mark the treatment number and color by Treat_Flag to provide separate box plots based on whether a given subject received a given treatment.
ggplot(dr, aes(Treatment, Score, colour=factor(Treat_Flag))) +
geom_boxplot() +
theme_classic() +
labs(colour="Treatment Indicator")
Here's another way to reshape the data. The code below uses functions from tidyr rather than from reshape2 (tidyr is the successor to reshape2). In the code below, gather(d, key, value, -ID) is essentially equivalent to melt(d, id.var="ID"). You can stop the chain of functions at any step to look at the intermediate outputs. This approach is probably more in keeping with the tidyverse paradigm for data reshaping, but I find it a bit less intuitive than the map approach above.
dr = gather(d, key, value, -ID) %>%
separate(key, into=c("key", "value2"), sep="(?=[0-9])") %>%
spread(key, value) %>%
rename(Treatment=value2, Treat_Flag=Treatment)

Convert absolute values to ranges for charting in R

Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.
Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))

Get a histogram plot of factor frequencies (summary)

I've got a factor with many different values. If you execute summary(factor) the output is a list of the different values and their frequency. Like so:
A B C D
3 3 1 5
I'd like to make a histogram of the frequency values, i.e. X-axis contains the different frequencies that occur, Y-axis the number of factors that have this particular frequency. What's the best way to accomplish something like that?
edit: thanks to the answer below I figured out that what I can do is get the factor of the frequencies out of the table, get that in a table and then graph that as well, which would look like (if f is the factor):
plot(factor(table(f)))
Update in light of clarified Q
set.seed(1)
dat2 <- data.frame(fac = factor(sample(LETTERS, 100, replace = TRUE)))
hist(table(dat2), xlab = "Frequency of Level Occurrence", main = "")
gives:
Here we just apply hist() directly to the result of table(dat). table(dat) provides the frequencies per level of the factor and hist() produces the histogram of these data.
Original
There are several possibilities. Your data:
dat <- data.frame(fac = rep(LETTERS[1:4], times = c(3,3,1,5)))
Here are three, from column one, top to bottom:
The default plot methods for class "table", plots the data and histogram-like bars
A bar plot - which is probably what you meant by histogram. Notice the low ink-to-information ratio here
A dot plot or dot chart; shows the same info as the other plots but uses far less ink per unit information. Preferred.
Code to produce them:
layout(matrix(1:4, ncol = 2))
plot(table(dat), main = "plot method for class \"table\"")
barplot(table(dat), main = "barplot")
tab <- as.numeric(table(dat))
names(tab) <- names(table(dat))
dotchart(tab, main = "dotchart or dotplot")
## or just this
## dotchart(table(dat))
## and ignore the warning
layout(1)
this produces:
If you just have your data in variable factor (bad name choice by the way) then table(factor) can be used rather than table(dat) or table(dat$fac) in my code examples.
For completeness, package lattice is more flexible when it comes to producing the dot plot as we can get the orientation you want:
require(lattice)
with(dat, dotplot(fac, horizontal = FALSE))
giving:
And a ggplot2 version:
require(ggplot2)
p <- ggplot(data.frame(Freq = tab, fac = names(tab)), aes(fac, Freq)) +
geom_point()
p
giving:

Resources