ggplot2 geom_mosaic with weight variable - r

I am trying to make a mosaic plot with ggplot2. I am using the bladdercancer data from the HSAUR3 package. I am looking to show the relationship between tumorsize and number, but I am not sure how to weight it. I know that the number in the sample with tumorsizes<=3cm is not the same as those with tumorsize>3cm. How do I incorporate that into my mosaic plot?
Here is what I did without weighting it.
library("ggplot2")
library("ggmosaic")
ggplot(data = bladdercancer, family=poisson()) +
geom_mosaic(aes(weight= 1 , x = product(tumorsize, number),
fill=factor(tumorsize)), na.rm=TRUE) +
labs(x="Number of tumors", title='Number of tumors vs Tumorsize') +
guides(fill=guide_legend(title = "Tumor Size"))

This may be late nevertheless I provide my suggestions since I am trying to do a similar thing. Below are two suggestions:
library(tidyverse)
first option:
bladdercancer %>%
group_by(tumorsize, number) %>%
# get frequencies/counts for each tumor size and for each number
summarise(n.cases = n()) %>%
ggplot() +
geom_mosaic(aes(weight = n.cases, x = product(number),
fill = factor(n.cases)), offset = 0) +
guides(fill=guide_legend(title = "Tumor Size")) +
labs(x="Number of tumors", title='Number of tumors vs Tumorsize') +
# remove background colour
theme_bw() +
theme(panel.grid.major = element_blank(),
# remove major and minor grids
panel.grid.minor = element_blank(),
# push title to the middle
plot.title = element_text(size = 10, hjust = .5))
where the categories within each column represent different counts for each tumor size e.g number 1 appears 15 times for <=3cm and 5 times for >3cm. I am however not able to partitions where the frequencies are the same, in this case number 3 and 4. Hence my option 2
second option:
ggplot(bladdercancer) +
geom_bar(aes(x = number, fill = tumorsize), position = "dodge")

Related

ggplot2 sorting but not displaying correctly

I have made a data frame consisting of NFL stats. This data frame contains team name, how many yards they have allowed, if this was passing or rushing, and the total of the yards. I want to create a stacked bar chart with the x axis ordered in ascending order. I have used reorder() but for some reason the Panthers values are flipped and appear much higher than the rest (even though they have the lowest total). I have made other graphs this same way with no problems. I tried manually setting the y axis limit but that did not solve my problem. My question is why is this happening and what have I missed?
I am also open on ways to make this more efficient.
library(rvest)
library(data.table)
library(ggplot2)
#Read and organise data
def = read_html("https://www.pro-football-reference.com/years/2021/opp.htm")
defense = html_table(def)
defense = defense[[1]]
colnames(defense) = defense[1,]
defense = defense[c(-1,-34,-35,-36),]
names = c("rank", "team", "games", "points_against", "yards_conceded", "off_plays_faced",
"yards_per_off_play", "TO", "fumbles_recovered", "1stD_faced", "pass_cmp",
"pass_att", "pass_yd_conceded", "pass_td_conceded", "int_recovered",
"net_yd_pass_att", "pass_1stD_against", "rush_att", "rush_yards_conceded",
"rush_td_conceded", "rush_yards_per_att", "rush_1stD_conceded", "pen_against",
"pen_yards_conceded", "pen_1stD", "off_score_percent", "turnover_percent",
"expected_points_conceded")
colnames(defense) = names
# Get passing and rushing totals
passing_def = data.table(defense$team, defense$pass_yd_conceded, rep("pass", 32))
rushing_def = data.table(defense$team, defense$rush_yards_conceded, rep("rush", 32))
sums = rep(rowSums(cbind(as.numeric(passing_def$V2), as.numeric(rushing_def$V2))),2)
# join them and add the total yards
yds_conc = rbind(passing_def, rushing_def)
yds_conc = data.table(yds_conc, sums)
colnames(yds_conc) = c("team", "conceded", "type", "sums")
# Create the plot so that the teams are sorted from left to right by total yards given up
defplot = ggplot(data = yds_conc, aes(fill = type, x = reorder(team, sums),
y = conceded)) +
geom_bar(stat = 'identity') +
scale_fill_manual(name = "Play Type",labels=c("Passing","Rushing"),
values = c("#006aff", "#4fb350")) +
ggtitle('2021 Season Defense: Total Yards Conceded per Play Type') + ylab("Avg Rank") +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1),
axis.ticks.y = element_blank(), axis.text.y = element_blank())
defplot```
add this
yds_conc$conceded=as.numeric(yds_conc$conceded)
then run the same script will work

change environment size on scale_colour_manual to assign colour to factors to use across multiples plots

I need to make 5 plots of bacteria species. Each plot has a different number of species present in a range of 30-90. I want each bacteria to always have the same color in all plots, therefore I need to set an assigned color to each name.
I tried to use scale_colour_manual to create a color set but, the environment created has only 16 colors. How can I increase the number of colors present in the environment created?
the code I am using can be replicated as follow:
colour_genus <- stringi::stri_rand_strings(90, 5) #to be random names
nb.cols = nrow(colour_genus) #to set the length of my string
MyPalette = colorRampPalette(brewer.pal(12,"Set1"))(nb.cols) # the palette of choice
colGenus <- scale_color_manual(name = colour_genus, values = MyPalette)
The output formed contains only 16 values, so when I try to apply it to a figure with 90 factors, it complains I have only 16 values
abundance <- runif(90, min = 10, max = 100)
my_data <- data.frame(colour_genus, abundance)
p <- ggplot(my_data, aes(x = colour_genus, y= abundance)) +
geom_bar(aes(color = colour_genus, fill = colour_genus), stat = "identity", position = "stack") +
labs(x = "", y = "Relative Abundance\n") +
theme(panel.background = element_blank())
p + theme(legend.text= element_text(size=7, face="bold"), axis.text.x = element_text(angle = 90)) + guides(fill=guide_legend(ncol=2)) + scale_fill_manual(values=colGenus)
The following error shows:
Error: Insufficient values in manual scale. 90 needed but only 16 provided.
Thank you very much for your help.
When you know all your 90 bacci names in front of plotting, you can try.
set.seed(123)
colour_genus <- sort(stringi::stri_rand_strings(90, 5))#to be random names. I sorted the vector to illustrate the output better (optional).
MyPalette <- sample(colors(), length(colour_genus))
# named vector for scale_fill
names(MyPalette) <- colour_genus
# data
abundance <- runif(90, min = 10, max = 100)
my_data <- data.frame(colour_genus, abundance)
# two sets to show results
set1 <- my_data[20:30,]
set2 <- my_data[25:35,]
ggplot(set1, aes(x = colour_genus, y= abundance)) +
geom_col(aes(fill = colour_genus)) +
scale_fill_manual(values = MyPalette)
ggplot(set2, aes(x = colour_genus, y= abundance)) +
geom_col(aes(fill = colour_genus)) +
scale_fill_manual(values = MyPalette)

dodge columns in ggplot2

I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.

Histogram of stacked boxes in ggplot2

The graph I'm currently trying to make falls a little between two stools. I want to make a histogram that is composed of stacked and labelled boxes. Here's an example of exactly the sort of thing I'm talking about, taken from a recent article in the New York Times:
http://farm8.staticflickr.com/7109/7026409819_1d2aaacd0a.jpg
Is it possible to achieve this using ggplot2?
To amplify the question somewhat, so far what I have is:
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15)
)
ggplot(dfr, aes(x=percent, fill=name)) + geom_bar() +
stat_bin(geom="text", aes(label=name))
...which I'm clearly doing all wrong. Ultimately what I'd ideally like is something along the lines of the manually-modified graph below, with (say) letters A to M filled one shade and N to Z filled another.
http://farm8.staticflickr.com/7116/7026536711_4df9a1aa12.jpg
Here you go!
set.seed(3421)
# added type to mimick which candidate is supported
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15),
type = sample(c("A", "B"), 26, replace = TRUE)
)
# easier to prepare data in advance. uses two ideas
# 1. calculate histogram bins (quite flexible)
# 2. calculate frequencies and label positions
dfr <- transform(dfr, perc_bin = cut(percent, 5))
dfr <- ddply(dfr, .(perc_bin), mutate,
freq = length(name), pos = cumsum(freq) - 0.5*freq)
# start plotting. key steps are
# 1. plot bars, filled by type and grouped by name
# 2. plot labels using name at position pos
# 3. get rid of grid, border, background, y axis text and lables
ggplot(dfr, aes(x = perc_bin)) +
geom_bar(aes(y = freq, group = name, fill = type), colour = 'gray',
show_guide = F) +
geom_text(aes(y = pos, label = name), colour = 'white') +
scale_fill_manual(values = c('red', 'orange')) +
theme_bw() + xlab("") + ylab("") +
opts(panel.grid.major = theme_blank(), panel.grid.minor = theme_blank(),
axis.ticks = theme_blank(), panel.border = theme_blank(),
axis.text.y = theme_blank())

Gap Y axis in ggplot

I have the below plot of ggplot with most Y values between 0-200, and one value ~3000:
I want to "zoom" on most of the values, but still show the high value
I wrote the following code:
Figure_2 <- ggplot(data = count_df, aes(x=count_df$`ng`,
y=count_df$`Number`)) +
geom_point(col = "darkmagenta") + ggtitle("start VS Number") +
xlab(expression(paste("start " , mu, "l"))) + ylab("Number") +
theme(plot.title = element_text(hjust = 0.5, color="orange", size=14,
face="bold.italic"),
axis.title.x = element_text(color="#993333", size=10, face = "bold"),
axis.title.y = element_text(color="#993333", size=10,face = "bold"))
Anybody knows how to achieve that?
A possible solution could be found by help of facet_grid. I do not have the exact data from OP but the approach should be to think of grouping y-axis in ranges. The OP has mentioned about two ranges as 0 - 200 and ~3000 for value of Number.
Hence, we have an option to divide Number by 2000 to transform it into factors representing 2 groups. That means factor(ceiling(Number/2000)) will create two factors.
Let's take similar data as OP and try our approach:
# Data
count_df <- data.frame(ng = 1:30, Number = sample(200:220, 30, TRUE))
# Change one value high as 3000
count_df$Number[20] <- 3000
library(ggplot2)
ggplot(data = count_df, aes(x=ng, y=Number)) +
geom_point() +
facet_grid(factor(ceiling(Number/2000))~., scales = "free_y") +
ggtitle("start VS Number") +
xlab(expression(paste("start " , mu, "l")))

Resources