ggplot2 geom_bar position failure - r

I am using the ..count.. transformation in geom_bar and get the warning
position_stack requires non-overlapping x intervals when some of my categories have few counts.
This is best explained using some mock data (my data involves direction and windspeed and I retain names relating to that)
#make data
set.seed(12345)
FF=rweibull(100,1.7,1)*20 #mock speeds
FF[FF>60]=59
dir=sample.int(10,size=100,replace=TRUE) # mock directions
#group into speed classes
FFcut=cut(FF,breaks=seq(0,60,by=20),ordered_result=TRUE,right=FALSE,drop=FALSE)
# stuff into data frame & plot
df=data.frame(dir=dir,grp=FFcut)
ggplot(data=df,aes(x=dir,y=(..count..)/sum(..count..),fill=grp)) + geom_bar()
This works fine, and the resulting plot shows the frequency of directions grouped according to speed. It is of relevance that the velocity class with the fewest counts (here "[40,60)") will have 5 counts.
However more velocity classes leads to a warning. For instance, with
FFcut=cut(FF,breaks=seq(0,60,by=15),ordered_result=TRUE,right=FALSE,drop=FALSE)
the velocity class with the fewest counts (now "[45,60)") will have only 3 counts and ggplot2 will warn that
position_stack requires non-overlapping x intervals
and the plot will show data in this category spread out along the x axis.
It seems that 5 is the minimum size for a group to have for this to work correctly.
I would appreciate knowing if this is a feature or a bug in stat_bin (which geom_bar is using) or if I am simply abusing geom_bar.
Also, any suggestions how to get around this would be appreciated.
Sincerely

This occurs because df$dir is numeric, so the ggplot object assumes a continuous x-axis, and aesthetic parameter group is based on the only known discrete variable (fill = grp).
As a result, when there simply aren't that many dir values in grp = [45,60), ggplot gets confused over how wide each bar should be. This becomes more visually obvious if we split the plot into different facets:
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar() +
facet_wrap(~ grp)
> for(l in levels(df$grp)) print(sort(unique(df$dir[df$grp == l])))
[1] 1 2 3 4 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 3 4 5 7 9 10
[1] 2 4 7
We can also check manually that the minimum difference between sorted df$dir values is 1 for the first three grp values, but 2 for the last one. The default bar width is thus wider.
The following solutions should all achieve the same result:
1. Explicitly specify the same bar width for all groups in geom_bar():
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar(width = 0.9)
2. Convert dir to a categorical variable before passing it to aes(x = ...):
ggplot(data=df,
aes(x=factor(dir), y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar()
3. Specify that the group parameter should be based on both df$dir & df$grp:
ggplot(data=df,
aes(x=dir,
y=(..count..)/sum(..count..),
group = interaction(dir, grp),
fill = grp)) +
geom_bar()

This doesn't directly solve the issue, because I also don't get what's going on with the overlapping values, but it's a dplyr-powered workaround, and might turn out to be more flexible anyway.
Instead of relying on geom_bar to take the cut factor and give you shares via ..count../sum(..count..), you can easily enough just calculate those shares yourself up front, and then plot your bars. I personally like having this type of control over my data and exactly what I'm plotting.
First, I put dir and FF into a data frame/tbl_df, and cut FF. Then count lets me group the data by dir and grp and count up the number of observations for each combination of those two variables, then calculate the share of each n over the sum of n. I'm using geom_col, which is like geom_bar but when you have a y value in your aes.
library(tidyverse)
set.seed(12345)
FF <- rweibull(100,1.7,1) * 20 #mock speeds
FF[FF > 60] <- 59
dir <- sample.int(10, size = 100, replace = TRUE) # mock directions
shares <- tibble(dir = dir, FF = FF) %>%
mutate(grp = cut(FF, breaks = seq(0, 60, by = 15), ordered_result = T, right = F, drop = F)) %>%
count(dir, grp) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 29 x 4
#> dir grp n share
#> <int> <ord> <int> <dbl>
#> 1 1 [0,15) 3 0.03
#> 2 1 [15,30) 2 0.02
#> 3 2 [0,15) 4 0.04
#> 4 2 [15,30) 3 0.03
#> 5 2 [30,45) 1 0.01
#> 6 2 [45,60) 1 0.01
#> 7 3 [0,15) 6 0.06
#> 8 3 [15,30) 1 0.01
#> 9 3 [30,45) 2 0.02
#> 10 4 [0,15) 6 0.06
#> # ... with 19 more rows
ggplot(shares, aes(x = dir, y = share, fill = grp)) +
geom_col()

Related

line graph with multiple variables on y axis stepwise

I need some help. Here is my data which i want to plot. I want to keep $path.ID on y axis and numerics of all other columns added stepwise. this is a subset of very large dataset so i want to pathID labels attached to each line. and also the values of the other columns with each point if possible.
head(table)
Path.ID sc st rc rt
<chr> <dbl> <dbl> <dbl> <dbl>
1 map00230 1 12 5 52
2 map00940 1 20 10 43
3 map01130 NA 15 8 34
4 map00983 NA 14 5 28
5 map00730 NA 5 3 26
6 map00982 NA 16 2 24
somewhat like this
Thank you
Here is the pseudo code.
library(tidyr)
library(dplyr)
library(ggplot2)
# convert your table into a long format - sorry I am more used to this type of data
table_long <- table %>% gather(x_axis, value, sc:rt)
# Plot with ggplot2
ggplot() +
# draw line
geom_line(data=table_long, aes(x=x_axis, y=value, group=Path.ID, color=Path.ID)) +
# draw label at the last x_axis in this case is **rt**
geom_label(data=table_long %>% filter(x_axis=="rt"),
aes(x=x_axis, y=value, label=Path.ID, fill=Path.ID),
color="#FFFFFF")
Note that with this code if a Path.ID doesn't have the rt value then it will not have any label
p<-ggplot() +
# draw line
geom_line(data=table_long, aes(x=x_axis, y=value, group=Path.ID, color=Path.ID)) +
geom_text(data=table_long %>% filter(x_axis=="rt"),
aes(x=x_axis, y=value, label=Path.ID),
color= "#050505", size = 3, check_overlap = TRUE)
p +labs(title= "title",x = "x-lable", y="y-label")
I had to use geom_text as i had large dataset and it gave me somewhat more clear graph
thank you #sinh it it helped a lot.

Why is geom_bar y-axis unproportional to actual numbers?

Sorry if this question already exists - was googling for a while now already and didn't find anything.
I am relatively new to R and learning while doing all of this.
I'm supposed to create some PDF via r markdown that analyses patient-data with specific main-diagnosis and secondary-diagnosis. For this I'm supposed to plot some numbers via ggplot (geom_bar and geom_boxplot).
So what I do so far is, I retrieve data-sets that include both codes via SQL and load them into data.table-objects afterwards. Afterwards I join them to get the data I need.
After this I add columns that consist sub-strings of those codes and others that consist the count of those certain sub-strings (so I can plot the occurrences of every code).
I wanted now for example to put certain data.table into a geom_bar or geom_boxplot and make it visible. This actually works, but my y-axis has a weird scale that doesn't fit the numbers it actually should show. The proportions of the bars are also not accurate.
For example: one diagnoses appears 600 times and the other one 1000 times. The y-axis shows steps of 0 - 500.000 - 1.000.000 - 1.500.000 - ....
The Bar that shows 600 is super small and the bar with 1000 goes up to 1.500.000
If I create a new variable before and count what I need via count() and plot this it just works. The rows I put for the y-axis have in both variable the same datatype (integer)
So here is just how I create the data.table that I use for plotting
exazerbationsHdComorbiditiesNd <- allExazerbationsHd[allComorbiditiesNd, on="encounter_num", nomatch=0]
exazerbationsHdComorbiditiesNd <- exazerbationsHdComorbiditiesNd[, c("i.DurationGroup", "i.DurationInDays", "i.start_date", "i.end_date", "i.duration", "i.patient_num"):=NULL]
exazerbationsHdComorbiditiesNd[ , IcdHdCodeCount := .N, by = concept_cd]
exazerbationsHdComorbiditiesNd[ , IcdHdCodeClassCount := .N, by = IcdHdClass]
If I want to bar-plot now for example IcdHdClass by IcdHdCodeClassCount I do following:
ggplot(exazerbationsHdComorbiditiesNd, aes(exazerbationsHdComorbiditiesNd$IcdHdClass, exazerbationsHdComorbiditiesNd$IcdHdCodeClassCount, label=exazerbationsHdComorbiditiesNd$IcdHdCodeClassCount)) + geom_bar(stat = "identity") + geom_text(vjust = 0, size = 5)
It outputs said bar-plot with weird proportions.
If I do first:
plotTest <- count(exazerbationsHdComorbiditiesNd, exazerbationsHdComorbiditiesNd$IcdHdClass)
And then bar-plot it:
ggplot(plotTest, aes(plotTest$`exazerbationsHdComorbiditiesNd$IcdHdClass`, plotTest$n, label=plotTest$n)) + geom_bar(stat = "identity") + geom_text(vjust = 0, size = 5)
Its all perfect and works.
I checked also data-types of the columns I needed:
sapply(exazerbationsHdComorbiditiesNd, class)
sapply(plotTest, class)
In both variables the columns I need are of the type character and integer
Edit:
Unfortunately I cant post images. So here are just the links to those.
Here is a screenshot of the plot with wrong y-axis:
https://ibb.co/CbxX1n7
And here is a screenshot of the plot shown right:
https://ibb.co/Xb8gyx1
Here is some example-data that I copied out the data.table object:
Exampledata
Since you added the class counts as an additional column--rather than aggregating--what’s happening is that for each row in your data, the class counts get stacked on top of each other:
library(tidyverse)
set.seed(42)
df <- tibble(class = sample(letters[1:3], 10, replace = TRUE)) %>%
add_count(class, name = "count")
df # this is essentially what your data looks like
#> # A tibble: 10 x 2
#> class count
#> <chr> <int>
#> 1 a 5
#> 2 a 5
#> 3 a 5
#> 4 a 5
#> 5 b 3
#> 6 b 3
#> 7 b 3
#> 8 a 5
#> 9 c 2
#> 10 c 2
ggplot(df, aes(class, count)) + geom_bar(stat = "identity")
You could use position = "identity" so that the bars don’t get stacked:
ggplot(df, aes(class, count)) +
geom_bar(stat = "identity", position = "identity")
However, that creates a whole bunch of unnecessary layers in your plot that you can’t see. A better approach would be to drop the extra rows from your data before plotting:
df %>%
distinct(class, count)
#> # A tibble: 3 x 2
#> class count
#> <chr> <int>
#> 1 a 5
#> 2 b 3
#> 3 c 2
df %>%
distinct(class, count) %>%
ggplot(aes(class, count)) +
geom_bar(stat = "identity")
Created on 2019-09-05 by the reprex package (v0.3.0.9000)

ggplot2 alternatives to fill in barplots, occurence of factor in multiple rows

I'm pretty new to R and I have a problem with plotting a barplot out of my data which looks like this:
condition answer
2 H
1 H
8 H
5 W
4 M
7 H
9 H
10 H
6 H
3 W
The data consists of 100 rows with the conditions 1 to 10, each randomly generated 10 times (10 times condition 1, 10 times condition 8,...). Each of the conditions also has a answer which could be H for Hit, M for Miss or W for wrong.
I want to plot the number of Hits for each condition in a barplot (for example 8 Hits out of 10 for condition 1,...) for that I tried to do the following in ggplot2
ggplot(data=test, aes(x=test$condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1))
And it looked like this:
This actually exactly what I need except for the red color which covers everything. You can see that conditions 3 to 5 have no blue bar, because there are no hits for these conditions.
Is there any way to get rid of this red color and to maybe count the amount of hits for the different conditions? -> I tried the count function of dplyr but it only showed me the amount of H when there where some for this particular condition. 3-5 where just "ignored" by count, there wasn't even a 0 in the output.-> but I'd still need those numbers for the plot
I'm sorry for this particular long post but I'm really at the end of knowledge considering this. I'd be open for suggestions or alternatives! Thanks in advance!
This is a situation where a little preprocessing goes a long way. I made sample data that would recreate the issue, i.e. has cases where there won't be any "H"s.
Instead of relying on ggplot to aggregate data in the way you want it, use proper tools. Since you mention dplyr::count, I use dplyr functions.
The preprocessing task is to count observations with answer "H", including cases where the count is 0. To make sure all combinations are retained, convert condition to a factor and set .drop = F in count, which is in turn passed to group_by.
library(dplyr)
library(ggplot2)
set.seed(529)
test <- data.frame(condition = rep(1:10, times = 10),
answer = c(sample(c("H", "M", "W"), 50, replace = T),
sample(c("M", "W"), 50, replace = T)))
hit_counts <- test %>%
mutate(condition = as.factor(condition)) %>%
filter(answer == "H") %>%
count(condition, .drop = F)
hit_counts
#> # A tibble: 10 x 2
#> condition n
#> <fct> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 2
#> 5 5 3
#> 6 6 0
#> 7 7 3
#> 8 8 2
#> 9 9 1
#> 10 10 1
Then just plot that. geom_col is the version of geom_bar for where you have your y-values already, instead of having ggplot tally them up for you.
ggplot(hit_counts, aes(x = condition, y = n)) +
geom_col()
One option is to just filter out anything but where answer == "H" from your dataset, and then plot.
An alternative is to use a grouped bar plot, made by setting position = "dodge":
test <- data.frame(condition = rep(1:10, each = 10),
answer = sample(c('H', 'M', 'W'), 100, replace = T))
ggplot(data=test) +
geom_bar(aes(x = condition, fill = answer), position = "dodge") +
labs(x="Conditions", y="Hitrate") +
coord_cartesian(xlim = c(1:10), ylim = c(0:10)) +
scale_x_continuous(breaks=seq(1,10,1))
Also note that if the condition is actually a categorical variable, it may be better to make it a factor:
test$condition <- as.factor(test$condition)
This means that you don't need the scale_x_continuous call, and that the grid lines will be cleaner.
Another option is to pick your fill colors explicitly and make FALSE transparent by using scale_fill_manual. Since FALSE comes alphabetically first, the first value to specify is FALSE, the second TRUE.
ggplot(data=test, aes(x=condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1)) +
scale_fill_manual(values = c(alpha("red", 0), "cadetblue")) +
guides(fill = F)

Plot In R with Multiple Lines Based On A Particular Variable?

I have this accelerometer dataset and, let's say that I have some n number of observations for each subject (30 subjects total) for body-acceleration x time.
I want to make a plot so that it plots these body acceleration x time points for each subject in a different color on the y axis and the x axis is just an index. I tried this:
ggplot(data = filtered_data_walk, aes(x = seq_along(filtered_data_walk$'body-acceleration-mean-y-time'), y = filtered_data_walk$'body-acceleration-mean-y-time')) +
geom_line(aes(color = filtered_data_walk$subject))
But, the problem is that it doesn't superimpose the 30 lines, instead, they run along side each other. In other words, I end up with n1 + n2 + n3 + ... + n30 x index points, instead of max{n1, n2, ..., n30}. This is my first time posting, so I hope this makes sense (I know my formatting is bad).
One solution I thought of was to create a new variable which gives a value of 1 to n for all the observations of each subject. So, for example, if I had 6 observations for subject1, 4 observations for subject2, and 9 observations for subject3, this new variable would be sequenced like:
1 2 3 4 5 6 1 2 3 4 1 2 3 4 5 6 7 8 9
Is there an easy way to do this? Please help, ty.
Assuming your data is formatted as a data.frame or matrix, for a toy dataset like
x <- data.frame(replicate(5, rnorm(10)))
x
# X1 X2 X3 X4 X5
# 1 -1.36452272 -1.46446475 2.0444381 0.001585876 -1.1085990
# 2 -1.41303046 -0.14690269 1.6179084 -0.310162018 -1.5528733
# 3 -0.15319554 -0.18779791 -0.3005058 0.351619212 1.6282955
# 4 -0.38712167 -0.14867239 -1.0776359 0.106694311 -0.7065382
# 5 -0.50711166 -0.95992916 1.3522922 1.437085757 -0.7921355
# 6 -0.82377208 0.50423328 -0.5366513 -1.315263679 1.0604499
# 7 -0.01462037 -1.15213287 0.9910678 0.372623508 1.9002438
# 8 1.49721113 -0.84914197 0.2422053 0.337141898 1.2405208
# 9 1.95914245 -1.43041783 0.2190829 -1.797396822 0.4970690
# 10 -1.75726827 -0.04123615 -0.1660454 -1.071688768 -0.3331887
...you might be able to get there with something like
plot(x[,1], type='l', xlim=c(1, nrow(x)), ylim=c(min(x), max(x)))
for(i in 2:ncol(x)) lines(x[,i], col=i)
You could play with formatting some more, of course, do things with lty= and lwd= and maybe a color ramp of your own choosing, etc.
If your data is in the format below...
x <- data.frame(id=c("A","A","A","B","B","B","B","C","C"), acc=rnorm(9))
x
# id acc
# 1 A 0.1796964
# 2 A 0.8770237
# 3 A -2.4413527
# 4 B 0.9379746
# 5 B -0.3416141
# 6 B -0.2921062
# 7 B 0.1440221
# 8 C -0.3248310
# 9 C -0.1058267
...you could get there with
maxn <- max(with(x, tapply(acc, id, length)))
ids <- sort(unique(x$id))
plot(x$acc[x$id==ids[1]], type='l', xlim=c(1,maxn), ylim=c(min(x$acc),max(x$acc)))
for(i in 2:length(ids)) lines(x$acc[x$id==ids[i]], col=i)
Hope this helps, and that I interpreted your problem right--
That's pretty quick to do if you are OK with using dplyr. group_by to enforce a separate counter for each subject, mutate to add the actual counter, and your ggplot should work. Example with iris dataset:
group_by(iris, Species) %>%
mutate(index = seq_along(Petal.Length)) %>%
ggplot() + geom_line(aes(x=index, y=Petal.Length, color=Species))

how to put percentage label in ggplot when geom_text is not suitable?

Here is my simplified data :
company <-c(rep(c(rep("company1",4),rep("company2",4),rep("company3",4)),3))
product<-c(rep(c(rep(c("product1","product2","product3","product4"),3)),3))
week<-c( c(rep("w1",12),rep("w2",12),rep("w3",12)))
mydata<-data.frame(company=company,product=product,week=week)
mydata$rank<-c(rep(c(1,3,2,3,2,1,3,2,3,2,1,1),3))
mydata=mydata[mydata$company=="company1",]
And, R code I used :
ggplot(mydata,aes(x = week,fill = as.factor(rank))) +
geom_bar(position = "fill")+
scale_y_continuous(labels = percent_format())
In the bar plot, I want to label the percentage by week, by rank.
The problem is the fact that the data doesn't have percentage of rank. And the structure of this data is not suitable to having one.
(of course, the original data has much more observations than the example)
Is there anyone who can teach me How I can label the percentage in this graph ?
I'm not sure I understand why geom_text is not suitable. Here is an answer using it, but if you specify why is it not suitable, perhaps someone might come up with an answer you are looking for.
library(ggplot2)
library(plyr)
mydata = mydata[,c(3,4)] #drop unnecessary variables
data.m = melt(table(mydata)) #get counts and melt it
#calculate percentage:
m1 = ddply(data.m, .(week), summarize, ratio=value/sum(value))
#order data frame (needed to comply with percentage column):
m2 = data.m[order(data.m$week),]
#combine them:
mydf = data.frame(m2,ratio=m1$ratio)
Which gives us the following data structure. The ratio column contains the relative frequency of given rank within specified week (so one can see that rank == 3 is twice as abundant as the other two).
> mydf
week rank value ratio
1 w1 1 1 0.25
4 w1 2 1 0.25
7 w1 3 2 0.50
2 w2 1 1 0.25
5 w2 2 1 0.25
8 w2 3 2 0.50
3 w3 1 1 0.25
6 w3 2 1 0.25
9 w3 3 2 0.50
Next, we have to calculate the position of the percentage labels and plot it.
#get positions of percentage labels:
mydf = ddply(mydf, .(week), transform, position = cumsum(value) - 0.5*value)
#make plot
p =
ggplot(mydf,aes(x = week, y = value, fill = as.factor(rank))) +
geom_bar(stat = "identity")
#add percentage labels using positions defined previously
p + geom_text(aes(label = sprintf("%1.2f%%", 100*ratio), y = position))
Is this what you wanted?

Resources