geom_histogram does not show all values in x axis - r

Trying to run a simple and quick analysis of some variables. I run this code:
ggplot(data, aes(var1)) +
geom_bar()
Resulting in a Histogram however in spite of having only 6 possible values in var1, x Axis only shows 2,4,6. Is it possible to easily include all 6 possible values as labels?

You want to have frequency bar plot for six individual numbers. However, you wish to see all of these numbers on the X axis, which makes me think that you actually treat them as categorical data rather then numeric data, so you actually would prefer a categorical X axis which shows all the data. Turning the x into a factor should do the trick:
data <- data.frame(var1=floor(6*runif(200) + 1))
ggplot(data, aes(factor(var1))) + geom_bar()
Below: left - without factor, right - with factor.

What does your data look like?
Assuming you have a numeric x, adding scale_x_continuous(breaks = seq(1,6, by = 1))should work.
Of course this would only work if the x values go from 1 to 6... Otherwise you can replace the seq call with a vector that contains the values you want.

Related

How to create a bar chart in R using with both X and Y axis defined? [duplicate]

I have a dataframe:
>picard
count reads
1 20681318
2 3206677
3 674351
4 319173
5 139411
6 117706
How do I plot log10(count) vs log10(reads) on a ggplot (barplot)?
I tried:
ggplot(picard) + geom_bar(aes(x=log10(count),y=log10(reads)))
But it is not accepting y=log10(reads). How do I plot my y values?
You can do something like this, but plotting the x axis, which is not continuous, with a log10 scale doesn't make sense for me :
ggplot(picard) +
geom_bar(aes(x=count,y=reads),stat="identity") +
scale_y_log10() +
scale_x_log10()
If you only want an y axis with a log10 scale, just do :
ggplot(picard) +
geom_bar(aes(x=count,y=reads),stat="identity") +
scale_y_log10()
Use stat="identity":
ggplot(picard) + geom_bar(aes(x=log10(count),y=log10(reads)), stat="identity")
You will actually get a warning with your approach:
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
There's a direct way to do this, i.e. by using the geom_col() function. Just make a tiny adjustment to your code:
ggplot(picard) + geom_col(aes(x=log10(count), y=log10(reads)))
and it will give the same output as setting the stat argument to identity with geom_bar(). The thing is, geom_bar() uses count as default for stat, hence it will not take any variable for the y-axis. It will simply use the count, i.e, the number of occurrences of each value of the x-axis, for it's y-axis. I hope this answers your question.

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

Plot a 'top 10' style list/ranking in R based on numerical column of dataframe

I have an R dataframe that contains a string variable and a numerical variable, and I would like to plot the top 10 strings, based on the value of the numerical variable.
I can of course get the top 10 entries pretty simply:
top10_rank <- rank[order(rank$numerical_var_name),]
My first approach to trying to visualize this was to simple attempt to plot this like:
ggplot(data=top10_rank, aes(x = top10_rank$numerical_var_name, y = top10_rank$string_name)) + geom_point(size=3)
And to a first approximation this "works" - the problem is that the strings on the y axis are sorted alphabetically rather than by the numerical value.
My preference would be to find a way to plot the top 10 strings without having to bother showing the numerical variable at all - just basically as a list (even better would be if I could enumerate the list). I am attempting to plot this so it looks more pleasing than simply dumping the text to the screen.
Any ideas greatly appreciated!
The y-axis tick marks may be sorted alphabetically, but the points are drawn in order(from left to right) of the top10_rank dataframe. What you need to do is change the order of the y-axis. Add this to your call of ggplot + scale_y_discrete(limits=top10_rank$String) and it should work.
ggplot(data=top10_rank, aes(x = top10_rank$Number,
y = top10_rank$String)) + geom_point(size=3) + scale_y_discrete(limits=top10_rank$String)
Here is a link to a great resource on R graphics: R Graphics Cookbook

Plotting percent change for a large number of factors on same figure using ggplot by faceting or color-coding factors

Here is an example of the code I'm working with
x<-as.factor(rep(c("tree_mean","tree_qmean","tree_skew"),3))
factor<-c(rep("mfn2_burned_99",3),rep("mfna_burned_5_7",3),rep("mfna_burned_5_7_10_12",3)))
y<-c(0.336457409,-0.347422910,-0.318945621,1.494109367, 0.003578698,-0.019985780,-0.484171146, 0.611589217,-0.322292664)
dat<-as.data.frame(cbind(x,factor,y))
head(dat)
x factor y
tree_mean mfn2_burned_99 -0.3364574
tree_qmean mfn2_burned_99 -0.3474229
tree_skew mfn2_burned_99 -0.3189456
tree_mean mfna_burned_5_7 -0.8269814
tree_qmean mfna_burned_5_7 -0.8088810
tree_skew mfna_burned_5_7 -2.5429226
tree_mean mfna_burned_5_7_10_12 -0.8601206
tree_qmean mfna_burned_5_7_10_12 -0.8474920
tree_skew mfna_burned_5_7_10_12 -2.9854178
I am trying to plot how much x deviates from 0, and facet it by each factor, as so:
ggplot(dat) +
geom_point(aes(x=x,y=y),shape=1,size=3)+
geom_linerange(aes(x=x,ymin=0,ymax=y))+
geom_hline(yintercept=0)+
facet_grid(factor~.)
This works fine when I have three factors (ignore the *: I had a significance column which I have since removed.
Example below:
However, I have 8 factors in total, and faceting obscures the plot such that the distance from zero for each x value gets very distorted.
Example below
So, my question is this: what would be a better way of coding/rendering this plot given my large number of x values and factors using faceting or color coding by factor in ggplot??
I would be very open to color-coding each distance for x by factor rather than faceting, but I have been beating my head against the wall trying to figure out how to even do that in ggplot (very new to ggplot), so I can't yet say if it would make the figure much more interpretable.
One option as you note is to color your point and/or linerange by a factor. You can then use position_dodge to move the points slightly on the x axis.
For example:
ggplot(dat, aes(color = factor)) +
geom_point(aes(x=x,y=y),shape=1,size=3, position = position_dodge(width = 0.5)+
geom_linerange(aes(x=x,ymin=0,ymax=y), position = position_dodge(width =0.5))+
geom_hline(yintercept=0)
I think this would still be difficult with many factors, but with 8 it might suit your purposes.

Why does a boxplot in ggplot requires axis x and y?

I have a variable ceroonce which is number of schools per county (integers) in 2011. When I plot it with boxplot() it only requires the ceroonce variable. A boxplot is then retrieved in which the y axis is the number of schools and the x axis is... the "factor" ceroonce. But in ggplot, when using geom_boxplot, it requires me to input both x and y axis, but I just want a boxplot of ceroonce. I have tried inputing ceroonce as both the x and y axis. But then a weird boxplot is retrieved in which the y axis is the number of schools but the x axis (which should be the factor variable) is also the number of schools? I am assuming this is very basic statistics, but I am just confused. I am attaching the images hoping this will clarify my question.
This is the code I am using:
ggplot(escuelas, aes(x=ceroonce, y=ceroonce))+geom_boxplot()
boxplot(escuelas$ceroonce)
ggplot(escuelas, aes(x="ceroonce", y=ceroonce))+geom_boxplot()
ggplot will interpret the character string "ceroonce" as a vector with the same length as the ceroonce column and it will give the result you're looking for.
There are no fancy statistics happening here. boxplot is simply assuming that since you've given it a single vector, that you want a single box in your boxplot. ggplot and geom_histogram simply don't make that assumption.
If you want a bit less typing, you can do this:
qplot(y=escuelas$ceroonce, x= 1, geom = "boxplot")
ggplot2 will automatically create a vector of 1s equal in length to the length of escuelas$ceroonce
This could work for you:
ggplot(escuelas, aes(x= "", y=ceroncee)) + geom_boxplot()

Resources