With a data frame df I have a large number of discrete values for metric and their counts cnt
I wanted to get a bar plot of the counts for each discrete value of metric.
So I do the following,
df <- read.csv("metric.csv", header=T)
df$metric <- as.factor(df$metric)
ggplot(df, aes(x=metric, y=cnt)) +
geom_bar(stat = 'identity')
With the above I get an empty plot like below with this - why ?
The data I used for the data frame df is here - http://wikisend.com/download/569376/metric.csv
How do I get a bar plot out of this data ?
I'm not immediately aware of any limitations of geom_bar, but it is unsurprising that this doesn't work very well -- I interrupted it on my machine, so I don't even know what it looks like when it finishes rendering.
Are you sure that a bar plot is appropriate for this data? Which is to say, is the "metric" column effectively a factor?
Running a scatter plot completes rapidly with results that might be more useful (here using a log scale because a linear scale is hurt by the outlier)
ggplot(df, aes(x=metric, y=cnt) +
geom_point() +
scale_y_log10()
yields
Related
I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.
So I have 10.000 values in a vector from a Monte Carlo simulation. I want to plot this data as a histogram and a density plot. Doing this with the hist() function is easy, and it will calculate the frequency of the of the different values automatically. My ambition is however doing this in ggplot.
My biggest problem right now is how to transform the data so ggplot can handle it. I would like my x-axis to show the "price" while the x-axis shows the frequency or density. My data has a lot decimals as shown in the example data below.
myData <- c(266.8997, 271.5137, 225.4786, 223.3533, 258.1245, 199.5601, 234.2341, 231.7850, 260.2091, 184.5102, 272.8287, 203.7482, 212.5140, 220.9094, 221.2627, 236.3224)
My current code using the hist()-function, and the plot is shown below.
hist(myData,
xlab ="Price",
prob=TRUE)
lines(density(myData))
Histogram for the data vector containing 10000 values
How would you sort the data, and how would you do this with ggplot? I am thinking if I should round the numbers as well?
Hard to say exactly without seeing a sample of your data, but have you tried:
ggplot(myData, aes(Price)) + geom_histogram()
or:
ggplot(myData, aes(Price)) + geom_density()
Just try this:
ggplot() +
geom_bar(aes(myData)) +
geom_density(aes(myData))
I've been searching for a while, and I've found a number of answers for problems similar to mine, but not quite working when I try to implement them.
I'm trying to make a series of radar plots for different observations of performance. The data has been normalized such that the mean is 0 and the standard deviation is 1, and the y-axis on the plot has been set from -3 to 3 so as to make it visually comparable how well the subjects performed, with more extreme observations being worse. I would like to add colors associated with that scale, preferably such that -1 to 1 is green, and then the bands between +/- 1-2 is yellow and +/- 2-3 is red. All the examples I've been able to find relating to color fills is based directly in the data or from factors rather than a fixed scale, and anything I try seems to not show correctly. I'm not even sure if it is normally in the functionality of ggplot to be able to set a color scale in the way I'm looking for...
Here's the toy data I've been working with while working out the plotting (after reshaping):
variable <- c("time", "distance", "turns")
value <- c(0.9536197, 0.5842319, -2.1814528)
df <- data.frame(variable, value)
and here's my most recent attempt as far as ggplot code goes (using ggiraphExtra):
ggplot(temp, aes(x=variable, y=value, group=1)) + geom_point() + geom_polygon() +
ggiraphExtra:::coord_radar() + ylim(-3,3) +
scale_fill_gradient(low="red", high="green")
and this is the output:
radar plot with solid green geom_polygon fill
I have searched considerably for what I want to accomplish, but I haven't run across examples or plots that are specifically what I'm looking for), so I am reaching out to the community.
What I have (data downloadable here):
Time-series data (each record 2 hours apart and spanning nearly a year) with associated elevation and property ownership.
library(ggplot2)
data <- read.csv("dataex.csv")
data$timestamp <-as.POSIXct(as.character(data$timestamp),format="%m/%d/%Y %H:%M", tz="GMT")
What I want (via ggplot):
A line or bar plot showing elevation (y-axis) across time (x-axis) for each data record colored by ownership (for a line plot, filling the area under the line, or for a bar plot, filling the bar). I've tried iterations of geom_line, geom_bar, and geom_area (w geom_bar below the closest I have come). I'd like at least one of the following options to come true!
Option A - The closest I have come to achieving this (plotting per data record) is with the following code:
ggplot(data, aes(x=timestamp, y=elev, fill=OWNER)) + geom_bar(stat="identity")
However, I'd like the bars to be touching each other, but if I adjust the width in geom_bar(), everything disappears. (Also, if I run the above code on other batches of similar data, it will only show a fraction of the bars, likely because they have more data records). Seems like its just too much data to plot. So I tried another route...
Option B - Plotting by day, which turns out to be more informative, showing each day the variability in ownership.
ggplot(data, aes(x=as.Date(Date, format='%Y-%m-%d'), y=elev, fill=OWNER)) + geom_bar(stat="identity", width=1)
However, this sums the y-axis, so the elevation is not interpretable. I could divide the y-axis by 12 (the typical number of records per day) but there are occasional days with fewer than 12 records, which causes the y-axis to be incorrect. Is there a function or a way to divide the y-axis by the respective number of records per day that is being represented in the plot? Or does someone have advice for a better solution?
Something like:
library(readr)
library(dplyr)
library(ggplot2)
library(ggalt)
readr::read_csv("~/Desktop/dataex.csv") %>%
mutate(timestamp=lubridate::mdy_hm(timestamp)) %>%
select(timestamp, elev, Owner=OWNER) -> df
ggplot(df, aes(timestamp, elev, group=Owner, color=Owner)) +
geom_segment(aes(xend=timestamp, yend=0), size=0.1) +
scale_x_datetime(expand=c(0,0), date_breaks="2 months") +
scale_y_continuous(expand=c(0,0), limits=c(0,2250), label=scales::comma) +
ggthemes::scale_color_tableau() +
hrbrmisc::theme_hrbrmstr(grid="Y") +
labs(x=NULL, y="Elevation") +
theme(legend.position="bottom") +
theme(axis.title.y=element_text(angle=0, margin=margin(r=-20)))
?
I'm struggling with making a graph of proportion of a variable across a factor in ggplot.
Taking mtcars data as an example and stealing part of a solution from this question I can come up with
ggplot(mtcars, aes(x = as.factor(cyl))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = percent_format())
This graph gives me proportion of each cyl category in the whole dataset.
What I'd like to get though is the proportion of cars in each cyl category, that have automatic transmission (binary variable am).
On top of each bar I would like to add an error bar for the proportion.
Is it possible to do it with ggplot only? Or do I have to first prepare a data frame with summaries and use it with identity option of bar graphs?
I found some examples on Cookbook for R web page, but they deal with continuous y variable.
I think that it would be easier to make new data frame and then use it for plotting. Here I calculated proportions and lower/upper confidence interval values (took them from prop.test() result).
library(plyr)
mt.new<-ddply(mtcars,.(cyl),summarise,
prop=sum(am)/length(am),
low=prop.test(sum(am),length(am))$conf.int[1],
upper=prop.test(sum(am),length(am))$conf.int[2])
ggplot(mt.new,aes(as.factor(cyl),y=prop,ymin=low,ymax=upper))+
geom_bar(stat="identity")+
geom_errorbar()