Hello everyone I have some data with which I need to create a nice histogram.
Firstly I used the hist() to create a base one and after researching I found out that it uses the sturges method to count how many bins will be needed. In order to make a more customizable and good-looking histogram, I tried using the ggplot package and manually entering the number of bins I need. As you can see in the photos the histograms are not the same cause on the y-axis using hist()it reaches up to 60 freq while with the ggplot it surpasses that.
Additionally, I'm having a hard time getting the ggplot to show proper ticks on the X I can't find any reference on how to mod the tick marks so that they align with the breaks without messing up the graph.
Any ideas and help would be really appreciated.
Photos:
https://prnt.sc/greVRNoGo67T
https://prnt.sc/bMl29-2Fr5BN
One way to solve the problem is to do some pre-processing and plot a bar plot.
The pre-processing is to bin the data with cut. This transforms the continuous variable Total_Volume in a categorical variable but since it is done in a pipe the transformation is temporary, the original data remains unchanged.
The breaks are equidistant and the labels are the mid point values rounded.
Note that the very small counts are plotted with bars that are not visible. But they are there.
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
})
brks <- seq(min(x) - 1, max(x) + 1, length.out = 30)
labs <- round(brks[-1] + diff(brks)/2)
data.frame(Total_Volume = x) %>%
mutate(Total_Volume = cut(x, breaks = brks, labels = labs)) %>%
ggplot(aes(Total_Volume)) +
geom_bar(width = 1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Created on 2022-10-02 with reprex v2.0.2
Data creation code.
set.seed(2022)
n <- 1e6
x <- rchisq(n, df = 1, ncp = 5.5e6)
i <- x > 5.5e6
x[i] <- rnorm(sum(i), mean = 5.5e6, sd = 1e4)
Created on 2022-10-02 with reprex v2.0.2
Related
I'm trying to plot a dotplot using geom_dotplot in which each dot represents an observation of my data set. Therefore, the y-axis shouldn't represent density but actual counts. I'm aware of this thread which revolves around the same topic. However, I haven't managed to solve my issue following the same methodology.
df <- data.frame(x = sample(1:500, size = 150, replace = TRUE))
ggplot(df, aes(x)) +
geom_dotplot(method = 'histodot', binwidth = 1)
And I obtain the following graph , I want to obtain one similar to this one where I can manipulate dots' size, space between, etc.
Thanks in advance
You can modify the binwidth argument to cause the points to stack. For example,
df %>%
ggplot(aes(x = x)) +
geom_dotplot(method = "histodot", binwidth = 20)
There is a dotsize argument that can be used to modify the size of the dot.
I'm trying to replicate the image below in R (Original post). I have seen similar posts (Post 1 and Post 2) but none similar to this plot. I'm just wondering if anyone knows how to do something similar in R. There's a couple of observations:
Bubbles do not overlap
Smaller bubbles tend to be closer to the axis (but not always!)
Bubbles are in two categories
I'm sure that data from Post 1 would be helpful!
https://docs.google.com/spreadsheets/d/11nq5AK3Y1uXQJ8wTm14w9MkZOwAxHHTyPyEvUxCeGVc/edit?usp=sharing
Thank you so much,
Ok so this is just a starting point that people could use to formulate a better answer to the question. It uses the packcircles package to (surprisingly) pack circles. It doesn't qualify all of your criteria, but can serve as a useful starting point. We're just going to pretend that the eruptions column from the faithful dataset is your time variable.
library(packcircles)
#> Warning: package 'packcircles' was built under R version 4.0.2
library(ggplot2)
library(scales)
library(ggrepel)
# Setup some data, suppose we'd like to label 5 samples
set.seed(0)
faith2 <- faithful
faith2$label <- ""
faith2$label[sample(nrow(faith2), 5)] <- LETTERS[1:5]
# Initialise circle pack data
init <- data.frame(
x = faith2$eruptions,
y = runif(nrow(faith2)),
areas = rescale(faith2$waiting, to = c(0.01, 0.1))
)
# Use the repelling layout
res <- circleRepelLayout(
init,
xlim = range(init$x) + c(-1, 1),
ylim = c(0, Inf),
xysizecols = c(1, NA, 3),
sizetype = "radius",
weights = 0.1
)
# Prepare for ggplot2
df <- circleLayoutVertices(res$layout)
df <- cbind(df, faith2[df$id,])
This is showing that the circles are reasonably placed with respect to our fake time variable.
# Plot
ggplot(df, aes(x, y, group = id)) +
geom_polygon(aes(fill = eruptions,
colour = I(ifelse(nzchar(label), "black", NA)))) +
scale_fill_viridis_c() +
coord_equal()
And this is showing that the circle size is reasonably corresponding to a different variable.
ggplot(df, aes(x, y, group = id)) +
geom_polygon(aes(fill = waiting,
colour = I(ifelse(nzchar(label), "black", NA)))) +
scale_fill_viridis_c() +
coord_equal()
Created on 2020-07-11 by the reprex package (v0.3.0)
There are few flaws in this, notably it doesn't satisfy the 2nd criterion (circles aren't hugging the axis). Also, for reasons beyond my understanding, the packcircles layout couldn't place about 12% of datapoints, which are assigned NaN in df. Anyway, hopefully somebody smarter than me will do a better job at this.
I have a scatter plot now. Each color represent a categorical group and each group has a range of values which are on the x-axis. There should not be any overlapping between the range of categorical variables. However, because of the thickness of scatter points, it looks like that there is overlapping. So, I want to draw a line to connect the maximum point of the group and the minimum point of the adjacent group so that as long as the line does not have a negative slope, it can show that there is no overlapping between each categorical variable.
I do not know how to use geom_line() to connect two points where y-coordinate is a categorical variable. IS that possible to do so??
Any help would be appreciated!!!
It sounds like you want geom_segment not geom_line. You'll need to aggregate your data into a new data frame that has the points you want plotted. I adapted Brian's sample data and use dplyr for this:
# sample data
df <- data.frame(xvals = runif(50, 0, 1))
df$cats <- cut(df$xvals, c(0, .25, .625, 1))
# aggregation
library(dplyr)
df_summ = df %>% group_by(cats) %>%
summarize(min = min(xvals), max = max(xvals)) %>%
mutate(adj_max = lead(max),
adj_min = lead(min),
adj_cat = lead(cats))
# plot
ggplot(df, aes(xvals, cats, colour = cats)) +
geom_point() +
geom_segment(data = df_summ, aes(
x = max,
xend = adj_min,
y = cats,
yend = adj_cat
))
You can keep the segments colored as the previous category, or maybe set them to a neutral color so they don't stand out as much.
My reading comprehension failed me, so I misunderstood the question. Ignore this answer unless you want to learn about the lineend = argument of geom_line.
# generate dummy data
df <- data.frame(xvals = runif(1000, 0, 1))
# these categories were chosen to line up
# with tick marks to show they don't overlap
df$cats <- cut(df$xvals, c(0, .25, .625, 1)))
ggplot(df, aes(xvals, cats, colour = cats)) +
geom_line(size = 3)
The caveat is there there is a lineend = argument to geom_line. The default is butt, so that lines end exactly where you want them to and butt up against things, but sometimes that's not the right look. In this case, the other options would cause visual overlap, as you can see with the gridlines.
With lineend = "square":
With lineend = "round":
The following code produces two density plots, first using the R base graphics and the second using ggplot2. The second plot has an artificial peak at the beginning of the curve which is not present in the first plot. The peak at the start is always present when start of the x-axis limit is set to more than zero. Why ggplot makes this peak and how to avoid it?
I can't post images due to lack of reputation points. PLease, try it yourself. This code should work as it is.
library(ggplot2)
set.seed(101)
xval<-rlnorm(n=10000)
xdf<-data.frame(xval)
plot(density(xdf$xval), xlim=c(1, 10))
ggplot(data=xdf, aes(x=xval))+geom_density()+xlim(1, 10)
Is this a bug in ggplot2?
If you change xlim() for coord_cartesian(), it works:
library(ggplot2)
set.seed(101)
xval <- rlnorm(n=10000)
xdf <- data.frame(xval)
par(xaxs = "i") # change the style to fix exact x limits to (1, 10)
plot(density(xdf$xval), xlim = c(1, 10))
ggplot(data = xdf, aes(x = xval)) +
stat_density(geom = "line") +
scale_x_continuous(breaks = c(2,4,6,8,10)) +
coord_cartesian(xlim = c(1, 10))
I have a basic plot of a two column data frame (x = "Periods" and y = "Range").
library (ggplot2)
qplot (Periods, Range, data=twocoltest, color=Periods, size = 3,) + geom_jitter(position=position_jitter(width=0.2))
I am trying to add a horizontal line at each period below which lie 90% of all the observations for that period. (It doesn't have to be a horizontal line, any visual indication per period would suffice).
Any help would be greatly appreciated.
Alrighty, I've read the ggplot help, and here's a go:
# example data
twocoltest <- data.frame(Periods=rep(1:3,each=3),Range=1:9)
library(ggplot2)
c <- qplot (Periods, Range, data=twocoltest, color=Periods, size = 3,) + geom_jitter(position=position_jitter(width=0.2))
q90 <- function(x) {quantile(x,probs=0.9)}
c + stat_summary(fun.y=q90, colour="red", geom="crossbar", size = 1, ymin=0, ymax=0)