I am trying to make an x-y scatter-plot. I don't mind if it's in plot or ggplot2. I don't know much about each, but I would like an example in both if you don't mind. I would like a label on the points.
Below is code and dput:
tickers <- rownames(x2)
library(zoo)
plot(x2,
main= "Vol vs Div",
xlab= "Vol (in %)",
ylab= "Div",
col= "blue", pch = 19, cex = 1, lty = "solid", lwd = 2)
text(x=x2$Volatility101,y=x2$`12m yield`, labels=tickers,cex= 0.7, pos= 3)
x2:
structure(list(Volatility101 = c(25.25353177644, 42.1628734949414,
28.527736824123), `12m yield` = c("3.08", "7.07", "4.72")), class = "data.frame", row.names = c("EUN",
"HRUB", "HUKX"))
Here is a tidyverse solution.
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggrepel)
x2 %>%
rownames_to_column(var = "tickers") %>%
ggplot(aes(x = Volatility101, y = `12m yield`)) +
geom_point(color = "blue") +
geom_text_repel(aes(label = tickers)) +
ggtitle("Vol vs Div") +
xlab("Vol (in %)") +
ylab("Div") +
theme_classic()
I was surprised that the plot function worked at all. The Y-values are character values. Fixing that in the text call results in text being placed in the expected locations
text(x=x2$Volatility101,y=as.numeric(x2$`12m yield`)+.1, labels=tickers,
cex= 0.7, col='black')
A couple of notes about the question presentation: It's unclear (and misleading) why ggplot2 is a tag. The plot function is generic and in this case it uses base-graphics rather than either ggplot2 specifically or grid graphics more generally. I also think that the library(zoo) call is probably unnecessary. There is a plot.zoo function, but it would not be called in this case.
Related
I am trying to add vertical lines to a time series plot I made in base R plot(data1,type = 'l',lwd = 1.5, family = "A", ylab ="", xlab = "", main = ""). This plot has a total of 5 plots inside of it. There are two x-axes that are the same (see current plot)
When adding vlines with abline(v=c(27,87, 167, 220, 280, 329), lty=2) I get this result
Is there a way to get them to go on the graphs so it looks something like this but with dashed lines and the lines within the plots.
Or if you know of a better way to plot this in ggplot that would be fantastic as well. Thank you so much in advance.
Here is a toy example of using ggplot to put in vlines.
library(tidyverse)
iris2 <- iris %>% pivot_longer(cols=Sepal.Length:Petal.Length)
ggplot(iris2, aes(x = Petal.Width, y = value)) +
geom_line() +
facet_wrap(~name, scales="free_y", ncol=2) +
geom_vline(xintercept=c(.25, .75, 1.25, 1.75, 2.25),
linetype='dashed', col = 'blue')
plot calls plot.ts. This has a panel= argument in which you may define a function what should happen in the panels. Since you want a grid and the usual lines, you can do quite easily:
panel <- function(...) {grid(col=1, ny=NA, lty=1); lines(...)}
plot(z, panel=panel)
grid usually uses the axTicks and you can define number of cells, see ?grid.
This also works with abline.
panel <- function(...) {abline(v=seq(1961, 1969, 2), col=1,lty=1); lines(...)}
plot(z, panel=panel)
Data:
set.seed(42)
z <- ts(matrix(rnorm(300), 100, 5), start = c(1961, 1), frequency = 12)
I have four series that I would like to plot.
There are 2 models : xg and algo30.
There are two types of data: predicted and observed.
This means we have the following 4 series: "predicted xg","observed xg", "predicted 30", "observed 30".
I want "xg" to be blue, "algo30" to be red.
I also want predicted to be a solid line and observed to be points.
Here is what I mean, using base plot:
library(magrittr)
library(ggplot2)
library(dplyr)
set.seed(123)
gr <- 1:10
obs.xg <- sort(runif(10, 0.5, 1))
obs.30 <- sort(runif(10, 0.5, 1))
pred.xg <- lm(obs.xg~gr) %>% predict() %>% add(rnorm(10,0,.01))
pred.30 <- lm(obs.30~gr) %>% predict() %>% add(rnorm(10,0,.01))
plot(gr, obs.xg, col="darkblue", ylim=range(c(obs.xg,obs.30)), pch=20)
lines(gr, pred.xg, col="darkblue", lwd=2)
points(gr, obs.30, col="firebrick", pch=20)
lines(gr, pred.30, col="firebrick", lwd=2)
legend("bottomright",
pch=c(20,NA,NA,NA,NA),
lty=c(NA,1,NA,1,1),
lwd=c(NA,1,NA,2,2),
col = c("black","black",NA, "darkblue","firebrick"),
legend=c("observé","prédit",NA,"xgboost","algo30"),
bty='n')
Here is my best attempt using ggplot. Notice that the legend doesnt work as I want.
xg.data <- data.frame(model= "xg", decile = seq(1:10), observed = obs.xg, predicted = pred.xg)
algo30.data <- data.frame(model = "algo30",decile = seq(1:10), observed = obs.30, predicted = pred.30)
ggplotdata <- bind_rows(xg.data, algo30.data)
ggplotdata %>%
ggplot( aes(x=decile, y= predicted, color= model))+ geom_line()+
geom_point(aes(x=decile, y= observed, color = model))
Most of the time when making a legend like this I look to override.aes in guide_legend().
The idea here is to make a legend using an additional aesthetic that you don't want mapped onto the plot itself and then using constants instead of a variable for that aesthetic. I used alpha, since both points and lines use that aesthetic.
Then the heavy lifting is done in scale_alpha_manual: removing the legend name, making sure the plot still looks right by setting the values, and then, finally, picking the correct point type and lines along with blanks for the legend.
ggplot(ggplotdata, aes(x=decile, y= predicted, color= model))+
geom_line( aes(alpha = "prédit") )+
geom_point(aes(x=decile, y= observed, alpha = "observé")) +
scale_alpha_manual(name = NULL, values = c(1, 1),
guide = guide_legend(override.aes = list(linetype = c(0, 1), shape = c(16, NA)))) +
scale_color_manual(name = NULL, values = c("firebrick", "darkblue"))
I'm trying to create a figure similar to the one below (taken from Ro, Russell, & Lavie, 2001). In their graph, they are plotting bars for the errors (i.e., accuracy) within the reaction time bars. Basically, what I am looking for is a way to plot bars within bars.
I know there are several challenges with creating a graph like this. First, Hadley points out that it is not possible to create a graph with two scales in ggplot2 because those graphs are fundamentally flawed (see Plot with 2 y axes, one y axis on the left, and another y axis on the right)
Nonetheless, the graph with superimposed bars seems to solve this dual sclaing problem, and I'm trying to figure out a way to create it in R. Any help would be appreciated.
It's fairly easy in base R, by using par(new = T) to add to an existing graph
set.seed(54321) # for reproducibility
data.1 <- sample(1000:2000, 10)
data.2 <- sample(seq(0, 5, 0.1), 10)
# Use xpd = F to avoid plotting the bars below the axis
barplot(data.1, las = 1, col = "black", ylim = c(500, 3000), xpd = F)
par(new = T)
# Plot the new data with a different ylim, but don't plot the axis
barplot(data.2, las = 1, col = "white", ylim = c(0, 30), yaxt = "n")
# Add the axis on the right
axis(4, las = 1)
It is pretty easy to make the bars in ggplot. Here is some example code. No two y-axes though (although look here for a way to do that too).
library(ggplot2)
data.1 <- sample(1000:2000, 10)
data.2 <- sample(500:1000, 10)
library(ggplot2)
ggplot(mapping = aes(x, y)) +
geom_bar(data = data.frame(x = 1:10, y = data.1), width = 0.8, stat = 'identity') +
geom_bar(data = data.frame(x = 1:10, y = data.2), width = 0.4, stat = 'identity', fill = 'white') +
theme_classic() + scale_y_continuous(expand = c(0, 0))
Using this example:
x<-mtcars;
barplot(x$mpg);
you get a graph that is a lot of barplots from (0 - 30).
My question is how can you adjust it so that the y axis is (10-30) with a split at the bottom indicating that there was data below the cut off?
Specifically, I want to do this in base R program using only the barplot function and not functions from plotrix (unlike the suggests already provided). Is this possible?
This is not recommended. It is generally considered bad practice to chop off the bottoms of bars. However, if you look at ?barplot, it has a ylim argument which can be combined with xpd = FALSE (which turns on "clipping") to chop off the bottom of the bars.
barplot(mtcars$mpg, ylim = c(10, 30), xpd = FALSE)
Also note that you should be careful here. I followed your question and used 0 and 30 as the y-bounds, but the maximum mpg is 33.9, so I also clipped the top of the 4 bars that have values > 30.
The only way I know of to make a "split" in an axis is using plotrix. So, based on
Specifically, I want to do this in base R program using only the barplot function and not functions from plotrix (unlike the suggests already provided). Is this possible?
the answer is "no, this is not possible" in the sense that I think you mean. plotrix certainly does it, and it uses base R functions, so you could do it however they do it, but then you might as well use plotrix.
You can plot on top of your barplot, perhaps a horizontal dashed line (like below) could help indicate that you're breaking the commonly accepted rules of what barplots should be:
abline(h = 10.2, col = "white", lwd = 2, lty = 2)
The resulting image is below:
Edit: You could use segments to spoof an axis break, something like this:
barplot(mtcars$mpg, ylim = c(10, 30), xpd = FALSE)
xbase = -1.5
xoff = 0.5
ybase = c(10.3, 10.7)
yoff = 0
segments(x0 = xbase - xoff, x1 = xbase + xoff,
y0 = ybase-yoff, y1 = ybase + yoff, xpd = T, lwd = 2)
abline(h = mean(ybase), lwd = 2, lty = 2, col = "white")
As-is, this is pretty fragile, the xbase was adjusted by hand as it will depend on the range of your data. You could switch the barplot to xaxs = "i" and set xbase = 0 for more predictability, but why not just use plotrix which has already done all this work for you?!
ggplot In comments you said you don't like the look of ggplot. This is easily customized, e.g.:
library(ggplot2)
ggplot(x, aes(y = mpg, x = id)) +
geom_bar(stat = "identity", color = "black", fill = "gray80", width = 0.8) +
theme_classic()
in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf
Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.
There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')
you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))
As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get
Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)
In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!