How to scale y axis with specified values in ggplot2 - r

I am trying to scale my y-axis to work like this:
So I tried the following:
scale_y_continuous(breaks=c(0, 0.9, 0.99, 0.999))
However, the results are:
How can I scale the graph accordingly to the numbers specified? More specifically, can I scale the graph according to an array of values? say c = (0, 0.9, 0.99, 0.999).
Here's the code I wrote so far:
library(ggplot2)
library(extrafont)
library(scales)
results = read.csv("results.csv")
breaks = c(0, 0.9, 0.99, 0.999)
ggplot(data=results, aes(x=t, y=Values, group=Algorithm, color = factor(Algorithm), shape = factor(Algorithm))) +
geom_line(size = 1)+
theme_bw() +
theme(legend.position="top") +
labs(color="") +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=16),
legend.text=element_text(size=16)) +
scale_y_log10(breaks=breaks, labels=breaks)
Sample CSV data:
t,Algorithm,Values
0,alg1,0.000000000000
0,alg2,0.000000000000
0,alg3,0.000000000000
0,alg4,0.000000000000
0,alg5,0.000000000000
100,alg1,0.950000000000
100,alg2,0.950000000000
100,alg3,0.950000000000
100,alg4,0.000000000147
100,alg5,0.000000000000
200,alg1,0.950000000005
200,alg2,0.950000000000
200,alg3,0.950000001250
200,alg4,0.004578701861
200,alg5,0.000000182645
250,alg1,0.950000259280
250,alg2,0.950000000000
250,alg3,0.950000400517
250,alg4,0.219429576450
250,alg5,0.000199361725
300,alg1,0.950314820965
300,alg2,0.950000000000
300,alg3,0.950037201876
300,alg4,0.824669958806
300,alg5,0.012390843342
400,alg1,0.992274938722
400,alg2,0.950000000000
400,alg3,0.959167637150
400,alg4,0.936487596777
400,alg5,0.603221722035
500,alg1,0.998314400000
500,alg2,0.998334835568
500,alg3,0.995747486022
500,alg4,0.978514678505
500,alg5,0.917973600000
600,alg1,0.998314400000
600,alg2,0.999100000000
600,alg3,0.999118983394
600,alg4,0.998040800000
600,alg5,0.917973600000

From what I understand from your data, it seems that you want to zoom in your plot to see how data in the range (0.9 - 0.99) is distributed. In ggplot it is recommend to use facets that help highlight the important segments in your data.
You can choose to create facets by dividing your data into multiple segments (range in your case) of interest. Something like below creates 3 segments out of your range.
library(dplyr)
results = results %>%
mutate(grp = case_when(Values<0.9 ~ "0 - 0.9",
Values>=0.9 & Values<0.99 ~ "0.9 - 0.99",
Values>=0.99 ~ "0.99+"))
results %>%
ggplot(aes(x = t, y = Values, group = Algorithm, color = Algorithm)) +
geom_line(size = 1) +
facet_wrap(~grp, scales = "free") +
theme(legend.position="top") +
labs(color="") +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=16),
legend.text=element_text(size=16))
Alternately, you can choose to display the whole data in one chart and create facets with the segments of your choice. Below I show only one segment in which you can zoom in.
plot_df = bind_rows(`All Data` = results,
`Segment (0.9 - 0.99)` = results %>% filter(grp=="0.9 - 0.99"),
.id = "Groups")
plot_df %>%
ggplot(aes(x = t, y = Values, group = Algorithm, color = Algorithm)) +
geom_line(size = 1) +
facet_wrap(~Groups, scales = "free") +
theme(legend.position="top") +
labs(color="") +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=16),
legend.text=element_text(size=16))
It is not a good idea to break scale in one plot as it may lead to wrong interpretations by users.
Edit:
The graph in your question is reproducible using a user-defined scale transform as below.
library(scales)
foo_trans = function() trans_new("foo", function(x) log(1/(1-x)), function(x) -1/exp(x) + 1)
results %>%
ggplot(aes(x = t, y = Values, group = Algorithm, color = Algorithm)) +
geom_line(size = 1) +
theme(legend.position="top") +
labs(color="") + ylab("Values (Tranformed Scale)") +
theme(axis.text=element_text(size=14),
axis.title=element_text(size=16),
legend.text=element_text(size=16)) +
scale_y_continuous(breaks = c(0,0.9,0.99,0.999), labels = c(0,0.9,0.99,0.999)) +
coord_trans(y = "foo")
As you see for your toy example, the y-axis was transformed using the code and no data transformation was applied. Computationally this can be done, I but I will prefer the first solution for representation. You may want to refer to additional answers here and here to work out your actual problem.

You're looking for scale_y_log10
replace scale_y_continuous(breaks=c(0, 0.9, 0.99, 0.999)) with scale_y_log10(breaks=c(0, 0.9, 0.99, 0.999))

Related

How to add percentages on top of an histogram when data is grouped

This is not my data (for confidentiality reasons), but I have tried to create a reproducible example using a dataset included in the ggplot2 library. I have an histogram summarizing the value of some variable by group (factor of 2 levels). First, I did not want the counts but proportions of the total, so I used that code:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>% as.data.frame() %>% filter(cut=="Premium" | cut=="Ideal")
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="Count") +
theme_bw() + theme(legend.position="none")
It gave me this as a result.
enter image description here
The issue is that I would like to print the numeric percentages on top of the bins and haven't find a way to do so.
As I saw it done for printing counts elsewhere, I attempted to print them using stat_bin(), including the same y and label values as the y in geom_histogram, thinking it would print the right numbers:
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
stat_bin(aes(y=after_stat(width*density),label=after_stat(width*density*100)),geom="text",vjust=-.5) +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="%") +
theme_bw() + theme(legend.position="none")
However, it does print way more values than there are bins, these values do not appear consistent with what is portrayed by the bar heights and they do not print in respect to vjust=-.5 which would make them appear slightly above the bars.
enter image description here
What am I missing here? I know that if there was no grouping variable/facet_wrap, I could use after_stat(count/sum(count)) instead of after_stat(width*density) and it seems that it would have fixed my issue. But I need the histograms for both groups to appear next to each other. Thanks in advance!
You have to use the same arguments in stat_bin as for the histogram when adding your labels to get same binning for both layers and to align the labels with the bars:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>%
as.data.frame() %>%
filter(cut == "Premium" | cut == "Ideal")
ggplot(df_example, aes(x = z, fill = cut)) +
geom_histogram(aes(y = after_stat(width * density)),
binwidth = 1, center = 0.5, col = "black"
) +
stat_bin(
aes(
y = after_stat(width * density),
label = scales::number(after_stat(width * density), scale = 100, accuracy = 1)
),
geom = "text", binwidth = 1, center = 0.5, vjust = -.25
) +
facet_wrap(~cut) +
scale_x_continuous(breaks = seq(0, 9, by = 1)) +
scale_y_continuous(labels = scales::number_format(scale = 100)) +
scale_fill_manual(values = c("#CC79A7", "#009E73")) +
labs(x = "Depth (mm)", y = "%") +
theme_bw() +
theme(legend.position = "none")

How do I add data labels to a ggplot histogram with a log(x) axis?

I am wondering how to add data labels to a ggplot showing the true value of the data points when the x-axis is in log scale.
I have this data:
date <- c("4/3/2021", "4/7/2021","4/10/2021","4/12/2021","4/13/2021","4/13/2021")
amount <- c(105.00, 96.32, 89.00, 80.84, 121.82, 159.38)
address <- c("A","B","C","D","E","F")
df <- data.frame(date, amount, address)
And I plot it in ggplot2:
plot <- ggplot(df, aes(x = log(amount))) +
geom_histogram(binwidth = 1)
plot + theme_minimal() + geom_text(label = amount)
... but I get the error
"Error: geom_text requires the following missing aesthetics: y"
I have 2 questions as a result:
Why am I getting this error with geom_histogram? Shouldn't it assume to use count as the y value?
Will this successfully show the true values of the data points from the 'amount' column despite the plot's log scale x-axis?
Perhaps like this?
ggplot(df, aes(x = log(amount), y = ..count.., label = ..count..)) +
geom_histogram(binwidth = 1) +
stat_bin(geom = "text", binwidth = 1, vjust = -0.5) +
theme_minimal()
ggplot2 layers do not (at least in any situations I can think of) take the summary calculations of other layers, so I think the simplest thing would be to replicate the calculation using stat_bin(geom = "text"...
Or perhaps simpler, you could pre-calculate the numbers:
library(dplyr)
df %>%
count(log_amt = round(log(amount))) %>%
ggplot(aes(log_amt, n, label = n)) +
geom_col(width = 1) +
geom_text(vjust = -0.5)
EDIT -- to show buckets without the log transform we could use:
df %>%
count(log_amt = round(log(amount))) %>%
ggplot(aes(log_amt, n, label = n)) +
geom_col(width = 0.5) +
geom_text(vjust = -0.5) +
scale_x_continuous(labels = ~scales::comma(10^.),
minor_breaks = NULL)

ggplot boxplot + jitter plot showing random sampling of data

I'd like to use ggplot to generate a series of boxplots derived from all data within a dataset, but then with jittered points showing a random sampling of the respective data (e.g., 100 data points) to avoid over-plotting (there are thousands of data points). Can anyone please help me with the code for this? The basic framework I have now is below, but I don't know what if any arguments can be added to draw a random sampling of data to display as the jittered points. Thanks for any help.
ggplot(datafile, aes(x=factor(var1), y=var2, fill=var3)) + geom_jitter(size=0.1, position=position_jitter(width=0.3, height=0.2)) + geom_boxplot(alpha=0.5) + facet_grid(.~var3) + theme_bw() + scale_fil_manual(values=c("red", "green", "blue")
You could take a random subset of your data using dplyr:
library(dplyr)
library(ggplot)
ggplot(data = datafile, aes(x = factor(var1), y = var2, fill = var3)) +
geom_jitter(
# use random subset of data
data = datafile %>% group_by(var1) %>% sample_n(100),
aes(x = factor(var1), y = var2, fill = var3)),
size = 0.1,
position = position_jitter(width = 0.3, height = 0.2)) +
geom_boxplot(alpha = 0.5) +
facet_grid(.~var3) +
theme_bw() +
scale_fill_manual(values = c("red", "green", "blue")

How to transform a graph on the xaxis to look like example image?

I need to plot the error rate for knn models against 1/k like the following example:
I have the data (error rates and values for 1/k), however, most of the values for 1/k are very close together and difficult to interpret. I have tried setting the tick intervals like below, but I need the ticks to be evenly spaced like in the example. I've spent the better part of two hours looking for a solution but to no avail. Does anyone know the R function that would allow me to do this?
My results
My results with tick intervals
table with results
results table
If needed, here is the Rcode for the plot:
tick intervals
k_ticks <- c(0.01, 0.02, 0.05, 0.10, 0.20, 0.50, 1.0)
plot of error rates against 1/k
err_plt <- ggplot(knn_table, aes(x = knn_table$`1.div.k`),
position_dodge()) +
geom_line(aes(y = knn_table$crv.cl.error.rate),
colour = "red") +
geom_line(aes(y = knn_table$cl.error.rate),
colour = "blue") +
xlab("1/K") +
ylab("Error Rate") +
geom_point(aes(y = knn_table$crv.cl.error.rate),
col = "red") +
geom_point(aes(y = knn_table$cl.error.rate),
col = "blue") +
scale_x_continuous(limits = c(0,1),breaks = k_ticks)
The scale on the x-axis in the OP's desired image seems to be log10, so if you can accept slightly-less-than-equal intervals between breaks, using scale_x_continuous with trans = 'log10' should work.
For example, without scaling:
x_vec <- c(0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1)
test_tbl <- tibble::tibble(x = x_vec,
y = runif(length(x_vec)))
p <- ggplot(test_tbl, aes(x = x, y = y)) +
geom_point()
p
But now with the x-axis scaled by log10:
p +
scale_x_continuous(trans = 'log10',
breaks = x_vec)

Need help on customizing my Odds Ratio (ggplot)!

I'm assigned to create an Odds of Ratio ggplot in R. The plot I'm supposed to create is given below.
Given plot
My job is to figure out codes which creates the exact plots in R. I've done most parts. Here is my work.
My work
Before jumping into my code, it is very important that I am not using the correct values for boxOdds, boxCILow, and boxCIHigh since I have not figured out the correct values. I wanted to figure out codes for ggplot first so I can enter the right values as soon as I find them.
This is the code I used:
library(ggplot2)
boxLabels = c("Females/Males", "Student-Centered Prac. (+1)", "Instructor Quality (+1)", "Undecided / STM",
"non-STEM / STM", "Pre-med / STM", "Engineering / STM", "Std. test percentile (+10)",
"No previous calc / HS calc", "College calc / HS calc")
df <- data.frame(yAxis = length(boxLabels):1,
boxOdds =
c(2.23189, 1.315737, 1.22866, 0.8197413, 0.9802449, 0.9786673, 0.6559005, 0.5929812, 0.6923759, 1.3958275),
boxCILow =
c(.7543566,1.016,.9674772,.6463458,.9643047,.864922,.4965308,.3572142, 0.4523759, 1.2023275),
boxCIHigh =
c(6.603418,1.703902,1.560353,1.039654,.9964486,1.107371,.8664225,.9843584, 0.9323759, 1.5893275)
)
(p <- ggplot(df, aes(x = boxOdds, y = boxLabels)) +
geom_vline(aes(xintercept = 1), size = 0.75, linetype = 'dashed') +
geom_errorbarh(aes(xmax = boxCIHigh, xmin = boxCILow), size = .5, height =
0, color = 'gray50') +
geom_point(size = 3.5, color = 'orange') +
theme_bw() +
theme(panel.grid.minor = element_blank()) +
scale_x_continuous(breaks = seq(0,7,1) ) +
ylab('') +
xlab('Odds Ratio') +
annotate(geom = 'text', y =1.1, x = 3.5, label ='',
size = 3.5, hjust = 0) + ggtitle('Estimated Odds of Switching') +
theme(plot.title = element_text(hjust = 0.5, size = 30),
axis.title.x = (element_text(size = 15))) +
theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank())
)
p
Where I'm stuck at:
Removing small vertical lines on the beginning and end of each row's CI). I was not sure what it's called so I was having hard time looking it up. SOLVED
I'm also stuck at coloring specific rows in different colors.
The last part I'm stuck at is assigning proper order of each variable for y-axis. As you can see in my code ("boxLabels" part), I have put all the variables in order of given plot but it seems like the R didn't care about the order. So the varaible located at the very top is "Undecided / STM", instead of "Females / Males".
How do I decrease the space from 0 to 1? SOLVED
Any help would be appreciated!
First, probably you want ggstance::geom_pointrangeh. Second, you could define colors by yAxis right at the beginning. To group some factors create a new variable group. Third is related to your data where you could assign factor labels. Fourth, remove coord_trans as suggested by #beetroot.
Assign factor labels
dat$yAxis <- factor(dat$yAxis, levels=10:1, labels=rev(boxLabels))
Create groups
dat$group <- 1
dat$group[which(dat$yAxis %in% c("Females/Males", "Undecided / STM", "non-STEM / STM",
"Pre-med / STM"))] <- 2
dat$group[which(dat$yAxis %in% c("Student-Centered Prac. (+1)",
"No previous calc / HS calc",
"College calc / HS calc"))] <- 3
Colors
colors <- c("#860fc2", "#fc691d", "black")
Plot
library(ggplot2)
library(ggstance)
ggplot(dat, aes(x=boxOdds, y=yAxis, color=as.factor(group))) +
geom_vline(aes(xintercept=1), size=0.75, linetype='dashed') +
geom_pointrangeh(aes(xmax=boxCIHigh, xmin=boxCILow), size=.5,
show.legend=FALSE) +
geom_point(size=3.5, show.legend=FALSE) +
theme_bw() +
scale_color_manual(values=colors)+
theme(panel.grid.minor=element_blank()) +
scale_x_continuous(breaks=seq(0,7,1), limits=c(0, max(dat[2:4]))) +
ylab('') +
xlab('Odds Ratio') +
annotate(geom='text', y =1.1, x=3.5, label ='',
size=3.5, hjust=0) + ggtitle('Estimated Odds of Switching') +
theme(plot.title=element_text(hjust=.5, size=20)) +
theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())
Gives
Data
dat <- structure(list(yAxis = 10:1, boxOdds = c(2.23189, 1.315737, 1.22866,
0.8197413, 0.9802449, 0.9786673, 0.6559005, 0.5929812, 0.6923759,
1.3958275), boxCILow = c(0.7543566, 1.016, 0.9674772, 0.6463458,
0.9643047, 0.864922, 0.4965308, 0.3572142, 0.4523759, 1.2023275
), boxCIHigh = c(6.603418, 1.703902, 1.560353, 1.039654, 0.9964486,
1.107371, 0.8664225, 0.9843584, 0.9323759, 1.5893275)), class = "data.frame", row.names = c(NA,
-10L))

Resources