Plot one vs many actual-predicted values scatter plot using R - r

For a sample dataframe df, pred_value and real_value respectively represent the monthly predicted values and actual values for a variable, and acc_level represents the accuracy level of the predicted values comparing with the actual values for the correspondent month, the smaller the values are, more accurate the predictions result:
df <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), .Label = c("2022/3/31", "2022/4/30",
"2022/5/31"), class = "factor"), pred_value = c(2721.8, 2721.8,
2705.5, 2500, 2900.05, 2795.66, 2694.45, 2855.36, 2300, 2799.82,
2307.36, 2810.71, 3032.91), real_value = c(2736.2, 2736.2, 2736.2,
2736.2, 2736.2, 2759.98, 2759.98, 2759.98, 2759.98, 3000, 3000,
3000, 3000), acc_level = c(1L, 1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L,
2L, 3L, 2L, 1L)), class = "data.frame", row.names = c(NA, -13L
))
Out:
date pred_value real_value acc_level
1 2022/3/31 2721.80 2736.20 1
2 2022/3/31 2721.80 2736.20 1
3 2022/3/31 2705.50 2736.20 2
4 2022/3/31 2500.00 2736.20 3
5 2022/3/31 2900.05 2736.20 3
6 2022/4/30 2795.66 2759.98 1
7 2022/4/30 2694.45 2759.98 2
8 2022/4/30 2855.36 2759.98 2
9 2022/4/30 2300.00 2759.98 3
10 2022/5/31 2799.82 3000.00 2
11 2022/5/31 2307.36 3000.00 3
12 2022/5/31 2810.71 3000.00 2
13 2022/5/31 3032.91 3000.00 1
I've plotted the predicted values with code below:
library(ggplot2)
ggplot(x, aes(x=date, y=pred_value, color=acc_level)) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
theme_bw()
Out:
Beyond what I've done above, if I hope to plot the actual values for each month with red line and red points, how could I do that? Thanks.
Reference:
How to add 4 groups to make Categorical scatter plot with mean segments?

We can add the actuals using additional layers. To make the line show up, we need to specify that the points should be part of the same series.
ggplot assumes by default that since the x axis is discrete that the data points are not part of the same group. We could alternatively deal with this by making the date variable into a date data type, like with aes(x=as.Date(date)...
library(ggplot2)
ggplot(df, aes(x=date, y=pred_value, color=as.factor(acc_level))) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
geom_point(aes(y = real_value), size=2, color = "red") +
geom_line(aes(y = real_value, group = 1), color = "red") +
scale_color_manual(values = c("yellow", "magenta", "cyan"),
name = "Acc Level") +
theme_bw()

Related

Error: Mapping should be created with aes() or aes_() in Line Graph with Error bars

I'm trying to make a line graph with 2 lines, and error bars at each point. My code is posted below along with some sample data. My question is how to add the error bars to the plot - I can make the plot and calculate the standard error value, but R keeps replying with the message: 'Error: Mapping should be created with aes() oraes_()`. Thanks in advance for any insight!
Code:
#### Load Libraries ####
library(ggplot2)
library(plyr)
library(dplyr)
#### Load Data ####
rm(list = ls())
VisualAcuity <- read.csv(file.choose(), stringsAsFactors = TRUE)
View(VisualAcuity)
summary(VisualAcuity)
#### Make the Plot ####
VA <- ddply(VisualAcuity, c("Dir", "Day"), summarise, Acty=mean(Acuity))
View(VA)
dplyr::summarise(VisualAcuity, std_err=sd(Acuity)/sqrt(n()), n=n()) %>%
ggplot(VA, aes(x=Day, y=Acty, colour=Dir)) +
geom_line() + geom_point() + ylim (min(0), max(0.6)) +
geom_errorbar(aes(ymin=Acty - std_err, ymax = Acty + std_err)) +
ylab('Visual Acuity') + theme(axis.line=element_line(colour='black')) + theme(panel.background = element_blank())
Data:
One possible solution is to calculate the mean and standard deviation outside of ggplot2 using dplyr for example:
library(dplyr)
VisualAcuity %>% group_by(Dir,day) %>%
summarise(Mean = mean(Acuity),
SEM = sd(Acuity)/ sqrt(n()))
# A tibble: 6 x 4
# Groups: Dir [2]
Dir day Mean SEM
<fct> <int> <dbl> <dbl>
1 CCW 1 0.376 0.0347
2 CCW 2 0.395 0.0297
3 CCW 3 0.391 0.00328
4 CW 1 0.392 0.0410
5 CW 2 0.381 0.0348
6 CW 3 0.403 0.0127
Then, you can add the graphical part to get the following plot:
library(dplyr)
library(ggplot2)
VisualAcuity %>% group_by(Dir,day) %>%
summarise(Mean = mean(Acuity),
SEM = sd(Acuity)/ sqrt(n())) %>%
ggplot(aes(x = day, y = Mean, color = Dir, group = Dir))+
geom_line()+
geom_point()+
geom_errorbar(aes(ymin = Mean-SEM, ymax = Mean+SEM), width = 0.2)
An alternative is to use stat_summary as follow:
ggplot(VisualAcuity, aes(x = day, y = Acuity, color = Dir))+
stat_summary(geom = "line", fun = "mean")+
stat_summary(geom = "point", fun = "mean")+
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2)
Does it answer your question ?
Reproducible example
structure(list(Dir = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("CCW",
"CW"), class = "factor"), day = c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Acuity = c(0.43,
0.386, 0.311, 0.428, 0.422, 0.336, 0.389, 0.397, 0.386, 0.464,
0.389, 0.322, 0.417, 0.414, 0.311, 0.425, 0.403, 0.381)), class = "data.frame", row.names = c(NA,
-18L))

R stackedBar chart

If this is my dataset.
Surgery Surv_Prob Group
CV 0.5113 Diabetic
Hip 0.6619 Diabetic
Knee 0.6665 Diabetic
QFox 0.7054 Diabetic
CV 0.5113 Non-Diabetic
Hip 0.6629 Non-Diabetic
Knee 0.6744 Non-Diabetic
QFox 0.7073 Non-Diabetic
How do i plot a stacked bar plot like this below.
Please note the values are already cumulative in nature, so the plot should show a very little increase from CV to Hip (delta = 0.6619- 0.5113)
And the order should be CV -> Hip -> Knee -> QFox
There could be a way where you can plot the cumulative values directly, however one way is to get the actual value and plot the stacked bar plot by arranging the Surgery data in the order you want using factor. For factor levels I have used rev(unique(Surgery)) for convenience as you want order in opposite order of how they appear in the dataset. For more complex types you might need to add levels manually.
library(tidyverse)
df %>%
group_by(Group) %>%
mutate(Surv_Prob1 = c(Surv_Prob[1], diff(Surv_Prob)),
Surgery = factor(Surgery, levels = rev(unique(Surgery)))) %>%
ggplot() + aes(Group, Surv_Prob1, fill = Surgery, label = Surv_Prob) +
geom_bar(stat = "identity") +
geom_text(size = 3, position = position_stack(vjust = 0.5))
data
df <- structure(list(Surgery = structure(c(1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L), .Label = c("CV", "Hip", "Knee", "QFox"), class = "factor"),
Surv_Prob = c(0.5113, 0.6619, 0.6665, 0.7054, 0.5113, 0.6629,
0.6744, 0.7073), Group = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("Diabetic", "Non-Diabetic"), class =
"factor")), class = "data.frame", row.names = c(NA, -8L))

Assigning tick marks frequency to discrete data axes in facet_grid

I'm having some trouble setting readable tick marks on my axes. The problem is that my data are at different magnitudes, so I'm not really sure how to go about it.
My data include ~400 different products, with 3/4 variables each, from two machines. I've pre-processed it into a data.table and used gather to convert it to long form- that part is fine.
Overview: Data is discrete, each X_________ on the x-axis represents a separate reading, and its relative values from machine 1/2 - the idea is to compare the two. The graphical format is perfect for my needs, I would just like to set the ticks at say, every 10 products on the x-axes, and at reasonable values on the y-axis.
Y_1: from 150 to 250
Y_2: from say, 1.5* to 2.5
Y_3: from say, 0.8* to 2.3
Y_4: from say, 0.4* to 1.5
*Bottom value, rounded down
Here's the code I'm using so far
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
MProduct$Parameter <- factor(MProduct$Parameter,
labels = var.Parameter)
labels_x <- MProduct$Lot[seq(0, 1626, by= 20)]
labels_y <- MProduct$Value[seq(0, 1626, by= 15)]
plot.MProduct <- ggplot(MProduct, aes(x = Lot,
y = Value,
colour = V4)) +
facet_grid(Parameter ~.,
scales = "free_y") +
scale_x_discrete(breaks=labels_x) +
scale_y_discrete(breaks=labels_y) +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (angle = 90,
hjust = 1,
vjust = 0.5))
# ggsave("MProduct.png")
plot.MProduct
Anyone knows how to possibly render this graph more readable? Setting labels/breaks manually greatly limits flexibility and readability - there should be an option to set it to every X ticks, right? Same with y.
I need to apply this as a function to multiple datasets, so I'm not very happy about having to specify the column length of the "gathered" dataset every time either, which, in this case is 1626.
Since I'm here, I would also like to take the opportunity to ask about this code:
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
More often than not, I need to label my data in a specific order, which is not necessarily alphabetical. R, however, defaults to some kind of odd behaviour whereupon I have to plot and verify that the labels are indeed where they should be. Any clue how I could force them to be presented in order? As it is, my solution is to keep shifting their position in that line of code until it produces the graph correctly.
Many thanks.
Okay. I'm going to ignore the y axis labels because the defaults seem to work just fine as long as you don't try to overwrite them with your custom labels_y thing. Just let the defaults do their work. For the X axis, we'll give a couple options:
(A) label every N products on X-axis. Looking at ?scale_x_discrete, we can set the labels to a function that takes all the level of the factor and returns the labels we want. So we'll write a functional that returns a function that returns every Nth label:
every_n_labeler = function(n = 3) {
function (x) {
ind = ((1:length(x)) - 1) %% n == 0
x[!ind] = ""
return(x)
}
}
Now let's use that as the labeler:
ggplot(df, aes(x = Lot,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
scale_x_discrete(labels = every_n_labeler(3)) +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
You can change the every_n_labeler(3) to (10) to make it every 10th label.
(B) Maybe more appropriate, it seems like your x-axis is actually numeric, it just happens to have "X" in front of it, let's convert it to numeric and let the defaults do the labeling work:
df$time = as.numeric(gsub(pattern = "X", replacement = "", x = df$Lot))
ggplot(df, aes(x = time,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
With your full x range, I imagine that would look nice.
(C) But who wants to read those 9-digit numbers? You're labeling the x-axis a "Time (s)", which makes me think it's actual a time, measured in seconds from some start time. I'll make up that your start time is 2010-01-01 and covert these seconds to actual times, and then we get a nice date-time scale:
ggplot(df_s, aes(x = as.POSIXct(time, origin = "2010-01-01"),
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
If this is the real meaning behind your data, then using a date-time axis is a big step up for readability. (Again, notice that we are not specifying the breaks, the defaults work quite well.)
Using this data (I subset your sample data down to 2 facets and used dput to make it copy/pasteable):
df = structure(list(Lot = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L), .Label = c("X180106482", "X180126485", "X180306523",
"X180526326"), class = "factor"), Value = c(201, 156, 253, 211,
178, 202.5, 203.4, 204.3, 205.2, 2.02, 2.17, 1.23, 1.28, 1.54,
1.28, 1.45, 1.61, 2.35, 1.34, 1.36, 1.67, 2.01, 2.06, 2.07, 2.19,
1.44, 2.19), Parameter = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Var 1", "Var 2", "Var 3", "Var 4"
), class = "factor"), Machine = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Machine 1", "Machine 2"), class = "factor"),
time = c(180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482, 180106482, 180126485,
180306523, 180526326, 180106482, 180126485, 180306523, 180526326,
180106482, 180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482)), row.names = c(NA,
-27L), class = "data.frame")

Errorbars in r of two groups ggplot2

I'd like to plot standard deviations of the mean(z)/mean(b) which are grouped by two factors $angle and $treatment:
z= Tracer angle treatment
60 0 S
51 0 S
56.415 15 X
56.410 15 X
b=Tracer angle treatment
21 0 S
15 0 S
16.415 15 X
26.410 15 X
So far I've calculated the mean for each variable based on angle and treatment:
aggmeanz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=mean)
aggmeanb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=mean)
It now looks like this:
aggmeanz
angle treatment x
1 0 S 0.09088021
2 30 S 0.18463353
3 60 S 0.08784315
4 80 S 0.09127198
5 90 S 0.12679296
6 0 X 2.68670392
7 15 X 0.50440692
8 30 X 0.83564470
9 60 X 0.52856956
10 80 X 0.63220093
11 90 X 1.70123025
But when I come to plot it, I can't quite get what I'm after
ggplot(aggmeanz, aes(x=aggmeanz$angle,y=aggmeanz$x/aggmeanb$x, colour=treatment)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=0.1, ymax=1.15),
width=.2,
position=position_dodge(.9)) +
theme(panel.grid.minor = element_blank()) +
theme_bw()
EDIT:
dput(aggmeanz)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.0841582902523, 61.2014237854156, 42.9900742785269,
42.4688447229277, 41.3354173870287, 45.7164231791512, 55.3943182966382,
55.0574951462903, 48.1575625699563, 60.5527200655174, 45.8412287451211
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
> dput(aggmeanb)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.26325504249, 61.751655279608, 43.1687113436753,
43.4147408285209, 41.9113698082799, 46.2800894420131, 55.1550995335947,
54.7531592595068, 47.3280215294235, 62.4629068516043, 44.2590192583692
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
EDIT 2: I calculated the standard dev as follows:
aggstdevz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=std)
aggstdevb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=std)
Any thoughts would be much appreciated,
Cheers
As others have noted, you'll need to join the two dataframes together. There are also some little quirks in the dput data you showed, so I've renamed some columns to make sure that they join appropriately and match what you've attempted. NOTE: You'll need name the two means differently so that they don't get merged together or cause conflicts.
names(aggmeanb)[names(aggmeanb) == "x"] = "mean_b"
names(aggmeanb)[names(aggmeanb) == "time"] = "angle"
names(aggmeanz)[names(aggmeanz) == "x"] = "mean_z"
names(aggmeanz)[names(aggmeanz) == "time"] = "angle"
joined_data = join(aggmeanb, aggmeanz)
joined_data$divmean = joined_data$mean_b/joined_data$mean_z
> head(joined_data)
angle treatment mean_b mean_z divmean
1 0 S 56.26326 56.08416 1.003193
2 30 S 61.75166 61.20142 1.008991
3 60 S 43.16871 42.99007 1.004155
4 80 S 43.41474 42.46884 1.022273
5 90 S 41.91137 41.33542 1.013934
6 0 X 46.28009 45.71642 1.012330
ggplot(joined_data, aes(factor(angle), divmean)) +
geom_boxplot() +
theme(panel.grid.minor = element_blank()) +
theme_bw()
It might be that the data you've included is just a bit of your real data set, but as is there's only one data point per angle-treatment group. However, when you are using a fuller dataset, you can try something like:
ggplot(joined_data, aes(factor(angle), diffmean, group = treatment)) +
geom_boxplot() +
facet_grid(.~angle, scales = "free_x")
That will group the boxes by angle and then allow you to fill them by treatment.
Think about the problem in two steps:
create a data frame (say data) which contains all the information
you would like to visualize. In this case, this seems to be the two
factors (angle, treatment), the mean group differences (say dif)
and standard errors (say ste).
visualize this information.
Step 2) will be easy. This should probably produce something very similar to your sketch.
ggplot(data, aes(x=angle, y=dif, colour=treatment)) +
geom_point(position=position_dodge(0.1)) +
geom_errorbar(aes(ymin=dif-ste, ymax=dif+ste), width=.1, position=position_dodge(0.1)) +
theme_bw()
However, at this point, you do not provide enough information to get help with Step 1. Try to include code which produces your original data (or the type of data you have) instead of copy-pasting chunks of your data output or pasting the aggregated data which lacks standard errors.
Combining your two aggregated data frames and generating random numbers for standard error produces the graph below:
#I imported your two aggregated data frames from your dput output.
data <- cbind(aggmeanb, aggmeanz$x, rnorm(11))
names(data) <- c("angle", "treatment", "meanz", "meanb", "ste")
data$dif <- data$meanz - data$meanb

R- Plot graph with certain variable

This is what my dataframe looks like:
Persnr Date AmountHolidays
1 55312 X201101 2
2 55312 X201102 4.5
3 55312 X201103 5
etc.
What I want to have is a graph that shows the amount of holidays (on the y-axis) of each period (Date on the x-axis) of a specific person (persnr). Basically, it's a pivot graph in R. So far I know, it is not possible to create such a graph.
Something like this is my desired result:
http://imgur.com/62VsYdJ
Is it possible in the first place to create such a model in R? If not, what is the best way for me to visualise such graph in R?
Thanks in advance.
Something like this could do the trick?
dat <- read.table(text="Persnr Date AmountHolidays
55312 2011-01-01 2
55312 2011-02-01 4.5
55312 2011-03-01 5
55313 2011-01-01 4
55313 2011-02-01 2.5
55313 2011-03-01 6", header=TRUE)
dat$Date <- as.POSIXct(dat$Date)
dat$Persnr <- as.factor(dat$Persnr)
# Build a primary graph
plot(AmountHolidays ~ Date, data = dat[dat$Persnr==55312,], type="l", col="red",
xlim = c(1293858000, 1299301200), ylim=c(0,8))
# Add additional lines to it
lines(AmountHolidays ~ Date, data = dat[dat$Persnr==55313,], type="l", col="blue")
# Build and place a legend
legend(x=as.POSIXct("2011-02-19"), y=2.2, legend = levels(dat$Persnr),fill = c("red", "blue"))
To set X coordinates, you can either use as.POSIXct(YYYY-MM-DD) or as.numeric(as.POSIXct(YYYY-MM-DD) as I did for the xlim's.
You can try with package ggplot2:
First option
ggplot(dat, aes(x=Date, y=AmountHolidays, group=Persnr)) +
geom_line(aes(colour=Persnr)) + scale_colour_discrete()
or
Second option
ggplot(dat, aes(x=Date, y=AmountHolidays, group=Persnr)) +
geom_line() + facet_grid(~Persnr)
One of the advantages is that you don't need to have a line per Persnr or even to specify (to know) the name or number of Persnr.
example:
first option
second option
Data:
dat <- structure(list(Persnr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("54000",
"55312"), class = "factor"), Date = structure(c(1L, 2L, 3L, 1L,
2L, 3L), .Label = c("2011-01-01", "2011-02-01", "2011-03-01"), class = "factor"),
AmountHolidays = c(5, 4.5, 2, 3, 6, 7)), .Names = c("Persnr",
"Date", "AmountHolidays"), row.names = c(3L, 5L, 6L, 1L, 2L,
4L), class = "data.frame")

Resources