I am doing correlation graphs between two continous variables using ggscatter in ggpubr package. I am using the kendall rank coefficient with p-values automatically added to the graph. I want to use scale_y_log10() since there is a large spread on one of the measurements. However adding scale_y_log10() to the code affects the p-value.
Sampledata:
sampledata <- structure(list(ID = c(1, 2, 3, 4, 5), Measure1 = c(10, 10, 50, 0, 100), Measure2 = c(5, 3, 40, 30, 20), timepoint = c(1, 1,1, 1, 1), time = structure(c(18628, 19205, 19236, 19205, 19205), class = "Date"), event = c(1, 1, NA, NA, NA), eventdate = structure(c(18779,19024, NA, NA, NA), class = "Date")), row.names = c(NA, -5L), class = "data.frame")
Graph without scale_y_log10()
ggscatter(data = sampledata, x = "Measure2", y = "Measure1",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "kendall",
xlab = "measure2", ylab = "measure1", color="#0073C2FF" )
As you can see, R=0.11, P=0.8
When adding scale_y_log10()
ggscatter(data = sampledata, x = "Measure2", y = "Measure1",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "kendall",
xlab = "measure2", ylab = "measure1", color="#0073C2FF" ) + scale_y_log10()
R=0.55 and P=0.28.
This is just some sample data and not my actual data.
Can anyone help me figure this out?
The reason why your p value changes is that one of your y values (in variable Measure2 is 0. When you perform a log transform, this 0 value becomes minus infinity. It cannot be shown on the plot and is therefore removed from the plotting data. If you run ggscatter without this data point, you will see you get the same values as you do with a log transform:
ggscatter(data = subset(sampledata, Measure1 > 0),
x = "Measure2", y = "Measure1",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "kendall",
xlab = "measure2", ylab = "measure1", color="#0073C2FF" )
You can also see that the y values of the confidence interval extend below 0, so the confidence interval in your log-transformed plot is not the same as the confidence interval in your non-transformed plot - the geom_smooth layer is basically doing its linear regression on log-transformed data, which probably isn't what you intended.
As with many ggplot extensions that make creating simple plots easier, one finds that if you want to do something unusual (like excluding 0 or negative values when adding a log scale), one cannot do it within that framework, and you therefore need to go back to vanilla ggplot to achieve what you want.
For example, you can create the points, line and ribbon, but excluding 0 or negative values like this:
mod <- lm(Measure1 ~ Measure2, data = sampledata)
xvals <- seq(3, 40, length.out = 100)
xvals <- c(xvals, rev(xvals))
preds <- predict(mod, newdata = data.frame(Measure2 = xvals), se.fit = TRUE)
lower <- preds$fit - 1.96 * preds$se.fit
upper <- preds$fit + 1.96 * preds$se.fit
lower[lower < 1] <- 1
pred_df <- data.frame(Measure2 = xvals,
Measure1 = preds$fit)
polygon <- data.frame(Measure2 = xvals,
Measure1 = c(lower[1:100], upper[101:200]))
ct <- cor.test(sampledata$Measure2, sampledata$Measure1, method = "kendall")
Now we can safely plot the data and style it to look like ggscatter:
p <- ggplot(subset(sampledata, Measure1 > 0),
aes(Measure2, Measure1)) +
geom_polygon(data = polygon, fill = "#0073c2", alpha = 0.5) +
geom_point(color = "#0073c2", size = 2) +
geom_line(data = pred_df, color = "#0073c2", size = 1) +
annotate("text", hjust = 0, x = min(sampledata$Measure2), y = 50, size = 5,
label = paste0("R = ", sprintf("%1.2f", ct$estimate), ", p = ",
sprintf("%1.2f", ct$p.value))) +
theme_classic(base_size = 16)
p
Except now we can safely log transform the output:
p + scale_y_log10(limits = c(1, 1000))
Related
I am trying to get a vline on a predicted plot until the sjplot::plot_model function. I have a paneled graph and for each condition (baseline, autonomous and fairness) I have a different vline I want to depict. Basically the point of divergence when the CI bands no longer overlap.
The issue I have is that I can get the lines on there, but it's all three lines for each panel rather than a unique line for each panel.
Here is the code I use to create my model
model3 <- glmer (reject ~ (1|sn) + condition*dist*actor.age.years +
block + actor.gender,
data = en.long.ai, family = binomial,
nAGQ = 1,
control = glmerControl(optimizer = "optimx", optCtrl = list(method = "bobyqa")))
then I do this to create my plot
ai.pred.plot <- sjPlot::plot_model(model3, type = "pred", terms = c( "actor.age.years", "dist", "condition"), title = "AI: Predicted probability of rejections",
se = TRUE,
legend.title = "allocation type",
show.data = F,
colors = "Set2")
print (ai.pred.plot + labs(y = "Probability of rejection") + labs(x = "Age (years)"))
Here is what I get.
predicted plot of rejections by condition and age
Then I try to make the vline. I've tried many methods....
#create a data frame
#one attempt
Primary<- c(6.6, 7.4, 4.75 )
grid <- c("Basline", "Autonomous", "Fairness")
treat<- data.frame(grid, Primary,stringsAsFactors = F)
#another attempt
vline.data <- data.frame(z = 6.6, 7.2, 4.5), condition = c("Baseline","Autonomous","Fariness"))
vline.data <- data.frame(condition = ("Baseline", z = 6.6), ("Autonomous", z = 7.2), ("Fariness", z = 7.2))
# then I try to add it to the plot
ai.pred.plot + geom_vline(aes(xintercept = z), vline.data, linetype = "dashed", color = "grey30") +
guides(fill = FALSE)
#or
print (ai.pred.plot + geom_vline(data=treat,aes(xintercept = Primary), linetype = "dotdash") )
I always get this.
predicted plot with 3 vlines
but I want one line per panel.
please help.
I have created the following model and predictions but I'm having trouble with the code to plot the predictions. I think it's a dimensions issue, does anyone know the changes I need to make for this to work?
code used;
#variogram
summer_vario = variog(geo_summer_df2, option = 'bin', estimator.type='modulus', bin.cloud = TRUE)
#fitting a basic parametric model
defult_summer_mod = variofit(summer_vario)
#creating predictions
preds_grid = matrix(c(-5.697, 55.441, -0.807, 51.682, -5.328, 50.218, -2.451, 54.684, -4.121, 50.355, -1.586, 54.768, -0.131, 51.505, -4.158, 52.915,
-0.442, 53.875, -3.413, 56.214, -2.860, 54.076, -3.323, 57.711, 0.566, 52.651, -0.626, 54.481, -1.185, 60.139, -2.643, 51.006,
-1.491, 53.381, -1.536, 52.424, -6.319, 58.213, -1.992, 51.503), nrow = 20, byrow = TRUE)
summer_preds = krige.conv(geo_summer_df2, locations = preds_grid, krige = krige.control(obj.model = defult_summer_mod))
#plotting predictions
#mean
image(summer_preds, col = viridis::viridis(100), zlim = c(100, max(c(summer_preds$predict))),
coords.data = geo_summer_df2[1]$coords, main = 'Mean', xlab = 'x', ylab = 'y',
x.leg = c(700, 900), y.leg = c(20, 70))
#variation
image(summer_preds, values = summer_preds$krige.var, col = heat.colors(100)[100:1],
zlim = c(0,max(c(summer_preds$krige.var))), coords.data = geo_summer_df2[1]$coords,
main = 'Variance', xlab = 'x', ylab = 'y', x.leg = c(700, 900), y.leg = c(20, 70))
data used;
https://drive.google.com/file/d/1ngwto6hgqCumoDsStOtPoG2J5EbmqxDf/view?usp=sharing
https://drive.google.com/file/d/1s9yBHsgaFRlF38CgiXCf_vum1DyhEbz4/view?usp=sharing
data changes made before code at the top of the page
#converting data to long format and combining both dataframes
MaxTemp %>%
pivot_longer(.,Machrihanish:Lyneham, names_to = "Location") %>%
full_join(.,metadata) -> MaxTemp_df
#renaming value column to temperature
MaxTemp_df = MaxTemp_df %>%
rename(Temp = 'value')
#filtering data for summer months
summer_df = MaxTemp_df %>%
filter(Date >= 20200701 & Date <=20200731)
#converting our data to geodata
geo_summer_df = as.geodata(summer_df, coords.col = 4:5, data.col = 3)
geo_summer_df2 = jitterDupCoords(geo_summer_df, max = 0.1, min = 0.05)
You're right about the dimensions. The predictions should be made over a regular grid of locations if you want to plot them as an image. Get all the unique x co-ordinates and all the unique y co-ordinates, sort them, then use expand.grid to get x, y co-ordinates for the whole grid. You'll then need to use this for kriging.
When you come to drawing the image, you need to arrange the predictions into a matrix:
xvals <- sort(unique(preds_grid[,1]))
yvals <- sort(unique(preds_grid[,2]))
preds_grid <- as.matrix(expand.grid(xvals, yvals))
colnames(preds_grid) <- NULL
summer_preds = krige.conv(geo_summer_df2, locations = preds_grid,
krige = krige.control(obj.model = default_summer_mod))
image(xvals, yvals, matrix(summer_preds$predict, nrow = length(xvals)),
col = viridis::viridis(100), main = 'Mean', xlab = 'x', ylab = 'y')
image(xvals, yvals, matrix(summer_preds$krige.var, nrow = length(xvals)),
col = heat.colors(100)[100:1], main = 'Variance', xlab = 'x', ylab = 'y')
Note that you will get better images if you use a finely-spaced sequence for x and y:
xvals <- seq(-7, 1, 0.1)
yvals <- seq(50, 62, 0.1)
The plots this produces with the same code otherwise are:
Update - using ggplot
The following adds the data to an outline of the British Isles:
devtools::install_github("ropensci/rnaturalearthhires")
library(rnaturalearth)
xvals <- seq(-7, 1, 0.1)
yvals <- seq(50, 62, 0.1)
preds_grid <- as.matrix(expand.grid(xvals, yvals))
summer_preds <- krige.conv(
geo_summer_df2, locations = preds_grid,
krige = krige.control(obj.model = default_summer_mod))
df <- as.data.frame(cbind(preds_grid,
mean = summer_preds$predict,
var = summer_preds$krige.var))
gb <- sf::st_crop(ne_coastline(scale = 10, returnclass = 'sf'),
xmin = -7, xmax = 1, ymin = 50, ymax = 62)
ggplot(gb) +
geom_tile(data = df, aes(Var1, Var2, fill = mean),
width = 0.11, height = 0.11, size = 0) +
geom_sf() +
scale_fill_viridis_c() +
ggtitle('Mean')
ggplot(gb) +
geom_tile(data = df, aes(Var1, Var2, fill = var),
width = 0.11, height = 0.11, size = 0) +
geom_sf() +
scale_fill_gradientn(colors = heat.colors(100, rev = TRUE)) +
ggtitle('Variance')
I am trying to get a spearman correlation for two scatter plots with confidence intervals on the same graph but it is proving difficult to do.
Given the following data:
spentWithTool <- sample(1:7, 20, replace = TRUE)
understoodWithTool <- sample(1:5, 20, replace = TRUE)
spentWithoutTool <- sample(1:4, 10, replace = TRUE)
understoodWithoutTool <- sample(1:5, 10, replace = TRUE)
This is the best workaround I came up with but it is not what I want - I want the withTool and the withoutTool plots to be on the same graph and NOT side by side.
plot_with <- ggscatter(data = data.frame(spentWithTool, understoodWithTool),
x = 'spentWithTool',
y = 'understoodWithTool',
color = 'darkred',
cor.method = "spearman",
cor.coef = TRUE,
conf.int = TRUE,
add = "reg.line",
add.params = list(color = 'red'),
)
plot_without <- ggscatter(data = data.frame(spentWithoutTool, understoodWithoutTool),
x = 'spentWithoutTool',
y = 'understoodWithoutTool',
color = 'darkblue',
cor.method = "spearman",
cor.coef = TRUE,
conf.int = TRUE,
add = "reg.line",
add.params = list(color = 'blue')
ggarrange(plotlist = list(plot_with, plot_without))
The above gives me:
Using ggpar(p = list(plot_with, plot_without)) just created two separate graphs so it was not helpful.
I am trying to get two lines (red and blue), where each line represents withTool and withoutTool respectively and the x and y axis represent timeSpent and understood respectively. Ideally, the confidence intervals, would be the same colour as what the line represents (red for withTool and blue for withoutTool).
Is there a way to get both correlation lines and points on the same graph?
Aside - Is there a way to use ggscatter without creating a data frame from vectors?
After checking out the docs and trying several options using the color and ggp arguments of ggscatter IMHO the easiest and less time-consuming option to achieve your desired result would be to build your plot from scratch using ggplot2 with some support from ggpubr to add the regression equations and the theme:
set.seed(1)
spentWithTool <- sample(1:7, 20, replace = TRUE)
understoodWithTool <- sample(1:5, 20, replace = TRUE)
spentWithoutTool <- sample(1:4, 10, replace = TRUE)
understoodWithoutTool <- sample(1:5, 10, replace = TRUE)
library(ggplot2)
library(ggpubr)
df <- rbind.data.frame(
data.frame(x = spentWithTool, y = understoodWithTool, id = "with"),
data.frame(x = spentWithoutTool, y = understoodWithoutTool, id = "without")
)
ggplot(df, aes(x, y, color = id, fill = id)) +
geom_point() +
geom_smooth(method = "lm") +
stat_cor(method = "spearman") +
scale_color_manual(values = c(with = "red", without = "blue"), aesthetics = c("color", "fill")) +
theme_pubr() +
labs(x = "timeSpent", y = "understood")
#> `geom_smooth()` using formula = 'y ~ x'
Simplifying and trying to make my error reproducible, my code is the following, it generates a histogram and set x axis and y axis afterwards:
set.seed(100)
dist <- data.frame(rnorm(200, sd = 300000))
histogram <- hist(dist$rnorm.200., col = "orange", breaks = 100, main = "VaR distribution", xlab = "P&L", axes = FALSE)
axis(1, at = seq(min(dist), max(dist), length = 7))
labels(formatC(seq(min(dist$rnorm.200.)/1000000, max(dist$rnorm.200.)/1000000, length = 7), format = "d", digits = 0))
axis(2, at = seq(from = 0, to = max(histogram$counts), by = 5))
labels(formatC(seq(from = 0, to = max(histogram$counts), by = 5), format = "d"))
The command to set labels on axis y works, but the x axis labels command doesn't, which is the following:
axis(1, at = seq(min(dist), max(dist), length = 7))
labels(formatC(seq(min(dist$rnorm.200.)/1000000, max(dist$rnorm.200.)/1000000, length = 7), format = "d", digits = 0))
instead of getting a sequence of 7 values of dist$norm.200. divided by 1000000 and without decimals, I get the default values set by histogram() function.
Could anyone help me?
Edition: Neither the y axis labels command nor x works, I thank it did in my original code because it matched causally.
you should use labels as an argument of the axis() function, not as a separate function. Something like this:
set.seed(100)
dist <- data.frame(rnorm(200, sd = 300000))
histogram <- hist(dist$rnorm.200., col = "orange", breaks = 100, main = "VaR distribution", xlab = "P&L", axes = FALSE)
axis(1, at = seq(min(dist), max(dist), length = 7), labels = formatC(seq(min(dist$rnorm.200.)/1000000, max(dist$rnorm.200.)/1000000, length = 7), format = "d", digits = 0))
axis(2, at = seq(from = 0, to = max(histogram$counts), by = 5), labels = (formatC(seq(from = 0, to = max(histogram$counts), by = 5), format = "d")))
Also, you should realize formatC(seq(min(dist$rnorm.200.)/1000000, max(dist$rnorm.200.)/1000000, length = 7), format = "d", digits = 0) only returns zeroes, so perhaps you'd like to give some more attention towards what these labels actually should be. (Perhaps divide by 100000 instead?)
Here is a ggplot2 approach. I myself find the code to be more readible, an so: easier to maintain.
set.seed(100)
dist <- data.frame(rnorm(200, sd = 300000))
library(ggplot2)
ggplot(dist, aes( x = dist[,1] ) ) +
geom_histogram( bins = 100, color = "black", fill = "orange" ) +
scale_x_continuous( labels = function(x){x / 100000} ) +
labs( title = "VaR distribution",
x = "P&L",
y = "Frequency" )
I am using a copula to look at the probability of occurrence of events based on duration and magnitude of the events. I can create contours for recurrence intervals with observed and simulated data in base R graphics, but I can't figure out how to reproduce in ggplot2. Why not just produce the graphs in base graphics and move on you may be wondering? Because I'm including the graphs in a short summary report and want to have consistency with numerous other graphs in the report. Below is some example code. I know that using the location, scale, and shape for the GEV distribution to create random deviates to get the same distribution from is not ideal, but it is the best way I could think of to create a somewhat reproducible example, despite the poor correlation at the end. In base R, the contours are generated from a matrix of simulated data. Is this possible in ggplot2?
library(evd)
library(copula)
dur <- rgev(500, 2.854659, 2.170122, -0.007829)
mag <- rgev(500, 0.02482, 0.01996, 0.04603)
fDurGev <- fgev(dur)
fMagGev <- fgev(mag)
durVec <- dgev(dur, fDurGev[[1]][1], fDurGev[[1]][2], fDurGev[[1]][3])
magVec <- dgev(mag, fMagGev[[1]][1], fMagGev[[1]][2], fMagGev[[1]][3])
durMagMat <- as.matrix(cbind(duration = durVec, magnitude = magVec))
theta <- coef(fitCopula(claytonCopula(dim = 2), durMagMat, method = "itau"))
clayCop <- claytonCopula(theta, dim = 2)
fCopDurMag <- pCopula(durMagMat, clayCop)
copPts <- data.frame(duration = dur, magnitude = mag, copNEP = fCopDurMag,
copEP = (1 - fCopDurMag), copRI = (1 / fCopDurMag))
fSim <- seq(0.05, 0.99998, length.out = 1000)
quaDur <- qgev(fSim, fDurGev[[1]][1], fDurGev[[1]][2], fDurGev[[1]][3])
quaMag <- qgev(fSim, fMagGev[[1]][1], fMagGev[[1]][2], fMagGev[[1]][3])
expDurMagMat <- cbind(expand.grid(fSim, fSim)$Var1, expand.grid(fSim,
fSim)$Var2)
simPred <- pCopula(expDurMagMat, clayCop)
simPredMat <- matrix(simPred, 1000, 1000)
simDF <- data.frame(simDur = quaDur, simMag = quaMag, simPredMat)
rndPred <- data.frame(rCopula(5000, clayCop))
rndPred$rndDur <- qgev(rndPred[,1], fDurGev[[1]][1], fDurGev[[1]][2],
fDurGev[[1]][3])
rndPred$rndMag <- qgev(rndPred[,2], fMagGev[[1]][1], fMagGev[[1]][2],
fMagGev[[1]][3])
RI <- c(1.25, 2 ,5, 10, 20, 50, 100, 200, 500)
NEP <- 1 - (1 / RI)
plot(rndPred$rndDur, rndPred$rndMag, col = "light grey", cex = 0.5, xlab =
"Duration (time)", ylab = "Magnitude (x)")
points(copPts[,1], copPts[,2], col = "red", cex = 0.5)
contour(simDF$simDur, simDF$simMag, simPredMat, levels = NEP, labels = RI,
xaxs = 'i', yaxs = 'i', labcex = 0.6, lwd = 1, col = "black", add =
TRUE, method = "flattest", vfont = c("sans serif", "plain"))
And now for my attempt to recreate in ggplot2 (which fails to draw contours).
library(dplyr)
simDF <- data.frame(dur = expDurMagMat[, 1], mag = expDurMagMat[, 2], NEP = simPred)
simDF <- simDF %>%
dplyr::mutate(quaDur = qgev(NEP, fDurGev[[1]][1], fDurGev[[1]][2], fDurGev[[1]][3])) %>%
dplyr::mutate(quaMag = qgev(NEP, fMagGev[[1]][1], fMagGev[[1]][2], fMagGev[[1]][3]))
library(ggplot2)
ggplot(data = rndPred, aes(x = rndDur, y = rndMag)) +
geom_point(color = "light grey", alpha = 0.5) +
labs(x = "Duration (time)", y = "Magnitude (x)") +
geom_point(data = copPts, aes(x = duration, y = magnitude),
color = "red") +
geom_contour(data = simDF, aes(x = quaDur, y = quaMag, z = NEP),
inherit.aes = FALSE, breaks = NEP) +
theme_classic()
Thank you to anyone who can help.