Nomogram plot using ggplot - r

I am trying to reproduce Naive Bayes nomogram as given in Nomograms for Visualization of Naive Bayesian Classifier by Mozina. It is a great visualization for looking at Bayes probabilities. I have been searching and trying various things, but no luck. (I am unable to put all the points on one row for a column.) I've computed probabilities and put them in a data frame called df
structure(list(.id = c("outlook", "outlook", "outlook", "windy",
"windy"), variablevalue = structure(c(1L, 2L, 3L, 5L, 6L), .Label = c("sunny",
"overcast", "rainy", "'All'", "FALSE", "TRUE"), class = "factor"),
prob = c(0.222222222222222, 0.444444444444444, 0.333333333333333,
0.666666666666667, 0.333333333333333)), .Names = c(".id",
"variablevalue", "prob"), row.names = c(1L, 3L, 5L, 11L, 13L), class = "data.frame")
Here's how the chart would like (this chart is all cut and paste):

Does this work?
ggplot(df, aes(prob,.id,label=variablevalue)) +
geom_text() +
xlim(c(0,1))

Related

Error running binomial GAM in mgcv with proportional data

I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")

Specify end points for different groups when plotting regression output in R

I'm hoping to get some help with presenting regression outputs for my Masters thesis. I am assessing the impacts of elephants on woody vegetation, particularly in relation to artificial waterholes. In addition to generally declining with distance from waterholes, the impacts differ substantially between the two vegetation types involved.
I've figured out what seems to me a satisfactory way to of plotting this using visreg. In the model output shown below, both distance to waterhole and veg type explained damage, hence my attempt to show both. However, the issue is that I only have samples at the furthest distances for waterholes (x-axis) from the red vegetation type. As you can see, the regression line for the blue veg type is extending beyond the last points for this vegetation type. Is there anyway I can get the blue line to stop at a smaller distance from the waterhole (x axis value) than for the red to avoid this?
See code for the model and plot below the visreg plot.
Sample data and code
> dput(vegdata[21:52, c(4,7,33)])
structure(list(distance = c(207L, 202L, 501L, 502L, 1001L, 1004L,
2010L, 1997L, 4003L, 3998L, 202L, 194L, 499L, 494L, 1004L, 1000L,
2008L, 1993L, 4008L, 3998L, 493L, 992L, 1941L, 2525L, 485L, 978L,
1941L, 3024L, 495L, 978L, 1977L, 2952L), vegtype = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("teak",
"term"), class = "factor"), toedl = c(35.48031025, 47.30482718,
25.16709533, 22.29360164, 17.6546533, 12.81605101, 20.34136734,
18.45809334, 11.3578081, 3.490830751, 60.54870317, 44.9863128,
18.81010698, 20.4777188, 30.36994386, 18.7417214, 21.52247156,
18.29685939, 30.26217664, 8.945486104, 43.95749178, 43.54799495,
44.42693993, 50.06207783, 48.05538594, 35.31220933, 52.37339094,
40.51569938, 41.45677007, 58.86629306, 37.80203313, 46.35633342
)), row.names = 21:52, class = "data.frame")
m1<-lm(toedl~vegtype+distance, data=vegdata)
summary(m1)
library(visreg)
visreg(oedl6, 'sexactd', by='vegtype',overlay=TRUE, gg=TRUE, points=list(size=2.5), ylab='% old elephant damage', xlab='distance from waterhole')
Regarding the comments about a reproducible example, you can just make a small dataframe with representative data like below, also a general comment that you should avoid naming your variables names of base functions like 'all'.
I'm not sure whether it's possible to use visreg to do what you want, but you can extract the information from your model using predict, then use ggplot to plot it, which may be preferable because ggplot is really good for customizing plots.
library(ggplot2)
library(visreg)
# Create reproducible data example
allData <- data.frame(vegtype = rep(c("t1", "t2"), each = 10),
oedl = c(seq(from = 35, to = 20, length.out = 10),
seq(from = 20, to = 5, length.out = 10)),
sexactd = c(seq(from = -1, to = 1, length.out = 10),
seq(from = -1, to = 2, length.out = 10)))
# Make linear model
oedl6 <- lm(formula = oedl ~ sexactd + vegtype, data = allData)
# Predict the data using the linear model
odelPred <- cbind(allData, predict(oedl6, interval = 'confidence'))
ggplot(odelPred, aes(sexactd, oedl, color = vegtype, fill = vegtype)) +
geom_point() + geom_line(aes(sexactd, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3)
MR Macarthurs solution is great, and (s)he deserved the accepted answer. Visualising a multiple regression model with several predictors in a 2 dimensional graph is... difficult. Basically, you are limited to one predictor. And can add the interaction (in your case: vegtype). One can simply use
geom_smooth for it.
Using your data:
library(tidyverse)
ggplot(vegdata, aes(toedl, distance, color = vegtype)) +
geom_point() +
geom_smooth(method = 'lm')
Created on 2019-12-13 by the reprex package (v0.3.0)

How to connect several points with an arrow omitting ggplot2

I made an ordination of a time series of some vegetation data, using the vegan package. Since ordination diagrams often are cluttered with many data points, I extracted the eigenvalues of the first two ordination axes and took the mean of each group. Now I have only one point per site (11 sites total) To still show some of the variation, I added ellipses with standard deviation and 95% confidence interval:
The last thing I want to do is to connect points of the same group (either A, B or C) with an arrow, indicating direction of change over time. All movement is from right to left.
I initially wanted to use the ordiarrow function in vegan, but this works only when class is decorana. My class is a factor.
Using ggplot2 does not seem like a valid option as the ordiellipse function (creating the ellipses) does not work there.
code for plotting data:
install.packages("vegan")
library(vegan)
plot(Ord_KIKKER, type = "n", main = "Kikkervalleien",
xlab = "DCA1 Eigenvalue = 0.62", ylab = "DCA2 Eigenvalue = 0.39")
points(ORD_KIKKER, cex = 2, pch = 19,
col = c("black", "black", "black", "red","red", "green", "green", "green", "blue", "blue", "blue"))
The resulting plot looks a bit different since I posted a reduced dataset here.
My data (Ord_KIKKER):
structure(list(DCA1 = c(2.676616032, 0.361181861, -1.363464067,
3.176862449, -0.087190269, 2.059548542, 0.167440366, -0.459090096,
1.571536367, 0.309623788, -0.25787459), DCA2 = c(0.276788721,
0.422077659, 0.181723453, 0.221610649, 0.940063655, -0.116083905,
-0.539375059, -0.545053063, -0.06120542, -0.367148924, -1.679257818
), Unique = structure(c(1L, 5L, 8L, 2L, 9L, 3L, 6L, 10L, 4L,
7L, 11L), .Label = c("2001A", "2001B", "2001C", "2001D", "2008A",
"2008C", "2008D", "2018A", "2018B", "2018C", "2018D"), class = "factor"),
BLOCK = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L,
4L), .Label = c("A", "B", "C", "D"), class = "factor")), .Names = c("DCA1",
"DCA2", "Unique", "BLOCK"), class = "data.frame", row.names = c("2001A",
"2008A", "2018A", "2001B", "2018B", "2001C", "2008C", "2018C",
"2001D", "2008D", "2018D"))
vegan::ordiarrows() will work, if you give it only the variables that have scores:
ordiarrows(Ord_KIKKER[,1:2], Ord_KIKKER$BLOCK) # one way
However, you should also remember to have asp=1 in the initial plot to force equal aspect ratio to axes.
I cannot do full testing, because the graph cannot be reproduced with the data you posted: If you issue plot(Ord_KIKKER, ...) with a data frame, you will not get ordinary plot, but a panel plot of all variables against each other (pairs() plot), and also give an error for type = "n" argument. It seems that you instead used some non-standard graphics tools, and I am not sure that standard R graphics of vegan::ordiarrows() can be combined with those.

How modify stacked bar chart in ggplot2 so it is diverging

My data (from a likert scale question) looks like this:
head(dat)
Consideration Importance2 Importance Percent Count
1 Aesthetic value 1 Not at all important 0.046875 3
2 Aesthetic value 2 Of little importance 0.109375 7
3 Aesthetic value 3 Moderately important 0.250000 16
dput(head(dat,6))
structure(list(Consideration = structure(c(2L, 2L, 2L, 2L, 2L,
12L), .Label = c("", "Aesthetic value", "Affordability/cost-efficiency",
"Climate change reduction", "Eco-sourcing", "Ecosystem services provision",
"Erosion mitigation", "Habitat for native wildlife", "Habitat/species conservation",
"Human use values", "Increasing biodiversity", "Planting native species",
"Restoring ecosystem function", "Restoring to a historical state"
), class = "factor"), Importance2 = c(1L, 2L, 3L, 4L, 5L, 1L),
Importance = structure(c(4L, 5L, 3L, 2L, 6L, 4L), .Label = c("",
"Important", "Moderately important", "Not at all important",
"Of little importance", "Very Important"), class = "factor"),
Percent = c(0.046875, 0.109375, 0.25, 0.375, 0.234375, 0),
Count = c(3L, 7L, 16L, 24L, 15L, 0L), percentage = c(5L,
11L, 25L, 38L, 23L, 0L)), row.names = c(NA, 6L), class = "data.frame")
I've plotted the results using a stacked bar chart. I would like to know how to modify this so it's a diverging stacked bar chart such as the example shown below, with the Importance2 level 3 (moderately important) as the centre.
I know there is a package called likert that can be used for this, but I think my data is not in the correct format.
The code for my existing plot is:
ggplot(dat, aes(x = Consideration, y = Percent, fill = forcats::fct_rev(Importance2))) +
geom_bar(position="fill", stat = "identity", color = "black", size = 0.2, width = 0.8) +
aes(stringr::str_wrap(dat$Consideration, 34), dat$Percent) +
coord_flip() +
labs(y = "Percentage of respondents (%)") +
scale_y_continuous(breaks=c(0, 0.25, 0.50, 0.75, 1), labels=c("0", "25", "50", "75", "100")) +
theme(axis.title.y=element_blank(), panel.background = NULL, axis.text.y = element_text(size=8), legend.title = element_text(size=8), legend.text = element_text(size = 6)) +
scale_fill_manual(name="Scale", breaks=c("1", "2", "3", "4", "5"), labels=c("Not at all important", "Of little importance", "Moderately important","Important", "Very important"), values=col3)
I've tried a couple of solution, but I think that the simplest one is to convert your data for the likert() function, and it's quite simple:
library(tidyr)
# you need the data in the wide format
data_l <- spread(dat[,c(1,3,4)], key = Importance, value = Percent)
# now add colnames
row.names(data_l) <- data_l$Consideration
# remove the useless column
data_l <- data_l[,-1]
Now you can use:
library(HH)
likert(data_l , horizontal=TRUE,aspect=1.5,
main="Here the plot",
auto.key=list(space="right", columns=1,
reverse=TRUE, padding.text=2),
sub="Here some words")
You can tweak ggplot to do this, but in that case you do not center by the center of the class you want, but by the "edge" of it.

Excel Dates and R?

I have a short data frame I randomly created to have a practice before it gets to Big Data frames. I made it with the same Variables as the original should be but way shorter.
The problem I'm having is that Excel takes dates with month first, so R is confused and it's putting 10/1/2015 first. When it's supposed to be last.
What can I do so R correctly orders the dates?
Also I want to for example calculate the Total amount of money (Data$Total) that I made in one month.
What would be the script for that?
Also if I'm already here I could kill two birds with one stone. I know there is already an answer for this, but the answer I saw involves using Direct.labels package that completely messes up with the whole graphic.
What would you advise to prevent the labels going over the plot
margin?
DPUT()
dput(Data)
structure(list(JOB = structure(c(2L, 3L, 1L, 3L, 3L), .Label = c("JAGER",
"PLAY", "RUGBY"), class = "factor"), AGENCY = structure(c(1L,
1L, 2L, 1L, 1L), .Label = c("LONDON", "WILHEL"), class = "factor"),
DATE = structure(c(4L, 5L, 1L, 2L, 3L), .Label = c("10/1/2015",
"10/3/2015", "10/9/2015", "9/24/2015", "9/26/2015"), class = "factor"),
RATE = c(90L, 90L, 100L, 90L, 90L), HS = c(8L, 6L, 4L, 6L,
4L), TOTAL = c(720L, 540L, 400L, 540L, 360L)), .Names = c("JOB",
"AGENCY", "DATE", "RATE", "HS", "TOTAL"), class = "data.frame", row.names = c(NA,
-5L))
Here is how I went about what you're after:
rugger is the dataset I constructed from your dput()
plot(order(as.Date(rugger$DATE,"%m/%d/%Y")),rugger$TOTAL,xaxt="n",xlab="",ylab="Total")
labs <- as.Date(rugger$DATE,"%m/%d/%Y")
axis(side = 1,at = rugger$DATE,labels = rep("",5))
text(cex=1, x=order(as.Date(rugger$DATE,"%m/%d/%Y"))+0.1, y=min(rugger$TOTAL)-25, labs, xpd=TRUE, srt=45, pos=2)
The text call allows you to manipulate the labels far more, srt is a rotation call. I used order() to put the days in chronological order, this will also turn them into the numbers that represent those Dates as ordered Dates appeared to be managed as factors (I'm not positive on that, it's just what I'm seeing).
If you don't want dots check out the pch argument within plot(). Pch types.

Resources