Combine logistic regression with bar graph for maturity results - r

I am trying to present the results of a logistic regression analysis for the maturity schedule of a fish species. Below is my reproducible code.
#coded with R version R version 3.0.2 (2013-09-25)
#Frisbee Sailing
rm(list=ls())
library(ggplot2)
library(FSA)
#generate sample data 1 mature, 0 non mature
m<-rep(c(0,1),each=25)
tl<-seq(31,80, 1)
dat<-data.frame(m,tl)
# add some non mature individuals at random in the middle of df to
#prevent glm.fit: fitted probabilities numerically 0 or 1 occurred error
tl<-sample(50:65, 15)
m<-rep(c(0),each=15)
dat2<-data.frame(tl,m)
#final dataset
data3<-rbind(dat,dat2)
ggplot can produce a logistic regression graph showing each of the data points employed, with the following code:
#plot logistic model
ggplot(data3, aes(x=tl, y=m)) +
stat_smooth(method="glm", family="binomial", se=FALSE)+
geom_point()
I want to combine the probability of being mature at a given size, which is obtained, and plotted with the following code:
#plot proportion of mature
#clump data in 5 cm size classes
l50<-lencat(~tl,data=data3,startcat=30,w=5)
#table of frequency of mature individuals by size
mat<-with(l50, table(LCat, m))
#proportion of mature
mat_prop<-as.data.frame.matrix(prop.table(mat, margin=1))
colnames(mat_prop)<-c("nm", "m")
mat_prop$tl<-as.factor(seq(30,80, 5))
# Bar plot probability mature
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")
What I've been trying to do, with no success, is to make a graph that combines both, since the axis are the same it should be straightforward, but I cant seem to make t work. I have tried:
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")+
stat_smooth(method="glm", family="binomial", se=FALSE)
but does not work. Any help would be greatly appreciated. I am new so not able to add the resulting graphs to this post.

I see three problems with your code:
Using stat="bin" in your geom_bar() is inconsisten with giving values for the y-axis (y=m). If you bin, then you count the number of x-values in an interval and use that count as y-value, so there is no need to map your data to the y-axis.
The data for the glm-plot is in data3, but your combined plot only uses mat_prop.
The x-axis of the two plots are acutally not quite the same. In the bar plot, you use a factor variable on the x-axis, making the axis discrete, while in the glm-plot, you use a numeric variable, which leads to a continuous x-axis.
The following code gave a graph combining your two plots:
mat_prop$tl<-seq(30,80, 5)
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="identity") +
geom_point(data=data3) +
geom_smooth(data=data3,aes(x=tl,y=m),method="glm", family="binomial", se=FALSE)
I could run it after first sourcing your script to define all the variables. The three problems mentioned above are adressed as follows:
I used geom_bar(stat="identity") in order not to use binning in the bar plot.
I use the data-argument in geom_point and geom_smooth in order to use the correct data (data3) for these parts of the plot.
I redifine mat_prop$tl to make it numeric. It is then consistent with the column tl in data3, which is numeric as well.
(I also added the points. If you don't want them, just remove geom_point(data=data3).)
The plot looks as follows:

Related

Plotting a subset of envfit results onto an ordination

I'm working on a figure for a publication where we are looking at a combination of plant coverage and other environmental data on differing communities. I am trying to make a multi-panel figure, with panels that display all the envfit results, one that displays only the plants, and one that displays only the other enviro. Because of the complexity of the figure, it's actually a little easier to construct in the base plot function than in ggvegan.
My challenge is figuring out how to subset the results of the envfit analysis object for the different panels. A simplified example would be:
library(vegan)
data("mite")
data("mite.env")
set.seed(55)
nmds<-metaMDS(mite)
set.seed(55)
ef<-envfit(nmds, mite.env, permu=999)
plot(ef, p.max = .05)
which produces this figure
For sake of the example, does anyone have suggestions on a way I could create two separate figures, one with only the WatrCont vector and one with only the SubsDens vector? I'm sure there is a way to pull specific results out of the ef object, but my coding is not savvy enough to understand how.
Additionally, is there a way to have the jumble of text at the center not overlap, similar to jitter in ggplot?
Thank y'all for all of your help!
I would suggest extracting the data from nmds and ef and using ggplot to add the required elements to your plots.
Here is an example:
library(vegan)
library(ggplot2)
data("mite")
data("mite.env")
set.seed(55)
nmds<-metaMDS(mite)
set.seed(55)
ef<-envfit(nmds, mite.env, permu=999)
# Get the NMDS scores
nmds_values <- as.data.frame(scores(nmds))
# Get the coordinates of the vectors produced for continuous predictors in your envfit
vector_coordinates <- as.data.frame(scores(ef, "vectors")) * ordiArrowMul(ef)
# Plot the vectors separately
ggplot(nmds_values,
aes(x=NMDS1, y = NMDS2)) +
geom_point() +
geom_segment(aes(x=0, y=0, xend=NMDS1, yend=NMDS2),
vector_coordinates[1,]) +
geom_text(aes(x=NMDS1,y=NMDS2),
vector_coordinates[1,],
label=row.names(vector_coordinates[1,]))
ggplot(nmds_values,
aes(x=NMDS1, y = NMDS2)) +
geom_point() +
geom_segment(aes(x=0, y=0, xend=NMDS1, yend=NMDS2),
vector_coordinates[2,]) +
geom_text(aes(x=NMDS1,y=NMDS2),
vector_coordinates[2,],
label=row.names(vector_coordinates[2,]))
You can play around with the colours, size of the different elements as you see fit. Coordinates for categorical predictors can be extracted in a similar manner.

R plotting proportions problem - ggplot making plot that looks like a table

I am trying to make a plot of proportions of a binomial distribution (yes/no) depending on one ordinal and one continuous variable. Somehow when including the continuous one as color of the dots the appearance of the plot radically changes. Can someone help me with how to include the third variable without having the plot turn into below table-looking result?
Code as follows:
#making table with proportions of people who switch (1),
## after arsenic level and education.
educ_switch <- prop.table(table(welldata$educ[welldata$switch==1],
welldata$arsenic[welldata$switch==1],
welldata$switch[welldata$switch==1]))
educ_switch <- as_data_frame(educ_switch, make.names=TRUE)
#remove observations where the proportion is 0
educ_switch1 <- educ_switch[which (educ_switch$proportion>0),]
p <- ggplot(educ_switch1, aes(x = educ, y=proportion))
If I do p + geom_point()
I get the following picture:
But when I try to distinguish the third variable by coloring it with p + geom_point(aes(colour = arsenic))
I get this weird looking thing instead:

Visualising interaction between two categorical predictors and continuous outcome with ribbon confidence intervals

I am very new to R, so apologise in advance for a possible simple or obvious question.
I am trying to graph an interaction between two categorical variables (A and I), which both have 2 levels (0 and 1), against one continuous variable (V). I would like V on the Y axes, A on the X axes and I as different lines on the graph. However, I would like to include 95% confidence intervals. I would like to use a ribbon style CI on the graph (like geom_ribbon produces). However, I cannot do this after identifying A and I as binary categorical variables in R. The only way I can figure out how to do it is leaving A as a continuous variable (see picture). The syntax I am using is below:
data$I <- as.factor(data$I)
data$A <- as.factor(data$A)
gp <- ggplot(data=data, aes(x=A, y=V, colour=I))
gp + geom_point() + stat_smooth(method="lm")
Though I did not set A as a categorical variable when producing the attached image.

Sorting data vector for a histogram using ggplot and R

So I have 10.000 values in a vector from a Monte Carlo simulation. I want to plot this data as a histogram and a density plot. Doing this with the hist() function is easy, and it will calculate the frequency of the of the different values automatically. My ambition is however doing this in ggplot.
My biggest problem right now is how to transform the data so ggplot can handle it. I would like my x-axis to show the "price" while the x-axis shows the frequency or density. My data has a lot decimals as shown in the example data below.
myData <- c(266.8997, 271.5137, 225.4786, 223.3533, 258.1245, 199.5601, 234.2341, 231.7850, 260.2091, 184.5102, 272.8287, 203.7482, 212.5140, 220.9094, 221.2627, 236.3224)
My current code using the hist()-function, and the plot is shown below.
hist(myData,
xlab ="Price",
prob=TRUE)
lines(density(myData))
Histogram for the data vector containing 10000 values
How would you sort the data, and how would you do this with ggplot? I am thinking if I should round the numbers as well?
Hard to say exactly without seeing a sample of your data, but have you tried:
ggplot(myData, aes(Price)) + geom_histogram()
or:
ggplot(myData, aes(Price)) + geom_density()
Just try this:
ggplot() +
geom_bar(aes(myData)) +
geom_density(aes(myData))

Create a ggplot2 stat_ecdf plot with standard error shading

I have data from three doses of a treatment, with three replicates per each dose:
df <- data.frame(dose=c(rep(1,300),rep(3,300),rep(5,300)),
replicate=rep(c(rep("X1",100),rep("X2",100),rep("X3",100)),3),
value=c(rnorm(300,1,1),rnorm(300,3,1),rnorm(300,5,1)),stringsAsFactors=F)
df$dose <- factor(df$dose,levels=c(1,3,5))
I want to display it using an cdf plot. Per each replicate I can simply plot the cdfs of the three doses with:
for(r in c("X1","X2","X3")){
ggplot(dplyr::filter(df,replicate==r),aes(x=value,color=dose))+
stat_ecdf(geom="step")+
theme_bw()+
theme(panel.border=element_blank(),strip.background=element_blank())
}
But I'm looking for a way to display all replicates of each dose in one figure, with standard error shading around the mean value, similar to plots achieved with stat_smooth.
Can this be achieved?
Also, either for this or for a single replicate's plot:
r <- "X1"
ggplot(dplyr::filter(df,replicate==r),aes(x=value,color=dose))+
stat_ecdf(geom="step")+
theme_bw()+
theme(panel.border=element_blank(),strip.background=element_blank())
Is there a way to compute the area under each of the ecdfs?
You can use interaction to group the data (in ggplot) based on two columns;
ggplot(df,aes(x=value,color = interaction(replicate, dose)
, group=interaction(replicate, dose)))+
stat_ecdf(geom="step")+
theme_bw()+
theme(panel.border=element_blank(),strip.background=element_blank())
This would be your plot;
As it is a little bit vague:
if you want to have lines for dose then you can get rid of replicate in the interaction or use dose instead of replicate in your ddplyr::filter;
ggplot(dplyr::filter(df,dose==1),aes(x=value,color = interaction(replicate, dose)
, group=interaction(replicate, dose)))+
stat_ecdf(geom="step")+
theme_bw()+
theme(panel.border=element_blank(),strip.background=element_blank())
And you'd get:

Resources