Problem with Density Plot and Normal Density Plot in R - r

I am asked to plot the density of the residuals of my dataset against the density of a normal distribution.
When I do this with ggplot2, it shows me my residual density plot, but not the normal distribution. Additionally, two error messages occur:
Removed 134 rows containing non-finite values (stat_density).
Removed 403 row(s) containing missing values (geom_path).
Could somebody explain me why the normal plot is not shown?
Please find my code below:
p1 <- ggplot(steak_eaters)+ geom_density(aes(x= resid))
p1 <- p1 + stat_function(fun =dnorm, n=403, args= list(mean= mean(steak_eaters$resid), sd = sd(steak_eaters$resid)), color= "red") + theme_stata()
plot(p1)

I don't have access to your steak_eaters dataset, but here is an example with the geom_function function and a built-in dataset:
library(ggplot2)
ggplot(diamonds, aes(x=depth)) +
geom_density(color="blue") +
geom_function(fun=function(x)
dnorm(x,
mean=mean(diamonds$depth),
sd=sd(diamonds$depth)),
color="red")
The messages you copy seem like warning messages rather than error messages. You probably have missing observations or infinite values in the dataset, which get dropped by the ggplot2 plotting mechanism.

Related

Boxplots by base R and ggplot2 do not match

I have a simple dataset. When I generate boxplot for the data by base R and ggplot separately, they do not match. In fact the base R boxplot is consistent with the summary function.
library(tidyverse)
library(ggplotify)
library(patchwork)
df <- read.csv("test_boxplot_data.csv")
summary(df)
p1 <- as.ggplot(~boxplot(df$y, outline=FALSE))
p2 <- ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) + ylim(0,100)
p1 + p2 + plot_layout(ncol = 2)
Generated plot kept here.
Any clue what is happening? It is also surprising that ggplot throws warning that "Removed 845 rows containing non-finite values (stat_boxplot)" but there is no NA in the data.
From: "Removed 845 rows containing non-finite values (stat_boxplot)". It just so happens that the data contains 845 points > 100. These points are being deleted in the calculation of the box plot.
From the first line of help for ylim():
"This is a shortcut for supplying the limits argument to the individual scales. By default, any values outside the limits specified are replaced with NA. Be warned that this will remove data outside the limits and this can produce unintended results. For changing x or y axis limits without dropping data observations, see coord_cartesian()."
This should provide the desired graph:
ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim=c(0,100))

how to plot only most abundant species in NMDS?

I need to plot an ordination plot showing only let s say the 20 most abundant species.
I tried to do the sum of the species colunm and then select only a certain sum value:
abu <- colSums(dune)
abu
sol <- metaMDS(dune)
sol
plot(sol, type="text", display="species", select = abu > 40)
I get this error: select is not a graphical parameter
I would expect to see only small number of species but it does not happen,
how do you show only a small number of species in the NMDS plot?
This is not straightforward. You are getting an error because select is not a parameter for the plot. Unfortunately, the result of the analysis is not a data.frame that could be handled easily (e.g. with tidyverse), and even more unfortunately, the plot() function called is not your standard plot, but a method defined specifically for objects of this class. The authors of this method did not foresee your need, and therefore, we must make the plot manually. But to do that, we need to understand what is plotting and how.
Let us find out more about the object sol:
class(sol)
# [1] "metaMDS" "monoMDS"
methods(class="metaMDS")
# [1] goodness nobs plot points print scores sppscores<- text
Oh good, we have a plot method. After a moment of digging, we find it in the vegan package (not exported, so we need to access it via vegan:::plot.metaMDS). It appears to be a wrapper around a function called ordiplot. We edit the function with edit() to figure out what it is doing. Essentially, it boils down to the following (with loads of unnecessary code):
Y <- scores(sol, display="species")
plot(Y, type="n")
text(Y[,1], Y[,2], rownames(Y), col="red")
This is, more or less, your plot. Choosing the species to show is now trivial, but first we must make sure that rows of Y are in the same order as columns of dune:
all(colnames(dune) == rownames(Y))
Y.sel <- Y[colSums(dune) > 40, ]
plot(Y.sel[,1], Y.sel[,2], type="n", xlim=c(-.8, .8), ylim=c(-.4, .4))
text(Y.sel[,1], Y.sel[,2], rownames(Y.sel), col="red")
We can of course make a much nicer plot. For example, with ggplot (it is definitely possible to make a much nicer plot with base R as well). We could actually show the abundance of the plants using the size esthetics:
library(ggplot2)
library(ggrepel)
Y <- data.frame(Y)
Y$abundance <- colSums(dune)
Y$labels <- rownames(Y)
ggplot(Y, aes(x=NMDS1, y=NMDS2, size=abundance)) +
geom_point() + geom_text_repel(aes(label=labels)) +
theme_minimal()
To filter the species by abundance, we now can do the following:
library(tidyverse)
Y %>% filter(abundance > 40) %>%
ggplot(Y, aes(x=NMDS1, y=NMDS2, size=abundance)) +
geom_point() + geom_text_repel(aes(label=labels)) +
theme_minimal()

Combine logistic regression with bar graph for maturity results

I am trying to present the results of a logistic regression analysis for the maturity schedule of a fish species. Below is my reproducible code.
#coded with R version R version 3.0.2 (2013-09-25)
#Frisbee Sailing
rm(list=ls())
library(ggplot2)
library(FSA)
#generate sample data 1 mature, 0 non mature
m<-rep(c(0,1),each=25)
tl<-seq(31,80, 1)
dat<-data.frame(m,tl)
# add some non mature individuals at random in the middle of df to
#prevent glm.fit: fitted probabilities numerically 0 or 1 occurred error
tl<-sample(50:65, 15)
m<-rep(c(0),each=15)
dat2<-data.frame(tl,m)
#final dataset
data3<-rbind(dat,dat2)
ggplot can produce a logistic regression graph showing each of the data points employed, with the following code:
#plot logistic model
ggplot(data3, aes(x=tl, y=m)) +
stat_smooth(method="glm", family="binomial", se=FALSE)+
geom_point()
I want to combine the probability of being mature at a given size, which is obtained, and plotted with the following code:
#plot proportion of mature
#clump data in 5 cm size classes
l50<-lencat(~tl,data=data3,startcat=30,w=5)
#table of frequency of mature individuals by size
mat<-with(l50, table(LCat, m))
#proportion of mature
mat_prop<-as.data.frame.matrix(prop.table(mat, margin=1))
colnames(mat_prop)<-c("nm", "m")
mat_prop$tl<-as.factor(seq(30,80, 5))
# Bar plot probability mature
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")
What I've been trying to do, with no success, is to make a graph that combines both, since the axis are the same it should be straightforward, but I cant seem to make t work. I have tried:
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")+
stat_smooth(method="glm", family="binomial", se=FALSE)
but does not work. Any help would be greatly appreciated. I am new so not able to add the resulting graphs to this post.
I see three problems with your code:
Using stat="bin" in your geom_bar() is inconsisten with giving values for the y-axis (y=m). If you bin, then you count the number of x-values in an interval and use that count as y-value, so there is no need to map your data to the y-axis.
The data for the glm-plot is in data3, but your combined plot only uses mat_prop.
The x-axis of the two plots are acutally not quite the same. In the bar plot, you use a factor variable on the x-axis, making the axis discrete, while in the glm-plot, you use a numeric variable, which leads to a continuous x-axis.
The following code gave a graph combining your two plots:
mat_prop$tl<-seq(30,80, 5)
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="identity") +
geom_point(data=data3) +
geom_smooth(data=data3,aes(x=tl,y=m),method="glm", family="binomial", se=FALSE)
I could run it after first sourcing your script to define all the variables. The three problems mentioned above are adressed as follows:
I used geom_bar(stat="identity") in order not to use binning in the bar plot.
I use the data-argument in geom_point and geom_smooth in order to use the correct data (data3) for these parts of the plot.
I redifine mat_prop$tl to make it numeric. It is then consistent with the column tl in data3, which is numeric as well.
(I also added the points. If you don't want them, just remove geom_point(data=data3).)
The plot looks as follows:

How to add two distributions to a density in ggplot2

I want to add two sets of bowling scores onto the same distribution in ggplot2, I don't have the same amount of observations in each group but I would like to plot them on top of eachother. Below is the code I have.
m <- ggplot(bowling, aes(x = as.numeric(Kenny)))
n <- ggplot(bowling, aes(x= as.numeric(Group)))
m + n geom_density()
and this is the error.
Error in p + o : non-numeric argument to binary operator
In addition: Warning message:
Incompatible methods ("+.gg", "Ops.data.frame") for "+"
I just want to plot them on top of eachother but I can't figure out what the problem is.
The problem is that you're adding a single geom_density layer to two different plots (m and n) that have different aesthetic mappings.
Here is a potential solution, if I understood your question correctly.
First, creating a small sample dataset
kenny <- rnorm(100, 20, 2)
group <- rnorm(100, 15, 2)
bowling <- data.frame(kenny, group)
Second, plotting first a geom_density layer for kenny as an aesthetic, and then adding a geom_density layer for a different aesthetic, namely group.
ggplot(bowling, aes(x = kenny)) +
geom_density() + geom_density(aes(x=group), colour="red")
Here is what you obtain:

Smooth Error in qplot from ggplot2

I have some data that I am trying to plot faceted by its Type with a smooth (Loess, LM, whatever) superimposed. Generation code is below:
testFrame <- data.frame(Time=sample(20:60,50,replace=T),Dollars=round(runif(50,0,6)),Type=sample(c("First","Second","Third","Fourth"),50,replace=T,prob=c(.33,.01,.33,.33)))
I have no problem either making a faceted plot, or plotting the smooth, but I cannnot do both. The first three lines of code below work fine. The fourth line is where I have trouble:
qplot(Time,Dollars,data=testFrame,colour=Type)
qplot(Time,Dollars,data=testFrame,colour=Type) + geom_smooth()
qplot(Time,Dollars,data=testFrame) + facet_wrap(~Type)
qplot(Time,Dollars,data=testFrame) + facet_wrap(~Type) + geom_smooth()
It gives the following error:
Error in [<-.data.frame(*tmp*, var, value = list(NA = NULL)) :
missing values are not allowed in subscripted assignments of data frames
What am I missing to overlay a smooth in a faceted plot? I could have sworn I had done this before, possibly even with the same data.
It works for me. Are sure you have the latest version of ggplot2?

Resources