Related
I am making a function Prop.Histogram() that plots data as a histogram showing the proportions with a normal distribution curve added to it. Addition of the curve was difficult for me to achieve, but I succeeded (see code below)!
Note: I personally prefer to work with the pipe-operator %>% from the package magrittr in my codes. Though, as probably not everyone is familiar with this operator and/or this package (or they prefer not to use it), I'll also provide the same code without using magrittr below.
Code using magrittr
Prop.Histogram <- function(data,
xlim_min, xlim_max, x_BreakSize,
ylim_max, y_steps) {
# Load packages
library(magrittr)
# Make histogram of data without y-axis
hist(data, freq = FALSE, ylab = "Proportion",
xlim = c(xlim_min, xlim_max), breaks = seq(from = xlim_min, to = xlim_max, by = x_BreakSize),
ylim = c(0, ylim_max %>% divide_by(., x_BreakSize)), yaxt = "n")
# I divided ylim_max by x_BreakSize, as I want ylim_max to be equal to the max proportion shown on the y_axis (and not to the max density)
# Add y-axis that shows proportion and not density
axis(side = 2,
at = seq(from = 0, to = ylim_max %>% divide_by(., x_BreakSize), by = y_steps %>% divide_by(., x_BreakSize)),
labels = seq(from = 0, to = ylim_max, by = y_steps))
box()
# Add curve to histogram
curve(dnorm(x, mean = mean(data), sd = sd(data)), lwd = 5, add = TRUE, yaxt = "n")
}
Same code without using magrittr
Prop.Histogram <- function(data,
xlim_min, xlim_max, x_BreakSize,
ylim_max, y_steps) {
# Load packages
library(magrittr)
# Make histogram of data without y-axis
hist(data, freq = FALSE, ylab = "Proportion",
xlim = c(xlim_min, xlim_max), breaks = seq(from = xlim_min, to = xlim_max, by = x_BreakSize),
ylim = c(0, ylim_max/x_BreakSize), yaxt = "n")
# I divided ylim_max by x_BreakSize, as I want ylim_max to be equal to the max proportion shown on the y_axis (and not to the max density)
# Add y-axis that shows proportion and not density
axis(side = 2,
at = seq(from = 0, to = ylim_max/x_BreakSize, by = y_steps/x_BreakSize),
labels = seq(from = 0, to = ylim_max, by = y_steps))
box()
# Add curve to histogram
curve(dnorm(x, mean = mean(data), sd = sd(data)), lwd = 5, add = TRUE, yaxt = "n")
}
This code does exactly what I want it to do: it plots the proportions and adds a normal distribution curve to the plot. Though, I do have difficulties understanding why addition of the curve actually works.
Main question (1): I have to put x as the first argument in dnorm(), and even though I have not defined x, it works! So my first and main question is: what is x, what does it do, and why does it work in my function?
Second question (2): My second question is whether it is possible (and, if so, how) to use magrittr pipe-operators (%>%) in the line of code that adds the curve to the plot. (Even if using operators is not the best way to do so in this case, I am still interested in the answer as I am eager to learn!)
First of all, for those who want to try out my code: here is some data that is representative of data that I want to plot:
data <- rnorm(724, mean = 84, sd = 33)
Prop.Histogram(data,
xlim_min = -50, xlim_max = 200, x_BreakSize = 10,
ylim_max = 0.15, y_step = 0.05)
Main question (1): role of x in dnorm()/curve()
I started by using data instead of x as the first argument of dnorm(), but this didn't work as it resulted in the following error message:
Error in curve(dnorm(data, mean = mean(data), sd = sd(data)), lwd = 5, :
'expr' must be a function, or a call or an expression containing 'x'
But then, when I take dnorm(data, mean = mean(data), sd = sd(data)) and run it individually (not as an argument of curve(), it gives me 724 values (of which I don't know what they meaning, but at least it's not an error message). Which is weird, since using data as the first argument when dnorm() is part of curve in my formula results in an error message as we saw previously.
Then, when I change data for x and run dnorm(x, mean = mean(data), sd = sd(data)) (again not as an argument of curve()), it gives me another error message:
Error in dnorm(x, mean = mean(data), sd = sd(data)) :
object 'x' not found
This I can understand, as I've not defined x anywhere in my code. But that rises the question: why do I not get this same error message when I run my (working) function.
In short, I observed that x must be the first argument in dnorm() when dnorm() is used as an argument in curve(), but x cannot be used as the first argument when dnorm() is used individually. Conclusion: I am lost.
Of course, when I am lost in R, I always look at the help page of R. The help page of dnorm() states that x is a vector of quantiles... that's it. I know those words individually, but have no idea what it means in my code (as I've not defined x, so what vector or what quantiles is the R help page talking about?).
Second question (2): use of magrittr in code
I've tried to write the code curve(dnorm(x, mean = mean(data), sd = sd(data)), lwd = 5, add = TRUE, yaxt = "n") using magrittr, but it does not work. Here are some examples I've tried:
data %>% dnorm(x, mean = mean(.), sd = sd(.)) %>% curve(., lwd = 5, add = TRUE, yaxt = "n")
data %>% dnorm(x, mean = mean(.), sd = sd(.)) %>% curve(lwd = 5, add = TRUE, yaxt = "n")
dnorm(x, mean = mean(data), sd = sd(data)) %>% curve(., lwd = 5, add = TRUE, yaxt = "n")
They all result in the same error message:
Error in dnorm(x, mean = mean(data), sd = sd(data)) :
object 'x' not found
I'd like to know if it's possible to use magrittr operators like %>% in this situation (even if it's not the best option).
PS. This is my first time posting, so please feel free to give feedback or ask me for more information if needed. Thank you in advance!
The curve() function uses non-standard evaluation. x is just a placeholder in the expression that it will plot. See ?curve for details.
In fact, x doesn't need to be the first argument, it can appear anywhere in the expression. But you would want it to be attached to the first argument of dnorm, so putting it first works well. If you want to see the effect of the sd argument on the density at 0, you could use
curve(dnorm(0, sd = x))
When you do put it first, the dummy x that curve() is looking for will be bound to the first argument of dnorm(), which happens to also be named x, as you saw on the help page. It is the location at which you want to calculate the density.
When you called dnorm(data, mean = mean(data), sd = sd(data)) you were asking it to calculate the density of a normal distribution with mean mean(data) and standard deviation sd(data) at each of the locations in data. That's why you got a long vector response.
For your second question: magrittr passes the result of things on the left of the pipe into the function call on the right. There are some complicated rules for where those results appear:
If you don't use . in the function call, the value is used as the first argument.
If you do use ., the argument appears there, but maybe also in the first place. I forget the exact rules; see ?pipe for details.
So to get what you want, you could do this:
data %>% {curve(dnorm(x, mean = mean(.), sd = sd(.), lwd = 5, add = TRUE, yaxt = "n")}
I had to use the curly brackets to get magrittr to handle the . properly.
I am trying to do a for loop like this:
for (n in 1:200)
{
pre[n] <- aggregate(S[n]~Secs[n], data = dataframe, FUN = sum)
freqsdf[n] <- data.frame(table(SecsOnly2$Secs[n]))
AVAL[n] <- pre[n]$S[n]/freqsdf[n]$Freq
AVAL[n] <- data.frame(AVAL[n])
hist(dataframe$Secs[n], xlab = "", ylab = "", ylim = c(0, 16000), axes = FALSE, col = "grey")
axis(4, ylim = c(0, 16000), col = "black", col.axis = "black", las = 2, cex.axis = .5)
par(new = TRUE)
plot(pre[n]$Secs[n], AVAL[n], col = "red" , type = "l")
abline(h = 0.25) }
But I'm getting this error:
Error in eval(expr, envir, enclos) : object 'S' not found
My dataset that has the "S" variable has a bunch of variables including "S1" through "S200." I want R to go through all this code for all the "S" variables, the "Secs" variables, etc... This code worked fine for just S1, when I wrote it just for S1, Secs1, etc...(not in a loop). But I want R to go through the same code for all my columns. I'm not sure why "S" is not being found. I thought by going from n = 1 to n = 200, R automatically looks for "S1", "Secs1," etc... the first time the loop runs, and then "S2", "Secs2," etc... the second time it runs, and so on.
In this line (and several other places):
pre[n]$S[n]/freqsdf[n]$Freq
You are trying to paste together S or Secs and the [n] you are using to subset the S01 columns, but R is interpreting it as you trying to take the nth item from S which does not exist. It's hard to fix without seeing your data, but you could try replacing every place you try and do this trick with something like:
pre[n][[paste("S", n, sep="")]]
I have the following script:
FGM = function (n,r,z){
x = r*sqrt(n)/(2*z)
Px = 1-pnorm(x)
}
re = 10000
data = data.frame(abs(rnorm(re,0,1)), abs(rnorm(re,0,1)), abs(rnorm(re,0,1)))
colnames(data) = c("n","r","z")
data$Px = FGM(data$n,data$r,data$z)
data$x = data$r*sqrt(data$n)/(2*data$z)
par(mar=c(4.5,4.5,1,1))
plot(data$x,data$Px, xlim = c(0,3), pch = 19, cex = 0.1, xaxs="i", yaxs="i",
xlab = expression(paste("Standardized mutational size (",italic(x), ")")),
ylab = expression(paste("P"[a],"(",italic(x),")")))
Which is a recreation of the graph found here (box 2). You can see in this script that I do this by just plotting 10000 small black points with various values of n,z, and r. This seems like an ugly work around, I think I should just be able to give R my function
FGM = function (n,r,z){
x = r*sqrt(n)/(2*z)
Px = 1-pnorm(x)
}
and have it plot a line on a graph. However, a few hours of scouring the web has been unproductive, and I tried a few ways with abline and lines but nothing worked, is there a way of doing it with these functions or another function?
Tried this...
plot(data$x,data$Px, xlim = c(0,3), ylim = c(0,0.5), xaxs="i", yaxs="i",
xlab = expression(paste("Standardized mutational size (",italic(x), ")")),
ylab = expression(paste("P"[a],"(",italic(x),")")), type = "n")
curve(1-pnorm(r*sqrt(n)/(2*z)), add=T)
>Error in curve(1 - pnorm(r * sqrt(n)/(2 * z)), add = T) :
'expr' must be a function, or a call or an expression containing 'x'
>
#PaulRegular offered this solution but it still plots based on data, not the formula itself. I'm looking for a solution which can produce the curve properly without large values of "re" - using the following but with "re" set to 10 you can see what I mean...
data <- data[order(data$x),]
lines(data$x, data$Px, lwd=1)
You can pass a function of just one variable to plot. I guess that you are looking for:
plot(function(x) 1-pnorm(x),0,3)
Try sorting your data by x, then add the line:
data <- data[order(data$x),]
lines(data$x, data$Px, lwd=2)
for instance when i got this:
http://i.stack.imgur.com/cWTIm.jpg
It seems that I've to set the color for every outlier separately...
Here is my (fractional) guess at code:
...,par.settings = list(...,box.rectangle = list(col= c("red","blue")),...) ),...
thx already in advance!
Fair question, but please do not post "fractional guesses at code"; it's unfair to ask other to generate the sample problem.
Here is the sample code, which confirms what you found:
library(lattice)
d = data.frame(x=c(rnorm(90),20*runif(16)),group=letters[1:2])
cols = list(col=c("red","blue"),pch=c(1,16,13))
bwplot(group~x,data=d,
par.settings = list(
plot.symbol=cols,
box.rectangle = cols,
box.dot = cols,
box.umbrella=cols
))
and here is the code that shows that the outlier pch/col/alpha/cex are not grouped, and therefore are recycled incorrectly.
From panel.bwplot:
panel.points(x = rep(levels.fos, sapply(blist.out, length)),
y = unlist(blist.out), pch = plot.symbol$pch, col = plot.symbol$col,
alpha = plot.symbol$alpha, cex = plot.symbol$cex,
fontfamily = plot.symbol$fontfamily, ......
Which means that this is a missing feature in lattice (I would not call it a bug).
I would like to overlay 2 density plots on the same device with R. How can I do that? I searched the web but I didn't find any obvious solution.
My idea would be to read data from a text file (columns) and then use
plot(density(MyData$Column1))
plot(density(MyData$Column2), add=T)
Or something in this spirit.
use lines for the second one:
plot(density(MyData$Column1))
lines(density(MyData$Column2))
make sure the limits of the first plot are suitable, though.
ggplot2 is another graphics package that handles things like the range issue Gavin mentions in a pretty slick way. It also handles auto generating appropriate legends and just generally has a more polished feel in my opinion out of the box with less manual manipulation.
library(ggplot2)
#Sample data
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
, lines = rep(c("a", "b"), each = 100))
#Plot.
ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5)
Adding base graphics version that takes care of y-axis limits, add colors and works for any number of columns:
If we have a data set:
myData <- data.frame(std.nromal=rnorm(1000, m=0, sd=1),
wide.normal=rnorm(1000, m=0, sd=2),
exponent=rexp(1000, rate=1),
uniform=runif(1000, min=-3, max=3)
)
Then to plot the densities:
dens <- apply(myData, 2, density)
plot(NA, xlim=range(sapply(dens, "[", "x")), ylim=range(sapply(dens, "[", "y")))
mapply(lines, dens, col=1:length(dens))
legend("topright", legend=names(dens), fill=1:length(dens))
Which gives:
Just to provide a complete set, here's a version of Chase's answer using lattice:
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
, lines = rep(c("a", "b"), each = 100))
densityplot(~dens,data=dat,groups = lines,
plot.points = FALSE, ref = TRUE,
auto.key = list(space = "right"))
which produces a plot like this:
That's how I do it in base (it's actually mentionned in the first answer comments but I'll show the full code here, including legend as I can not comment yet...)
First you need to get the info on the max values for the y axis from the density plots. So you need to actually compute the densities separately first
dta_A <- density(VarA, na.rm = TRUE)
dta_B <- density(VarB, na.rm = TRUE)
Then plot them according to the first answer and define min and max values for the y axis that you just got. (I set the min value to 0)
plot(dta_A, col = "blue", main = "2 densities on one plot"),
ylim = c(0, max(dta_A$y,dta_B$y)))
lines(dta_B, col = "red")
Then add a legend to the top right corner
legend("topright", c("VarA","VarB"), lty = c(1,1), col = c("blue","red"))
I took the above lattice example and made a nifty function. There is probably a better way to do this with reshape via melt/cast. (Comment or edit if you see an improvement.)
multi.density.plot=function(data,main=paste(names(data),collapse = ' vs '),...){
##combines multiple density plots together when given a list
df=data.frame();
for(n in names(data)){
idf=data.frame(x=data[[n]],label=rep(n,length(data[[n]])))
df=rbind(df,idf)
}
densityplot(~x,data=df,groups = label,plot.points = F, ref = T, auto.key = list(space = "right"),main=main,...)
}
Example usage:
multi.density.plot(list(BN1=bn1$V1,BN2=bn2$V1),main='BN1 vs BN2')
multi.density.plot(list(BN1=bn1$V1,BN2=bn2$V1))
You can use the ggjoy package. Let's say that we have three different beta distributions such as:
set.seed(5)
b1<-data.frame(Variant= "Variant 1", Values = rbeta(1000, 101, 1001))
b2<-data.frame(Variant= "Variant 2", Values = rbeta(1000, 111, 1011))
b3<-data.frame(Variant= "Variant 3", Values = rbeta(1000, 11, 101))
df<-rbind(b1,b2,b3)
You can get the three different distributions as follows:
library(tidyverse)
library(ggjoy)
ggplot(df, aes(x=Values, y=Variant))+
geom_joy(scale = 2, alpha=0.5) +
scale_y_discrete(expand=c(0.01, 0)) +
scale_x_continuous(expand=c(0.01, 0)) +
theme_joy()
Whenever there are issues of mismatched axis limits, the right tool in base graphics is to use matplot. The key is to leverage the from and to arguments to density.default. It's a bit hackish, but fairly straightforward to roll yourself:
set.seed(102349)
x1 = rnorm(1000, mean = 5, sd = 3)
x2 = rnorm(5000, mean = 2, sd = 8)
xrng = range(x1, x2)
#force the x values at which density is
# evaluated to be the same between 'density'
# calls by specifying 'from' and 'to'
# (and possibly 'n', if you'd like)
kde1 = density(x1, from = xrng[1L], to = xrng[2L])
kde2 = density(x2, from = xrng[1L], to = xrng[2L])
matplot(kde1$x, cbind(kde1$y, kde2$y))
Add bells and whistles as desired (matplot accepts all the standard plot/par arguments, e.g. lty, type, col, lwd, ...).