Log Log Probability Chart in R - r

I'm sure this is easy, but I've been tearing my hair out trying to find out how to do this in R.
I have some data that I am trying to fit to a power law distribution. To do this, you need to plot the data on a log-log cumulative probability chart. The y-axis is the LOG of the frequency of the data (or log-probability, if you like), and the x-axis is the log of the values. If it's a straight line, then it fits a power law distribution, and the gradient determines the power law parameter.
If I want the frequency of the data, I can just use the ecdf() function:
My data set is called Profits.negative, and it's just a long list of trading profits that were less than zero (and I've notionally converted them all to positive numbers to avoid logging problems later on).
So I can type
plot(ecdf(Profits.negative))
And I get a handy empirical CDF function plotted. All I need to do is to convert both axes to log scales. I can do the x-axis:
Profits.negative.logs <- log(Profits.negative)
plot(ecdf(Profits.negative.logs))
Almost there! I just need to work out how to log the y-axis! But I can't seem to do it, and I can't work out how to extract the figures from the ecdf object. Can anyone help?
I know there is a power.law.fit function, but that just estimates the parameters - I want to plot the data and see if it lines up.

You can fit and plot power-laws using the poweRlaw package. Here's an example. First we generate some data from a heavy tailed distribution:
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
Next we load the package and create a data object and a displ object:
library(poweRlaw)
m = displ$new(x)
We can estimate xmin and the scaling parameter:
est = estimate_xmin(m))
and set the parameters
m$setXmin(est[[2]])
m$setPars(est[[3]])
Then plot the data and add the fitted line:
plot(m)
lines(m, col=2)
To get:

Data generation first (you part, actually ;)):
set.seed(1)
Profits.negative <- runif(1e3, 50, 100) + rnorm(1e2, 5, 5)
Logging and ecdf:
Profits.negative.logs <- log(Profits.negative)
fn <- ecdf(Profits.negative.logs)
ecdf returns function, and if you want to extract something from it - it's good idea to look into function's closure:
ls(environment(fn))
# [1] "f" "method" "n" "nobs" "x" "y" "yleft" "yright"
Well, now we can access x and y:
x <- environment(fn)$x
y <- environment(fn)$y
Probably it's what you need. Indeed, plot(fn) and plot(x,y,type="l") show virtually the same results. To log y-axis you need just:
plot(x,log(y),type="l")

Here is an approach using ggplot2:
library(ggplot2)
# data
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
# organize data into a df
df <- data.frame(x = sort(x, decreasing = T),
pk <- ecdf(x)(x),
k <- seq_along(x))
# plot
ggplot(df, aes(x=k, y= pk)) + geom_point(alpha=0.5) +
coord_trans(x = 'log10', y = 'log10') +
scale_x_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
scale_y_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))

Related

Export results from LOESS plot

I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)
You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.
To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}

How to customize title, axis labels, etc. in a plot of a decomposed time series

I'm reasonably familiar with the usual ways of modifying a plot by writing your own x axis labels or a main title, but I've been unable to customize the output when plotting the results of a time series decomposition.
For example,
library(TTR)
t <- ts(co2, frequency=12, start=1, deltat=1/12)
td <- decompose(t)
plot(td)
plot(td, main="Title Doesn't Work") # gets you an error message
gives you a nice, basic plot of the observed time series, trend, etc. With my own data (changes in depth below the water surface), however, I'd like to be able to switch the orientation of the y axes (eg ylim=c(40,0) for 'observed', or ylim=c(18,12) for 'trend'), change 'seasonal' to 'tidal', include the units for the x axis ('Time (days)'), and provide a more descriptive title for the figure.
My impression is that the kind of time series analyses I'm doing is pretty basic and, eventually, I may be better off using another package, perhaps with better graphical control, but I'd like to use ts() and decompose() if I can for now (yeah, cake and consumption). Assuming this doesn't get too horrendous.
Is there a way to do this?
Thanks! Pete
You can modify the plot.decomposed.ts function (that's the plot "method" that gets dispatched when you run plot on an object of class decomposed.ts (which is the class of td).
getAnywhere(plot.decomposed.ts)
function (x, ...)
{
xx <- x$x
if (is.null(xx))
xx <- with(x, if (type == "additive")
random + trend + seasonal
else random * trend * seasonal)
plot(cbind(observed = xx, trend = x$trend, seasonal = x$seasonal, random = x$random),
main = paste("Decomposition of", x$type, "time series"), ...)
}
Notice in the code above that the function hard-codes the title. So let's modify it so that we can choose our own title:
my_plot.decomposed.ts = function(x, title="", ...) {
xx <- x$x
if (is.null(xx))
xx <- with(x, if (type == "additive")
random + trend + seasonal
else random * trend * seasonal)
plot(cbind(observed = xx, trend = x$trend, seasonal = x$seasonal, random = x$random),
main=title, ...)
}
my_plot.decomposed.ts(td, "My Title")
Here's a ggplot version of the plot. ggplot requires a data frame, so the first step is to get the decomposed time series into data frame form and then plot it.
library(tidyverse) # Includes the packages ggplot2 and tidyr, which we use below
# Get the time values for the time series
Time = attributes(co2)[[1]]
Time = seq(Time[1],Time[2], length.out=(Time[2]-Time[1])*Time[3])
# Convert td to data frame
dat = cbind(Time, with(td, data.frame(Observed=x, Trend=trend, Seasonal=seasonal, Random=random)))
ggplot(gather(dat, component, value, -Time), aes(Time, value)) +
facet_grid(component ~ ., scales="free_y") +
geom_line() +
theme_bw() +
labs(y=expression(CO[2]~(ppm)), x="Year") +
ggtitle(expression(Decomposed~CO[2]~Time~Series)) +
theme(plot.title=element_text(hjust=0.5))

add exponential function given mean and intercept to cdf plot

Considering the following random data:
set.seed(123456)
# generate random normal data
x <- rnorm(100, mean = 20, sd = 5)
weights <- 1:100
df1 <- data.frame(x, weights)
#
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
We can create a general cumulative distribution plot.
But, I want to compare my curve to that from data used 20 years ago. From the paper, I only know that the data is "best modeled by a shifted exponential distribution with an x intercept of 1.1 and a mean of 18"
How can I add such a function to my plot?
+ stat_function(fun=dexp, geom = "line", size=2, col="red", args = (mean=18.1))
but I am not sure how to deal with the shift (x intercept)
I think scenarios like this are best handled by making your function first outside of the ggplot call.
dexp doesn't take a parameter mean but uses rate instead which is the same as lambda. That means you want rate = 1/18.1 based on properties of exponential distributions. Also, I don't think dexp makes much sense here since it shows the density and I think you really want the probability with is pexp.
your code could look something like this:
library(ggplot2)
test <- function(x) {pexp(x, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red")
you could shift your pexp distributions doing this:
test <- function(x) {pexp(x-10, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red") +
xlim(10,45)
just for fun this is what using dexp produces:
I am not entirely sure if I understand concept of mean for exponential function. However, generally, when you pass function as an argument, which is fun=dexp in your case, you can pass your own, modified functions in form of: fun = function(x) dexp(x)+1.1, for example.
Maybe experimenting with this feature will get you to the solution.

Combining 3 different matrix plots in R

Can anyone tell me how to create a plot which features 3 different matrices sets of data. In general, I have 3 different matricies of data all 1*1001 dimensions, and i wish to plot all 3 on the same graph.
I have managed to get one matrix to plot at once, and assemble the code to create the other 2 matrices but not to plot it. B[i,] is randomly generated data. What I would like to know is what would be the coding to get all 3 plots together on one graph.
Code for one matrix:
ntime<-1000
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
}
matplot(t, t(average.price.at.each.timestep), type="l", lty=1, main="MC Price of a Zero Coupon Bond", ylab="Price", xlab = "Option Exercise Date")
Code for 3:
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
s.e.at.each.time <-matrix(0,nrow=1,ncol=ntime+1)
upper.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
lower.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
std <- function(x) sd(x)/sqrt(length(x))
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
s.e.at.each.time[i] <- std(B[i,])
upper.c.l.at[i] <- average.price.at.each.timestep[i]+1.96*s.e.at.each.time[i]
lower.c.l.at[i] <- average.price.at.each.timestep[i]-1.96*s.e.at.each.time[i]
}
I'm still struggling with this as I cannot get the solutions given to match with my data sets, I have now included the code below that generates the matrix B as a working example so you can see the data I am dealing with. As you can see it produces a plot of the different prices, I would like a plot with the average price and confidence intervals of the average.
# Define Bond Price Parameters
#
P<-1 #par value
# Define Vasicek Model Parameters
#
rev.rate<-0.3 #speed of reversion
long.term.mean<-0.1 #long term level of the mean
sigma<-0.05 #volatility
r0<-0.03 #spot interest rate
Strike<-0.05
# Define Simulation Parameters
#
T<-50 #time to expiry
ntime<-1000 #number of timesteps
yearstep<-ntime/T #yearstep
npaths<-1000 #number of paths
dt<-T/ntime #timestep
R <- matrix(0,nrow=ntime+1,ncol=npaths) #matrix of Vasicek interest rate values
B <- matrix(0,nrow=ntime+1,ncol=npaths) # matrix of Bond Prices
R[1,]<-r0 #specifies that all paths start at specified spot rate
B[1,]<-P
# do loop which generates values to fill matrix R with multiple paths of Interest Rates as they evolve over time.
# stochastic process based on standard normal distribution
for (j in 1:npaths) {
for (i in 1:ntime) {
dZ <-rnorm(1,mean=0,sd=1)*sqrt(dt)
Rij<-R[i,j]
Bij<-B[i,j]
dr <-rev.rate*(long.term.mean-Rij)*dt+sigma*dZ
R[i+1,j]<-Rij+dr
B[i+1,j]<-Bij*exp(-R[i+1,j]*dt)
}
}
t<-seq(0,T,dt)
par(mfcol = c(3,3))
matplot(t, B[,1:pmin(20,npaths)], type="l", lty=1, main="Price of a Zero Coupon Bond", ylab="Price", xlab = "Time to Expiry")
Your example isn't reproducible, so I created some fake data that I hope is structured similarly to yours. If this isn't what you were looking for, let me know and I'll update as needed.
# Fake data
ntime <- 100
mat1 <- matrix(rnorm(ntime+1, 10, 2), nrow=1, ncol=ntime+1)
mat2 <- matrix(rnorm(ntime+1, 20, 2), nrow=1, ncol=ntime+1)
mat3 <- matrix(rnorm(ntime+1, 30, 2), nrow=1, ncol=ntime+1)
matplot(1:(ntime+1), t(mat1), type="l", lty=1, ylim=c(0, max(c(mat1,mat2,mat3))),
main="MC Price of a Zero Coupon Bond",
ylab="Price", xlab = "Option Exercise Date")
# Add lines for mat2 and mat3
lines(1:101, mat2, col="red")
lines(1:101, mat3, col="blue")
UPDATE: Is this what you're trying to do?
matplot(t, t(average.price.at.each.timestep), type="l", lty=1,
main="MC Price of a Zero Coupon Bond", ylab="Price",
xlab = "Option Exercise Date")
matlines(t, t(upper.c.l.at), lty=2, col="red")
matlines(t, t(lower.c.l.at), lty=2, col="green")
See plot below. If you have multiple columns that you want to plot (as in your updated example where you plot 20 separate paths) and you want to add lower and upper CIs for all of them (though this would make the plot unreadable), just use a matrix of upper and lower CI values that correspond to each path in average.price.at.each.timestep and use matlines to add them to your existing plot of the multiple paths.
This is doable using ggplot2 and reshape2. The structures you have are a little awkward, which you could improve by using a data frame instead of a matrix.
#Dummy data
average.price.at.each.timestep <- rnorm(1000, sd=0.01)
s.e.at.each.time <- rnorm(1000, sd=0.0005, mean=1)
#CIs (note you can vectorise this):
upper.c.l.at <- average.price.at.each.timestep+1.96*s.e.at.each.time
lower.c.l.at <- average.price.at.each.timestep-1.96*s.e.at.each.time
#create a data frame:
prices <- data.frame(time = 1:length(average.price.at.each.timestep), price=average.price.at.each.timestep, upperCI= upper.c.l.at, lowerCI= lower.c.l.at)
library(reshape2)
#turn the data frame into time, variable, value triplets
prices.t <- melt(prices, id.vars=c("time"))
#plot
library(ggplot2)
ggplot(prices.t, aes(time, value, colour=variable)) + geom_line()
This produces the following plot:
This can be improved somewhat by using geom_ribbon instead:
ggplot(prices, aes(time, price)) + geom_ribbon(aes(ymin=lowerCI, ymax=upperCI), alpha=0.1) + geom_line()
Which produces this plot:
Here's another, slightly different ggplot solution that does not require you to calculate the confidence limits first - ggplot does it for you.
# create sample dataset
set.seed(1) # for reproducible example
B <- matrix(rnorm(1000,mean=rep(10+1:10/2,each=10)),nc=10)
library(ggplot2)
library(reshape2) # for melt(...)
gg <- melt(data.frame(date=1:nrow(B),B), id="date")
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = function(y)mean(y)-1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = function(y)mean(y)+1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
theme_bw()
stat_summary(...) summarizes the y-values for a given value of x (the date). So in the first call, it calculates the mean, in the second the lowerCL, and in the third the upperCL.
You could also create a CL(...) function, and call that:
CL <- function(x,level=0.95,type=c("lower","upper")) {
fact <- c(lower=-1,upper=1)
mean(x) - fact[type]*qnorm((1-level)/2)*sd(x)/sqrt(length(x))
}
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = CL, type="lower", geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = CL, type="upper", geom="line",linetype="dotted", color="blue")+
theme_bw()
This produces a plot identical to the one above.

How to convert a bar histogram into a line histogram in R

I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.
Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")
This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference
To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))
Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.
There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot

Resources