Extract critical points of a polynomial model object in R? - r

I am trying to solve for the inflection points of a cubic polynomial function which has been fitted to data, i.e. values of x where the first derivative is zero.
I also need a way to find the values of y at the critical points of x.
It is easy enough to fit the model using lm() and to view the model quality with summary(). And I can plot the function easily enough by adding predictions and using geom_line().
There must be a package or a base R function dedicated to this problem. Can anyone suggest a method?
Below is a reprex to depict the problem. Needless to say, the arrows are drawn only to illustrate the question; they are not mapped to the true inflection points or I would not be asking this question...
library(tidyverse)
library(modelr)
set.seed(0)
#generate random data and plot the values
df <- tibble(x= sample(x= c(-100:200), size= 50),
y= -0.5*(x^3) + 50*(x^2) + 7*(x) + rnorm(n=50, mean=10000, sd=50000) )
df %>% ggplot(aes(x, y)) +
geom_point()
# fit a model to the data
cubic_poly_model <- lm(data= df, formula = y~poly(x, 3))
# plot the fitted model
df %>%
add_predictions(model = cubic_poly_model) %>%
ggplot(aes(x, y))+
geom_point(alpha=1/3)+
geom_line(aes(x, y=pred))+
annotate('text', label= 'critical point A', x=-50, y=-250000)+
geom_segment(x=-50, xend=-10, y=-200000, yend=-5000, arrow = arrow(length=unit(3, 'mm'), type = 'closed'))+
annotate('text', label= 'critical point B', x=140, y=400000)+
geom_segment(x=110, xend=90, y=300000, yend=100000, arrow = arrow(length=unit(3, 'mm'), type = 'closed'))
# But how can I get the critical values of x and the y values they produce?
Created on 2020-09-03 by the reprex package (v0.3.0)

I devised a solution using the mosaic package . The makeFun() function allows a model object to be converted to a function. You can then use base R optimize()to find the max or min value of that function over a specified interval (in this case, the range of x values). Specify the "maximum" argument in optimize() to state whether you want the local maximum or local minimum.
See code below:
library(magrittr)
set.seed(0)
#generate random data and plot the values
df <- tibble::tibble(x= sample(x= c(-100:200), size= 50),
y= -0.5*(x^3) + 50*(x^2) + 7*(x) + rnorm(n=50, mean=10000, sd=50000) )
cubic_poly_model <- lm(data= df, formula = y~poly(x, 3))
crit_values <- cubic_poly_model %>%
mosaic::makeFun() %>%
optimize(interval = c(min(df$x), max(df$x)), maximum = TRUE)
funct_crit_x <- crit_values[['maximum']][[1]]
funct_max <- crit_values[['objective']]
funct_crit_x
funct_max

Related

Spatial clustering/sampling in R

I have a spatial data frame in R. we have a class imbalance problem so I want to be able to remove the positive cases (our response variable is binary and positive is approx 10% of the dataset) and then select a portion of the negative cases to combat the class imbalance in the model. I want to select negative cases that are closely related spatially and I am really struggling to figure out how.
Some ideas I have thought about which may work
KNN to cluster negative cases
Overlay spatial grid and extract x number of samples from each grid square
Buffer analysis and randomly select within buffer
If anyone has recommendations on how to execute this in R it would be awesome.
thanks
Just going to answer here in case anyone else searches this.
I decided to go with a kmeans cluster and then added the cluster as a col to the dB and randomly sampled from clusters.
Code below!
##CLuster analysis set.seed(1) clusdb <- W_neg[c(
"x_coor_farm", "y_coor_farm",
"Area_Farm_SqM", "NatGrass_1km_buff",
"BioFor_1km_buff", "MixedFor_1km_buff",
"Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436")]
##Write functuon to loop the algorithim kmean_withinss <- function(k) { cluster <- kmeans(clusdb, k) return (cluster$tot.withinss) }
# Set maximum cluster max_k <-20
# Run algorithm over a range of k wss <- sapply(2:max_k, kmean_withinss)
#Dataframe of kmeans output to find optimal K elbow <-data.frame(2:max_k, wss)
#plot library(ggplot2) ggplot(elbow, aes(x = X2.max_k, y = wss)) + geom_point() + geom_line() + scale_x_continuous(breaks = seq(1, 20, by = 1))
#Optimal K = 8
#Re-run the model with optimal K
pc_cluster_2 <-kmeans(clusdb, 8) pc_cluster_2$cluster pc_cluster_2$centers pc_cluster_2$size
pc_cluster_2$totss pc_cluster_2$betweenss
pc_cluster_2$betweenss/pc_cluster_2$totss*100
#92%
#Add col to dataframe W_neg$cluster <-pc_cluster_2$cluster
W_neg <- W_neg[c("TB2017", "x_coor_farm", "y_coor_farm", "Area_Farm_SqM", "NatGrass_1km_buff", "BioFor_1km_buff", "MixedFor_1km_buff", "Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436", "cluster")]
ggplot(data = W_neg, aes(y = cluster)) + geom_bar(aes(fill = TB2017)) + ggtitle("Count of Clusters by Region") + theme(plot.title = element_text(hjust = 0.5))
fviz_cluster(pc_cluster_2, data = scale(clusdb), geom = c("point"),ellipse.type = "euclid")

Shade an area under density curve, to mark the Highest Density Interval (HDI)

I thought this should be straightforward, but I'm lost, despite tons of information online.
My Problem: I have a vector of data points, for which I want to plot a density curve, then color the area under the curve to signify the Highest Density Interval (HDI). Naturally, I'm trying to achieve this with ggplot2 package, and specifically with qplot(), since my data comes as a vector, and not a data frame.
Reproducible Example
library(ggplot2)
library(HDInterval)
## create data vector
set.seed(789)
dat <- rnorm(1000)
## plot density curve with qplot and mark 95% hdi
qplot(dat, geom = "density")+
geom_vline(aes(xintercept = c(hdi(dat))))
So I get this:
But what I really want is something like this:
Is there a simple way to achieve this with ggplot2::qplot?
You can do this with the ggridges package. The trick is that we can provide HDInterval::hdi as quantile function to geom_density_ridges_gradient(), and that we can fill by the "quantiles" it generates. The "quantiles" are the numbers in the lower tail, in the middle, and in the upper tail.
As a general point of advice, I would recommend against using qplot(). It's more likely going to cause confusion, and putting a vector into a tibble is not a lot of effort.
library(tidyverse)
library(HDInterval)
library(ggridges)
#>
#> Attaching package: 'ggridges'
#> The following object is masked from 'package:ggplot2':
#>
#> scale_discrete_manual
## create data vector
set.seed(789)
dat <- rnorm(1000)
df <- tibble(dat)
## plot density curve with qplot and mark 95% hdi
ggplot(df, aes(x = dat, y = 0, fill = stat(quantile))) +
geom_density_ridges_gradient(quantile_lines = TRUE, quantile_fun = hdi, vline_linetype = 2) +
scale_fill_manual(values = c("transparent", "lightblue", "transparent"), guide = "none")
#> Picking joint bandwidth of 0.227
Created on 2019-12-24 by the reprex package (v0.3.0)
The colors in scale_fill_manual() are in the order of the three groups, so if you, for example, only wanted to shade the left tail, you would write values = c("lightblue", "transparent", "transparent").
When I read this post I was realy grateful for your answer, Wilke. But I wondered how to adjust the credible mass of the hdi. Finally I found a solution! When I figuered out where the quantiles argument comes from (I access it via dots[[2]]) it clicked the trigger. I wrote the following function (because passing the value to HDInterval::hdi didn't work out of the box):
hdi_custWidth <- function(...) {
dots <- list(...)
quantiles <- dots[[2]]
hdi_width <- quantiles[[length(quantiles)]] # uses the last entry if its a vector which should be the biggest one; better pass a single double < 1.0
if (is.na(hdi_width)) hdi_width <- .89 # happens is quantiles = 1L
message(paste0('HDI credible interval width = ', hdi_width))
HDInterval::hdi(dots[[1]], credMass = hdi_width)
}
You can use it to alter the repex from the post above:
library(tidyverse)
library(HDInterval)
library(ggridges)
#>
#> Attaching package: 'ggridges'
#> The following object is masked from 'package:ggplot2':
#>
#> scale_discrete_manual
## create data vector
set.seed(789)
dat <- rnorm(1000)
df <- tibble(dat)
## plot density curve with qplot and mark 95% hdi
ggplot(df, aes(x = dat, y = 0, fill = stat(quantile))) +
geom_density_ridges_gradient(quantile_lines = TRUE, quantile_fun = hdi_custWidth, quantiles = .90, vline_linetype = 2) +
scale_fill_manual(values = c("transparent", "lightblue", "transparent"), guide = "none")
#> Picking joint bandwidth of 0.227
Of course you can pick any value between 0 and 1 in the quantiles argument (not just .90) and get the corresponding hdi credible mass.

Best Way to Plot three vectors in R?

I have a vector of length 10k for each of the variables x and z. For each of the 10k, I have also estimated propensity scores using logit and other methods. So I have another vector that contains the predicted propensity scores.
I want to plot predicted propensity vector as the height of the 3d graph and as a function of the x and z vectors (I want something like a surface). What is the best way to go about doing this? I tried using scatter3d() from the plot3d library and it looks very bad.
Sample data: https://www.dropbox.com/s/1lf36dpxvebd7kw/mydata2.csv?dl=0
Updated Answer
Using the data you provided, we can bin the data, get the average propensity score by bin and plot using geom_tile. I provide code for that below. A better option would be to fit the propensity score model using the x and z vectors (and the binary treatment variable that you're predicting). Then, create a new data frame of predicted pz_p values on a complete grid of x and z values and plot that. I don't have your binary treatment variable with which to fit the model, so I haven't produced an actual plot, but the code would look something like this:
# Propensity score model
m1 = glm(treat ~ x + z, data=dat, family=binomial)
# Get propensity scores on full grid of x and z values
n = 100 # Number of grid points. Adjust as needed.
pred.dat = expand.grid(x=seq(min(dat$x),max(dat$x),length=n,
z=seq(min(dat$z),max(dat$z),length=n)
pred.dat$pz_p = predict(m1, newdata=pred.dat, type="response")
ggplot(pred.dat. aes(x, z, fill=pz_p)) +
geom_tile() +
scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0.5, limits=c(0,1))
Code for tile plot with binned data:
library(tidyverse)
theme_set(theme_classic())
dat = read_csv("mydata2.csv")
# Bin by x and z
dat = dat %>%
mutate(xbin = cut(x,breaks=seq(round(min(x),1)-0.05,round(max(x),1)+0.05,0.1),
labels=seq(round(min(x),1), round(max(x),1),0.1)),
xbin=as.numeric(as.character(xbin)),
zbin = cut(z,breaks=seq(round(min(z),1)-0.1,round(max(z),1)+0.1,0.2),
labels=seq(round(min(z),1), round(max(z),1),0.2)),
zbin=as.numeric(as.character(zbin)))
# Calculate average pz_p by bin and then plot
ggplot(dat %>% group_by(xbin, zbin) %>%
summarise(pz_p=mean(pz_p)),
aes(xbin, zbin, fill=pz_p)) +
geom_tile() +
scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0.5, limits=c(0,1))
Original Answer
A heat map might work well here. For example:
library(ggplot2)
# Fake data
set.seed(2)
dat = expand.grid(x=seq(0,10,length=100),
z=seq(0,10,length=100))
dat$ps = 1/(1 + exp(0.3 + 0.2*dat$x - 0.5*dat$z))
ggplot(dat, aes(x, z, fill=ps)) +
geom_tile() +
scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0.5, limits=c(0,1)) +
coord_equal()
Or in 3D with rgl::persp3d:
library(rgl)
library(tidyverse)
x=unique(sort(dat$x))
z=unique(sort(dat$z))
ps=dat %>% spread(z, ps) %>% select(-1) %>% as.matrix
persp3d(x, z, ps, col="lightblue")

Change colors of select lines in ggplot2 coefficient plot in R

I would like to change the color of coefficient lines based on whether the point estimate is negative or positive in a ggplot2 coefficient plot in R. For example:
require(coefplot)
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
coefplot.lm(mod1)
Which produces the following plot:
In this plot, I would like to change the "x" variable to red when plotted. Any ideas? Thanks.
I think, you cannot do this with a plot produced by coefplot.lm. The package coefplot uses ggplot2 as the plotting system, which is good itself, but does not allow to play with colors as easily as you would like. To achieve the desired colors, you need to have a variable in your dataset that would color-code the values; you need to specify color = color-code in aes() function within the layer that draws the dots with CE. Apparently, this is impossible to do with the output of coefplot.lm function. Maybe, you can change the colors using ggplot2 ggplot_build() function. I would say, it's easier to write your own function for this task.
I've done this once to plot odds. If you want, you may use my code. Feel free to change it. The idea is the same as in coefplot. First, we extract coefficients from a model object and prepare the data set for plotting; second, actually plot.
The code for extracting coefficients and data set preparation
df_plot_odds <- function(x){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint.default(x))))
odds<-tmp[-1,]
names(odds)<-c('OR', 'lower', 'upper')
odds$vars<-row.names(odds)
odds$col<-odds$OR>1
odds$col[odds$col==TRUE] <-'blue'
odds$col[odds$col==FALSE] <-'red'
odds$pvalue <- summary(x)$coef[-1, "Pr(>|t|)"]
return(odds)
}
Plot the output of the extract function
plot_odds <- function(df_plot_odds, xlab="Odds Ratio", ylab="", asp=1){
require(ggplot2)
p <- ggplot(df_plot_odds, aes(x=vars, y=OR, ymin=lower, ymax=upper),asp=asp) +
geom_errorbar(aes(color=col),width=0.1) +
geom_point(aes(color=col),size=3)+
geom_hline(yintercept = 1, linetype=2) +
scale_color_manual('Effect', labels=c('Positive','Negative'),
values=c('blue','red'))+
coord_flip() +
theme_bw() +
theme(legend.position="none",aspect.ratio = asp)+
ylab(xlab) +
xlab(ylab) #switch because of the coord_flip() above
return(p)
}
Plotting your example
set.seed(123)
dat <- data.frame(x = rnorm(100),y = rnorm(100), z = rnorm(100))
mod1 <- lm(y ~ x + z, data = dat)
df <- df_plot_odds(mod1)
plot <- plot_odds(df)
plot
Which yields
Note that I chose theme_wb() as the default. Output is a ggplot2object. So, you may change it quite a lot.

How do I plot the first derivative of the smoothing function?

I have the following script that emulates the type of data structure I have and analysis that I want to do on it,
library(ggplot2)
library(reshape2)
n <- 10
df <- data.frame(t=seq(n)*0.1, a =sort(rnorm(n)), b =sort(rnorm(n)),
a.1=sort(rnorm(n)), b.1=sort(rnorm(n)),
a.2=sort(rnorm(n)), b.2=sort(rnorm(n)))
head(df)
mdf <- melt(df, id=c('t'))
## head(mdf)
levels(mdf$variable) <- rep(c('a','b'),3)
g <- ggplot(mdf,aes(t,value,group=variable,colour=variable))
g +
stat_smooth(method='lm', formula = y ~ ns(x,3)) +
geom_point() +
facet_wrap(~variable) +
opts()
What I would like to do in addition to this is plot the first derivative of the smoothing function against t and against the factors, c('a','b'), as well. Any suggestions how to go about this would be greatly appreciated.
You'll have to construct the derivative yourself, and there are two possible ways for that. Let me illustrate by using only one group :
require(splines) #thx #Chase for the notice
lmdf <- mdf[mdf$variable=="b",]
model <- lm(value~ns(t,3),data=lmdf)
You then simply define your derivative as diff(Y)/diff(X) based on your predicted values, as you would do for differentiation of a discrete function. It's a very good approximation if you take enough X points.
X <- data.frame(t=seq(0.1,1.0,length=100) ) # make an ordered sequence
Y <- predict(model,newdata=X) # calculate predictions for that sequence
plot(X$t,Y,type="l",main="Original fit") #check
dY <- diff(Y)/diff(X$t) # the derivative of your function
dX <- rowMeans(embed(X$t,2)) # centers the X values for plotting
plot(dX,dY,type="l",main="Derivative") #check
As you can see, this way you obtain the points for plotting the derivative. You'll figure out from here how to apply this to both levels and combine those points to the plot you like. Below the plots from this sample code :
Here's one approach to plotting this with ggplot. There may be a more efficient way to do it, but this uses the manual calculations done by #Joris. We'll simply construct a long data.frame with all of the X and Y values while also supplying a variable to "facet" the plots:
require(ggplot2)
originalData <- data.frame(X = X$t, Y, type = "Original")
derivativeData <- data.frame(X = dX, Y = dY, type = "Derivative")
plotData <- rbind(originalData, derivativeData)
ggplot(plotData, aes(X,Y)) +
geom_line() +
facet_wrap(~type, scales = "free_y")
If data is smoothed using smooth.spline, the derivative of predicted data can be specified using the argument deriv in predict. Following from #Joris's solution
lmdf <- mdf[mdf$variable == "b",]
model <- smooth.spline(x = lmdf$t, y = lmdf$value)
Y <- predict(model, x = seq(0.1,1.0,length=100), deriv = 1) # first derivative
plot(Y$x[, 1], Y$y[, 1], type = 'l')
Any dissimilarity in the output is most likely due to differences in the smoothing.

Resources