I have used 'predict' find a fit line for a linear model(lm) I have created. Because the lm was built on only 2 data points and needs to have a positive slope, I have forced it to go thru the origin (0,0). I have also weighted the function by the number of observations underlying each data point.
Question 1: (SOLVED -see comment by #Gregor)
Why does the predicted line lie so much closer to my second data point (B) than my first data point (A), when B has fewer underlying observations? Did I code something wrong here when weighting the model?
Question 2:
Plotting GLM (link=logit) now, but how can still I force this through 0,0? I've tried adding formula = y~0+x in several places, none of which seem to work.
M <- data.frame("rate" = c(0.4643,0.2143), "conc" = c(300,6000), "nr_dead" = c(13,3), "nr_surv" = c(15,11), "region" = c("A","B"))
M$tot_obsv <- (M$nr_dead+M$nr_surv)
M_conc <- M$conc
M_rate <- M$rate
M_tot_obsv <- M$tot_obsv
#**linear model of data, force 0,0 intercept, weighted by nr. of observations of each data point.**
M_lm <- lm(data = M, rate~0+conc, weights = tot_obsv)
#**plot line using "predict" function**
x_conc <-c(600, 6700)
y_rate <- predict(M_lm, list(conc = x_conc), weights = tot_obsv, type = 'response')
plot(x = M$conc, y = M$rate, pch = 16, ylim = c(0, 0.5), xlim = c(0,7000), xlab = "conc", ylab = "death rate")
lines(x_conc, y_rate, col = "red", lwd = 2)
#**EDIT 1:**
M_glm <- glm(cbind(nr_dead, nr_surv) ~ (0+conc), data = M, family = "binomial")
#*plot using 'predict' function*
binomial_smooth <- function(formula = (y ~ 0+x),...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), formula = (y ~ 0+x), ...)
}
tibble(x_conc = c(seq(300, 7000, 1), M$conc), y_rate = predict.glm(M_glm, list(conc = x_conc), type = "response")) %>% left_join(M, by = c('x_conc' = 'conc')) %>%
ggplot(aes(x = x_conc, y = y_rate)) + xlab("concentration") + ylab("death rate") +
geom_point(aes(y = rate, size = tot_obsv)) + binomial_smooth(formula = (y ~ 0+x)) + theme_bw()
Related
Im trying to plot my predictions using the k-nearest neighbor method but am unable to do do, I get an error message as seen below. Im sure it's something to do with how ive set up my plot but unsure as to how i need to change it. Dataset is here; https://drive.google.com/file/d/1GYnlsXgT2GS9ubeXq8Pm7iNUWDRGogU_/view?usp=sharing
set.seed(20220719)
#splitting training and testing data
ii = createDataPartition(classification[,3], p = .75, list = F)
#split the data using the indices returned by
createDataPartition
xTrain = classification[ii, 1:2] #predictors for training
yTrain = classification[ii, 3] #class label for training
xTest = classification[-ii, 1:2] #predictors for testing
yTest = classification[-ii, 3] #class label for testing
#set training options
#repeat 10 fold cross-validation, 5 times
opts = trainControl(method = 'repeatedcv', number = 10, repeats = 5)
#find optimal k (model)
kmeans_mod = train(x = xTrain, y = as.factor(yTrain),
method ='knn',
trControl = opts,
tuneGrid = data.frame(k = seq(3, 10)))
#test model on testing data
yTestPred = predict(kmeans_mod, newdata = xTest)
confusionMatrix(as.factor(yTestPred), as.factor(yTest))
#plot
plot(kmeans_mod, xTrain)
Gives the error message
Error in if (!(plotType %in% c("level", "scatter", "line"))) stop("plotType must be either level, scatter or line") :
the condition has length > 1
Im looking for an output like this;
To get a plot similar to the one in the question, you can create a grid of prediction points to produce the background classification map, then plot the test data on top using ggplot.
# Create prediction data frame for test data
preds <- data.frame(X1 = xTest[,1], X2 = xTest[,2], Group = yTestPred)
# Create classification grid
gr <- expand.grid(X1 = seq(min(classification[,1]), max(classification[,1]),
length.out = 100),
X2 = seq(min(classification[,2]), max(classification[,2]),
length.out = 100))
gr$Group <- predict(kmeans_mod, newdata = gr)
# Plot the result
library(ggplot2)
ggplot(gr, aes(X1, X2, col = Group)) +
geom_point(size = 0.6) +
geom_point(data = preds, shape = 21, aes(fill = Group),
col = "black", size = 3) +
theme_minimal(base_size = 16)
Though you may prefer a raster:
library(ggplot2)
ggplot(gr, aes(X1, X2, fill = Group)) +
geom_raster(alpha = 0.3) +
geom_point(data = preds, shape = 21, col = "black", size = 3) +
theme_minimal(base_size = 16)
And you may wish to color the test data points with their actual level rather than their predicted level to get a visual impression of the model accuracy:
library(ggplot2)
ggplot(gr, aes(X1, X2, fill = Group)) +
geom_raster(alpha = 0.3) +
geom_point(data = within(preds, Group <- factor(yTest)),
col = "black", size = 3, shape = 21) +
theme_minimal(base_size = 16)
I'm working with the Wage dataset in the ISLR library. My objective is to perform a spline regression with knots at 3 locations (see code below). I can do this regression. That part is fine.
My issue concerns the visualization of the regression curve. Using base R functions, I seem to get the correct curve. But I can't seem to get quite the right curve using the tidyverse. This is what is expected, and what I get with the base functions:
This is what ggplot spits out
It's noticeably different. R gives me the following message when running the ggplot functions:
geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")
What does this mean and how do I fix it?
library(tidyverse)
library(ISLR)
attach(Wage)
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60), degree = 3), data = Wage) #Default is 3
plot(age, wage, col = 'grey', xlab = 'Age', ylab = 'Wages')
points(age.grid, predict(fit, newdata = list(age = age.grid)), col = 'darkgreen', lwd = 2, type = "l")
abline(v = c(25, 40, 60), lty = 2, col = 'darkgreen')
ggplot(data = Wage) +
geom_point(mapping = aes(x = age, y = wage), color = 'grey') +
geom_smooth(mapping = aes(x = age, y = fit$fitted.values), color = 'red')
I also tried
ggplot() +
geom_point(data = Wage, mapping = aes(x = age, y = wage), color = 'grey') +
geom_smooth(mapping = aes(x = age.grid, y = predict(fit, newdata = list(age = age.grid))), color = 'red')
but that looks very similar to the 2nd picture.
Thanks for any help!
splines::bs() and s(., type="bs") from mgcv do very different things; the latter is a penalized regression spline. I would try (untested!)
geom_smooth(method="lm",
formula= y ~ splines::bs(x, knots = c(25, 40, 60), degree = 3))
I have a data with continuous independent variable and binary dependent. Therefore I was trying to apply logistic regression for the analysis of this data. However in contrast to the classical case with S-shaped transition, I have a two transitions.
Here is an example of what I mean
library(ggplot)
library(visreg)
classic.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 14), 1, 0, rep(1, times = 14)))
model.classic = glm(formula = y ~ x,
data = classic.data,
family = "binomial")
summary(model.classic)
visreg(model.classic,
partial = FALSE,
scale = "response",
alpha = 0)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
model.my = glm(formula = y ~ x,
data = my.data,
family = "binomial")
summary(model.my)
visreg(model.my,
partial = FALSE,
scale = "response",
alpha = 0)
The blue lines on both plots - it is outcome of glm, while red line it what I want to have.
Is there any way to apply logistic regression to such data? Or should I apply some other type of regression analysis?
In your second model, y is not a linear function of x. When you write y ~ x you assume that when x increases, y will increase/decrease depending on a positive/negative coefficient. That is not the case, it's increasing and then decreasing, making the average effect of x zero (hence the strait line). You therefore need a non-linear function. You could do that with a gam from the mgcv package, where the effect of x is modelled as a smooth function:
library(mgcv)
my.data = data.frame(x = seq(from = 0, by = 0.5, length = 30),
y = c(rep(0, times = 10), rep(1, times = 10), rep(0, times = 10)))
m = gam(y ~ s(x), data = my.data, family = binomial)
plot(m)
That would lead to the following fit on the original scale:
my.data$prediction = predict(m, type = "response")
plot(my.data$x, my.data$y)
lines(my.data$x, my.data$prediction, col = "red")
I have data.frame object with a numeric column amount and categorical column fraud:
amount <- [60.00, 336.38, 119.00, 115.37, 220.01, 60.00, 611.88, 189.78 ...]
fraud <- [1,0,0,0,0,0,1,0, ...]
I want to fit a gamma distribution to amount but to plot it by factor(fraud).
I want a graph that will show me 2 curves with 2 different colors that will distinguish between the 2 sets (fraud/non fraud groups).
Here is what I have done so far:
fit.gamma1 <- fitdist(df$amount[df$fraud == 1], distr = "gamma", method = "mle")
plot(fit.gamma1)
fit.gamma0 <- fitdist(df$amount[df$fraud == 0], distr = "gamma", method = "mle")
plot(fit.gamma0)
I have used this reference:
How would you fit a gamma distribution to a data in R?
Perhaps what you want is
curve(dgamma(x, shape = fit.gamma0$estimate[1], rate = fit.gamma0$estimate[2]),
from = min(amount), to = max(amount), ylab = "")
curve(dgamma(x, shape = fit.gamma1$estimate[1], rate = fit.gamma1$estimate[2]),
from = min(amount), to = max(amount), col = "red", add = TRUE)
or with ggplot2
ggplot(data.frame(x = range(amount)), aes(x)) +
stat_function(fun = dgamma, aes(color = "Non fraud"),
args = list(shape = fit.gamma0$estimate[1], rate = fit.gamma0$estimate[2])) +
stat_function(fun = dgamma, aes(color = "Fraud"),
args = list(shape = fit.gamma1$estimate[1], rate = fit.gamma1$estimate[2])) +
theme_bw() + scale_color_discrete(name = NULL)
I feel that I am close to finding the answer for my problem, but somehow I just cannot manage to do it. I have used nls function to fit 3 parameters using a rather complicated function describing fertilization success of eggs (y-axis) in a range of sperm concentrations (x-axis) (Styan's model [1], [2]). Fitting the parameters works fine, but I cannot manage to plot a smoothed extrapolated curve using predict function (see at the end of this post). I guess it is because I have used a value that was not fitted on x-axis. My question is how to plot a smoothed and extrapolated curve based on a model fitted with nls function
using non-fitted parameter on x-axis?
Here is an example:
library(ggplot2)
data.nls <- structure(list(S0 = c(0.23298, 2.32984, 23.2984, 232.98399, 2329.83993,
23298.39926), fert = c(0.111111111111111, 0.386792452830189,
0.158415841584158, 0.898648648648649, 0.616, 0.186440677966102
), speed = c(0.035161615379406, 0.035161615379406, 0.035161615379406,
0.035161615379406, 0.035161615379406, 0.035161615379406), E0 = c(6.86219803476946,
6.86219803476946, 6.86219803476946, 6.86219803476946, 6.86219803476946,
7.05624476582978), tau = c(1800, 1800, 1800, 1800, 1800, 1800
), B0 = c(0.000102758645352932, 0.000102758645352932, 0.000102758645352932,
0.000102758645352932, 0.000102758645352932, 0.000102758645352932
)), .Names = c("S0", "fert", "speed", "E0", "tau", "B0"), row.names = c(NA,
6L), class = "data.frame")
## Model S
modelS <- function(Fe, tb, Be) with (data.nls,{
x <- Fe*(S0/E0)*(1-exp(-B0*E0*tau))
b <- Fe*(S0/E0)*(1-exp(-B0*E0*tb))
x*exp(-x)+Be*(1-exp(-x)-(x*exp(-x)))*exp(-b)})
## Define starting values
start <- list(Fe = 0.2, tb = 0.1, Be = 0.1)
## Fit the model using nls
modelS.fitted <- nls(formula = fert ~ modelS(Fe, tb, Be), data = data.nls, start = start,
control=nls.control(warnOnly=TRUE,minFactor=1e-5),trace = T, lower = c(0,0,0),
upper = c(1, Inf, 1), algorithm = "port")
## Combine model parameters
model.data <- cbind(data.nls, data.frame(pred = predict(modelS.fitted)))
## Plot
ggplot(model.data) +
geom_point(aes(x = S0, y = fert), size = 2) +
geom_line(aes(x = S0, y = pred), lwd = 1.3) +
scale_x_log10()
I have tried following joran's example here, but it has no effect, maybe because I did not fit S0:
r <- range(model.data$S0)
S0.ext <- seq(r[1],r[2],length.out = 200)
predict(modelS.fitted, newdata = list(S0 = S0.ext))
# [1] 0.002871585 0.028289057 0.244399948 0.806316161 0.705116868 0.147974213
You function should have the parameters (S0,E0,B0,tau,Fe,tb,Be). nls will look for the parameters in the data.frame passed to its data argument and only try to fit those it doesn't find there (provided that starting values are given). No need for this funny with business in your function. (with shouldn't be used inside functions anyway. It's meant for interactive usage.) In predict newdata must contain all variables, that is S0,E0,B0, and tau.
Try this:
modelS <- function(S0,E0,B0,tau,Fe, tb, Be) {
x <- Fe*(S0/E0)*(1-exp(-B0*E0*tau))
b <- Fe*(S0/E0)*(1-exp(-B0*E0*tb))
x*exp(-x)+Be*(1-exp(-x)-(x*exp(-x)))*exp(-b)}
## Define starting values
start <- list(Fe = 0.2, tb = 0.1, Be = 0.1)
## Fit the model using nls
modelS.fitted <- nls(formula = fert ~ modelS(S0,E0,B0,tau,Fe, tb, Be), data = data.nls, start = start,
control=nls.control(warnOnly=TRUE,minFactor=1e-5),trace = T, lower = c(0,0,0),
upper = c(1, Inf, 1), algorithm = "port")
## Combine model parameters
model.data <- data.frame(
S0=seq(min(data.nls$S0),max(data.nls$S0),length.out=1e5),
E0=seq(min(data.nls$E0),max(data.nls$E0),length.out=1e5),
B0=seq(min(data.nls$B0),max(data.nls$B0),length.out=1e5),
tau=seq(min(data.nls$tau),max(data.nls$tau),length.out=1e5))
model.data$pred <- predict(modelS.fitted,newdata=model.data)
## Plot
ggplot(data.nls) +
geom_point(aes(x = S0, y = fert), size = 2) +
geom_line(data=model.data,aes(x = S0, y = pred), lwd = 1.3) +
scale_x_log10()
Obviously, this might not be what you want, since the function has multiple variables and more than one vary in new.data. Normally one would only vary one and keep the others constant for such a plot.
So this might be more appropriate:
S0 <- seq(min(data.nls$S0),max(data.nls$S0),length.out=1e4)
E0 <- seq(1,20,length.out=20)
B0 <- unique(data.nls$B0)
tau <- unique(data.nls$tau)
model.data <- expand.grid(S0,E0,B0,tau)
names(model.data) <- c("S0","E0","B0","tau")
model.data$pred <- predict(modelS.fitted,newdata=model.data)
## Plot
ggplot(model.data) +
geom_line(data=,aes(x = S0, y = pred, color=interaction(E0,B0,tau)), lwd = 1.3) +
geom_point(data=data.nls,aes(x = S0, y = fert), size = 2) +
scale_x_log10()