Cut function returns NA for intervals - r

I am trying to use the cut function to create age intervals. Unfortunately, I receive NAs for values that match the lower end of the first break.
For example:
AGE <- sample(18:50, 100, replace = TRUE)
AGE_GROUPS <- cut(AGE, breaks = c(18, 27, 36, 45))
DF <- data.frame(AGE, AGE_GROUPS)
For all the values where AGE is 18 and above 45, I receive NA in the AGE_GROUPS variable. How can I make sure that the lowest bracket in AGE_GROUPS includes 18 and how can I make sure that the highest bracket includes all values >= 45?

Breaks isn't just the intermediate breaks, it is the endpoints too. You can make sure you get everything with
breaks = c(-Inf, 18, 27, 36, 45, Inf)
A little more conservatively, you could use
breaks = c(0, 18, 27, 36, 45, 120)
which can be useful for catching outlier coding errors. You may also want include.lowest = TRUE. See ?cut for examples.

Related

simple curve fitting in R

I am trying to find a fit for my data. But so far had no luck.
Tried the logarithmic, different ones from the drc package .. but I am sure there must be a better one I just don't know the type.
On a different note - I would be grateful for advice on how to go about curve hunting in general.
library(drc)
df<-structure(list(x = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55), y = c(0.1066, -0.6204, -0.2028, 0.2621, 0.4083,
0.4497, 0.6343, 0.7762, 0.8809, 1.0029, 0.8089, 0.7845, 0.8009,
0.9319, 0.9414, 0.9505, 0.9323, 1.0321, 0.9381, 0.8975, 1.0929,
1.0236, 0.9589, 1.0644, 1.0411, 1.0763, 0.9679, 1.003, 1.142,
1.1049, 1.2868, 1.1569, 1.1952, 1.0802, 1.2125, 1.3765, 1.263,
1.2507, 1.2125, 1.2207, 1.2836, 1.3352, 1.1311, 1.2321, 1.4277,
1.1645), w = c(898, 20566, 3011, 1364, 1520, 2376, 1923, 1934,
1366, 1010, 380, 421, 283, 262, 227, 173, 118, 113, 95, 69, 123,
70, 80, 82, 68, 83, 76, 94, 101, 97, 115, 79, 98, 84, 92, 121,
97, 102, 93, 92, 101, 74, 124, 64, 52, 63)), row.names = c(NA,
-46L), class = c("tbl_df", "tbl", "data.frame"), na.action = structure(c(`47` = 47L), class = "omit"))
fit <- drm(data = df,y ~ x,fct=LL.4(), weights = w)
plot(fit)
1) If we ignore the weights then y = a + b * x + c/x^2 seems to fit and is linear in the coefficients so is easy to fit. This seems upward sloping so we started with a line but then we needed to dampen that so we added a reciprocal term. A reciprocal quadratic worked slightly better than a plain reciprocal based on the residual sum of squares so we switched to that.
fm <- lm(y ~ x + I(1 / x^2), df)
coef(summary(fm))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.053856e+00 0.116960752 9.010341 1.849238e-11
## x 4.863077e-03 0.002718613 1.788808 8.069195e-02
## I(1/x^2) -1.460443e+02 16.518887452 -8.841049 3.160306e-11
The coefficient of the x term is not significant at the 5% level -- the p value is 8% in the table above -- so we can remove it and it will fit nearly as well giving a model with only two parameters. In the plot below the fm fit with 3 parameters is solid and the fm2 fit with 2 parameters is dashed.
fm2 <- lm(y ~ I(1 / x^2), df)
plot(y ~ x, df)
lines(fitted(fm) ~ x, df)
lines(fitted(fm2) ~ x, df, lty = 2)
2) Another approach is to use two straight lines. This is still continuous but has one non-differentiable point at the point of transition. The model has 4 parameters, the intercepts and slopes of each line. Below we do use the weights. It has the advantage of an obvious motivation based on the appearance of the data. The break point at the intersection of the two lines may have significance as the transition point between the higher sloping initial growth and the lower sloping subsequent growth.
# starting values use lines fitted to 1st ten and last 10 points
fm_1 <- lm(y ~ x, df, subset = 1:10)
fm_2 <- lm(y ~ x, df, subset = seq(to = nrow(df), length = 10))
st <- list(a = coef(fm_1)[[1]], b = coef(fm_1)[[2]],
c = coef(fm_2)[[1]], d = coef(fm_2)[[2]])
fm3 <- nls(y ~ pmin(a + b * x, c + d * x), df, start = st, weights = w)
# point of transition
X <- with(as.list(coef(fm3)), (a - c) / (d - b)); X
## [1] 16.38465
Y <- with(as.list(coef(fm3)), a + b * X); Y
## [1] 0.8262229
plot(y ~ x, df)
lines(fitted(fm3) ~ x, df)
The basic idea is to understand how the selected function performs. Take a function you know (e.g. the logistic) and modify it. Or (even better) go to the literature and see which functions people use in your specific domain. Then create a user-defined model, play with it to understand the parameters, define good start values and then fit it.
Her a quick & dirty example of a user-defined function (with package growthrates). It can surely be made similarly with drc.
library("growthrates")
grow_userdefined <- function (time, parms) {
with(as.list(parms), {
y <- (K * y0)/(y0 + (K - y0) * exp(-mumax * time)) + shift
return(as.matrix(data.frame(time = time, y = y)))
})
}
fit <- fit_growthmodel(FUN=grow_userdefined,
p = c(y0 = -1, K = 1, mumax = 0.1, shift = 1),
time = df$x, y = df$y)
plot(fit)
summary(fit)
It can of course be made better. As we have no exponential start at the onset, one can for example start with a simple saturation function instead of a logistic, e.g. something Monod-like. As said, the preferred way is to use a function related to the application domain.

R:mgcv add colorbar to 2D heatmap of GAM

I'm fitting a gam with mgcv and plot the result with the default plot.gam() function. My model includes a 2D-smoother and I want to plot the result as a heatmap. Is there any way to add a colorbar for the heatmap?
I've previously looked into other GAM potting packages, but none of them provided the necessary visualisation. Please note, this is just a simplification for illustration purposes; the actual model (and reporting needs) is much more complicated
edited: I initially had swapped y and z in my tensor product, updated to reflect the correct version both in the code and the plot
df.gam<-gam(y~te(x,z), data=df, method='REML')
plot(df.gam, scheme=2, hcolors=heat.colors(999, rev =T), rug=F)
sample data:
structure(list(x = c(3, 17, 37, 9, 4, 11, 20.5, 11.5, 16, 17,
18, 15, 13, 29.5, 13.5, 25, 15, 13, 20, 20.5, 17, 11, 11, 5,
16, 13, 3.5, 16, 16, 5, 20.5, 2, 20, 9, 23.5, 18, 3.5, 16, 23,
3, 37, 24, 5, 2, 9, 3, 8, 10.5, 37, 3, 9, 11, 10.5, 9, 5.5, 8,
22, 15.5, 18, 15, 3.5, 4.5, 20, 22, 4, 8, 18, 19, 26, 9, 5, 18,
10.5, 30, 15, 13, 27, 19, 5.5, 18, 11.5, 23.5, 2, 25, 30, 17,
18, 5, 16.5, 9, 2, 2, 23, 21, 15.5, 13, 3, 24, 17, 4.5), z = c(144,
59, 66, 99, 136, 46, 76, 87, 54, 59, 46, 96, 38, 101, 84, 64,
92, 56, 69, 76, 93, 109, 46, 124, 54, 98, 131, 89, 69, 124, 105,
120, 69, 99, 84, 75, 129, 69, 74, 112, 66, 78, 118, 120, 103,
116, 98, 57, 66, 116, 108, 95, 57, 41, 20, 89, 61, 61, 82, 52,
129, 119, 69, 61, 136, 98, 94, 70, 77, 108, 118, 94, 105, 52,
52, 38, 73, 59, 110, 97, 87, 84, 119, 64, 68, 93, 94, 9, 96,
103, 119, 119, 74, 52, 95, 56, 112, 78, 93, 119), y = c(96.535,
113.54, 108.17, 104.755, 94.36, 110.74, 112.83, 110.525, 103.645,
117.875, 105.035, 109.62, 105.24, 119.485, 107.52, 107.925, 107.875,
108.015, 115.455, 114.69, 116.715, 103.725, 110.395, 100.42,
108.79, 110.94, 99.13, 110.935, 112.94, 100.785, 110.035, 102.95,
108.42, 109.385, 119.09, 110.93, 99.885, 109.96, 116.575, 100.91,
114.615, 113.87, 103.08, 101.15, 98.68, 101.825, 105.36, 110.045,
118.575, 108.45, 99.21, 109.19, 107.175, 103.14, 94.855, 108.15,
109.345, 110.935, 112.395, 111.13, 95.185, 100.335, 112.105,
111.595, 100.365, 108.75, 116.695, 110.745, 112.455, 104.92,
102.13, 110.905, 107.365, 113.785, 105.595, 107.65, 114.325,
108.195, 96.72, 112.65, 103.81, 115.93, 101.41, 115.455, 108.58,
118.705, 116.465, 96.89, 108.655, 107.225, 101.79, 102.235, 112.08,
109.455, 111.945, 104.11, 94.775, 110.745, 112.44, 102.525)), row.names = c(NA,
-100L), class = "data.frame")
It would be easier (IMHO) to do this reliably within the ggplot2 ecosphere.
I'll show a canned approach using my {gratia} package but also checkout {mgcViz}. I'll also suggest a more generic solution using tools from {gratia} to extra information about your model's smooths and then plot them yourself using ggplot().
library('mgcv')
library('gratia')
library('ggplot2')
library('dplyr')
# load your snippet of data via df <- structure( .... )
# then fit your model (note you have y as response & in the tensor product
# I assume z is the response below and x and y are coordinates
m <- gam(z ~ te(x, y), data=df, method='REML')
# now visualize the mode using {gratia}
draw(m)
This produces:
{gratia}'s draw() methods can't plot everything yet, but where it doesn't work you should still be able to evaluate the data you need using tools in {gratia}, which you can then plot with ggplot() itself by hand.
To get values for your smooths, i.e. the data behind the plots that plot.gam() or draw() display, use gratia::smooth_estimates()
# dist controls what we do with covariate combinations too far
# from support of the data. 0.1 matches mgcv:::plot.gam behaviour
sm <- smooth_estimates(m, dist = 0.1)
yielding
r$> sm
# A tibble: 10,000 × 7
smooth type by est se x y
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 te(x,y) Tensor NA 35.3 11.5 2 94.4
2 te(x,y) Tensor NA 35.5 11.0 2 94.6
3 te(x,y) Tensor NA 35.7 10.6 2 94.9
4 te(x,y) Tensor NA 35.9 10.3 2 95.1
5 te(x,y) Tensor NA 36.2 9.87 2 95.4
6 te(x,y) Tensor NA 36.4 9.49 2 95.6
7 te(x,y) Tensor NA 36.6 9.13 2 95.9
8 te(x,y) Tensor NA 36.8 8.78 2 96.1
9 te(x,y) Tensor NA 37.0 8.45 2 96.4
10 te(x,y) Tensor NA 37.2 8.13 2 96.6
# … with 9,990 more rows
In the output, x and y are a grid of values over the range of both covariates (the number of points in the grid in each covariate is controlled by n such that the grid for a 2d tensor product smooth is of size n by n). est is the estimated value of the smooth at the values of the covariates and se its standard error. For models with multiple smooths, the smooth variable uses the internal label that {mgcv} gives each smooth - these are the labels used in the output you get from calling summary() on your GAM.
We can add a confidence interval if needed using add_confint().
Now you can plot your smooth(s) by hand using ggplot(). At this point you have two options
if draw() can handle the type of smooth you want to plot, you can use the draw() method for that object and then build upon it, or
plot everything by hand.
Option 1
# evaluate just the smooth you want to plot
smooth_estimates(m, smooth = "te(x,y)", dist = 0.1) %>%
draw() +
geom_point(data = df, alpha = 0.2) # add a point layer for original data
This pretty much gets you what draw() produced when given the model object itself. And you can add to it as if it were a ggplot object (which is not the case of the objects returned by gratia:::draw.gam(), which is wrapped by {patchwork} and needs other ways to interact with the plots).
Option 2
Here you are in full control
sm <- smooth_estimates(m, smooth = "te(x,y)", dist = 0.1)
ggplot(sm, aes(x = x, y = y)) +
geom_raster(aes(fill = est)) +
geom_point(data = df, alpha = 0.2) + # add a point layer for original data
scale_fill_viridis_c(option = "plasma")
which produces
A diverging palette is likely better for this, along the lines of the one gratia:::draw.smooth_estimates uses
sm <- smooth_estimates(m, smooth = "te(x,y)", dist = 0.1)
ggplot(sm, aes(x = x, y = y)) +
geom_raster(aes(fill = est)) +
geom_contour(aes(z = est), colour = "black") +
geom_point(data = df, alpha = 0.2) + # add a point layer for original data
scale_fill_distiller(palette = "RdBu", type = "div") +
expand_limits(fill = c(-1,1) * abs(max(sm[["est"]])))
which produces
Finally, if {gratia} can't handle your model, I'd appreciate you filing a bug report here so that I can work on supporting as many model types as possible. But do try {mgcViz} as well for an alternative approach to visualsing GAMs fitted using {mgcv}.
A base plot solution would be to use fields::image.plot directly. Unfortunately, it require data in a classic wide format, not the long format needed by ggplot.
We can facilitate plotting by grabbing the object returned by plot.gam(), and then do a little manipulation of the object to get what we need for image.plot()
Following on from #Anke's answer then, instead of plotting with plot.gam() then using image.plot() to add the legend, we proceed to use plot.gam() to get what we need to plot, but do everything in image.plot()
plt <- plot(df.gam)
plt <- plt[[1]] # plot.gam returns a list of n elements, one per plot
# extract the `$fit` variable - this is est from smooth_estimates
fit <- plt$fit
# reshape fit (which is a 1 column matrix) to have dimension 40x40
dim(fit) <- c(40,40)
# plot with image.plot
image.plot(x = plt$x, y = plt$y, z = fit, col = heat.colors(999, rev = TRUE))
contour(x = plt$x, y = plt$y, z = fit, add = TRUE)
box()
This produces:
You could also use the fields::plot.surface() function
l <- list(x = plt$x, y = plt$y, z = fit)
plot.surface(l, type = "C", col = heat.colors(999, rev = TRUE))
box()
This produces:
See ?fields::plot.surface for other arguments to modify the contour plot etc.
As shown, these all have the correct range on the colour bar. It would appear that #Anke's version the colour bar mapping is off in all of the plots, but mostly just a little bit so it wasn't as noticeable.
Following Gavin Simpson's answer and this thread (How to add colorbar with perspective plot in R), I think I've come up with a solution that uses plot.gam() (though I really love that {gratia} takes it into a ggplot universe and will definitely look more into that)
require(fields)
df.gam<-gam(y~te(x,z), data=df, method='REML')
sm <- as.data.frame(smooth_estimates(df.gam, dist = 0.1))
plot(df.gam, scheme=2, hcolors=heat.colors(999, rev =T), contour.col='black', rug=F, main='', cex.lab=1.75, cex.axis=1.75)
image.plot(legend.only=T, zlim=range(sm$est), col=heat.colors(999, rev =T), legend.shrink = 0.5, axis.args = list(at =c(-10,-5,0,5, 10, 15, 20)))
I hope I understood correctly that gratia:smooth_estimates() actually pulls out the partial effects.
For my model with multiple terms (and multiple tensor products), this seems to work nicely by indexing the sections of the respective terms in sm. Except for one, where the colorbar and the heatmap aren't quite matching up. I can't provide the actual underlaying data, but add that plot for illustration in case anyone has any idea. I'm using the same approach as outlined above. In the colorbar, dark red is at 15-20, but in the heatmap the isolines just above 0 already correspond with the dark red (while 0 is dark yellow'ish in the colorbar).

Quick way to print the regression line slope of two vectors

Lets say I have the following two vectors:
years <- seq(1991, 2000, by = 1)
height <- c(30, 34, 40, 45, 66, 70, 81, NA, 90, 110)
I now want to perform a simple linear regression:
lm(formula = height ~ years)
All I need is the slope of my regression line to get a trend-value.
I there any quick way or a function to give my the regression line slope of two vectors?
You'd do:
model <- lm(formula = height ~ years)
model$coefficients[2]
Output:
years
8.857353
A somewhat less readable way would also be doing directly:
lm(formula = height ~ years)$coefficients[2]
Maybe this too:
years <- seq(1991, 2000, by = 1)
height <- c(30, 34, 40, 45, 66, 70, 81, NA, 90, 110)
df1<-data.frame(Yr=years,Ht=height)
lmfun<-function(df,yname,xname){
f<-as.formula(paste0(yname,"~",xname))
lm.fit<-do.call("lm",list(data=quote(df),f))
coef(lm.fit)
}
lmfun(df1,yname="Ht","Yr")[2] #or [1] depending on the need
Output:
Yr
8.857353

R: How Plot an Excel Table(Matrix) with R

I got this problem I still haven't found out how to solve it. I want to plot all the Values MW1, MW2 and MW3 in function of "DHT + Procymidone". How can I plot all this values in the graphic so that I will get 3 different curves (in different colors and different number like curve 1, 2, ...)? And I want the labels of the X-Values("DHT + Procymidone") to be like -10, -9, ... , -4 instead of 1,00E-10, ...
DHT + Procymidone MW 1 MW 2 MW 3
1,00E-10 114,259526780335 111,022461066274 213,212408408682
1,00E-09 115,024187788314 111,083316791613 114,529425136628
1,00E-08 110,517449986348 107,867941606743 125,10230718665
1,00E-07 100,961311263444 98,4219995773135 116,045168653416
1,00E-06 71,2383604211297 73,539659636842 50,3213799775309
1,00E-05 20,3553333652104 36,1345771905088 15,42260866106
1,00E-04 4,06189509055904 18,1246447874679 10,1988107887318
I have shortened your data frame for convenience reasons, so here's an example:
mydat <- data.frame(DHT_Procymidone = c(-10, -9, -8, -7, -6, -5, -4),
MW1 = c(114, 115, 110, 100, 72, 20, 4),
MW2 = c(111, 111, 107, 98, 73, 36, 18),
MW3 = c(213, 114, 123, 116, 50, 15, 10))
library(tidyr)
library(ggplot2)
mydf <- gather(mydat, "grp", "MW", 2:4)
ggplot(mydf, aes(x = DHT_Procymidone, y = MW, colour = grp)) + geom_line()
which gives following plot:
To use ggplot, your data needs to be in long-format. gather does this for you, appending columns MW1-MW3 into one column, while the column names are added as new column values in the grp-column. This group-column allows to identify different groups, i.e. different colored lines in the plot.
Depending on the type of DHT + Procymidone, you can, e.g. use format(..., scientific = FALSE) to convert to numeric, however, this will result in -0.0000000001 (and not -10).
However, if this data column is a character vector (you can coerce with as.character), this may work:
a <- "1,00E-10"
sub("1,00E", "", a, fixed = TRUE)
> [1] "-10"
As an alternative answer to #Daniel's which doesn't rely on ggplot (thanks Daniel for providing the reproducible data).
mydat <- data.frame(DHT_Procymidone = c(-10, -9, -8, -7, -6, -5, -4),
MW1 = c(114, 115, 110, 100, 72, 20, 4),
MW2 = c(111, 111, 107, 98, 73, 36, 18),
MW3 = c(213, 114, 123, 116, 50, 15, 10))
plot(mydat[,2] ~ mydat[,1], typ = "l", ylim = c(0,220), xlim = c(-10,-2), xlab = "DHT Procymidone", ylab = "MW")
lines(mydat[,3] ~ mydat[,1], col = "blue")
lines(mydat[,4] ~ mydat[,1], col = "red")
legend(x = -4, y = 200, legend = c("MW1","MW2","MW3"), lty = 1, bty = "n", col = c("black","blue","red"))
To change axis labels see the text in xlab and ylab. To change axis limits see xlim and ylim.

How do I create a histogram with a probability y-axis rather than a density y-axis?

I have a vector (variable dist) of which I want to draw a histogram with a bin-width of 7 units. Here's the assignment to dist:
dist <- c(
# 0-6 7-13 14-20 21-27 28-34 35-41 42-48 49-55
# --- ---- ----- ----- ----- ----- ----- -----
16,
20, 29,
17, 27, 28,
19, 21, 34,
3, 14, 26, 33, 35, 44,
1, 11, 14, 21, 29, 38, 43, 55,
4, 12, 18, 22, 32, 35, 48, 50
)
In order to draw the histogram, I use hist:
hist(dist, breaks=seq(0, 56, by=7)-0.5)
which creates this graphic:
So far, so good. There are three numbers between 0 and 6, two numbers between 7 and 13 and so forth, as is shown by the histogram.
Now, I use hist with the prop=TRUE parameter which creates the following graph:
Instead of a density on the y axis, I'd like it to show the probability for a bin. For example the bin with the values 21 through 27 has a height (or density) of 0.02304147, calculated as follows:
dens_21_27 <- length(dist[dist > 20.5 & dist < 27.5])/length(dist)/7
This can be verified by drawing a line with this height:
lines(c(-5, 56), c(dens_21_27, dens_21_27), col="#FF770070")
which draws
Yet, I'd like the y-axis to show the probability for a number to fall into the 21 through 27 bin, which is
length(dist[dist > 20.5 & dist < 27.5])/length(dist)
or 0.1612930.
Is this possible somehow?
Here's a wrapper i've used in the past to coerce the values to probabilites.
probabilityplot<-function(x, ..., prob=T, ylab="Probability") {
xx<-hist(x, yaxt="n", prob=prob, ylab=ylab , ...)
bin.sizes<-diff(xx$breaks)
if (any(bin.sizes != bin.sizes[1])) stop("bin sizes are not the same")
marks<-axTicks(2)
axis(2, at=marks, labels=marks*bin.sizes[1])
xx$probabilities <- xx$density*bin.sizes[1]
invisible(xx)
}
probabilityplot(dist,breaks=seq(0, 56, by=7)-0.5 )
Histograms were designed to estimate the density of continuous random variables hence the preference for density over probability.
You can bin the groups by the histogram breaks and make a barplot.
bs <- hist(dist, breaks=seq(0, 56, by=7)-0.5, plot=F)$breaks
probs <- table(cut(dist, bs)) / length(dist)
barplot(probs, ylab="Probability", las=2)

Resources