Creating data table of points above/below abline in ggplot2 - r

Is it possible to identify data points above a geom_abline in ggplot, and to create a new data table separating these data points using data.table?
I have a panel dataset with 150 unique ID's, and have fit a fixed effects model using plm(). Here is a sample of the dataset:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted the response progenyMean vs damMean using the following code:
plotFunction <- function(aggData, year){
ggplot(aggData, aes(x=damMeanCentered, y=progenyMean3Y)) +
geom_point() +
geom_abline(slope=fixed$coefficients, intercept=71.09, colour='dodgerblue1', size=1)
# The intercept 71.09 was calculated using the mean of fixef(fixed)
}
plotFunction(data, '(2005 - 2012)')
Is it possible to identify the points above/below the geom_abline in ggplot and create a new data table separating these data points using data.table?

It is not clear where the intercept came from, but nevertheless the trick is
add a predict to your dataset using the regression model (in your case fixed). Then filter out actual values that are higher than the predict.
library(dplyr)
data %>%
mutate(predict = predict(fixed, newdata = data)) %>%
filter(progenyMean > predict)

First make the predictions
data[,newpredict:=predict(fixed, newdata=data)]
It's not clear what you want the new data.table to look like but you'd get the values above predictions by doing
data[progencyMean>newpredict]
For below, you'd obviously just change the > to <.

Related

Calculate area under a curve below a certain threshold in R

I'm trying to calculate the area below a certain point, and unsure how to do that. I've seen this question, but it's not exactly answering what I'm looking for.
Here is some example data...
test_df <- structure(list(time = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23), balance = c(27,
-45, -118, -190, -263, -343, -424, -1024, -434, -533, -613, -694,
-775, -355, -436, -516, -597, -77, -158, -239, -319, -400, -472,
-545)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
)) %>% as_tibble()
ggplot(test_df, aes(time, balance))+
geom_smooth(se = F)+
geom_hline(yintercept = -400)
I'd like to calculate the AUC for the trend line, but only for when it is below a certain threshold (-400, for example).
So I can extract the values for the smoothed line...
test_plot <- ggplot(test_df, aes(time, balance))+
geom_smooth(se = F)+
geom_hline(yintercept = -400)
ggp_data <- ggplot_build(test_plot)$data[[1]]
and use something like this to get an AUC value
MESS::auc(ggp_data$x, ggp_data$y)
My questions are:
How to only calculate below -400?
How to interpret the value?
What units would it be in?
If my x axis is in hours, is there a way to turn the value into an hour value?
Thanks!
To calculate the area only below a certain threshold you can add the threshold to your y-values if your threshold is below 0 and subtract if your threshold is larger than 0. For your case that would be like this:
MESS::auc(ggp_data$x, ggp_data$y+400)
However, this calculates the AUC from 0 to 23 and therefore, also parts that are above -400. To get the AUC for the part that is below your threshold you have to find the x-values of the intersection between your smoothed line and the h-line at -400. Inspecting your values by eye you could find the following approximation of these x-values that fulfill this criteria:
x1 <- 4.45
x2 <- 15.45
x3 <- 21.35
Now we have to calculate the AUC between x1 and x2, and x3 and max(x). Then we have to add these values together:
AUC1 <- MESS::auc(ggp_data$x, ggp_data$y+400, from = x1, to = x2)
AUC2 <- MESS::auc(ggp_data$x, ggp_data$y+400, from = x3, to = max(ggp_data$x))
AUC.total <- AUC1 + AUC2
> AUC.total
[1] -1747.352
Note that the value is negative because it is below 0. There are now "negative areas" therefore, you can take the absolute value AUC.total = 1747.352 to proceede. However, without information on your y-axis one cannot clearly interpret this value.

Identifying data points above fixed effects regression using data.table

I want to identify the data points above a regression line. I have a panel data set which I have fit a fixed effects model:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted progenyMean vs damMean with the fixed effects regression line:
I want to identify the ID's above this regression line.
I have computed the predicted values of the fixed effects model using the following code (based off code from this post)
fitted <- as.numeric(fixed$model[[1]] - fixed$residuals)
> fitted
[1] 71.24338 79.03766 74.86613 71.34263 70.83020 71.56797 72.17324 74.54755 71.16720 73.37487
[11] 70.58863 69.27203 71.05852 59.72911 63.43947 68.69871 67.25271 75.68397 76.30475 81.12128
Is it possible to identify the ID's above the fixed effects regression line using the predicted values above and data.table in R?
Use residuals function. Positive residual = points above the line, negative = points below the line.
library(plm)
library(tidyverse)
library(ggplot2)
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89))
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
residuals(fixed)
data %>% ggplot(aes(damMean, progenyMean)) +
geom_point(data=data %>% filter(residuals(fixed)>0), col="red")+
geom_point(data=data %>% filter(residuals(fixed)<0), col="blue")
data %>% mutate(
test = ifelse(residuals(fixed)>0, "up", "down") %>% factor()
) %>% group_by(test) %>% summarise(
n = n()
)

Best way to incorporate offset variable into raster for predicting poisson regression

Possibly this is a naive question but did not find a solution. I have a dataframe with count data from field survey and I want to predict species richness using poisson regression. The survey is allocated to grids of equal size but variable number of survey were done in each grid. So I wanted to include 'number of surveys per grid' as offset. The problem is when I want to predict the glm output using raster stack it wants a raster layer for the offset variable (number of surveys per grid). My question is how to incorporate that offset variable into raster stack so that I can produce a spatial prediction (i.e., prediction should be a raster file). Below is my reproducible effort (using fewer variable):
Create the dataframe:
bio2 <- c(12.74220, 14.10092, 13.82644, 14.30550, 15.02780, 14.88224, 13.98853, 14.89524, 15.59887, 13.98664, 14.75405,
15.38178, 14.50719, 15.00427, 12.77741, 13.25432, 12.91208, 15.75312, 15.36683, 13.33202, 12.55190, 14.94755,
13.52424, 14.75273, 14.42298, 15.37897, 12.02472, 15.49786, 14.28823, 13.01982, 13.60521, 15.07687, 14.17427,
13.24491, 14.84833, 13.52594, 13.92113, 11.39738, 14.31446, 12.10239)
bio9 <- c(26.30980, 26.52826, 27.03376, 23.93621, 26.48416, 26.05859, 25.37550, 25.34595, 25.34056, 23.37793, 25.74681,
22.72016, 22.00458, 24.37140, 22.95169, 24.52542, 24.63087, 22.86291, 23.10240, 23.79215, 24.86875, 21.40718,
23.84258, 21.91964, 25.97682, 24.97625, 22.31471, 19.64094, 23.93386, 25.87234, 25.99514, 17.17149, 20.72802,
18.22862, 24.51112, 24.33626, 23.90822, 23.43660, 23.07425, 20.71244)
count <- c(37, 144, 91, 69, 36, 32, 14, 34, 48, 168, 15, 21, 36, 29, 24, 16, 14, 11, 18, 64, 37, 31, 18, 9, 4,
16, 14, 10, 14, 43, 18, 88, 69, 26, 20, 5, 9, 75, 8, 26)
sitesPerGrid <- c(3, 16, 8, 5, 3, 3, 1, 3, 3, 29, 2, 4, 5, 2, 3, 4, 2, 1, 2, 9, 6, 3, 3, 2, 1, 2, 2, 1, 2, 5, 7, 15, 9, 4,
1, 1, 2, 22, 6, 5)
testdf <- data.frame(bio2, bio9, count, sitesPerGrid)
pois1 <- glm(count ~ bio2 + bio9, offset = log(sitesPerGrid), family = poisson (link = "log"), data = testdf)
Spatial prediction:
library(raster)
bio_2 <- bio_9 <- raster(nrow=5,ncol=8, xmn=0, xmx=1,ymn=0,ymx=1)
values(bio_2) <- bio2
values(bio_9) <- bio9
predRas <- stack(bio_2, bio_9)
names(predRas) <- c("bio2", "bio9")
pdPois <- raster::predict(predRas, pois1, type = "response")
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = #object$xlevels) :
# variable lengths differ (found for 'bio9')
#In addition: Warning message:
#'newdata' had 16 rows but variables found have 40 rows
I get error because it expect a raster layer for sitesPerGrid. But I don't want to use sitesPerGrid as a predictor.
Update
Based on the comment and answer given by #robertHijmans I have tried using the following code:
pdPois <- raster::predict(predRas, pois1, const = testdf[, "sitesPerGrid"], type = "response")
Again I get the following error:
Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 143811, 40
I see that this works, because the number of data points is the same as what was used to fit the model
p <- predict(pois1, as.data.frame(predRas), type = "response")
However, this (taking two data points) does not work:
p <- predict(pois1, as.data.frame(predRas)[1:2,], type = "response")
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# variable lengths differ (found for 'bio9')
#In addition: Warning message:
#'newdata' had 2 rows but variables found have 40 rows
So, irrespective of the raster data, can you (and if so how?) use a model like this to make predictions to (any number of) new data points?
The problem is solved using a raster for the offset variable. The raster is created based on a hypothesis. For example, I want to see the prediction if there is one site per grid, or mean(sitesPerGrid) or max(sitesPerGrid). If my hypothesis is mean(sitesPerGrid) then the raster for prediction would be:
# make new raster for sitesPerGrid
rasGrid <- bio2
rasGrid[,] <- mean(testdf$sitesPerGrid)
names(rasGrid) <- "sitesPerGrid"
predRas <- stack(bio_2, bio_9, rasGrid)
p <- raster::predict(predRas, pois1, type = "response")

Grouping factors in a pooled 2 sample t-test

I have a 2*2 table of 7 men and 11 women's weight (saved as weights_gender.csv), and aim to perform a pooled t-test. I have assigned the CSV file as weight = read.csv("weights_gender.csv"), but whenever I try to run t.test(weight$men~weight$women, var.equal=TRUE), it keeps on printing this message:
grouping factor must have exactly 2 levels.
What is the issue?
Try ...
t.test(x = weight$men, y = weight$women, var.equal = TRUE)
The way you were specifying the command it thought you wanted men's weight grouped by women which of course is not what you want.
Results...
Two Sample t-test
data: weight$men and weight$women
t = 5.9957, df = 16, p-value = 1.867e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
15.26250 31.95828
sample estimates:
mean of x mean of y
77.42857 53.81818
Data
weight <- data.frame(
men = c(88, 90, 78, 75, 70, 72, 69, NA, NA, NA, NA),
women = c(45, 57, 54, 62, 60, 59, 44, 43, 67, 50, 51)
)
Your question is a bit "theoretical" so I'll make it more concrete
Here I make two data frames with data about men's and women's weights, and labeling them.
df_m <- tibble(weight = 170 + 30*rnorm(7), sex = "Male")
df_f <- tibble(weight = 130 + 30*rnorm(11), sex = "Female")
Next we combine the data and set sex to be a factor variable
df_all <- rbind(df_m, df_f)
df_all[, 'sex'] <- lapply(df_all[, 'sex'], as.factor)
Finally we apply the t-test.
t.test(weight ~ sex, data = df_all, var.equal = TRUE)
My result was
Two Sample t-test
data: weight by sex
t = -5.2104, df = 16, p-value = 8.583e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-89.84278 -37.87810
sample estimates:
mean in group Female mean in group Male
120.2316 184.0921

How to undo a survival object (return a data frame)

After generating a model using cph() from rms, calling model$y will return a survival object. Is there a function that will "undo" the survival object and return a data frame?
I would like to be able to use a survival object as a argument for a function I am writing, but I also need the response data. I am trying to avoid using the data as an argument and creating the model inside the function.
A minimal working example is provided below:
library(rms)
# generate data
time <- c(82, 73, 89, 79, 72, 87, 103, 83, 100, 79)
event <- c(0, 0, 1, 0, 1, 0, 0, 0, 1, 1)
covar <- c(15, 11, 11, 20, 12, 13, 10, 11, 10, 14)
df <- data.frame(time, event, covar)
# Cox model
dd <- datadist(df)
options(datadist = 'dd')
model <- cph(Surv(time, event) ~ covar, x=TRUE, y=TRUE, surv=TRUE, data=df)
# returns a survival object
model$y
# what I want is a data frame
want <- data.frame(time, event)
want
I think you can get what you want by using as.matrix:
pl <- as.matrix(model$y)
as.data.frame(pl)

Resources