Sampling out of tables depending on other variables (R) - r

I am a physician just who just started working in R and appreciate any help in this question:
i have 2 tables (A, B) with the variables age (continous), sex (binary) and test_value (binary). Each table has a different age and sex distribution.
set.seed(10)
AgeA <- round(rnorm(100, mean = 40, sd = 15))
SexA <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueA <- rbinom(100, 1, 0.3)
set.seed(20)
AgeB <- round(rnorm(1000, mean = 50, sd = 15))
SexB <- sample(c("M","F"), 1000, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueB <- rbinom(1000, 1, 0.4)
A <- data.frame(Age = AgeA, Sex = SexA, Test = Test_ValueA)
B <- data.frame(Age = AgeB, Sex = SexB, Test = Test_ValueB)
genderA<-(prop.table(table(A[,2])))
TestA<-(prop.table(table(A[,3])))
paste("median age in group A is",median(A[,1]), "percentage female in group A is",genderA[1], "percentage of test positive in A is", TestA[2])
genderB<-(prop.table(table(B[,2])))
TestB<-(prop.table(table(B[,3])))
paste("median age in group A is",median(B[,1]), "percentage female in group B is",genderB[1], "percentage of test positive in A is", TestB[2])
The difference in test-proportion is now confounded by age and sex.
now i would like to match the patients from table A with table B to adjust for age and sex. because B is the smaller cohort i would prefer to sample out of A and match to B. is the match package an option? any other ideas
hopefully I was able to explain my problem.
any hints to which functions this may point?

Hello i have a possible answer, I will build two populations of a 100 people with the characteristics you said
set.seed(10)
AgeA <- rnorm(100, mean = 30, sd = 10)
#population A is 0.8 percent male
SexA <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueA <- rbinom(100, 1, 0.5)
set.seed(20)
AgeB <- rnorm(100, mean = 30, sd = 10)
#population B is 0.8 percent male
SexB <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.8, 0.2))
Test_ValueB <- rbinom(100, 1, 0.3)
A <- data.frame(Age = AgeA, Sex = SexA, Test = Test_ValueA)
B <- data.frame(Age = AgeB, Sex = SexB, Test = Test_ValueB)
Then using dplyr you can summarise population B parameters:
library(dplyr)
Bsummary <- group_by(B,Sex)
Bsummary <- summarise(Bsummary, PercenteagePositive = sum(Test == 1)/length(Test == 1), PercenteageSex = n()/100)
Bsummary
If you look at the results of this B is 76% male and 24% female, if you sampled 20 people from A you would have to sample 15 males and 5 females. First you separate the population of A on males and females:
Amale <- filter(A, Sex == "M")
Afemale <- filter(A, Sex == "F")
And from that you sample 15 males and 5 females:
SampleAMale <- Amale[sample(nrow(Amale), 15), ]
SampleAFemale <-Afemale[sample(nrow(Afemale), 5), ]
Then join them and you can summarise Them
sampleA <- rbind(SampleAMale, SampleAFemale)
ASampleSummary <- group_by(sampleA,Sex)
ASampleSummary <- summarise(ASampleSummary, PercenteagePositive = sum(Test == 1)/length(Test == 1), PercenteageSex = n()/100)

OK Fank I think you will like this answer a little better, the first part is the same, exept that the AGE IS ROUNDED:
set.seed(10)
AgeA <- round(rnorm(100, mean = 30, sd = 2))
#population A is 0.8 percent male
SexA <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.5, 0.5))
Test_ValueA <- rbinom(100, 1, 0.5)
set.seed(20)
AgeB <- round(rnorm(100, mean = 30, sd = 2))
#population B is 0.8 percent male
SexB <- sample(c("M","F"), 100, replace = TRUE, prob = c(0.8, 0.2))
Test_ValueB <- rbinom(100, 1, 0.3)
A <- data.frame(Age = AgeA, Sex = SexA, Test = Test_ValueA)
B <- data.frame(Age = AgeB, Sex = SexB, Test = Test_ValueB)
now you just use prop.table to get the proportions of your population. Lets say you want to sample a 1000 individuals from B in the same proportion as A in terms of AGE and SEX you do this.
1000*(prop.table(table(A[,1:2])))
then by applying filters you can sample within groups:
for example if you want to get only the males age 30 in group B you could go
BMale30 <- filter(B, Sex == "M" & Age == 30)

Related

How can I find the maximum output of a function

If I have a GLM, is there any way I can efficiently find the maximum output by changing one covariate and holding the others?
Using my simulated data:
# FUNCTIONS ====================================================================
logit <- function(p){
x = log(p/(1-p))
x
}
sigmoid <- function(x){
p = 1/(1 + exp(-x))
p
}
beta_duration <- function(D, select){
logit(
switch(select,
0.05 + 0.9 / (1 + exp(-2*D + 25)),
0.9 * exp(-exp(-0.5 * (D - 11))),
0.9 * exp(-exp(-(D - 11))),
0.9 * exp(-2 * exp(-(D - 9))),
sigmoid(0.847 + 0.210 * (D - 10)),
0.7 + 0.0015 * (D - 10) ^ 2,
0.7 - 0.0015 * (D - 10) ^ 2 + 0.03 * (D - 10)
)
)
}
beta_sex <- function(sex, OR = 1){
ifelse(sex == "Female", -0.5 * log(OR), 0.5 * log(OR))
}
plot_beta_duration <- function(select){
x <- seq(10, 20, by = 0.01)
y <- beta_duration(x, select)
data.frame(x = x,
y = y) %>%
ggplot(aes(x = x, y = y)) +
geom_line() +
ylim(0, 1)
}
# DATA SIMULATION ==============================================================
duration <- c(10, 12, 14, 18, 20)
sex <- factor(c("Female", "Male"))
eta <- function(duration, sex, duration_select, sex_OR, noise_sd){
beta_sex(sex, sex_OR) + beta_duration(duration, duration_select) + rnorm(length(duration), 0, noise_sd)
}
sim_data <- function(durations_type, sex_OR, noise_sd, p_female, n, seed){
set.seed(seed)
data.frame(
duration = sample(duration, n, TRUE),
sex = sample(sex, n, TRUE, c(p_female, 1 - p_female))
) %>%
rowwise() %>%
mutate(eta = eta(duration, sex, durations_type, sex_OR, noise_sd),
p = sigmoid(eta),
cured = sample(0:1, 1, prob = c(1 - p, p)))
}
# DATA SIM PARAMETERS
durations_type <- 4 # See beta_duration for functions
sex_OR <- 3 # Odds of cure for male vs female (ref)
noise_sd <- 1
p_female <- 0.7 # proportion of females in the sample
n <- 500
data <- sim_data(durations_type = 1, # See beta_duration for functions
sex_OR = 3, # Odds of cure for male vs female (ref)
noise_sd = 1,
p_female = 0.7, # proportion of females in the sample
n = 500,
seed = 21874564)
I am fitting a fractional polynomial GLM:
library(mfp)
model1 <- mfp(cured ~ fp(duration) + sex,
family = binomial(link = "logit"),
data = data)
summary(model1)
Given that I am holding sex as constant, is there any way to find the value of duration within a certain range that gives me the highest predicted value? Something less inefficient than:
range <- seq(10, 20, by = 1e-4)
range[which.max(predict(model, type = "response", newdata = data.frame(duration = range, sex = "Male")))]
You can use optimize here. Just create a function which returns a prediction based on the value of duration:
f <- function(x) predict(model1, list(sex = 'Male', duration = x))
And we can find the value of duration which produces the maximum log odds within the range 0-20 by doing:
optimise(f, c(0, 20), maximum = TRUE)$maximum
#> [1] 17.95679

Find value of covariate given a probability in R

Given a fractional polynomial GLM, I am looking to find the value of a covariate that gives me an output of a given probability.
My data is simulated using:
# FUNCTIONS ====================================================================
logit <- function(p){
x = log(p/(1-p))
x
}
sigmoid <- function(x){
p = 1/(1 + exp(-x))
p
}
beta_duration <- function(D, select){
logit(
switch(select,
0.05 + 0.9 / (1 + exp(-2*D + 25)),
0.9 * exp(-exp(-0.5 * (D - 11))),
0.9 * exp(-exp(-(D - 11))),
0.9 * exp(-2 * exp(-(D - 9))),
sigmoid(0.847 + 0.210 * (D - 10)),
0.7 + 0.0015 * (D - 10) ^ 2,
0.7 - 0.0015 * (D - 10) ^ 2 + 0.03 * (D - 10)
)
)
}
beta_sex <- function(sex, OR = 1){
ifelse(sex == "Female", -0.5 * log(OR), 0.5 * log(OR))
}
plot_beta_duration <- function(select){
x <- seq(10, 20, by = 0.01)
y <- beta_duration(x, select)
data.frame(x = x,
y = y) %>%
ggplot(aes(x = x, y = y)) +
geom_line() +
ylim(0, 1)
}
# DATA SIMULATION ==============================================================
duration <- c(10, 12, 14, 18, 20)
sex <- factor(c("Female", "Male"))
eta <- function(duration, sex, duration_select, sex_OR, noise_sd){
beta_sex(sex, sex_OR) + beta_duration(duration, duration_select) + rnorm(length(duration), 0, noise_sd)
}
sim_data <- function(durations_type, sex_OR, noise_sd, p_female, n, seed){
set.seed(seed)
data.frame(
duration = sample(duration, n, TRUE),
sex = sample(sex, n, TRUE, c(p_female, 1 - p_female))
) %>%
rowwise() %>%
mutate(eta = eta(duration, sex, durations_type, sex_OR, noise_sd),
p = sigmoid(eta),
cured = sample(0:1, 1, prob = c(1 - p, p)))
}
# DATA SIM PARAMETERS
durations_type <- 4 # See beta_duration for functions
sex_OR <- 3 # Odds of cure for male vs female (ref)
noise_sd <- 1
p_female <- 0.7 # proportion of females in the sample
n <- 500
data <- sim_data(durations_type = 1, # See beta_duration for functions
sex_OR = 3, # Odds of cure for male vs female (ref)
noise_sd = 1,
p_female = 0.7, # proportion of females in the sample
n = 500,
seed = 21874564)
And my model is fitted by:
library(mfp)
model1 <- mfp(cured ~ fp(duration) + sex,
family = binomial(link = "logit"),
data = data)
summary(model1)
For each level of sex (i.e. "Male" or "Female"), I want to find the value of duration that gives me a probability equal to some value frontier <- 0.8.
So far, I can only think of using an approximation using a vector of possibilities:
pred_duration <- seq(10, 20, by = 0.1)
pred <- data.frame(expand.grid(duration = pred_duration,
sex = sex),
p = predict(model1,
newdata = expand.grid(duration = pred_duration,
sex = sex),
type = "response"))
pred[which(pred$p > 0.8), ] %>%
group_by(sex) %>%
summarize(min(duration))
But I am really after an exact solution.
The function uniroot allows you to detect the point at which the output of a function equals 0. If you create a function that takes duration as input, calculates the predicted probability from that duration, then subtracts the desired probability, then this function will have an output of 0 at the desired value of duration. uniroot will find this value for you. If you wrap this process in a little function, it makes it very easy to use:
find_prob <- function(p) {
f <- function(v) {
predict(model1, type = 'response',
newdata = data.frame(duration = v, sex = 'Male')) - p
}
uniroot(f, interval = range(data$duration), tol = 1e-9)$root
}
So, for example, to find the duration that gives an 80% probability, we just do:
find_prob(0.8)
#> [1] 12.86089
To prove that this is the correct value, we can feed it directly into predict to see what the predicted probability will be given sex = male and duration = 12.86089
predict(model1, type = 'response',
newdata = data.frame(sex = 'Male', duration = find_prob(0.8)))
#> 1
#> 0.8

How can I run a stratified glm.nb on a mids object in R?

I have a mids object output from MICE that I need to run a series of stratified negative binomial models (glm.nb()) on. The stratification variable ("pressure") is itself imputed. The below code subsets the data within the glm.nb() function.
Is this the correct approach to run the stratified models?
library("mice") # For MI
library("MASS") # For glm.nb()
set.seed(09212020)
df <- data.frame("ethnicity" = sample(1:4, 50, replace = TRUE) ,
"education" = sample(1:4, 50, replace = TRUE) ,
"age" = sample(50:90, 50, replace = TRUE) ,
"pressure" = sample(0:12, 50, replace = TRUE),
"outcome_rate" = sample(0:6, 50, replace = TRUE),
"exposure_quart" = sample(1:4, 50, replace=TRUE,
prob=c(0.1, 0.2, 0.65, 0.05)))
# Insert NAs for MICE
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15),
size = length(cc), replace = TRUE) ]))
# Run MICE
imp <- mice(data = df, m = 2, maxit = 1)
# Attempt to run stratified model on the mids object
# This is the portion I'm seeking to have confirmed/updated
m1 <- with(imp, (glm.nb(outcome_rate ~ exposure_quart + ethnicity + education + age + pressure,
subset = imp$imp$pressure < 9)))
m1

How can I use the MatchIt function to identify the effect of a given variable (propensity score)?

Let's say I have a data set like this:
library(dplyr)
library(MatchIt)
set.seed(1)
df <- data_frame(outcome = sample(c(rep("TRUE", 50), rep("FALSE", 50))),
age = rnorm(100, mean = 35, sd = 15),
gender = sample(c(rep("MALE", 50), rep("FEMALE", 50))),
var1 = rnorm(100, mean = 1, sd = 0.3),
var2 = rnorm(100, mean = 1000, sd = 125),
var3 = rnorm(100, mean = 0, sd = 300))
I want to control for age and gender and determine the effect of vars 1, 2, and 3 on the outcome. I believe I can use the MatchIt function to control for age and gender like this:
match_it <- matchit(formula = outcome ~ age + gender,
data = df,
method = "nearest")
But once this is done, how can I determine the effect of var1 on outcome? Thanks for any help.

randomly assign teachers to a school with dplyr or similar?

Suppose I have a data frame with 8 schools and its characteristics, and another with 48 teachers and its characteristics. I can generate some fake data with the following code:
library(dplyr)
library(geosphere)
set.seed(6232015)
n.schools <-8
n.teachers <- 48
makeRandomString <- function(pre, n=1, length=12) {
randomString <- c(1:n) # initialize vector
for (i in 1:n) {
randomString[i] <- paste0(pre,'.', paste(sample(c(0:9, letters, LETTERS),
length, replace=TRUE),
collapse=""))
}
return(randomString)
}
gen.teachers <- function(n.teachers){
Teacher.ID <- makeRandomString(pre= 'T', n = n.teachers, length = 20)
Teacher.exp <- runif(n = n.teachers, min = 1, max = 30)
Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5), size = n.teachers)
Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other)
return(Teachers)
}
gen.schools <- function(n.schools){
School.ID <- makeRandomString(pre= 'S', n = n.schools, length = 20)
School.lat <- runif(n = n.schools, min = -2, max = 2)
School.long <- runif(n = n.schools, min = -2, max = 2)
Schools <- data.frame(School.ID, School.lat, School.long) %>%
rowwise() %>% mutate (School.distance = distHaversine(p1 = c(School.long, School.lat),
p2 = c(0, 0), r = 3961))
return(Schools)
}
Teachers <- gen.teachers(n.teachers = n.teachers)
Schools <- gen.schools(n.schools = n.schools)
To each shool, I want to assign 6 teachers (every teacher get 1 and only 1 school). I could use:
Teachers %>% sample_n(6)
To get a list of 6 teachers assign those to a school, remove them from my original pool and keep going with a loop. My guess/hope is that there is a much easier way of doing this.
Thanks for the help!
In the context of your code
sample(rep(Schools$School.ID, each = 6))
gives a random sequence of schools where each school.id appears 6 times. Set Teachers$AssignedSchool to this sample and each teacher has an assigned school

Resources