I have trouble looping through a regression model dropping one observation each time to estimate the effect of influential observations.
I would like to run the model several times, each time dropping the ith observation and extracting the relevant coefficient estimate and store it in a vector. I think this could quite easily be done with a fairly straight forward loop, however, I'm stuck at the specifics.
I want to be left with a vector containing n coefficient estimates from n iterations of the same model. Any help would be beneficial!
Below I provide some dummy data and example code.
#Dummy data:
set.seed(489)
patientn <- rep(1:400)
gender <- rbinom(400, 1, 0.5)
productid <- rep(c("Product A","Product B"), times=200)
country <- rep(c("USA","UK","Canada","Mexico"), each=50)
baselarea <- rnorm(400,400,60) #baseline area
baselarea2 <- rnorm(400,400,65) #baseline area2
sfactor <- c(
rep(c(0.3,0.9), times = 25),
rep(c(0.4,0.5), times = 25),
rep(c(0.2,0.4), times = 25),
rep(c(0.3,0.7), times = 25)
)
rashdummy2a <- data.frame(patientn,gender,productid,country,baselarea,baselarea2,sfactor)
Data <- rashdummy2a %>% mutate(rashleft = baselarea2*sfactor/baselarea*100) ```
## Example of how this can be done manually:
# model
m1<-lm(rashleft ~ gender + baselarea + sfactor, data = data)
# extracting relevant coefficient estimates, each time dropping a different "patient" ("patientn")
betas <- c(lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=1)$coefficients[2],
lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=2)$coefficients[2],
lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=3)$coefficients[2])
# the betas vector now stores the relevant coefficient estimates (coefficient nr 2, for gender) for three different variations of the model.
We can use a for loop. In your question you use an object rashdummy2b which is not defined. Now I used data but you can replace that by an object of choice.
#create list to bind results to
result <- list()
#loop through patients and extract betas
for(i in unique(data$patientn)){
#construct linear model
lm.model <- lm(rashleft ~ gender + baselarea + sfactor, data = subset(data, data$patientn != i))
#create data.frame containing patient left out and coefficient
result.dt <- data.frame(beta = lm.model$coefficients[[2]],
patient_left_out = i)
#bind to list
result[[i]] <- result.dt
}
#bind to data.frame
result <- do.call(rbind, result)
Result
head(result)
beta patient_left_out
1 1.381248 1
2 1.345188 2
3 1.427784 3
4 1.361674 4
5 1.420417 5
6 1.454196 6
You can drop a particular row (or column) by using a negative index. In your case, you proceed as follows:
betas <- numeric(nrow(rashdummy2b)) # memory preallocation
for (i in 1:nrow(rashdummy2b)) {
betas[i] <- lm(rashleft ~ gender + baselarea + sfactor, data=rashdummy2b[-i,])$coefficients[2]
}
Related
I would like to derive individual growth rates from our growth model directly, similar to this OP and this OP.
I am working with a dataset that contains the age and weight (wt) measurements for ~2000 individuals in a population. Each individual is represented by a unique id number.
A sample of the data can be found here. Here is what the data looks like:
id age wt
1615 6 15
3468 32 61
1615 27 50
1615 60 145
6071 109 209
6071 125 207
10645 56 170
10645 118 200
I have developed a non-linear growth curve to model growth for this dataset (at the population level). It looks like this:
wt~ A*atan(k*age - t0) + m
which predicts weight (wt) for a given age and has modifiable parameters A, t0, and m. I have fit this model to the dataset at the population level using a nlme regression fit where I specified individual id as a random effect and used pdDiag to specify each parameter as uncorrelated. (Note: the random effect would need to be dropped when looking at the individual level.)
The code for this looks like:
nlme.k = nlme(wt~ A*atan(k*age - t0) + m,
data = df,
fixed = A+k+t0+m~1,
random = list(id = pdDiag(A+t0+k+m~1)), #cannot include when looking at the individual level
start = c(A = 99.31,k = 0.02667, t0 = 1.249, m = 103.8), #these values are what we are using at the population level # might need to be changed for individual models
na.action = na.omit,
control = nlmeControl(maxIter = 200, pnlsMaxIter = 10, msMaxIter = 100))
I have our population level growth model (nlme.k), but I would like to use it to derive/extract individual values for each growth constant.
How can I extract individual growth constants for each id using my population level growth model (nlme.k)? Note that I don't need it to be a solution that uses nlme, that is just the model I used for the population growth model.
Any suggestions would be appreciated!
I think this is not possible due to the nature on how random effects are designed. According to this post the effect size (your growth constant) is estimated using partial pooling. This involves using data points from other groups. Thus you can not estimate the effect size of each group (your individual id).
Strictly speaking (see here) random effects are not really a part of the model at all, but more a part of the error.
However, you can estimate the R2 for all groups together. If you want it on an individual level (e.g. parameter estiamtes for id 1), then just run the same model only on all data points of this particular individual. This give you n models with n parameter sets for n individuals.
We ended up using a few loops to do this.
Note that our answer builds off a model posted in this OP if anyone wants the background script. We will also link to the published script when it is posted.
For now - this is should give a general idea of how we did this.
#Individual fits dataframe generation
yid_list <- unique(young_inds$squirrel_id)
indf_prs <- list('df', 'squirrel_id', 'A_value', 'k_value', 'mx_value', 'my_value', 'max_grate', 'hit_asymptote', 'age_asymptote', 'ind_asymptote', 'ind_mass_asy', 'converge') #List of parameters
ind_fits <- data.frame(matrix(ncol = length(indf_prs), nrow = length(yid_list))) #Blank dataframe for all individual fits
colnames(ind_fits) <- indf_prs
#Calculates individual fits for all individuals and appends into ind_fits
for (i in 1:length(yid_list)) {
yind_df <-young_inds%>%filter(squirrel_id %in% yid_list[i]) #Extracts a dataframe for each squirrel
ind_fits[i , 'squirrel_id'] <- as.numeric(yid_list[i]) #Appends squirrel i's id into individual fits dataframe
sex_lab <- unique(yind_df$sex) #Identifies and extracts squirrel "i"s sex
mast_lab <- unique(yind_df$b_mast) #Identifies and extracts squirrel "i"s mast value
Hi_dp <- max(yind_df$wt) #Extracts the largest mass for each squirrel
ind_long <- unique(yind_df$longevity) #Extracts the individual death date
#Sets corresponding values for squirrel "i"
if (mast_lab==0 && sex_lab=="F") { #Female no mast
ind_fits[i , 'df'] <- "fnm" #Squirrel dataframe (appends into ind_fits dataframe)
df_asm <- af_asm #average asymptote value corresponding to sex
df_B_guess <- guess_df[1, "B_value"] #Inital guesses for nls fits corresponding to sex and mast sex and mast
df_k_guess <- guess_df[1, "k_value"]
df_mx_guess <- guess_df[1, "mx_value"]
df_my_guess <- guess_df[1, "my_value"]
ind_asyr <- indf_asy #growth rate at individual asymptote
} else if (mast_lab==0 && sex_lab=="M") { #Male no mast
ind_fits[i , 'df'] <- "mnm"
df_asm <- am_asm
df_B_guess <- guess_df[2, "B_value"]
df_k_guess <- guess_df[2, "k_value"]
df_mx_guess <- guess_df[2, "mx_value"]
df_my_guess <- guess_df[2, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="F") { #Female mast
ind_fits[i , 'df'] <- "fma"
df_asm <- af_asm
df_B_guess <- guess_df[3, "B_value"]
df_k_guess <- guess_df[3, "k_value"]
df_mx_guess <- guess_df[3, "mx_value"]
df_my_guess <- guess_df[3, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="M") { #Males mast
ind_fits[i , 'df'] <- "mma"
df_asm <- am_asm
df_B_guess <- guess_df[4, "B_value"]
df_k_guess <- guess_df[4, "k_value"]
df_mx_guess <- guess_df[4, "mx_value"]
df_my_guess <- guess_df[4, "my_value"]
ind_asyr <- indf_asy
} else { #If sex or mast is not identified or identified improperlly in the data
print("NA")
} #End of if else loop
#Arctangent
#Fits nls model to the created dataframe
nls.floop <- tryCatch({data.frame(tidy(nls(wt~ B*atan(k*(age - mx)) + my, #tryCatch lets nls have alternate results instead of "code stopping" errors
data=yind_df,
start = list(B = df_B_guess, k = df_k_guess, mx = df_mx_guess, my = df_my_guess),
control= list(maxiter = 200000, minFactor = 1/100000000))))
},
error = function(e){
nls.floop <- data.frame(c(0,0), c(0,0)) #Specifies nls.floop as a dummy dataframe if no convergence
},
warning = function(w) {
nls.floop <- data.frame(tidy(nls.floop)) #Fit is the same if warning is displayed
}) #End of nls.floop
#Creates a dummy numerical index from nls.floop for if else loop below
numeric_floop <- as.numeric(nls.floop[1, 2])
#print(numeric_floop) #Taking a look at the values. If numaric floop...
# == 0, function did not converge on iteration "i"
# != 0, function did converge on rapid "i" and code will run through calculations
if (numeric_floop != 0) {
results_DF <- nls.floop
ind_fits[i , 'converge'] <- 1 #converge = 1 for converging fit
#Extracting, calculating, and appending values into dataframe
B_value <- as.numeric(results_DF[1, "estimate"]) #B value
k_value <- as.numeric(results_DF[2, "estimate"]) #k value
mx_value <- as.numeric(results_DF[3, "estimate"]) #mx value
my_value <- as.numeric(results_DF[4, "estimate"]) #my value
A_value <- ((B_value*pi)/2)+ my_value #A value calculation
ind_fits[i , 'A_value'] <- A_value
ind_fits[i , 'k_value'] <- k_value
ind_fits[i , 'mx_value'] <- mx_value
ind_fits[i , 'my_value'] <- my_value #appends my_value into df
ind_fits[i , 'max_grate'] <- adr(mx_value, B_value, k_value, mx_value, my_value) #Calculates max growth rate
}
} #End of individual fits loop
Which gives this output:
> head(ind_fits%>%select(df, squirrel_id, A_value, k_value, mx_value, my_value))
df squirrel_id A_value k_value mx_value my_value
1 mnm 332 257.2572 0.05209824 52.26842 126.13183
2 mnm 1252 261.0728 0.02810033 42.37454 103.02102
3 mnm 3466 260.4936 0.03946594 62.27705 131.56665
4 fnm 855 437.9569 0.01347379 86.18629 158.27641
5 fnm 2409 228.7047 0.04919819 63.99252 123.63404
6 fnm 1417 196.0578 0.05035963 57.67139 99.65781
Note that you need to create a blank dataframe first before running the loops.
I have the following data:
set.seed(3)
library(data.table)
library(lme4)
a <- rep(1:5, times = 20)
b <- rep(c(1,1,1,1,1,2,2,2,2,2), times = 50)
ppt <- rep(101:110, each = 10)
item <- rep(1:10, times = 10)
dv <- rnorm(n = 100)
contrasts(data$a) = contr.sum(4)
data <- data.table(cbind(ppt, item, a, b, dv))
data$ppt <- as.factor(data$ppt)
data$item <- as.factor(data$item)
data$a <- as.factor(data$a)
data$b <- as.factor(data$b)
I would like to get a coefficient for each level of a. u/omsa_d00d and u/dead-serious pointed me to the idea of running a model without an intercept.
If I run this model:
m1 <- lmer(dv ~ a + b -1 +(1|ppt) + (1|item), data = data)
I get coefficients for each level of a.
However if I run this model in which b comes first:
m2 <- lmer(dv ~ b + a -1 +(1|ppt) + (1|item), data = data)
I get coefficients for each level of b, but not a.
What exactly is happening in each case?
Additionally, is running m1 sufficient to get an effect of each level of a compared to the grand mean, while also controlling for b?
Does it matter if I mean centre my predictors first?
What are the different implications of dummy vs. sum coding factor a?
I am generating a model fit using glm. My data has a mix of integer variables and categorical variables. Categorical variables are in the form of codes and hence integer type in the data. Initially when I tried to generate the model I passed the categorical variables in integer format as it is and got the model. I was looking at the p-values to check the once that are significant and noticed few variables were significant which I was not expecting.
This is when realized that may be the categorical variables in integer form are creating some issue. So like code 3 might get a higher importance than code 1 (not sure on this and it would be great if someone can confirm this). On doing some research I found that we can convert the categorical integer variable to factor. I did the same and re-generated the model.
I also saw some posts where it was mentioned to convert to binary, so I did that we well. So now I have 3 results -
r1 >> with categorical integer variables
r2 >> with categorical factor variables
r3 >> with categorical variable converted to binary
I feel that output 1 with categorical integer variables is incorrect (Please confirm). But between output 2 and 3 I am confused which one to consider as
p-values are different,
which one would be more accurate
can I related the p-values of output 3 with output 2?
How does glm handle such variables
Hope glm inside a for loop is not an issue
My database is big, can we do glm using data.table?
I am pasting below my code with some sample data to be reproduced
library("plyr")
library("foreign")
library("data.table")
#####Generating sample data
set.seed(1200)
id <- 1:100
bill <- sample(1:3,100,replace = T)
nos <- sample(1:40,100,replace = T)
stru <- sample(1:4,100,replace = T)
type <- sample(1:7,100,replace = T)
value <- sample(100:1000,100,replace = T)
df1 <- data.frame(id,bill,nos,stru,type,value)
var1 <- c("bill","nos","stru")
options(scipen = 999)
r1 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r1 <- rbind(r1, df2)
}
}
##### converting the categorical numeric variables to factor variables
df1$bill_f <- as.factor(bill)
df1$stru_f <- as.factor(stru)
var1 <- c("bill_f","nos","stru_f")
r2 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r2 <- rbind(r2, df2)
}
}
#####converting the categorical numeric variables to binary format (1/0)
df1$bill_1 <- ifelse(df1$bill == 1,1,0)
df1$bill_2 <- ifelse(df1$bill == 2,1,0)
df1$bill_3 <- ifelse(df1$bill == 3,1,0)
df1$stru_1 <- ifelse(df1$stru == 1,1,0)
df1$stru_2 <- ifelse(df1$stru == 2,1,0)
df1$stru_3 <- ifelse(df1$stru == 3,1,0)
df1$stru_4 <- ifelse(df1$stru == 4,1,0)
var1 <- c("bill_1","bill_2","bill_3","nos","stru_1","stru_2","stru_3")
r3 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r3 <- rbind(r3, df2)
}
}
Your feeling is mostly correct. For a GLM you should make the distinction between continious variables and discrete (categorical) variables.
Binary variables are variables which contain only 2 levels, for example 0 and 1.
Since you only have variables with 2+ levels, you should use the factor() function.
I am working on a large dataset with 19 subcohorts for which I want to run a lineair regression model to estimate BMI.
One of the covariates I am using is sex, but some subcohorts consist only of men, which causes problems in my loop.
If I try to run a linear regression model, I get the following error:
tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I have found a solution for this problem, by running seperate loops for subcohorts with men and subcohorts with men and women by the following (simplified) code:
men <- c(1,6,15) # Cohort nrs that only contain men
menandwomen <- c(2,3,4,5,7,8,9,10,11,12,13,14,16,17,18,19)
trenddpmodelm <-list()
for(i in men) {
trenddpmodelm[[i]] <- lm(BMI ~ age + sex,
data=subcohort[subcohort$centre_a==i, ],)
}
trenddpmodelmw <-list()
for(i in menandwomen) {
trenddpmodelmw[[i]] <- lm(BMI ~ age + sex,
data=subcohort[subcohort$centre_a==i, ],)
}
trenddpmodel <- c(list(trenddpmodelm[[1]]), list(trenddpmodelmw[[2]]), list(trenddpmodelmw[[3]]), list(trenddpmodelmw[[4]]), list(trenddpmodelmw[[5]]), list(trenddpmodelm[[6]]), list(trenddpmodelmw[[7]]), list(trenddpmodelmw[[8]]), list(trenddpmodelmw[[9]]), list(trenddpmodelmw[[10]]), list(trenddpmodelmw[[11]]), list(trenddpmodelmw[[12]]), list(trenddpmodelmw[[13]]), list(trenddpmodelmw[[14]]), list(trenddpmodelm[[15]]), list(trenddpmodelmw[[16]]), list(trenddpmodelmw[[17]]), list(trenddpmodelmw[[18]]), list(trenddpmodelmw[[19]]))
After this step, I extract relevant information from the summaries and put this in a df to export to excel.
My problem is that I will be running quite a lot of analyses, which will result in pages and pages of code.
My question is therefore: Is there a setting in R that I could use that allows non varying factors to be dropped from my lineair regression model in subcohorts where this is applicable? (similar to what happens in coxph; R gives a warning that the factor does not always vary, but the loop does run)
It is not like I cannot continue working without a solution, but I have been trying to find an answer to this question for days without succes and I think it must be possible somehow. Any advice is much appreciated :)
I would recommend building your formula dynamically within the loop.
DF <- list(Cohort1 = data.frame(bmi = rnorm(25, 24, 1),
age = rnorm(25, 50, 3),
sex = sample(c("F", "M"), 25, replace = TRUE)),
Cohort2 = data.frame(bmi = rnorm(15, 24, 1),
age = rnorm(15, 55, 4),
sex = rep("M", 15)))
candidate_vars <- c("age", "sex")
Models <- vector("list", length(DF))
for (i in seq_along(DF)){
# Determine if the variables are either numeric, or factor with more than 1 level
indep <- vapply(X = DF[[i]][candidate_vars],
FUN = function(x){
if (is.numeric(x)) return(TRUE)
else return(nlevels(x) > 1)
},
FUN.VALUE = logical(1))
# Write the formula
form <- paste("bmi ~ ", paste(candidate_vars[indep], collapse = " + "))
# Create the model
Models[[i]] <- lm(as.formula(form), data = DF[[i]])
}
I have written this R code to reproduce. Here, I have a created a unique column "ID", and I am not sure how to add the predicted column back to test dataset mapping to their respective IDs. Please guide me on the right way to do this.
#Code
library(C50)
data(churn)
data=rbind(churnTest,churnTrain)
data$ID<-seq.int(nrow(data)) #adding unique id column
rm(churnTrain)
rm(churnTest)
set.seed(1223)
ind <- sample(2,nrow(data),replace = TRUE, prob = c(0.7,0.3))
train <- data[ind==1,1:21]
test <- data[ind==2, 1:21]
xtrain <- train[,-20]
ytrain <- train$churn
xtest <- test[,-20]
ytest<- test$churn
x <- cbind(xtrain,ytrain)
## C50 Model
c50Model <- C5.0(churn ~
state +
account_length +
area_code +
international_plan +
voice_mail_plan +
number_vmail_messages +
total_day_minutes +
total_day_calls +
total_day_charge +
total_eve_minutes +
total_eve_calls +
total_eve_charge +
total_night_minutes +
total_night_calls +
total_night_charge +
total_intl_minutes +
total_intl_calls +
total_intl_charge +
number_customer_service_calls,data=train, trials=10)
# Evaluate Model
c50Result <- predict(c50Model, xtest)
table(c50Result, ytest)
#adding prediction to test data
testnew = cbind(xtest,c50Result)
#OR predict directly
xtest$churn = predict(c50Model, xtest)
I’d use match(dataID, predictedID) to match ID columns in data sets.
In reply to your comment:
If you want to add predicted values to the original dataframe, both ways of merging data and prediction are correct and produce identical result. The only thing is, I would use
xtest$churn_hut <- predict(c50Model, xtest)
instead of
xtest$churn <- predict(c50Model, xtest)
because here you are replacing original churn ( as in data$churn) with whatever the model predicted, so you can’t compare the two.