create survival curve with a list of dataframes - r

the survfit() function will not accept a list, so I need a way to unlist the result of Surv() without removing information from the column. For example, using the lung data:
library(survival)
attach(lung)
lung$survObj <- with(lung, Surv(time, status ==2))
lung.2 <- lung
mylist <- list(lung,lung.2)
Here is an attempt to use lung from my list and pass lung$survObj into `survfit() and group by column 5 (sex):
survfit(formula = mylist[[1]][11] ~ mylist[[1]][5], data = mylist[[1]])
invalid type (list) for variable 'mylist[[1]][11]'
but given the following:
nrow(mylist[[1]][11])
228
and
length(unlist[[1]][11])
456
i.e. double!
I get the obvious complaint that I am comparing columns of unequal length.
also notice that the + symbols are removed after unlisting which is vital to survfit().
Has anyone managed to call survfit on a list of dataframes?
thanks.

Here are three possible solutions to your problem.
A)
survfit(survObj ~ sex, data=mylist[[1]])
B)
Y1 <- mylist[[1]]$survObj
x1 <- mylist[[1]]$sex
survfit(Y1 ~ x1)
C)
Y2 <- mylist[[1]][11][[1]]
x2 <- mylist[[1]][5][[1]]
survfit(Y2 ~ x2)

Related

how to change order of rows in ggforest without change reference

Imagine the following database. Made up data
K<- c(2,2.2,2.4,2.6,2.8,3,3.5,3.8,4,4.2,4.4,4.8,5,5.2,5.4,5.6,5.8,6)
event <- c(1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1)
t<- c(8,10,25,10,8,22,30,16,32,30,32,20,8,12,14,22,10,6)
df<- data.frame(K,event,t)
I split the variable K (potassium) into a categorical variable with 3 levels (< 3, >= 3 and <5, >=5)
df$K_cut <- cut(K, c(0,3,5,6.5), right = F)
levels(df$K_cut) # [1] "[0,3)" "[3,5)" "[5,6.5)"
We perform a cox regression and represent it with ggforest
The reference category is potassium < 3
fit3<- coxph(Surv(t,event) ~ K_cut, data=df)
fit3
library(survminer)
ggforest(fit3, data=df, fontsize=0.8)
We changed the reference category to be a normal potassium (3-5)
And when plotting it is now the correct reference, but it is plotted on the first line.
df$K_cut <- relevel(df$K_cut, ref = "[3,5)")
fit4<- coxph(Surv(t,event) ~ K_cut, data=df)
fit4
library(survminer)
ggforest(fit4, data=df, fontsize=0.8)
I would like more to be able to put the reference category K 3-5 but for it to be on the center line, so that the graph represents from top to bottom, K < 3, between 3 and 5 and K >=5
The result shoud be (with paste, retouching the figure ...)
Is there a way to do it with ggforest or with another function/package
Change the order of the rows and put the reference wherever you want ..
In addittion, can you change the spaces so that the intervals and N= ... are not so close together, or modify the name of the variable
In the ggforest documentation, I have not seen that such options exist.
Regards and thanks
One option to achieve your desired result would be to relevel your factor after estimating your model and before calling ggforest:
library(survminer)
library(survival)
fit4 <- coxph(Surv(t, event) ~ K_cut, data = df)
df$K_cut <- factor(df$K_cut, levels = c("[0,3)", "[3,5)", "[5,6.5)"))
ggforest(fit4, data = df, fontsize = 0.8)

How to write a function in R that will implement the "best subsets" approach to model selection?

So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model.
The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable).
The function is supposed to create a tibble with a single column for the model number (I have no idea what this is supposed to mean), subset of of explanatory variables along with response variable, model formula, outcome of fitting linear model, and others as needed.
The function is supposed to output: the model number, the explanatory variables in the model, the value of adjusted r-square, and a graph (I can figure the graph out on my own). I have a image of a table here to help with visualizing what the result should look like.
I figured out that this code will get me the explanatory and response variables:
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
I know that I can get a model with something like this (ignore variable names for now):
model <- final_data %>%
group_by(debt) %>%
lm(debt ~ distance, data = .) %>%
glance()
I also know that I'm going to have to somehow map that model to each of the rows in the tibble that I'm trying to create.
What I'm stuck on is figuring out how to put all this together and create the complete function. I wish I could provide more details but I am completely stuck. I've spent about 10 hours working on this today... I asked my professor for help and he just told me to post here.
For reference here is a very early (not working at all) attempt I made:
best_subsets <- function(data) {
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
# Create the formula as a string and then later in the lm function
# have it turned into a real formula.
form <- paste(y, "~", xs, sep = " ")
data %>%
lm(as.formula(form), data = .) %>%
glance()
}
I don't fully understand your description but I think I understand your goal. Maybe this can help in some way?:
library(tidyverse)
library(broom)
library(data.table)
lm_func <- function(df){
fit1 <- lm(df[, 1] ~ df[, 2], data = df)
fit2 <- lm(df[, 1] ~ df[, 3], data = df)
fit3 <- lm(df[, 1] ~ df[, 2], df[, 3], data = df)
results <- list(fit1, fit2, fit3)
names(results) <- paste0("explanitory_variables_", 1:3)
r_sq <- lapply(results, function(x){
glance(x)
})
r_sq_df <- rbindlist(r_sq, idcol = "df_name")
r_sq_df
}
lm_func(iris)
This gives you a dataframe of all the important outputs from which you can select adj.r.squared. Would also be possible to automate. As a side note, selecting a model based on R squared seems very strange, dangers of overfitting? a higher R squared does not necessarily mean a better model, consider looking into AIC as well?
Let me know if this helps at all or if I can refine the answer a little more towards your goal.
UPDATE:
lm_func <- function(df) {
lst <- c()
for (i in 2:ncol(df)) {
ind <- i
form_df <- df[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = df)
lst[[i - 1]] <- glance(fit)
}
lst
names(lst) <- paste0("explanitory_variables_", 1:length(lst))
lst <- rbindlist(lst, idcol = "df_name")
lst
}
lm_func(iris)
This assumes your first column is y and you want a model for every additional column.
OK one more UPDATE:
I think this does everything possible but is probably overkill:
library(combinat)
library(data.table)
library(tidyverse)
library(broom)
#First function takes a dataframe containing only the dependent and independent variables. Specify them by variable name or column position.
#The function then returns a list of dataframes of every possible order of independent variables (y ~ x1 + x2...) (y ~ x2 + x1...).
#So you can run your model on every possible sequence of explanatory variables
formula_func <- function(df, dependent = df["Sepal.Length"], independents = df[c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")]) {
independents_df_list <- permn(independents) #length of output should be the factorial of the number of independent variables
df_list <- lapply(independents_df_list, function(x){ #this just pastes your independent variable as the first column of each df
cbind(dependent, x)
})
df_list
}
permd_df_list <- formula_func(iris) # voila
# This function takes the output from the previous function and runs the lm building in one variable each time (y ~ x1), (y ~ x1 + x2) and so on
# So the result is many lms building in one one independent variable at a time in every possible order
# If that is as confusing to you as it is to me then check final output. You will see what model formula is used per row and in what order each explanatory variable was added
lm_func <- function(form_df_list, df) {
mega_lst <- c()
mega_lst <- lapply(form_df_list, function(x) {
lst <- vector(mode = "list", length = length(2:ncol(x)))
for (i in 2:ncol(x)) {
ind <- i
form_df <- x[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = x)
lst[[i - 1]] <- glance(fit)
names(lst)[[i-1]] <- deparse(form)
}
lst <- rbindlist(lst, idcol = "Model_formula")
return(lst)
})
return(mega_lst)
}
everything_list <- lm_func(permd_df_list, iris) # VOILA!!!
#Remove duplicates and return single df
everything_list_distinct <- everything_list %>%
rbindlist() %>%
distinct()
## You can now subset and select whichever column you want from the final output
I posted this as a coding exercise so let me know if anyone spots any errors. Just one caveat, this code does NOT represent a statistically sound approach just a coding experiment so be sure to understand the stats first!

R - lm, cooks.distance & Outliers by Group

Code with the out-group working great:
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
mod <- lm(ozone_reading ~ ., data=ozone)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
(ozone[influential, ]) # influential observations.
As per my new requirements, I have to add a group and need to find outliers for each group. My code sample is like below. How I get cooks.distance and outliers by the group? please help
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
ozone$season <- c('summer','summer','summer','summer','summer','summer','summer','summer','summer','summer',
'winter','winter','winter','winter','winter','winter','winter','winter','winter','winter')
Here I need to compute mod, cooksd and influential by group.
Simply generalize your process and call it with by (object-oriented wrapper to tapply) which subsets a data frame by one or more factors and passes subsets into a function to return a list of data frames equal to number of distinct groups:
proc_cooks_outlier <- function(df) {
mod <- lm(ozone_reading ~ ., data=transform(df, season=NULL))
cooksd <- cooks.distance(mod)
# influential row numbers
influential <- as.integer(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=TRUE))])
return(df[complete.cases(df[influential,]),])
}
outlier_df_list <- by(ozone, ozone$season, FUN=proc_cooks_outlier)
# REFERENCE INDIVIDUAL DFs
outlier_df_list$summer
outlier_df_list$winter
...
# COMBINE ALL INTO ONE DF
master_outlier_df <- do.call(rbind, unname(outlier_df_list))

In R using the pls package, how can I obtain estimates of coefficients by group/factor

I've started looking at the pls package & I am unsure about how to extract separate coefficients by group/factor. I can run separate models per group, or consider the X ~ group interaction term, but that isn't what I'm after.
I'm using the following syntax:
model1 <- plsr(outcome ~ pred * group, data =plsDATA,2)
I've tried using the following:
model2 <- plsr(outcome ~ embed(pred:as.factor(group)), data=plsDATA,2)
but this results in this error:
Error in model.frame.default(formula = outcome ~ embed(pred:as.factor(group)), :
variable lengths differ (found for 'embed(pred:as.factor(group))')
In addition: Warning messages:
1: In pred:as.factor(group) :
numerical expression has 640 elements: only the first used
2: In pred:as.factor(group) :
numerical expression has 32 elements: only the first used
I'm not sure why I'm getting the variable lengths error since running the following command gives compatible dimensions:
dim(group)
[1] 32 1
dim(outcome)
[1] 32 1
dim(pred)
[1] 32 20
The code is below:
library(pls) #Dummy Data
setwd("/Users/John/Documents")
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group is the factor & outcome is Y
pred <- as.matrix(Data[,3:22])
group <- as.matrix(Data[,1])
outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
plsDATA <- data.frame(SampN=c(1:nrow(Data)))
plsDATA$pred <- pred
plsDATA$group <- group
plsDATA$outcome <-outcome #define the model - ask for two components
model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)#Get coefficients from this object
According to your question, you are wanting to extract the coefficients. There is a function, 'coef()' that will pull them out easily. See the results below.
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group
is the factor & outcome is Y
> pred <- as.matrix(Data[,3:22])
> group <- as.matrix(Data[,1])
> outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
> plsDATA <- data.frame(SampN=c(1:nrow(Data)))
> plsDATA$pred <- pred
> plsDATA$group <- group
> plsDATA$outcome <-outcome #define the model - ask for two components
> model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)
> coef(model1)
, , 2 comps
outcome
predpred1 -1.058426e-02
predpred2 2.634832e-03
predpred3 3.579453e-03
predpred4 1.135424e-02
predpred5 3.271867e-04
predpred6 4.438445e-03
predpred7 8.425997e-03
predpred8 3.001517e-03
predpred9 2.111697e-03
predpred10 -9.264594e-04
predpred11 1.885554e-03
predpred12 -2.798959e-04
predpred13 -1.390471e-03
predpred14 -1.023795e-03
predpred15 -3.233470e-03
predpred16 5.398053e-03
predpred17 9.796533e-03
predpred18 -8.237801e-04
predpred19 4.778983e-03
predpred20 1.235484e-03
group 9.463735e-05
predpred1:group -8.814101e-03
predpred2:group 9.013430e-03
predpred3:group 7.597494e-03
predpred4:group 1.869234e-02
predpred5:group 1.462835e-03
predpred6:group 6.928687e-03
predpred7:group 1.925111e-02
predpred8:group 3.752095e-03
predpred9:group 2.404539e-03
predpred10:group -1.288023e-03
predpred11:group 4.271393e-03
predpred12:group 6.704938e-04
predpred13:group -3.943964e-04
predpred14:group -5.468510e-04
predpred15:group -5.595737e-03
predpred16:group 1.090501e-02
predpred17:group 1.977715e-02
predpred18:group -3.013597e-04
predpred19:group 1.169534e-02
predpred20:group 3.389127e-03
The same results could also be achieved with the call model1$coefficients or model1[[1]]. Based on the question, I think this is the result you are looking for.
Actually, I've just figured this out. You need to dummy code the grouping variable & make it the outcome (i.e. predicted variable). In this case, I had two columns representing group membership. In each case, membership in the group was indicated by 1 and non-membership by 0. Then I called the first two columns as group (i.e. group <- as.matrix(Data[,1:2])) & ran the rest of the code as before substituting group for outcome.

dynamically generating factor name in lm function in R

I have a R question here
we do the 'normal' lm fit quite often:
model <- lm(y~0+x1+x2+x3, data=d)
However, what I am trying to do is to select the names of x1, x2, x3 dynamically in the R script. depending on which y I am fitting, the function can be run in the form of
model<-lm(y2~0+x10+x21+x33, data=d)
model<-lm(y3~0+x41+x15+x8, data=d)
etc. Please note that y2, y3 ... are sequentially selected from a list. There is actually a function
factorOfInterest <-getFactors(y)
Anyone knows how to program R script like this? Or, is there a better way to construct the problem?
Say your data looks like this:
set.seed(1)
df1 <- data.frame(x1=rnorm(5),
x2=rnorm(5),
x3=rnorm(5),
y1=rnorm(5),
y2=rnorm(5)
)
The following will go through all combinations of the xs, 7 in total:
### get columns named x
c1 <- colnames(df1)[grepl("x",colnames(df1))]
### make matrix of all combinations
library(combinat)
c2 <- combinat::hcube(rep(2, length(c1)))-1
### remove top row (represents intercept-only model)
c2 <- c2[-1, ]
### list to store results
l1 <- as.list(vector(length=nrow(c2)))
### use matrix for y values when fitting models
lhs1 <- cbind(df1$y1, df1$y2)
for (i in 1:nrow(c2)){
### subset of x variables
rhs1 <- c1[as.logical(c2[i, ])]
rhs1 <- paste0(rhs1, collapse="+")
f1 <- paste("lhs1", rhs1, sep="~")
f1 <- as.formula(f1)
l1[[i]] <- lm(f1)
}
(I'm sure faster methods are available for larger sets)...

Resources