I am trying to perform lm() and segmented() in R using the same independent variable (x) and multiple dependent response variables (Curve1, Curve2, etc.) one by one. I wish to extract the estimated break point and model coefficients for each response variable. I include an example of my data below.
x Curve1 Curve2 Curve3
1 -0.236422 98.8169 95.6828 101.7910
2 -0.198083 98.3260 95.4185 101.5170
3 -0.121406 97.3442 94.8899 100.9690
4 0.875399 84.5815 88.0176 93.8424
5 0.913738 84.1139 87.7533 93.5683
6 1.795530 73.3582 78.1278 82.9956
7 1.833870 72.8905 77.7093 82.7039
8 1.872200 72.4229 77.3505 82.4123
9 2.907350 59.2070 67.6652 74.5374
10 3.865810 46.4807 58.5158 65.0220
11 3.904150 45.9716 58.1498 64.7121
12 3.942490 45.4626 57.8099 64.4022
13 4.939300 33.3040 48.9742 56.3451
14 4.977640 32.9641 48.6344 56.0352
15 5.936100 24.4682 36.4758 47.0485
16 5.936100 24.4682 36.4758 47.0485
17 6.012780 23.7885 35.9667 46.5002
18 6.971250 20.7387 29.6035 39.6476
19 7.009580 20.6167 29.3490 39.3930
20 8.006390 18.7209 22.7313 32.7753
21 8.121410 18.5022 22.3914 32.1292
22 9.041530 16.4722 19.6728 26.9604
23 9.079870 16.3877 19.5595 26.7450
I am able to do this one curve at a time using the below code. However, my full data set has over 1000 curves, so I would like to be able to repeat this code over every column somehow. I have not been at all successful trying to loop it over every column, so if anyone could show me how to do something like that and create a summary data frame similar to that generated by the below code, but with every column included, I would be extremely grateful. Thanks!
model <- lm(Curve1~x, dat) # Linear model
seg_model <- segmented(model, seg.Z = ~x) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients)) # combine breakpoint and coefficeints
colnames(summary_curve1) <- "Curve_1" # header name
summary_curve1 # display summary
Here's an approach using tidyverse and broom to return a data frame containing the results for each Curve column:
library(broom)
library(tidyverse)
model.results = setNames(names(dat[,-1]), names(dat[,-1])) %>%
map(~ lm(paste0(.x, " ~ x"), data=dat) %>%
segmented(seg.Z=~x) %>%
list(model=tidy(.),
psi=data.frame(term="breakpoint", estimate=.[["psi.history"]][[5]]))) %>%
map_df(~.[2:3] %>% bind_rows, .id="Curve")
model.results
Curve term estimate std.error statistic p.value
1 Curve1 (Intercept) 95.866127 0.14972382 640.286416 1.212599e-42
2 Curve1 x -12.691455 0.05220412 -243.112130 1.184191e-34
3 Curve1 U1.x 10.185816 0.11080880 91.922447 1.233602e-26
4 Curve1 psi1.x 0.000000 0.02821843 0.000000 1.000000e+00
5 Curve1 breakpoint 5.595706 NA NA NA
6 Curve2 (Intercept) 94.826309 0.45750667 207.267599 2.450058e-33
7 Curve2 x -9.489342 0.11156425 -85.057193 5.372730e-26
8 Curve2 U1.x 6.532312 1.17332640 5.567344 2.275438e-05
9 Curve2 psi1.x 0.000000 0.23845241 0.000000 1.000000e+00
10 Curve2 breakpoint 7.412087 NA NA NA
11 Curve3 (Intercept) 100.027990 0.29453941 339.608175 2.069087e-37
12 Curve3 x -8.931163 0.08154534 -109.523900 4.447569e-28
13 Curve3 U1.x 2.807215 0.36046013 7.787865 2.492325e-07
14 Curve3 psi1.x 0.000000 0.26319757 0.000000 1.000000e+00
15 Curve3 breakpoint 6.362132 NA NA NA
You can wrap the whole thing in a function, taking as the arguments the column name and the data, and use lapply on the column names, like this:
library(segmented)
run_mod <- function(varname, data){
data$Y <- data[,varname]
model <- lm(Y ~ x, data) # Linear model
seg_model <- segmented(model, seg.Z = ~x) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients))
colnames(summary_curve1) <- varname
return(summary_curve1)
}
lapply(names(dat)[2:ncol(dat)], function(x)run_mod(x, dat))
Which gives the summary for each fitted curve (not sure which output you actually want).
I had the same issue and I'm tryng to adapt the suggested answer, but it appears the following:
Error in model.frame.default(formula = Y ~ Prof, data = data, drop.unused.levels = TRUE) :
invalid type (list) for variable 'Y'
I run this code:
run_mod <- function(varname, data){
data$Y <- data[,varname]
model <- lm(Y ~ Prof, data) # Linear model
seg_model <- segmented(model, seg.Z = ~ Prof) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients))
colnames(summary_curve1) <- varname
return(summary_curve1)
}
lapply(names(DATApiv)[3:ncol(DATApiv)], function(Prof)run_mod(Prof, DATApiv))
NOTE: Prof = is the column in my DF the corresponds to independent variable as the x column of this example). DataPiv is my DB.
Related
I have this dataframe that I applied multinom function
df = data.frame(x = c('a','a','b','b','c','c','d','d','d','e','e','f','f',
'f','f','g','g','g','h','h','h','h','i','i','j','j'),
y = c(1,2,1,3,1,2,1,4,5,1,2,2,3,4,5,1,1,2,1,2,2,3,2,2,3,4) )
df$y = factor(df$y,ordered = TRUE)
nnet::multinom(y~x, data = df)
when checking the output, I have all the variables with their coefficients (meaning everything is fine)
Coefficients:
(Intercept) xb xc xd xe xf
2 -6.045294e-05 -31.83512 3.800915e-05 -36.67053 3.800915e-05 25.00515
3 -1.613311e+01 16.13310 -1.725649e+01 -21.06832 -1.725649e+01 41.13825
4 -1.692352e+01 -14.71119 -1.428100e+01 16.92351 -1.428100e+01 41.92865
5 -2.129358e+01 -10.49359 -1.002518e+01 21.29353 -1.002518e+01 46.29867
xg xh xi xj
2 -0.6931261 0.6932295 40.499799 -25.311410
3 -24.0387863 16.1332150 -8.876562 45.191730
4 -20.2673490 -16.0884760 -6.394423 45.982129
5 -15.1755064 -11.8589447 -4.563793 -6.953942
but my original dataframe (will share only the output) that is coded as the dependent and independent variables from the df dataframe (meaning as ordinal factors) and all the analysis is well done but when it comes to interpretation I have this output :
Coefficients:
(Intercept) FIES_R.L FIES_R.Q FIES_R.C FIES_R^4 FIES_R^5
2 -0.09594409 -1.303256 0.03325169 -0.1753022 -0.46026668 -0.282463422
3 -0.18587599 -1.469957 0.42005569 -0.2977628 0.00508412 0.003068678
4 -0.58189239 -2.875183 0.33128994 -0.6787992 0.11145099 0.239368520
5 -2.68727952 -10.178604 -5.12515249 -5.8454920 -3.13775961 -1.820629143
FIES_R^6 FIES_R^7 FIES_R^8
2 -0.2179067 -0.1000471 -0.1489342
3 0.1915476 -0.5483707 -0.2565626
4 0.2585801 0.3821566 -0.2679774
5 -0.5562958 -0.6335412 -0.7205215
I don't want FIES_R.L,FIES_R.Q and FIES_R.C. I want them as : FIES_R_1, FIES_R_2, FIES_R_3, FIES_R_4, FIES_R_5, FIES_R_6, FIES_R_7, FIES_R_8,
why I have such an output ? knowing that the two dataframes include ordinal categorical variables and the x variable and the FIES variable include many categories in both dataframes. Thanks
I just figured it out : because the independent variable is an ordinal factor. Meaning FIES in my dataset in an ordinal factor. When I used the argument ordered = FALSE the problem got solved
You can change the coefnames "by hand":
mod = nnet::multinom(y~x, data = df)
mod$vcoefnames = c("(Intercept)", paste0(substr(mod$vcoefnames, 1, 6), "_", 1:8))
In my working project, I use rfe function from caret package to do recursive feature elimination. I use a toy example to illustrate my point.
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control, metric="ROC")
The optimal variable selected is based on those variables that give highest auroc in the process and can be retrieved by results$optVariables.
However, what I want to do is use '1 standard error rule' to select less features (code below). The number of variables identified is 4.
# auc that is 1-se from the highest auc
df.results = results$results %>% dplyr::mutate(ROCSE = ROCSD/sqrt(10-1))
idx = which.max(df.results$ROC)
ROC.1se = df.results$ROC[idx] - df.results$ROCSE[idx]
# plot ROC vs feature size
g = ggplot(df.results, aes(x=Variables, y=ROC)) +
geom_errorbar(aes(ymin=ROC-ROCSE, ymax=ROC+ROCSE),
width=.2, alpha=0.4, linetype=1) +
geom_line() +
geom_point()+
scale_color_brewer(palette="Paired")+
geom_hline(yintercept = ROC.1se)+
labs(x ="Number of Variables", y = "AUROC")
print(g)
The number of variables I identified is 4. Now I need to know which four variables. I did below:
results$variables %>% filter(Variables==4) %>% distinct(var)
It shows me 5 variables!
Does anyone know how I can retrieve those variables? Basically it applies to get those variables for any number of variables selected.
Thanks a lot in advance!
One-line Answer
If you know you want only the best 4 variables from the rfe resampling, this will give you what you are looking for.
results$optVariables[1:4]
# [1] "glucose" "mass" "age" "pregnant"
dplyr Answer
# results$variables %>%
# group_by(var) %>%
# summarize(Overall = mean(Overall)) %>%
# arrange(-Overall)
#
# A tibble: 8 x 2
# var Overall
# <chr> <dbl>
# 1 glucose 34.2
# 2 mass 15.8
# 3 age 12.7
# 4 pregnant 7.92
# 5 pedigree 5.09
# 6 insulin 4.87
# 7 triceps 3.25
# 8 pressure 1.95
Why your attempt gives more than 4 variables
You are filtering 40 observations. 10 folds of the best 4 variables. The best 4 variables is not always the same within each fold. Hence, to get the best top 4 variables across the resamples you need to average their performance across the folds as the code above does. Even simpler, the variables within optVariables are sorted in this order, so you can just grab the first 4 (as in my one-line answer). The proof that this is the case takes a bit of digging into the source code (shown below).
Details: Digging into the source code
A good first thing to do with objects returned from functions like rfe is to try functions like print, summary, or plot. Often custom methods will exist that will give you very helpful information. For example...
# Run rfe with a random seed
# library(dplyr)
# library(mlbench)
# library(caret)
# data(PimaIndiansDiabetes)
# rfFuncs$summary <- twoClassSummary
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# set.seed(1)
# results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8),
# rfeControl=control, metric="ROC")
#
# The next two lines identical...
results
print(results)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold)
#
# Resampling performance over subset size:
#
# Variables ROC Sens Spec ROCSD SensSD SpecSD Selected
# 1 0.7250 0.870 0.4071 0.07300 0.07134 0.10322
# 2 0.7842 0.840 0.5677 0.04690 0.04989 0.05177
# 3 0.8004 0.824 0.5789 0.02823 0.04695 0.10456
# 4 0.8139 0.842 0.6269 0.03210 0.03458 0.05727
# 5 0.8164 0.844 0.5969 0.02850 0.02951 0.07288
# 6 0.8263 0.836 0.6078 0.03310 0.03978 0.07959
# 7 0.8314 0.844 0.5966 0.03075 0.04502 0.07232
# 8 0.8316 0.860 0.6081 0.02359 0.04522 0.07316 *
#
# The top 5 variables (out of 8):
# glucose, mass, age, pregnant, pedigree
Hmm, that gives 5 variables, but you said you wanted 4. We can pretty quickly dig into the source code to explore how it is calculating and returning those 5 variables as the top 5 variables.
print(caret:::print.rfe)
#
# Only a snippet code shown below...
# cat("The top ", min(top, x$bestSubset), " variables (out of ",
# x$bestSubset, "):\n ", paste(x$optVariables[1:min(top,
# x$bestSubset)], collapse = ", "), "\n\n", sep = "")
So, basically it is pulling the top 5 variables directly from results$optVariables. How is that getting populated?
# print(caret:::rfe.default)
#
# Snippet 1 of code...
# bestVar <- rfeControl$functions$selectVar(selectedVars,
bestSubset)
#
# Snippet 2 of code...
# bestSubset = bestSubset, fit = fit, optVariables = bestVar,
Ok, optVariables is getting populated by rfeControl$functions$selectVar.
print(rfeControl)
#
# Snippet of code...
# list(functions = if (is.null(functions)) caretFuncs else functions,
From above, we see that caretFuncs$selectVar is being used...
Details: Source code that is populating optVariables
print(caretFuncs$selectVar)
# function (y, size)
# {
# finalImp <- ddply(y[, c("Overall", "var")], .(var), function(x) mean(x$Overall,
# na.rm = TRUE))
# names(finalImp)[2] <- "Overall"
# finalImp <- finalImp[order(finalImp$Overall, decreasing = TRUE),
# ]
# as.character(finalImp$var[1:size])
# }
#Create subset of a dataset
df <- subset(dat,select = c(id,obs,day_clos,posaff,er89,qol1))
### remove rows with missing values on a variable
df <- subset(df, !is.na(day_clos))
df <- subset(df, !is.na(er89))
df <- subset(df, !is.na(qol1))
df <- subset(df,!is.na(posaff))
any(is.na(df)) ## returns FALSE
Then my data looks like this
id obs day_clos posaff er89 qol1
1 0 16966.61 2.000000 2.785714 3
1 1 16967.79 1.666667 2.785714 4
1 2 16968.82 1.666667 3.142857 3
1 3 16969.76 1.166667 3.071429 4
1 4 16970.95 2.083333 3.000000 4
1 5 16971.75 1.416667 2.857143 4
model.Y <- lm(qol1 ~ posaff,df)
summary(model.Y)
model.M <- lm(qol1 ~ er89, df)
summary(model.M)
#### There is no problem running the regression analyses, however:
results <- mediate(model.M, model.Y, treat="posaff", mediator="er89", boot=TRUE, sims=500)
Returns error message: [.data.frame(m.data, , treat) : undefined columns selected
Any one know how to fix this?
Variables used in treat and mediator must be presents in both models:
treat a character string indicating the name of the treatment variable used in the models.
The treatment can be either binary (integer or a two-valued factor) or continuous
(numeric).
mediator a character string indicating the name of the mediator variable used in the models
Source
A trivial working example:
library("mediation")
db<-data.frame(y=c(1,2,3,4,5,6,7,8,9),x1=c(9,8,7,6,5,4,3,2,1),x2=c(9,9,7,7,5,5,3,3,1),x3=c(1,1,1,1,1,1,1,1,1))
model.M <- lm(x2 ~ x1+x3,db)
model.Y <- lm(y ~ x1+x2+x3)
results <- mediate(model.M, model.Y, treat="x1", mediator="x2", boot=TRUE, sims=500)
I think that I have what you suggested but it is still giving the same error message.
model.mediator <- lmer(PercAccuracy~factor(Rep1) +
(factor(Rep1)| ParticipantPublicID),
data = data, REML=FALSE , control = control_params)
summary(model.mediator)
model.outcome <- lmer(Sharing~factor(Rep1) +PercAccuracy+
(factor(Rep1)+PercAccuracy| ParticipantPublicID),
data = data, REML=FALSE , control = control_params)
summary(model.outcome )
effectModel<-mediate(model.mediator, model.outcome, treat = "Rep1", mediator="PercAccuracy")
summary(effectModel)
Using R and estimating a simple equation by least squares that has the dependent variable as a independent (explanatory, right hand side) variable, I want to forecast out of sample and use the dependent variable forecasts in the out of sample period as a lag for each step ahead.
I.e., I want to extend forecasts of y to be outside the data period
a <- lm( y ~ x + lag(y,1), data= dset1)
b <- forecast(a,newdata=dset2)
where dset2 has the full period of extra x variables, but not the lagged y.
Here is an example using the AirPassengers data set, where dset2 was created with some missing ap data. The results below show only row 143 gets filled in not 144 because forecast did not have the 143 lag.
I looked at the dyn dynlm and forecast packages but nonw seem to work with type of model. (I do not want to restate as an ARMA or a VAR)
What package can easily do this, or am I using forecast incorrectly?
I can loop and step ahead on period at a time, but rather not do that.
##Example case using airline data
data("AirPassengers", package = "datasets")
ap <- log(AirPassengers)
ap <- as.ts(ap)
d1 <- data.frame(ap, index= as.Date(ap))
m1 <- lm(ap ~ lag(ap,1), data=d1)
m2 <- dynlm(ap ~ lag(ap,1), data=d1)
m3 <- dyn(lm(ap ~ lag(ap,1), data=d1))
summary(m3)
##Neither lm or dyn or dynlm obects worked as I want
## Try forecast missing values, 2 steps, rows 143 and 144
d2 <- d1
d2$apx = d2$ap
d2$apx[143:144]= NA
mx <- lm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Results:
> b
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1 NA NA NA NA NA
2 4.756850 4.619213 4.894488 4.545513 4.968188
3 4.807218 4.669783 4.944653 4.596191 5.018245
....
140 6.411559 6.273546 6.549572 6.199644 6.623474
141 6.386407 6.248507 6.524306 6.174667 6.598146
142 6.216154 6.078941 6.353368 6.005467 6.426841
143 6.122453 5.985553 6.259354 5.912247 6.332659
144 NA NA NA NA NA
other lm like objects produced errors for forecast
mx <- dynlm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Error in forecast.lm(mx, newdata = d2) : invalid type/length
(symbol/0) in vector allocation
mx <- dyn(lm(apx ~ lag(apx,1), data=d2))
b <- forecast(mx,newdata=d2)
Error in predict.lm(object, newdata = newdata, se.fit = TRUE, interval
= "prediction", : formal argument "se.fit" matched by multiple actual arguments
This question already has an answer here:
r predict function returning too many values [closed]
(1 answer)
Closed 6 years ago.
Assume I have have fit a regression model with multiple predictor variables in R, like in the following toy example:
n <- 20
x <- rnorm(n)
y <- rnorm(n)
z <- x + y + rnorm(n)
m <- lm(z ~ x + y + I(y^2))
Now I have new date, consisting of x and y values, and I want to predict the corresponding z values:
x.new <- rnorm(5)
y.new <- rnorm(5)
Question: How should I best call predict to apply the fitted model to the new data?
Here are a few things I tried, which do not work:
Attempt 1. Trying to use the x.new and y.new as the columns of a new data frame:
> predict(m, data=data.frame(x=x.new, y=y.new))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
This gave 20 values instead of 5, so cannot be right.
Attempt 2: Maybe predict got confused because the y^2 values were not supplied? Try to use model.frame to provide data in the correct form.
> predict(m, model.frame(~ x.new + y.new + I(y.new^2)))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
Warning message:
'newdata' had 5 rows but variables found have 20 rows
Again, this results in 20 values (plus a warning), so cannot be right.
The parameter is newdata (not data) when telling predict what to predict for.
predict(m, newdata = data.frame(x = x.new, y = y.new))