apply model coefficients on new data - r

I have two matrices sub and macro_data. They include the estimated coefficients of a model and the macro data, respectively
> sub
coeff varname
1 -1.50 gdp
2 0.005 inflation
3 -2.4 constant
> macro_data
gdp inflation
1 18.0 -0.17
2 15.8 -0.14
3 17.7 -0.15
I would like to apply the following formula: -1.5*gdp+0.005*inflation-2.4 in order to get the scores.
I have tried
for (i in 1:1){
sub$coeff[i]*macro_data[,1]+sub$coeff[i+1]*macro_data[,sub$coeff[i+1]]+sub$coeff[i+2]
}
Actually it works but this is not the best solution, because I would like something general. Any idea?

You can do a matrix multiplication:
cbind(macro_data, 1) %*% sub[, "coeff", drop=FALSE]
If your coefficients are from estimating a model, then normally the function predict.~() can take a parameter newdata= to claculate estimates for new data.
For your example data this wont work because you have dataframes. This will do:
sub <- read.table(header=TRUE, text=
"coeff varname
-1.50 gdp
0.005 inflation
-2.4 constant ")
macro_data <- read.table(header=TRUE, text=
"gdp inflation
1 18.0 -0.17
2 15.8 -0.14
3 17.7 -0.15")
m <- cbind(macro_data, constant=1)
C <- sub$coeff
names(C) <- sub$varname
m$gdp*C["gdp"] + m$inflation*C["inflation"] + m$constant*C["constant"]
The last line can be shorten to:
as.matrix(m) %*% C[names(m)]

Related

How to calculate the Intraclass correlation (ICC) in R?

I have a dataset that is in a long format with 200 variables, 94 subjects, and each subject has anywhere from 1 to 3 measurements for each variable.
Eg:
ID measurement var1 var2 . . .
1 1 2 6
1 2 3 8
1 3 6 12
2 1 3 9
2 2 4 4
2 3 5 3
3 1 1 11
3 2 1 4
. . . .
. . . .
. . . .
However, some variables have missing values for one of three measurements. It was suggested to me that before imputing missing values with the mean for the subject, I should use a repeated measures ANOVA or mixed model in order to confirm the repeatability of measurements.
The first thing I found to calculate the ICC was the ICC() function from the psych package. However, from what I can tell this requires that the data have one row per subject and one column per measurement, which would be further complicated by the fact that I have 200 variables I need to calculate the ICC for individually. I did go ahead and calculate the ICC for a single variable, and obtained this output:
Intraclass correlation coefficients
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.38 2.8 93 188 0.00000000067 0.27 0.49
Single_random_raters ICC2 0.38 2.8 93 186 0.00000000068 0.27 0.49
Single_fixed_raters ICC3 0.38 2.8 93 186 0.00000000068 0.27 0.49
Average_raters_absolute ICC1k 0.65 2.8 93 188 0.00000000067 0.53 0.74
Average_random_raters ICC2k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Average_fixed_raters ICC3k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Number of subjects = 94 Number of Judges = 3
Next, I tried to calculate the ICC using a mixed model. Using this code:
m1 <- lme(var1 ~ measurement, random=~1|ID, data=mydata, na.action=na.omit)
summary(m1)
The output looks like this:
Linear mixed-effects model fit by REML
Data: mydata
AIC BIC logLik
-1917.113 -1902.948 962.5564
Random effects:
Formula: ~1 | ORIGINAL_ID
(Intercept) Residual
StdDev: 0.003568426 0.004550419
Fixed effects: var1 ~ measurement
Value Std.Error DF t-value p-value
(Intercept) 0.003998953 0.0008388997 162 4.766902 0.0000
measurement 0.000473053 0.0003593452 162 1.316429 0.1899
Correlation:
(Intr)
measurement -0.83
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.35050264 -0.30417725 -0.03383329 0.25106803 12.15267443
Number of Observations: 257
Number of Groups: 94
Is this the correct model to use to assess ICC? It is not clear to me what the correlation (Intr) is measuring, and it is different from the ICC obtained using ICC().
This is my first time calculating and using intraclass correlation, so any help is appreciated!
Using a mock dataset...
set.seed(42)
n <- 6
dat <- data.frame(id=rep(1:n, 2),
group= as.factor(rep(LETTERS[1:2], n/2)),
V1 = rnorm(n),
V2 = runif(n*2, min=0, max=100),
V3 = runif(n*2, min=0, max=100),
V4 = runif(n*2, min=0, max=100),
V5 = runif(n*2, min=0, max=100))
Loading some libraries...
library(lme4)
library(purrr)
library(tidyr)
# Add list of variable names to the vector below...
var_list <- c("V1","V2","V3","V4","V5")
map_dfr() is from the purrr library. I use lme4::VarCorr() to get the variances at each level.
map_dfr(var_list,
function(x){
formula_mlm = as.formula(paste0(x,"~ group + (1|id)"));
model_fit = lmer(formula_mlm,data=dat);
re_variances = VarCorr(model_fit,comp="Variance") %>%
data.frame() %>%
dplyr::mutate(variable = x);
return(re_variances)
}) %>%
dplyr::select(variable,grp,vcov) %>%
pivot_wider(names_from="grp",values_from="vcov") %>%
dplyr::mutate(icc = id/(id+Residual))

How to treat negative values in lm(x~y) function in R?

When running my script I get the following error message: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases and I'm guessing that is due some negative values?
The script is looping trough a list of csv files and for a small selection of them, the code is working. But for all of them I get the error message. I checked the data and there are some (about 2% of the whole data) negative NDVI values which are always -99999. And I have some soil moisture values which are 0.
I found this solution na.action=na.exclude to add in the lm function:
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf)
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf, na.action=na.exclude)
But the same error still occurs. Do you have any other solution for this, besides deleting the negative values from the data. Best would be to ignore the whole the not exclude these values in the linear regression (lm) or to ignore the whole csv file. If there are negative values in it.
Missing values in R should be coded as NA. You could use replace,
replace(dat, dat == -99999, NA)
# X1 X2 X3
# 1 1.37 1.30 -0.31
# 2 NA 2.29 -1.78
# 3 0.36 -1.39 -0.17
# 4 0.63 -0.28 1.21
# 5 0.40 NA 1.90
# 6 -0.11 0.64 -0.43
# 7 1.51 -0.28 -0.26
# 8 -0.09 -2.66 -1.76
# 9 2.02 -2.44 NA
# 10 -0.06 1.32 -0.64
what actually works directly in the formula without changing the data.
lm(X1 ~ X2 + X3, replace(dat, dat == -99999, NA))$coefficients
# (Intercept) X2 X3
# 0.61499466 0.06062925 0.25979370
If there are more than one missing code, you could do e.g.:
replace(dat, array(unlist(dat) %in% c(-99999, -88888), dim(dat)), NA)
Data:
set.seed(42)
dat <- data.frame(matrix(round(rnorm(30), 2), 10, 3))
dat[2, 1] <- -99999
dat[5, 2] <- -99999
dat[9, 3] <- -99999

Time-series average of cross-sectional correlations

I have a panel dataset looking like this:
head(panel_data)
date symbol close rv rv_plus rv_minus rskew rkurt Mkt.RF SMB HML
1 1999-11-19 a 25.4 19.3 6.76 12.6 -0.791 4.36 -0.11 0.35 -0.5
2 1999-11-22 a 26.8 10.1 6.44 3.69 0.675 5.38 0.02 0.22 -0.92
3 1999-11-23 a 25.2 8.97 2.56 6.41 -1.04 4.00 -1.29 0.08 0.3
4 1999-11-24 a 25.6 5.81 2.86 2.96 -0.505 5.45 0.87 0.08 -0.89
5 1999-11-26 a 25.6 2.78 1.53 1.25 0.617 5.60 0.23 0.92 -0.2
6 1999-11-29 a 26.1 5.07 2.76 2.30 -0.236 7.27 -0.6 0.570 -0.14
where the variable symbol depicts different stocks. I want to calculate the time-series average of the cross-sectional correlation between the variables rskew and rkurt. This means I need to compute the correlation between rskew and rkurt over all different stocks at each point in time and then calculate the time-series average afterwards.
I tried to do it with the rollapply function from the zoo package, but since the number of different stocks is not the same for all dates, I cannot simply define width as an integer. Here is what i tried for a sample width of 20:
panel_data <- panel_data %>%
group_by(date) %>%
mutate(cor_skew_kurt = rollapply(data = panel_data[7:8],
width=20,
FUN=cor,
align="right",
na.rm=TRUE,
fill=NA)) %>%
ungroup
Is there a way to do this without having to define a fixed width for each date group?
Or should I maybe use a different approach to do this?
[Edited] Can you try running the below code? I have recreated an example emulating your issue. if I understood your problem correctly this code should at least put you on the path to the right solution as it solves the issue of unequal time window length.
###################
#Recreating an example dataset with unequal dates across stocks
seed(1)
date6 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26','1999-11-29')
date5 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26')
date4 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24')
cor_skew_kurt <- c(rep(NaN,21))
symbol <- c(rep('a',6),rep('b',5),rep('c',4),rep('d',6))
rskew <- rnorm(21,mean=1, sd =1)
rkurt <- rnorm(21, mean=5, sd = 1)
panel_data <- cbind.data.frame(date = c(date6,date5,date4,date6), symbol = symbol, rskew = rskew, rkurt = rkurt, cor_skew_kurt = cor_skew_kurt )
panel_data$date <- as.Date(panel_data$date, '%Y-%m-%d')
# Computing the cor_skew_kurt and filling the table <- ANSWER TO YOUR QUESTION
for (date in unique(panel_data$date))
{
panel_data[panel_data$date == date,"cor_skew_kurt"] <- as.double(cor(panel_data[panel_data$date == date,'rskew'],panel_data[panel_data$date == date,'rkurt']))
}

Why can't I use cv.glm on the output of bestglm?

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

Model multiple imputation with interaction terms

According to the documentation of the mice package, if we want to impute data when we're interested in interaction terms we need to use passive imputation. Which is done the following way.
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I((bmi-25)*(chl-200))"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
It is said that
Imputations created in this way preserve the interaction of bmi with chl
Here, a new variable called bmi.chl is created in the original dataset. The meth step tells how this variable needs to be imputed from the existing ones. The pred step says we don't want to predict bmi and chl from bmi.chl. But now, if we want to apply a model, how do we proceed? Is the product defined by "~I((bmi-25)*(chl-200))" is just a way to control for the imputed values of the main effects, i.e. bmi and chl?
If the model we want to fit is glm(hyp~chl*bmi, family="binomial"), what is the correct way to specify this model from the imputed data? fit1 or fit2?
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
Or do we have to use somehow the imputed values of the new variable created, i.e. bmi.chl?
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
With passive imputation, it does not matter if you use the passively imputed variable, or if you re-calculate the product term in your call to glm.
The reason that fit1 and fit2 yield different results in your example is because are not just doing passive imputation for the product term.
Instead you are transforming the two variables befor multiplying (i.e., you calculate bmi-25 and chl-100). As a result, the passively imputed variable bmi.chl does not represent the product term bmi*chl but rather (bmi-25)*(chl-200).
If you just calculate the product term, then fit1 and fit2 yield the same results like they should:
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I(bmi*chl)"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
pred[c("hyp"), "bmi.chl"] <- 1
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
# > round(summary(pool(fit1)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# chl:bmi 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 NA 0.43 0.33
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
# > round(summary(pool(fit2)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# bmi.chl 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 25 0.43 0.33
This is not surprising because the ~I(bmi*chl) in mice and the bmi*chl in glm do the exact same thing. They merely calculate the product of the two variables.
Remark:
Note that I added a line saying that bmi.chl should be used as a predictor when imputing hyp. Without this step, passive imputation has no purpose because the imputation model would neglect the product term, thus being incongruent with the analysis model.

Resources