GLM combining results - r

I'm writing a Sweave file for clearer presentation of glm() results. The glm is for calculating premium prices of insurances. Usually 2 separate glms are used for this. One for claim frequency and one for claim severity. To get the final price I would have to multiply coefficient estimates of the 2 models, according to categorization. If both models have the same independent variables with the same levels the problem is trivial. I can just multiply the fitted values of both and it's done. The problem arises when the factors have different levels, a cause of merging them to get better results. Lets say I have factor age for frequency with 3 levels 0-25,25-50,50-110 and for severity with 2 levels 0-25,25-110. I want to combine the fitted values to be multiplied in the following sense:
Frequency Severity
0-25 0-25
25-50 25-110
50-110 25-110
In other words, the fitted values should be multiplied only when the categories are in the same range. This should also work for non numeric categorizations. For instance
Frequency Severity
a ab
b ab
c c
Is there any function/package in R that would allow me to do that? If not, what other ways exist?
Currently my only idea is to use custom labels for factor levels and then using string comparisons between them.

The best way to do this is to create code for transforming your dataset in model-specific ways, and then call it before computing the predictions. This generalises easily to situations where your models involve different subsets of variables, or are of different forms completely. Since this is R and not SAS, you can do it all in one function.
predict_combined <- function(glm.cf, glm.cs, newdata)
{
newdata.cf <- within(newdata, {
age <- cut(age, c(0, 25, 50, 110))
...
...
})
newdata.cs <- within(newdata, {
age <- cut(age, c(0, 25, 110))
...
...
})
pred.cf <- predict(glm.cf, newdata.cf, type="resp")
pred.cs <- predict(glm.cs, newdata.cs, type="resp")
pred.cf * pred.cs
}
This can be turned into a one-liner, but that would probably obfuscate more than it would elucidate.

Related

Negative Binomial model offset seems to be creating a 2 level factor

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

Factor scores from factor analysis on ordinal categorical data in R

I'm having trouble computing factor scores from an exploratory factor analysis on ordered categorical data. I've managed to assess how many factors to draw, and to run the factor analysis using the psych package, but can't figure out how to get factor scores for individual participants, and haven't found much help online. Here is where I'm stuck:
library(polycor)
library(nFactors)
library(psych)
# load data
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
# convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# 2. choose number of factors
ev <- eigen(pc)
ap <- parallel(subject = nrow(dat),
var=ncol(dat),rep=100,cent=.05)
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
dev.new(height=4,width=6,noRStudioGD = T)
plotnScree(nS) # 2 factors, maybe 1
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
faPC$loadings
Edit: I've found a way to get scores using irt.fa() and scoreIrt(), but it involved converting my ordered categories to numeric so I'm not sure it's valid. Any advice would be much appreciated!
x = as.matrix(dat)
fairt <- irt.fa(x = x,nfactors=2,correct=TRUE,plot=TRUE,n.obs=NULL,rotate="varimax",fm="ml",sort=FALSE)
for(i in 1:length(dat)){dat[,i] <- as.numeric(dat[,i])}
scoreIrt(stats = fairt, items = dat, cut = 0.2, mod="logistic")
That's an interesting problem. Regular factor analysis assumes your input measures are ratio or interval scaled. In the case of ordinal variables, you have a few options. You could either use an IRT based approach (in which case you'd be using something like the Graded Response Model), or to do as you do in your example and use the polychoric correlation matrix as the input to factor analysis. You can see more discussion of this issue here
Most factor analysis packages have a method for getting factor scores, but will give you different output depending on what you choose to use as input. For example, normally you can just use factor.scores() to get your expected factor scores, but only if you input your original raw score data. The problem here is the requirement to use the polychoric matrix as input
I'm not 100% sure (and someone please correct me if I'm wrong), but I think the following should be OK in your situation:
dat <- read.csv("https://raw.githubusercontent.com/paulrconnor/datasets/master/data.csv")
dat_orig <- dat
#convert to ordered factors
for(i in 1:length(dat)){
dat[,i] <- as.factor(dat[,i])
}
# compute polychoric correlations
pc <- hetcor(dat,ML=T)
# run FA
faPC <- fa(r=pc$correlations, nfactors = 2, rotate="varimax",fm="ml")
factor.scores(dat_orig, faPC)
In essence what you're doing is:
Calculate the polychoric correlation matrix
Use that matrix to conduct the factor analysis and extract 2 factors and associated loadings
Use the loadings from the FA and the raw (numeric) data to get your factor scores
Both this method, and the method you use in your edit, treat the original data as numeric rather than factors. I think this should be OK because you're just taking your raw data and projecting it down on the factors identified by the FA, and the loadings there are already taking into account the ordinal nature of your variables (as you used the polychoric matrix as input into FA). The post linked above cautions against this approach, however, and suggests some alternatives, but this is not a straightforward problem to solve

Treatment of categorical variables in rpart

I wonder how rpart treats categorical variables. There are several references suggesting that for unordered factors it looks through all combinations. Actually, even the vignette at the end section 6.2 states
(F)or a categorical predictor with m levels, all 2^m−1 different possible
splits are tested.
However, given my experience with the code, I find it difficult to believe. The vignette shows a supporting evidence that running
rpart(Reliability ~ ., data=car90)
takes a really long, long time. However, in my case, it runs in seconds. Despite having an unordered factor variable with 30 levels.
To demonstrate the issue further, I have created several variables with 52 levels, meaning that 2^51 - 1 ~ 2.2 10^15 splits would need to be checked if all possibilities were explored. This code runs in about a minute, IMHO proving that all combinations are not checked.
NROW = 50000
NVAR = 20
rand_letters = data.frame(replicate(NVAR, as.factor(c(
letters[sample.int(26, floor(NROW/2), replace = TRUE)],
LETTERS[sample.int(26, ceiling(NROW/2), replace = TRUE)]))))
rand_letters$target = rbinom(n = NROW, size = 1, prob = 0.1)
system.time({
tree_letter = rpart(target ~., data = rand_letters, cp = 0.0003)
})
tree_letter
What combinations of categorical variables are ACTUALLY checked in rpart?
I know it is an old question but I found this link that might answer some of it.
Bottom line is that rpart seems to be applying a simple algorithm:
First, sort the conditional means, p_i = E(Y|X = x_i)
Then compute Gini indices based on groups obtained from that ordering.
Pick the two groups giving the maximum of these Gini indices.
So it should not be nearly as computationally expensive.
However, I personally have a case where I have a single categorical variable, whose categories are US states, and rpart overtimes when trying to use it to produce a classification tree. Creating dummy variables and running rpart with the 51 variables (1 for each state) works fine.

SVD with missing values in R

I am performing a SVD analysis with R, but I have a matrix with structural NA values. Is it possible to obtain a SVD decomposition in this case? Are there alternative solutions? Thanks in advance
You might want to try out the SVDmiss function in SpatioTemporal package which does missing value imputation as well as computes the SVD on the imputed matrix. Check this link SVDmiss Function
However, you might want to be wary of the nature of your data and whether missing value imputation makes sense in your case.
I have tried using the SVM in R with NA values without succes.
Sometimes they are important in analysis so I usually transform my data as follows:
If you have lots of variables try to reduce their number (clustering, lasso, etc...)
Transform the remaining predictors like this:
- for quantitative variables:
- calculate deciles per predictor (leaving missing obs out)
- calculate frequency of Y per decile (assuming Y is qualitative)
- regroup deciles on their Y freq similarity into 2/3/4 groups
(you can do this by looking at their plot too)
- create for each group a new binary variable
(X11 = 1 if X1 takes values in the interval ...)
- calculate Y frequency for missing obs of that predictor
- join the missing obs category to the variable that has the closest Y freq
- for qualitative variables:
- if you have variables with lots of levels you should do clustering by Y
variable
- for variables with lesser levels, you can calculate Y freq per class
- regroup the classes like above
- calculate the same thing for missing obs and attach it to the most similar
group of non-missing
- recode the variable as for numeric case*
There, now you have a complete database of dummy variables and the chance to perform SVM, neural networks, LASSO, etc...

Using combinations of principal components in a regression model

I have a group of 51 variables into which I have applied Principal Component Analysis and selected six factors based on the Kaiser-Guttman criterion. I'm using R for my analysis and did this with the following function:
prca.searchwords <- prcomp(searchwords.ts, scale=TRUE)
summary(prca.searchwords)
prca.searchwords$sdev^2
Next I would like to use these six extracted factors in a dynamic linear regression model as explanatory variables in groups of one, two, three & four and choose the regression model that explains most of the variation of the dependent variable. The six variables are prca.searchwords$x[,1] + prca.searchwords$x[,2] + prca.searchwords$x[,3] + prca.searchwords$x[,4] + prca.searchwords$x[,5] + prca.searchwords$x[,6]
Which I convert to time series before using in a regression:
prca.searchwords.1.ts <- ts(data=prca.searchwords$x[,1], freq=12, start=c(2004, 1))
prca.searchwords.2.ts <- ts(data=prca.searchwords$x[,2], freq=12, start=c(2004, 1))
I'm using the dynlm package in R for this (I chose to use dynamic regression because other regressions that I perform require lagged values of the independent variables).
For example with the first two factors it would look like this:
private.consumption.searchwords.dynlm <- dynlm(monthly.privateconsumption.ts ~ prca.searchwords.1.ts + prca.searchwords.2.ts)
summary(private.consumption.searchwords.dynlm)
The problem I'm facing is that I would like to do this for all possible combinations of one, two, three and four factors of those six factors that I have chosen to use. This would mean that I would have to do six regressions for 1 variable groups, 15 for two variables, 20 for three variables and 15 for four variables. I would like to do this as efficiently as possible, without having to type 51 different regressions manually.
I'm a relatively new R user and therefore I still struggle with these general coding tricks that will radically speed up my analysis. Could someone please point me into the right direction?
Thank you!
You could build all the formula you are intereted in running using string manipulation functions then convert those to propert formuals and apply over the list of models you want to run. For example
vars <- paste0("prca.searchwords.",1:6,".ts")
resp <- unlist(lapply(1:6, function(i) apply(combn(vars,i), 2, paste, collapse=" + ")))
result <- lapply(resp, function(r) {
do.call("dynlm", list(as.formula(paste0("monthly.privateconsumption.ts ~ ", r))))
})

Resources