I have 4 groups: sensitivity, category, extent, duration. Each of these 4 have 4 levels.
Sensitivity - A, B, C, D
Category - 1, 2, 3, 4
Extent - site, local, regional, pan-regional
Duration - short-term, medium-term, long-term, irreversible
I use a combination of these (there are 256 total) to get a severity classification.
There are 7 severity classes: S1, S2, S3, S4, S5, S6, N.
Using RPART, I get a tree that has Duration=cd as the first binary break. Then the next one on the left side of the tree is Sensitivity=ab.
How do I know which duration levels c and d are?
And how do I know which sensitivity levels a and b are?
Thanks!
Does using print() give you what you're after? This prints a list of nodes in the model and then the variables at each.
rpartmodel <- rpart(Price ~ Country + Mileage + Type, data = car.test.frame)
print(rpartmodel)
Alternatively try plotting your tree using the rpart.plot package?
library(rpart.plot)
prp(rpartmodel, faclen = 9)
More customization of the plots (I find) and it labels splits with the names of categories. faclen above controlling the number of characters shown for factor splits.
Related
I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html
I am trying to find a way to visualize a mixed effects model for a project I am working on, but am unsure how to do this when using multiple fixed and random effects.
The project I am working on is an attempt to estimate the helpfulness of online reviews, based on several different factors. A sample of the data looks like this:
Participant Product Type Star Rating Anonymous Product Helpfulness
1 Exp Extr Yes 12 8
1 Search Extr Yes 6 6
1 Search Mid Yes 13 7
...
30 Exp Mid No 11 2
30 Exp Mid No 14 4
30 Search Extr No 9 5
The data is significantly longer than this (30 participants, who each saw roughly two dozen reviews, resulting in approx. 700 entries). Each participant sees a mix of products, product types, and star ratings, but all of the reviews they see will either be anonymous or not anonymous (no mix).
As a result, I tried to fit a maximal mixed model, with the following:
mixed(helpfulness ~ product_type * star_rating * anonymity
+ (product_type * star_rating | participant)
+ (star_rating * anonymity | product))
What I would like to do now is to find a way of visually representing the data, likely color-coding the 8 different "groups" (essentially, the different unique combinations of the 3 binary independent variables (2 types of products * 2 types of star ratings * 2 types of anonymity), to show how they relate to the helpfulness rating.
Try something like this:
library(ggplot2)
# make notional data
df <- data.frame(participant = seq(1,30,1),
product_type = sample(x=c('Exp', 'Search'), size=30, replace=T),
star_rating = sample(x=c('Extr', 'Mid'), size=30, replace=T),
anonymous = sample(x=c('Yes', 'No'), size=30, replace=T),
product = rnorm(n=30, mean=10, sd=5),
helpfullness = rnorm(n=30, mean=5, sd=3))
ggplot(df) +
geom_col(aes(x=participant, y=helpfullness, fill=product, color=anonymous)) +
facet_grid(c('product_type', 'star_rating'))
This captures all six variables in your data. You can also work with alpha and other aesthetics to include the variables. Using alpha might be better if you do not want to use a facet_grid. If you include alpha as an aesthetic, I would recommend using facet_wrap instead of facet_grid.
We are looking at pattern recognitions and making different variables
unusualsubjects <- rtaverages$subject_id[rtaverages$count < 5] # make a list of subjects without enough data.
rtaverages <- filter(rtaverages,!(subject_id %in% unusualsubjects)) # only include data from good subjects. ! = not. Put data from acceptable subjects right back in the same data frame
# Another example of filtering subjects: let's say we only wanted to analyze subjects with accuracies over 95%
accurateSubjects <- averages$subject_id[averages$accuracy > .95] #returns all of the subject_ids for subjects meeting an accuracy criterion
length(accurateSubjects) # tells us how many accurate subjects there are
goodSubjectdata<-filter(data,subject_id %in% accurateSubjects) # make a new data frame that contains only the data from accurate subjects
Code to conduct actual ANOVA of the Respone Times results
model<ezANOVA(data=Data,dv=rt,within=c(set_size,target_presence,task),wid=subject_id) # You need to fill in the XXXs with the correct variable names within the variable containing all of the correct RTs. conduct a repeated measures ANOVA - dv = dependent variable. within = a list of all of the within subject variables. wid = variable that is used to group data by subject
model # show results of the ANOVA model
table1 <- tapply(X=Data$rt,INDEX=list(Data$task,Data$set_size),FUN=mean,trim=0.1)#find breakdown just of setsize and task - less broken down than the above tapply code, obtained just by deleting one item from the INDEX list "INDEX=list(rtaverages$target_presence,rtaverages$set_size,rtaverages$task)" above
table1 #show means so that one can begin to interpret the data. You'll break down rtaverages in different ways to get the different mean RTs that you need for your report
par(mar = c(4,4,4,0),mfrow=c(1, 2) ) # mfrow=c(1,2) creates two plots side by side
lineplot.CI(data=filter(rtaverages,task=="conjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Conjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
lineplot.CI(data=filter(rtaverages,task=="disjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Disjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
Currently attempting to put 3 lines onto one plot the following way:
# The next bit of code is to reproduce Treisman and Gelade's Figure 1, including best lines of fit
rtaverages$set_size_num<-sizes[rtaverages$set_size] # added a new column to rtaverage data frame which is the numeric/continuous version of the nominal/categorical set_size factor which will be useful for predicting RT from set_size
bySetSize<-group_by(rtaverages,set_size_num,task,target_presence) #collapse even more, so all subjects' data are combined together
collapsed<-summarize(bySetSize,rt=mean(rt,trim=0.1)) # make RT summary
collapsed # show what collapsed data look like. Note that there are now only 4 (set sizes) X 2 (tasks) X 2 (present/absent trials)=16 rows
cp<-filter(collapsed,task=="conjunctive" & target_presence=="present") # plot each of the four lines separated, filtering by the right type each time
cpf<-lm(data=cp,rt ~ set_size_num) # use a linear model to predict RT from set size. Use this to get out best fitting slope (estimate for set size) and intercept
summary(cpf) # make a summary of the linear regression model. cpf stands for: conjunctive, present fit
cp3<-filter(collapsed,task=="conjunctive" & target_presence=="absent")
caf<-lm(data=cp3,rt ~ set_size_num)
summary(caf)
cp1<-filter(collapsed,task=="disjunctive" & target_presence=="present")
dpf<-lm(data=cp1,rt ~ set_size_num)
summary(dpf)
cp2<-filter(collapsed,task=="disjunctive" & target_presence=="absent")
daf<-lm(data=cp2,rt ~ set_size_num)
summary(daf)
plot(cp$set_size_num,cp$rt,ylim=c(0,4000),xlim=c(0,30),pch=19,col="green",xlab="Set Size",ylab="Response Time (msec.)") # use a big enough range to capture all of the data
abline(cpf, col="green") # add the line with the slope and intercept derived from linear model
lines(cp$set_size_num,cp3$rt,col="green")
abline(caf, col="green")
lines(cp1$set_size_num,cp1$rt,col="red")
abline(dpf, col="red")
lines(cp2$set_size_num,cp2$rt,col="red")
abline(daf, col="red")
legend(x=0,y=4000,pch=c(19,1,19,1),col=c("green","green","red","red"),cex=0.7,legend=c("Conjunctive present","Conjunctive absent","Disjunctive present","Disjunctive absent")) #Legend only should be plotted once, pch sets 4 symbols, and col sets 4 colors. cex < 1 so that legend box isn't too big
I got them to combine, but now the lines lost their format:
the story:
I'm facing a problem in gam where only 2 input-variable should be considered:
x = relative price (%) the customer paid for the product given the entry price for the club
b = binary, if the customer has to pay the product (VIPs get it for free)
the output-variable is
y = if the customer took the product
and this sims the data:
require(mgcv)
require(data.table)
set.seed(2017)
y <- sample(c(0, 1), 100, replace=T)
x <- rgamma(100, 3, 3)
b <- as.factor(ifelse(x<.5, 0, 1))
dat <- as.data.table(list(y=y, x=x, b=b))
dat[b=="0",x:=0]
plot(dat$x, dat$y, col=dat$b)
relative price
as you can see in the plot, customers who hadn't pay for the product have a relative price for the product at 0%, others have the relative prices between .5% and 3.5%
here comes the problem:
I want to model one dummy effect for b and a smooth effect for x (certainly only for those who has to pay), so I use b also as a by-variable in x:
mod <- bam(y~b+s(x, by=b), data=dat, family=binomial(link="logit"))
summary(mod)
par(mfrow=c(1,2))
plot(mod)
smooth effects
my question is:
a. why can you still see rug by s(x, b=1) at 0%, wouldn't it makes more sense if mgcv only consider those who has to pay? does this problem has s.th to do with the knots?
b. as you can see in the summary, the dummy effect is estimated as NA, this might has to do with the fact that the information of b was totally used in as by-variable in s(x) so the dummy b itself has no more information to give? how can I overcome this problem, in other words: is there a option to model a smooth term only for a subset of the data and make mgcv actually only use this subset to fit?
Your question is conceptually as same as How can I force dropping intercept or equivalent in this linear model?. You want to contrast b, rather than using all its levels.
In GAM setting, you want:
dat$B <- as.numeric(dat$b) - 1
y ~ b + s(x, by = B)
For factor by smooth, mgcv does not apply contrast to by, if this factor is unordered. This is generally appealing as often we want a smooth for each factor level. It is thus your responsibility to use some trick to get what you want. What I did in above is to coerce this two-level factor b to a numeric B, with the level you want to omit being numerically 0. Then use numerical 'by' B. This idea can not be extended to factors of more levels.
If your factor by has more than 2 levels and you still want to enforce a contrast, you need to use an ordered factor. For example, you can do
dat$B <- ordered(dat$b)
y ~ b + s(x, by = B)
Read more on 'by' variables from ?gam.models.
I was wondering if there is any package for generating a D-efficient balanced design with R
I tried the AlgDesign package, but I did not manage to get a balanced design.
Not sure if this is because of the small full factorial of the attributes and levels I am considering (32) or because balanced designs are out of the scope of AlgDesign.
I have 3 attributes: two attributes have four levels, and 1 has 2 levels
att 1: 4 levels
att 2: 4 levels
att 3: 2 levels
I use the following R code:
library(AlgDesign)
#-----------------------------
# define attributes and levels
#-----------------------------
desVarNames <- c("esource", "certified", "cost")
desLevels <- c(4,2, 4)
n <- 6 #number of choice sets
desOpt <- 4 #num option per choice set
set.seed(123456)
#generate full factorial
dat<-gen.factorial(desLevels,length(desLevels),varNames=desVarNames, center=TRUE)
destT <- optFederov(~., dat, nTrials = (n*(desOpt)), criterion="D")
destT
First, when you say "balanced" design, I suspect you mean that each factor level should appear the same number of times in the final design. (Actually, your variable "dat" above is a balanced design, too, since each possible combination appears only once).
The way optFederov works is by randomly selecting and replacing trials using Federov's exchange algorithm. As such, everytime a trial is exchanged with another candidate trial, an initially balanced design will become unbalanced, since if a trial "balances" a design, replacing it with any other trial will unbalance the design. If the algorithm was forced to keep a balanced design at each step, no replacement would be possible, and it would get stuck.
Not only a balanced design is incompatible with the way Feverov's exchange algorithm works, it is actually not desirable from a D-efficiency point of view.
For instance, if you have 4 factors with 2, 3, 5 and 7 levels respectively, the only way to have a balanced design is to include all 2*3*5*7 = 210 trials, whereas AlgDesign only suggests 19.
data = gen.factorial(c(2,3,5,7), factors = "all")
trials = optFederov(data = data, center = FALSE, criterion = "D")
In short, the notion of a "balanced d-efficient" design is largely antagonistic.