I am using the mob() function from partykit package and I am getting some problems when parsing the obtained model.
My main aim is to check approximately how large a sample size needs to be in order to detect the real
structure of a data generating process (DGP) when breaks are present.
The code below performs a Montecarlo simulation of data with breaks and tries to identify if the break was captured by the M-fluctuation test or not.
More specifically, I want to make a count on the number of times over the total number of simulations (nreps) that the model actually captures the DGP, conditional on a fixed sample size (N) to get a feeling of how many data should I need to capture the real DGP.
At the end of the code, you can see the list that I get out of my simulations. The problem is that I cannot recover the information displayed on the console.
Additionally, I have some doubts about how to make the count of the "correctly identified models". For now, what I have in mind is to count as positive if the model has a break into the correct variable (z1) at the specified region (z1>0) with some tolerance into the break region for example if the break is at z1>0.1 or z1>-0.4 it is also valid as a positive for me. Therefore, is there any simple way of counting the models that have the stated characteristics?
I hope my example is clear enough for you to help me out. Thank you a lot in advance.
Simulate model
Ingredients to generate the DGP.
library("partykit")
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
#sample size
N <- 300
#coeficients
betas <- list()
betas$b0 <- 1
betas$b1_up <- 2.4
betas$b1_down <- 2
betas$b2_up <- 2.4
betas$b2_down <- 2
#mob() ingredients
ols_formula <- y ~ x1 + x2 | z1 + z2
# the ""0 +"" ---> supress the 'double' interecept
ols <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {lm(y ~ 0 + x)}
Function that generates the data and fit OLS using mob algorithm.
reg_simulation_mob <- function(...){
#data
data <- data.frame(
x1 = rnorm(N),
x2 = rnorm(N),
z1 = rnorm(N),
z2 = rnorm(N),
e = rnorm(N))
#dependent variable
data$y <- betas$b0 + with(data, ifelse(z1>0,
betas$b1_up * x1 + betas$b2_up * x2 ,
betas$b1_down * x1 + betas$b2_down * x2 )
+ e )
#Estimate mob()-OLS
ols_mob <- mob(ols_formula,
data = data,
fit = ols)
# return(ols$coefficients)
return(ols_mob)
}
Montecarlo simulation (only 2 trials) of the above-described setup.
# N repetitions
nreps <- 2
## Parallel version
results <- future_lapply(1:nreps, reg_simulation_mob, future.seed = 1234L)
Obtained result
As you can see below in the first trial (results[[1]]), the model finds the correct break but in the second (results[[2]]) it fails to find it.
> results
[[1]]
Model-based recursive partitioning (ols)
Model formula:
y ~ x1 + x2 | z1 + z2
Fitted party:
[1] root
| [2] z1 <= 0.00029: n = 140
| x(Intercept) xx1 xx2
| 0.9597894 1.7552122 2.1360788
| [3] z1 > 0.00029: n = 160
| x(Intercept) xx1 xx2
| 0.9371795 2.4745728 2.5087608
Number of inner nodes: 1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 422.2329
[[2]]
Model-based recursive partitioning (ols)
Model formula:
y ~ x1 + x2 | z1 + z2
Fitted party:
[1] root: n = 300
x(Intercept) xx1 xx2
1.015224 2.175625 2.200746
Number of inner nodes: 0
Number of terminal nodes: 1
Number of parameters per node: 3
Objective function: 422.3085
In the picture below, you can observe the structure of the list results, where I cannot find the information displayed on the console (i.e. number of nodes, parameters, threshold values, etc..)
First, I would recommend to use the lmtree() function and not vanilla mob(). The former is faster, comes with better defaults for printing and plotting, and has more options for predictions.
Second, I recommend that you consult the vignette("partykit", package = "partykit") which explains how party objects are constructed and which classes and methods are involved.
Third, to determine which variable (if any) was used for splitting in the root node it is probably of interest to extract the results from all parameter instability tests. There is a dedicated sctest() (structural change test) method for obtaining this:
library("strucchange")
sctest(results[[1]], node = 1)
## z1 z2
## statistic 4.072483e+01 6.1762164
## p.value 5.953672e-07 0.9153013
sctest(results[[2]], node = 1)
## z1 z2
## statistic 12.2810548 10.1944484
## p.value 0.2165527 0.4179998
The corresponding partysplit object for the $split (if any) in the root $node is probably easiest to extract manually:
results[[1]]$node$split
## $varid
## [1] 4
##
## $breaks
## [1] 0.0002853492
##
## $index
## NULL
##
## $right
## [1] TRUE
##
## $prob
## NULL
##
## $info
## NULL
##
## attr(,"class")
## [1] "partysplit"
results[[2]]$node$split
## NULL
The variable id pertains to the order of the variables in:
names(results[[1]]$data)
## [1] "y" "x1" "x2" "z1" "z2"
Finally, as for the question what to evaluate: Everything depends on identifying the correct variable for splitting. If this is done correctly, then the split point estimates converge fast to the true values, and then the parameter estimates also converge. See for example our recent arXiv paper (https://arxiv.org/abs/1906.10179) which contains a larger simulation study and also provides replication code.
Therefore, typically, I either evaluate the probability of selecting the correct variable in the first split. Or alternatively I look at the RMSE of the estimated vs.true coefficients for each observation.
Update: Beyond the root node you can use nodeapply() to extract information from various nodes. However, I do not recommend to evaluate all splits because it becomes increasingly difficult to match which estimated split matches which of the true splits. Instead, we often assess how similar the fitted partition is compared to the true partition, e.g., using the adjusted Rand Index. Again, you can find an example for the in the arXiv paper mentioned above.
This answer is based on the reference that professor #AchimZeileis provided in his article (https://arxiv.org/abs/1906.10179), and it is devoted to the second part of my original question, which was regarding the question: How to count the correctly specified models (trees)?
The short answer.
The article divides the problem into two different types of data generating process (DGP). In the first one, the data has only one break in one variable ("stump" case), and the authors count the number of correctly identified models based on the number of times that the M-fluctuation test identified correctly the variable that was generating the break (just one real break in one variable and 9 noisy candidates to be splitting variable with no break). The second DGP was a model with two breaks ("tree" case), and they used the Adjusted Rand Index (ARI) to assess the performance of the model as a metric of the similarity of the real tree to the predicted one.
The very long answer.
Let's break down the ARI for 6 different illustrative possible trees that can be obtained at different sample sizes. The code used here is highly based on the supplemental material of the article recommended by #AchimZeileis.
Data generating process: Tree structure
The real dgp has 2 breaks, as illustrated in the picture below. The first one is generated by the variable z2 and the second one by z1. In the snippet of the code below, delta is equal to 1. The threshold value for the first break (depending on z2) is equal to 0.3, and the threshold value for the second break (depending on z1) is equal to -0.3 (the values can be seen in the object xi = c(-0.3, 0.3))
#function from https://arxiv.org/src/1906.10179v1/anc
dgp_tree <- function(nobs = 1000, delta = 1, xi = c(-0.3, 0.3),
sigma = 1, seed = 7,
changetype = "abrupt",
variation = "all",
beta0 = NULL, beta1 = NULL)
{
# check input values
if(variation != "all") stop("variation can only be set to 'all' in dgp_tree")
if(changetype != "abrupt") stop("changetype can only be abrupt in dgp_tree")
if(!is.null(beta0) | !is.null(beta1)) warning("values for beta0 or beta1 are ignored since variation='all' for dgp_tree")
set.seed(seed)
if(length(xi)==1){
xi1 <- xi2 <- xi
} else {
xi1 <- xi[1]
xi2 <- xi[2]
}
z1 <- runif(nobs,-1,1)
z2 <- runif(nobs,-1,1)
z3 <- rnorm(nobs, 0, 1)
z4 <- rnorm(nobs, 0, 1)
z5 <- rnorm(nobs, 0, 1)
z6 <- rnorm(nobs, 0, 1)
z7 <- rnorm(nobs, 0, 1)
z8 <- runif(nobs, -1, 1)
z9 <- runif(nobs, -1, 1)
z10 <- runif(nobs, -1, 1)
id <- numeric(length(z1))
x <- runif(nobs, min = -1, max = 1)
beta0 <- delta * (-1)^(z1<xi1) * 0^(z2<xi2)
beta1 <- delta * (-1)^(z2>=xi2)
id <- 1 + (z2>=xi2) + (z2>=xi2)*(z1>=xi1)
mu <- beta0 + beta1 * x
y <- rnorm(nobs, mu, sigma)
d <- data.frame(y = y, x = x,
z1 = z1, z2 = z2, z3 = z3, z4 = z4, z5 = z5, z6 = z6, z7 = z7, z8 = z8, z9 = z9, z10 = z10,
beta0 = beta0, beta1 = beta1, mu = mu, sigma = rep.int(sigma, times = length(y)), id = id)
return(d)
}
The Adjusted Rand Index
Among the functions included in the article, there is one to compute the ARI, and it is listed below to be used in the following examples. It resembles almost exactly letter by letter the notation used here.
# function to compute adjusted Rand Index from https://arxiv.org/src/1906.10179v1/anc
adj_rand_index <- function(x, y) {
tab <- table(x, y)
a <- rowSums(tab)
b <- colSums(tab)
M <- sum(choose(tab, 2))
N <- choose(length(x), 2)
A <- sum(choose(a, 2))
B <- sum(choose(b, 2))
c(ARI = (M - (A * B) / N) / (0.5 * (A + B) - (A * B) / N))
}
Simulating the data
library(partykit)
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
ols_formula <- y ~ x | z1 + z2 +z3 +z4 + z5 +z6 +z7+ z8 +z9 +z10
ols <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {lm(y ~ 0 + x)}
sim_ari <- function(n){
tree_data <- dgp_tree(nobs = n)
ols_mob <- mob(ols_formula,
data = tree_data,
fit = ols)
prednode <- predict(ols_mob ,
type = "node")
cross_table <- table(prednode,tree_data$id)
ari <- adj_rand_index(prednode,
tree_data$id)
print(n)
print(ari)
return(
list(
ols_mob = ols_mob,
cross_table = cross_table,
ari=ari,
data = tree_data)
)
}
n_levels <- c(55, ## no break
87, ## only one break
123, ## Correct structure, but poor performance
199, ## Nested break in second leaf
667, ## Additional break in first leaf
5000 ## Perfect model
)
ari <- future_lapply(n_levels, sim_ari, future.seed = 1234L)
A Tale of Six Trees.
The following six cases are analyzed in terms of how the ARI can accurately capture the degree of similarity between the correct and the estimated tree. The key to compare the trees is id, which shows which leaf each observation should belong in the tree. For example, if an observation has an id value of 1, it meets the requirements assigned to node number 2 in the figure above. On the other hand, if id is equal to 2, the observation should be assigned to node number 4 in the same picture. Finally, if id is equal to 3, it is assigned to node number 5. You can check this reasoning in the following line id <- 1 + (z2>=xi2) + (z2>=xi2)*(z1>=xi1)
The First Tree (n=55): No breaks
The first case analyzed corresponds when no breaks are identified. Here, in this case, the ARI is equal to 0.
##### First Tree (n=55): No break ####
ari[[1]][[1]]
## Fitted party:
## [1] root: n = 55
## x(Intercept) xx
## -0.01309586 0.39291089
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 95.58631
Here it is interesting to note that all the observations are assigned to the root node. Therefore, when crosstable the predicted nodes prednode_1, we see that all possible id values belong to the root node [1] of the predicted tree (basically, because there is no other option.) By using the function adj_rand_index(), you can check that this leads to an ARI equal to 0.
#data first tree (n=55)
data_1 <- ari[[1]][[4]]
#predicted node first iteration
data_1$prednode_1 <- predict(ari[[1]][[1]], type = "node")
#Cross table
with(data_1, table(prednode_1 ,id))
## id
## prednode_1 1 2 3
## 1 37 7 11
#adj_rand_index
ari[[1]][[3]]
The second tree (n=87): Just one break identified
This case is interesting because it partially identifies the tree's structure (i.e., the break on z1 is missing).
##### Second Tree (n=87): Extra partition in node[5] ####
ari[[2]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.29288: n = 57
# | x(Intercept) xx
# | 0.133293 1.082701
# | [3] z2 > 0.29288: n = 30
# | x(Intercept) xx
# | 0.2598309 -1.8014133
#
# Number of inner nodes: 1
# Number of terminal nodes: 2
# Number of parameters per node: 2
# Objective function: 122.0116
Additionally, we can check that when crosstable the predicted and the real nodes, we see that some observations meet the criteria even in this non-perfect tree. Meaning there are 57 observations correctly assigned to the first node and 9 that were correctly assigned to the second branch. Finally, 30 where misassigned because the last node was not identified at all. This lead to an ARI equal to 0.8577366, which is a huge improvement from the first tree.
#data second iteration (n=87)
data_2 <- ari[[2]][[4]]
#predicted node first iteration
data_2$prednode_2 <- predict(ari[[2]][[1]], type = "node")
#Cross table
with(data_2, table(prednode_2 ,id))
# id
# prednode_2 1 2 3
# 2 57 0 0
# 3 1 9 20
#adj_rand_index
ari[[2]][[3]]
# > ari[[2]][[3]]
# ARI
# 0.8577366
The third tree (n=123): Correct structure but poor performance
This case is interesting because it does recover the real tree structure, but it has worse performance than the last three that only partially identified its structure.
##### Third Tree (n=123): Correct structure but poor performance ####
ari[[3]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.07319: n = 60
# | x(Intercept) xx
# | -0.1723388 1.1071878
# | [3] z2 > 0.07319
# | | [4] z1 <= -0.35485: n = 22
# | | x(Intercept) xx
# | | -0.7166565 -0.6791717
# | | [5] z1 > -0.35485: n = 41
# | | x(Intercept) xx
# | | 0.7096033 -0.8605967
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 156.4397
Below we can see that when we crosstable predicted and real nodes, we see that 16 (10 + 6) observations were incorrectly classified, and this fact leads to an ARI of 0.6117612.
#data third iteration (n=123)
data_3 <- ari[[3]][[4]]
#predicted node first iteration
data_3$prednode_3 <- predict(ari[[3]][[1]], type = "node")
#Cross table
with(data_3, table(prednode_3 ,id))
# id
# prednode_3 1 2 3
# 2 60 0 0
# 4 6 16 0
# 5 10 0 31
#adj_rand_index
ari[[3]][[3]]
# > ari[[3]][[3]]
# ARI
# 0.6117612
The fourth Tree: Forth Tree (n=199): Extra leaf at node [5]
The tree identified here deviated from the original because it has an extra leaf from node[5], which is unexisting in the real data.
##### Forth Tree (n=199): Extra leaf at node[5] ####
ari[[4]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= -0.19806: n = 79
# | x(Intercept) xx
# | 0.06455217 1.51512672
# | [3] z2 > -0.19806
# | | [4] z1 <= -0.27127: n = 44
# | | x(Intercept) xx
# | | -0.4863122 -0.3860951
# | | [5] z1 > -0.27127
# | | | [6] z2 <= 0.17481: n = 23
# | | | x(Intercept) xx
# | | | -0.1335096 0.2046050
# | | | [7] z2 > 0.17481: n = 53
# | | | x(Intercept) xx
# | | | 1.0868488 -0.0290925
#
# Number of inner nodes: 3
# Number of terminal nodes: 4
# Number of parameters per node: 2
# Objective function: 282.6727
Here the count of the crosstable of the real and predicted nodes is interesting because nodes [6] and [7] are inexistent in the real data, but they are getting observations that should, for example, be assigned to the node [1] (23 and 7 observations respectively.) This misallocation diminished the ARI index down to 0.4649789.
#data forth iteration (n=199)
data_4 <- ari[[4]][[4]]
#predicted node first iteration
data_4$prednode_4 <- predict(ari[[4]][[1]], type = "node")
#Cross table
with(data_4, table(prednode_4 ,id))
# id
# prednode_4 1 2 3
# 2 79 0 0
# 4 16 27 1
# 6 23 0 0
# 7 7 0 46
#adj_rand_index
ari[[4]][[3]]
# ARI
# 0.4649789
The fifth Tree (n=667): Extra leaf at node [2]
This is another example of a tree with an incorrect structure where an extra leaf (based on a partition on z5 which is incorrect(!)) is attached to the node [2].
##### Fifth Tree (n=667): Extra leaf at node[2] ####
ari[[5]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.28476
# | | [3] z5 <= 0.76285: n = 322
# | | x(Intercept) xx
# | | -0.1322881 0.9535337
# | | [4] z5 > 0.76285: n = 96
# | | x(Intercept) xx
# | | 0.1686863 1.3878776
# | [5] z2 > 0.28476
# | | [6] z1 <= -0.32001: n = 89
# | | x(Intercept) xx
# | | -0.9139858 -0.7957158
# | | [7] z1 > -0.32001: n = 160
# | | x(Intercept) xx
# | | 0.7661154 -0.8656553
#
# Number of inner nodes: 3
# Number of terminal nodes: 4
# Number of parameters per node: 2
# Objective function: 927.9088
The crosstable from predicted and correct nodes shows us that most of the observations (322) that, in reality, belong to the first node [1] were assigned to the predicted node [3]. Finally, this poor structure leads to an ARI of 0.6932132.`
#data third iteration (n=667)
data_5 <- ari[[5]][[4]]
#predicted node first iteration
data_5$prednode_5 <- predict(ari[[5]][[1]], type = "node")
#Cross table
with(data_5, table(prednode_5 ,id))
# id
# prednode_5 1 2 3
# 3 322 0 0
# 4 96 0 0
# 6 0 89 0
# 7 3 3 154
#adj_rand_index
ari[[5]][[3]]
# ARI
# 0.6932132
The sixth tree (n=5000): The Golden Tree!
This final tree recovers the data perfectly, both in the tree structure and in allocating the observation to each leaf.
##### Sixth Tree (n=5000): Extra leaf at node[2] ####
ari[[6]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.29971: n = 3187
# | x(Intercept) xx
# | -0.008719923 1.022232280
# | [3] z2 > 0.29971
# | | [4] z1 <= -0.30286: n = 609
# | | x(Intercept) xx
# | | -0.9488846 -0.9813765
# | | [5] z1 > -0.30286: n = 1204
# | | x(Intercept) xx
# | | 1.0281410 -0.9565637
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 6992.848
Here we can see from the crosstable of the predicted and real nodes that it allocates perfectly each observation where it does belong, leading to an ARI equal to 1.
#data sixt iteration (n=5000)
data_6 <- ari[[6]][[4]]
#predicted node first iteration
data_6$prednode_6 <- predict(ari[[6]][[1]], type = "node")
#Cross table
with(data_6, table(prednode_6 ,id))
# id
# prednode_6 1 2 3
# 2 3187 0 0
# 4 0 609 0
# 5 0 0 1204
#adj_rand_index
ari[[6]][[3]]
# ARI
# 1
The conclusions.
Some important takeaways can be recovered from the illustration above.
1.- The ARI is useful to assess the degree of similarity of a predicted tree that can have a very different structure from the real tree in the data generating process.
2.- Recover the tree's correct structure does not lead to an ARI equal to one.
3.- Incorrect trees not necessarily will an ARI equal to zero.
A final Simulation.
To conclude, here is a small simulation to see how the ARI index behaves when the sample size increases.
### Final simulation
n_levels <-seq(from= 10, to= 2000, by= 5)
ari <- lapply(n_levels, sim_ari)
ari_models<- function(i){
ari <- ari_sim[[i]]$ari
n <- nobs(ari_sim[[i]]$ols_mob)
return(
list(ari = ari , n= n )
)
}
ari_n_list <- lapply(1:length(ari_sim), ari_models)
df <- data.frame(matrix(unlist(ari_n_list), nrow=length(ari_n_list), byrow=T))
colnames(df) <- c("ARI" , "N")
library(ggplot2)
ggplot(df, aes(N)) +
geom_line(aes(y = ARI, colour = "ARI"))
Suppose I have a data frame of 101 variables. I select one so-called Y as a dependent variable, and the remaining 100 so-called x_1, X_2,...,X_{100} as independent ones.
Now I would like to create a matrix containing 100 independent variables. What are the ways to do it directly? Like when I make a linear regression model, just use "." as regex, i.e lm(Y ~ ., _____)
You can use grep function to extract indpendent variable associated column names of the data frame. Then you can transform it into the matrix. Please see the code below:
# simulation of the data frame with 100 measurements and 101 variables
n <- 100
df <- data.frame(matrix(1:101 * n, ncol = 101))
names(df) <- c(paste0("X_", 1:100), "Y")
# extract matrix of Xs
m_x <- as.matrix(df[, grep("^X", names(df))])
dimnames(m_x)
Output:
[[1]]
NULL
[[2]]
[1] "X_1" "X_2" "X_3" "X_4" "X_5" "X_6" "X_7" "X_8" "X_9" "X_10" "X_11" "X_12" "X_13" "X_14" "X_15"
[16] "X_16" "X_17" "X_18" "X_19" "X_20" "X_21" "X_22" "X_23" "X_24" "X_25" "X_26" "X_27" "X_28" "X_29" "X_30"
[31] "X_31" "X_32" "X_33" "X_34" "X_35" "X_36" "X_37" "X_38" "X_39" "X_40" "X_41" "X_42" "X_43" "X_44" "X_45"
[46] "X_46" "X_47" "X_48" "X_49" "X_50" "X_51" "X_52" "X_53" "X_54" "X_55" "X_56" "X_57" "X_58" "X_59" "X_60"
[61] "X_61" "X_62" "X_63" "X_64" "X_65" "X_66" "X_67" "X_68" "X_69" "X_70" "X_71" "X_72" "X_73" "X_74" "X_75"
[76] "X_76" "X_77" "X_78" "X_79" "X_80" "X_81" "X_82" "X_83" "X_84" "X_85" "X_86" "X_87" "X_88" "X_89" "X_90"
[91] "X_91" "X_92" "X_93" "X_94" "X_95" "X_96" "X_97" "X_98" "X_99" "X_100"
Assume we have the following model:
set.seed(1)
d <- data.frame(a = gl(4, 1, 64), a4 = sample(4, 64, TRUE),
x = rnorm(64), y = rnorm(64))
l <- lm(y ~ a4 + a * x, d)
For the interaction x:a I will get 3 coefficients x:a2, x:a3, x:a4. I want now to determine which coefficients are the corresponding main effects associated to this interactions, which would be x, a2, a3 and a4.
My idea was to use strsplit on the interactions and to retrieve the corresponding main effects:
(atoms <- strsplit(names(coef(l))[7:9], ":"))
# [[1]]
# [1] "a2" "x"
# [[2]]
# [1] "a3" "x"
# [[3]]
# [1] "a4" "x"
So far so good. But now I would like to get the value of the corresponding main effect. While this is straight forward for x, a2, a3 (as these are unique names) I struggle to see how I can do that with a4:
lapply(atoms, function(.) coef(l)[.])
# [[1]]
# a2 x
# 0.3630732 0.2136751
# [[2]]
# a3 x
# 0.04153299 0.21367510
# [[3]]
# a4 x
# 0.04765737 0.21367510
The result for a4 is wrong, because it is the main effect associated with the variable a4 and not the dummy coded factor 4 of factor a.
So, the model I was showing is a valid model in R, yet the names of the coefficients are ambiguous. So is there any other way how I can make a correct mapping between the coefficients of an interaction and the corresponding main effects?
You can use the assign component of the lm object:
l$assign
[1] 0 1 2 2 2 3 4 4 4
This maps the coefficients to the expanded formula a4 + a + x + a:x.
See help("model.matrix") for documentation of the assign component.
Edit:
To expand my answer, you can do this:
terms <- labels(terms(l))
coef(l)[l$assign %in% which(terms %in% strsplit("a:x", ":", fixed = TRUE)[[1]])]
# a2 a3 a4 x
#0.36307321 0.04153299 0.23383125 0.21367510
Let's consider the following vectors in the dataframe:
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))) )
I know I could use
pairwise.t.test(x$data,
x$Group,
pool.sd=FALSE)
to get pairwise comparison like
Pairwise comparisons using t tests with non-pooled SD
data: x$data and x$Group
ctrl x1 x2
x1 0.08522 - -
x2 0.99678 0.10469 -
x3 0.00065 0.99678 2.8e-05
P value adjustment method: holm
However I am not interested in every possible combination of vectors. I am seeking a way to compare ctrl vector with every other vectors, and to take into account alpha inflation. I'd like to avoid
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x1']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x2']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x3']), var.equal=T)
And then perform manual correction for multiple comparisons. What would be the best way to do so ?
You can use p.adjust to get a Bonferroni adjustment to multiple p-values. You should not bundle thos unequal length vectors inot t adataframe but rather use a list.
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
> lapply( list(x1,x2,x3), function(x) t.test(x,ctrl)$p.value)
[[1]]
[1] 0.2464039
[[2]]
[1] 0.8576423
[[3]]
[1] 0.0144275
> p.adjust( .Last.value)
[1] 0.4928077 0.8576423 0.0432825
#BondedDust 's answer looks great. I provide a bit more complicated solution if you really need to work with dataframes.
library(dplyr)
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))), stringsAsFactors = F )
# provide the combinations you want
# set1 with all from set2
set1 = c("ctrl")
set2 = c("x1","x2","x3")
dt_res =
data.frame(expand.grid(set1,set2)) %>% # create combinations
mutate(test_id = row_number()) %>% # create a test id
group_by(test_id) %>% # group by test id, so everything from now on is performed for each test separately
do({x_temp = x[(x$Group==.$Var1 | x$Group==.$Var2),] # for each test id keep groups of interest
x_temp = data.frame(x_temp)}) %>%
do(test = t.test(data~Group, data=.)) # perform the test and save it
# you create a dataset that has the test id and a column with t.tests results as elements
dt_res
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# test_id test
# 1 1 <S3:htest>
# 2 2 <S3:htest>
# 3 3 <S3:htest>
# get all tests as a list
dt_res$test
# [[1]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -1.9776, df = 58.36, p-value = 0.05271
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.894829477 0.005371207
# sample estimates:
# mean in group ctrl mean in group x1
# -0.447213560 -0.002484425
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -2.3549, df = 100.68, p-value = 0.02047
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.71174095 -0.06087081
# sample estimates:
# mean in group ctrl mean in group x2
# -0.44721356 -0.06090768
#
#
# [[3]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -5.4235, df = 101.12, p-value = 4.001e-07
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -1.2171386 -0.5652189
# sample estimates:
# mean in group ctrl mean in group x3
# -0.4472136 0.4439652
PS : It's always interesting to work with p-values and alpha corrections. It's a bit of a philosophical issue how to approach that and some people agree and other disagree. Personally, I tend to correct alpha based on all possible comparison I can do after an experiment, because you never know when you'll come back to investigate other pairs. Imagine what happens if in the future people decide that you have to go back and compare the winning group (let's say x1) with x2 and x3. You'll focus on those pairs and you'll again correct alpha based on those compariosns. But on the whole you performed all possible comparisons, apart from x2 vs x3! You may write your reports or publish findings that should have been a bit more strict on the alpha correction.
Suppose if I have a random time series that I want to interpolate over another time series. How would I do this in R?
# generate time interval series from exponential distribution
s = sort(rexp(10))
# scale between 0 and 1
scale01 = function(x){(x-min(x))/(max(x)-min(x))}
s = scale01(s)
> s
[1] 0.00000000 0.02804113 0.05715588 0.10630185 0.15778932 0.20391987 0.26066608 0.27265697 0.39100373
[10] 1.00000000
# generate random normal series
x = rnorm(20)
> x
[1] -0.82530658 0.92289557 0.39827984 -0.62416117 -1.69055539 -0.28164232 -1.32717654 -1.36992509
[9] -1.54352202 -1.09826247 -0.68260576 1.07307043 2.35298180 -0.41472811 0.38919315 -0.27325343
[17] -1.52592682 0.05400849 -0.43498544 0.73841106
# interpolate 'x' over 's' ?
> approx(x,xout=s)
$x
[1] 0.00000000 0.02804113 0.05715588 0.10630185 0.15778932 0.20391987 0.26066608 0.27265697 0.39100373
[10] 1.00000000
$y
[1] NA NA NA NA NA NA NA NA NA
[10] -0.8253066
>
I want to interpolate the series 'x' over the series 's'. Lets assume time interval series for the 'x' series has 20 elements distributed uniformly over the interval [0,1]. Now I want to interpolate those 10 elements from 'x' that occur at time intervals described by 's'.
EDIT:
I think this does the job.
> approx(seq(0,1,length.out=20), x, xout=s)
$x
[1] 0.00000000 0.02804113 0.05715588 0.10630185 0.15778932 0.20391987 0.26066608 0.27265697 0.39100373
[10] 1.00000000
$y
[1] -0.8253066 0.1061033 0.8777987 0.3781018 -0.6221134 -1.5566990 -0.3483466 -0.4703429 -1.4444105
[10] 0.7384111
Thanks for your help guys. I think I now understand how to use interpolation functions in R now. I should really use a time series data structure here.
This isn't meant as a direct answer to the OP's Q but rather to illustrate how approx() works so the OP can formulate a better Q
Your Q makes next to no sense. approx() works by taking a reference set of x, and y coordinates and then interpolating to find y at n locations over the range of x, or at the specified xout locations supplied by the user.
So in your call, you don't provide y and x doesn't contain a y component so I don't see how this can work.
If you want to interpolate s, so you can find time intervals for any value over range of s then:
> approx(s, seq_along(s), n = 20)
$x
[1] 0.00000000 0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
[7] 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
[13] 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
[19] 0.94736842 1.00000000
$y
[1] 1.00000 26.25815 42.66323 54.79831 64.96162 76.99433 79.67388
[8] 83.78458 86.14656 89.86223 91.98513 93.36233 93.77353 94.19731
[15] 94.63652 95.26239 97.67724 98.74056 99.40548 100.00000
Here $y contains the interpolated values for s at n = 20 equally spaced locations on the range of s (0,1).
Edit: If x represents the series at unstated time intervals uniform on 0,1 and you want the interpolated values of x at the time intervals s, then you need something like this:
> set.seed(1)
> x <- rnorm(20)
> s <- sort(rexp(10))
> scale01 <- function(x) {
+ (x - min(x)) / (max(x) - min(x))
+ }
> s <- scale01(s)
>
> ## interpolate x at points s
> approx(seq(0, 1, length = length(x)), x, xout = s)
$x
[1] 0.00000000 0.04439851 0.11870795 0.14379236 0.20767388 0.21218632
[7] 0.25498856 0.29079300 0.40426335 1.00000000
$y
[1] -0.62645381 0.05692127 -0.21465011 0.94393053 0.39810806 0.29323742
[7] -0.64197207 -0.13373472 0.62763207 0.59390132
Is that closer to what you want?