Why does R not relevel? - r

Please find My data "w" below.
I have the covariate w$WHO, which has three levels: w$WHO==1, w$WHO==2 and w$WHO==3
I want to relevel so w$WHO==1 is set as reference.
I tried
w$WHO <- factor(w$WHO)
w$WHO <- relevel(w$WHO, ref=1)
and
w$WHO <- relevel(w$WHO, ref="1")
My script is
library(rms)
d <- datadist(w)
options(datadist="d")
model <- cph(Surv(rfs,recurrence)~age + WHO,data=w)
summary(model)
As you can see, the adjusted model chooses w$WHO==2 as reference.
Effects Response : Surv(rfs, recurrence)
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 48.545 68.907 20.362 0.28228 0.093283 0.099454 0.46512
Hazard Ratio 48.545 68.907 20.362 1.32620 NA 1.104600 1.59220
WHO - 1:2 2.000 1.000 NA -0.56706 0.156850 -0.874490 -0.25963
Hazard Ratio 2.000 1.000 NA 0.56719 NA 0.417080 0.77134
WHO - 3:2 2.000 3.000 NA 0.69360 0.152910 0.393910 0.99330
Hazard Ratio 2.000 3.000 NA 2.00090 NA 1.482800 2.70010
Here is my data
My data
w <- structure(list(age = c(54.36164384, 74.91232877, 64.98356164,
60.56712329, 57.61369863, 45.85205479, 78.47123288, 59.95616438,
57.4739726, 25.12876712, 56.61917808, 61.10136986, 58.74246575,
62.56438356, 55.81917808, 30.83013699, 63.11232877, 56.29863014,
47.96986301, 40.53424658, 49.9890411, 47.75616438, 40.83835616,
42.02191781, 49.85205479, 55.05479452, 59.33424658, 71.89589041,
60.30410959, 50.24383562, 41.3260274, 33.4, 73.27945205, 67.45753425
), WHO = c(3L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 3L,
2L, 3L, 2L, 1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), recurrence = c(1L, 1L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), rfs = c(19.1, 15.33333333,
49.16666667, 15.6, 57.16666667, 47.63333333, 54, 16.93333333,
6.7, 102.1, 24.33666667, 127.7666667, 100.6333333, 25.96666667,
1.233333333, 13.1, 72.16666667, 62, 97.23333333, 199.1, 24.73333333,
60.46666667, 10.43333333, 31.76666667, 28.96666667, 56.43333333,
9.533333333, 114.9333333, 114.8666667, 85.06666667, 107.6, 121.2,
69.56666667, 70.03333333)), .Names = c("age", "WHO", "recurrence",
"rfs"), class = "data.frame", row.names = 271:304)
Thanks,
Best.

The solution was
d$limits$WHO[2] <- 1
model <- cph(Surv(rfs,recurrence)~age + WHO,data=w)
summary(model)

Related

Multiple fixed effect levels missing in lmer from lme4

I am running mixed linear models using lmer from lme4. We are testing the effect of family, strain and temperature on several growth factors for brook trouts. I have 4 families (variable FAMILLE) form which we sampled our individuals. 2 are from the selected strain and 2 are from the control strain (variable Lignee). For each strain, the 2 families were either marked as resistant (Res) or sensible (Sens). So my fixed effect variable (FAMILLE), is nested in my variable Lignee. The expermiment was conducted at 3 different temperatures.
Here is what my dataframe looks like :
structure(list(BASSIN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4"), class = "factor"), t.visee = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("15", "17", "19"), class = "factor"), FAMILLE = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L), .Label = c("RES", "SENS"), class = "factor"), Lignee = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("CTRL", "SEL"), class = "factor"), taux.croiss.sp.poids = c(0.8,
1.14285714285714, 1.42857142857143, 0.457142857142857, -0.228571428571429,
0.628571428571429, 0.971428571428571, 0.742857142857143, 1.08571428571429,
0.8, 0.571428571428571, 1.02857142857143, 0.8, 0.285714285714286,
0.285714285714286, 0.571428571428571, 0.742857142857143, 1.14285714285714,
0.628571428571429, 0.742857142857143, 1.02857142857143, 0.285714285714286,
0.628571428571429, 0.628571428571429, 0.857142857142857, 0.8,
1.08571428571429, 1.37142857142857, 0.742857142857143, 1.08571428571429,
0.0571428571428571, 0.571428571428571, 0.171428571428571, 0.8,
0.685714285714286, 0.285714285714286, 0.285714285714286, 0.8,
0.457142857142857, 1.02857142857143, 0.342857142857143, 0.742857142857143,
0.857142857142857, 0.457142857142857, 0.742857142857143, 1.25714285714286,
0.971428571428571, 0.857142857142857, 0.742857142857143, 0.514285714285714
)), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"
))
Lignee has 2 levels (Sel and Ctrl)
FAMILLE has 2 levels (Sens and Res)
So I have 4 distinct levels :
Lignee Sel and FAMILLE Sens
Lignee Sel and FAMILLE Res
Lignee Ctrl and FAMILLE Sens
Lignee Ctrl and FAMILLE Res
when I run for example this line to test the effect of the variables on the rate of weight gain:
model6 <- lmer((taux.croiss.sp.poids) ~ t.visee + Lignee/FAMILLE + (1 |BASSIN), data = mydata1, REML = FALSE)
and then
summary(model6)
<Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: (taux.croiss.sp.poids) ~ t.visee + Lignee/FAMILLE + (1 | BASSIN)
Data: mydata1
AIC BIC logLik deviance df.resid
115.2 139.5 -50.6 101.2 228
Scaled residuals:
Min 1Q Median 3Q Max
-3.11527 -0.59489 0.05557 0.69775 2.79920
Random effects:
Groups Name Variance Std.Dev.
BASSIN (Intercept) 0.01184 0.1088
Residual 0.08677 0.2946
Number of obs: 235, groups: BASSIN, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.770942 0.209508 194.702337 3.680 0.000302 ***
t.visee -0.019077 0.011682 231.005933 -1.633 0.103809
LigneeSEL 0.214062 0.054471 231.007713 3.930 0.000112 ***
LigneeCTRL:FAMILLESENS -0.008695 0.054487 231.038877 -0.160 0.873358
LigneeSEL:FAMILLESENS -0.205001 0.054242 231.016973 -3.779 0.000200 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) t.vise LgnSEL LCTRL:
t.visee -0.948
LigneeSEL -0.131 0.000
LCTRL:FAMIL -0.124 -0.007 0.504
LSEL:FAMILL 0.000 0.000 -0.498 0.000>
From what I can understand, the model chooses 1 family as the reference group, which won't be in the output. But the problem here is that 2 groups are missing :
LigneeCTRL:FAMILLERES
AND
LigneeSEL:FAMILLERES
Does somebody knows why my output is missing not ONE but TWO of the groups?
I'm french canadian so don't hesitate if some things are not clear, I will try to re-explain in other words!
Also, this is my 1st message on Stack, I tried to include everything needed but dont hesitate if I need to include some other things!
Thanks in advance

on the difference of confusionMatrix using caret for classification

I'm applying the example here:
https://quantdev.ssri.psu.edu/sites/qdev/files/09_EnsembleMethods_2017_1127.html
to my data, to build a model for classification using the caret package.
I got to the point:
cvcontrol <- trainControl(method="repeatedcv", number = 10, repeats=3,allowParallel=TRUE)
train.rf <- train(as.factor(variate) ~ .,
data=train.n.inp,
method="rf",
trControl=cvcontrol,
importance=TRUE)
rf.classTrain <- predict(train.rf, type="raw")
#computing confusion matrix
cM <- confusionMatrix(train.n.inp$variate,rf.classTrain)
I don't understand the need to use the predict function to calculate the confusion matrix, or, in other words, what is the difference between cM and train.rf$finalModel:
train.rf$finalModel
OOB estimate of error rate: 43.08%
Confusion matrix:
MV UV class.error
MV 25 12 0.3243243
UV 16 12 0.5714286
> cM
Confusion Matrix and Statistics
Reference
Prediction MV UV
MV 37 0
UV 0 28
Accuracy : 1
I am confused by the (large) difference between the two confusion matrices and unsure which one reflects the accuracy of the model. Any help appreciated.
the data:
dput(train.n.inp)
structure(list(variate = structure(c(1L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("MV",
"UV"), class = "factor"), AMB = c(0.148918043959789, 0.137429106929874,
0.13522219247215, 0.152139165429334, 0.193551266136034, 0.1418753904697,
0.132098434875739, 0.256245486778797, 0.136593400352133, 0.0183612037420183,
0.0235701709547339, 0.030539801539972, 0.0532418112925866, 0.0506048730618504,
0.0443005622763673, 0.172991261592386, 0.135717125493919, 0.139092406429261,
0.1225892299329, 0.13579014839877, 0.183709401293317, 0.122207888096455,
0.00542803592726925, 0.0192455922563268, 0.0731446096925737,
0.0150264910871489, 0.0487793004405717, 0.0433918327937752, 0.0122597343588996,
0.0211847560629296, 0.114451232870044, 0.113712890165437, 0.00788647372392488,
-0.03807738805183, 0.00735097242168299, -0.00173226557619129,
0.000279921135262793, 0.0487306185040041, 0.00901021509302318,
0.164378615647997, 0.081505732298031, 0.0337690366656119, 0.0520247628784008,
0.0318461001711981, 0.0467265454486446, 0.0503046677863513, 0.026150313592808,
0.102418680881792, 0.145640126897581, 0.158703113209843, 0.166192017785134,
0.145234444092853, 0.189096868940113, 0.142573164893833, 0.157794383727251,
0.312043099741174, 0.136009217113324, 0.115213916542934, 0.119757563955894,
0.120065882887488, 0.141891617781889, 0.177956819122265, 0.13731551574455,
0.328513821613157, 0.110426859447136), MB = c(-0.73416, -0.67752,
-0.66664, -0.75004, -0.9542, -0.69944, -0.65124, -1.26328, -0.6734,
-0.09052, -0.1162, -0.15056, -0.26248, -0.24948, -0.2184, -0.85284,
-0.66908, -0.68572, -0.60436, -0.66944, -0.90568, -0.60248, -0.02676,
-0.09488, -0.3606, -0.07408, -0.24048, -0.21392, -0.06044, -0.10444,
-0.56424, -0.5606, -0.0388800000000001, 0.18772, -0.0362400000000001,
0.00854000000000001, -0.00138, -0.24024, -0.04442, -0.81038,
-0.40182, -0.16648, -0.25648, -0.157, -0.23036, -0.248, -0.12892,
-0.50492, -0.718, -0.7824, -0.81932, -0.716, -0.93224, -0.70288,
-0.77792, -1.53836, -0.67052, -0.568, -0.5904, -0.59192, -0.69952,
-0.87732, -0.67696, -1.61956, -0.5444), MGE = c(1.58768, 1.6152,
1.53288, 1.52972, 1.12908, 1.50552, 1.48988, 1.67552, 1.55052,
1.23556, 1.27284, 1.21336, 0.84592, 1.30172, 1.14048, 1.26828,
1.20884, 1.21764, 1.22876, 1.22168, 1.27944, 1.22528, 1.26932,
1.25408, 1.183, 1.38032, 1.33416, 0.95584, 1.31188, 1.39796,
1.33848, 1.4458, 1.18416, 1.23868, 1.22968, 1.17838, 1.17278,
1.13368, 1.11374, 1.31642, 1.14034, 1.21984, 1.17128, 1.16364,
1.15036, 1.12984, 1.22484, 1.17244, 1.2768, 1.55744, 1.66964,
1.54848, 1.17416, 1.56424, 1.48928, 1.9326, 1.54588, 1.228, 1.29096,
1.39296, 1.38432, 1.275, 1.32704, 1.9442, 1.35128)), row.names = c(NA,
-65L), class = "data.frame")

Error in R t_test , not enough "x" observations

I am trying to conduct group-wise t-test , but the code i am using returnign an error. It has worked alright for me previously and on other data frame but for this data frame its giving this error
Error in t.test.default(x = 0.0268, y = 0.0223, paired = FALSE,
var.equal = FALSE, : not enough 'x' observations
My Code is
stat.test.BACI5 <- Flaov %>%
group_by(`Treatment`) %>%
t_test(`Observed` ~ Control, detailed = TRUE) %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance()
Here is the data structure
structure(list(Treatment = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Phase1", "Phase2"), class = "factor"), Group = structure(c(3L,
4L, 2L, 3L, 2L, 4L, 1L, 2L, 4L, 3L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 4L, 2L, 3L, 2L, 4L, 3L, 1L, 2L, 4L, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 3L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 2L, 2L,
4L, 2L, 1L, 1L, 1L, 4L, 1L, 3L, 1L, 3L, 4L, 2L, 1L, 1L, 2L, 4L,
2L, 3L, 1L, 1L, 2L), .Label = c("Group A ", "Group B", "Group C ",
"Group D"), class = "factor"), Observed = c(0.1057, 0.151, 0.0576,
0.1267, 0.0941, 0.1554, 0.0247, 0.0832, 0.2807, 0.1137, 0.0325,
0.0777, 0.0362, 0.0637, 0.0303, 0.0223, 0.0932, 0.0363, 0.0641,
0.0453, 0.0359, 0.0334, 0.2006, 0.0538, 0.1114, 0.0661, 0.2452,
0.1043, 0.0489, 0.0663, 0.1967, 0.0321, 0.1042, 0.0268, 0.0313,
0.0255, 0.0787, 0.038, 0.1212, 0.0839, 0.0446, 0.0986, 0.1364,
0.0335, 0.0409, 0.0407, 0.0871, 0.0584, 0.0875, 0.1961, 0.0711,
0.0191, 0.0363, 0.0474, 0.1608, 0.0349, 0.1099, 0.0399, 0.1095,
0.2011, 0.057, 0.0418, 0.0394, 0.054, 0.2033, 0.0631, 0.1089,
0.0441, 0.0261, 0.0686), Control = c(0.1061, 0.154, 0.0585, 0.1289,
0.1076, 0.15856, 0.02997, 0.1022, 0.2849, 0.1193, 0.03292, 0.0888,
0.04628, 0.06454, 0.03341, 0.0239, 0.1013, 0.0364, 0.0883, 0.06363,
0.0566, 0.04036, 0.20641, 0.06206, 0.1158, 0.0687, 0.2457, 0.12643,
0.05126, 0.05705, 0.1987, 0.04719, 0.08199, 0.02312, 0.0317,
0.07045, 0.06395, 0.06043, 0.1251, 0.0912, 0.04575, 0.1018, 0.1379,
0.03834, 0.048, 0.04131, 0.0926, 0.06242, 0.0965, 0.1972, 0.0742,
0.0211, 0.04318, 0.05741, 0.1616, 0.06552, 0.1104, 0.04814, 0.11015,
0.2081, 0.06341, 0.04329, 0.04486, 0.06179, 0.2114, 0.05545,
0.1127, 0.04327, 0.03355, 0.07189), factors = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Phase1", "Phase2"), class = "factor")), row.names = c(NA,
70L), class = "data.frame")
If you are doing a t test between observed and control in the different treatment groups, the formula is wrong, the left hand side of the formula should be the response variable and right hand side should be grouping variable.
In your case, you need to pivot the data long to get something like this:
library(tidyr)
Flaov[,c("Treatment","Observed","Control")] %>%
pivot_longer(-c(Treatment)) %>% group_by(Treatment)
# A tibble: 140 x 3
# Groups: Treatment [2]
Treatment name value
<fct> <chr> <dbl>
1 Phase1 Observed 0.106
2 Phase1 Control 0.106
3 Phase1 Observed 0.151
4 Phase1 Control 0.154
5 Phase1 Observed 0.0576
6 Phase1 Control 0.0585
7 Phase1 Observed 0.127
8 Phase1 Control 0.129
9 Phase1 Observed 0.0941
10 Phase1 Control 0.108
# … with 130 more rows
Then we further pipe it to test:
Flaov[,c("Treatment","Observed","Control")] %>%
pivot_longer(-c(Treatment)) %>%
group_by(Treatment) %>%
t_test(value ~ name)
# A tibble: 2 x 9
Treatment .y. group1 group2 n1 n2 statistic df p
* <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Phase1 value Control Observed 46 46 0.482 90.0 0.631
2 Phase2 value Control Observed 24 24 0.323 46.0 0.748

Developing a function to analyse rows of a data.table in R

For a sample dataframe:
df1 <- structure(list(area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("a1",
"a2", "b1", "b2"), class = "factor"), weight = c(0, 1.2, 3.2,
2, 1.6, 5, 1, 0.5, 0.2, 0, 1.5, 2.3, 1.5, 1.8, 1.6, 2, 1.3, 1.4,
1.5, 1.6, 2, 3, 4, 2.3, 1.3, 2.1, 1.3, 1.6, 1.7, 1.8, 2, 1.3,
1, 0.5), var.1 = c(0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L), var.2 = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L)), .Names = c("area",
"region", "weight", "var.1", "var.2"), class = c("data.table",
"data.frame"))
I want to first produce a summary table...
area_summary <- setDT(df1)[,.(.N, freq.1 = sum(var.1==1), result = weighted.mean((var.1==1),
w = weight)*100), by = area]
...and then populate it by running the following code for each area (e.g. a, b). This looks for the highest and lowest 'result' in each region, and then produces a xtabs and calculates the relative difference (RD) before adding these to the summary table. Here I have developed the code for area 'a':
#Include only regions with highest or lowest percentage
a_cntry <- subset(df1, area=="a")
a_cntry.summary <- setDT(a_cntry)[,.(.N, freq.1 = sum(var.1==1), result = weighted.mean((var.1==1),
w = weight)*100), by = region]
#Include only regions with highest or lowest percentage
incl <- a_cntry.summary[c(which.min(result), which.max(result)),region]
region <- as.data.frame.matrix(a_cntry)
a_cntry <- a_cntry[a_cntry$region %in% incl,]
#Produce xtabs table of RD
a_cntry.var.1 <- xtabs(weight ~ var.1 + region, data=a_cntry)
a_cntry.var.1
#Produce xtabs table
RD.var.1 <- prop.test(x=a_cntry.var.1[,2], n=rowSums(a_cntry.var.1), correct = FALSE)
RD <- round(- diff(RD.var.1$estimate), 3)
RDpvalue <- round(RD.var.1$"p.value", 4)
RD
RDpvalue
#Add RD and RDpvalue tosummary table
area_summary$RD[area_summary$area == "a"] <- RD
area_summary$RDpvalue[area_summary$area == "a"] <- RDpvalue
rm(RD, RD.var.1, RDpvalue, a_cntry.var.1, incl, a_cntry,a_cntry.summary,region)
I wish to wrap this code into a function, so I can just specify the 'areas' (in the 'area' column in df1) and then the code completes all the analysis and adds the results to the summary table.
If I wanted to call my function stats, I understand it may start like this:
stats= function (df1, x) {
apply(x)
}
If anyone can start me off developing my function, I should be most grateful.

identifying rows in data frame that exhibit patterns

Below I have code with 3 columns: a group field, a open/close field for the store, and the rolling sum of 3 month opens for the store. I also have the desired solution output.
My dataset can be thought of as an employees availability. You can assume each row to be a different time period (hour, day,month, year, whatever). In the open/closed column I have whether or not the employee was present. The 3month rolling column is a sum of the previous rows.
What I want to identify is the non-zero values in this rolling sum column following a gap of at least 3 zero rows for that particular group. While not present in this dataset, you can assume that there might be more than one 'gap' of zeros present.
structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), X0_closed_1_open = c(0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), X3month_roll_open = c(0L,
0L, 1L, 2L, 2L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), desired_solution = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("no", "yes"), class ="factor")), .Names = c("Group", "X0_closed_1_open", "X3month_roll_open", "desired_solution"), class = "data.frame", row.names = c(NA,
-26L))
One option is:
res <- unsplit(
lapply(split(df1, df1$Group), function(x) {
rl <- with(x,rle(X3month_roll_open==0))
indx <- cumsum(c(0,diff(inverse.rle(within.list(rl,
values[values] <- lengths[values]>=3)))<0))
x$Flag <- indx!=0 & x[,3]!=0
x}),
df1$Group)
NOTE: Instead of 'yes/no', it may be better to have 'TRUE/FALSE' for easing subsetting.
identical(c('no', 'yes')[res$Flag+1L], as.character(res$desired_solution))
#[1] TRUE

Resources