Two way ANOVA - repeated measure in r, missing a desired effect - r

I am trying to do a two way mixed factorial ANOVA with repeated measures. From:
aov(Estimate ~ Dose*Visit, data = AUClast)
I get 3 sums of squares: two main effects (Visit and Dose) and their interaction (Dose:Visit) which I figured out by hand are correct.
Both Dose and Visit are explanatory variables with Dose being a between subject variable with 4 levels, 3, 10, 30, 100 and Visit being a within subjects variable (repeated measure) of 2 levels, 1 and 28. Also, the subjectID variable is 'Animal'
I want to include one more effect into the result but do not know how. The desired effect is variance between Animal within Dose, or how SAS puts it Animal(Dose). The SS is calculated by:
sum((mean(Animal(ik))-mean(Dose(i))^2)
Where k is the animal of a dose i (averaging the Estimates of the observation in Visit 1 and Visit 28 for each Animal and subtracting the mean Estimate of animals in that Dose quantity squared for all Animals in this study).
Does anyone know how to adjust the formula accordingly to include the Animal(Dose) effect?
Thanks in advance for the help and sorry if all of this is too unspecific.

If I understand you correctly, I have a suggestion. First, a sample data set
#sample data
set.seed(15)
AUClast<-data.frame(
expand.grid(
Animal=1:3,
Dose=c(3,10,30,100),
Visit=c(1,28)
), Estimate=runif(24)
)
Now we calculate the interaction term as requested. First, we split the data into dosage groups, then for each does, we subtract the overall mean from the mean for each animal. Then we sum the squared of those differences. Finally, we expand them back out to does group using unsplit.
animaldose<-unsplit(lapply(split(AUClast, AUClast$Dose), function(x) {
rep(
sum((tapply(x$Estimate, x$Animal, mean) - mean(x$Estimate))^2)
, nrow(x))
}), AUClast$Dose)
And we can see what that looks like next to the original data.frame
cbind(AUClast, animaldose)
Which gives the result
Animal Dose Visit Estimate animaldose
1 1 3 1 0.60211404 0.1181935
2 2 3 1 0.19504393 0.1181935
3 3 3 1 0.96645873 0.1181935
4 1 10 1 0.65090553 0.1641363
5 2 10 1 0.36707189 0.1641363
6 3 10 1 0.98885921 0.1641363
7 1 30 1 0.81519341 0.0419291
8 2 30 1 0.25396837 0.0419291
9 3 30 1 0.68723085 0.0419291
10 1 100 1 0.83142902 0.1881314
11 2 100 1 0.10466936 0.1881314
12 3 100 1 0.64615091 0.1881314
13 1 3 28 0.50909039 0.1181935
14 2 3 28 0.70662857 0.1181935
15 3 3 28 0.86231366 0.1181935
16 1 10 28 0.84178515 0.1641363
17 2 10 28 0.44744372 0.1641363
18 3 10 28 0.96466695 0.1641363
19 1 30 28 0.14118707 0.0419291
20 2 30 28 0.77671251 0.0419291
21 3 30 28 0.80372740 0.0419291
22 1 100 28 0.79334595 0.1881314
23 2 100 28 0.35756312 0.1881314
24 3 100 28 0.05800106 0.1881314
So you can see each does group has it's own adjustment.

Related

Multiple columns in one random effect GLMER

I'm trying to find variance in infectivity trait of animals in different herds. Each herds contains a fixed number of offspring from 5 different sires.
Example of data:
Herd
S
C
DeltaT
I
sire1
I1
sire2
I2
sire3
I3
sire4
I4
sire5
I5
1
20
0
14
1
13
0
26
0
46
0
71
0
91
1
1
1
0
14
5
13
1
26
0
46
2
71
1
91
1
18
4
0
14
13
2
5
52
4
84
2
87
2
98
0
19
11
3
14
27
2
6
13
7
18
3
46
5
85
6
Herd is the herdname. S is the number of susceptible animals in the herd, C is the number of cases in the time interval. DeltaT is the time interval length. Sire# is the ID of the sire in the Herd. I# is the number of infected Ofspring of the corresponding Sire#. This means that a sireID "13" in the first two rows in the column sire1. Refers to the same sire as the "13" in sire2 of the last row. To include these 5 sires into one random effect in a glmer of lme4 is getting me in trouble.
I tried:
glmer(data = GLMM_Data,
cbind(C, S-C) ~ (1 | Herd) + (1| (I1 | sire1) + (I2 | sire2) + (I3 | sire3) + (I4 | sire4) + (I5 | sire5)),
offset = log(GLMM_Data$I/nherds * GLMM_Data$DeltaT),
family = binomial(link="cloglog"))
This gave errors. So any help on combining these 10 columns in a single random factor would be more than welcome. Thanks in advance.
p.s. I know my offset, family and the left side of the formula are working since the analysis of susceptibility is working

Episode splitting in survival analysis by the timing of an event in R

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

Adding extreme value distributed noise (with µ=0,σ=10) to a vector of numbers in R

I have the following matrix
Measurement Treatment
38 A
14 A
54 A
69 A
20 B
36 B
35 B
10 B
11 C
98 C
88 C
14 C
I want to add extreme value distributed noise (with mean=0 and sd=10) to the Measurement values. How can I achieve that in R?
I found revd in extRemes package, but it does not work as expected. Does devd from the same package do what I want to do? (but it does not allow for mean and sd to be defined)
If you want to use your measure as the mean for the noise, then you can do this:
measure = round(runif(10,0,30),0)
data = data.frame(measure)
for(i in 1:nrow(data)){
data$measure1[i] = rnorm(1,data$measure[i],10)
}
data
measure measure1
1 6 6.281557
2 12 -5.780177
3 18 13.529773
4 26 33.665584
5 14 12.666614
6 24 41.146132
7 5 -1.850390
8 14 16.728703
9 13 26.082601
10 13 14.066475
EDIT: You can avoid the for loop with this instead:
data$measure1 = data$measure + rnorm(1,0,10)

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

Maximum Intermediate Volatility

I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1

Resources