I'm trying to find variance in infectivity trait of animals in different herds. Each herds contains a fixed number of offspring from 5 different sires.
Example of data:
Herd
S
C
DeltaT
I
sire1
I1
sire2
I2
sire3
I3
sire4
I4
sire5
I5
1
20
0
14
1
13
0
26
0
46
0
71
0
91
1
1
1
0
14
5
13
1
26
0
46
2
71
1
91
1
18
4
0
14
13
2
5
52
4
84
2
87
2
98
0
19
11
3
14
27
2
6
13
7
18
3
46
5
85
6
Herd is the herdname. S is the number of susceptible animals in the herd, C is the number of cases in the time interval. DeltaT is the time interval length. Sire# is the ID of the sire in the Herd. I# is the number of infected Ofspring of the corresponding Sire#. This means that a sireID "13" in the first two rows in the column sire1. Refers to the same sire as the "13" in sire2 of the last row. To include these 5 sires into one random effect in a glmer of lme4 is getting me in trouble.
I tried:
glmer(data = GLMM_Data,
cbind(C, S-C) ~ (1 | Herd) + (1| (I1 | sire1) + (I2 | sire2) + (I3 | sire3) + (I4 | sire4) + (I5 | sire5)),
offset = log(GLMM_Data$I/nherds * GLMM_Data$DeltaT),
family = binomial(link="cloglog"))
This gave errors. So any help on combining these 10 columns in a single random factor would be more than welcome. Thanks in advance.
p.s. I know my offset, family and the left side of the formula are working since the analysis of susceptibility is working
Related
I want to use R to write a model that will answer a general question about probability. The general question is below, followed by my specific questions about how to answer it using R code. If you know the answer to the general question (separate from the R code), and can explain the underlying statistical principles in plain English, I'm interested in that too!
Question: If I split a group of n objects, first through a 4-way splitter, then through a 7-way splitter (resulting in a total of 28 distinct groups), and each splitter results in a random distribution (i.e. the objects are split approximately equally), does the order of the splits impact the variance of the final 28 groups. If I split into 4 and then into 7, is that different than splitting into 7 and then into 4? Does the answer change if one splitter has greater variance than the other?
Specific R question: how can I write a model to answer this question? So far, I've tried using sample and rnorm to generate sample data. Simulating a 4-way splitter would look something like this:
sample(1:4, size=100000, replace=TRUE)
This is basically like rolling a 4-sided die 100,000 times and recording the number of instances of each number. I can use the table function to sum the instances, which gives me an output like this:
> table(sample(1:4, size=100000, replace=TRUE))
1 2 3 4
25222 24790 25047 24941
Now, I want to take each of those outputs and use them as the input for a 7-way split.
I tried saving the 4-way split as a variable and then plugging that vector in the the size = variable like this:
Split4way <- as.vector(table(sample(1:4, size=100000, replace=TRUE)))
as.vector(table(sample(1:7, size=Split4Way, replace=TRUE)))
But when I do that, instead of a matrix with 4 rows and 7 columns, I just get a vector with 1 row and 7 columns. It appears that "size" variable for the 7-way split only uses 1 of the 4 outputs from the 4-way split instead of using each of them.
> as.vector(table(sample(1:7, size = Split4up, replace=TRUE)))
[1] 3527 3570 3527 3511 3550 3480 3588
So, how can I generate a table or list that shows all the outputs of the 4-way split followed by the 7-way split, for a total of 28 splits?
AND
Is there a function that will allow me to customize the standard deviation of each splitting device? For example, can I dictate that the outputs of the 4-way splitter have a standard deviation of x%, and the outputs of the 7-way splitter have a standard deviation of x%?
We can illustrate your set-up by writing a function that will simulate n objects being passed into the splitters.
Imagine the object comes first to the 4-splitter. Let us randomly assign it a number from one to four to determine which way it is split. Next it comes to a seven splitter; we can also randomly assign it a number from one to seven to determine which final bin it will end up in.
The set up looks like this:
Final bins
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
\__|__|__|__|__|_/ \__|__|__|__|__|_/ \__|__|__|__|__|_/ \__|__|__|__|__|_/
| | | |
seven splitter seven splitter seven splitter seven splitter
| | | |
1 2 3 4
\___________________|____________________|___________________/
|
four splitter
|
input
We can see that any unique pair of numbers will cause the object to end up in a different bin.
For the second set-up, we reverse the order, so that the seven splitter comes first, but otherwise each object still gets a unique bin based on a unique pair of numbers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
\__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/ \__|__|__/
| | | | | | |
4 splitter 4 splitter 4 splitter 4 splitter 4 splitter 4 splitter 4 splitter
| | | | | | |
1 2 3 4 5 6 7
\__________|___________|___________|___________|___________|__________/
|
7 splitter
|
input
Note that we can either draw a random 1:4 then a random 1:7, or vice versa, but in either case the unique pair will determine a unique bin. The actual bin the object ends up in will change depending on the order in which the two numbers are applied, but this will not change the fact that each bin will get 1/28 of the objects passed in, and the variance will remain the same.
That means to simulate and compare the two set ups, we need only sample from 1:4 and 1:7 for each object passed in, then apply the two numbers in a different order to calculate the final bin:
simulate <- function(n) {
df <- data.frame(fours = sample(4, n, replace = TRUE),
sevens = sample(7, n, replace = TRUE))
df$four_then_seven <- 7 * (df$fours - 1) + df$sevens
df$seven_then_four <- 4 * (df$sevens - 1) + df$fours
return(df)
}
So let's examine how this would play out for 10 objects passed in:
set.seed(69) # Makes the example reproducible
simulate(10)
#> fours sevens four_then_seven seven_then_four
#> 1 4 6 27 24
#> 2 1 5 5 17
#> 3 3 7 21 27
#> 4 2 2 9 6
#> 5 4 2 23 8
#> 6 4 3 24 12
#> 7 1 4 4 13
#> 8 3 2 16 7
#> 9 3 7 21 27
#> 10 3 2 16 7
Now let's do a table of the quantities in each bin if we had 100,000 draws:
s <- simulate(100000)
seven_four <- table(s$seven_then_four)
seven_four
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 3434 3607 3539 3447 3512 3628 3564 3522 3540 3539 3544 3524 3552 3644 3626 3578
#> 17 18 19 20 21 22 23 24 25 26 27 28
#> 3609 3616 3673 3617 3654 3637 3542 3624 3568 3651 3486 3523
four_seven <- table(s$four_then_seven)
four_seven
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 3434 3512 3540 3552 3609 3654 3568 3607 3628 3539 3644 3616 3637 3651 3539 3564
#> 17 18 19 20 21 22 23 24 25 26 27 28
#> 3544 3626 3673 3542 3486 3447 3522 3524 3578 3617 3624 3523
If you sort these two tables from smallest number to largest number in each bin, you will see they are actually identical apart from the labels on their bins. The distribution of counts is completely unchanged. This means the variance / standard deviation is also the same in both cases:
var(four_seven)
#> [1] 3931.439
var(seven_four)
#> [1] 3931.439
The only way to change the variance / standard deviation is to "fix" the splitters so they do not have an equal probability.
I'm also struggling to interpret your use of variance and standard deviation. the best I can think of is doing this "splitting" non-uniformly
as an alternative to Allan's code, you could generate non-uniform samples by doing:
# how should the alternatives be weighted (normalised probability is also OK)
a <- c(1, 2, 3, 4) # i.e. last four times as much as first
b <- c(1, 1, 2, 2, 3, 3, 4)
x <- sample(28, 10000, prob=a %*% t(b), replace=TRUE)
note that prob is automatically normalised (i.e. by dividing by the sum) in sample. you can check that things are working with:
table((x-1) %% 4 + 1) should be close to a/sum(a) * 10000
table((x-1) %/% 4 + 1) should be close to b/sum(b) * 10000
I have a large data frame with over 1 million observations. Two of my independent variables A and B have 18 and 72 numerically labelled categories respectively. For simplicity sake, assume the categories are labelled 1-18 and 1-72. I'd like to partition all of my data into 36 groups of 6, (A 1-6 with B 1-6, A 1-6 with B 7-12, etc.)
Currently, I am using dplyr's mutate with 36 nested ifelse statements, such as mutate(partition = ifelse(A <= 6 & B <= 6, 1, ifelse(...))) but this is tedious and difficult to change should I want to make partitions of different sizes.
Another way of describing it is that there are 18 * 72 = 1296 unique combinations of parameter A and B, but I would like to partition these 1296 into 36 groups of 36 observations, with the flexibility to change the number of observations and groups.
I really feel like there should be a better way to partition my data, but nothing comes to mind immediately. The only other idea I have is to use expand.grid and use a join of sorts. What other methods exist that allow me to partition my data?
The below example is kind of how I would like my data to appear.
A B Partition
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
1 6 1
2 1 1
... ... ...
6 6 1
7 1 2
... ... ...
12 71 12
12 72 12
13 1 13
... ... ...
18 70 36
18 71 36
18 72 36
I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)
I am trying to do a two way mixed factorial ANOVA with repeated measures. From:
aov(Estimate ~ Dose*Visit, data = AUClast)
I get 3 sums of squares: two main effects (Visit and Dose) and their interaction (Dose:Visit) which I figured out by hand are correct.
Both Dose and Visit are explanatory variables with Dose being a between subject variable with 4 levels, 3, 10, 30, 100 and Visit being a within subjects variable (repeated measure) of 2 levels, 1 and 28. Also, the subjectID variable is 'Animal'
I want to include one more effect into the result but do not know how. The desired effect is variance between Animal within Dose, or how SAS puts it Animal(Dose). The SS is calculated by:
sum((mean(Animal(ik))-mean(Dose(i))^2)
Where k is the animal of a dose i (averaging the Estimates of the observation in Visit 1 and Visit 28 for each Animal and subtracting the mean Estimate of animals in that Dose quantity squared for all Animals in this study).
Does anyone know how to adjust the formula accordingly to include the Animal(Dose) effect?
Thanks in advance for the help and sorry if all of this is too unspecific.
If I understand you correctly, I have a suggestion. First, a sample data set
#sample data
set.seed(15)
AUClast<-data.frame(
expand.grid(
Animal=1:3,
Dose=c(3,10,30,100),
Visit=c(1,28)
), Estimate=runif(24)
)
Now we calculate the interaction term as requested. First, we split the data into dosage groups, then for each does, we subtract the overall mean from the mean for each animal. Then we sum the squared of those differences. Finally, we expand them back out to does group using unsplit.
animaldose<-unsplit(lapply(split(AUClast, AUClast$Dose), function(x) {
rep(
sum((tapply(x$Estimate, x$Animal, mean) - mean(x$Estimate))^2)
, nrow(x))
}), AUClast$Dose)
And we can see what that looks like next to the original data.frame
cbind(AUClast, animaldose)
Which gives the result
Animal Dose Visit Estimate animaldose
1 1 3 1 0.60211404 0.1181935
2 2 3 1 0.19504393 0.1181935
3 3 3 1 0.96645873 0.1181935
4 1 10 1 0.65090553 0.1641363
5 2 10 1 0.36707189 0.1641363
6 3 10 1 0.98885921 0.1641363
7 1 30 1 0.81519341 0.0419291
8 2 30 1 0.25396837 0.0419291
9 3 30 1 0.68723085 0.0419291
10 1 100 1 0.83142902 0.1881314
11 2 100 1 0.10466936 0.1881314
12 3 100 1 0.64615091 0.1881314
13 1 3 28 0.50909039 0.1181935
14 2 3 28 0.70662857 0.1181935
15 3 3 28 0.86231366 0.1181935
16 1 10 28 0.84178515 0.1641363
17 2 10 28 0.44744372 0.1641363
18 3 10 28 0.96466695 0.1641363
19 1 30 28 0.14118707 0.0419291
20 2 30 28 0.77671251 0.0419291
21 3 30 28 0.80372740 0.0419291
22 1 100 28 0.79334595 0.1881314
23 2 100 28 0.35756312 0.1881314
24 3 100 28 0.05800106 0.1881314
So you can see each does group has it's own adjustment.
I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1