How can I specify the carryover in the first period of a two-treatment three-period crossover study (ABB/BAA)? - r

I have found a lot of information on how to analyze a 2*2 (AB/BA) crossover trial; however, there are fewer materials on how to disentangle the carryover effect when the study is designed in three periods and two sequences (ABB/BAA). It is worth mentioning that A and B are the treatments and there have been wash-out phases between the three periods.
As sample data, I would like to use the bioequivalence data from "daewr" library.
library("daewr")
data(bioequiv)
head(bioequiv)
Group Subject Period Treat Carry y
1 1 2 1 A none 112.25
2 1 2 2 B A 106.36
3 1 2 3 B B 88.59
4 1 3 1 A none 153.71
5 1 3 2 B A 150.13
6 1 3 3 B B 151.31
The variable Carry contains lagged information from the previous period's Treatment.
The model below should be able to disentangle the effects, but I don't know how to replace the none in the Carry column. I am not sure how to specify this, or how to check if the carryover effect is negligible.
If we don't replace the none in the Carry column, the undermentioned model faces a problem of multicollinearity.
fit <- lmer(y ~ Period+Treat+Carry+(1|Subject), bioequiv)
anova(fit)
summary(fit)

Related

R: How to measure difference with both categorical and numeric features

I'm very new to data wrangling. And now I have this problem at hand:
So basically I have used tables of biochemical measurements (all numerical) of patients to perform cluster analysis, and by doing so I sorted them into 5 clusters.
Then I also have their clinical data/features, now I want to ask if any of these clinical features (a mix of numerical and categorical features) are significantly different from one cluster to another. So how can I go about this? What test shall I perform? Is there a good library I should be looking at?
To give you an idea about the "clinical data":
ClusterAssigned PatientID age sex stage FISH IGHV IgG ...
1 S134567 50 m 4 11q mutated scig
1 S234667 80 m 2 13q mutated 6.5
1 S135677 55 f 4 11q na scig
1 S356576 94 f 2 13q,t12 unmutated 5
1 S187978 59 m 4 11q mutated scig
4 S278967 80 f 2 17q unmutated 6.5
4 S123467 75 f 4 na unmutated 9.1
4 S234577 62 m 2 t12 mutated 9
.....
So you see the Cluster assigned is based on my cluster analysis. FISH, IGHV, IgG are categorical, and you can see there are sometimes na values and sometimes one person can have multiple entry "13q,t12".
In a discounted way, I can perhaps just take cluster 1 and 4 patients out, emit all na ones, and ask if there is a difference in their age, sex, FISH, IGHV...Still what's the method I can use here to perform such test in one go?
You can convert the categorical variables into dummy variables first and then perform a normal cluster analysis.
Things get more complicated if you have ordered categorical fields

How to calculate percentages for categorical variables by items?

I have a question about calculating the percentage by items and time bins. The experiment is like this:
I conduct an eye-tracking experiment. Participants were asked to describe pictures consisting of two areas of interest(AOIs; I name them Agent and Patient). Their eye movements (fixations on the two AOIs) were recorded along the time when they plan their formulation. I worked out a dataset included time information and AOIs as below (The whole time from the picture onset was divided into separate time bins, each time bin 40 ms).
Stimulus Participant AOIs time_bin
1 M1 agent 1
1 M1 patient 2
1 M1 patient 3
1 M1 agent 4
...
1 M2 agent 1
1 M2 agent 2
1 M2 agent 3
1 M2 patient 4
...
1 M3 agent 1
1 M3 agent 2
1 M3 agent 3
1 M3 patient 4
...
2 M1 agent 1
2 M1 agent 2
2 M1 patient 3
2 M1 patient 4
I would like to create a table containing the proportion of one AOI (e.g. agent) by each stimulus of each time bin. It would be like this:
Stimulus time_bin percentage
1 1 20%
1 2 40%
1 3 55%
1 4 60%
...
2 1 30%
2 2 35%
2 3 40%
2 4 45%
I calculate the percentage because I want to do a multilevel analysis (Growth Curve Analysis) investigating the relationship between the dependent variable agent fixation proportion and the independent variable time_bin, as well as with the stimulus as a random effect.
I hope I get my question understood, due to my limited English knowledge.
If you have an idea or a suggestion, that would be a great help!
Using the tidyverse package ecosystem you could try:
library(tidyverse)
df %>%
mutate(percentage = as.integer(AOIs == "agent") ) %>%
group_by(Stimulus, time_bin) %>%
summarise(percentage = mean(percentage))
Note that this will give you ratios in the [0, 1] interval. You still have to convert it to the percentage values by multiplying with 100 and appending "%".

Using multiple weights for lm() or glm() in R

I want to fit a model in R within which I need to apply two weights at the same time. Let's say my model is glm(y ~ x1 + male + East_Germany) where male identifies a respondent's gender and East Germany is also a binary variable checking whether someone lives in East Germany.
Now let's say both females and East Germans are dramatically missrepresented in my data. Assuming that it's not due to a flawed data collection process, I would have to apply two weights. But can I really specify two weights simultaneously like this...? glm(y ~ x1 + male + East_Germany, weight=c("male_wgt","east_wgt"))
My thinking was that applying one weight to the overall data or the overall model already changes up the whole data structure, but I might be wrong. Let's give you an example:
y x1 male east male_weight east_weight
5 4 0 1 5 2
3 2 1 1 1 2
9 7 1 0 1 1
4 8 1 0 1 1
1 3 1 0 1 1
6 4 1 0 1 1
...where male==1 means "male" and male==0 is "female" and we assume that both genders should be represented equally (50%), and where east==1 means "East Germany", east==0 is "West Germany" and here too, let's assume for reasons of simplicity that both should be represented equally. yand x are just random numbers.
I'm wondering how I would apply both weights simultaneously, because if I say "let's count row#1 five times so that females gain more weight", I'm at the same time giving East Germany too much weight (without having even applied east_weight). The reasons is that if we count row#1 five times, we will end up with a new East/West ratio of 6:4. Or am I mistaken?

R: Pairwise Matrix Manipulation & Variable Construction with Many Groups

I'm starting with data of scores at the "group-person" level as follows:
group_id person_id score
1 1 3
1 2 1
1 3 5
2 1 3
2 2 3
2 3 6
The goal is to generate data on person-person pairs that looks like the following:
person_id1 person_id2 sumsquarederror
1 2 4
1 3 13
2 3 25
where the "sumsquarederror" variable is defined as the sum across all groups of the squared differences in score values for each possible pair of persons. In mathspeak, this variable would be defined like: for persons i=1 and i=2 and groups j=(1,...,J)
sumsquarederror(i=1,i=2) = sum_j (( score(i=1) - score(i=2) )^2)
Building this data is trivial with small numbers of groups and persons, but I have roughly 1,000 groups and 150,000 persons, so creating matrices/dataframes for all combinations possible quickly becomes computationally burdensome (=150K by 150K by 1K, before collapsing to the sumsquarederror variable)
I'm guessing there might be some linear algebra approaches or regression-type ideas, but am stumped. Any tips or tricks or useful packages would be greatly appreciated!

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

Resources