Algorithm to optimally define groups based on multiple responses in R - r

I have a scheduling puzzle that I am looking for suggestions/solutions using R.
Context
I am coordinating a series of live online group discussions where registered participants will be grouped according to their availability. In a survey, 28 participants (id) indicated morning, afternoon, or evening (am, after, pm) availability on days Monday through Saturday (18 possibilities). I need to generate groups of 4-6 participants who are available at the same time, without replacement (meaning they can only be assigned to one group). Once assigned, groups will meet weekly at the same time (i.e. Group A members will always meet Monday mornings).
Problem
Currently group assignment is being achieved manually (by a human), but with more participants optimizing group assignment will become increasingly challenging. I am interested in finding an algorithm that efficiently achieves relatively equal group placements, and respects other factors such as a person's timezone.
Sample Data
Sample data are in long-format located in an R-script here.
>str(x)
'data.frame': 504 obs. of 4 variables:
$ id : Factor w/ 28 levels "1","10","11",..: 1 12 22 23 24 25 26 27 28 2 ...
$ timezone: Factor w/ 4 levels "Central","Eastern",..: 2 1 3 4 2 1 3 4 2 1 ...
$ day.time: Factor w/ 18 levels "Fri.after","Fri.am",..: 5 5 5 5 5 5 5 5 5 5 ...
$ avail : num 0 0 1 0 1 1 0 1 0 0 ...
The first 12 rows of the data look like this:
> head(x, 12)
id timezone day.time avail
1 1 Eastern Mon.am 0
2 2 Central Mon.am 0
3 3 Mountain Mon.am 1
4 4 Pacific Mon.am 0
5 5 Eastern Mon.am 1
6 6 Central Mon.am 1
7 7 Mountain Mon.am 0
8 8 Pacific Mon.am 1
9 9 Eastern Mon.am 0
10 10 Central Mon.am 0
11 11 Mountain Mon.am 0
12 12 Pacific Mon.am 1
Ideal Solution
An algorithm to optimally define groups (size = 4 to 6) that exactly match on day.time and avail while minimizing differences on other more flexible factors (in this case timezone). In the final result, a participant should only exist in a single group.

Okay, so I am not the most knowledge when it comes to this, but have you looked at the K-Means Clustering algorithm. You can specify the number of clusters you want and the variables for the algorithm to consider. It will then cluster the data into the specified number of clusters, aka, categories for you.
What do you think?
References:
https://datascienceplus.com/k-means-clustering-in-r/
http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning

Related

Cavs vs. Warriors - probability of Cavs winning the series includes combinations like "0,1,0,0,0,1,1" - but the series is over after game 5

There is a problem in DataCamp about computing the probability of winning an NBA series. Cavs and the Warriors are playing a seven game championship series. The first to win four games wins the series. They each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
Here is how DataCamp computed the probability using Monte Carlo simulation:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
Here is a different way they computed the probability using simple code:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
Question/Problem:
They both produce approximately the same probability of winning the series ~ 0.34. However, there seems to be a flaw in the the concept and the code design. For example, the code (sampling six times) allows for combinations such as the following:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
As you can see, these are never possible. After winning the first four of the remaining six games, no more games should be played. Similarly, after losing the first three games of the remaining six games, no more games should be played. So these combinations shouldn't be included in the computation of the probability of winning the series. There is double counting for some of the combinations.
Here is what I did to omit some of the combinations that are not possible in real life.
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
But then I am not able to omit the other unnecessary combinations. For example, I want to remove two of these three: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0,1,1. This is because no more games will be played after losing three times in a row. And they are basically double counting.
There are too many conditions for me to be able to filter them individually. There has to be a more efficient and intuitive way to do this. Can someone provide me with some hints on how to solve this whole mess?
Here is a way:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
We use the fact that r can coerce into a logical where 0 will be false. See sum(!0) as a short example.

Episode splitting in survival analysis by the timing of an event in R

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

Creating time-varying covariates when time is different for each observation

I'm attempting to conduct survival analysis with time-varying covariates. The data comes from a longitudinal survey that is administered biannually, and currently looks something like this:
id event1yr event2yr income 14 income16 income18 income20
1 2014 2020 8 10 13 8
2 2018 NA 13 15 24 35
In the case of my study, I am trying to begin time (t_0) at event1yr, and measure time from that variable, which obviously is different for each observation. So, for instance, time to event for observation 1 is 6 years, whereas the time to event for observation 2 is right-censored and 2 years. The main issue comes with also trying to pull data from different time points since the beginning time is different. For instance, income for years 0-2 (exclusive) for observation 1 would come from income14, but income for year 0-2 for observation would come from income18. In the end, I'd like my data to look something like this:
id st.time end.time event2 censor inc
1 0 2 0 0 8
1 2 4 0 0 10
1 4 6 1 0 13
2 0 2 0 1 24
Thus, I'm trying to think of the best way to code to account for pulling the data from different points in time since the beginning reference time is not constant across observations.

Competing risk analysis of interval data

I study competitive risks and use R.
I would like to use the model in Fine and Gray (1999), A proportional hazards model for the subdistribution of a competing risk, JASA, 94:496-509.
I found the cmprsk package.
However, I have an “interval data” configuration with a starting time t0 and an ending time t1 for each interval, t1 being the exit or the right censoring when I am in the last interval for a given entity. Here is an extract of the dataset
entity t0 t1 cov
1 0 3 12
1 3 7 4
1 7 9 1
2 2 3 2
2 3 10 9
3 0 10 11
4 0 1 0
4 1 6 21
4 6 7 12
...
I do not find how to implement that with cmprsk, while it is implemented FOR EXAMPLE in the survival package (Surv(time,time2,…)).
Is it possible to do it with cmprsk or should I go to another package?
I know that there is a Stata package (stcrreg) doing it but I prefer working with R.

R, lme: specifying random effects for mixed model of before-after-gradient analysis

I'm trying to measure the biological impacts of an industrial development using a Before-After-Gradient approach. I am using a linear mixed model approach in R, and am having trouble specifying an appropriate model, especially the random effects. I've spent a lot of time researching this, but so far haven't come up with a clear solution--at least not one that I understand. I am new to LMM (and R for that matter) so would welcome any advice.
The response variables (for example, changes in abundance of key species) will be measured as a function of distance from the edge of impact, using plots established at fixed distances along multiple transects ("gradients") radiating out from the edge of the disturbance. Ideally, each plot would be sampled at multiple times both before and after the impact; however, for simplicity I'm starting by assuming the simplest case, where each plot is sampled once before and once after the impact. Assume also that the individual gradients are far enough apart that they can be considered spatially independent.
First, some simulated data. The effect here is linear instead of curvilinear, but you get the idea.
> str(bag)
'data.frame': 30 obs. of 5 variables:
$ Plot : Factor w/ 15 levels "G1-D0","G1-D100",..: 1 2 4 5 3 6 7 9 10 8 ...
$ Gradient: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
$ Distance: Factor w/ 5 levels "0","100","300",..: 1 2 3 4 5 1 2 3 4 5 ...
$ Period : Factor w/ 2 levels "After","Before": 2 2 2 2 2 2 2 2 2 2 ...
$ response: num 0.633 0.864 0.703 0.911 0.676 ...
> bag
Plot Gradient Distance Period response
1 G1-D0 1 0 Before 0.63258749
2 G1-D100 1 100 Before 0.86422356
3 G1-D300 1 300 Before 0.70262745
4 G1-D700 1 700 Before 0.91056851
5 G1-D1500 1 1500 Before 0.67637353
6 G2-D0 2 0 Before 0.75879579
7 G2-D100 2 100 Before 0.77981992
8 G2-D300 2 300 Before 0.87714158
9 G2-D700 2 700 Before 0.62888739
10 G2-D1500 2 1500 Before 0.83217617
11 G3-D0 3 0 Before 0.87931801
12 G3-D100 3 100 Before 0.81931761
13 G3-D300 3 300 Before 0.74489963
14 G3-D700 3 700 Before 0.68984485
15 G3-D1500 3 1500 Before 0.94942006
16 G1-D0 1 0 After 0.00010000
17 G1-D100 1 100 After 0.05338171
18 G1-D300 1 300 After 0.15846741
19 G1-D700 1 700 After 0.34909588
20 G1-D1500 1 1500 After 0.77138824
21 G2-D0 2 0 After 0.00010000
22 G2-D100 2 100 After 0.05801157
23 G2-D300 2 300 After 0.11422562
24 G2-D700 2 700 After 0.34208601
25 G2-D1500 2 1500 After 0.52606733
26 G3-D0 3 0 After 0.00010000
27 G3-D100 3 100 After 0.05418663
28 G3-D300 3 300 After 0.19295391
29 G3-D700 3 700 After 0.46279103
30 G3-D1500 3 1500 After 0.58556186
As far as I can tell, the fixed effects should be Period (Before,After) and Distance, treating distance as continuous (not a factor) so we can estimate the slope. The interaction between Period and Distance (equivalent to the difference in slopes, before vs. after) measures the impact. I'm still scratching my head over how to specify the random effects. I assume I should control for variation among gradients, as follows:
result <- lme(response ~ Distance + Period + Distance:Period, random=~ 1 | Gradient, data=bag)
However, I suspect I may be missing some source of variation. For example, I'm not sure the above model controls for the re-sampling of individual plots before and after. Any suggestions?
With one sample / gradient, as you have, there's no need to specify random effects or anything about the gradients. You can do this with a straight multiple regression. Once you have multiple measures in each gradient then you can use the model you've specified. Which is that there's an expected main effect of gradient on the intercept of the model but that the effects (slopes) of Distance, Period, and their interactions, should be fixed.
You could specify additional random effects if you expect there to be an appreciable amount of variability among gradients in your other predictors. I'm not sure how you do it in lme, or even if you can, but in lmer an example might be:
lmer(response ~ Distance * Distance:Period + (1 + Distance | Gradient), data=bag)
That would allow the Distance slope to have a fixed effect component and one that varied with gradient. You can look up further specification of random effects but hopefully you see the general idea and then you can decide how complex to make your model.

Resources