Sample size estimation for three armed clinical trial in R for comparing the Restricted Mean Survival Time - r

I have a three armed clinical trial with two treatments and one placebo. I am to compare the Restricted Mean Survival Time among the three.
I need to find the sample sizes for each arms (equal allocation)
I know that R software has survRM2 package for calculating sample sizes for a RMST of two armed trial. The code as used is:
library(SSRMST) ssrmst(ac_rate=ac_rate, ac_period=ac_period, tot_time=tot_time, tau=tau, shape0=shape0, scale0=scale0, shape1=shape1, scale1=scale1, margin=margin, seed=seed)
So, my question is, how do I use this package to calculate sample size for a three armed trial (with equal allocation).
How will the above code modify?
Any guidance will be very helpful.

Related

Calculate vaccine efficacy confidence Interval using the exact method

I'm trying to calculate confidence intervals for vaccine efficacy studies.
All the studies I am looking at claim that they use the Exact method and cite this free PDF: Statistical Methods in Cancer Research Volume II: The Design and Analysis of Cohort Studies It is my understanding that the exact method is also sometimes called the Clopper Pearson method.
The data I have is: Person-years of vaccinated, Person-years of unvaccinated, Number of cases among vaccinated, Number of cases among unvaccinated,
Efficacy is easy to calculate: 1 - ( (Number of cases among vaccinated/Person-years of vaccinated) / (Number of cases among unvaccinated/Person-years of unvaccinated) ) * 100
But calculating the confidence interval is harder.
At first I thought that this website gave the code I needed:
testall <- binom.test(8, 8+162)
(theta <- testall$conf.int)
(VE <- (1-2*theta)/(1-theta))
In this example, 8 is the number of cases in the vaccinated group and 162 is the number of cases in the unvaccinated group. But I have had a few problems with this.
(1) there are some studies where the size of the two cohorts (vaccinated vs. not vaccinated) are different. I don't think that this code works for those cohorts.
(2) I want to be able to adjust the type of confidence interval. For example, one study used "one-sided α risk of 2·5%" where as another study used "a two-sided α level of 5%". I'm not clear if this effects the numbers.
Either way, when I tried to run the numbers, it didn't work.
Here is an example of a data sets I am trying to validate:
Number of cases among vaccinated: 176
Number of cases among unvaccinated: 221
Person-years of vaccinated: 11,793
Person-years of unvaccinated: 5,809
Efficacy: 60.8 95%
Two sided 95% CI: 52.0–68.0

Metric for evaluating agreement at inter-rater reliability for a single subject by multiple raters

I'm making a rating survey in R (Shiny) and I'm tryng to find a metric that can evaluate the agreement but for only one of the "questions" in the survey. The ratings range from 1 to 5. There is multiple raters and each rater rates a set of 10 questions according to the ratings.
I've used Fleiss Kappa and Krippendorff Alpha for the whole set of questions and raters and it works but when evaluating each question separately these metrics give negative value. I tried calculating them by hand (formulas) and I still get the same results so I guess that they don't work for a small sample of subjects (in this case a sample of 1).
I've looked at other metrics like rwg in the multilevel package but thus far I can't seem to make it work. According to r documentation:
rwg(x, grpid, ranvar=2)
Where:
x = A vector representing the construct on which to estimate agreement.
grpid = A vector identifying the groups from which x originated.
Can someone explain me what the rwg function expects from me?
If someone know some other agreement metric that might work better please let me know.
Thanks.

how to calculate the discrete number of steps per feature of dataset

I have been looking for a way to calculate the minimum number of samples required Ne(min) to train a classification model when the dataset is not normally distributed. A research paper suggests the following :
if the data are not normally distributed, an exponential relationship between d and N will be
assumed and the number of samples that are required may be as plentiful as:
Ne(min) = Dsteps^d
where Dsteps is the discrete number of steps per feature.
d: dimension of the dataset.
....
It
is useful to think of a histogram approach to understand this relationship. If we want to construct a histogram from data with at least one sample in each bin and with Dsteps discrete steps per feature, we will require at least Dsteps^d samples.
The number of samples required to model the data accurately is in this case an exponential function of d.
I will be very grateful if someone can help me to get/calculate this measure: the discrete number of steps per feature.
An explanation with R or Matlab code would be very helpful. Thank you :D
Edit:
Paper reference: Christiaan Maarten Van Der Walt: Data Measure that Characterises Classification Problems, 2008.

Fit negative binomial distribution in R

I have a data set derived from the sport Snooker:
https://www.dropbox.com/s/1rp6zmv8jwi873s/snooker.csv
Column "playerRating" can take the values from 0 to 1, and describes how good a player is:
0: bad player
1: good player
Column "suc" is the number of consecutive balls potted by each player with the specific rating.
I am trying to prove 2 things regarding the number of consecutive balls potted until first miss:
The distribution of successes follows a negative binomial
The number of success depends on the player's worth. ie if a player is really good, he will manage to pot more consecutive balls.
I am using the "fitdistrplus" package to fit my data, however, I am unable to find a way of using the "playerRatings" as input parameters.
Any help would be much appreciated!

Determining optimum number of clusters for k-means with a large dataset

I have a matrix of 62 columns and 181408 rows that I am going to be clustering using k-means. What I would ideally like is a method of identifying what the optimum number of clusters should be. I have tried implementing the gap statistic technique using clusGap from the cluster package (reproducible code below), but this produces several error messages relating to the size of the vector (122 GB) and memory.limitproblems in Windows and a "Error in dist(xs) : negative length vectors are not allowed" in OS X. Does anyone has any suggestions on techniques that will work in determining optimum number of clusters with a large dataset? Or, alternatively, how to make my code function (and does not take several days to complete)? Thanks.
library(cluster)
inputdata<-matrix(rexp(11247296, rate=.1), ncol=62)
clustergap <- clusGap(inputdata, FUN=kmeans, K.max=12, B=10)
At 62 dimensions, the result will likely be meaningless due to the curse of dimensionality.
k-means does a minimum SSQ assignment, which technically equals minimizing the squared Euclidean distances. However, Euclidean distance is known to not work well for high dimensional data.
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are three ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
This is from RBloggers.
https://www.r-bloggers.com/k-means-clustering-from-r-in-action/
You could do the following:
data(wine, package="rattle")
head(wine)
df <- scale(wine[-1])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(df)
this will create a plot like this.
From this you can choose the value of k to be either 3 or 4. i.e
there is a clear fall in 'within groups sum of squares' when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.
But like Anony-Mouse pointed out, the curse of dimensionality affects due to the fact that euclidean distance being used in k means.
I hope this answer helps you to a certain extent.

Resources