I'm working a cancer cohort in the 'survival' package. For each patient, there are values for survival (continuous, in days), censor (0/1) information, as well as about 30 genes with values of "High", "Low", or "Neither" for each gene.
When I've been working with just two category values, running the survival analysis to get the Kaplan-meier plots and log-rank test values is straightforward for me. My brain has been melting when trying to figure our the correct code to compare only two of three groups (i.e. "High" vs. "Low" for a particular gene). If I use survdiff for say ~data$geneA it will only return the statistic comparing all three groups. I basically want to exclude the "Neither" group. Can someone help me with the code to do a survdiff function to specify the test on only two groups of a certain value when there are 3 (or more) values.
Similarly, (but less of an issue as I can set the "Neither" group to the transparent color), how to I generate the KM plots for only two groups?
Edit: Some of the code I'm currently using
> B<-read.table("~/Desktop/breast.csv",header=T,sep=",")
> BR1=survfit(Surv(B$death,B$status)~B$GeneA)
> BR2=survfit(Surv(B$death,B$status)~B$GeneB)
And so on for genes to 30. Then for the statistics and KM curves:
> survdiff(Surv(B$death,B$status)~B$GeneB,rho=0)
> plot(BR1, xlim=c(0,3000), col=c("yellow3", "blue3", "transparent"))
I understand I can use 'subset' to define one value, such as
> BR3=survfit(Surv(B$death,B$status)~B$GeneB, subset=B$GeneB=="High")
But does can it work with two values? Doing what makes logical sense to me:
> BR4=survfit(Surv(B$death,B$status)~B$GeneB, subset=B$GeneB==c("Low", "High")
Doesn't work correctly, it splits up one of the groups into two?
I have a set of control and treated plots which had been sampled during years. I run the prc function in the vegan package and want to perform a permutation test to check whether control vs treated plots significantly differ during years. As my data is unbalanced, I can not use strata function. my code look like:
I've written this, but I'm sure that it is wrong because the treatment is missing here:
h1 <- how(within = Within(type = "series", mirror = F),
blocks = year, nperm = 999
Any suggestions is greatly appreciated.
Under the null hypothesis, samples from the control or treated groups are exchangeable and hence you don't want them in the permutation design; you really want to permute them to generate the permutation-based null distribution for the test statistic.
The permutation design is there to indicate what isn't exchangeable.
You haven't explained why you want samples within the blocks to be permuted in series; why are samples within years also time series? If they're not, you don't want this.
You only need to worry about imbalance if you want to permute the strata. Whilst using blocks is similar in some respects to strata, blocks are never permuted so if you can use blocks you can use strata as you won't be permuting them.
If you want to permute the years as groups of samples, then you'll need strata and you'll need balance at the year level, which you don't have.
What you have defined with your call to how() is:
groups samples by year and as such samples will never be swapped between years, and
samples within the levels of year will be permuted in series, keeping their temporal order intact after applying cyclic shift permutations.
If that's not what you want to do, you need to explain in words what you want to do. By "do" I mean what is it you want to test? What is your model in vegan?
I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (, though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
hep.PC.cor = prcomp(data, scale=TRUE)
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.
I have two groups. The treatment group is exposure to media; the control group is no media. They are distinguished by a categorial variable in the data frame. (exposure to media = 1, no media = 0)
Now, I want to examine whether there are any clear differences between these two groups. To do this, apply the k-means algorithm with two clusters to four variables (proportion of black population, proportion of male population, proportion of hispanic population, median income on the logarithmic scale).
How to do this in R? Could anyone give some hints? Thanks!
Try this:
km <-kmeans(your data, 2, nstart=10)
your data here as a data.frame (your whole data or you can select the variables that you are interesting about them). You need to select the number of clusters (here is 2). A good practice to understand your data is to apply different number of cluster and then see which one fit your data better (use for example any criteria methods such as AIC or BIC).
k-means is an approach applied to cluster data. Where this data come from different distribution and we would like to know from where each observation come from (from which distribution).
You can also have a look at many tutorials about kmeans in R. For example,
I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.
I have about 9k observations for 2 variables for which I want to test for correlation. I was initially subsetting this by value, which I had no issues with. I realised that I wouldnt get a statistically significant correlation for some value groups due to low observation count. I have decided to change my approach to group by quantiles. I can currently subset the top X% with no trouble, but am having difficulty figuring out how to group all data into multiple percentiles i.e 0-5%, 5-10%, 10-15%. Help much appreciated. Thanks, Jono
We can use cut2 function in Hmisc package
cut2(x, g=20)
It divides your data into 20 quantiles as you wish