selecting from a dataset using set probabilities in r - r

I am running some simulations for a selection experiment I am doing.
As part of this, I want to select from a dataset I've already made using probabilities to simulate selection.
I start by making an initial population using starting frequencies where the probability of getting a 1 is 0.25, a 2 is 0.5 and a 3 is 0.25. 1,2 and 3 represent the 3 different genotypes.
N <- 400
my_prob = c(0.25,0.5,0.25)
N1=sample(c(1:3), N, replace= TRUE, prob=my_prob)
P1 <-data.frame(N1)
I now want to simulate selection in my population where one homozygote is selected against and there is partial selection against heterozygotes so probabilities of ((1-s)^2, (1-s), 1) where s=0.2 in this example.
Initially I was sampling each group individually using the sample_frac() function and then recombing the datasets.
s <- 0.2
S1homo<- filter(P1, N1==1) %>%
sample_frac((1-s)^2, replace= FALSE)
S1hetero <-filter(P1, N1==2) %>%
sample_frac((1-s), replace= FALSE)
S1others <-filter(P1, N1==3)
S1 <- rbind(S1homo, S1hetero, S1others)
The problem with this is there isn't any variability in the numbers it returns which is unrealisitic, for example S1homo will always return exactly 64% of the 1 values when I set s=0.2 whereas in my initial populations there is some variability in the numbers you get for each value.
So I was wondering if there is a way to select from my P1 population using the set probabilities of ((1-s)^2,(1-s), 1) for the different genotypes so that I don't always get the exact same numbers being returned for each group being selected against.
I tried doing this using the sample() function I used before but I couldn't get it to work.
# sel is done to give the total number of values there will be in the new population when times by N
sel <-((1-s)^2 + 2*(1-s)+1)/4
S1 <-sample(P1, N*sel, replace=FALSE, prob=c((1-s)^2,(1-s),1))
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'

I am not 100% sure what you are trying to do, but if you want (1-s)^2 to be the probability that a randomly chosen element is included in the sample, rather than the exact percentage chosen, you can use sample_n rather than sample_frac, with an n which is randomly chosen to reflect that rate:
S1homo<- filter(P1, N1==1) %>%
sample_n(rbinom(1,sum(N1==1),(1-s)^2))
Using rbinom like that is perhaps a bit indirect, but I don't see another way to easily do it with %>%.

Related

How to simulate a matrix based on both row and column parameters

I'm looking into simulating columns of normally distributed data whilst sticking to certain rowbased parameters. Specifically, let's say I want to simulate 6 rows of data for 4 columns, where the final column is a sum of the previous 3 columns. Let's say I have the fourth column filled out, and I know what I want for means and standard deviations for the other three columns. Is there a way for me to simulate this?
For a visual representation, my question is essentially how can I fill out the blanks in the following table:
x
y
z
total
?
?
?
17.42
?
?
?
11.95
?
?
?
15.85
?
?
?
15.93
?
?
?
14.78
?
?
?
17.19
------
------
------
------
mean = 5
mean = 6
mean = 5
mean = 15.5
sd = 1.2
sd = 1.5
sd = 1.3
sd = 2
Simulating each column is of course simple enough with rnorm or something similar, but the row sums are then random, and it's important that I maintain control over the variance of the total column. The final column values don't actually need to be known if there's a way to simulate the 4 columns simultaneously, as long as the approximate mean and sd parameters are maintained, and that the 4th column is a sum of the first 3.
I've fiddled with different things such as mvrnorm or rnorm_multi, which allows me a degree of control over the correlation of the columns, but only indirect unreliable influence over the variance of the final column, which again, is a crucial factor.
Any ideas?
EDIT
A bit more of my process, with a brief example. If I simulate a dataset with three variables, eg. x, y and z, I can make sure these variables stick to certain means and vars. A brief example with rnorm:
dat <- tibble(x = rnorm(200, 5, 1.3),
y = rnorm(200, 6, 1.5),
z = rnorm(200, 5, 1.7))
dat2 <- dat %>%
mutate(total = rowSums(dat))
var(dat2$total)
If you run this piece of code, you can see that the variance of the total column changes pretty considerably for each set of simulations. What I want is to be able to simulate data where I can specify a variance I want for the total column. My idea for this was to create the total column first and then somehow simulate the other columns (through something like rnorm), but giving it a rowwide parameter too. I might've been completely off track here, if so I'll happily listen to other solutions.

How to decide best number of clusters for kamila clustering with R?

I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?
In the kamila package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of
Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005).
There is no perfect method for estimating the number of clusters; PS
tends to give a smaller number than, say, BIC based methods for large
sample sizes.
In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

K-means algorithm variation with minimum measure of size

I'm looking for some algorithm such as k-means for grouping points on a map into a fixed number of groups, by distance.
The number of groups has already been decided, but the trick part (at least for me) is to meet the criteria that the sum of MOS of each group should in the certain range, say bigger than 1. Is there any way to make that happen?
ID MOS X Y
1 0.47 39.27846 -76.77101
2 0.43 39.22704 -76.70272
3 1.48 39.24719 -76.68485
4 0.15 39.25172 -76.69729
5 0.09 39.24341 -76.69884
I was intrigued by your question but was unsure how you might introduce some sort of random process into a grouping algorithm. Seems that the kmeans algorithm does indeed give different results if you permutate your dataset (e.g. the order of the rows). I found this bit of info here. The following script demonstrates this with a random set of data. The plot shows the raw data in black and then draws a segment to the center of each cluster by permutation (colors).
Since I'm not sure how your MOS variable is defined, I have added a random variable to the dataframe to illustrate how you might look for clusterings that satisfy a given criteria. The sum of MOS is calculated for each cluster and the result is stored in the MOS.sums object. In order to reproduce a favorable clustering, you can use the random seed value that was used for the permutation, which is stored in the seeds object. You can see that the permutations result is several different clusterings:
set.seed(33)
nsamples=500
nperms=10
nclusters=3
df <- data.frame(x=runif(nsamples), y=runif(nsamples), MOS=runif(nsamples))
MOS.sums <- matrix(NaN, nrow=nperms, ncol=nclusters)
colnames(MOS.sums) <- paste("cluster", 1:nclusters, sep=".")
rownames(MOS.sums) <- paste("perm", 1:nperms, sep=".")
seeds <- round(runif(nperms, min=1, max=10000))
plot(df$x, df$y)
COL <- rainbow(nperms)
for(i in seq(nperms)){
set.seed(seeds[i])
ORD <- sample(nsamples)
K <- kmeans(df[ORD,1:2], centers=nclusters)
MOS.sums[i,] <- tapply(df$MOS[ORD], K$cluster, sum)
segments(df$x[ORD], df$y[ORD], K$centers[K$cluster,1], K$centers[K$cluster,2], col=COL[i])
}
seeds
MOS.sums

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources