How to simulate a matrix based on both row and column parameters - r

I'm looking into simulating columns of normally distributed data whilst sticking to certain rowbased parameters. Specifically, let's say I want to simulate 6 rows of data for 4 columns, where the final column is a sum of the previous 3 columns. Let's say I have the fourth column filled out, and I know what I want for means and standard deviations for the other three columns. Is there a way for me to simulate this?
For a visual representation, my question is essentially how can I fill out the blanks in the following table:
x
y
z
total
?
?
?
17.42
?
?
?
11.95
?
?
?
15.85
?
?
?
15.93
?
?
?
14.78
?
?
?
17.19
------
------
------
------
mean = 5
mean = 6
mean = 5
mean = 15.5
sd = 1.2
sd = 1.5
sd = 1.3
sd = 2
Simulating each column is of course simple enough with rnorm or something similar, but the row sums are then random, and it's important that I maintain control over the variance of the total column. The final column values don't actually need to be known if there's a way to simulate the 4 columns simultaneously, as long as the approximate mean and sd parameters are maintained, and that the 4th column is a sum of the first 3.
I've fiddled with different things such as mvrnorm or rnorm_multi, which allows me a degree of control over the correlation of the columns, but only indirect unreliable influence over the variance of the final column, which again, is a crucial factor.
Any ideas?
EDIT
A bit more of my process, with a brief example. If I simulate a dataset with three variables, eg. x, y and z, I can make sure these variables stick to certain means and vars. A brief example with rnorm:
dat <- tibble(x = rnorm(200, 5, 1.3),
y = rnorm(200, 6, 1.5),
z = rnorm(200, 5, 1.7))
dat2 <- dat %>%
mutate(total = rowSums(dat))
var(dat2$total)
If you run this piece of code, you can see that the variance of the total column changes pretty considerably for each set of simulations. What I want is to be able to simulate data where I can specify a variance I want for the total column. My idea for this was to create the total column first and then somehow simulate the other columns (through something like rnorm), but giving it a rowwide parameter too. I might've been completely off track here, if so I'll happily listen to other solutions.

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

How to decide best number of clusters for kamila clustering with R?

I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?
In the kamila package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of
Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005).
There is no perfect method for estimating the number of clusters; PS
tends to give a smaller number than, say, BIC based methods for large
sample sizes.
In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.

selecting from a dataset using set probabilities in r

I am running some simulations for a selection experiment I am doing.
As part of this, I want to select from a dataset I've already made using probabilities to simulate selection.
I start by making an initial population using starting frequencies where the probability of getting a 1 is 0.25, a 2 is 0.5 and a 3 is 0.25. 1,2 and 3 represent the 3 different genotypes.
N <- 400
my_prob = c(0.25,0.5,0.25)
N1=sample(c(1:3), N, replace= TRUE, prob=my_prob)
P1 <-data.frame(N1)
I now want to simulate selection in my population where one homozygote is selected against and there is partial selection against heterozygotes so probabilities of ((1-s)^2, (1-s), 1) where s=0.2 in this example.
Initially I was sampling each group individually using the sample_frac() function and then recombing the datasets.
s <- 0.2
S1homo<- filter(P1, N1==1) %>%
sample_frac((1-s)^2, replace= FALSE)
S1hetero <-filter(P1, N1==2) %>%
sample_frac((1-s), replace= FALSE)
S1others <-filter(P1, N1==3)
S1 <- rbind(S1homo, S1hetero, S1others)
The problem with this is there isn't any variability in the numbers it returns which is unrealisitic, for example S1homo will always return exactly 64% of the 1 values when I set s=0.2 whereas in my initial populations there is some variability in the numbers you get for each value.
So I was wondering if there is a way to select from my P1 population using the set probabilities of ((1-s)^2,(1-s), 1) for the different genotypes so that I don't always get the exact same numbers being returned for each group being selected against.
I tried doing this using the sample() function I used before but I couldn't get it to work.
# sel is done to give the total number of values there will be in the new population when times by N
sel <-((1-s)^2 + 2*(1-s)+1)/4
S1 <-sample(P1, N*sel, replace=FALSE, prob=c((1-s)^2,(1-s),1))
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
I am not 100% sure what you are trying to do, but if you want (1-s)^2 to be the probability that a randomly chosen element is included in the sample, rather than the exact percentage chosen, you can use sample_n rather than sample_frac, with an n which is randomly chosen to reflect that rate:
S1homo<- filter(P1, N1==1) %>%
sample_n(rbinom(1,sum(N1==1),(1-s)^2))
Using rbinom like that is perhaps a bit indirect, but I don't see another way to easily do it with %>%.

How to solve prcomp.default(): cannot rescale a constant/zero column to unit variance

I have a data set of 9 samples (rows) with 51608 variables (columns) and I keep getting the error whenever I try to scale it:
This works fine
pca = prcomp(pca_data)
However,
pca = prcomp(pca_data, scale = T)
gives
> Error in prcomp.default(pca_data, center = T, scale = T) :
cannot rescale a constant/zero column to unit variance
Obviously it's a little hard to post a reproducible example. Any ideas what the deal could be?
Looking for constant columns:
sapply(1:ncol(pca_data), function(x){
length = unique(pca_data[, x]) %>% length
}) %>% table
Output:
.
2 3 4 5 6 7 8 9
3892 4189 2124 1783 1622 2078 5179 30741
So no constant columns. Same with NA's -
is.na(pca_data) %>% sum
>[1] 0
This works fine:
pca_data = scale(pca_data)
But then afterwards both still give the exact same error:
pca = prcomp(pca_data)
pca = prcomp(pca_data, center = F, scale = F)
So why cant I manage to get a scaled pca on this data? Ok, lets make 100% sure that it's not constant.
pca_data = pca_data + rnorm(nrow(pca_data) * ncol(pca_data))
Same errors. Numierc data?
sapply( 1:nrow(pca_data), function(row){
sapply(1:ncol(pca_data), function(column){
!is.numeric(pca_data[row, column])
})
} ) %>% sum
Still the same errors. I'm out of ideas.
Edit: more and a hack at least to solve it.
Later, still having a hard time clustering this data eg:
Error in hclust(d, method = "ward.D") :
NaN dissimilarity value in intermediate results.
Trimming values under a certain cuttoff eg < 1 to zero had no effect. What finally worked was trimming all columns that had more than x zeros in the column. Worked for # zeros <= 6, but 7+ gave errors. No idea if this means that this is a problem in general or if this just happened to catch a problematic column. Still would be happy to hear if anyone has any ideas why because this should work just fine as long as no variable is all zeros (or constant in another way).
I don't think you're looking for zero variance columns correctly. Let's try with some dummy data. First, an acceptable matrix: of 10x100:
mat <- matrix(rnorm(1000, 0), nrow = 10)
And one with a zero-variance column. Let's call it oopsmat.
const <- rep(0.1,100)
oopsmat <- cbind(const, mat)
The first few elements of oopsmat look like this:
const
[1,] 0.1 0.75048899 0.5997527 -0.151815650 0.01002536 0.6736613 -0.225324647 -0.64374844 -0.7879052
[2,] 0.1 0.09143491 -0.8732389 -1.844355560 0.23682805 0.4353462 -0.148243210 0.61859245 0.5691021
[3,] 0.1 -0.80649512 1.3929716 -1.438738923 -0.09881381 0.2504555 -0.857300053 -0.98528008 0.9816383
[4,] 0.1 0.49174471 -0.8110623 -0.941413109 -0.70916436 1.3332522 0.003040624 0.29067871 -0.3752594
[5,] 0.1 1.20068447 -0.9811222 0.928731706 -1.97469637 -1.1374734 0.661594937 2.96029102 0.6040814
Let's try scaled and unscaled PCAs on oopsmat:
PCs <- prcomp(oopsmat) #works
PCs <- prcomp(oopsmat, scale. = T) #not forgetting the dot
#Error in prcomp.default(oopsmat, scale. = T) :
#cannot rescale a constant/zero column to unit variance
Because you can't divide by the standard deviation if it's infinity. To identify the zero-variance column, we can use which as follows to get the variable name.
which(apply(oopsmat, 2, var)==0)
#const
#1
And to remove zero variance columns from the dataset, you can use the same apply expression, setting variance not equal to zero.
oopsmat[ , which(apply(oopsmat, 2, var) != 0)]
Hope that helps make things clearer!
In addition to Joe's answer, just check that the classes of the columns in your dataframe are numerics.
If there are integers, then you'll get variances of 0, causing the scaling to fail.
So if,
class(my_df$some_column)
is an integer64, for example, then do the following
my_df$some_column <- as.numeric(my_df$some_column)
Hope this helps someone.
The error is because one of the column has constant values.
Calculate standard deviation of all the numeric cols to find the zero variance variables.
If the standard deviation is zero, you can remove the variable and compute pca

Resources