I would like to resample data with weighted bootstrap for constructing random forest.
The situation is like that.
I have the data which consist of normal subjects(N=20000) and patients(N=500).
I made new data set with normal subjects (N=2000) and patients (n=500) because I conducted a certain experiment with subjects (N=2500).
As you can see, normal subjects extracted 1/10 of original data and patients extracted all of them.
Therefore, I should give a weight to normal subjects to perform machine learning algorithm.
Please let me know how I can bootstrap with weight in R.
Thank you.
It sounds like you really need to stratified resampling rather than weighted resampling.
Your data are structured into two different groups of different sizes, and you would like to preserve that structure in your bootstrap. You didn't say what function you were applying to these data, so lets use something simple like the mean.
Generate some fake data, and take the (observed) means:
controls <- rnorm(2000, mean = 10)
patients <- rnorm(500, mean = 9.7)
mean(controls)
mean(patients)
Tell R we want to perform 200 bootstraps, and set up two empty vectors to store means for each bootstrap sample:
nbootraps <- 200
boot_controls <- numeric(nbootraps)
boot_patients <- numeric(nbootraps)
Using a loop we can draw resamples of the same size as you have in the original sample, and calculate the means for each.
for(i in 1:nbootraps){
# draw bootstrap sample
new_controls <- controls[sample(1:2000, replace = TRUE)]
new_patients <- patients[sample(1:500, replace = TRUE)]
# send the mean of each bootstrap sample to boot_ vectors
boot_controls[i] <- mean(new_controls)
boot_patients[i] <- mean(new_patients)
}
Finally, plot the bootstrap distributions for group means:
p1 <- hist(boot_controls)
p2 <- hist(boot_patients)
plot(p1, col=rgb(0,0,1,1/4), xlim = c(9.5,10.5), main="")
plot(p2, col=rgb(1,0,0,1/4), add=T)
Related
I have some heatmap data and I want a notion as to whether that heat map is 'centered' around the middle of my image or skewed to one side (in R). My data is too big to give an example here, so this is some fake data of the same form (but in real life my intensity values are not uniformly distributed, I assume they are binned counts from an underlying multivariate normal distribution but I don't know how to code that as a reproducible example).
set.seed(42)
tibble(
x = rep(0:7, each = 8),
y = rep(0:7, 8),
intensity = sample(0:10, 64, replace = TRUE)
)
The x value here is the horizontal index of a pixel, the y value is the vertical index of a pixel and intensity is the value of that pixel according to a heatmap. I have managed to find a "centre" of the heatmap by marginalising these intensity values and finding the marginalised mean for x and y, but how would I perform a hypothesis test on whether the underlying multivariate normal distribution was centered around a certain point? In this case I would like to have a test statistic (more specifically a -log10 p-value) as to whether the underlying multivariate normal distibution that generated this count data is centered around the point c(3.5, 3.5).
Furthermore, I would also like a test statistic (again, more specifically a -log10 p-value) as to whether the underlying distribution that generated the count data actually is multivariate normal.
This is all part of a larger pipeline where I would like to use dplyr and group_by to perform this test on multiple heatmaps at once so if it is possible to keep this in tidy format that would be great.
A little bit of googling finds this web page which suggests mvnormtest::mshapiro.test.
mshap <- function(z, nrow = round(sqrt(length(z)))) {
mvnormtest::mshapiro.test(matrix(z, nrow = nrow))
}
mshap(dd$intensity)
If you want to make this more tidy-like you could do something with map/nest/etc..
I'm not quite sure how to test the centering hypothesis (likelihood ratio test using mnormt::dmnorm ?)
Objective: The overall objective of the problem is to calculate the confidence interval (CI) of various sample sizes (n=2,4..1024) of rnorm, 10,000 times and then count the number of times each one fails (this likely requires a counter and an if/else statement). Finally the results are to be plotted
I am trying to calculate CI of the means for several simulations of a sample sizes, however, I am first trying to break down the code for one specific sample size a = 8.
The problem I have is that I do not know how to generate a linear model for each row. Would anyone know how I can do this? Here is what I have so far:
a <- 8
n.sim.3 <- 10000
for ( i in a) {
r.mat <- matrix(rnorm(i*n.sim.3), nrow=n.sim.3, ncol = a)
lm.tmp <- apply(three.mat,1,lm(n.sim.3~1) # The lm command is where I'm stuck I don't think this is correct)
confint.tmp <- confint(lm.tmp)
I have a set of simulated data that are roughly uniformly distributed. I would like to sample a subset of these data and for that subset to have a log-normal distribution with a (log)mean and (log)standard deviation that I specify.
I can figure out some slow brute-force ways to do this, but I feel like there should be a way to do it in a couple lines using the plnorm function and the sample function with the "prob" variable set. I can't seem to get the behavior I'm looking for though. My first attempt was something like:
probs <- plnorm(orig_data, meanlog = mu, sdlog = sigma)
new_data <- sample(orig_data, replace = FALSE, prob = probs)
I think I'm misinterpreting the way the plnorm function behaves. Thanks in advance.
If your orig_data are uniformly distributed between 0 and 1, then
new_data = qlnorm(orig_data, meanlog = mu, sdlog = sigma)
will give log sampled data. IF your data aren't between 0 and 1 but say a and b then first:
orig_data = (orig_data-a)/(b-a)
Generally speaking, uniform RV between 0 and 1 are seen as probability so if you want to sample from a given distribution with it, you have to use q... ie take the corresponding quantile
Thanks guys for the suggestions. While they get me close, I've decided on a slightly different approach for my particular problem, which I'm posting as the solution in case it's useful to others.
One specific I left out of the original question is that I have a whole data set (stored as a data frame), and I want to resample rows from that set such that one of the variables (columns) is log-normally distributed. Here is the function I wrote to accomplish this, which relies on dlnorm to calculate probabilities and sample to resample the data frame:
resample_lognorm <- function(origdataframe,origvals,meanlog,sdlog,n) {
prob <- dlnorm(origvals,meanlog=log(10)*meanlog,sdlog=log(10)*sdlog)
newsamp <- origdataframe[sample(nrow(origdataframe),
size=n,replace=FALSE,prob=prob),]
return(newsamp)
}
In this case origdataframe is the full data frame I want to sample from, and originals is the column of data I want to resample to a log-normal distribution. Note that the log(10) factors in meanlog and sdlog are because I want the distribution to be log-normal in base 10, not natural log.
I run a meta analysis and use the metafor library to calculate fisher z transformed values from correlations.
>meta1 <- escalc(ri=TESTR, ni=N, measure="ZCOR", data=subdata2)
As some of the studies I include in my meta-analysis, overlap in samples (i.e. in Study XY, 5 effect-sizes are reported from the same N), I need to calculate means of the standardized z-values. To indicate overlapping samples, I gave all effect sizes IDs (in Excel) which are equal if the samples overlap.
To run the final metaanalysis, I would like R to sum the standardized effect sizes from IDs and calculate means for the final metaanalysis.
So the idea is:
IF Effect_SIZE_ID (a variable) is similar in two lines of my df, then sum both effect sizes and divide it by two (calculate the mean). Provide this result in a new column.
As I am a full newbie, please let me know if you require further specification!
Thank you so much in advance.
LEon
Have a look at the summaryBy command in the doBy package.
mymean <- summaryBy(SD_effect ~ ID, FUN = mean, data = data)
Should work in general (if you provide some sample data it is easy to check if that does what you need).
I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.