If I have a large dataset in R, how can I take random sample of the data taking into consideration the distribution of the original data, particularly if the data are skewed and only 1% belong to a minor class and I want to take a biased sample of the data?
The sample(x, n, replace = FALSE, prob = NULL) function takes a sample from a vector x of size n. This sample can be with or without replacement, and the probabilities of selecting each element to the sample can be either the same for each element, or a vector informed by the user.
If you want to take a sample of same probabilities for each element with 50 cases, all you have to do is
n <- 50
smpl <- df[sample(nrow(df), 50),]
However, if you want to give different probabilities of being selected for the elements, let's say, elements that sex is M has probability 0.25, while those whose sex is F has prob 0.75, you should do
n <- 50
prb <- ifelse(sex=="M",0.25,0.75)
smpl <- df[sample(nrow(df), 50, prob = prb),]
Related
I have a categorical data column (Urban.f) in a dataset (edu2018) and would like to calculate a division and repeat the process 1000 times.
This is the code I have to get a random sample of size 300. It shows me how many cases of "Urban" and "Rural" there is, along with the rest of columns in the dataset. enter image description here
x1 <- sample_n(edu2018, 300, fac = "urban.f")
summary(x1)
I want to calculate the number that appears in "Urban" by the sum of "Urban" + "Rural".
Then, I want to repeat this step 1000 times. I have tried this code that tries to englobe everything I want, but cannot make it work.
edu2018_df <- edu2018$urban.f
n <- 1000
sample_rural <- replicate(n, {edu2018_df <- sample(edu2018, 300, replace = TRUE)
count(sample_rural = "Urban")/count(sample_rural == "Urban","Rural) * n})
hist(sample_rural)
You could do:
sample_rural <- replicate(1000, sum(sample(edu2018_df, 300, F) == 'Rural')/300)
hist(sample_rural)
Data used
Obviously, we don't have your actual data, so I recreated your vector edu2018_df, which is simply a vector of "Rural" and "Urban"
set.seed(1)
edu2018_df <- sample(c("Rural", "Urban"), 1000, TRUE)
I know the basic loop format, but I'm unsure how to incorporate 'population' into the loop to find the probability of collecting a sample with a mean of 42 or larger.
Use a loop to find out the probability of collecting a sample (n=10) with a mean of 42 (or larger) from the dataset produced by the following code:
set.seed(1)
population<-rnorm(n=500,mean=35,sd=10)
One approach to this problem is to repeatedly sample from population and compute the frequency that the mean of these samples is greater than or equal to 42.
set.seed(1);
population <- rnorm(n=500, mean=35, sd=10)
nsim <- 100000 # the number of time we will do this
vec_mean <- numeric(nsim) # a vector to hold the sample means
for (i in 1:nsim) {
samp <- sample(population, size = 10, replace = TRUE)
vec_mean[i] <- mean(samp)
}
sum(vec_mean >= 42) / nsim
# [1] 0.01727
This can be interpreted as the (frequentist) probability of collecting a sample of size 10 from this population with a mean of 42 or larger.
I'm trying to first extract all values <= -4 (call these p1) from a mother normal distribution. Then, randomly sample 50 of p1s with replacement according to their probability of being selected in the mother (call these 50s p2). For example, -4 is more likely to be selected than -6 which is further into the tail area.
I was wondering if my R code below correctly captures what I described above?
mother <- rnorm(1e6)
p1 <- mother[mother <= -4]
p2 <- sample(p1, 50, replace = T) # How can I define probability of being selected here?
You can use function sample argument prob. Quoting from help("sample"):
prob a vector of probability weights for obtaining the elements of
the vector being sampled.
And in the section Details:
The optional prob argument can be used to give a vector of weights for
obtaining the elements of the vector being sampled. They need not sum
to one, but they should be non-negative and not all zero.
So you must be careful, the more distant from the mean value the smaller the probabilities, the normal distribution drops to small values of probability very quickly.
set.seed(1315) # Make the results reproducible
mother <- rnorm(1e6)
p1 <- mother[mother <= -4]
p2 <- sample(p1, 50, replace = T, prob = pnorm(p1))
You can see that it worked with the histogram.
hist(p2)
Wouldn't it be easier to sample from a truncated normal distribution in the first place?
truncnorm::rtruncnorm(50, a = -Inf, b = -4)
I think you are looking for something like this:
mother <- rnorm(1e6)
p1 <- mother[mother <= -4]
Calculate probability of p1 getting selected from mother
p2 <- sample(p1, 50, replace = T,prob = pnorm(p1,mean = mean(mother),sd = sd(mother)))
From the documentation:
For bootstrap samples, simple random sampling is used.
For other data splitting, the random sampling is done within the levels of y
when y is a factor in an attempt to balance the class distributions within
the splits.
For numeric y, the sample is split into groups sections based on percentiles
and sampling is done within these subgroups.
For createDataPartition, the number of percentiles is set via the groups
argument.
I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.
It means, if you have a data set ds with 10000 rows
set.seed(42)
ds <- data.frame(values = runif(10000))
with 2 "classes" with unequal distribution (9000 vs 1000)
ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
# 1 2
# 9000 1000
you can create a sample, which tries to maintain the ratio / "balance" of the factor classes.
dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
# 1 2
# 900 100
I have a banking dataset which has 5% defaulters and the rest are good( non-defaulters).
I want to create a sample which has 30% defaulters , 70% non-defaulters.
Assuming my dataset is data and it has a column named "default" signifying 0 or 1, how do i get a sample with 30% default, 70% non-default given that my original dataset has only 5% default.
Can some one please provide the R code. That would be great.
I tried the following to get 100 random samples with replacement
data[sample(1:nrow(data),size=100,replace=TRUE),]
But how do i ensure that I get that the split is 30%,70%?
sample has an option prob that represents a vector of probability weights for obtaining the elements of the vector being sampled. So you could use prob=c(0.3,0.7) as a parameter to sample.
For example
sample(0:1, 100, replace=TRUE, prob=c(0.3,0.7))
Assume df is your dataframe and default is the column indicating who defaults.
To sample without replacement:
df[c(sample(which(df$default),30), sample(which(!df$default),70)),]
To sample with replacement (i.e., possibly duplicating records):
df[c(sample(which(df$default),30,TRUE), sample(which(!df$default),70,TRUE)),]
Alternatively, if you don't want to specify an exact number of defaulters and non-defaulters, you can specify a sampling probability for each row:
set.seed(1)
df <- data.frame(default=rbinom(250,1,.5), y=rnorm(250))
n <- 100 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
# 0 1
# 61 39
n <- 150 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
# 0 1
# 97 53