How to create a simulation of a small data set in R - r

I am very new to programming, therefore, I apologize in case my question may seem to fundamental.
Basically I have now a data set of apprx. 300 rows. The idea was now to create an entire new data set with the size of 10k for instance, however, which still has the same characteristics as the smlla data set of 300.
ID Category1 Category2 Amount1 Probability1
1 Class1 A 100 0.3
2 Class2 B 800 0.2
3 Class3 C 300 0.7
4 Class2 A 250 0.4
5 Class3 C 900 0.6
I already did exploratory analysis. I know that my numeric data has a beta distribution and I know the mean and sd (and the level of skewness in case it is relevant)
For my categorical data I know the percent distribution so for instance category A take 25% of the data set. Category B takes 35% and category C takes 40%.
My question now is: what are the best packages in order to simulate this data and to create a bigger data set?
I found on the simstudy package which seemed very goodm however, I am still very new to programming and I'm having hard time to get my head around the code.
Here is the link to the description
https://cran.r-project.org/web/packages/simstudy/vignettes/simstudy.html
(I also checked the R documentation but for a newbie like me it is very hard to follow and fully understand it)
I still don't really get how I can define there my categorical values. (They set there the percent distribution of the single classes but they dont actually set what apply to which class.
Maybe, someone here could help me explain me how I could apply it on my data set or is there another better package for that?
Thank you very much in advance!
EDIT
So my current code with the simstudy package is the following:
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
However, here my problem is that I cant create a skewed beta distribution (and I know that my data is skewed to the right).
Alternativey, I could use this formula, but here i have to create each column seperately and I can not create a relationship between some columns (f.i. correlation, which I would have to create later on as well)
rsbeta(n, shape1, shape)
# shape1 <0 & shape2 >0 creates a right skewede beta distribution
rsbeta(1000, 0.2,3)
Any other suggestions how to resolve this problem?
How do you usually do simulations of different data sets which have only a limited amount of entries ?

Would it work if you just used the sample() function in R with with replacement?
Here is an example using the mtcars data set.
data(mtcars)
mydata=mtcars[,1:4] # only using the first 4 columns for this example
head(mydata)
dim(mydata) # data has 32 rows 4 columns
bigdata=data.frame(mpg=sample(mydata$mpg,1000,replace = T),
cyl=sample(mydata$cyl,1000,replace = T),
disp=sample(mydata$disp,1000,replace = T),
hp=sample(mydata$hp,1000,replace = T))
head(bigdata)
dim(bigdata)

I actually have done something exactly like this. I'm calculating the actual min and max for each variable, so I can simulate to mimic my own original dataset. Using simstudy has several advantages over just using sample, primarily that sample only takes from the existing data available, while simstudy generates any potential value between the minimum and maximum (for numeric types), or a proportion for the categorical variables. Simstudy is also useful if your original data is sensitive/personal data, so you can bypass privacy problems compared to using sample. This is what I did:
library(skimr)
library(simstudy)
library(dplyr)
library(glue)
sim_definitions <-
skim_to_wide(iris) %>%
mutate(min = as.numeric(p0), max = as.numeric(p100)) %>%
transmute(
varname = variable,
dist = case_when(
# For binary data if it is only 0 and 1
n_unique == 2 ~ "binary",
n_unique > 2 ~ "categorical",
TRUE ~ "uniform"
),
formula = case_when(
dist == "uniform" ~ as.character(glue("{min};{max}")),
# For only factors with 3 levels. number is proportion. 0.3 = 30%
dist == "categorical" ~ "0.5;0.2;0.3",
dist == "binary" ~ "0.2",
# other wise 10 is min, 20 is max
TRUE ~ "10;20"
),
link = case_when(
dist == "binary" ~ "logit",
TRUE ~ "identity"
)
)
# 1000 is the final size of the dataset. Change to what ever you want.
simulated_data <- genData(1000, sim_definitions)
dim(simulated_data)
head(simulated_data)
NOTE: I see to have an error with simstudy. Not sure if it's because of an update. Let me know if this works for you. UPDATE: Seems the categorical specification causes the error but I was unable to find the problem.
UPDATE based on clarification in question and comments:
Your code works fine in generating a simulated dataset. If you want to force a skewed distribution, you can use base R's distribution functions like qlnorm. So:
library(simstudy)
#> Loading required package: data.table
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
simulated_data <- genData(1000, def)
hist(simulated_data$Amount2)
simulated_data$Amount2 <- qlnorm(simulated_data$Amount2)
hist(simulated_data$Amount2)
Created on 2019-03-24 by the reprex package (v0.2.1)

Related

A question about assigning design weights to a dataset - the case of pps without replacement : survey package in RSTUDIO

I am really new into the field of setting up survey weights, and I need help. I have this example dataframe as follows that represents a multi-stage survey (5 clusters for stage 1 and 10 clusters for stage 2)
set.seed(111)
mood <- sample(c("happy","neutral","grumpy"),
size = 1000,
replace=TRUE,
c(0.3,0.3,0.4))
set.seed(222)
sex <- sample(c("female","male"),
size=1000,
replace=TRUE,
c(0.6,0.4))
set.seed(333)
age_group <- sample(c("young","middle","senior"),
size=1000,
replace=TRUE,
c(0.2,0.6,0.2))
status <- data.frame(mood=mood,
sex=sex,
age_group=age_group,
income = trunc(runif(1000,1000,2000)),
dnum = rep(c(441,512,39,99,61),each = 200),
snum = rep(c(1,2,3,4,5,6,7,8,9,10),each=100),
fpc1 = rep(c(100,200,300,400,500),each=200),
fpc2 = rep(c(10,9,8,10,7,6,13,9,5,12),each=100) )
# to take into account the two cluster populations (fpc1 and fpc2)
# I calculated the probability proportional to size of each unit as follows
# (using a method mentioned by a previous question.
# The link of the referred question is at the end of this post) :
status1 <- status %>%
group_by(fpc1,fpc2) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(fpc = n/sum(n)) %>%
right_join(status)
That way, we take into account the clusters to set up the PPS for each unit.
So my question is (assuming that there is no missing values), we create the design weights by the inverse of the new fpc column. Right?
And if we wanted to continue to adjust for other variables (mood, sex, age_group) so that my sample becomes representative of the target population, we adjust the design weights created using some calibration method such as raking, propensity score ...etc. Is this correct that way? Or did I misunderstand something using RSTUDIO to assign weights to my sample?
The link of the referred question :
survey package in R: How to set fpc argument (finite population correction)
Thanks.

Create ROC curve manually from data frame

I have the below conceptual problem which I can't get my head around.
Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.
Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.
So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.
Any ideas?
example data:
time column indidates the time needed per case
truth column indicates my current decision I want to compare against
predicted column indicates the cleaning decision if I would cut at a time threshold of 2.5s. This is waht I need to change/loop through.
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
You can use ROCR too for this
library(ROCR)
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
You can also check the AUC value:
auc <- performance(pred, measure = "auc")
auc#y.values[[1]]
[1] 0.92
Cross checking the AUC value with pROC
library(pROC)
roc(df$truth, df$time)
Call:
roc.default(response = df$truth, predictor = df$time)
Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92
For both the cases, it is same!
So my idea was to just create a ROC curve
Creating a ROC curve is as easy as
library(pROC)
set.seed(3)
data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) |>
roc(truth, time) |>
plot()
The problem is that I don't have a machine-learning-like model that would give me class probabilities.
Sorry, I do not understand what is machine-learning-like about the question.
I had the idea of just looping through my data at maybe 100 different thresholds
There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5))
thresholds <- roc(df, truth, time)$thresholds
print(thresholds)
or
> print(thresholds)
[1] -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
[8] 3.093424 3.969994 4.586341 Inf
What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Matching on a quaternary variable R

I am working with data set you can generate with the following code:
set.seed(922)
dat<-data.frame("y" = c(rnorm(500,20,2),rnorm(500, 40,2),rnorm(500,60,2),rnorm(500,80,2)),
"quaternary" = sample(LETTERS[1:4],2000,replace = T),
"binary" = sample(c("0","1"),2000,replace = T))
(Generating a 4-modal distribution was an arbitrary decision)
the four treatment groups ("A","B","C","D") are what is important.
I am trying to create a balanced matched sample based on the values of y in the data frame. I've used the "Matchit" package to build balanced samples based on a binary variable:
matchit(binary~y,data = dat)
but I'm not sure how I could build matches of a 4-level factor "quaternary" on the values of "y".
I'm not certain there's an elegant way to do it in the Matchit package, but I'm open to any suggestion on how I might stack the methodologies to get a good balanced sample. Any help would be awesome
EDIT:
OK so I think I'm close. You can leverage dplyr in a for loop. It's a bit inefficient, and I still have to think about the implications of using this to create a balanced sample, but it's getting closer...
first in the dat frame, you create four new variables populated with NAs:
dat$A_match<-NA
dat$B_match<-NA
dat$C_match<-NA
dat$D_match<-NA
The you use summarise function in dplyr to find the values.
require(dplyr) #haha. Hey that rhymes
for(i in 1:dim(dat)[1]){
dat_A_index<-dat%>%
mutate(y = ifelse(quaternary=="A",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(A_index = which.min(abs))
dat$A_match[i]<-dat[dat_A_index$A_index,1]
rm(dat_A_index)
dat_B_index<-dat%>%
mutate(y = ifelse(quaternary=="B",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(B_index = which.min(abs))
dat$B_match[i]<-dat[dat_B_index$B_index,1]
rm(dat_B_index)
dat_C_index<-dat%>%
mutate(y = ifelse(quaternary=="C",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(C_index = which.min(abs))
dat$C_match[i]<-dat[dat_C_index$C_index,1]
rm(dat_C_index)
dat_D_index<-dat%>%
mutate(y = ifelse(quaternary=="D",y,0),
abs = abs(dat[i,1]-y))%>%
summarise(D_index = which.min(abs))
dat$D_match[i]<-dat[dat_D_index$D_index,1]
rm(dat_D_index)
}
I know it's clunky, but at least it's selecting the best match in each of the 4 categories for the given value of y. In a real world application, the final balanced sample should be no larger than smallest conditional n multiplied by 4. You also have to assume some outliers might have to be thrown out (maybe an F-test to set the last filtering rule?). At any rate, the vector, y, we generated already represent a balanced sample, but for a real-world application, this is not correct.
Remember that MatchIt only produces matched samples that are suitable for estimating the ATT (average treatment effect on the treated). Typically, MatchIt selects a group it considers the "treated", which is usually the treatment level labeled "1". It then matches to each treated unit one or more control units.
With multinomial treatments, you also need to decide which estimand you are interested in. If, again, you are interested in the ATT, you must select one group to be considered the "treated", and the other groups are considered "control" (I prefer to refer to them as "focal" and "non-focal"). Importantly, your treatment effect estimates will only generalize to a population similar in composition to that of the focal group.
If this is what you want, you need to select one group as focal, and then perform three separate matchit calls where each one matches units from one of the non-focal group to the units in the focal group. The focal group remains unchanged. Below is some code I might use to do this:
set.seed(922)
library(MatchIt)
dat<-data.frame("y" = c(rnorm(500,20,2),rnorm(500, 40,2),rnorm(500,60,2),rnorm(500,80,2)),
"quaternary" = sample(LETTERS[1:4],2000,replace = T, prob = c(.1, .3, .3, .3)),
"binary" = sample(c("0","1"),2000,replace = T))
focal <- "A"
dat$match.weights <- 1
for (lev in levels(dat$quaternary)) {
if (lev != focal) {
dat0 <- dat[dat$quaternary %in% c(focal, lev),]
dat0$treat <- as.numeric(dat0$quaternary == focal)
m.out <- matchit(treat ~ y, dat = dat0, replace = FALSE)
dat$match.weights[dat$quaternary == lev] <- m.out$weights[dat0$treat == 0]
}
}
library(cobalt)
bal.tab(quaternary ~ y, data = dat, weights = dat$match.weights,
method = "matching", focal = focal, un = TRUE)
#> Note: estimand and s.d.denom not specified; assuming ATT and treated.
#> Balance summary across all treatment pairs
#> Type Max.Diff.Un Max.Diff.Adj
#> y Contin. 0.1134 0.0009
#>
#> Sample sizes
#> B C D A
#> All 593 597 612 198
#> Matched 198 198 198 198
#> Unmatched 395 399 414 0
Created on 2018-10-13 by the reprex package (v0.2.1)
Note that if your focal group is not the smallest of the groups, you must match with replacement by setting replace = TRUE in matchit(). To ensure the focal group in this example was the smallest, I set the probabilities of the randomly sampled values of quaternary so that the probability of A was lowest.
If, on the other hand, you want the ATE, matching is probably not your best option. It would be hard to use MatchIt to produce a matched set for the ATE for a binary treatment, making it even harder to do so for multiple treatment groups. Instead, you might look into propensity score weighting, for which the weights are well defined with multinomial treatments. Below is some code to estimate the weights using the above data set to estimate either the ATT or the ATE:
library(WeightIt)
#Weighting for the ATT with A as focal:
w.out.att <- weightit(quaternary ~ y, data = dat, estimand = "ATT", focal = "A")
#> Using multinomial logit regression.
dat$w.att <- w.out.att$weights
#Weighting for the ATE:
w.out.ate <- weightit(quaternary ~ y, data = dat, estimand = "ATE")
#> Using multinomial logit regression.
dat$w.ate <- w.out.ate$weights
bal.tab(quaternary ~ y, data = dat, weights = c("w.att", "w.ate"),
method = "weighting", estimand = c("ATT", "ATE"), un = TRUE)
#> Balance summary across all treatment pairs
#> Type Max.Diff.Un Max.Diff.w.att Max.Diff.w.ate
#> y Contin. 0.1092 0.0055 0.0024
#>
#> Effective sample sizes
#> A B C D
#> All 198.000 593.000 597.000 612.000
#> w.att 198.000 591.139 593.474 604.162
#> w.ate 196.947 592.822 596.993 611.107
Created on 2018-10-13 by the reprex package (v0.2.1)
No matter what strategy you use, you can use the weights in a weighted regression of the outcome on the treated using the estimated matching weights or ATT or ATE weights.
[Disclosure: I'm the author of both the cobalt and WeightIt packages.]

Trying to Generate Random Data from Lists

I am trying to generate data for a project. The data needs to be generated randomly from predefined lists. Essentially, I have real data but it's very small. In order to build some classifiers (decision tress, Support Vector Machines and Naive Bayes), I want to produce 100,000 observations.
I am new to coding (I can do rudimentary things in Matlab and R) and initially tried to do this in Excel, however, the RANDOMA function generated very equally distributed data. To be more specific, I am using 5 demographic pieces of information to predict which retailer a customer will select, e.g. retailer A, B or C. The lists for the demographic information is below:
1) Age group (18-24, 25-34, 35-44, 45-54, 55+)
2) Gender (male or female)
3) Income group (<£10k,£10k-19.99k, £20k-£29.99k, etc.)
4) Region (London, Wales, Scotland, Nothern Ireland, South West, etc.)
5) Type of job (Full-time, part-time, student, etc.)
When I tried to randomly create 100,000 observations (each observation randomly selected 1 from each of the 5 lists), they were almost equally distributed between them. Even worse, the value you I randomly assigned to the retailer (A, B or C) was also equal.
The idea is to split this randomly generated data into training and test data, so I can build some models and test their suitability.
In Matlab, your best friend for this task will be randsample function (reference here), which is part of the Statistics Toolbox. Let's make an example concerning your Gender variable:
% possible values (M for male and F for female)
% since it's a qualitative variable, let's use the categorical type
var = categorical({'M' 'F'});
prob = [0.55 0.45]; % corresponding probabilities
n = 100000; % sample size
repl = true; % replacement (true = yes, false = no)
gender = randsample(var,100000,repl,prob);
You can use the same approach to generate samples concerning Region and Job. Let's now make another example with your Age variable.
var = 1:100; % possible values (age from 1 to 100 years)
n = 100000; % sample size
repl = true; % replacement (true = yes, false = no)
% the probability argument is not provided, hence the result is equally distributed
age = randsample(var,100000,repl);
Since you want to split your Age sample into different groups, the histcounts with edges as the second argument will do that for you:
age_grps = histcounts(age,[0 18 25 35 45 55 100]);
% remove the first column if you want to esclude people from 0 to 17 years
age_grps(1) = [];
You can use the same approach to generate the Income sample.
As far as I can see, your main concern is the uniform distribution of your variables. I show you how to set different probabilities for each possible value in the randsample function (prob argument).
I don't know the typical distributions of your data, but the following should get you started.
library(tidyverse)
set.seed(315) # This will create the same data set each run
n.size <- 500
myData <- tibble(
ID = 1:n.size,
VisitDT = lubridate::today()-30 - (runif(n.size) * 100),
IncomeGroup = sample(c("Low", "Medium", "High" ), n.size, prob = c(.7, .25, .05), replace = TRUE),
age = round(rnorm(n = n.size, mean = 52, sd = 10),2),
sex = sample (c('M', 'F'), size = n.size, prob = c(.4, .6), replace = TRUE),
region = sample (c('London', 'Wales', 'Scotland'), size = n.size, prob = c(.4,.3,.2), replace = TRUE),
Treatment = sample(c('No','Yes'), size = n.size, prob = c(.1, .9), replace = TRUE)
)

Resources