I am trying to generate data for a project. The data needs to be generated randomly from predefined lists. Essentially, I have real data but it's very small. In order to build some classifiers (decision tress, Support Vector Machines and Naive Bayes), I want to produce 100,000 observations.
I am new to coding (I can do rudimentary things in Matlab and R) and initially tried to do this in Excel, however, the RANDOMA function generated very equally distributed data. To be more specific, I am using 5 demographic pieces of information to predict which retailer a customer will select, e.g. retailer A, B or C. The lists for the demographic information is below:
1) Age group (18-24, 25-34, 35-44, 45-54, 55+)
2) Gender (male or female)
3) Income group (<£10k,£10k-19.99k, £20k-£29.99k, etc.)
4) Region (London, Wales, Scotland, Nothern Ireland, South West, etc.)
5) Type of job (Full-time, part-time, student, etc.)
When I tried to randomly create 100,000 observations (each observation randomly selected 1 from each of the 5 lists), they were almost equally distributed between them. Even worse, the value you I randomly assigned to the retailer (A, B or C) was also equal.
The idea is to split this randomly generated data into training and test data, so I can build some models and test their suitability.
In Matlab, your best friend for this task will be randsample function (reference here), which is part of the Statistics Toolbox. Let's make an example concerning your Gender variable:
% possible values (M for male and F for female)
% since it's a qualitative variable, let's use the categorical type
var = categorical({'M' 'F'});
prob = [0.55 0.45]; % corresponding probabilities
n = 100000; % sample size
repl = true; % replacement (true = yes, false = no)
gender = randsample(var,100000,repl,prob);
You can use the same approach to generate samples concerning Region and Job. Let's now make another example with your Age variable.
var = 1:100; % possible values (age from 1 to 100 years)
n = 100000; % sample size
repl = true; % replacement (true = yes, false = no)
% the probability argument is not provided, hence the result is equally distributed
age = randsample(var,100000,repl);
Since you want to split your Age sample into different groups, the histcounts with edges as the second argument will do that for you:
age_grps = histcounts(age,[0 18 25 35 45 55 100]);
% remove the first column if you want to esclude people from 0 to 17 years
age_grps(1) = [];
You can use the same approach to generate the Income sample.
As far as I can see, your main concern is the uniform distribution of your variables. I show you how to set different probabilities for each possible value in the randsample function (prob argument).
I don't know the typical distributions of your data, but the following should get you started.
library(tidyverse)
set.seed(315) # This will create the same data set each run
n.size <- 500
myData <- tibble(
ID = 1:n.size,
VisitDT = lubridate::today()-30 - (runif(n.size) * 100),
IncomeGroup = sample(c("Low", "Medium", "High" ), n.size, prob = c(.7, .25, .05), replace = TRUE),
age = round(rnorm(n = n.size, mean = 52, sd = 10),2),
sex = sample (c('M', 'F'), size = n.size, prob = c(.4, .6), replace = TRUE),
region = sample (c('London', 'Wales', 'Scotland'), size = n.size, prob = c(.4,.3,.2), replace = TRUE),
Treatment = sample(c('No','Yes'), size = n.size, prob = c(.1, .9), replace = TRUE)
)
Related
I am really new into the field of setting up survey weights, and I need help. I have this example dataframe as follows that represents a multi-stage survey (5 clusters for stage 1 and 10 clusters for stage 2)
set.seed(111)
mood <- sample(c("happy","neutral","grumpy"),
size = 1000,
replace=TRUE,
c(0.3,0.3,0.4))
set.seed(222)
sex <- sample(c("female","male"),
size=1000,
replace=TRUE,
c(0.6,0.4))
set.seed(333)
age_group <- sample(c("young","middle","senior"),
size=1000,
replace=TRUE,
c(0.2,0.6,0.2))
status <- data.frame(mood=mood,
sex=sex,
age_group=age_group,
income = trunc(runif(1000,1000,2000)),
dnum = rep(c(441,512,39,99,61),each = 200),
snum = rep(c(1,2,3,4,5,6,7,8,9,10),each=100),
fpc1 = rep(c(100,200,300,400,500),each=200),
fpc2 = rep(c(10,9,8,10,7,6,13,9,5,12),each=100) )
# to take into account the two cluster populations (fpc1 and fpc2)
# I calculated the probability proportional to size of each unit as follows
# (using a method mentioned by a previous question.
# The link of the referred question is at the end of this post) :
status1 <- status %>%
group_by(fpc1,fpc2) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(fpc = n/sum(n)) %>%
right_join(status)
That way, we take into account the clusters to set up the PPS for each unit.
So my question is (assuming that there is no missing values), we create the design weights by the inverse of the new fpc column. Right?
And if we wanted to continue to adjust for other variables (mood, sex, age_group) so that my sample becomes representative of the target population, we adjust the design weights created using some calibration method such as raking, propensity score ...etc. Is this correct that way? Or did I misunderstand something using RSTUDIO to assign weights to my sample?
The link of the referred question :
survey package in R: How to set fpc argument (finite population correction)
Thanks.
I am new to R and I am using glm() function to fit a logistic model. I have 5 columns. I need to find all possible predictors using a loop based on their p-values(less than 0.05).
My dataset has 40,000 entries which contains numerical and categorical variables and it looks more or less like this:
"Age" "Sex" "Occupation" "Education" "Income"
50 Male Farmer High School False
30 Female Maid High School False
25 Male Engineer Graduate True
The target variable "Income" denotes if the person earns more or less than 30K. If true means, they earn more than 30K and vice versa. I would like to find the predictor variables that can be used to predict the target using loops. Also, can I find the best 3 predictors based on their p-values?
Thanks in Advance!
If I understood correctly your question you are looking into a way of test univariable models given your dataframe (i am in fact in doubt if you want to test every combination of these variables including cross variation)
My suggestion is to use purrr::map function and create list for every column. Check the following example based on your information:
library(tidyr)
library(purrr)
## Sample data
df <- data.frame(
Age = rnorm(n = 40000,
mean = mean(c(50,30,25)),
sd(c(50,30,25))),
Ocupation = sample(x = c("Farmer", "Maid", "Engineer"),
size = 40000,
replace = TRUE),
Education = sample(x = c("High School", "Graduate", "UnderGraduate"),
size = 40000,
replace = TRUE),
Income = as.logical(rbinom(40000, 1, 0.5))
)
## Split dataframe into lists
list_df <- Map(cbind, split.default(df[-4], names(df)[-4]))
list_df <- lapply(list_df, cbind, "target" = df[4])
## Use map to fit a model for each list
list_models <- map(.x = list_df,
.f = ~glm(Income ~ ., data = .x, family = binomial))
You can call each model using list_models[i].
Now addressing the second part of your question concerning p-values. Given that each project is unique and so are their metrics i suggest you double check you usage of p-values. Granted, they are important, but they provid you a probability of acceptance given a specific statistic test and treshold which depends on context. It is a fundamental tool of statistical quality and decision (not only about t-test, but f-test and hence forward). But for ranking ? hmm i would say is a litle odd. But just saying :)
In an attempt to avoid nesting for loops 6-7 times, I am trying to use lapply to find the proportion of randomly drawn values (that are combined in a certain way) that exceed some arbitrary thresholds values. The problem is that I have several parameters that each vary a certain number of ways, and these, in turn, will affect how the values are combined. The goal is to use the results in an ANOVA to see how varying these parameters contributes to reaching those thresholds. However, I don't understand how to do this. I have a feeling that anonymous functions could be useful, but I don't understand how they work with more than 1 parameter.
I tried to simplify the code as much as possible. But again, there are just so many parameters that must be included.
trials = 10
data_means = c(0,1,2,3)
prior_samples = c(2, 8, 32)
data_SD = c(0.5, 1, 2)
thresholds = c(10, 30, 80)
The idea is that there are two distributions, data and prior, which I draw values from. I always draw one from data, but I draw a sample (see prior_samples) of values from the prior distribution. There are four different values that determine the mean of the data distribution (see data_means), but the values are drawn the same number of times (determined by trials) from each of these four "versions" of the data distribution. These are then put into nested lists:
set.seed(123)
data_list = list()
for (nMean in data_means){ #the data values
for (nTrial in 1:trials){
data_list[[paste(nMean, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(1, nMean, 1)
}
}
prior_list = list()
for (nSamples in prior_samples){ #the prior values
for (nTrial in 1:trials){
prior_list[[paste(nSamples, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(nSamples, 0, 1)
}
}
Then I create another list for the prior values, because I want to calculate the means and standard deviations (SD) of the samples of prior values. I include normal SD, as well as SD/2 and SD*2:
prior_SD = list("mean"=0, "standard_devations"=list("SD/2"=0, "SD"=0, "SD*2"=0))
prior_mean_SD = rep(list(prior_SD), trials)
prior_nested_list = list("2"=prior_mean_SD, "8"=prior_mean_SD, "32"=prior_mean_SD)
for (nSamples in 1:length(prior_samples)){
for (nTrial in 1:trials){
prior_nested_list[[nSamples]][[nTrial]][["mean"]]=mean(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD/2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])/2)
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD"]]=sd(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD*2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])*2)
}
}
Then I combinde the values from the data list and the last list, using list.zip from rlist:
library(rlist)
dataMean0 = list.zip(dMean0=data_list[["0"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean1 = list.zip(dMean1=data_list[["1"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean2 = list.zip(dMean2=data_list[["2"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean3 = list.zip(dMean3=data_list[["3"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
all_values = list(mean_difference0=dataMean0, mean_difference1=dataMean1,
mean_difference2=dataMean2, mean_difference3=dataMean3)
Now comes the tricky part. I combine the data values and the prior values in all_values by using this custom function for the Kullback-Leibler divergence. As you can see, there are 6 parameters that varies:
mean_diff refers to the means of the data distribution (data_means). It is named mean_diff beacsue it refers to the difference in mean between the prior distribution (which is always 0), and the data distribution (which can be 0, 1, 2 or 3).
trial refers to trials,
pSample refers to the numbers of samples drawn from the prior distribution (prior_samples)
p_SD refers to the calculations of the SD based on the prior samples (normal SD, SD/2, SD*2)
data_SD refers to the SD of the data distribution, determined by data_SD
threshold refers to thresholds
The Kullback-Leibler divergence function:
kld = function(mean_diff, trial, pSample, p_SD, data_SD, threshold){
prior_mean = all_values[[mean_diff]][[trial]][[pSample]][["mean"]]
data_mean = all_values[[mean_diff]][[trial]][["mean"]]
prior_SD = all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
posterior_SD = sqrt(1/(1/
((all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
*all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]))
+1/(data_SD*data_SD)))
length(
which(
(log(prior_SD/posterior_SD) +
(((posterior_SD*posterior_SD) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(prior_SD*prior_SD)))-0.5
+
log(posterior_SD/prior_SD) +
((((prior_SD*prior_SD)) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(posterior_SD*posterior_SD)))-0.5
)>=threshold))/trials
}
So the question is how can one use lapply on the list with all the values (all_values) while using all the different combinations of the six parameters that are included? The data I want to end up with is the proportions of values (percentage of trials) that exceed the thresholds in all the parameter combinations.
I can't find the info I need, so any tips would be appreciated.
I am very new to programming, therefore, I apologize in case my question may seem to fundamental.
Basically I have now a data set of apprx. 300 rows. The idea was now to create an entire new data set with the size of 10k for instance, however, which still has the same characteristics as the smlla data set of 300.
ID Category1 Category2 Amount1 Probability1
1 Class1 A 100 0.3
2 Class2 B 800 0.2
3 Class3 C 300 0.7
4 Class2 A 250 0.4
5 Class3 C 900 0.6
I already did exploratory analysis. I know that my numeric data has a beta distribution and I know the mean and sd (and the level of skewness in case it is relevant)
For my categorical data I know the percent distribution so for instance category A take 25% of the data set. Category B takes 35% and category C takes 40%.
My question now is: what are the best packages in order to simulate this data and to create a bigger data set?
I found on the simstudy package which seemed very goodm however, I am still very new to programming and I'm having hard time to get my head around the code.
Here is the link to the description
https://cran.r-project.org/web/packages/simstudy/vignettes/simstudy.html
(I also checked the R documentation but for a newbie like me it is very hard to follow and fully understand it)
I still don't really get how I can define there my categorical values. (They set there the percent distribution of the single classes but they dont actually set what apply to which class.
Maybe, someone here could help me explain me how I could apply it on my data set or is there another better package for that?
Thank you very much in advance!
EDIT
So my current code with the simstudy package is the following:
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
However, here my problem is that I cant create a skewed beta distribution (and I know that my data is skewed to the right).
Alternativey, I could use this formula, but here i have to create each column seperately and I can not create a relationship between some columns (f.i. correlation, which I would have to create later on as well)
rsbeta(n, shape1, shape)
# shape1 <0 & shape2 >0 creates a right skewede beta distribution
rsbeta(1000, 0.2,3)
Any other suggestions how to resolve this problem?
How do you usually do simulations of different data sets which have only a limited amount of entries ?
Would it work if you just used the sample() function in R with with replacement?
Here is an example using the mtcars data set.
data(mtcars)
mydata=mtcars[,1:4] # only using the first 4 columns for this example
head(mydata)
dim(mydata) # data has 32 rows 4 columns
bigdata=data.frame(mpg=sample(mydata$mpg,1000,replace = T),
cyl=sample(mydata$cyl,1000,replace = T),
disp=sample(mydata$disp,1000,replace = T),
hp=sample(mydata$hp,1000,replace = T))
head(bigdata)
dim(bigdata)
I actually have done something exactly like this. I'm calculating the actual min and max for each variable, so I can simulate to mimic my own original dataset. Using simstudy has several advantages over just using sample, primarily that sample only takes from the existing data available, while simstudy generates any potential value between the minimum and maximum (for numeric types), or a proportion for the categorical variables. Simstudy is also useful if your original data is sensitive/personal data, so you can bypass privacy problems compared to using sample. This is what I did:
library(skimr)
library(simstudy)
library(dplyr)
library(glue)
sim_definitions <-
skim_to_wide(iris) %>%
mutate(min = as.numeric(p0), max = as.numeric(p100)) %>%
transmute(
varname = variable,
dist = case_when(
# For binary data if it is only 0 and 1
n_unique == 2 ~ "binary",
n_unique > 2 ~ "categorical",
TRUE ~ "uniform"
),
formula = case_when(
dist == "uniform" ~ as.character(glue("{min};{max}")),
# For only factors with 3 levels. number is proportion. 0.3 = 30%
dist == "categorical" ~ "0.5;0.2;0.3",
dist == "binary" ~ "0.2",
# other wise 10 is min, 20 is max
TRUE ~ "10;20"
),
link = case_when(
dist == "binary" ~ "logit",
TRUE ~ "identity"
)
)
# 1000 is the final size of the dataset. Change to what ever you want.
simulated_data <- genData(1000, sim_definitions)
dim(simulated_data)
head(simulated_data)
NOTE: I see to have an error with simstudy. Not sure if it's because of an update. Let me know if this works for you. UPDATE: Seems the categorical specification causes the error but I was unable to find the problem.
UPDATE based on clarification in question and comments:
Your code works fine in generating a simulated dataset. If you want to force a skewed distribution, you can use base R's distribution functions like qlnorm. So:
library(simstudy)
#> Loading required package: data.table
def <- defData(varname = "Product_Class", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Category", formula = "0.25;0.35;0.4", dist = "categorical")
def <- defData(def, varname = "Amount", dist = "beta", formula = 0.6, variance = 0.12)
def <- defData(def, varname = "Amount2", dist = "beta", formula = 0.45, variance = 0.1)
def <- defData(def, varname = "Probability", dist = "beta", formula = 0.4, variance = 0.23)
simulated_data <- genData(1000, def)
hist(simulated_data$Amount2)
simulated_data$Amount2 <- qlnorm(simulated_data$Amount2)
hist(simulated_data$Amount2)
Created on 2019-03-24 by the reprex package (v0.2.1)
I have a question about the sign of t in a paired-sample t-test using different data structures, but the same data. I know that the sign doesn't make a difference in terms of significance, but, it does generally tell the user if there have been decreases over time or increases over time. So, I need to make sure that the code I provide produces the same results OR, is explained correctly.
I have to explain the results (and code) as an example we're giving users of our software, which uses R (Rdotnet within a C# program) for statistics. I'm unclear as to the proper order of variables in both methods in R.
Method 1 uses two matrices
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
## Runs paired-sample T-Test just on two original matrices
t.test(preTest,postTest, paired = TRUE)
The results show significance and with the positive t, tells me that there has been a reduction in the mean difference from preTest to PostTest.
Paired t-test
data: preTest and postTest
t = 7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.340533 11.185513
sample estimates:
mean of the differences
8.763023
However, most people are going to get their data not from two matrices, but, from a file with values for BEFORE and AFTER. I will have these data in a csv and import them during a demo. So, to mimic this, I need to create data frame in the structure that users of our software are used to seeing. 'pstt' should look like the dataframe I have after I import a csv.
Method 2: uses a flat-file structure
## Converts the matrices into a dataframe that looks like the way these
data are normally stored in a csv or Excel
ID <- c(1:100)
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
The results for this t-test has the t negative, which may indicate to the user that the variable under study has increased over time.
Paired t-test
data: weight by group
t = -7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.185513 -6.340533
sample estimates:
mean of the differences
-8.763023
Is there a way to define explicitly which group is the BEFORE and which is the AFTER? Or, do you have to have the AFTER group first in Method 2.
Thanks for any help/explanation.
Here is the full R program that I used:
## sets working dir
# setwd("C:\\Temp\\")
## runs file from command line
# source("paired_ttest.r",echo=TRUE)
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
ID <- c(1:100)
## Converts the matrices into a dataframe that looks like the way these
data are normally stored
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
print(pstt2)
## Runs paired-sample T-Test just on two original matrices
# t.test(preTest,postTest, paired = TRUE)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
Since group is a factor, the t.test will use the first level of that factor as the reference level. By default factor levels are sorted alphabetically to "AFTER" would come before "BEFORE" and "postTest" would be come before "preTest". You can explicitly set reference level of a factor with relevel().
t.test(weight ~ relevel(group, "preTest"), data = pstt2, paired = TRUE)