Data sim and set.seed() - r

I've been assigned to create a dataset of simulated patient data in R for an assignment. We've been provided variable names and thats it. I want to be able to get a random sample of 100, and use set.seed() to make it reproducible, but when I run the code, I originally got different sample variables each time I re-open the script, and now it I just get error messages and it won't run
This is what I have:
pulse_data <- data.frame(
group = c(rep("control", "treatment")),
age = sample(c(20:75)),
gender = c(rep("male", "female")),
resting_pulse = sample(c(40:120)),
height_cm = sample(c(140:220))
)
set.seed(30)
pulse_sim <- sample_n(pulse_data, 100, replace = FALSE)
am I missing something fundamental?!
(total beginner, speak to me like an idiot and I might understand :) )
I've tried to sample_n() straight from the dataframe, with the set.seed() and to put set.seed() inside the pulse_sim but to no avail... as for why I get errors now, I'm at my wits end

Realize that pulse_data is created using random data, so each time the script is called, you get random data. After you create it, you set the random seed, so you get the same rows you did the last time you opened the script, but ... the rows have different data. SOLUTION: set the random seed before you define pulse_data.
pulse_data <- data.frame(
group = rep(c("control", "treatment"), length.out=30),
age = sample(c(20:75), size=30),
gender = rep(c("male", "female"), length.out=30),
resting_pulse = sample(c(40:120), size=30),
height_cm = sample(c(140:220), size=30)
)
pulse_sim <- sample_n(pulse_data, 10, replace = FALSE)
I have put that code, plus a simple pulse_sim again (to print it) in a file 74408236.R. (Note that I added length.out and changed your sample size from 100 to 10, for the sake of this demonstration.) I can run this briefly with this shell command (not in R):
$ Rscript.exe 74408236.R
group age gender resting_pulse height_cm
1 treatment 28 female 76 210
2 treatment 24 female 118 140
3 control 44 male 57 141
4 control 70 male 96 184
5 treatment 22 female 87 177
6 control 30 male 50 168
7 control 39 male 56 145
8 treatment 37 female 120 182
9 treatment 20 female 79 181
10 treatment 75 female 98 186
When I run it a few times in a row, I get the same output. For brevity, I'll demonstrate same-ness by showing its MD5 checksum; while MD5 is not the most "secure" (cryptographically), I think this is an easy way to suggest that the output is unlikely to be different. (This is shell-scripting, still not in R.)
$ for rep in $(seq 1 5) ; do Rscript.exe 74408236.R | md5sum; done
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
In fact, if I repeat it 100 times, I still see no change. I'll pipe through uniq -c to replace repeated output with the count (first number) and the output (everything else, the checksum).
$ for rep in $(seq 1 100) ; do /mnt/c/R/R-4.1.2/bin/Rscript.exe 74408236.R | md5sum; done | uniq -c
100 0f06ecd84c1b65d6d5e4ee36dea76add -

Related

nnet gives me error "NA/NaN/Inf in foreign function call (arg 2)" in RStudio

I've been trying to run a neural network in RStudio to predict the answer of a bank marketing campaign, but for some reason I get the bellow error.
> bankData_net <- nnet(bankData[A,c(1:4)], Train_lab[A,], size=3, maxit=100, softmax=TRUE)
# weights: 23
Error in nnet.default(bankData[A, c(1:4)], Train_lab[A, ], size = 3, maxit = 100, :
NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In nnet.default(bankData[A, c(1:4)], Train_lab[A, ], size = 3, maxit = 100, :
NAs introduced by coercion
The database looks like the one bellow (these are only the first 10 rows to get an idea. The db has a few thousands of rows).
age job marital education y
1 56 housemaid married basic.4y no
2 57 services married high.school no
3 37 services married high.school no
4 40 admin. married basic.6y no
5 56 services married high.school yes
6 45 services married basic.9y no
7 59 admin. married professional.course no
8 41 blue-collar married unknown yes
9 24 technician single professional.course no
10 25 services single high.school no
And the bellow is the code I'm trying to run.
# save the data set in a variable
bankData = read.csv("data/bank-additional.csv", sep = ";")
# print first 10 rows of iris data
head(bankData, n=10)
# Remove variable "duration" which is not helpful
newbankData <- subset(bankData, select = c(age, job, marital, education, y))
head(newbankData, n=10)
library(nnet)
# create train labels: convert the text bankData responses to numeric class labels
Train_lab <- class.ind(bankData$y)
# set seed for random number generator for repeatable results
set.seed(1)
# Create indexes for training (70%) and validation (30%) data
A <- sort(sample(nrow(bankData), nrow(bankData)*.7))
# train neural net
# bankData[A,c(1:4)] is to select the first 4 variables as inputs
# size=5 for 5 hidden units, maxit=100 to train for 100 iterations
bankData_net <- nnet(bankData[A,c(1:4)], Train_lab[A,], size=3, maxit=100, softmax=TRUE)
# test
Yt <- predict(bankData_net, bankData[-A,c(1:4)], type="class")
# build a confusion matrix
conf.matrix <- table(bankData[-A,]$y, Yt)
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix))
colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix))
print(conf.matrix)
Please help me what I'm doing wrong and how I can fix this.

How to create a more concise table with these 2 variables? (R programming)

I am using the dataset nba_ht_wt which can be imported via text(readr) by the url http://users.stat.ufl.edu/~winner/data/nba_ht_wt.csv . The question I am trying to tackle is "What percentage of players have a BMI over 25, which is considered "overweight"?
I already created a new variable in the table called highbmi, which corresponds to bmi > 25. This is my code, but the table is hard to read, how could I get a more concise and easier to read table?
nba_ht_wt = nba_ht_wt %>% mutate(highbmi = bmi>25)
tab = table(nba_ht_wt$highbmi, nba_ht_wt$Player)
100*prop.table(tab,1)
I am using R programming.
There is no variable called bmi in the data provided so I will take a guess it is calculated via formula Weight/Height^2, where height is in meters.
data <- read.csv("http://users.stat.ufl.edu/~winner/data/nba_ht_wt.csv")
head(data)
Player Pos Height Weight Age
1 Nate Robinson G 69 180 29
2 Isaiah Thomas G 69 185 24
3 Phil Pressey G 71 175 22
4 Shane Larkin G 71 176 20
5 Ty Lawson G 71 195 25
6 John Lucas III G 71 157 30
I am no expert but it looks to me like height and weight have it names swapped for some reason.
So I will make this adjustment to calculate bmi:
data$bmi <- data$Height/(data$Weight/100)**2
And now we can answer "What percentage of players have a BMI over 25, which is considered "overweight"? with simple line of code:
mean(data$bmi > 25)
Multiply this number by 100 to get answer in percentages. So the answer will be 1.782178%
Assuming the formula: weight (lb) / [height (in)]^2 * 703 (source: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html), you could do:
library(data.table)
nba_ht_wt <- fread("http://users.stat.ufl.edu/%7Ewinner/data/nba_ht_wt.csv")
nba_ht_wt[, highbmi:=(Weight / Height**2 * 703)>25][,
.(`% of Players`=round(.N/dim(nba_ht_wt)[1]*100,2)), by="highbmi"][]
#> highbmi % of Players
#> 1: TRUE 45.35
#> 2: FALSE 54.65
... or plug in the formula into the previous response for a base R solution.
This simple formula might not be really appropriate for basketball players, obviously.

Synth() and dataprep() in R

I have a dataset that I'd like to perform synth() on. Right now I'm working on the dataprep() command and I'm running into some issues. Below is a sample of what my dataset looks like:
unit.number 2000 2001 2002 2003 2004
1 400 344 252 212 344
2 342 234 111 102 222
3 244 555 512 122 152
4 515 125 324 100 155
My treated unit is unit.number = 3, and the treated time period is 2004. Since I'd like to use the lagged outcome variables as predictors, I've left the data in long format. However, I also obviously want to use the years as time.variable, so I'm not sure how to do that (I originally tried inputting it as a vector with all of the year columns -- see below). Is there some way that I can make the year columns into rows as well (kind of like converting the data to wide, but while also keeping the columns present), or is there some other way that I can construct time.variable? Here is what I have so far in terms of code:
dataprep(foo = data, predictors = c("2000 : 2004"),
predictors.op = c("mean"), dependent = "2004",
unit.variable = "unit.number", time.variable = c("2000 : 2004"),
treatment.identifier = 3, controls.identifier = c(1:2),
time.predictors.prior = c("2000 : 2003"),
time.plot = c("2000 : 2004"))
I'd really appreciate some assistance with this! Thanks.

Generating meaningful sample data in R based on conditions?

I'm trying to generate some sample insurance claims data that is meaningful instead of just random numbers.
Assuming I have two columns Age and Injury, I need meaningful values for ClaimAmount based on certain conditions:
ClaimantAge | InjuryType | ClaimAmount
---------------------------------------
35 Bruises
55 Fractures
. .
. .
. .
I want to generate claim amounts that increase as age increases, and then plateaus at around a certain age, say 65.
Claims for certain injuries need to be higher than claims for other types of injuries.
Currently I am generating my samples in a random manner, like so:
amount <- sample(0:100000, 2000, replace = TRUE)
How do I generate more meaningful samples?
There are many ways that this could need to be adjusted, as I don't know the field. Given that we're talking about dollar amounts, I would use the poisson distribution to generate data.
set.seed(1)
n_claims <- 2000
injuries <- c("bruises", "fractures")
prob_injuries <- c(0.7, 0.3)
sim_claims <- data.frame(claimid = 1:n_claims)
sim_claims$age <- round(rnorm(n = n_claims, mean = 35, sd = 15), 0)
sim_claims$Injury <- factor(sample(injuries, size = n_claims, replace = TRUE, prob = prob_injuries))
sim_claims$Amount <- rpois(n_claims, lambda = 100 + (5 * (sim_claims$age - median(sim_claims$age))) +
dplyr::case_when(sim_claims$Injury == "bruises" ~ 50,
sim_claims$Injury == "fractures" ~ 500))
head(sim_claims)
claimid age Injury Amount
1 1 26 bruises 117
2 2 38 bruises 175
3 3 22 bruises 102
4 4 59 bruises 261
5 5 40 fractures 644
6 6 23 bruises 92

no smd (standardized mean differences) shown by tableone::CreateTableOne

In R, I am trying to use tableone::CreateTableOne in order to calculate smd (standardized mean differences) on a dataframe. I used this tutorial (https://cran.r-project.org/web/packages/tableone/vignettes/smd.html) - the code runs and nicely produces the desired output table, including the smd.
However, if I use my own data, e.g. the test data below, I get the table but without smd. Probably I did some stupid mistake, but after trying a lot of things (only numeric, smaller or larger dataset, categorial variables as factor (as in r help) or character (as in tutorial)...) I cannot figure out why I do not get smd.
# package tableone for CreateTableOne
if (!require("tableone")) install.packages("tableone"); library("tableone")
# producible test data
set.seed(1234)
d <- data.frame(age = rnorm(n = 200, mean = 50, 9),
hair = as.factor(sample(x = c("brown", "black", "blond"), 200, replace = T)),
group = sample(x = c("sick", "healthy"), 200, replace = T))
str(d)
# calculate and print the table
tabUnmatched <- tableone::CreateTableOne(vars = c("age", "hair"), strata = "group", data = d, test = FALSE, smd = TRUE)
print(tabUnmatched)
results in the following table, WITHOUT smd (and no error message):
Stratified by group
healthy sick
n 90 110
age (mean (SD)) 49.18 (7.97) 49.72 (10.10)
hair (%)
black 30 (33.3) 35 (31.8)
blond 33 (36.7) 43 (39.1)
brown 27 (30.0) 32 (29.1)
What am I doing wrong, what do I need to do to get smd output?
errr...this?
print(tabUnmatched, smd = TRUE)
Stratified by group
healthy sick SMD
n 90 110
age (mean (SD)) 49.18 (7.97) 49.72 (10.10) 0.059
hair (%) 0.050
black 30 (33.3) 35 (31.8)
blond 33 (36.7) 43 (39.1)
brown 27 (30.0) 32 (29.1)

Resources