no smd (standardized mean differences) shown by tableone::CreateTableOne - r

In R, I am trying to use tableone::CreateTableOne in order to calculate smd (standardized mean differences) on a dataframe. I used this tutorial (https://cran.r-project.org/web/packages/tableone/vignettes/smd.html) - the code runs and nicely produces the desired output table, including the smd.
However, if I use my own data, e.g. the test data below, I get the table but without smd. Probably I did some stupid mistake, but after trying a lot of things (only numeric, smaller or larger dataset, categorial variables as factor (as in r help) or character (as in tutorial)...) I cannot figure out why I do not get smd.
# package tableone for CreateTableOne
if (!require("tableone")) install.packages("tableone"); library("tableone")
# producible test data
set.seed(1234)
d <- data.frame(age = rnorm(n = 200, mean = 50, 9),
hair = as.factor(sample(x = c("brown", "black", "blond"), 200, replace = T)),
group = sample(x = c("sick", "healthy"), 200, replace = T))
str(d)
# calculate and print the table
tabUnmatched <- tableone::CreateTableOne(vars = c("age", "hair"), strata = "group", data = d, test = FALSE, smd = TRUE)
print(tabUnmatched)
results in the following table, WITHOUT smd (and no error message):
Stratified by group
healthy sick
n 90 110
age (mean (SD)) 49.18 (7.97) 49.72 (10.10)
hair (%)
black 30 (33.3) 35 (31.8)
blond 33 (36.7) 43 (39.1)
brown 27 (30.0) 32 (29.1)
What am I doing wrong, what do I need to do to get smd output?

errr...this?
print(tabUnmatched, smd = TRUE)
Stratified by group
healthy sick SMD
n 90 110
age (mean (SD)) 49.18 (7.97) 49.72 (10.10) 0.059
hair (%) 0.050
black 30 (33.3) 35 (31.8)
blond 33 (36.7) 43 (39.1)
brown 27 (30.0) 32 (29.1)

Related

Data sim and set.seed()

I've been assigned to create a dataset of simulated patient data in R for an assignment. We've been provided variable names and thats it. I want to be able to get a random sample of 100, and use set.seed() to make it reproducible, but when I run the code, I originally got different sample variables each time I re-open the script, and now it I just get error messages and it won't run
This is what I have:
pulse_data <- data.frame(
group = c(rep("control", "treatment")),
age = sample(c(20:75)),
gender = c(rep("male", "female")),
resting_pulse = sample(c(40:120)),
height_cm = sample(c(140:220))
)
set.seed(30)
pulse_sim <- sample_n(pulse_data, 100, replace = FALSE)
am I missing something fundamental?!
(total beginner, speak to me like an idiot and I might understand :) )
I've tried to sample_n() straight from the dataframe, with the set.seed() and to put set.seed() inside the pulse_sim but to no avail... as for why I get errors now, I'm at my wits end
Realize that pulse_data is created using random data, so each time the script is called, you get random data. After you create it, you set the random seed, so you get the same rows you did the last time you opened the script, but ... the rows have different data. SOLUTION: set the random seed before you define pulse_data.
pulse_data <- data.frame(
group = rep(c("control", "treatment"), length.out=30),
age = sample(c(20:75), size=30),
gender = rep(c("male", "female"), length.out=30),
resting_pulse = sample(c(40:120), size=30),
height_cm = sample(c(140:220), size=30)
)
pulse_sim <- sample_n(pulse_data, 10, replace = FALSE)
I have put that code, plus a simple pulse_sim again (to print it) in a file 74408236.R. (Note that I added length.out and changed your sample size from 100 to 10, for the sake of this demonstration.) I can run this briefly with this shell command (not in R):
$ Rscript.exe 74408236.R
group age gender resting_pulse height_cm
1 treatment 28 female 76 210
2 treatment 24 female 118 140
3 control 44 male 57 141
4 control 70 male 96 184
5 treatment 22 female 87 177
6 control 30 male 50 168
7 control 39 male 56 145
8 treatment 37 female 120 182
9 treatment 20 female 79 181
10 treatment 75 female 98 186
When I run it a few times in a row, I get the same output. For brevity, I'll demonstrate same-ness by showing its MD5 checksum; while MD5 is not the most "secure" (cryptographically), I think this is an easy way to suggest that the output is unlikely to be different. (This is shell-scripting, still not in R.)
$ for rep in $(seq 1 5) ; do Rscript.exe 74408236.R | md5sum; done
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
In fact, if I repeat it 100 times, I still see no change. I'll pipe through uniq -c to replace repeated output with the count (first number) and the output (everything else, the checksum).
$ for rep in $(seq 1 100) ; do /mnt/c/R/R-4.1.2/bin/Rscript.exe 74408236.R | md5sum; done | uniq -c
100 0f06ecd84c1b65d6d5e4ee36dea76add -

Generating meaningful sample data in R based on conditions?

I'm trying to generate some sample insurance claims data that is meaningful instead of just random numbers.
Assuming I have two columns Age and Injury, I need meaningful values for ClaimAmount based on certain conditions:
ClaimantAge | InjuryType | ClaimAmount
---------------------------------------
35 Bruises
55 Fractures
. .
. .
. .
I want to generate claim amounts that increase as age increases, and then plateaus at around a certain age, say 65.
Claims for certain injuries need to be higher than claims for other types of injuries.
Currently I am generating my samples in a random manner, like so:
amount <- sample(0:100000, 2000, replace = TRUE)
How do I generate more meaningful samples?
There are many ways that this could need to be adjusted, as I don't know the field. Given that we're talking about dollar amounts, I would use the poisson distribution to generate data.
set.seed(1)
n_claims <- 2000
injuries <- c("bruises", "fractures")
prob_injuries <- c(0.7, 0.3)
sim_claims <- data.frame(claimid = 1:n_claims)
sim_claims$age <- round(rnorm(n = n_claims, mean = 35, sd = 15), 0)
sim_claims$Injury <- factor(sample(injuries, size = n_claims, replace = TRUE, prob = prob_injuries))
sim_claims$Amount <- rpois(n_claims, lambda = 100 + (5 * (sim_claims$age - median(sim_claims$age))) +
dplyr::case_when(sim_claims$Injury == "bruises" ~ 50,
sim_claims$Injury == "fractures" ~ 500))
head(sim_claims)
claimid age Injury Amount
1 1 26 bruises 117
2 2 38 bruises 175
3 3 22 bruises 102
4 4 59 bruises 261
5 5 40 fractures 644
6 6 23 bruises 92

Using barplot in R studio

when I try this code for barplot (L$neighbourhood is the apartment neighbourhood in Paris for example, Champs-Elysées, Batignolles, which is string data, and L$price is the numeric data for apartment price).
barplot(L$neighbourhood, L$price, main = "TITLE", xlab = "Neighbourhood", ylab = "Price")
But, I get an error:
Error in barplot.default(L$neighbourhood, L$price, main = "TITLE",
xlab = "Neighbourhood", : 'height' must be a vector or a matrix
We cannot use string data as an input in barplot function in R? How can I fix this error please?
allneighbourhoods
Quite unclear what you want to barplot. Let's assume you want to see the average price per neighborhood. If that's what you're after you can proceed like this.
First some illustrative data:
set.seed(123)
Neighborhood <- sample(LETTERS[1:4], 10, replace = T)
Price <- sample(10:100, 10, replace = T)
df <- data.frame(Neighborhood, Price)
df
Neighborhood Price
1 C 23
2 C 34
3 C 99
4 B 100
5 C 78
6 B 100
7 B 66
8 B 18
9 C 81
10 A 35
Now compute the averages by neighborhood using the function aggregate and store the result in a new dataframe:
df_new <- aggregate(x = df$Price, by = list(df$Neighborhood), function(x) mean(x))
df_new
Group.1 x
1 A 35
2 B 71
3 C 63
And finally you can plot the average prices in variable x and add the neighborhood names from the Group.1column:
barplot(df_new$x, names.arg = df_new$Group.1)
An even simpler solution is this, using tapplyand mean:
df_new <- tapply(df$Price, df$Neighborhood, mean)
barplot(df_new, names.arg = names(df_new))

Creating a contingency table by hypergeometric sampling with the Titanic's database

I created a contingency table with the passengers data from the Titanic by the Hypergeometric sampling -That's mean that both of the marginal totals are preset and equals-. It was created crossing the Sex and Survivor columns of 328 cases -164 men and 164 women-, this is the code:
First, I ungroup the data and deleted the useless columns
titanic = as.data.frame(Titanic)
titanic = titanic[rep(1:nrow(titanic),titanic$Freq),]
titanic = titanic[,c(2,4)]
later, selected a sample of men
men = subset(titanic, titanic$Sex == 'Male')
men = men [sample(nrow(men),164), ]
table(men$Sex, men$Survived)
# No Yes
# Male 133 31
# Female 0 0
now the row of women must be filled in with the appropriate values
n = summary.factor(men$Survived)
womenYes = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='Yes'))
womenYes = subset(womenYes[1:n[1], ])
womenNo = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='No'))
womenNo = subset(womenNo[1:n[2], ])
women = merge(womenYes, womenNo, all = TRUE)
hyperSample = merge(men, women, all = TRUE)
table(hyperSample$Sex, hyperSample$Survived)
# No Yes
# Male 133 31
# Female 31 133
It works, but it looks like a bit ugly and I honestly think perhaps someone could find a much more elegant or efficient way to do it. Thanks.
You can sample in two stages, both using rhyper: First to determine the number of men and women subject to only sampling 328 and assuming populations were sex-distributed as in the original sample. This is what you might do if you were trying to bootstrap a statistic like a rate ratio. And then secondly, use rhyper twice more to determine the numbers of survivors subject to the same probabilities in the original sample rows.
MFmat <- apply(Titanic, c(2, 4), sum)
nMale <- rhyper(1, rowSums(MFmat)[1], rowSums(MFmat)[2], 328)
#[1] 262
nFemale <- 328 - nMale
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----
Survived
Sex No Yes
Male 223 42
Female 22 41
I suppose you could sample the two rows separately and you should be able to use the logic above, ... if that what you have decided to do. Which way is more appropriate will depend on the underlying problem.
# Fixed row marginals....
nMale <-164
nFemale <- 164
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----------------
Survived
Sex No Yes
Male 127 37
Female 39 125

Calculate the trend for 2000-rows time series and isolate the abnormal rows

I have a R dataframe which describes the evolution of the sales of a product in approx. 2000 shops in a quarterly basis, with 5 columns (ie. 5 periods of time). I'd like to know how to analyse it with R.
I've already tried to make some basic analysis, that is to say to determine the average sales for the 1st period, the 2nd period, etc. and then determine the average for each period and then to compare the evolution of each shop relatively to this general evolution. For instance, there is a total of 50 000 sales for the 1st period and 35 000 for the 5th, so I assume that for each shop the normal sale in the 5th period is to be 35/55=0.63*the amount of the 1st period's sale: if the shop X has sold 100 items in the first period, I assume that it should normally sell 63 items in the 5th period.
Obviously, this is an easy-to-do method, but it is not statistically relevant.
I would like a method which would enable me to determine a trend curb which miminizes my R-square. My objective is to be able to analyse the sales of the shops by neutralizing the general trend: I'd like to know precisely what are the underperforming shops and what are the overperforming shops, with a statistically correct approach.
My dataframe is structured in this way :
shopID | sum | qt1 | qt2 | qt3 | qt4 | qt5
000001 | 150 | 45 | 15 | 40 | 25 | 25
000002 | 100 | 20 | 20 | 20 | 20 | 20
000003 | 500 | 200 | 0 | 100 | 100 | 100
... (2200 rows)
I've tried to put my timeserie in a list, which is successful, with the following functon:
reversesales=t(data.frame(sales$qt1,sales$qt2,sales$qt3,sales$qt4,sales$qt5))
# I reverse rows and columns of the frame in order that the time periods be the rows
timeser<-ts(reversesales,start=1,end=5, deltat=1/4)
# deltat=1/4 because it is a quarterly basis, 1 and 5 because I have 5 quarters
Still, I am unable to do anything with this variable. I can't do any plot (with the "plot" function) as there are 2200 rows (and so R wants to make me 2200 successive plots, obviously this is not what I want).
In addition, I don't know how to determine the theoretical trend and the theoretical value of the sales for each period for each shop...
Thank you for your help! (and merry Christmas)
An implementation of mixed model:
install.packages("nlme")
library("nlme")
library(dplyr)
# Generating some data with a structure like yours:
start <- round(sample(10:100, 50, replace = TRUE)*runif(50))
df <- data_frame(shopID = 1:50, qt1 = start, qt2 =round(qt1*runif(50, .5, 2)) ,qt3 = round(qt2*runif(50, .5, 2)), qt4 = round(qt3*runif(50, .5, 2)), qt5 = round(qt4*runif(50, .5, 2)))
df <- as.data.frame(df)
# Converting in into the long format:
df <- reshape(df, idvar = "shopID", varying = names(df)[-1], direction = "long", sep = "")
Estimating the model:
mod <- lme(qt ~ time, random = ~ time | shopID, data = df)
# Extract the random effects for comparison:
random.effects(mod)
(Intercept) time
1 74.0790805 3.7034172
2 7.8713699 4.2138001
3 -8.0670810 -5.8754060
4 -16.5114428 16.4920663
5 -16.7098229 6.4685228
6 -11.9630688 -8.0411504
7 -12.9669777 21.3071366
8 -24.1099280 32.9274361
9 8.5107335 -9.7976905
10 -13.2707679 -6.6028927
11 3.6206163 -4.1017784
12 21.2342886 -6.7120725
13 -14.6489512 11.6847109
14 -14.7291647 2.1365768
15 10.6791941 3.2097199
16 -14.1524187 -1.6933291
17 5.2120647 8.0119320
18 -2.5172933 -6.5011416
19 -9.0094366 -5.6031271
20 1.4857512 -5.9913865
21 -16.5973442 3.5164298
22 -26.7724763 27.9264081
23 49.0764631 -12.9800871
24 -0.1512509 2.3589947
25 15.7723150 -7.9295698
26 2.1955489 11.0318875
27 -8.0890346 -5.4145977
28 0.1338790 -8.3551182
29 9.7113758 -9.5799588
30 -6.0257683 42.3140432
31 -15.7655545 -8.6226255
32 -4.1450984 18.7995079
33 4.1510104 -1.6384103
34 2.5107652 -2.0871890
35 -23.8640815 7.6680185
36 -10.8228653 -7.7370976
37 -14.1253093 -8.1738468
38 42.4114024 -9.0436585
39 -10.7453627 2.4590883
40 -12.0947901 -5.2763010
41 -7.6578305 -7.9630013
42 -14.9985612 -0.4848326
43 -13.4081771 -7.2655456
44 -11.5646620 -7.5365387
45 6.9116844 -10.5200339
46 70.7785492 -11.5522014
47 -7.3556367 -8.3946072
48 27.3830419 -6.9049164
49 14.3188079 -9.9334156
50 -15.2077850 -7.9161690
I would interpret the values as follows: consider them as a deviation from zero, so that positive values are positive deviations from the average, whereas negative values are negative deviation from the average. The averages of the two columns are zero, as is checked below:
round(apply(random.effects(mod), 2, mean))
(Intercept) time
0 0
library(zoo)
#Reconstructing the data with four quarter columns (instead of five quarters as in your example)
shopID <- c(1, 2, 3, 4, 5)
sum <- c(150, 100, 500, 350, 50)
qt1 <- c(40, 10, 130, 50, 10)
qt2 <- c(40, 40, 110, 100, 15)
qt3 <- c(50, 30, 140, 150, 10)
qt4 <- c(20, 20, 120, 50, 15)
myDF <- data.frame(shopID, sum, qt1, qt2, qt3, qt4)
#The ts() function converts a numeric vector into an R time series object
ts1 <- ts(as.numeric((myDF[1,3:6])), frequency=4)
ts2 <- ts(as.numeric((myDF[2,3:6])), frequency=4)
ts3 <- ts(as.numeric((myDF[3,3:6])), frequency=4)
ts4 <- ts(as.numeric((myDF[4,3:6])), frequency=4)
ts5 <- ts(as.numeric((myDF[5,3:6])), frequency=4)
#Merge time series objects
tsm <- merge(a = as.zoo(ts1), b = as.zoo(ts2), c = as.zoo(ts3), d = as.zoo(ts4), e = as.zoo(ts5))
#Plotting the Time Series
plot.ts(tsm, plot.type = "single", lty = 1:5, xlab = "Time", ylab = "Sales")
The code is not optimized, and can be improved. More about time series analysis can be read here. Hope this gives some direction.

Resources