I created a contingency table with the passengers data from the Titanic by the Hypergeometric sampling -That's mean that both of the marginal totals are preset and equals-. It was created crossing the Sex and Survivor columns of 328 cases -164 men and 164 women-, this is the code:
First, I ungroup the data and deleted the useless columns
titanic = as.data.frame(Titanic)
titanic = titanic[rep(1:nrow(titanic),titanic$Freq),]
titanic = titanic[,c(2,4)]
later, selected a sample of men
men = subset(titanic, titanic$Sex == 'Male')
men = men [sample(nrow(men),164), ]
table(men$Sex, men$Survived)
# No Yes
# Male 133 31
# Female 0 0
now the row of women must be filled in with the appropriate values
n = summary.factor(men$Survived)
womenYes = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='Yes'))
womenYes = subset(womenYes[1:n[1], ])
womenNo = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='No'))
womenNo = subset(womenNo[1:n[2], ])
women = merge(womenYes, womenNo, all = TRUE)
hyperSample = merge(men, women, all = TRUE)
table(hyperSample$Sex, hyperSample$Survived)
# No Yes
# Male 133 31
# Female 31 133
It works, but it looks like a bit ugly and I honestly think perhaps someone could find a much more elegant or efficient way to do it. Thanks.
You can sample in two stages, both using rhyper: First to determine the number of men and women subject to only sampling 328 and assuming populations were sex-distributed as in the original sample. This is what you might do if you were trying to bootstrap a statistic like a rate ratio. And then secondly, use rhyper twice more to determine the numbers of survivors subject to the same probabilities in the original sample rows.
MFmat <- apply(Titanic, c(2, 4), sum)
nMale <- rhyper(1, rowSums(MFmat)[1], rowSums(MFmat)[2], 328)
#[1] 262
nFemale <- 328 - nMale
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----
Survived
Sex No Yes
Male 223 42
Female 22 41
I suppose you could sample the two rows separately and you should be able to use the logic above, ... if that what you have decided to do. Which way is more appropriate will depend on the underlying problem.
# Fixed row marginals....
nMale <-164
nFemale <- 164
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----------------
Survived
Sex No Yes
Male 127 37
Female 39 125
Related
I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL
I'm trying to generate some sample insurance claims data that is meaningful instead of just random numbers.
Assuming I have two columns Age and Injury, I need meaningful values for ClaimAmount based on certain conditions:
ClaimantAge | InjuryType | ClaimAmount
---------------------------------------
35 Bruises
55 Fractures
. .
. .
. .
I want to generate claim amounts that increase as age increases, and then plateaus at around a certain age, say 65.
Claims for certain injuries need to be higher than claims for other types of injuries.
Currently I am generating my samples in a random manner, like so:
amount <- sample(0:100000, 2000, replace = TRUE)
How do I generate more meaningful samples?
There are many ways that this could need to be adjusted, as I don't know the field. Given that we're talking about dollar amounts, I would use the poisson distribution to generate data.
set.seed(1)
n_claims <- 2000
injuries <- c("bruises", "fractures")
prob_injuries <- c(0.7, 0.3)
sim_claims <- data.frame(claimid = 1:n_claims)
sim_claims$age <- round(rnorm(n = n_claims, mean = 35, sd = 15), 0)
sim_claims$Injury <- factor(sample(injuries, size = n_claims, replace = TRUE, prob = prob_injuries))
sim_claims$Amount <- rpois(n_claims, lambda = 100 + (5 * (sim_claims$age - median(sim_claims$age))) +
dplyr::case_when(sim_claims$Injury == "bruises" ~ 50,
sim_claims$Injury == "fractures" ~ 500))
head(sim_claims)
claimid age Injury Amount
1 1 26 bruises 117
2 2 38 bruises 175
3 3 22 bruises 102
4 4 59 bruises 261
5 5 40 fractures 644
6 6 23 bruises 92
In R, I am trying to use tableone::CreateTableOne in order to calculate smd (standardized mean differences) on a dataframe. I used this tutorial (https://cran.r-project.org/web/packages/tableone/vignettes/smd.html) - the code runs and nicely produces the desired output table, including the smd.
However, if I use my own data, e.g. the test data below, I get the table but without smd. Probably I did some stupid mistake, but after trying a lot of things (only numeric, smaller or larger dataset, categorial variables as factor (as in r help) or character (as in tutorial)...) I cannot figure out why I do not get smd.
# package tableone for CreateTableOne
if (!require("tableone")) install.packages("tableone"); library("tableone")
# producible test data
set.seed(1234)
d <- data.frame(age = rnorm(n = 200, mean = 50, 9),
hair = as.factor(sample(x = c("brown", "black", "blond"), 200, replace = T)),
group = sample(x = c("sick", "healthy"), 200, replace = T))
str(d)
# calculate and print the table
tabUnmatched <- tableone::CreateTableOne(vars = c("age", "hair"), strata = "group", data = d, test = FALSE, smd = TRUE)
print(tabUnmatched)
results in the following table, WITHOUT smd (and no error message):
Stratified by group
healthy sick
n 90 110
age (mean (SD)) 49.18 (7.97) 49.72 (10.10)
hair (%)
black 30 (33.3) 35 (31.8)
blond 33 (36.7) 43 (39.1)
brown 27 (30.0) 32 (29.1)
What am I doing wrong, what do I need to do to get smd output?
errr...this?
print(tabUnmatched, smd = TRUE)
Stratified by group
healthy sick SMD
n 90 110
age (mean (SD)) 49.18 (7.97) 49.72 (10.10) 0.059
hair (%) 0.050
black 30 (33.3) 35 (31.8)
blond 33 (36.7) 43 (39.1)
brown 27 (30.0) 32 (29.1)
1. First
I have the following data frame
weight <- c(74,85,58,80)
height <- c(1.68,1.83,1.58,1.72)
age <- c(22,25,21,20)
names <- c("Peter","Joseph","Marie","Xavier")
sex <- c("Male","Male","Woman","Woman")
data <- data.frame(weight,height,age,names,sex)
I need to add a new individual to the data frame: name= "Anne", weight= 70, height= 1.72 sex= Woman. I set those values as a vector:
Anne <- c(70,1.72,24,"Anne","Woman")
I used rbind to add the vector Anne
data <- rbind(data,Anne)
But I got this warning.
Warning messages:
1: In `[`<-.factor``(*tmp*, ri, value = "Ana") :
invalid factor level, NA generated
2: In `[`<-.factor``(*tmp*, ri, value = "Mujer") :
invalid factor level, NA generated
"Anne" and "Woman" appear as NA in the data frame. How can I fix this?
2. Second
Also, how can I add a column to the data dataframe, named HEIGHT. Where, if the individual measures more than 1.78, he will have the value "High"; otherwise it will have the value "Normal".
APPRECIATE YOUR HELP
Your attempt would work if you read the data as characters as shown at the end of this post.
Alternatively, you can also do
data[nrow(data) + 1, ] <- Anne
data$HEIGHT <- ifelse(data$height > 1.78, "High", "Normal")
#Faster way would be
#data$HEIGHT <- c("Normal", "High")[(data$height > 1.78) + 1]
data
# weight height age names sex HEIGHT
#1 74 1.68 22 Peter Male Normal
#2 85 1.83 25 Joseph Male High
#3 58 1.58 21 Marie Woman Normal
#4 80 1.72 20 Xavier Woman Normal
#5 70 1.72 24 Anne Woman Normal
data
Read the data with stringsAsFactors = FALSE.
data <- data.frame(weight,height,age,names,sex, stringsAsFactors = FALSE)
So we have
13-17 18-24 25-34 35-44 45-54 55-64 65+
Female 1 45 15 6 2 3 2
Male 2 121 31 7 4 2 3
and the raw data has headers like F.13-20, F.21-35 M.13-20 etc.
How would you do this? It is hard to explain, but we can't find it anywhere.
tab <- matrix(as.numeric(WeekReach[158,3:16]), nrow=2, byrow=TRUE)
colnames(tab) <- c("13-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+")
rownames(tab) <- c("Female","Male")
Then after this is:
exInd = function() {
n = sum(tab)
p = rowSums(tab)/sum(tab)
q = colSums(tab)/sum(tab)
return(p %o% q * n)}
chiSquaredStatistic = function(E) {
return(sum((tab - E)^2/E))}
E = exInd()
x2 = replicate(1000, {
ageShuffle = sample(age)
genderShuffle = sample(gender)
Xindep = table(ageShuffle, genderShuffle)
chiSquaredStatistic(Xindep, E)})
But we need something to make male and female their own thing - It is hard to explain, the teacher wont even explain it to us #univeristyproblems
Right so this was the solution given by the teacher - note, they do not explain anything.
WeekReach = read.csv("http://staff.scm.uws.edu.au/~lapark/300958/labs/WeeklyReachDemog.csv", as.is=TRUE)
tab = matrix(as.numeric(WeekReach[158,3:16]), nrow=2, byrow=TRUE)
colnames(tab) <- c("13-17", "18-24", "25-34", "35-44",
+ "45-54", "55-64", "65+")
rownames(tab) <- c("Female","Male")
stretchTable = function(tab, variableNames) {
+ tabx = rep(rownames(tab), rowSums(tab))
+ l = ncol(tab)
+ m = nrow(tab)
+ cn = colnames(tab)
+ taby = c()
+ for (a in 1:m) {
+ for (b in 1:l) {
+ taby = c(taby, rep(cn[b], tab[a,b]))
+ }
+ }
+
+ d = data.frame(x = tabx, y = taby)
+ colnames(d) = variableNames
+ return(d)
+ }
tab2 = stretchTable(tab, c("Gender","Age"))
Verify that we the correct values
table(tab2)
This was the 'question'
We showed in the lectures that we can perform a test for independence for a given two way table (two way meaning, has more than one row and column). To perform the test, we need to:
compute the expected values of the table if the rows and columns are independent (code shown in the lecture slides).
shuffle the rows and columns of the table.
To peform the shuffling, we must first untabulate the table. For example, if we start with the table:
A B C
X 2 1 1
Y 1 3 1
We must convert it to the form:
Column Row
A X
A X
A Y
B X
B Y
B Y
B Y
C X
C Y
Write the code to do this table conversion.
Hint: To compute the two column table, the two columns can be computed seperately as gender and age, then combined using tab2 = data.frame(Gender = gender, Age = age). Also, the functions rowSums, colSums, rownames, colnames and rep may be useful (if you are unfamiliar with these functions, read the R documentation on them, e.g help(rowSums)).
Once we have the data in two columns, we shuffle the columns and recompute the table and compute the χ2 value (as shown in the lecture).
Using the above table tab and the hypotheses H0: Gender and Age are independent, HA: Gender and Age are not independent:
Compute the χ2 randomisation distribution.
Compute the χ2 statistic for tab.
Compute the p value of the test.
Finally state the conclusion of the test.
HAVE FUN.
I assume you want to perform a chi-square test of independence, to establish if there are significant differences between the expected and observed frequencies in the male/female groups across the different age brackets.
The following should get you started
df <- read.table(text =
"13-17 18-24 25-34 35-44 45-54 55-64 65+
Female 1 45 15 6 2 3 2
Male 2 121 31 7 4 2 3", header = T)
chisq.test(df)
#
# Pearson's Chi-squared test
#
#data: df
#X-squared = 4.8117, df = 6, p-value = 0.5682
Based on the sample data and the chi-square test results, we fail to reject the null hypothesis, and conclude that there is not enough evidence to infer that there is a statistically significant difference between the male and female frequencies across the different age brackets.