Proportion of patients in different strata - r

I am trying to distribute 20 patients into 4 strata. The covariates are jointly distributed. Below is my code, along with the error. Can anyone help me debug this?
runif(20)
if (0<u<3/20) then {i1} =1
if (3/20<u<14/20) then {i2} =2
if (14/20<u<19/20) then {i3} =3
if (19/20<u<20/20) then {i4} =4
Error: unexpected '<' in "if (0<u<"**

You can use the cut function to assign your strata:
breaks <- c(0, 3/20, 14/20, 19/20, 20)
labels <- 1:4
set.seed(7)
x <- runif(20)
stratum <- cut(x, breaks = breaks, labels = labels)
df <- data.frame(cbind(x, stratum))
df
x stratum
1 0.98890930 4
2 0.39774545 2
3 0.11569778 1
4 0.06974868 1
5 0.24374939 2
6 0.79201043 3
As AnilGoya mentioned, you have clean up your comparison operators.

Related

R: Tetrachoric correlation for multiple variables at one go?

You can see I'm a beginner at this when I'm not even able to reproduce my problem with a dummy dataset... Anyways, here goes: I want to calculate tetrachoric correlations between one grouping variable and multiple other variables. Like this:
library(psych)
set.seed(42)
n <- 16
dat <- data.frame(id=1:n,
group=c(rep("a", times=5), rep("b", times=3)),
x=sample(1:2, n, replace=TRUE),
y=sample(1:2, n, replace=TRUE),
z=sample(1:2, n, replace=TRUE))
dat
id group x y z
1 1 a 1 1 2
2 2 a 1 2 2
3 3 a 1 1 2
4 4 a 1 2 2
5 5 a 2 1 1
6 6 b 2 2 1
7 7 b 2 1 1
8 8 b 2 1 1
tetrachoric(as.matrix(dat[,c("group","y")]))
Now with this example (not with my actual dataset) I get an error which I'm unable to solve:
Error in apply(x, 2, function(x) min(x, na.rm = TRUE)) :
dim(X) must have a positive length
In addition: Warning messages:
1: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion
2: In tetrachoric(as.matrix(dat[, c("group", "y")])) :
Item = group had no variance and was deleted
My question is still what would be the best solution to get all the correlations with a single piece of code? Thank you for help!
The help file for tetrachoric says "The tetrachoric correlation is the inferred Pearson Correlation from a two x two table with the assumption of bivariate normality", so presumably you need to pass it a 2x2 table. You could write a little function that would hand the tetrachoric the appropriate table and collect the results:
myfun <- function(x,y, ...){
tabs <- lapply(seq_along(y), function(i)table(x,y[,i]))
l <- lapply(tabs, function(x)tetrachoric(x, ...))
rho <- sapply(l, function(x)x$rho)
tau <- sapply(l, function(x)x$tau)
colnames(tau) <- colnames(y)
names(rho) <- colnames(y)
ret <- list(rho = rho ,
tau = tau)
ret
}
myfun(dat$group, dat[,c("x", "y", "z")])
# $rho
# x y z
# 0.5397901 -0.2605839 0.6200705
#
# $tau
# x y z
# a 0.3186394 0.3186394 0.2690661
# 1 0.1573107 0.1573107 -0.6045853

I do not know how to plot the probability distribution of outcomes of some code in R

I have created a program that simulates the throwing of dice 100 times. I need help with adding up the results of the individual dice and also how to plot the probability distribution of outcomes.
This is the code I have:
sample(1:6, size=100, replace = TRUE)
So far, what you've done is sample the dice throws (note I've added a line setting the seed for reproducibility:
set.seed(123)
x <- sample(1:6, size=100, replace = TRUE)
The simple command to "add[] up the results of the individual dice" is table():
table(x)
# x
# 1 2 3 4 5 6
# 17 16 20 14 18 15
Then, to "plot the probability distribution of outcomes," we must first get that distribution; luckily R provides the handy prop.table() function, which works for this sort of discrete distribution:
prop.table(table(x))
# x
# 1 2 3 4 5 6
# 0.17 0.16 0.20 0.14 0.18 0.15
Then we can easily plot it; for plotting PMFs, my preferred plot type is "h":
y <- prop.table(table(x))
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")
Update: Weighted die
sample() can easily used to simulate weighted die using its prob argument. From help("sample"):
Usage
sample(x, size, replace = FALSE, prob = NULL)
Arguments
[some content omitted]
prob a vector of probability weights for obtaining the elements of the vector being sampled.
So, we just add your preferred weights to the prob argument and proceed as usual (note I've also upped your sample size from 100 to 10000):
set.seed(123)
die_weights <- c(4/37, rep(6/37, 4), 9/37)
x <- sample(1:6, size = 10000, replace = TRUE, prob = die_weights)
(y <- prop.table(table(x)))
# x
# 1 2 3 4 5 6
# 0.1021 0.1641 0.1619 0.1691 0.1616 0.2412
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

BTYD Individual Level Estimations For All Observations

I am using BTYD BG NBD in R and did the individual level estimates.
For instance following the documentation in page 20 of:
BTYD Walkthrough
Code for Data Prep:
system.file("data/cdnowElog.csv", package = "BTYD")%>%
dc.ReadLines(., cust.idx = 2, date.idx = 3, sales.idx = 5)%>%
dc.MergeTransactionsOnSameDate()%>%
mutate(date = parse_date_time(date, "%Y%m%d")) -> elog
end.of.cal.period <- as.Date("1997-09-30")
elog.cal <- elog[which(elog$date <= end.of.cal.period), ]
split.data <- dc.SplitUpElogForRepeatTrans(elog.cal);
birth.periods <- split.data$cust.data$birth.per
last.dates <- split.data$cust.data$last.date
clean.elog <- split.data$repeat.trans.elog;
freq.cbt <- dc.CreateFreqCBT(clean.elog);
tot.cbt <- dc.CreateFreqCBT(elog)
cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)
cal.cbs.dates <- data.frame(birth.periods, last.dates, end.of.cal.period)
cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates,per="week")
params <- pnbd.EstimateParameters(cal.cbs);
one could get estimates for a particular observation.
Code for Individual Level Estimation:
cal.cbs["1516",]
# x t.x T.cal
# 26.00 30.86 31.00
x <- cal.cbs["1516", "x"]
t.x <- cal.cbs["1516", "t.x"]
T.cal <- cal.cbs["1516", "T.cal"]
bgnbd.ConditionalExpectedTransactions(params, T.star = 52,
x, t.x, T.cal)
# [1] 25.76
My question is, is it possible to recursively run this such that I could get a data frame containing the expectations for each row instead of hard coding a particular ID number such as "1516" in this case?
Thanks!
Yes, it is straightforward with dplyr's mutate()
cal.cbs%>%
data.frame()%>%
mutate(`Conditional Expectation` = bgnbd.ConditionalExpectedTransactions(params, T.star = 52, x, t.x, T.cal))
x t.x T.cal Conditional Expectation
1 2 30.428571 38.85714 2.3224971
2 1 1.714286 38.85714 1.0646350
3 0 0.000000 38.85714 0.5607707
4 0 0.000000 38.85714 0.5607707
5 0 0.000000 38.85714 0.5607707
6 7 29.428571 38.85714 6.0231497

Running 'prop.test' multiple times in R

I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216

Resources