Adjust implausible imputed values in an optimized way - r

I have a dataset with some imputed values. According to a predefined edit rule, some of these imputed values are implausible. For that reason, I want to adjust these implausible imputed values, but the adjustment should be as small as possible.
Here is a simplified example:
# Seed
set.seed(111)
# Example data
data <- data.frame(x1 = round(rnorm(200, 5, 5), 0),
x2 = factor(round(runif(200, 1, 3), 0)),
x3 = round(rnorm(200, 2, 10), 0),
x4 = factor(round(runif(200, 0, 5), 0)))
data[data$x1 > 5 & data$x2 == 1, ]$x3 <- 4
data[data$x1 > 5 & data$x2 == 1, ]$x4 <- 5
# Missings
data$x1[sample(1:nrow(data), 25)] <- NA
data$x2[sample(1:nrow(data), 50)] <- NA
data$x3[sample(1:nrow(data), 40)] <- NA
data$x4[sample(1:nrow(data), 35)] <- NA
# Imputation
library("mice")
imp <- mice(data, m = 1)
# Imputed data
data_imp <- complete(imp, "repeated")
# So far everything works well.
# However, there is a predefined edit rule, which should not be violated.
# Edit Rule:
# If x1 > 5 and x2 == 1
# Then x3 > 3 and x4 > 4
# Because of the imputation, some of the observations have implausible values.
implausible <- data_imp[data_imp$x1 > 5 & data_imp$x2 == 1 &
(data_imp$x3 <= 3 | (data_imp$x4 != 4 & data_imp$x4 != 5)), ]
implausible
# Example 1)
# In row 26 x1 has a value > 5 and x2 equals 1.
# For that reason, x3 would have to be larger than 3 (here x3 is -17).
# Like you can see in the original data, x2 has been imputed in row 26.
data[rownames(implausible), ]
# Hence, x2 would have to be adjusted, so that it randomly gets a different category.
# Example 2)
# In row 182 are also implausible values.
# Three of the variables have been imputed in this row.
# Therefore, all/some of the imputed cells would have to be adjusted,
# but the adjustment should be as small as possible.
I have already made some research and found some relevant papers/books, in which some optimization algorithms are described:
Pannekoek & Zhang (2011): https://www.researchgate.net/publication/269410841_Partial_donor_Imputation_with_Adjustments
de Waal, Pannekoek & Scholtus (2011): Handbook of Statistical Data Editing and Imputation
However, I am struggling with the implementation of these algorithms in R. Is there a Package available, which helps with these kind of calculations. I'd really appreciate some help with my code or some hints about the topic!

Related

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

BTYD Individual Level Estimations For All Observations

I am using BTYD BG NBD in R and did the individual level estimates.
For instance following the documentation in page 20 of:
BTYD Walkthrough
Code for Data Prep:
system.file("data/cdnowElog.csv", package = "BTYD")%>%
dc.ReadLines(., cust.idx = 2, date.idx = 3, sales.idx = 5)%>%
dc.MergeTransactionsOnSameDate()%>%
mutate(date = parse_date_time(date, "%Y%m%d")) -> elog
end.of.cal.period <- as.Date("1997-09-30")
elog.cal <- elog[which(elog$date <= end.of.cal.period), ]
split.data <- dc.SplitUpElogForRepeatTrans(elog.cal);
birth.periods <- split.data$cust.data$birth.per
last.dates <- split.data$cust.data$last.date
clean.elog <- split.data$repeat.trans.elog;
freq.cbt <- dc.CreateFreqCBT(clean.elog);
tot.cbt <- dc.CreateFreqCBT(elog)
cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)
cal.cbs.dates <- data.frame(birth.periods, last.dates, end.of.cal.period)
cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates,per="week")
params <- pnbd.EstimateParameters(cal.cbs);
one could get estimates for a particular observation.
Code for Individual Level Estimation:
cal.cbs["1516",]
# x t.x T.cal
# 26.00 30.86 31.00
x <- cal.cbs["1516", "x"]
t.x <- cal.cbs["1516", "t.x"]
T.cal <- cal.cbs["1516", "T.cal"]
bgnbd.ConditionalExpectedTransactions(params, T.star = 52,
x, t.x, T.cal)
# [1] 25.76
My question is, is it possible to recursively run this such that I could get a data frame containing the expectations for each row instead of hard coding a particular ID number such as "1516" in this case?
Thanks!
Yes, it is straightforward with dplyr's mutate()
cal.cbs%>%
data.frame()%>%
mutate(`Conditional Expectation` = bgnbd.ConditionalExpectedTransactions(params, T.star = 52, x, t.x, T.cal))
x t.x T.cal Conditional Expectation
1 2 30.428571 38.85714 2.3224971
2 1 1.714286 38.85714 1.0646350
3 0 0.000000 38.85714 0.5607707
4 0 0.000000 38.85714 0.5607707
5 0 0.000000 38.85714 0.5607707
6 7 29.428571 38.85714 6.0231497

Mean and standard deviation of triplicated vector data

I have an experiment where I measured a bit less than 200 variables in triplicate. In other words, I have three vectors of ~ 200 values.
I want a quick way to determine if I should use mean or median for my calculations. I can do the mean easily ((v1 + v2 + v3) / 3), but how do I calculate the SD to have it in a vector of ~ 200 SDs? And what about the median?
After having these values, I need to do growth curves (measurements were taken over certain period of time).
Here is a dplyr solution:
require(dplyr)
d <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10)
)
d %>%
rowwise() %>%
mutate(
mean = mean(c(x1, x2, x3)),
median = median(c(x1, x2, x3)),
sd = sd(c(x1, x2, x3))
)
It sounds like you also have a substantive question about longitudinal data. If so, crossvalidated would be a good platform for this question.
apply is what you do. Have your vector in a matrix, e.g.
mydat <- matrix(rnorm(600), ncol = 3)
means <- apply(mydat, MARGIN = 1, mean) # MARGIN = 1 is rows, MARGIN = 2 would be columns...
sds <- apply(mydat, MARGIN = 1, sd)
medians <- apply(mydat, MARGIN = 1, median)
Though I have to say, with 3 values each, using median sounds pretty questionable.
Traditional 'for' loop can also be used, though it is not preferred:
for(i in 1:nrow(d)) d[i,4]=mean(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,5]=sd(unlist(d[i,1:3]))
for(i in 1:nrow(d)) d[i,6]=median(unlist(d[i,1:3]))
names(d)[4:6]=c('meanval', 'sdval', 'medianval')
d
x1 x2 x3 meanval sdval medianval
1 -1.3230176 0.6956100 -0.7210798 -0.44949580 1.0363556 -0.7210798
2 -1.8931166 0.9047873 -1.0378874 -0.67540558 1.4337404 -1.0378874
3 -0.2137543 0.1846733 0.6410478 0.20398893 0.4277283 0.1846733
4 0.1371915 -1.0345325 -0.2260038 -0.37444827 0.5998009 -0.2260038
5 -0.8662465 -0.8229465 -0.2230030 -0.63739866 0.3595296 -0.8229465
6 -0.2918697 -1.3543493 1.3025262 -0.11456426 1.3372826 -0.2918697
7 -0.4931936 1.7186173 1.3757156 0.86704643 1.1904138 1.3757156
8 0.3982403 -0.3394208 1.9316059 0.66347514 1.1585131 0.3982403
9 -1.0332427 -0.3045905 1.1513260 -0.06216908 1.1122775 -0.3045905
10 -1.5603811 -0.1709146 -0.5409815 -0.75742575 0.7195765 -0.5409815
Using d from #DMC's answer.

split on factor, sapply, and lm [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I want to apply lm() to observations grouped by subject, but cannot work out the sapply syntax. At the end, I want a dataframe with 1 row for each subject, and the intercept and slope (ie, rows of: subj, lm$coefficients[1] lm$coefficients[2])
set.seed(1)
subj <- rep(c("a","b","c"), 4) # 4 observations each on 3 experimental subjects
ind <- rnorm(12) #12 random numbers, the independent variable, the x axis
dep <- rnorm(12) + .5 #12 random numbers, the dependent variable, the y axis
df <- data.frame(subj=subj, ind=ind, dep=dep)
s <- (split(df,subj)) # create a list of observations by subject
I can pull a single set of observations from s, make a dataframe, and get what I want:
df2 <- as.data.frame(s[1])
df2
lm1 <- lm(df2$a.dep ~ df2$a.ind)
lm1$coefficients[1]
lm1$coefficients[2]
I am having trouble looping over all the elements of s and getting the data into the final form I want:
lm.list <- sapply(s, FUN= function(x)
(lm(x[ ,"dep"] ~ x[,"ind"])))
a <-as.data.frame(lm.list)
I feel like I need some kind of transpose of the structure below; the columns (a,b,c) are what I want my rows to be, but t(a) does not work.
head(a)
a
coefficients 0.1233519, 0.4610505
residuals 0.4471916, -0.3060402, 0.4460895, -0.5872409
effects -0.6325478, 0.6332422, 0.5343949, -0.7429069
rank 2
fitted.values 0.74977179, 0.09854505, -0.05843569, 0.47521446
assign 0, 1
b
coefficients 1.1220840, 0.2024222
residuals -0.04461432, 0.02124541, 0.27103003, -0.24766112
effects -2.0717363, 0.2228309, 0.2902311, -0.2302195
rank 2
fitted.values 1.1012775, 0.8433366, 1.1100777, 1.0887808
assign 0, 1
c
coefficients 0.2982019, 0.1900459
residuals -0.5606330, 1.0491990, 0.3908486, -0.8794147
effects -0.6742600, 0.2271767, 1.1273566, -1.0345665
rank 2
fitted.values 0.3718773, 0.2193339, 0.5072572, 0.2500516
assign 0, 1
By the sounds of it, this might be what you're trying to do:
sapply(s, FUN= function(x)
lm(x[ ,"dep"] ~ x[,"ind"])$coefficients[c(1, 2)])
# a b c
# (Intercept) 0.71379430 -0.6817331 0.5717372
# x[, "ind"] 0.07125591 1.1452096 -1.0303726
Other alternatives, if this is what you're looking for
I've seen it noted that in general, if you're splitting and then using s/lapply, you can usually just jump straight to by and skip the split step:
do.call(rbind,
by(data = df, INDICES=df$subj, FUN=function(x)
lm(x[, "dep"] ~ x[, "ind"])$coefficients[c(1, 2)]))
# (Intercept) x[, "ind"]
# a 0.7137943 0.07125591
# b -0.6817331 1.14520962
# c 0.5717372 -1.03037257
Or, you can use one of the packages that lets you do such sorts of calculations more conveniently, like "data.table":
library(data.table)
DT <- data.table(df)
DT[, list(Int = lm(dep ~ ind)$coefficients[1],
Slo = lm(dep ~ ind)$coefficients[2]), by = subj]
# subj Int Slo
# 1: a 0.7137943 0.07125591
# 2: b -0.6817331 1.14520962
# 3: c 0.5717372 -1.03037257
How about nlme::lmList?
library(nlme)
coef(lmList(dep~ind|subj,df))
## (Intercept) ind
## a 0.7137943 0.07125591
## b -0.6817331 1.14520962
## c 0.5717372 -1.03037257
You can transpose this if you want.

Using split function in R

I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance
IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)
It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.
Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.

Resources