How to impute the same values within one group? - r

I'm struggeling with imputation via mice package to solve a NA problem in my data anlysis. I'm using lienar mixed models to calcultate inter class correlation coefficients (ICC's). in my final dataframe there are several control variables (as columns) that I use as fixed effects in the model.
in some columns there are missing values. I have no further Problems to impute the NA by the following commands:
imputation_list <- mice(baseline_df,
method = "pmm",
m=5) # "pmm" == predictive mean matching (numeric data)
df_imputation_final= complete(imputation_list)
But now my problem:
The ID's (persons in rows) are subgrouped in multiple groups (families). So I have to impute the NA's, all persons within one family having the same imputation.
In the following dataframe I have to make imputations.
df_test <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
family=c(Gerrard, Gerrard, Gerrard, Torres, Torres, Torres, Keita, Keita, Keita, Suarez, Suarez, Kuyt, Kuyt, Carragher, Carragher, Carragher, Salah, Salah, Firmono, Firmino )
income_family=c(NA, NA, NA, 100, 100, 100, 90, 90, 90, 150, 150, 40, 40, NA, NA, NA, 200, 200, 99, 99))
So all members/persons ("1", "2", "3" & "14", "15", "16") within families: "Gerrard", and "Carragher" need imputation in the income_family variable and the imputed values must be the same for all the members of the family. Should look like this:
df_final <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
family=c(Gerrard, Gerrard, Gerrard, Torres, Torres, Torres, Keita, Keita, Keita, Suarez, Suarez, Kuyt, Kuyt, Carragher, Carragher, Carragher, Salah, Salah, Firmono, Firmino )
income_family=c(55, 55, 55, 100, 100, 100, 90, 90, 90, 150, 150, 40, 40, 66, 66, 66, 200, 200, 99, 99))
I hope you know what I mean. Thx a lot !!

It's unclear what purpose the long ID variable serves if the values for income_family are the same for every observation of family. I believe the only way to achieve your desired result is to summarize your dataset before imputation.
df <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
family=c("Gerrard", "Gerrard", "Gerrard", "Torres", "Torres", "Torres", "Keita", "Keita", "Keita", "Suarez", "Suarez", "Kuyt", "Kuyt", "Carragher", "Carragher", "Carragher", "Salah", "Salah", "Firmono", "Firmino"),
income_family=c(NA, NA, NA, 100, 100, 100, 90, 90, 90, 150, 150, 40, 40, NA, NA, NA, 200, 200, 99, 99))
df2 <- df %>%
group_by(family) %>%
summarize(income_family = mean(income_family))
# Same for every family
imputation_list <- mice(df2, m = 1, printFlag = FALSE)
df_imputation_final <- complete(imputation_list)
However, if you want to do proper modelling on multiply-imputed data, you will need to conduct your analyses on the mids object imputation_list, not the large dataframe df_imputation_final. If you're using lme4, see this post for details: Using imputed datasets from library mice() to fit a multi-level model in R
# Longitudinal multiple imputation
# https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_miadvanced_2018
imp <- mice(df, maxit = 0)
meth <- imp$meth
pred <- imp$pred
meth[c("income_family")] <- "2lonly.pmm"
pred[, "ID"] <- -2
pred[, "family"] <- 2
imputation_list <- mice::mice(df,
m = 5, maxit = 10,
method = meth,
seed = 123,
pred = pred,
printFlag = FALSE)
fit <- with(data = imputation_list,
exp = lme4::lmer(income_family ~ (1|family)))
pool(fit)

Related

I want to calculate a formula in R

I have a dataset that starts like this:
In dput it is
structure(list(20, TRUE, c(0, 0, 1, 1, 1, 1, 2, 3, 4, 4, 4, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7), c(8, 1, 0, 8, 9, 5,
8, 10, 10, 5, 7, 4, 11, 12, 6, 13, 14, 15, 16, 17, 18, 4, 5,
19, 4, 17), c(1, 0, 2, 5, 3, 4, 6, 7, 9, 10, 8, 11, 14, 12, 13,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), c(2, 1, 11, 21,
24, 5, 9, 22, 14, 10, 0, 3, 6, 4, 7, 8, 12, 13, 15, 16, 17, 18,
19, 25, 20, 23), c(0, 2, 6, 7, 8, 11, 21, 24, 26, 26, 26, 26,
26, 26, 26, 26, 26, 26, 26, 26, 26), c(0, 1, 2, 2, 2, 5, 8, 9,
10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26), list(c(1,
0, 1), structure(list(), names = character(0)), list(name = c("1",
"3", "5", "6", "8", "9", "12", "19", "2", "4", "7", "10", "11",
"14", "15", "16", "17", "18", "20", "13")), list(`Number of messages` = c(157,
1058, 2481, 833, 178, 119, 66, 222, 20, 343, 3, 4991, 47, 11,
83, 26, 10, 19, 33, 84, 51, 589, 79, 37, 110, 55))), <environment>), class = "igraph")
so far I have the following codelines:
Datensatz <- read_xlsx("...")
Netzwerkgraph <- graph.data.frame(Datensatz[,1:3], directed = TRUE)
actors<-Datensatz$From
relations<-Datensatz$To
weight<-Datensatz$`Number of messages`
How can I calculate the following formula in R with my data set?
I´ve tried the following code
Function <- function(i,j,x,y,z){
i <- actors
j <- relations
w <- weight
for(i in 1:20)
print (-1/(cumsum 1:length(actors, i)(w,i+1))logb(x,base=2)*1/(cumsum 1:length(actors, i)*w,i+1))
}
It isn't entirely clear how you wish to apply the given formula to your example data set, that is, exactly what inputs you are using and what outputs you wish to achieve. Hence, it also isn't clear if the following approach will be sufficient for your purposes. Here is my interpretation thus far.
If one interprets each unique value in the "from" column as being a node i, then it appears that you wish to calculate the sum of messages to each j in the "to" column for each sender i in the "from" column. One approach might then be to calculate all such sums by sender first and then run them all through a simple function that accepts the sum along with some lambda constant.
I used a lambda value of "2" below arbitrarily for illustrative purposes. Additionally, while the formula references a time t, there does not appear to be a time component in your example data set; time isn't represented in this approach. The output would presumably represent the expression for each node at a single point in time.
#written in R version 4.2.1
require(data.table)
##Example data frame
df = data.frame(from = c(1,1,3,3,3), to = c(2,3,1,2,4),nm = c(157,1058,2481,833,178))
df = data.table(df)
df
from to nm
1: 1 2 157
2: 1 3 1058
3: 3 1 2481
4: 3 2 833
5: 3 4 178
##Calculate the sum of messages by sender in "from" column
nf = df[,sum(nm), by = from]
colnames(nf) = c("from","message_total")
nf
from message_total
1: 1 1215
2: 3 3492
## Function
## inputs to function are the total number of messages of a sender in
## "from" column (called cit) and some lambda constant
icit = function(cit,lambda = 2){
-(1/(cit + lambda))*log(1/((cit + lambda)), base = 2)
}
##Find vector of values for each sender in the data set
ans = NULL
for(i in 1:dim(nf)[1]){
ans[i] = icit(nf$message_total[i])
}
ans
[1] 0.008421622 0.003368822

Simple restrictions/constraint for multiple imputation (MICE) in R

I want to perform multiple imputation for a set of variables using the MICE package in R.
# Example data
data <- data.frame(
gcs = c(3, 10, NA, NA, NA, 15, 14, 15, 15, 14, 15, NA, 13, 15, 15),
hf = c(50, 66, 78, 99, NA, NA, 56, 55, NA, 76, 98, 105, NA, NA, 65),
...
)
The minimum for gcs is 3 and the maximum is 15, and it may not be a fractional number, how can I set these constraints in MICE? Same goes for hf, but this one only has a bottom limit of 0.

Creating bubble plot for metaprop

I was trying to plot a metaregression for proportions using the meta package. The metaregression using metaprop works as expected. But when I run bubble, I get the error listed below the script:
library(meta)
sample <- c(74, 62,370, 72, 40, 84, 290, 244, 173, 106, 89, 139, 43, 398, 179, 31)
BLIPS <- c(23, 12, 11, 11, 1, 17, 52, 28, 6, 4, 3, 4, 1, 56, 22, 1)
covar <- c(21, 11, 14, 1, 4, 47, 2, 42, 16, 44, 3, 34, 11, 15, 21, 4)
hr <- data.frame(sample, BLIPS, covar)
meta <- metaprop(BLIPS, sample)
reg <- metareg(meta, covar)
reg
bubble(reg)
Error in [.data.frame(x$.meta$x$data, , covar.name) : undefined
columns selected
Currently your metaregression uses the variables from the global environment and not the variables from your data.frame hr. This appears to work as for the regression itself, but not for the bubble plot. If you just add data = hr to your metaprop call, then the bubble plot works as expected.
hr <- data.frame(sample, BLIPS, covar)
meta <- metaprop(BLIPS, sample, data = hr)
reg <- metareg(meta, covar)
reg
bubble(reg)

Taking the derivative of a Survival Function in R

I'm looking to take the derivative of the survival function in R and store it in a new function.
Here is my code so far:
install.packages("survival")
library(survival)
survival <- matrix(c(1, 555, 0, 82, 2, 473, 8, 30, 3, 435, 8, 27, 4, 400, 7, 22, 5,
371, 7, 26, 6, 338, 28, 25, 7, 285, 31, 20,8, 234, 32, 11, 9, 191,
24, 14, 10, 153, 27, 13, 11, 113, 22, 5, 12, 86, 23, 5, 13, 58, 18,
5, 14, 35, 9, 2, 15, 24, 7, 3, 16, 14, 11, 3),
ncol=4, byrow=TRUE)
year <- c()
for (i in 1:nrow(survival) ) year <- c(year, rep(i, survival[i, 4]))
for (i in 1:nrow(survival) ) year <- c(year, rep(i, survival[i, 3]))
state <- c(rep(1, sum(survival[, 4])), rep(0, sum(survival[, 3])))
my.surv <- Surv(year, state)
fit <- survfit(my.surv ~ 1)
my.fit <- survfit(my.surv ~ 1)
### K-M plot
plot(my.fit, main="Kaplan-Meier estimate with 95% confidence bounds",
xlab="time", ylab="survival function")
### K-M cumulative hazard function
H.hat <- -log(my.fit$surv)

How can I calculate the mean of the top 4 observations in my column?

How can I calculate the mean of the top 4 observations in my column?
c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
For instance, in the above I would have (50+60+50+60)/4 = 55. I only know how to use the quantile, but it does not work for this.
Any ideas?
Since you're interested in only the top 4 items, you can use partial sort instead of full sort. If your vector is huge, you might save quite some time:
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
idx <- seq(length(x)-3, length(x))
mean(sort(x, partial=idx)[idx])
# [1] 55
Try this:
vec <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(sort(vec, decreasing=TRUE)[1:4])
gives
[1] 55
Maybe something like this:
v <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(head(sort(v,decreasing=T),4))
First, you sort your vector so that the largest values are in the beginning. Then with head you take the 4 first values in that vector, subsequently taking the mean value of that.
To be different! Also, please try to do some research on your own before posting.
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(tail(sort(x), 4))
Just to show that you can use quantile in this exercise:
mean(quantile(x,1-(0:3)/length(x),type=1))
#[1] 55
However, the other answers are clearly more efficient.
You could use the order function. Order by -x to give the values in descending order, and just average the first 4:
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(x[order(-x)][1:4])
[1] 55

Resources