data.table slow aggregating on factor column - r

Came across this issue today. I have a data.table with some categorical fields (i.e. factors). Something like
set.seed(2016)
dt <- data.table(
ID=factor(sample(30000, 2000000, replace=TRUE)),
Letter=factor(LETTERS[sample(26, 2000000, replace=TRUE)])
)
dt
ID Letter
1: 5405 E
2: 4289 E
3: 25250 J
4: 4008 J
5: 14326 G
---
Now, I'd like to calculate the gini impurity for each column of dt, grouped by the values in ID.
My attempt:
giniImpurity <- function(vals){
# Returns the gini impurity of a set of categorical values
# vals can either be the raw category instances (vals=c("red", "red", "blue", "green")) or named category frequencies (vals=c(red=2, blue=1, green=1))
# Gini Impurity is the probability a value is incorrectly labeled when labeled according to the distribution of classes in the set
if(is(vals, "numeric")) counts <- vals else counts <- table(vals)
total <- sum(counts)
return(sum((counts/total)*(1-counts/total)))
}
# Calculate gini impurities
dt[, list(Samples=.N, ID.GinitImpurity=giniImpurity(ID), Letter.GiniImpurity=giniImpurity(Letter)), by=ID]
ID Samples ID.GinitImpurity Letter.GiniImpurity
1: 5405 66 0 0.9527
2: 4289 73 0 0.9484
3: 25250 60 0 0.9394
4: 4008 66 0 0.9431
5: 14326 79 0 0.9531
---
This works but it's incredibly slow. It seems that if I change ID from factor to numeric, it runs much quicker. Is this what I should do in practice or is there a less hacky way to speed up this operation?
Also, I know it's unnecessary to calculate the gini impurity of ID grouped by itself, but please look past this. My real dataset has many more categorical features which add to the slowness.
Also note that I'm using data.table version 1.9.7 (devel)
EDIT
Sorry guys... I just realized that when I tested this with ID as numeric instead of a factor, my call to giniImpurity() is where the speed up occurred due to the nature of how it works. I guess the call to table() is where the slowdown is. Still not 100% sure how to make this quicker though.

Got it.
giniImpurities <- function(dt){
# Returns pairs of categorical fields (cat1, cat2, GI) where GI is the weighted gini impurity of
# cat2 relative to the groups determined by cat1
#--------------------------------------------------
# Subset dt by just the categorical fields
catfields <- colnames(dt)[sapply(dt, is.factor)]
cats1 <- dt[, catfields, with=FALSE]
# Build a table to store the results
varpairs <- CJ(Var1=catfields, Var2=catfields)
varpairs[Var1==Var2, GI := 0]
# Loop through each grouping variable
for(catcol in catfields){
print(paste("Calculating gini impurities by field:", catcol))
setkeyv(cats1, catcol)
impuritiesDT <- cats1[, list(Samples=.N), keyby=catcol]
# Looop through each of the other categorical columns
for(colname in setdiff(catfields, catcol)){
# Get the gini impurity for each pair (catcol, other)
counts <- cats1[, list(.N), by=c(catcol, colname)]
impurities <- counts[, list(GI=sum((N/sum(N))*(1-N/sum(N)))), by=catcol]
impuritiesDT[impurities, GI := GI]
setnames(impuritiesDT, "GI", colname)
}
cats1.gini <- melt(impuritiesDT, id.vars=c(catcol, "Samples"))
cats1.gini <- cats1.gini[, list(GI=weighted.mean(x=value, w=Samples)), by=variable]
cats1.gini <- cats1.gini[, list(Var1=catcol, Var2=variable, GI)]
varpairs[cats1.gini, `:=`(GI=i.GI), on=c("Var1", "Var2")]
}
return(varpairs[])
}
giniImpurities(dt)
Var1 Var2 GI
1: Letter Letter 0.0000000
2: Letter Letter2 0.9615258
3: Letter PGroup 0.9999537
4: Letter2 Letter 0.9615254
5: Letter2 Letter2 0.0000000
6: Letter2 PGroup 0.9999537
7: PGroup Letter 0.9471393
8: PGroup Letter2 0.9470965
9: PGroup PGroup 0.0000000

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

R: creating uneven levels of factor for a numeric variable

I have a set of values (100000 entries) ranging from -0.20 to +0.15, which are return percentages.
Bulk of the values lies between +3.5% and -3.5%
I am looking to convert this into a factor such that:
any return between -0.035 to +.035 are equally binned in 0.05 increments and
anything between -0.2 to -.035 is binned as one factor and
anything between 0.05 to .15 is binned as one factor variable.
Any thoughts on how I can achieve this in R? I did try cut, but it seems to bin only in equal increments.
So I generated the vector that holds the values (out of uniform distribution)
library(data.table)
set.seed(555)#in order to be reproducible
N <- 100000#number of pseudonumbers to be generated
min1=-0.035#arbitrary limits
max1=0.035#idem
samp <- runif(N,min = -0.2,max = 0.15)#create the vector
level1 <- as.factor(ifelse(samp<=min1,paste0("(",min(samp),",",min1,"]"),NA))#create the first level
level2 <- as.factor(ifelse(samp>=max1,paste0("[",max1,",",max(samp),")"),NA))#create the second level
incr <- 0.005
level3 <- cut(samp,seq(min1, max1, by = incr))#create the intermediate levels
dt <- data.table(samp,level1,level2,level3)#put all together
mylevels <- na.omit(unlist(matrix(t(dt[,-1]))))#the vector that contains in which range the samp belongs to
For better visualization of results:
mylevels<-factor(mylevels,levels= unique(mylevels))
dt2<-dt[,.(samp,levels=mylevels)]
samp levels
1: -0.07023653 (-0.199996188434307,-0.035]
2: 0.10889991 [0.035,0.149995080730878)
3: 0.04246077 [0.035,0.149995080730878)
4: -0.01193010 (-0.015,-0.01]
5: 0.02607736 (0.025,0.03]
---
99996: -0.04786692 (-0.199996188434307,-0.035]
99997: -0.08700210 (-0.199996188434307,-0.035]
99998: 0.09989973 [0.035,0.149995080730878)
99999: 0.10095336 [0.035,0.149995080730878)
100000: -0.05555869 (-0.199996188434307,-0.035]

Comparing Groups in data.table Columns

I have a dataset that I need to both split by one variable (Day) and then compare between groups of another variable (Group), performing per-group statistics (e.g. mean) and also tests.
Here's an example of what I devised:
require(data.table)
data = data.table(Day = rep(1:10, each = 10),
Group = rep(1:2, times = 50),
V = rnorm(100))
data[, .(g1_mean = mean(.SD[Group == 1]$V),
g2_mean = mean(.SD[Group == 2]$V),
p.value = t.test(V ~ Group, .SD, alternative = "two.sided")$p.value),
by = list(Day)]
Which produces:
Day g1_mean g2_mean p.value
1: 1 0.883406048 0.67177271 0.6674138
2: 2 0.007544956 -0.55609722 0.3948459
3: 3 0.409248637 0.28717183 0.8753213
4: 4 -0.540075365 0.23181458 0.1785854
5: 5 -0.632543900 -1.09965990 0.6457325
6: 6 -0.083221671 -0.96286343 0.2011136
7: 7 -0.044674252 -0.27666473 0.7079499
8: 8 0.260795244 -0.15159164 0.4663712
9: 9 -0.134164758 0.01136245 0.7992453
10: 10 0.496144329 0.76168408 0.1821123
I'm hoping that there's a less roundabout manner of arriving at this result.
A possible compact alternative which can also apply more functions to each group:
DTnew <- dcast(DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day],
Day + pval ~ paste0("g",Group), fun = list(mean,sd), value.var = "V")
which gives:
> DTnew
Day pval V_mean_g1 V_mean_g2 V_sd_g1 V_sd_g2
1: 1 0.4763594 -0.11630634 0.178240714 0.7462975 0.4516087
2: 2 0.5715001 -0.29689807 0.082970631 1.3614177 0.2745783
3: 3 0.2295251 -0.48792449 -0.031328749 0.3723247 0.6703694
4: 4 0.5565573 0.33982242 0.080169698 0.5635136 0.7560959
5: 5 0.5498684 -0.07554433 0.308661427 0.9343230 1.0100788
6: 6 0.4814518 0.57694034 0.885968245 0.6457926 0.6773873
7: 7 0.8053066 0.29845913 0.116217727 0.9541060 1.2782210
8: 8 0.3549573 0.14827289 -0.319017581 0.5328734 0.9036501
9: 9 0.7290625 -0.21589411 -0.005785092 0.9639758 0.8859461
10: 10 0.9899833 0.84034529 0.850429982 0.6645952 1.5809149
A decomposition of the code:
First, a pval variable is added to the dataset with DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day]
Because DT is updated in place and by reference by the previous step, the dcast function can be applied to that directly.
In the casting formula, you specify the variables that need to stay in the current form on the RHS and the variable that needs to be spread over columns on the LHS.
With the fun argument you can specify which aggregation function has to be used on the value.var (here V). If multiple aggregation functions are needed, you can specify them in a list (e.g. list(mean,sd)). This can be any type of function. So, also cumstom made functions can be used.
If you want to remove the V_ from the column names, you can do:
names(DTnew) <- gsub("V_","",names(DTnew))
NOTE: I renamed the data.table to DT as it is often not wise to name your dataset after a function (check ?data)
While not a one-liner, you might consider doing your two processes separate and then merging the results. This prevents you from having to hardcode the group-names.
First, we calculate the means:
my_means <- dcast(data[,mean(V), by = .(Day, Group)],
Day~ paste0("Mean_Group", Group),value.var="V1")
Or in the less-convoluted way #Akrun mentioned in the comments, with some added formatting.
my_means <- dcast(Day~paste0("Mean_Group", Group), data=data,
fun.agg=mean, value.var="V")
Then the t-tests:
t_tests <- data[,.(p_value=t.test(V~Group)$p.value), by = Day]
And then merge:
output <- merge(my_means, t_tests)

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Summing Counts of a wide variable once per subject

I have a subject dataset subjects that contains a set of variables corresponding to medications subjects have taken. From that wide variable, I've created a unique set of all the meds in the dataset (Regardless of which specific variable it came from).
I want to find the subject count of each med, such that if a subject lists a med once or more that count is increased by one.
Below is a slow way of doing it, but I have 9 med variables and over 50,000 subjects. Can someone help me figure out a more efficient way? Thanks.
subjects <- data.table(
med1= as.factor(c("NASONEX","ALBUTEROL","","BENADRYL","THEODUR")),
med2= as.factor(c("","ALBUTEROL","ASBRON","NASONEX","MONTEKULAST")),
medrecent= as.factor(c("MONTEKLUAST","","","THEODUR",""))
)
medvarnames <- c("med1","med2","medrecent")
allmeds <- data.table(
med=subjects[,unique(unlist(sapply(.SD,levels))), .SDcols=medvarnames],
count=0
)
for(i in 1: dim(subjects)[1]){
allmeds[, count := count +
sapply(allmeds$med,function(.m){
as.integer(
any(.m == subjects[i,.SD,.SDcols=medvarnames],na.rm=TRUE)
)
})
]
}
allmeds
med count
1: 4
2: ALBUTEROL 1
3: BENADRYL 1
4: NASONEX 2
5: THEODUR 2
6: ASBRON 1
7: MONTEKULAST 1
8: MONTEKLUAST 1
What about
as.data.frame(table(unlist(apply(subjects, 1, unique))))
There might be faster methods but it does a decent job (~1.5 sec) on a 50,000-by-9 table like you mentioned.

Resources