So initially I had the following object:
> head(gs)
year disturbance lek_id complex tot_male
1 2006 N 3T Diamond 3
2 2007 N 3T Diamond 17
3 1981 N bare 3corners 4
4 1982 N bare 3corners 7
5 1983 N bare 3corners 2
6 1985 N bare 3corners 5
With that I computed general statistics min, max, mean, and sd of tot_male for year within complex. I used R data splitting functions, and assigned logical column names where it seemed appropriate and ultimately made them different objects.
> tyc_min = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=min)
> names(tyc_min) = c("year", "complex", "tot_male_min")
> tyc_max = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=max)
> names(tyc_max) = c("year", "complex", "tot_male_max")
> tyc_mean = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=mean)
> names(tyc_mean) = c("year", "complex", "tot_male_mean")
> tyc_sd = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=sd)
> names(tyc_sd) = c("year", "complex", "tot_male_sd")
Example Output (2nd Object - Tyc_max):
year complex tot_male_max
1 2003 0
2 1970 3corners 26
3 1971 3corners 22
4 1972 3corners 26
5 1973 3corners 32
6 1974 3corners 18
Now I need to add the number of samples per year/complex combination as well. Then I need to merge these into single data object, and export as a .csv file
I know I need to use merge() function along with all.y but have no idea how to handle this error:
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
Or.. add the number of samples per year and complex. Any suggestions?
This might work (but hard to check without a reproducible example):
gsnew <- Reduce(function(...) merge(..., all = TRUE, by = c("year","complex")),
list(tyc_min, tyc_max, tyc_mean, tyc_sd))
But instead of aggregating for the separate statistics and then merging, you can also aggregate everything at once into a new dataframe / datatable with for example data.table, dplyr or base R. Then you don't have to merge afterwards (for a base R solution see the other answer):
library(data.table)
gsnew <- setDT(gs)[, .(male_min = min(tot_male),
male_max = max(tot_male),
male_mean = mean(tot_male),
male_sd = sd(tot_male), by = .(year, complex)]
library(dplyr)
gsnew <- gs %>% group_by(year, complex) %>%
summarise(male_min = min(tot_male),
male_max = max(tot_male),
male_mean = mean(tot_male),
male_sd = sd(tot_male))
mystat <- function(x) c(mi=min(x), ma=max(x))
aggregate(Sepal.Length~Species, FUN=mystat, data=iris)
for you:
mystat <- function(x) c(mi=min(x), ma=max(x), m=mean(x), s=sd(x), l=length(x))
aggregate(tot_male~year+complex, FUN=mystat, data=gs)
Related
I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL
i have 24 dataframes for 24 years of commercial flows and all of them look like this.
t i j k v q
1 1997 4 12 90930 210.981000 160.433
2 1997 4 12 121190 2.614000 4.187
3 1997 4 20 570110 26.957000 2.500
4 1997 4 31 190211 3.747064 21.613
5 1997 4 31 252329 22.541865 626.000
I need to change the names of the variables to the following ones Year, Product, Exporter, Importer, Flow, Quantity for t,i,j,k,v,q respectively.
I'm using the function below but it doesn't save the changes. When i run the code it gives me the output with the new dataframe and the new names of the columns but it is not replaced on the original dataframe.
What I'm looking for is to minimize the quantity of lines wrote to achieve this, so for example doing "BACI1995 <- ChangeNames(BACI1995) BACI1996 <- ChangeNames(BACI1996)"is not an option
ChangeNames <- function(x) {
names(x) <- c("Year", "Product", "Exporter", "Importer", "flow", "Quantity")
return(x)
}
ChangeNames(BACI1995)
So if you really want to minimise the number of keystrokes, then consider this one-liner:
for (i in 1995:2018) names(.GlobalEnv[[paste0("BACI", i)]]) <- c("Year", "Product", "Exporter", "Importer", "flow", "Quantity")
I have a df like this:
> df<-data.frame(Client.code =
c(100451,100451,100523,100523,100523,100525),dayref = c(24,30,15,13,17,5))
> df
Client.code dayref
1 100451 24
2 100451 30
3 100523 15
4 100523 13
5 100523 17
6 100525 5
It is a one-year distribution of payments period from issue.
Usign this data above and given a df2 like this:
Client.Code Days
1 100451 16
1 100523 16
1 100460 35
As i have enough data for a reasonable quantile prob. calculations.I will like to know how to build a loop for assing to every row in this df2 of days a quantile according with the first df.
We can use data.table
library(data.table)
setDT(df)[, .(Quantile = quantile(dayref)), Client.code]
Or with tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(Client.code) %>%
summarise(Quantile = list(quantile(dayref))) %>%
unnest
tapply(df$dayref, df$Client.code, quantile)
You can specify specific percentiles by adding a vector of them
tapply(df$dayref, df$Client.code, quantile, 1:19/20)
You may need to formulate like this
tapply(df$dayref, df$Client.code, quantile, probs = 1:19/20)
And you can add na.rm = TRUE as another argument if you might have NAs
This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978
I am computing a dplyr::summarize across a dataframe of sales data.
I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).
Two questions about my code idiom:
Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read.
Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.
Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:
# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
summarize(ysd_med543_X = median(X),
ysd_mean543_X = mean(X)
) %.% ungroup()
# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)
and run this first to generate sample-data:
set.seed(1234)
rep_vector <- function(vv, n) {
unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}
n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
Y = rep_vector(2005:2007, n),
Week = round(52*runif(m*n)),
X = 4e4*runif(m*n) + 1e4 )
# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]
require(dplyr)
I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.
I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:
df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...
This is just so that you're aware of it for any future cases.
IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
md_X = median(X[Week >=5 & Week <= 43]),
mn_X = mean(X[Week >=5 & Week <= 43]))
Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.
What I've in mind is something like:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
{ tmp = Week >= 5 & Week <= 43;
md_X = ifelse(length(tmp), median(X[tmp]), 0),
md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
})
So, we'll have to workaround in this manner probably:
df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], mean(X), 0))
Or to put things together:
df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43,
md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], median(X), 0))
# S D Y Week X tmp md_X mn_X
# 1 10 20 2005 6 22107.73 TRUE 22107.73 22107.73
# 2 10 23 2005 32 18751.98 TRUE 18751.98 18751.98
# 3 10 25 2005 33 31027.90 TRUE 31027.90 31027.90
# 4 10 26 2005 0 46586.33 FALSE 0.00 0.00
# 5 11 20 2006 12 43253.80 TRUE 43253.80 43253.80
# 6 11 22 2006 27 28243.66 TRUE 28243.66 28243.66
# 7 11 23 2006 36 20607.47 TRUE 20607.47 20607.47
# 8 11 24 2006 28 22186.89 TRUE 22186.89 22186.89
# 9 11 25 2006 15 30292.27 TRUE 30292.27 30292.27
# 10 12 20 2007 15 40386.83 TRUE 40386.83 40386.83
# 11 12 21 2007 44 18049.92 FALSE 0.00 0.00
# 12 12 26 2007 16 35856.24 TRUE 35856.24 35856.24
which doesn't require df.tmp.
HTH