I am currently trying to display the count of factor levels (e.g., gender) and their relative frequency per group (e.g., treatment group) using datasummary. In addition, I would like to combine this with the display of quantitative variables (e.g., age) with their respective mean and standard deviation.
So far, I created a function to display mean and sd in one column and managed to calculate N and percentages. However, I am struggling with creating a function that displays N and percentage in one column as well as adding the empty column to the datasummary of the quantitative variable to combine both frames (based on Show count of unique values in datasummary and combine two different tables of descriptive statistics using data).
library(modelsummary)
library(magrittr)
library(dplyr)
set.seed(123)
iris$gender <- factor(sample(1:3, size = 150, replace = T),
labels = c("Male", "Female", "Other"))
iris$job <- factor(sample(1:5, size = 150, replace = T),
labels = c("Student", "Worker", "CEO", "Other", "None"))
empty <- function(...) ""
MeanSD = function(x) {
M = mean(x, na.rm = T)
SD = sd(x, na.rm = T)
MSD = paste(round(M, 2), " (",round(SD,2), ")", sep = "")
return(MSD)
}
#This function does not work properly
NP = function(x, y) {
N = N(x)
P = Percent(x, y, denom = "col")
out = paste(N, " (",P, ")", sep = "")
return(NP)
}
iris_tab1 <- iris %>% dplyr::select(Species,
Gender = gender,
Job = job,
Length = Sepal.Length)
tbl_1 <- datasummary((Heading("")*N + Heading("")*Percent(fn = function(x, y) 100 * length(x) / length(y), denom = "col"))*(Gender + Job)~Species,
data = iris_tab1,
fmt = 2,
output = 'data.frame'
)
tbl_1
#Cannot add the empty column
tbl_2 <- datasummary(Heading("")*(MeanSD)*Length~empty+Species,
data = iris_tab1,
output = 'data.frame'
)
tbl_2
empty is a function. MeanSD is a function. All functions need to go on the same side of the datasummary formula:
library(modelsummary)
library(magrittr)
library(dplyr)
set.seed(123)
iris$gender <- factor(sample(1:3, size = 150, replace = T),
labels = c("Male", "Female", "Other"))
iris$job <- factor(sample(1:5, size = 150, replace = T),
labels = c("Student", "Worker", "CEO", "Other", "None"))
empty <- function(...) ""
MeanSD = function(x) {
M = mean(x, na.rm = T)
SD = sd(x, na.rm = T)
MSD = paste(round(M, 2), " (", round(SD, 2), ")", sep = "")
return(MSD)
}
iris_tab1 <- iris %>%
dplyr::select(Species,
Gender = gender,
Job = job,
Length = Sepal.Length)
tbl_2 <- datasummary(Heading("") * Length ~ empty + MeanSD * Species,
data = iris_tab1,
output = "data.frame")
tbl_2
#> empty setosa versicolor virginica
#> 1 5.01 (0.35) 5.94 (0.52) 6.59 (0.64)
Simple illustration of Percent function:
library(modelsummary)
dat <- mtcars
dat$cyl <- as.factor(dat$cyl)
fn <- function(x, y) {
out <- sprintf(
"%s (%.1f%%)",
length(x),
length(x) / length(y) * 100)
}
datasummary(
cyl ~ Percent(fn = fn),
data = dat)
cyl
Percent
4
11 (34.4%)
6
7 (21.9%)
8
14 (43.8%)
Related
I'm trying to pass vectors, each with a different number of NA values, through to a map() function but it's returning an error.
I have a tibble of N numeric columns and 1 categorical column. I want to compare the distributions for each of the numeric columns against the other split by the values of the categorical column. I use overlapping::overlap() to calculate the overlap of the distributions, and i feed the numeric columns into a map_dfr function for the iteration. For example:
require(overlapping)
require(dplyr)
require(purrr)
set.seed( 1 )
n <- 100
G1 <- sample( 0:30, size = n, replace = TRUE )
G2 <- sample( 0:30, size = n, replace = TRUE, prob = dbinom( 0:30, 31, .55 ))
G3 <- sample( 0:30, size = n, replace = TRUE, prob = dbinom( 0:30, 41, .65 ))
Data <- data.frame(y = G1, x = G2, z = G3, group = rep(c("G1","G2", "G3"), each = n), class = rep(c("C1","C2", "C3"), each = 1)) %>% as_tibble()
Data
overlap_fcn <- function(.x) {
## construct list of vectors
dist_list <- list(
"C1" = Data %>%
filter(class == 'C1', !is.na(.x)) %>%
pull(.x),
"C2" = Data %>%
filter(class == 'C2', !is.na(.x)) %>%
pull(.x),
"C3" = Data %>%
filter(class == 'C3', !is.na(.x)) %>%
pull(.x)
)
## calculate distribution overlaps
return(
enframe(
overlapping::overlap(dist_list)$OV*100
) %>%
mutate(value = paste0(round(value, 2), "%"),
class = .x) %>%
rename(comparison = name, overlap = value) %>%
relocate(class)
)
}
overlap_table <- purrr::map_dfr(
.x = c('y', 'x', "z"),
.f = ~overlap_fcn(.x))
overlap_table
The above works as intended. However, in practice I have different amounts of missingess in each of x, y, and z. I try to account for this with the filter on !is.na(.x) but it's not working. For example:
Data$x[1:3] <- NA
Data$y[10:20] <- NA
Data$z[100:150] <- NA
overlap_table <- purrr::map_dfr(
.x = c('x', 'y', "z"),
.f = ~overlap_fcn(.x))
returns this error:
Error in density.default(x[[j]], n = nbins, ...): 'x' contains missing values
Error in density.default(x[[j]], n = nbins, ...): 'x' contains missing values
Traceback:
1. purrr::map_dfr(.x = c("x", "y", "z"), .f = ~overlap_fcn(.x))
2. map(.x, .f, ...)
3. .f(.x[[i]], ...)
4. overlap_fcn(.x)
5. enframe(overlapping::overlap(dist_list)$OV * 100) %>% mutate(value = paste0(round(value,
. 2), "%"), class = .x) %>% rename(comparison = name, overlap = value) %>%
. relocate(class) # at line 25-33 of file <text>
6. relocate(., class)
7. rename(., comparison = name, overlap = value)
8. mutate(., value = paste0(round(value, 2), "%"), class = .x)
9. enframe(overlapping::overlap(dist_list)$OV * 100)
10. overlapping::overlap(dist_list)
11. density(x[[j]], n = nbins, ...)
12. density.default(x[[j]], n = nbins, ...)
13. stop("'x' contains missing values")
Can anyone help me out here please? I'm sure it's something super obvious i'm missing; i just can't see what!
Here, the .x is character class. We may need to convert to symbol and evaluate (!!)
overlap_fcn <- function(.x) {
## construct list of vectors
dist_list <- list(
"C1" = Data %>%
filter(class == 'C1', !is.na(!! rlang::sym(.x))) %>%
pull(.x),
"C2" = Data %>%
filter(class == 'C2', !is.na(!! rlang::sym(.x))) %>%
pull(.x),
"C3" = Data %>%
filter(class == 'C3', !is.na(!! rlang::sym(.x))) %>%
pull(.x)
)
## calculate distribution overlaps
return(
enframe(
overlapping::overlap(dist_list)$OV*100
) %>%
mutate(value = paste0(round(value, 2), "%"),
class = .x) %>%
rename(comparison = name, overlap = value) %>%
relocate(class)
)
}
-testing after creating the NAs in Data
> purrr::map_dfr(
+ .x = c('x', 'y', "z"),
+ .f = ~overlap_fcn(.x))
# A tibble: 9 × 3
class comparison overlap
<chr> <chr> <chr>
1 x C1-C2 98.61%
2 x C1-C3 97.46%
3 x C2-C3 97.5%
4 y C1-C2 95.47%
5 y C1-C3 96.22%
6 y C2-C3 97.14%
7 z C1-C2 90.17%
8 z C1-C3 94.9%
9 z C2-C3 89.24%
I have a list of 2D matrices. Each matrix is filled using the function fillMatrices. This function adds a number of individuals to each day 0 in a matrix and updates the columns a_M, b_M and c_M. The numbers of individuals come from an initial matrix ind. The code works but it is slow when the number of matrices within the list is large. For example with n = 10000:
user system elapsed
3.73 0.83 4.55
If possible, I would like to reduce the elapsed time to <= 1 sec and increase the n to 720000 matrices. So, I am looking for way to optimize only the section 3. Here is the code:
###############################################
###############################################
## Section 3
## Run the function "fillMatrices"
indexTime <- 1
dt_t_1 <- do.call(rbind, lapply(list_matrices, function(x) x[1,]))
dt_t <- fillMatrices(dt_t_1 = dt_t_1, species = c("a_M", "b_M", "c_M"), maxDuration = 5, matrixColumns = col_mat)
## Fill the matrices within the list
system.time(for(i in 1:n){
list_matrices[[i]][indexTime + 1,] <- dt_t[,i]
})
## test <- list_matrices[[1]]
The code of the section 1 is used to initialize the matrices and the function fillMatrices can be found in the section 2. In my example, the function is used to fill matrices for one species. In reality, the function is used for 3 species (i.e., is applied three times) by changing the argument species = c("a_M", "b_M", "c_M"). How can I speed up my code? Any advice would be much appreciated.
Here are the codes of sections 1 and 2:
rm(list=ls(all=TRUE))
library(ff)
library(dplyr)
set.seed(12345)
## Define the number of individuals
n <- 10000
###############################################
###############################################
## Section 1
## Build the list of 2D matrices
v_date <- as.vector(outer(c(paste(seq(0, 1, by = 1), "day", sep="_"), paste(seq(2, 5, by = 1), "days", sep="_")), c("a_M", "b_M", "c_M"), paste, sep="|"))
col_mat <- c("year", "day", "time", "ID", "died", v_date)
list_matrices <- list()
for(i in 1:n){
print(i)
list_matrices[[i]] <- ff(-999, dim=c(3650, length(col_mat)), dimnames=list(NULL, col_mat), vmode="double", overwrite = TRUE)
}
## test <- list_matrices[[1]]
## dim(list_matrices[[1]])
## Fill the first row of each matrix
for(i in 1:n){
print(i)
list_matrices[[i]][1,] <- c(1, 1, 1, i-1, 0, rep(0, length(v_date)))
}
## test <- list_matrices[[2]]
## Build the matrix "individual"
ind <- as.matrix(data.frame(year = rep(1, n), day = rep(1, n), time = rep(1, n), died = rep(0, n), ID = (seq(1, n, 1))- 1, a_M = sample(1:10, n, replace = T), b_M = sample(1:10, n, replace = T), c_M = sample(1:10, n, replace = T)))
## print(ind)
###############################################
###############################################
## Section 2
## Function to convert a data frame into a matrix
convertDFToMat <- function(x){
mat <- as.matrix(x[,-1])
ifelse(is(x[,1], "data.frame"), rownames(mat) <- pull(x[,1]), rownames(mat) <- x[,1])
## Convert character matrix into numeric matrix
mat <- apply(mat, 2, as.numeric)
return(mat)
}
## Define the function that is used to fill the matrices within the list
fillMatrices <- function(dt_t_1, species, maxDuration, matrixColumns){
## Format data
dt <- as.data.frame(dt_t_1) %>%
reshape::melt(id = c("ID")) %>%
arrange(ID) %>%
dplyr::mutate_all(as.character)
## summary(dt)
## Break out the variable "variable" into different columns, with one row for each individual-day
dt_reshape_filter_1 <- dt %>%
dplyr::filter(!variable %in% c("year", "day", "time", "ID", "died")) %>%
dplyr::mutate(day = variable %>% gsub(pattern = "\\_.*", replacement = "", x = .), col = variable %>% gsub(pattern = ".*\\|", replacement = "", x = .)) %>%
dplyr::select(-variable) %>%
tidyr::spread(col, value) %>%
dplyr::mutate_all(as.numeric) %>%
dplyr::arrange(ID, day)
## summary(dt_reshape_filter_1)
## Apply requested transformations and build the data frame
dt_transform <- dt_reshape_filter_1 %>%
dplyr::rename_at(vars(species), ~ c("a", "b", "c")) %>%
dplyr::mutate(day = day + 1) %>%
dplyr::filter(day < maxDuration + 1) %>%
dplyr::bind_rows(tibble(ID = ind[,c("ID")], day = 0, a = ind[,c("a_M")], b = ind[,c("b_M")])) %>%
dplyr::mutate(c = a + b) %>%
dplyr::rename_at(vars("a", "b", "c"), ~ species) %>%
dplyr::arrange(ID, day)
## summary(dt_transform)
## Take different columns of the data frame and gather them into a single column
dt_gather <- dt_transform %>%
tidyr::gather(variable, value, species) %>%
dplyr::mutate(day = if_else(day > 1, paste0(day, "_days"), paste0(day, "_day"))) %>%
tidyr::unite(variable, c("day", "variable"), sep = "|") %>%
dplyr::rename(var2 = ID) %>%
dplyr::mutate_all(as.character)
## summary(dt_gather)
## Add the other columns in the data frame and convert the resulting data frame into a matrix
dt_reshape_filter_2 <- dt %>%
dplyr::rename(var2 = ID) %>%
dplyr::filter(variable %in% c("year", "day", "time", "ID", "died")) %>%
tidyr::spread(variable, value) %>%
dplyr::arrange(as.numeric(var2)) %>%
dplyr::mutate(year = ind[,c("year")],
day = ind[,c("day")],
time = ind[,c("time")],
ID = ind[,c("ID")],
died = ind[,c("died")]) %>%
tidyr::gather(variable, value, c(year, day, time, ID, died)) %>%
dplyr::arrange(as.numeric(var2)) %>%
dplyr::mutate_all(as.character)
## summary(dt_reshape_filter_2)
## Build the output matrix
dt_bind <- bind_rows(dt_reshape_filter_2, dt_gather) %>%
tidyr::spread(var2, value) %>%
dplyr::arrange(match(variable, matrixColumns)) %>%
dplyr::select("variable", as.character(ind[,c("ID")]))
## summary(dt_bind)
dt_mat <- convertDFToMat(dt_bind)
## summary(dt_mat)
return(dt_mat)
}
Making a 3D array instead of a 2D list of matrices gives you more options
library(ff)
library(dplyr)
set.seed(12345)
## Define the number of individuals
n <- 10000L
n_row <- 3650L
#array way:
v_date <- as.vector(outer(c(paste(seq(0, 1, by = 1), "day", sep="_"), paste(seq(2, 5, by = 1), "days", sep="_")), c("a_M", "b_M", "c_M"), paste, sep="|"))
col_mat <- c("year", "day", "time", "ID", "died", v_date)
arr1 <- ff(-999L, dim = c(n_row, length(col_mat), n), dimnames = list(NULL, col_mat, NULL))
## Fill the first row of each matrix slice
arr1[1, , ] <- c(1L, 1L, 1L, NA, 0L, rep(0L, length(v_date)))
arr1[1, 4, ] <- seq_len(n)-1L
## Build the matrix "individual"
ind <- as.matrix(data.frame(year = rep(1L, n), day = rep(1L, n), time = rep(1L, n), died = rep(0L, n), ID = (seq(1L, n, 1L))- 1L, a_M = sample(1L:10L, n, replace = T), b_M = sample(1L:10L, n, replace = T), c_M = sample(1L:10L, n, replace = T)))
##fill the matrix
indexTime <- 1L
dt_t <- fillMatrices(dt_t_1 = t(arr1[1, ,]), species = c("a_M", "b_M", "c_M"), maxDuration = 5, matrixColumns = col_mat)
## reassign
system.time(
arr1[indexTime + 1, ,] <- dt_t
)
user system elapsed
0.05 0.70 0.7
# for comparison
#> system.time(for(i in 1:n){
#+ list_matrices[[i]][indexTime + 1,] <- dt_t[,i]
#+ })
# user system elapsed
# 4.75 1.08 5.90
As far as I can tell, it's giving me the same results as your original approach but does so a lot faster.
I am working with the iris dataset, and manipulating it as follows to get a species, feature1, feature2, value data frame:
gatherpairs <- function(data, ...,
xkey = '.xkey', xvalue = '.xvalue',
ykey = '.ykey', yvalue = '.yvalue',
na.rm = FALSE, convert = FALSE, factor_key = FALSE) {
vars <- quos(...)
xkey <- enquo(xkey)
xvalue <- enquo(xvalue)
ykey <- enquo(ykey)
yvalue <- enquo(yvalue)
data %>% {
cbind(gather(., key = !!xkey, value = !!xvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key),
select(., !!!vars))
} %>% gather(., key = !!ykey, value = !!yvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key)%>%
filter(!(.xkey == .ykey)) %>%
mutate(var = apply(.[, c(".xkey", ".ykey")], 1, function(x) paste(sort(x), collapse = ""))) %>%
arrange(var)
}
test = iris %>%
gatherpairs(sapply(colnames(iris[, -ncol(iris)]), eval))
This was taken from https://stackoverflow.com/a/47731111/8315659
What this does is give me that data frame with all combinations of feature1 and feature2, but I want to remove duplicates where it is just the reverse being shown. For example, Petal.Length vs Petal.Width is the same as Petal.Width vs Petal.Length. But if there are two rows with identical values for Petal.Length vs Petal.Width, I do not want to drop that row. Therefore, just dropping rows where all values are identical except that .xkey and .ykey are reversed is what I would want to do. Essentially, this is just to recreate the bottom triangle of the ggplot matrix shown in the above linked answer.
How can this be done?
Jack
I think this could be accomplished using the first part of the source code, which performs a single gathering operation. Using the iris example, this will produce 600 rows of output, one for each of the 150 rows x 4 columns in iris.
gatherpairs <- function(data, ...,
xkey = '.xkey', xvalue = '.xvalue',
ykey = '.ykey', yvalue = '.yvalue',
na.rm = FALSE, convert = FALSE, factor_key = FALSE) {
vars <- quos(...)
xkey <- enquo(xkey)
xvalue <- enquo(xvalue)
ykey <- enquo(ykey)
yvalue <- enquo(yvalue)
data %>% {
cbind(gather(., key = !!xkey, value = !!xvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key),
select(., !!!vars))
} # %>% gather(., key = !!ykey, value = !!yvalue, !!!vars,
# na.rm = na.rm, convert = convert, factor_key = factor_key)%>%
# filter(!(.xkey == .ykey)) %>%
# mutate(var = apply(.[, c(".xkey", ".ykey")], 1, function(x) paste(sort(x), collapse = ""))) %>%
# arrange(var)
}
How to expand when maintaining the rank of numbers when using the spread function?
library(tidyverse)
data.frame(time = paste0("t_", 1:100)) %>%
rowwise() %>%
mutate(rnd = sample(1:100, size=1)) %>%
spread(time, rnd)
The column names of the execution result of the code shown above are t_1, t_11, t_100, .....
I want to get column names in order of numbers(t_1, t_2, t_3, ...).
You can try two things:
(1) Make "time" a factor with levels matching the order you want:
data.frame(time = factor(paste0("t_", 1:100), levels = paste0("t_", 1:100))) %>%
rowwise() %>%
mutate(rnd = sample(1:100, size=1)) %>%
spread(time, rnd)
(2) Force the order using a select statement:
data.frame(time = paste0("t_", 1:100)) %>%
rowwise() %>%
mutate(rnd = sample(1:100, size=1)) %>%
spread(time, rnd) %>%
select(paste0("t_", 1:100))
Here is a new function that retains column order. Only one small change is needed (see annotation):
my_spread <- function (data, key, value, fill = NA, convert = FALSE, drop = TRUE,
sep = NULL) {
key_col <- tidyr:::col_name(substitute(key))
value_col <- tidyr:::col_name(substitute(value))
tbl_df(my_spread_(data, key_col, value_col, fill = fill, convert = convert,
drop = drop, sep = sep))
}
my_spread_ <- function (data, key_col, value_col, fill = NA, convert = FALSE,
drop = TRUE, sep = NULL) {
col <- data[key_col]
#col_id <- tidyr:::id(col, drop = drop) # Old line
col_id <- seq_len(nrow(data)) # New line 1
attr(col_id, 'n') <- nrow(data) # New line 2
col_labels <- tidyr:::split_labels(col, col_id, drop = drop)
rows <- data[setdiff(names(data), c(key_col, value_col))]
if (length(rows) == 0) {
row_id <- structure(1L, n = 1L)
row_labels <- as.data.frame(matrix(nrow = 1, ncol = 0))
}
else {
row_id <- id(rows, drop = drop)
row_labels <- tidyr:::split_labels(rows, row_id, drop = drop)
rownames(row_labels) <- NULL
}
overall <- tidyr:::id(list(col_id, row_id), drop = FALSE)
n <- attr(overall, "n")
if (anyDuplicated(overall)) {
groups <- split(seq_along(overall), overall)
groups <- groups[vapply(groups, length, integer(1)) >
1]
str <- vapply(
groups,
function(x) paste0("(", paste0(x, collapse = ", "), ")"), character(1)
)
stop("Duplicate identifiers for rows ", paste(str, collapse = ", "),
call. = FALSE)
}
if (length(overall) < n) {
overall <- match(seq_len(n), overall, nomatch = NA)
}
else {
overall <- order(overall)
}
value <- data[[value_col]]
ordered <- value[overall]
if (!is.na(fill)) {
ordered[is.na(ordered)] <- fill
}
if (convert && !is.character(ordered)) {
ordered <- as.character(ordered)
}
dim(ordered) <- c(attr(row_id, "n"), attr(col_id, "n"))
colnames(ordered) <- enc2utf8( tidyr:::col_names(col_labels, sep = sep))
ordered <- tidyr:::as_data_frame_matrix(ordered)
if (convert) {
ordered[] <- lapply(ordered, type.convert, as.is = TRUE)
}
tidyr:::append_df(row_labels, ordered)
}
I have a data frame with I obsevations, and each observation belongs to one of g categories.
set.seed(9782)
I <- 500
g <- 10
library(dplyr)
anon_id <- function(n = 1, length = 12) {
randomString <- c(1:n)
for (i in 1:n)
{
randomString[i] <- paste(sample(c(0:9, letters, LETTERS),
length, replace = TRUE),
collapse = "")
}
return(randomString)
}
df <- data.frame(id = anon_id(n = I, length = 16),
group = sample(1:g, I, T))
I want to randomly assign each observation to one of J "urns", given some vector of probabilities p. That is the probability of being assign to urn J=1 is p[1]. The added complexity is that I want to do this block by block.
If I ignore the blocks, I can do this easily:
J <- 3
p <- c(0.25, 0.5, 0.25)
df1 <- df %>% mutate(urn = sample(x = c(1:J), size = I, replace = T, prob = p))
I thought about this method to do it by "block"
# Block randomization
randomize_block <- function(g) {
df1 <- df %>% filter(group==g)
size <- nrow(df1)
df1 <- df1 %>% mutate(urn = sample(x = c(1:J),
size = size,
replace = T,
prob = p))
return(df1)
}
df2 <- lapply(1:g, randomize_block)
df2 <- data.table::rbindlist(df2)
Is there a better way?
Not sure if this is better, but here is a base R technique with data.frame df, that has group name "group" as well as urn assignments 1:J with assignment probabilities in vector p of length J.
# get urn assignment
urnAssignment <- lapply(unique(df$group),
function(i) sample(1:J, nrow(df[group==i,]), replace =T, prob=p))
# get a list that collects position of observations
obsOrder <- lapply(unique(df$group),
function(i) which(df$group == i))
df$urnAssignment <- unlist(urnAssignment)[unlist(obsOrder)]
randomizr::block_ra does exactly what you want.
library(randomizr)
library(janitor) #just for the tabyl function
block_rand <- as.tibble(randomizr::block_ra(blocks = df$group, conditions = c("urn_1","urn_2","urn_3")))
df2 <- as.tibble(bind_cols(df, block_rand))
df2 %>% janitor::tabyl(group, value)
This does the trick using dplyr:
randomize <- function(data, groups=2, block_id = NULL, p=NULL, seed=9782) {
if(is.null(p)) p <- rep(1/groups, groups)
if(is.null(block_id)){
df1 <- data %>%
mutate(Treatment = sample(x = c(1:groups),
size = n(),
replace = T,
prob = p))
return(df1)
}else{
df1 <- data %>% group_by_(block_id) %>%
mutate(Treatment = sample(x = c(1:groups),
size = n(),
replace = T,
prob = p))
}
}
df1 <- randomize(data = df, groups = J, block_id = "group", p = p, seed = 9782)