My question can be considering an extension of the following discussion: R expss package: format numbers by statistic / apply different format to alternate rows
I would like to understand the grammar of conditions to be able to write my own custom formats. Consider the 'insert' dataframe from datasets. Then we create the following table thanks to expss:
infert %>%
tab_cells(parity) %>%
### TOTAL
tab_cols(total()) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
### OTHER VARIABLES
tab_cols(education) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
tab_stat_cpct(label="%Col.", total_row_position="none") %>%
tab_pivot(stat_position="inside_columns") %>%
format_vert()
The last line operates basic formatting, as discussed in the URL above. In details:
format_vert = function(tbl, pct_digits=2, n_digits=0){
#Finding columns to format
pct_cols = grepl("\\|%Col.$", names(tbl), perl = TRUE)
n_cols = grepl("\\|N$", names(tbl), perl = TRUE)
#Format
recode(tbl[,-1]) = other ~ function(x) ifelse(is.numeric(x) & is.na(x), 0, x)
tbl[,pct_cols] = format(tbl[,pct_cols], digits=pct_digits, nsmall=pct_digits)
tbl[,n_cols] = format(tbl[,n_cols], digits=n_digits, nsmall=n_digits)
recode(tbl[,pct_cols]) = other ~ function(x) paste0(x, "%")
tbl
}
I understand how to format whole tables or columns (experts would have noticed the differences vs. the example in the URL), but what if I only wish to format specific cells? For instance, how to set digits=0 when value = 100,00% (to only show 100%) ?
I don't know if I should go for recode, format, when and where to reference tbl[,pct_cols]...
Thank you!
The simplest way is to insert additional recodings into recode in the function format_vert. We can't use recoding in the form of '100.00' ~ '100' because columns are already aligned with spaces. So we use regular expressions. perl means perl-style regex comparison and \\b means word boundary. All values which will match with such expression will be recoded.
data(infert)
format_vert = function(tbl, pct_digits=2, n_digits=0){
#Finding columns to format
pct_cols = grepl("\\|%Col.$", names(tbl), perl = TRUE)
n_cols = grepl("\\|N$", names(tbl), perl = TRUE)
#Format
recode(tbl[,-1]) = other ~ function(x) ifelse(is.numeric(x) & is.na(x), 0, x)
tbl[,pct_cols] = format(tbl[,pct_cols], digits=pct_digits, nsmall=pct_digits)
tbl[,n_cols] = format(tbl[,n_cols], digits=n_digits, nsmall=n_digits)
recode(tbl[,pct_cols]) = c(
perl("\\b0.00\\b") ~ "0% ", # additional recodings
perl("\\b100.00\\b") ~ "100% ", # additional recodings
other ~ function(x) paste0(x, "%")
)
tbl
}
infert %>%
tab_cells(parity) %>%
### TOTAL
tab_cols(total()) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
### OTHER VARIABLES
tab_cols(education) %>%
tab_stat_cases(label="N", total_row_position="none") %>%
tab_stat_cpct(label="%Col.", total_row_position="none") %>%
tab_pivot(stat_position="inside_columns") %>%
format_vert()
Related
The dataset below has columns with very similar names and some values which are NA.
library(tidyverse)
dat <- data.frame(
v1_min = c(1,2,4,1,NA,4,2,2),
v1_max = c(1,NA,5,4,5,4,6,NA),
other_v1_min = c(1,1,NA,3,4,4,3,2),
other_v1_max = c(1,5,5,6,6,4,3,NA),
y1_min = c(3,NA,2,1,2,NA,1,2),
y1_max = c(6,2,5,6,2,5,3,3),
other_y1_min = c(2,3,NA,1,1,1,NA,2),
other_y1_max = c(5,6,4,2,NA,2,NA,NA)
)
head(dat)
In this example, x1 and y1 would be what I would consider the common "categories" among the columns. In order to get something similar with my current dataset, I had to use grepl to tease these out
cats<-dat %>%
names() %>%
gsub("^(.*)_(min|max)", "\\1",.) %>%
gsub("^(.*)_(.*)", "\\2",.) %>%
unique()
Now, my goal is to mutate a new min and a new max column for each of those categories. So far the code below works just fine.
dat %>%
rowwise() %>%
mutate(min_v1 = min(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(max_v1 = max(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(min_y1 = min(c_across(contains(cats[2])), na.rm=T)) %>%
mutate(max_y1 = max(c_across(contains(cats[2])), na.rm=T))
However, the number of categories in my current dataset is quite a bit bigger than 2.. Is there a way to implement this but quicker?
I've tried a few of the suggestions on this post but haven't quite been able to extend them to this problem.
You can use one of the map function here for each common categories.
library(dplyr)
library(purrr)
result <- bind_cols(dat, map_dfc(cats,
~dat %>%
rowwise() %>%
transmute(!!paste('min', .x, sep = '_') := min(c_across(matches(.x)), na.rm = TRUE),
!!paste('max', .x, sep = '_') := max(c_across(matches(.x)), na.rm = TRUE))))
result
I have the following dataset:
combined <- data.frame(
client = c('aaa','aaa','aaa','bbb','bbb','ccc','ccc','ddd','ddd','ddd'),
type = c('norm','reg','opt','norm','norm','reg','opt','opt','opt','reg'),
age = c('>50','>50','75+','<25','<25','>50','75+','25-50','25-50','75+'),
cases = c('1','2','2','1','0','1','2','0','3','2'),
IsActive = c('1','0','0','1','1','0','1','1','1','0')
)
And have identified the unique variable combinations with :
# get unique variable combinations
unique_vars <- combined %>%
select(1:3,5) %>%
distinct()
I am trying to iterate on this query combined %>% anti_join(slice(unique_vars,1)) using purrr and save both the output of the query and also save summary of cases from each output back to the unique_vars table. The slice should iterate through each row of unique_vars, not be fixed at 1
I tried :
qry <- combined %>% anti_join(slice(unique_vars,1))
map(.x = unique_vars %>%
slice(.),
~qry %>%
summarise(CaseCnt = sum(cases)) %>%
inner_join(.x))
My desired output would be two things:
Full output of the query
the new Field CaseCnt added to the unique_vars dataframe
Is this possible?
Although I don't completely follow the intuition behind your query, it seems that for #1 you would want:
lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE)
})
And for #2 you would want:
unique_vars$CaseCnt <- lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
}) %>% do.call(what = rbind.data.frame,
args = .)
Alternatively for #2 with purrr:map_df():
unique_vars$CaseCnt <- map_df(c(1:nrow(unique_vars)), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
})
Just as an aside -- you could do this directly with:
combined %>%
mutate(cases = as.numeric(cases)) %>%
mutate(tot_cases = sum(cases)) %>% # sum total cases across unique_id's
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = mean(tot_cases) - sum(cases))
Or if what you were actually looking for is the sum of cases in that group:
combined %>%
mutate(cases = as.numeric(cases)) %>%
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = sum(cases))
I have a (new) question related to expss tables. I wrote a very simple UDF (that relies on few expss functions), as follows:
library(expss)
z_indices <- function(x, m_global, std_global, weight=NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (w_mean(x, weight)-m_global)/std_global
indices <- 100+(z*100)
return(indices)
}
Reproducible example, based on infert dataset (plus a vector of arbitrary weights):
data(infert)
infert$w <- as.vector(x=rep(2, times=nrow(infert)), mode='numeric')
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(infert$age, infert$w),std_global=w_sd(infert$age, infert$w))
}) %>%
tab_pivot(stat_position="inside_columns")
The table is computed and the output for the first line is (almost) as expected.
Then things go messy for the second line, since both arguments of z_indices explicitely refer to infert$age, where infert$parity is expected.
My question: is there a way to dynamically pass the variables of tab_cells as function argument within tab_stat_fun to match the variable being processed? I guess this happens inside function declaration but have not clue how to proceed...
Thanks!
EDIT April 28th 2020:
Answer from #Gregory Demin works great in the scope of infert dataset, although for better scalability to larger dataframes I wrote the following loop:
var_df <- data.frame("age"=infert$age, "parity"=infert$parity)
tabZ=infert
for(each in names(var_df)){
tabZ = tabZ %>%
tab_cells(var_df[each]) %>%
tab_cols(total(), education) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(var_df[each], infert$w),std_global=w_sd(var_df[each], infert$w))
})
}
tabZ = tabZ %>% tab_pivot()
Hope this inspires other expss users in the future!
There is no universal solution for this case. Function in the tab_stat_fun is always calculated inside cell so you can't get global values in it.
However, in your case we can calculate z-index before summarizing. Not so flexible solution but it works:
# function for weighted z-score
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
tab_cells(age = w_z_index(age, w), parity = w_z_index(parity, w)) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")
UPDATE.
A little more scalable approach:
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
w_z_index_df = function(df, weight = NULL){
df[] = lapply(df, w_z_index, weight = weight)
df
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
# we process a lot of variables at once
tab_cells(w_z_index_df(data.frame(age, parity))) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")
I would like to join/merge multiple tibbles/data frames with the use of map/lapply. How would it be possible to perform that?
Reproducible example:
set.seed(42)
df <- tibble::tibble(rank = rep(stringr::str_c("rank",1:10),10),
char_1 = sample(c("a","b","c"), size = 100, replace = TRUE),
points = sample(1:10000, size = 100)
)
my_top <- seq(10,90, by= 10) %>%
as.list() %>%
set_names(c(stringr::str_c("sample_",1:9)))
my_list_1 <- map(my_top , ~ df %>%
sample_n(.x) %>%
mutate(!!str_c(.x, "_score") := sample(1:10000, size = .x)))
I would like to perform this:
df %>% group_by(rank, char_1, points) %>%
left_join(my_list_1[[1]] ) %>%
left_join(my_list_1[[2]] ) %>%
left_join(my_list_1[[3]] )
and so on ... with map function.
I tried this:
map(as.list(names(my_top)), ~ df %>% group_by(rank, char_1, points) %>%
left_join(my_list_1[[.x]] ))
But of course, it is not saving somewhere the joined tibble in order to make a new join with it!
An option would be reduce
library(dplyr)
library(purrr)
df %>%
group_by(rank, char_1, points) %>%
list(.) %>%
c(., my_list_1[1:3]) %>%
reduce(left_join)
This is my first answer, I'm new here. I had a similar problem recently, join_all was the best solution I found.
library(plyr)
#list files that are saved in your computer, for example, in txt format
files <- list.files("path", *.txt)
# open the files and save then as a list
list_of_data_frames <- lapply(files, read_delim, delim = "\t")
# merge files
merged_file <- join_all(list_of_data_frames, by = NULL)
I am trying to unnest two columns that do not always have the same number of values per cell and then concatenate the values that have a correspond between the two columns. For example:
library('dplyr')
library('tidyr')
#Sample Data
df <- data.frame(id = c(1:4),
first.names = c('Michael, Jim', 'Michael, Michael', 'Creed', 'Creed, Jim'),
last.names = c('Scott, Halpert', 'Scott, Cera', '', 'Halpert'))
Not all values in df$first.names are associated with a value in df$last.names. I am trying to get the following results:
#Desired output
df.results <- data.frame(id = c(1,1,2,2,3,4,4),
first.names = c('Michael', 'Jim', 'Michael', 'Michael', 'Creed', 'Creed', 'Jim'),
last.names = c('Scott', 'Halpert', 'Scott', 'Cera', '', '', 'Halpert'),
full.names = c('Michael Scott', 'Jim Halpert', 'Michael Scott', 'Michael Cera', 'Creed', 'Creed', 'Jim Halpert'))
I have tried using unnest, it works for first.names, but not for last.names (it drops the row where last.names is blank):
#convert to characters
df$first.names <- as.character(df$first.names)
df$last.names <- as.character(df$last.names)
#Unnest first names
df <- df %>%
transform(first.names = strsplit(first.names, ',')) %>%
unnest(first.names)%>%
transform(last.names = strsplit(last.names, ',')) %>%
unnest(last.names)
I was then going to delete duplicate lines, but that still does not solve the the issues with the values in df$first.names that do not have a value in df$last.names
Is there a better way to do this?
Check this solution:
library(tidyverse)
df %>%
as_tibble() %>%
mutate_at(2:3, ~ strsplit(as.character(.x), ',') %>% map(~ str_trim(.x))) %>%
mutate(
First = map2_chr(first.names, last.names, ~ paste(.x[1], .y[1])),
Second = map2_chr(first.names, last.names, ~ paste(.x[2], .y[2]))
) %>%
mutate_at(4:5, ~ str_remove_all(.x, 'NA') %>% str_trim()) %>%
gather('x', 'full.names', First:Second) %>%
filter(full.names != '') %>%
mutate(
first.names = map_chr(full.names, ~ str_split(.x, ' ')[[1]][1]),
last.names = map_chr(full.names, ~ str_split(.x, ' ')[[1]][2]) %>%
replace_na('')
) %>%
select(-x) %>%
arrange(id)
I can include a logic, that if there is one last.names it will combine it with the second first.names to get the same result, but I don't think this is what you want. Vector with first.names that has no second.names can solve the problem.