R: dynamic variable name comparisons - r

I recoded a bunch of variables in a dataset, and and gave the newly recoded variables the prefix "r_" in my dataset.
I'd like to run table on the pairs to ensure the recoding was correct. Something like table(v1, r_v1), but I need to do it for lots of variables. They are not in any particular order, so I couldn't use indexing.
Here is a reproducible example of data one can use (also any tips on optimizing that code are appreciated!).
mtcars %>% select(c(disp,hp)) %>%
mutate_all(funs(if_else(.>100,1,0))) %>%
rename_(.dots=setNames(names(.), paste0('r_', names(.)))) %>%
cbind(mtcars,.)
Any ideas?

I would just use variable names and simple for loop. Calling your modified data dd,
orig = c("disp", "hp")
trans = paste0("r_", orig)
check_list = list()
for (i in seq_along(orig)) {
check_list[[i]] = table(dd[[orig[i]]], dd[[trans[i]]])
# or whatever other check you want to do
}
check_list
You can then examine the check_list contents one at a time.

To keep things in the tidy format with which you started:
library(purrr)
library(tidyr)
mtcars %>%
select(disp,hp) %>%
mutate_all(funs(r = if_else(.>100,1,0))) %>%
mutate(index = row_number()) %>%
gather(key = key, value = value, -index) %>%
separate(key, c("Variable", "Type")) %>%
mutate(Type = ifelse(is.na(Type), "Original", "Recode")) %>%
spread(key = Type, value = value) %>%
select(-index) %>%
split(.$Variable) %>%
map(~ select(.,-Variable)) %>%
map(~ table(.))

Related

How do you efficiently group by multiple columns in dplyr

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.
across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())
You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

How to apply multiple functions to a list of data frames?

I have a list of more than 50 csv files with the same numbers of columns and rows.
I want to find the percentage of missing values for each of the data frames and I have found the code that works fine with a single file which is the following:
missing.values <- estaciones2 %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing))
Now I want to apply these functions to each of my data frames in my list.
I read that I can use the map function to create a loop and run the code for each of my files in the list, although I am not quite sure how to insert the map function into my code shown above and I have tried the following but doesn't seem right:
missing.values <- map(estaciones2, ~ map(estaciones2, ~ estaciones2 %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing)))
We need a lambda function (~) to loop over the list (assuming estaciones2 is a list object). The .x is the data.frame element of the list using the lambda call
library(purrr)
library(tidyr)
library(dplyr)
map(estaciones2, ~ .x %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing)))
In the OP's code, multiple map functions are called on the same list element again and again i.e. estaciones2

Save intermediate list output in dplyr pipeline and map it back to another list further down the pipeline - R

I am running pcas on groups in a data set using dplyr pipelines. I am starting with group_split, so am working with a list. In order to run the prcomp() function, only the numeric columns of each list can be included, but I would like the factor column brought back in for plotting at the end. I have tried saving an intermediate output using {. ->> temp} partway through the pipeline, but since it is a list, I don't know how to index the grouping column when plotting.
library(tidyverse)
library(ggbiplot)
iris %>%
group_split(Species, keep = T) %>% #group by species, one pca per species
{. ->> temp} %>% # save intermediate output to preserve species column for use in plotting later
map(~.x %>% select_if(is.numeric) %>% select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE))%>% #run pca on numeric columns only
map(~ggbiplot(.x), label=temp$Species)#plot each pca, labeling points as species names form the temporary object
This works to produce one pca plot for each species in the irisdata set, but since temp$species = NULL, the points are not labelled.
If you use map2() and pass the .y argument as the species list you can get the result I think you want. Note that in your original code the labels argument was outside the ggbiplot() function and was ignored.
library(tidyverse)
library(ggbiplot)
iris %>%
group_split(Species, keep = T) %>%
{. ->> temp} %>%
map(~.x %>%
select_if(is.numeric) %>%
select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
map2(map(temp, "Species"), ~ggbiplot(.x, labels = .y))
In response to your comment, if you wanted to add a third argument you could use pmap() instead of map2(). In the example below, pmap() is being passed a (nested) list of the data for the ggbiplot() arguments. Note I've changed the new variable so that it's a factor and not constant across groups.
iris %>%
mutate(new = factor(sample(1:3, 150, replace = TRUE))) %>%
group_split(Species, keep = T) %>%
{. ->> temp} %>%
map(~.x %>%
select_if(is.numeric) %>%
select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
list(map(temp, "Species"), map(temp, "new")) %>%
pmap(~ ggbiplot(pcobj = ..1, labels = ..2, groups = ..3))
One option is to use split and imap
library(tidyverse)
library(ggbiplot)
iris %>%
split(.$Species) %>% # save intermediate output to preserve species column for use in plotting later
map(~.x %>% select_if(is.numeric) %>% select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
imap(~ggbiplot(.x, labels = .y))

grouped statistical test tidyverse

I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.
The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE

Labels not parsed in Expss for loop

I'm new to R and trying to explore my variables by groups and i'm using a for loop to pass all suiting variable names under expss.
Here is an reproducible example :
require(expss)
require(dplyr)
colnoms <- as.data.frame(HairEyeColor) %>% names(.)
expss_digits(2)
for (i in colnoms){
as.data.frame(HairEyeColor) %>%
tab_cells(get(i)) %>%
tab_cols(Eye) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot() %>%
set_caption(i) %>%
htmlTable() %>%
print()
}
I expect the name of the variable in the output (Hair, Eye, Color) but instead i get only "get(i)".
Thanks for any advice
After get we can not to know original variable name. The simplest way to show original name is to set variable name as label:
require(expss)
data(HairEyeColor)
HairEyeColor <- as.data.frame(HairEyeColor)
colnoms <- names(HairEyeColor)
expss_digits(2)
for (i in colnoms){
# if we don't have label we assign name as label
if(is.null(var_lab(HairEyeColor[[i]]))) var_lab(HairEyeColor[[i]]) = i
HairEyeColor %>%
tab_cells(get(i)) %>%
tab_cols(Eye) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot() %>%
set_caption(i) %>%
htmlTable() %>%
print()
}

Resources