Dynamically recode and append fields to dataframe/tibble using yaml

Dynamically recode and append fields to dataframe/tibble using yaml - r

GIVEN I want to transform a series of variables using information contained in a yaml file
AND I have the contents of the yaml file loaded into memory as list
WHEN I go to apply the re-mapping using the $responses element of each entry in the yaml file
THEN I get a new data.frame/tibble with columns appended using the names & recodes in each field's map entry
Below is a single entry in the yaml file. The entry health corresponds to a column name in the original data.frame. The text field is irrelevant for this question. The responses field contains the re-mapping recipe.
health:
study_name: global_health
text: Would you say your health in general is excellent, very good, good, fair, or poor
responses:
1: Excellent
2: Very good
3: Good
4: Fair
5: Poor
Here is an example of the original data - just focusing on a single field as an example:
health
1 1
2 3
3 1
4 2
5 2
6 4
What I would like in this case (though with many more variables) is the following:
health global_health
1 1 Excellent
2 3 Good
3 1 Excellent
4 2 Very good
5 2 Very good
6 4 Fair
What I have got so far is the following:
data_map <- yaml::read_yaml(map_filepath)
study_cols <- names(data_map)
And I have been able to leverage this structure to ignore entries that do not have a populated responses field in the yaml file - using this as my means of tracking which columns need to be recoded and which do not.
library(tidyverse)
# create simple local function to identify which have "responses" that need recoding
pluck_func <- function(x) !is.null(x$responses)
# create mask using the function
recode_cols_mask <- data_map %>%
map(~pluck(., pluck_func)) %>%
unlist()
# Identify variables that need recoding
recode_names <- study_cols[recode_cols_mask]
Now, what I would like to do is to take the variables that exist in recode_names and apply the renaming map for only those variables. My guess is that there is some clever solution with purrr's map family of functions. I just haven't been able to find the right combination of map() and mutate(). For completeness, below is an entry that would not be recoded and returned. Again the way this is determined is effectively is.null(data_map[['ANALWT_C']][['responses']]) will evaluate to TRUE.
ANALWT_C:
study_name: weights
text: Case-level study weight based on Census estimates
Solutions do not need to be in tidyverse for what it is worth. Happy to use base R. I just tend to find tidyverse processing a little more readable.
UPDATE
Here is my current solution which I would love to simplify if I can. But if not I can live with this for the moment.
recode_func <- function(df, data_map, recode_names, study_names) {
for(v in recode_names) {
df[data_map[[v]][['study_name']]] <- factor(
df[[v]],
levels = seq_along(data_map[[v]][['responses']]),
labels = data_map[[v]][['responses']] %>% unlist()
)
}
for(v in setdiff(names(data_map), recode_names)) {
df[data_map[[v]][['study_name']]] <- df[[v]]
}
study_df <- df[study_names]
return(list(base_df = df,
study_df = study_df))
}
with study_names defined as:
get_study_names <- function(x) x[['study_name']]
study_names <- data_map %>%
map(~pluck(., get_study_names)) %>%
unlist()
RESOLVED
My current implementation based on the accepted answer below:
#' Recodes dataframe using info pulled from the yaml file in the main function \code{create_study_datasets}
#'
#' #param df target data.frame
#'
#' #param data_map list read in from yaml file containing data renames and recoding recipes
#'
#' #param recode_names a character vector of names derived from the data_map in pre-processing steps that take place in
#' the parent function. These are the fields that need to be recoded in some fashion.
#'
#' #param study_names a character vectory of names derived from the data_map in pre-processing steps that take place in
#' the parent function. These are all of the fields that should be returned - whether they are recoded and renamed or
#' just renamed according to the data_map.
#'
#' #seealso create_study_datasets
helper_recode_func <- function(df, data_map, recode_names, study_names) {
# Apply the recodes as appropriate from the data map
df <- df %>%
imap_dfc(~ if (hasName(data_map, .y) && hasName(data_map[[.y]], "responses"))
recode(.x, !!! data_map[[.y]][["responses"]])) %>%
setNames(map_chr(data_map, "study_name")[names(.)]) %>%
bind_cols(df, .)
# Apply passthrough/name-change for variables that do not need recoding
df <- df %>%
select_if((names(.) %in% setdiff(names(data_map), recode_names))) %>%
setNames(map_chr(data_map, "study_name")[names(.)]) %>%
bind_cols(df, .)
study_df <- df[study_names]
return(list(base_df = df,
study_df = study_df))
}
Which when executed returns the expected outputs for both variables that require recoding and those that do not:
ANALWT_C weights health global_health
1 1274.9937 1274.9937 1 Excellent
2 550.3692 550.3692 3 Good
3 6043.4082 6043.4082 1 Excellent
4 262.2439 262.2439 2 Very good
5 8354.2522 8354.2522 2 Very good

f <- function(x) recode(x, !!! data_map[[cur_column()]][["responses"]])
df %>%
mutate(across(any_of(names(keep(data_map, hasName, "responses"))),
f,
.names = "{data_map[[.col]][[\"study_name\"]]}"))
#> health global_health
#> 1 1 Excellent
#> 2 3 Good
#> 3 1 Excellent
#> 4 2 Very good
#> 5 2 Very good
#> 6 4 Fair
Variant:
df %>%
imap_dfc(~ if (hasName(data_map[[.y]], "responses"))
recode(.x, !!! data_map[[.y]][["responses"]])) %>%
setNames(map_chr(data_map, "study_name")[names(.)]) %>%
bind_cols(df, .)

Related

R: Comparing list with grouped values in dataframe; questions about data types

I'm trying to make booleans for the dataframe testdf where when grouped by id, the boolean indicates whether everything in the vector values exists in that id's values.
I believe that ultimately this is a question about the different vector/list data types in R, which I still don't understand.
The vector values comes from the second column in lookup. R says it's a vector but not a list (should I do as.list(lookup$values) instead)?
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
These produce:
> testdf
id value
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
> lookup
col1 values
1 x 1
2 y 2
3 z 3
> values
[1] 1 2 3
And my intended result would look like this. values_bool is TRUE for an id where all elements in values exist in value_list. Note that I formatted the lists in a generic way.
testdf
id value_list values_bool
1 a [1,2,3] TRUE
2 b [1,2] FALSE
This page includes some more information on the creation and use of list columns in R, although it doesn't explain the difference in data types generated within each list.
For example, a cell in the list column created by nest() looks like this:
asia <tibble [59 × 1]>
and a cell in the list column created by summarize() and list() looks like this:
asia <chr [59]>
I tried to create list columns using two methods from that page in my code:
version1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
version2 <- testdf %>%
# Nest values into list
nest(value_list = value)
If you run this you can see they produce different list types.
> version1
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <dbl [3]>
2 b <dbl [2]>
> version2
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <tibble [3 × 1]>
2 b <tibble [2 × 1]>
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
So what's going on with the dataframe list data types when grouping value from testdf and selecting values from lookup?
Thanks so much for your help.

#akrun answered this in the comments, but I love this question, which is really a deep inquiry into R data structures and functional programming, so I'm going to answer it in a longer form here and arrive at the same answer.
First the minimal reproducible problem setup (I simplify version1 and version2 to v1 and v2):
library(tidyverse)
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
v1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
v2 <- testdf %>%
# Nest values into list
nest(value_list = value)
In v1 and v2, we create a value_list column that is a nested list. The list contents differ (you can tell by the reported dimensions in the OP).
v1$value_list is a nested atomic vector
v2$value_list is a nested data.frame
This is because list() -- when passed an atomic vector -- stores that atomic vector as an atomic vector, but nest() when passed an atomic vector, stores this as a data.frame. Why? Because nest() comes from the functional programming paradigm where we use map() (type-safe next gen lapply()) functions to operate on data.frame objects and importantly, return data.frame objects.
Okay, now let's start walking through the solution. It helps to understand how R represents the data types in v1 and v2.
We begin with v1. calling list() on an atomic vector results in a 1D atomic vector. We can pull it out and examine it:
Note that throughout the post I only include code. C/P and run interactively to see output
v1 # a dataframe
v1$value_list # a column, which is itself a list (see below)
class(v1$value_list) # verify this is a list
v1$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v1$value_list[[1]]) # it's an atomic vector - "numeric" = integer or double
typeof(v1$value_list[[1]]) # typeof() to show it's double
typeof(1L) # as opposed to a strict integer type
length(v1$value_list[[1]]) # atomic vectors have property length
Now we examine v2:
v2 # a dataframe
v2$value_list # a column-which is itself a list, see below
class(v2$value_list) # verify this is a list
v2$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v2$value_list[[1]]) # it's a data.frame, NOT an atomic vector!
typeof(v2$value_list[[1]]) # typeof() to show a dataframe is a just special list
dim(v2$value_list[[1]]) # dataframes have property dim
length(v2$value_list[[1]]) # dataframes also have property length, which is ncol()
With that background on object types, let's move to the problem posed in the OP.
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
Essentially, we want to test if all values occur in v1$value_list or v2$value_list. The expected result is TRUE, and then FALSE, for grouping variables a and b respectively. First, we do this the long way, and then we build to a functional programming solution.
# exhaustive approach for list 1 with atomic vectors
values %in% v1$value_list[[1]] # all values present
all(values %in% v1$value_list[[1]]) # returns TRUE as expected
# exhaustive approach for list 2 with atomic vectors
values %in% v1$value_list[[2]] # one value missing
all(values %in% v1$value_list[[2]]) # returns FALSE as expected
# now the functional programming solution without indexing. Notes on syntax:
# - Pass a list (or a vector) as the first argument.
# - Use ~ to indicate the start of the function to apply across the list
# - Use .x as a placehold within the function for the ith list element. Think: concise for-loop
map(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# Two solutions: because we want to output a vector we can unlist(), or
# use map_lgl() which only returns a boolean and fails if the result is NOT bool
unlist(map(v1$value_list, ~all(values %in% .x))) # returns c(TRUE, FALSE)
map_lgl(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# now the functional programming solution in a pipe (put it all together):
v1 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x)))
# we can do the same with v2, but we need to convert the dataframe to a vector
# so that it works with all() -- this is a bit tedious and verbose, because we
# used nest() which is designed to work with nested dataframes, but we ultimately
# want to use all() which takes vectors, so there was no need to nest() in the first
# place. For examples of how to use the power of nest():
# https://r4ds.had.co.nz/many-models.html#creating-list-columns
v2 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x$value)))
Additional Resources:
essential reading on vectors

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!

You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))

R use of lapply() to populate and name one column in list of dataframes

After searching for some time, I cannot find a smooth R-esque solution.
I have a list of vectors that I want to convert to dataframes and add a column with the names of the vectors. I cant do this with cbind() and melt() to a single dataframe b/c there are vectors with different number of rows.
Basic example would be:
list<-list(a=c(1,2,3),b=c(4,5,6,7))
var<-"group"
What I have come up with and works is:
list<-lapply(list, function(x) data.frame(num=x,grp=""))
for (j in 1:length(list)){
list[[j]][,2]<-names(list[j])
names(list[[j]])[2]<-var
}
But I am trying to better use lapply() and have cleaner coding practices. Right now I rely so heavily on for and if statements, which a lot of the base functions do already and much more efficiently than I can code at this point.
The psuedo code I would like is something like:
list<-lapply(list, function(x) data.frame(num=x,get(var)=names(x))
Is there a clean way to get this done?
Second closely related question, if I already have a list of dataframes, why is it so hard to reassign column values and names using lapply()?
So using something like:
list<-list(a=data.frame(num=c(1,2,3),grp=""),b=data.frame(num=c(4,5,6,7),grp=""))
var<-"group"
#pseudo code
list<-lapply(list, function(x) x[,2]<-names(x)) #populate second col with name of df[x]
list<-lapply(list, function(x) names[[x]][2]<-var) #set 2nd col name to 'var'
The first line of pseudo code throws an error about matching row lengths. Why does lapply() not just loop over and repeat names(x) like the same function on a single dataframe does in a for loop?
For the second line, as I understand it I can use setNames() to reassign all the column names, but how do I make this work for just one of the col names?
Many thanks for any ideas or pointing to other threads that cover this and helping me understand the behavior of lapply() in this context.

A full R base approach without using loops
> l<-list(a=c(1,2,3),b=c(4,5,6,7))
> data.frame(grp=rep(names(l), lengths(l)), num=unlist(l), row.names = NULL)
grp num
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6

Related to your first/main question you can use the function enframe from package tibble for this purpose
library(tibble)
library(tidyr)
library(dplyr)
l<-list(a=c(1,2,3),b=c(4,5,6,7))
l %>%
enframe(name = "group", value="value") %>%
unnest(value) %>%
group_split(group)

Try this:
library(dplyr)
mylist <- list(a = c(1,2,3), b = c(4,5,6,7))
bind_rows(lapply(names(mylist), function(x) tibble(grp = x, num = mylist[[x]])))
# A tibble: 7 x 2
grp num
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 b 7
This is essentially a lapply-based solution where you iterate over the names of your list, and not the individual list elements themselves. If you prefer to do everything in base R, note that the above is equivalent to
do.call(rbind, lapply(names(mylist), function(x) data.frame(grp = x, num = mylist[[x]], stringsAsFactors = F)))
Having said that, tibbles as modern implementation of data.frames are preferred, as is bind_rows over the do.call(rbind... construct.
As to the second question, note the following:
lapply(mylist, function(x) str(x))
num [1:3] 1 2 3
num [1:4] 4 5 6 7
....
lapply(mylist, function(x) names(x))
$a
NULL
$b
NULL
What you see here is that the function inside of lapply gets the elements of mylist. In this case, it get's to work with the numeric vector. This does not have any name as far as the function that is called inside lapply is concerned. To highlight this, consider the following:
names(c(1,2,3))
NULL
Which is the same: the vector c(1,2,3) does not have a name attribute.

using variable column names in dplyr summarise

I found this question already asked but without proper answer. R using variable column names in summarise function in dplyr
I want to calculate the difference between two column means, but the column name should be provided by variables... So far I found only the function as.name to provide column names as text, but this somehow doesn't work here...
With fix column names it works.
x <- c('a','b')
df <- group_by(data.frame(a=c(1,2,3,4), b=c(2,3,4,5), c=c(1,1,2,2)), c)
df %>% summarise(mean(a) - mean(b))
With variable columns, it doesn't work
df %>% summarise(mean(x[1]) - mean(x[2]))
df %>% summarise(mean(as.name(x[1])) - mean(as.name(x[2])))
Since this was asked already 3 years ago and dplyr is under good development, I am wondering if there is an answer to this now.

You can use base::get:
df %>% summarise(mean(get(x[1])) - mean(get(x[2])))
# # A tibble: 2 x 2
# c `mean(a) - mean(b)`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
get will search in current environment by default.
As the error message says, mean expects a logical or numeric object, as.name returns a name:
class(as.name("a")) # [1] "name"
You could evaluate your name, that would work as well :
df %>% summarise(mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2]))))
# # A tibble: 2 x 2
# c `mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2])))`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1

This is not a direct answer to your question but maybe could be useful for other people reading your post:
It could be easier to use variable columns directly, like
df %>% summarise(someName = mean(.[[1]]) - mean(.[[2]]))
############ which is the same as ############
df %>% summarise(someName = mean(.[,1,drop=T]) - mean(.[,2,drop=T]))
Note that drop=T is because when using just single square bracket the result preserves the class (in this case class( . ) = data.frame) and this isn't what we want (columns must be given in vector form to the summarise function)

dplyr gives me different answers depending on how I select columns

I may be having trouble understanding some of the basics of dplyr, but it appears that R behaves very differently depending on whether you subset columns as one column data frames or as traditional vectors. Here is an example:
mtcarsdf<-tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
#subsetting to cyl this way gives integer vector
example(mtcars$gear,mtcarsdf$cyl)
# 3 112
# 4 56
# 5 30
#subsetting this way gives a one column data table
example(mtcars$gear,mtcarsdf[,"cyl"])
# 3 198
# 4 198
# 5 198
all(mtcarsdf$cyl==mtcarsdf[,"cyl"])
# TRUE
Since my inputs are technically equal the fact that I am getting different outputs tells me I am misunderstanding how the two objects behave. Could someone please enlighten me on how to improve the example function so that it can handle different objects more robustly?
Thanks

First, the items that you are comparing with == are not really the same. This could be identified using all.equal instead of ==:
all.equal(mtcarsdf$cyl, mtcarsdf[, "cyl"])
## [1] "Modes: numeric, list"
## [2] "Lengths: 32, 1"
## [3] "names for current but not for target"
## [4] "Attributes: < target is NULL, current is list >"
## [5] "target is numeric, current is tbl_df"
With that in mind, you should be able to get the behavior you want by using [[ to extract the column instead of [.
mtcarsdf <- tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
example(mtcars$gear, mtcarsdf[["cyl"]])
However, a safer approach might be to integrate the renaming of the columns as part of your function, like this:
example2 <- function(x, y) {
df <- tbl_df(setNames(data.frame(x, y), c("x", "y")))
df %>% group_by(x) %>% summarise(total = sum(y))
}
Then, any of the following should give you the same results.
example2(mtcars$gear, mtcarsdf$cyl)
example2(mtcars$gear, mtcarsdf[["cyl"]])
example2(mtcars$gear, mtcarsdf[, "cyl"])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dynamically recode and append fields to dataframe/tibble using yaml - r

Related

R: Comparing list with grouped values in dataframe; questions about data types

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

R use of lapply() to populate and name one column in list of dataframes

using variable column names in dplyr summarise

dplyr gives me different answers depending on how I select columns

Categories

Resources