I found this question already asked but without proper answer. R using variable column names in summarise function in dplyr
I want to calculate the difference between two column means, but the column name should be provided by variables... So far I found only the function as.name to provide column names as text, but this somehow doesn't work here...
With fix column names it works.
x <- c('a','b')
df <- group_by(data.frame(a=c(1,2,3,4), b=c(2,3,4,5), c=c(1,1,2,2)), c)
df %>% summarise(mean(a) - mean(b))
With variable columns, it doesn't work
df %>% summarise(mean(x[1]) - mean(x[2]))
df %>% summarise(mean(as.name(x[1])) - mean(as.name(x[2])))
Since this was asked already 3 years ago and dplyr is under good development, I am wondering if there is an answer to this now.
You can use base::get:
df %>% summarise(mean(get(x[1])) - mean(get(x[2])))
# # A tibble: 2 x 2
# c `mean(a) - mean(b)`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
get will search in current environment by default.
As the error message says, mean expects a logical or numeric object, as.name returns a name:
class(as.name("a")) # [1] "name"
You could evaluate your name, that would work as well :
df %>% summarise(mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2]))))
# # A tibble: 2 x 2
# c `mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2])))`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
This is not a direct answer to your question but maybe could be useful for other people reading your post:
It could be easier to use variable columns directly, like
df %>% summarise(someName = mean(.[[1]]) - mean(.[[2]]))
############ which is the same as ############
df %>% summarise(someName = mean(.[,1,drop=T]) - mean(.[,2,drop=T]))
Note that drop=T is because when using just single square bracket the result preserves the class (in this case class( . ) = data.frame) and this isn't what we want (columns must be given in vector form to the summarise function)
Related
I'm trying to make booleans for the dataframe testdf where when grouped by id, the boolean indicates whether everything in the vector values exists in that id's values.
I believe that ultimately this is a question about the different vector/list data types in R, which I still don't understand.
The vector values comes from the second column in lookup. R says it's a vector but not a list (should I do as.list(lookup$values) instead)?
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
These produce:
> testdf
id value
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
> lookup
col1 values
1 x 1
2 y 2
3 z 3
> values
[1] 1 2 3
And my intended result would look like this. values_bool is TRUE for an id where all elements in values exist in value_list. Note that I formatted the lists in a generic way.
testdf
id value_list values_bool
1 a [1,2,3] TRUE
2 b [1,2] FALSE
This page includes some more information on the creation and use of list columns in R, although it doesn't explain the difference in data types generated within each list.
For example, a cell in the list column created by nest() looks like this:
asia <tibble [59 × 1]>
and a cell in the list column created by summarize() and list() looks like this:
asia <chr [59]>
I tried to create list columns using two methods from that page in my code:
version1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
version2 <- testdf %>%
# Nest values into list
nest(value_list = value)
If you run this you can see they produce different list types.
> version1
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <dbl [3]>
2 b <dbl [2]>
> version2
# A tibble: 2 × 2
id value_list
<chr> <list>
1 a <tibble [3 × 1]>
2 b <tibble [2 × 1]>
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
So what's going on with the dataframe list data types when grouping value from testdf and selecting values from lookup?
Thanks so much for your help.
#akrun answered this in the comments, but I love this question, which is really a deep inquiry into R data structures and functional programming, so I'm going to answer it in a longer form here and arrive at the same answer.
First the minimal reproducible problem setup (I simplify version1 and version2 to v1 and v2):
library(tidyverse)
testdf <- data.frame(id = c('a','a','a','b','b'),
value = c(1,2,3,1,2))
lookup <- data.frame(col1 = c('x','y','z'),
values = c(1,2,3))
values <- lookup$values
v1 <- testdf %>%
# Make column listing values for each id
group_by(id) %>%
summarize(value_list = list(value)) %>%
ungroup()
v2 <- testdf %>%
# Nest values into list
nest(value_list = value)
In v1 and v2, we create a value_list column that is a nested list. The list contents differ (you can tell by the reported dimensions in the OP).
v1$value_list is a nested atomic vector
v2$value_list is a nested data.frame
This is because list() -- when passed an atomic vector -- stores that atomic vector as an atomic vector, but nest() when passed an atomic vector, stores this as a data.frame. Why? Because nest() comes from the functional programming paradigm where we use map() (type-safe next gen lapply()) functions to operate on data.frame objects and importantly, return data.frame objects.
Okay, now let's start walking through the solution. It helps to understand how R represents the data types in v1 and v2.
We begin with v1. calling list() on an atomic vector results in a 1D atomic vector. We can pull it out and examine it:
Note that throughout the post I only include code. C/P and run interactively to see output
v1 # a dataframe
v1$value_list # a column, which is itself a list (see below)
class(v1$value_list) # verify this is a list
v1$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v1$value_list[[1]]) # it's an atomic vector - "numeric" = integer or double
typeof(v1$value_list[[1]]) # typeof() to show it's double
typeof(1L) # as opposed to a strict integer type
length(v1$value_list[[1]]) # atomic vectors have property length
Now we examine v2:
v2 # a dataframe
v2$value_list # a column-which is itself a list, see below
class(v2$value_list) # verify this is a list
v2$value_list[[1]] # b/c it's a list, use [[ notation to pull element 1
class(v2$value_list[[1]]) # it's a data.frame, NOT an atomic vector!
typeof(v2$value_list[[1]]) # typeof() to show a dataframe is a just special list
dim(v2$value_list[[1]]) # dataframes have property dim
length(v2$value_list[[1]]) # dataframes also have property length, which is ncol()
With that background on object types, let's move to the problem posed in the OP.
And when I try to add this line at the end of each, the output gives all False even though one should be True.
# Mark whether all values exist in value_list for each id
mutate(values_bool = values %in% value_list)
Essentially, we want to test if all values occur in v1$value_list or v2$value_list. The expected result is TRUE, and then FALSE, for grouping variables a and b respectively. First, we do this the long way, and then we build to a functional programming solution.
# exhaustive approach for list 1 with atomic vectors
values %in% v1$value_list[[1]] # all values present
all(values %in% v1$value_list[[1]]) # returns TRUE as expected
# exhaustive approach for list 2 with atomic vectors
values %in% v1$value_list[[2]] # one value missing
all(values %in% v1$value_list[[2]]) # returns FALSE as expected
# now the functional programming solution without indexing. Notes on syntax:
# - Pass a list (or a vector) as the first argument.
# - Use ~ to indicate the start of the function to apply across the list
# - Use .x as a placehold within the function for the ith list element. Think: concise for-loop
map(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# Two solutions: because we want to output a vector we can unlist(), or
# use map_lgl() which only returns a boolean and fails if the result is NOT bool
unlist(map(v1$value_list, ~all(values %in% .x))) # returns c(TRUE, FALSE)
map_lgl(v1$value_list, ~all(values %in% .x)) # returns c(TRUE, FALSE)
# now the functional programming solution in a pipe (put it all together):
v1 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x)))
# we can do the same with v2, but we need to convert the dataframe to a vector
# so that it works with all() -- this is a bit tedious and verbose, because we
# used nest() which is designed to work with nested dataframes, but we ultimately
# want to use all() which takes vectors, so there was no need to nest() in the first
# place. For examples of how to use the power of nest():
# https://r4ds.had.co.nz/many-models.html#creating-list-columns
v2 %>%
mutate(values_bool = map_lgl(value_list, ~all(values %in% .x$value)))
Additional Resources:
essential reading on vectors
I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))
GIVEN I want to transform a series of variables using information contained in a yaml file
AND I have the contents of the yaml file loaded into memory as list
WHEN I go to apply the re-mapping using the $responses element of each entry in the yaml file
THEN I get a new data.frame/tibble with columns appended using the names & recodes in each field's map entry
Below is a single entry in the yaml file. The entry health corresponds to a column name in the original data.frame. The text field is irrelevant for this question. The responses field contains the re-mapping recipe.
health:
study_name: global_health
text: Would you say your health in general is excellent, very good, good, fair, or poor
responses:
1: Excellent
2: Very good
3: Good
4: Fair
5: Poor
Here is an example of the original data - just focusing on a single field as an example:
health
1 1
2 3
3 1
4 2
5 2
6 4
What I would like in this case (though with many more variables) is the following:
health global_health
1 1 Excellent
2 3 Good
3 1 Excellent
4 2 Very good
5 2 Very good
6 4 Fair
What I have got so far is the following:
data_map <- yaml::read_yaml(map_filepath)
study_cols <- names(data_map)
And I have been able to leverage this structure to ignore entries that do not have a populated responses field in the yaml file - using this as my means of tracking which columns need to be recoded and which do not.
library(tidyverse)
# create simple local function to identify which have "responses" that need recoding
pluck_func <- function(x) !is.null(x$responses)
# create mask using the function
recode_cols_mask <- data_map %>%
map(~pluck(., pluck_func)) %>%
unlist()
# Identify variables that need recoding
recode_names <- study_cols[recode_cols_mask]
Now, what I would like to do is to take the variables that exist in recode_names and apply the renaming map for only those variables. My guess is that there is some clever solution with purrr's map family of functions. I just haven't been able to find the right combination of map() and mutate(). For completeness, below is an entry that would not be recoded and returned. Again the way this is determined is effectively is.null(data_map[['ANALWT_C']][['responses']]) will evaluate to TRUE.
ANALWT_C:
study_name: weights
text: Case-level study weight based on Census estimates
Solutions do not need to be in tidyverse for what it is worth. Happy to use base R. I just tend to find tidyverse processing a little more readable.
UPDATE
Here is my current solution which I would love to simplify if I can. But if not I can live with this for the moment.
recode_func <- function(df, data_map, recode_names, study_names) {
for(v in recode_names) {
df[data_map[[v]][['study_name']]] <- factor(
df[[v]],
levels = seq_along(data_map[[v]][['responses']]),
labels = data_map[[v]][['responses']] %>% unlist()
)
}
for(v in setdiff(names(data_map), recode_names)) {
df[data_map[[v]][['study_name']]] <- df[[v]]
}
study_df <- df[study_names]
return(list(base_df = df,
study_df = study_df))
}
with study_names defined as:
get_study_names <- function(x) x[['study_name']]
study_names <- data_map %>%
map(~pluck(., get_study_names)) %>%
unlist()
RESOLVED
My current implementation based on the accepted answer below:
#' Recodes dataframe using info pulled from the yaml file in the main function \code{create_study_datasets}
#'
#' #param df target data.frame
#'
#' #param data_map list read in from yaml file containing data renames and recoding recipes
#'
#' #param recode_names a character vector of names derived from the data_map in pre-processing steps that take place in
#' the parent function. These are the fields that need to be recoded in some fashion.
#'
#' #param study_names a character vectory of names derived from the data_map in pre-processing steps that take place in
#' the parent function. These are all of the fields that should be returned - whether they are recoded and renamed or
#' just renamed according to the data_map.
#'
#' #seealso create_study_datasets
helper_recode_func <- function(df, data_map, recode_names, study_names) {
# Apply the recodes as appropriate from the data map
df <- df %>%
imap_dfc(~ if (hasName(data_map, .y) && hasName(data_map[[.y]], "responses"))
recode(.x, !!! data_map[[.y]][["responses"]])) %>%
setNames(map_chr(data_map, "study_name")[names(.)]) %>%
bind_cols(df, .)
# Apply passthrough/name-change for variables that do not need recoding
df <- df %>%
select_if((names(.) %in% setdiff(names(data_map), recode_names))) %>%
setNames(map_chr(data_map, "study_name")[names(.)]) %>%
bind_cols(df, .)
study_df <- df[study_names]
return(list(base_df = df,
study_df = study_df))
}
Which when executed returns the expected outputs for both variables that require recoding and those that do not:
ANALWT_C weights health global_health
1 1274.9937 1274.9937 1 Excellent
2 550.3692 550.3692 3 Good
3 6043.4082 6043.4082 1 Excellent
4 262.2439 262.2439 2 Very good
5 8354.2522 8354.2522 2 Very good
f <- function(x) recode(x, !!! data_map[[cur_column()]][["responses"]])
df %>%
mutate(across(any_of(names(keep(data_map, hasName, "responses"))),
f,
.names = "{data_map[[.col]][[\"study_name\"]]}"))
#> health global_health
#> 1 1 Excellent
#> 2 3 Good
#> 3 1 Excellent
#> 4 2 Very good
#> 5 2 Very good
#> 6 4 Fair
Variant:
df %>%
imap_dfc(~ if (hasName(data_map[[.y]], "responses"))
recode(.x, !!! data_map[[.y]][["responses"]])) %>%
setNames(map_chr(data_map, "study_name")[names(.)]) %>%
bind_cols(df, .)
I am trying to search a database and then label the ouput with a name derived from the original search, "derived_name" in the reproducible example below. I am using a dplyr pipe %>%, and I am having trouble with quasiquotation and/or non-standard evaluation. Specifically, using count_colname, a character object derived from "derived_name", in the final top_n() function fails to subset the dataframe.
search_name <- "derived_name"
set.seed(1)
letrs <- letters[rnorm(52, 13.5, 5)]
letrs_count.df <- letrs %>%
table() %>%
as.data.frame()
count_colname <- paste0(search_name, "_letr_count")
colnames(letrs_count.df) <- c("letr", count_colname)
letrs_top.df <- letrs_count.df %>%
top_n(5, count_colname)
identical(letrs_top.df, letrs_count.df)
# [1] TRUE
Based on this discussion I thought the code above would work. And this post lead me to try top_n_(), which does not seem to exist.
I am studying vignette("programming") which is a little over my head. This post led me to try the !! sym() syntax, which works, but I have no idea why! Help understanding why the below code works would be much appreciated. Thanks.
colnames(letrs_count.df) <- c("letr", count_colname)
letrs_top.df <- letrs_count.df %>%
top_n(5, (!! sym(count_colname)))
letrs_top.df
# letr derived_name_letr_count
# 1 l 5
# 2 m 6
# 3 o 7
# 4 p 5
# 5 q 6
Additional confusing examples based on #lionel and #Tung's questions and comments below. What is confusing me here is that the help fils say that sym() "take strings as input and turn them into symbols" and !! "unquotes its argument". However, in the examples below, sym(count_colname) appears to unquote to derived_name_letr_count. I do not understand why the !! is needed in !! sym(count_colname), since sym(count_colname) and qq_show(!! sym(count_colname)) give the same value.
count_colname
# [1] "derived_name_letr_count"
sym(count_colname)
# derived_name_letr_count
qq_show(count_colname)
# count_colname
qq_show(sym(count_colname))
# sym(count_colname)
qq_show(!! sym(count_colname))
# derived_name_letr_count
qq_show(!! count_colname)
# "derived_name_letr_count"
According to top_n documentation (?top_n), it doesn't support character/string input thus the 1st example didn't work. In your 2nd example, rlang::sym converted the string to a variable name then !! unquoted it so that it could be evaluated inside top_n. Note: top_n and other dplyr verbs automatically quote their inputs.
Using rlang::qq_show as suggested by #lionel, we can see it doesn't work because there is no count_colname column in letrs_count.df
library(tidyverse)
set.seed(1)
letrs <- letters[rnorm(52, 13.5, 5)]
letrs_count.df <- letrs %>%
table() %>%
as.data.frame()
search_name <- "derived_name"
count_colname <- paste0(search_name, "_letr_count")
colnames(letrs_count.df) <- c("letr", count_colname)
letrs_count.df
#> letr derived_name_letr_count
#> 1 b 1
#> 2 c 1
#> 3 f 2
...
rlang::qq_show(top_n(letrs_count.df, 5, count_colname))
#> top_n(letrs_count.df, 5, count_colname)
sym & !! create the right column name existing in letrs_count.df
rlang::qq_show(top_n(letrs_count.df, 5, !! sym(count_colname)))
#> top_n(letrs_count.df, 5, derived_name_letr_count)
letrs_count.df %>%
top_n(5, !! sym(count_colname))
#> letr derived_name_letr_count
#> 1 l 5
#> 2 m 6
#> 3 o 7
#> 4 p 5
#> 5 q 6
top_n(x, n, wt)
Arguments:
x: a tbl() to filter
n: number of rows to return. If x is grouped, this is the number of rows per group. Will include more than n rows if there are ties. If n is positive, selects the top n rows. If negative, selects the bottom n rows.
wt: (Optional). The variable to use for ordering. If not specified, defaults to the last variable in the tbl.
This argument is automatically quoted and later evaluated in the context of the data frame. It supports unquoting. See vignette("programming") for an introduction to these concepts.
See also these answers: 1st, 2nd, 3rd
So, I've realized that what I was struggling with in this question (and many other probelms) is not really quasiquotation and/or non-standard evaluation, but rather converting character strings into object names. Here is my new solution:
letrs_top.df <- letrs_count.df %>%
top_n(5, get(count_colname))
Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.