unquoted column names of a data frame in select function - r

I'm encountering a confusion when doing my homework, the problem is as below:
Does this function defined below work if you provide it unquoted column names of a data frame? In 3-4 sentences, explain why or why not.
sel <- function(x, col_names) {
select(x, col_names)}
I take iris data as a example:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
I tried
select(iris, Sepal.Length)
it worked well,
but when tried
sel(iris, Sepal.Length)
it didn't work, erro message is as below:
Error in .f(.x[[i]], ...): can't find 'Sepal.Length'
Can anyone could help me out?

Unquoted column names do not work as it is in a function, one way is to use curly-curly ({{}}) operator from rlang.
library(rlang)
sel <- function(x, col_names) dplyr::select(x, {{col_names}})
sel(iris, Sepal.Length) %>% head
# Sepal.Length
#1 5.1
#2 4.9
#3 4.7
#4 4.6
#5 5.0
#6 5.4
which is a successor to previous enquo and evaluate (!!)
sel <- function(x, col_names) dplyr::select(x, !!enquo(col_names))

You can use either an rlang/tidyeval approach or in this case a base R approach also works with dot, dot, dot.
library(dplyr)
sel <- function(x, ...) select(x, ...)
# tests
sel(iris, Sepal.Length)
sel(iris, Sepal.Length, Sepal.Width)

Related

How to use subset() in a for loop in R

I need to select the levels of Species in the dataset Iris (available in R) with the function subset() and calculate the mean of the column Petal.Length from the same dataset, everything with a for loop. I know that I can do this calculations with the function tappy, but the task consists in using a for loop.
I tried writing a vector in which I would put the results:
medie <- rep(NA,3)
names(medie) <- levels(iris$Species)
and then this as the loop:
for (i in 1:length(medie)){
medie[i] <- mean(subset(iris, Species==levels(Species))$Petal.Length)
}
but this are the results I get:
> medie
setosa versicolor virginica
3.796 3.796 3.796
Any help?
I think you need to include i in levels(Species)[i]
for (i in 1:length(medie)){
medie[i] <- mean(subset(iris, Species==levels(Species)[i])$Petal.Length)
}
> medie
setosa versicolor virginica
1.462 4.260 5.552
There is an argument called select in subset to select your target column, so you can use:
medie[i] <- mean(subset(iris, Species==levels(Species)[i], select = "Petal.Length"))
Here's a dplyr approach if you, someday, want to avoid for loop.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(medie = mean(Petal.Length))

r successive filtering with n arguments in a list

I am trying to apply successive filters on a dataframe without knowing in advance the number of filter or their arguments. Arguments are stocked in a list. With 1 or 2 filters, i can do it with purrr.
For instance with 2 filters :
require(tidyverse)
data("iris")
head(iris)
f2 <- list("Species" = "virginica", "Sepal.Length" = c(5.8, 6.3))
iris_f2 <- map2_df(.x = f2[[1]],
.y = f2[[2]],
.f = ~{
iris %>%
filter(get(names(f2)[1]) %in% .x,
get(names(f2)[2]) %in% .y)
})
# With 3 filters or more, I am completely stuck !
f3 <- list("Species" = "virginica", "Sepal.Length" = c(5.8, 6.3), "Sepal.Width" = 2.7)
I would like to generalize my code so that it applies successive filters with n arguments in a list (n can be 1, or 2 as in my example or more).
Ideally, I would like to know how to do it with purrr but I am also interested in loop-based solutions.
Here is one way that uses call() to construct defused expressions that can be spliced inside of filter().
library(purrr)
library(dplyr)
fns <- imap(f3, ~ call(if (length(.x) == 1) "==" else "%in%", sym(.y), .x))
Which gives the following:
$Species
Species == "virginica"
$Sepal.Length
Sepal.Length %in% c(5.8, 6.3)
$Sepal.Width
Sepal.Width == 2.7
However, the names cause an issue when spliced, so it needs to be unnamed before use:
iris %>%
filter(!!!unname(fns))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 2.7 5.1 1.9 virginica
2 6.3 2.7 4.9 1.8 virginica
3 5.8 2.7 5.1 1.9 virginica

Iterate sequentially over two lists in R

I have two df that look something like this
library(tidyverse)
iris <- iris%>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
My aim is to reduce the values of the variables in iris that are above the maximum values of the corresponding variable in iris2, to match the maximum value in iris2.
I have written a function that does this.
max(iris$Sepal.Length)
[1] 9.9
max(iris2$Sepal_Length)
[1] 7.9
# i want every value of iris that is >= to max value of iris2 to be equal to the max value of iris 2.
# my function:
fixmax<- function(data,data2,var1,var2) {
data<- data %>%
mutate("{var1}" := ifelse(get(var1)>=max(data2[[var2]],na.rm = T),
max(data2[[var2]],na.rm = T),get(var1)))
return(data)
}
# apply my function to a variable
tst_iris <- fixmax(iris,iris2,"Sepal.Length","Sepal_Length")
max(tst_iris$Sepal.Length)
7.9 # it works!
The challange I face is that I would like to iterate my function sequentially overtwo lists of variables- i.e. Sepal.Length with Sepal_Length, Sepal.Widthwith Sepal_Width etc.
Does anyone knows how I can do this?
I tried using Map but I am doing something wrong.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
My goal is to obtain a df (final_iris) where every variable has been adjusted using the criteria specified by fixmax.
I know I can do this by running my function on every variable like so.
final_iris <- iris
final_iris <- fixmax(final_iris,iris2,"Sepal.Length","Sepal_Length")
final_iris <- fixmax(final_iris,iris2,"Sepal.Width","Sepal_Width")
final_iris <- fixmax(final_iris,iris2,"Petal.Length","Petal_Length")
final_iris <- fixmax(final_iris,iris2,"Petal.Width","Petal_Width")
But in the real data, I have to run this operation tens of times and I would like to be able to loop my function sequentially.
Does anyone know how I loop my fixmax over lst1 and lst2 sequentially?
Rather than explicitly iterating over the different datasets and columns by name, you can take advantage of the vectorization built into R. If the dataframes have the same column/variable ordering a function mapped to both dataframes using mapply or purrr::map2 will iterate column by column without the need to specify column names.
Given two input data frames (df_small and df_big) the steps are:
Calculate the max of each column in df_small to create df_small_max
Apply the pmin function to each column of df_big and each value of df_small_max using mapply (or purr::map2_dfc if you prefer tidyverse mapping)
#set up fake data
df_small <- iris[,1:4]
df_big <- df_small + 2
# find max of each col in df_small
df_small_max <- sapply(df_small, max)
# replace values of df_big which are larger than df_small_max
df_big_fixed <- mapply(pmin, df_big, df_small_max)
# sanity check -- Note the change in Sepal.Width
df_small_max
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 7.9 4.4 6.9 2.5
head(df_big, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.1 5.5 3.4 2.2
#> 2 6.9 5.0 3.4 2.2
#> 3 6.7 5.2 3.3 2.2
head(df_big_fixed, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 7.1 4.4 3.4 2.2
#> [2,] 6.9 4.4 3.4 2.2
#> [3,] 6.7 4.4 3.3 2.2
Created on 2021-07-31 by the reprex package (v2.0.0)
It's likely that your issue is related to the fact that dataframes are themselves lists. Map() expects the non-function arguments to be lists of the same length. Any arguments that are shorter than the longest list are "recycled" to match it's length.
Currently, you have:
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
This is actually equivalent to:
final_iris<- Map(fixmax,
list(iris$Sepal.Length,
iris$Sepal.Width,
iris$Petal.Length,
iris$Petal.Width,
iris$Species),
list(iris2$Sepal_Length,
iris2$Sepal_Width,
iris2$Petal_Length,
iris2$Petal_Width,
iris2$Species),
lst1,
lst2)
(To understand why, you must remember that dataframes like iris and iris2 are, technically, under the hood, lists of [atomic] vectors.)
I suspect that you want iris and iris2 to be supplied to each call to fixmax(). In order to have Map() recycle these two vectors, they need to be supplied as single-element lists. Like so:
final_iris<- Map(fixmax, list(iris), list(iris2),lst1,lst2)
To combine a list of dataframes into a single dataframe do
do.call(rbind, final_iris)
Here is a mostly base way. I also renamed the variables because I had some trouble replicating since originally the approach would save over the iris object.
The approach is that instead of mutating a data.frame object, we instead only return the vector of the expected values from our modified function. Then, we re-assign those values back to our original data.frame.
fixmax2 = function(x, y) {
max_y = max(y, na.rm = TRUE)
ifelse(x >= max_y, max_y, y)
}
cols = which(sapply(df_plus, is.numeric))
df_plus[cols] = Map(fixmax2, df_plus[cols], df_iris[cols])
df_plus
Raw data:
library(dplyr)
df_plus = iris %>% mutate_at((1:4), ~. + 2) ## let's not save over iris
df_iris = iris
names(df_iris)<-sub(".", "_", names(df_iris), fixed = TRUE)
Is that what you're expecting ?
my_a <- iris %>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
my_var <- which(my_a$Sepal.Length >= max(iris2$Sepal_Length) & my_a$Sepal.Width >= max(iris2$Sepal_Width))
if (length(my_var)) {
my_a <- my_a[my_var,]
}
Your function seems convoluted and hard to read at a first glance. We can tidy up the function to return max(x, max_val) for each value in a column with a quick function
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
Finally, we want to apply this automatically and sequentially using the two dataframes. We will use a simple for loop. Code to set up the problem is attached.
#libraries
library(tidyverse)
#set up fake data
iris_big <- iris%>% mutate_at((1:4),~.+2)
iris_small <- iris
names(iris_small)<- sub(".", "_", names(iris_small), fixed = TRUE)
#check which is the bigger one and the smaller
max(iris_big$Sepal.Length) #bigger
max(iris_small$Sepal_Length) #smaller
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
#apply it to get a final result
iris_final <- iris_big
# iterate over columns, assuming same positions
# you can edit the 1:ncol(iris_final) to only take the columns you want
for (i in 1:ncol(iris_final)) {
#check numeric
if (is.numeric(iris_final[,i])) {
#applies the function - notice we call iris_final and iris_small
iris_final[,i] <- sapply(iris_final[,i],
adjust_max,
max_val = max(iris_small[,i]))
}
}
#check answer is correct
apply(iris_final[,1:4], 2, max)
apply(iris_small[,1:4], 2, max)
tail(iris_final)
For a tidyverse approach you can use transmute instead of mutate. transmute would return only one column in each iteration whereas mutate would return all the columns every time.
Apart from that to make it more tidyverse friendly I am using .data instead of get. Also using pmin instead of complicated ifelse solution.
library(dplyr)
library(purrr)
fixmax<- function(data,data2,var1,var2) {
data<- data %>% transmute("{var1}" := pmin(.data[[var1]], max(data2[[var2]])))
return(data)
}
To apply the function to each pair of columns you can use map2_dfc which will also combine the results in one dataframe.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
Compare the max values of two dataframes before applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 9.9 6.4 8.9 4.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5
Apply the function -
iris[lst1] <- map2_dfc(lst1, lst2, ~fixmax(iris, iris2, .x, .y))
Compare the max values of two dataframes after applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 7.9 4.4 6.9 2.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5
You should consider using column indices; a complete (not including the data-frame construction) base R solution could look like:
# Resolve the indices of the numeric vectors in
# iris: num_cols => integer vector
num_cols <- which(
vapply(
iris,
is.numeric,
logical(1)
),
arr.ind = TRUE
)
# Map the pmin function over iris to select the
# minimum of the vector element in iris and the
# maximum values of that vector in iris2:
# iris => data.frame
iris[,num_cols] <- Map(function(i){
pmin(
iris[,i],
max(
iris2[,i],
na.rm = TRUE
)
)
},
num_cols
)
You can do this by creating a matrix of the max value repeated in each column and use pmin to take the minimum values between the max values in iris2 and the values in the other dataframe. I created a new fixmax function which only takes the two dataframes as arguments.
Preparing the data
library(tidyverse)
initial <- iris %>% mutate_at(1:4, ~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
print(max(initial$Sepal.Length))
# [1] 9.9
print(max(iris2$Sepal_Length))
# [1] 7.9
Creating the function
fixmax <- function(df, dfmax){
colids <- which(unlist(lapply(dfmax, is.numeric)))
dfmax <- apply(dfmax[, colids], 2, max) %>%
matrix(nrow=nrow(dfmax), ncol=length(colids), byrow=TRUE) %>%
as.data.frame()
df[, colids] <- pmin(df[,colids], dfmax)
return(df)
}
Testing the function
newiris <- fixmax(initial, iris2)
print(max(newiris$Sepal.Length))
# [1] 7.9
assertthat::assert_that(!identical(newiris, iris2))
# [1] TRUE
assertthat::assert_that(all((initial == newiris) || (iris2 == newiris)))
# [1] TRUE
imax = apply(iris2[, 1:4], 2, max) %>%
matrix(nrow=nrow(iris2), ncol=4, byrow=TRUE) %>%
as.data.frame()
assertthat::assert_that(all(newiris[, 1:4] <= imax))
# [1] TRUE
print(head(newiris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.1 4.4 3.4 2.2 setosa
# 2 6.9 4.4 3.4 2.2 setosa
# 3 6.7 4.4 3.3 2.2 setosa
# 4 6.6 4.4 3.5 2.2 setosa
# 5 7.0 4.4 3.4 2.2 setosa
# 6 7.4 4.4 3.7 2.4 setosa

How do I summarise only part of a table?

I have two related use-cases in which I need to summarise just parts of a table, specified in a way similar to filter.
In a nutshell, I want something like this:
iris %>%
use_only(Species == 'setosa') %>%
summarise_each(funs(sum), -Species) %>%
mutate(Species = 'setosa_sum') %>%
use_all()
To yield this:
Source: local data frame [101 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 250.3 171.4 73.1 12.3 setosa_sum
2 7.0 3.2 4.7 1.4 versicolor
3 6.4 3.2 4.5 1.5 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 5.5 2.3 4.0 1.3 versicolor
…
So instead of grouping by the value of a column, I use a filtering criterion to operate on a view of the table, without actually losing the rest of the table (unlike filter).
How do I smartly implement use_only/use_all? Even better, is this functionality already contained in dplyr and how do I use it?
It’s of course quite easy to generate the result above, but I need to do something similar for many different cases, with complex and variable criteria for filtering.
I implemented this with the approach of having use_only save the rest of the table into a global option dplyr_use_only_rest, and having use_all bind it back together.
use_only <- function(.data, ...) {
if (!is.null(.data$.index)) {
stop("data cannot already have .index column, would be overwritten")
}
filt <- .data %>%
mutate(.index = row_number()) %>%
filter(...)
rest <- .data %>% slice(-filt$.index)
options(dplyr_use_only_rest = rest)
select(filt, -.index)
}
use_all <- function(.data, ...) {
rest <- getOption("dplyr_use_only_rest")
if (is.null(rest)) {
stop("called use_all() without earlier use_only()")
}
options(dplyr_use_only_rest = NULL)
bind_rows(.data, rest)
}
I recognize setting global options is less than ideal design for functional programming, but I don't think there's another way to ensure that the remainder of the data frame passes through any intermediate functions untouched. Adding an extra attribute to the object wouldn't survive functions such as do or summarize.
At this point,
iris %>%
use_only(Species == 'setosa') %>%
summarise_each(funs(sum), -Species) %>%
mutate(Species = 'setosa_sum') %>%
use_all()
returns, as desired:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 250.3 171.4 73.1 12.3 setosa_sum
2 7.0 3.2 4.7 1.4 versicolor
3 6.4 3.2 4.5 1.5 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 5.5 2.3 4.0 1.3 versicolor
...
Any intermediate steps could be used in place of summarize_each and mutate (do, filter, etc) and they would happen only to the specified rows. You could even add or remove columns (the remainder would be filled in with NAs).
I think your approach of searching for a function to satisfy that particular syntax is too restrictive. This is what I would do using data.table (I'm not sure if dplyr allows for variable rows like this yet, I know it's been an FR for a while):
library(data.table)
dt = as.data.table(iris)
dt[, if (Species == 'setosa') lapply(.SD, sum) else .SD, by = Species]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 250.3 171.4 73.1 12.3
# 2: versicolor 7.0 3.2 4.7 1.4
# 3: versicolor 6.4 3.2 4.5 1.5
# 4: versicolor 6.9 3.1 4.9 1.5
# 5: versicolor 5.5 2.3 4.0 1.3
# ---
You can also add [Species == 'setosa', Species := 'setosa_sum'] at the end to modify the name in place. It should be straightforward to extend to multiple criteria/whatever function.
You can create a new column to group by:
iris %>%
mutate( group1 = ifelse(Species == "setosa", "", row_number())) %>%
group_by( group1, Species ) %>%
summarise_each(funs(sum), -Species, -group1) %>%
ungroup() %>%
select(-group1)
Update - as more general solution
library(lazyeval)
use_only_ <- function(x, condition, ...) {
condition <- as.lazy(condition, parent.frame())
mutate_(x, .group = condition) %>%
group_by_(".group", ...)
}
use_only <- function(x, condition, ...) {
use_only_(x, lazy(condition), ...)
}
use_all <- function(x) {
ungroup(x) %>%
select(- .group)
}
Use use_only with any condition in the context of data frame and calling environment. In this case:
iris %>%
use_only( ifelse(Species == "setosa", "", row_number()), "Species") %>%
summarise_each(funs(sum), -Species, -.group) %>%
use_all()
The use_only_ can be used with formula or string. For example:
condition <- ~ifelse(Species == "setosa", "", row_number())
or
condition <- "ifelse(Species == 'setosa' , "", row_number())"
And call:
iris %>%
use_only_(condition, "Species") %>%
summarise_each(funs(sum), -Species, -.group) %>%
use_all()
When mutate-ing between the use_only and use_all calls you must take care to change only values inside marked group.

understanding difference in results between dplyr group_by vs tapply

I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online). Can anyone explain why the results are different, or how to obtain similar results?
library(dplyr)
x <- iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
summarise (mean_by_group = mean(Sepal.Width))
print(x)
x <- iris
x <- tapply(x$Sepal.Width, x$Species, mean)
print(x)
Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr
library(dplyr)
x <- iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
print(x)
Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.
x <- iris %.%
group_by(Species) %.%
summarise(Sepal.Width = mean(Sepal.Width))
print(x)
Maybe this...
- dplyr:
require(dplyr)
iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))
# Source: local data frame [3 x 2]
#
# Species mean_width
# 1 setosa 3.428
# 2 versicolor 2.770
# 3 virginica 2.974
- tapply:
tapply(iris$Sepal.Width, iris$Species, mean)
# setosa versicolor virginica
# 3.428 2.770 2.974
NOTE: tapply() simplifies output by default whereas summarise() does not:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))
# [1] "double"
it returns a list otherwise:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))
# [1] "list"
So to actually get the same type of output form tapply() you would need:
tbl_df(
data.frame(
mean_width = tapply( iris$Sepal.Width,
iris$Species,
mean )))
# Source: local data frame [3 x 1]
#
# mean_width
# setosa 3.428
# versicolor 2.770
# virginica 2.974
and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

Resources