Take function from dataframe - r

I need to perform calculation based on inputs defined in a dataframe. Refer the dataframe RefDf below. It has 3 columns - column name, calculation, New Variable Name. When Calculation column contains count, we should use n_distinct( ) function.
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length count Petal.LengthNew
", header = T)
Manual Approach - Needs to be automated via inputs in RefDf. Species remains same for grouping.
library(dplyr)
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
I am looking for dplyr or base R based solution

Here's a solution with data.table package
library(data.table)
library(dplyr)
# using data.table
dt <- as.data.table(RefDf)
dt[Calculation == "count", Calculation := "n_distinct"]
# function for doing grouping calculation
inner.fun <- function(calc, data, column, group="Species"){
print(column)
data.dt <- as.data.table(data)
data.dt[, .(as.numeric(get(calc)(get(column)))), by=group][]
}
out <- dt[, inner.fun(calc=Calculation, data=iris, column=Variables), by=NewVariable]
# reshape from wide to long
out2 <- dcast(data=out, Species ~ NewVariable, value.var="V1")
# convert to data.frame
out_df <- as.data.frame(out2)
out_df
Species Petal.LengthNew Sepal.Length2
1 setosa 9 250.3
2 versicolor 19 296.8
3 virginica 20 329.4

Related

Iterate sequentially over two lists in R

I have two df that look something like this
library(tidyverse)
iris <- iris%>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
My aim is to reduce the values of the variables in iris that are above the maximum values of the corresponding variable in iris2, to match the maximum value in iris2.
I have written a function that does this.
max(iris$Sepal.Length)
[1] 9.9
max(iris2$Sepal_Length)
[1] 7.9
# i want every value of iris that is >= to max value of iris2 to be equal to the max value of iris 2.
# my function:
fixmax<- function(data,data2,var1,var2) {
data<- data %>%
mutate("{var1}" := ifelse(get(var1)>=max(data2[[var2]],na.rm = T),
max(data2[[var2]],na.rm = T),get(var1)))
return(data)
}
# apply my function to a variable
tst_iris <- fixmax(iris,iris2,"Sepal.Length","Sepal_Length")
max(tst_iris$Sepal.Length)
7.9 # it works!
The challange I face is that I would like to iterate my function sequentially overtwo lists of variables- i.e. Sepal.Length with Sepal_Length, Sepal.Widthwith Sepal_Width etc.
Does anyone knows how I can do this?
I tried using Map but I am doing something wrong.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
My goal is to obtain a df (final_iris) where every variable has been adjusted using the criteria specified by fixmax.
I know I can do this by running my function on every variable like so.
final_iris <- iris
final_iris <- fixmax(final_iris,iris2,"Sepal.Length","Sepal_Length")
final_iris <- fixmax(final_iris,iris2,"Sepal.Width","Sepal_Width")
final_iris <- fixmax(final_iris,iris2,"Petal.Length","Petal_Length")
final_iris <- fixmax(final_iris,iris2,"Petal.Width","Petal_Width")
But in the real data, I have to run this operation tens of times and I would like to be able to loop my function sequentially.
Does anyone know how I loop my fixmax over lst1 and lst2 sequentially?
Rather than explicitly iterating over the different datasets and columns by name, you can take advantage of the vectorization built into R. If the dataframes have the same column/variable ordering a function mapped to both dataframes using mapply or purrr::map2 will iterate column by column without the need to specify column names.
Given two input data frames (df_small and df_big) the steps are:
Calculate the max of each column in df_small to create df_small_max
Apply the pmin function to each column of df_big and each value of df_small_max using mapply (or purr::map2_dfc if you prefer tidyverse mapping)
#set up fake data
df_small <- iris[,1:4]
df_big <- df_small + 2
# find max of each col in df_small
df_small_max <- sapply(df_small, max)
# replace values of df_big which are larger than df_small_max
df_big_fixed <- mapply(pmin, df_big, df_small_max)
# sanity check -- Note the change in Sepal.Width
df_small_max
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 7.9 4.4 6.9 2.5
head(df_big, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.1 5.5 3.4 2.2
#> 2 6.9 5.0 3.4 2.2
#> 3 6.7 5.2 3.3 2.2
head(df_big_fixed, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 7.1 4.4 3.4 2.2
#> [2,] 6.9 4.4 3.4 2.2
#> [3,] 6.7 4.4 3.3 2.2
Created on 2021-07-31 by the reprex package (v2.0.0)
It's likely that your issue is related to the fact that dataframes are themselves lists. Map() expects the non-function arguments to be lists of the same length. Any arguments that are shorter than the longest list are "recycled" to match it's length.
Currently, you have:
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
This is actually equivalent to:
final_iris<- Map(fixmax,
list(iris$Sepal.Length,
iris$Sepal.Width,
iris$Petal.Length,
iris$Petal.Width,
iris$Species),
list(iris2$Sepal_Length,
iris2$Sepal_Width,
iris2$Petal_Length,
iris2$Petal_Width,
iris2$Species),
lst1,
lst2)
(To understand why, you must remember that dataframes like iris and iris2 are, technically, under the hood, lists of [atomic] vectors.)
I suspect that you want iris and iris2 to be supplied to each call to fixmax(). In order to have Map() recycle these two vectors, they need to be supplied as single-element lists. Like so:
final_iris<- Map(fixmax, list(iris), list(iris2),lst1,lst2)
To combine a list of dataframes into a single dataframe do
do.call(rbind, final_iris)
Here is a mostly base way. I also renamed the variables because I had some trouble replicating since originally the approach would save over the iris object.
The approach is that instead of mutating a data.frame object, we instead only return the vector of the expected values from our modified function. Then, we re-assign those values back to our original data.frame.
fixmax2 = function(x, y) {
max_y = max(y, na.rm = TRUE)
ifelse(x >= max_y, max_y, y)
}
cols = which(sapply(df_plus, is.numeric))
df_plus[cols] = Map(fixmax2, df_plus[cols], df_iris[cols])
df_plus
Raw data:
library(dplyr)
df_plus = iris %>% mutate_at((1:4), ~. + 2) ## let's not save over iris
df_iris = iris
names(df_iris)<-sub(".", "_", names(df_iris), fixed = TRUE)
Is that what you're expecting ?
my_a <- iris %>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
my_var <- which(my_a$Sepal.Length >= max(iris2$Sepal_Length) & my_a$Sepal.Width >= max(iris2$Sepal_Width))
if (length(my_var)) {
my_a <- my_a[my_var,]
}
Your function seems convoluted and hard to read at a first glance. We can tidy up the function to return max(x, max_val) for each value in a column with a quick function
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
Finally, we want to apply this automatically and sequentially using the two dataframes. We will use a simple for loop. Code to set up the problem is attached.
#libraries
library(tidyverse)
#set up fake data
iris_big <- iris%>% mutate_at((1:4),~.+2)
iris_small <- iris
names(iris_small)<- sub(".", "_", names(iris_small), fixed = TRUE)
#check which is the bigger one and the smaller
max(iris_big$Sepal.Length) #bigger
max(iris_small$Sepal_Length) #smaller
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
#apply it to get a final result
iris_final <- iris_big
# iterate over columns, assuming same positions
# you can edit the 1:ncol(iris_final) to only take the columns you want
for (i in 1:ncol(iris_final)) {
#check numeric
if (is.numeric(iris_final[,i])) {
#applies the function - notice we call iris_final and iris_small
iris_final[,i] <- sapply(iris_final[,i],
adjust_max,
max_val = max(iris_small[,i]))
}
}
#check answer is correct
apply(iris_final[,1:4], 2, max)
apply(iris_small[,1:4], 2, max)
tail(iris_final)
For a tidyverse approach you can use transmute instead of mutate. transmute would return only one column in each iteration whereas mutate would return all the columns every time.
Apart from that to make it more tidyverse friendly I am using .data instead of get. Also using pmin instead of complicated ifelse solution.
library(dplyr)
library(purrr)
fixmax<- function(data,data2,var1,var2) {
data<- data %>% transmute("{var1}" := pmin(.data[[var1]], max(data2[[var2]])))
return(data)
}
To apply the function to each pair of columns you can use map2_dfc which will also combine the results in one dataframe.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
Compare the max values of two dataframes before applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 9.9 6.4 8.9 4.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5
Apply the function -
iris[lst1] <- map2_dfc(lst1, lst2, ~fixmax(iris, iris2, .x, .y))
Compare the max values of two dataframes after applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 7.9 4.4 6.9 2.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5
You should consider using column indices; a complete (not including the data-frame construction) base R solution could look like:
# Resolve the indices of the numeric vectors in
# iris: num_cols => integer vector
num_cols <- which(
vapply(
iris,
is.numeric,
logical(1)
),
arr.ind = TRUE
)
# Map the pmin function over iris to select the
# minimum of the vector element in iris and the
# maximum values of that vector in iris2:
# iris => data.frame
iris[,num_cols] <- Map(function(i){
pmin(
iris[,i],
max(
iris2[,i],
na.rm = TRUE
)
)
},
num_cols
)
You can do this by creating a matrix of the max value repeated in each column and use pmin to take the minimum values between the max values in iris2 and the values in the other dataframe. I created a new fixmax function which only takes the two dataframes as arguments.
Preparing the data
library(tidyverse)
initial <- iris %>% mutate_at(1:4, ~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
print(max(initial$Sepal.Length))
# [1] 9.9
print(max(iris2$Sepal_Length))
# [1] 7.9
Creating the function
fixmax <- function(df, dfmax){
colids <- which(unlist(lapply(dfmax, is.numeric)))
dfmax <- apply(dfmax[, colids], 2, max) %>%
matrix(nrow=nrow(dfmax), ncol=length(colids), byrow=TRUE) %>%
as.data.frame()
df[, colids] <- pmin(df[,colids], dfmax)
return(df)
}
Testing the function
newiris <- fixmax(initial, iris2)
print(max(newiris$Sepal.Length))
# [1] 7.9
assertthat::assert_that(!identical(newiris, iris2))
# [1] TRUE
assertthat::assert_that(all((initial == newiris) || (iris2 == newiris)))
# [1] TRUE
imax = apply(iris2[, 1:4], 2, max) %>%
matrix(nrow=nrow(iris2), ncol=4, byrow=TRUE) %>%
as.data.frame()
assertthat::assert_that(all(newiris[, 1:4] <= imax))
# [1] TRUE
print(head(newiris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.1 4.4 3.4 2.2 setosa
# 2 6.9 4.4 3.4 2.2 setosa
# 3 6.7 4.4 3.3 2.2 setosa
# 4 6.6 4.4 3.5 2.2 setosa
# 5 7.0 4.4 3.4 2.2 setosa
# 6 7.4 4.4 3.7 2.4 setosa

dplyr: Is it possible to return two columns in summarize using one function?

Say I have a function that returns two scalars, and I want to use it with summarize, e.g.
fn = function(x) {
list(mean(x), sd(x))
}
iris %>%
summarize(fn(Petal.Length)) # Error: Column `fn(Petal.Length)` must be length 1 (a summary value), not 2
iris %>%
summarize(c("a","b") := fn(Petal.Length))
# Error: The LHS of `:=` must be a string or a symbol Run `rlang::last_error()` to see where the error occurred.
I tried both ways, but can't figure it out.
However, this can be done with data.table
library(data.table)
iris1 = copy(iris)
setDT(iris1)[, fn(Petal.Length)]
Is there a way to do this in dplyr?
Yes, you can save them as a list in a column and then use unnest_wider to separate them in different columns.
fn = function(x) {
list(mean = mean(x),sd = sd(x))
}
library(dplyr)
library(tidyr)
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_wider(temp)
# A tibble: 1 x 2
# mean sd
# <dbl> <dbl>
#1 3.76 1.77
Or unnest_longer to have them in separate rows
iris %>%
summarise(temp = list(fn(Petal.Length))) %>%
unnest_longer(temp)
# temp temp_id
# <dbl> <chr>
#1 3.76 mean
#2 1.77 sd

Summarizing by dynamic column name in dplyr

So I'm trying to do some programming in dplyr and I am having some trouble with the enquo and !! evaluations.
Basically I would like to mutate a column to a dynamic column name, and then be able to further manipulate that column (i.e. summarize). For instance:
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1)
}
my_function(iris, Petal.Length)
This works great and returns a column called "Petal.Length.adjusted" which is just Petal.Length increased by one.
However I can't seem to summarize this new column.
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarize(!!mean_col := mean(!!new_col))
}
my_function(iris, Petal.Length)
This results in a warning stating the argument "Petal.Length_adjusted" is not numeric or logical, although the output from the mutate call gives a numeric column.
How do I reference this dynamically generated column name to pass it in further dplyr functions?
Unlike the quo_column which is a quosure, the new_col and mean_col are strings, so we convert it to symbol using sym (from rlang) and then do the evaluation
my_function <- function(data, column) {
quo_column <- enquo(column)
new_col <- paste0(quo_column, "_adjusted")[2]
mean_col <- paste0(quo_column, "_meanAdjusted")[2]
data %>%
mutate(!!new_col := (!!quo_column) + 1) %>%
group_by(Species) %>%
summarise(!!mean_col := mean(!! rlang::sym(new_col)))
}
head(my_function(iris, Petal.Length))
# A tibble: 3 x 2
# Species Petal.Length_meanAdjusted
# <fct> <dbl>
#1 setosa 2.46
#2 versicolor 5.26
#3 virginica 6.55

Unnesting results of function returning multiple values in summarize

The "wanted" result is given by the "do" function below. I thought that I could get the same with some use of unnest, but could not get it to work.
library(dplyr)
library(tidyr)
# Function rr is given
rr = function(x){
# This should be an expensive and possibly random function
r = range(x + rnorm(length(x),0.1))
# setNames(r, c("min", "max")) # fails, expecting single value
# list(min = r[1], max= r[2]) # fails
list(r) # Works, but result is in "long" form without min/max
}
# Works, but syntactically awkward
iris %>% group_by(Species) %>%
do( {
r = rr(.$Sepal.Width)[[1]]
data_frame(min = r[1], max = r[2])
})
# This give the long format, but without column
# names min/max
iris %>% group_by(Species) %>%
summarize(
range = rr(Sepal.Length)
) %>% unnest(range)
Here's a pretty straight forward alternative using the data.table package
# Function rr is given
rr = function(x) as.list(setNames(range(x + rnorm(length(x), 0.1)), c("min", "max")))
library(data.table)
data.table(iris)[, rr(Sepal.Width), by = Species]
# Species min max
# 1: setosa 1.839845 6.341040
# 2: versicolor 1.063727 5.498810
# 3: virginica 1.232525 5.402483
Unnest() will always unlist your nested columns in a "long" format, but you could use spread() to get the desired output if you create a key column.
library(dplyr)
library(tidyr)
iris %>%
group_by(Species) %>%
summarize(range = rr(Sepal.Length)) %>%
unnest(range) %>% mutate(newcols = rep(c("min", "max"), 3)) %>%
spread(newcols, range)
# Species max min
# (fctr) (dbl) (dbl)
#1 setosa 7.636698 3.292692
#2 versicolor 9.792319 3.337382
#3 virginica 9.810723 3.367066

Correlation using funs in dplyr

I want to find the rank correlation of various columns in a data.frame using dplyr.
I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.
For the following df:
df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))
I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????
cor <- df %>% group_by(Universe) %>%
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)
I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.
Please help!
An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:
#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%
data.frame(row=rep(colnames(.)[-1], n_groups(.)))
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <- which(upper.tri(diag(num_col)))
cor_cols <- df %>% group_by(Universe) %>%
do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])
So here follows the winning (time-wise) solution to my problem:
d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>%
summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>%
unite(Pair, R1, R2, sep="_")
Albeit 0.005 milliseconds in this example, adding data adds time.
Try this:
library(data.table) # needed for fast melt
setDT(df) # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))
mdf %>%
mutate(obs_set = substr(variable, 4, 4) ) %>% # ".x" or ".y" subgroup
full_join(.,., by=c('Universe', 'obs_set', 'id')) %>% # see notes
group_by(Universe, variable.x, variable.y) %>%
filter(variable.x != variable.y) %>%
dplyr::summarise(rank_corr = cor(value.x, value.y,
method='spearman', use='pairwise.complete.obs'))
Produces:
Universe variable.x variable.y rank_corr
(fctr) (fctr) (fctr) (dbl)
1 A AA.x BB.x -0.9
2 A AA.x CC.x -0.9
3 A BB.x AA.x -0.9
4 A BB.x CC.x 0.8
5 A CC.x AA.x -0.9
6 A CC.x BB.x 0.8
7 A AA.y BB.y -0.3
8 A AA.y CC.y 0.2
9 A BB.y AA.y -0.3
10 A BB.y CC.y -0.3
.. ... ... ... ...
Explanation:
Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.
Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).
Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.
Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.

Resources