`dplyr::summarise` does not accept external functions - r

I have the follow dataset:
dataset=structure(list(var1 = c(28.5627505742013, 22.8311421908438, 95.2216156944633,
43.9405107684433, 97.11211245507, 48.4108281508088, 77.1804554760456,
27.1229329891503, 69.5863061584532, 87.2112890332937), var2 = c(32.9009465128183,
54.1136392951012, 69.3181485682726, 70.2100433968008, 44.0986660309136,
62.8759404085577, 79.4413498230278, 97.4315509572625, 62.2505457513034,
76.0133410431445), var3 = c(89.6971945464611, 67.174579706043,
37.0924087055027, 87.7977314218879, 29.3221596442163, 37.5143952667713,
62.6237869635224, 71.3644423149526, 95.3462834469974, 27.4587387405336
), var4 = c(41.5336912125349, 98.2095112837851, 80.7970978319645,
91.1278881691396, 66.4086666144431, 69.2618868127465, 67.7560870349407,
71.4932355284691, 21.345994155854, 31.1811877787113), var5 = c(33.9312525652349,
88.1815139763057, 98.4453701227903, 25.0217059068382, 41.1195872165263,
37.0983888953924, 66.0217586159706, 23.8814191706479, 40.9594196081161,
79.7632974945009), var6 = c(39.813664201647, 80.6405956856906,
30.0273275375366, 34.6203793399036, 96.5195455029607, 44.5830867439508,
78.7370151281357, 42.010761089623, 23.0079878121614, 58.0372223630548
), kmeans = structure(c(2L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 2L, 3L
), .Label = c("1", "2", "3"), class = "factor")), .Names = c("var1",
"var2", "var3", "var4", "var5", "var6", "kmeans"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
And the follow function:
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
With dplyr::summarise only, the result is ok:
library(tidyverse)
my1<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(sum,mean,sd))
But, with myfun doesn't work:
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(myfun))
Error in summarise_impl(.data, dots) :
Column var1 must be length 1 (a summary value), not 3
What's the problem?

You can try this approach, Your approach will not yield the correct result as there it is not able to wrap two values returned by your custom function in a single cell, to circumvent the problem, I used enframe with list in the custom function:
library(tidyverse)
myfun<-function(x){
return(list(enframe(c('sum' = sum(x),'mean' = mean(x),'sd' = sd(x)))))
}
For example with mtcars data:
my2<-mtcars%>%
summarise_at(c('mpg','drat'), function(x) myfun(x)) %>%
unnest() %>%
select(-name1) %>%
set_names(nm = c('name', 'mpg', 'drat'))
it will yield:
name mpg drat
1 sum 642.900000 115.0900000
2 mean 20.090625 3.5965625
3 sd 6.026948 0.5346787
Also, there is one alternate way in which you can try solving it using purrr.
For example:
f <- function(x,...){
list('mean' = mean(x, ...),'sum' = sum(x, ...))
}
mtcars %>%
select(mpg, drat) %>%
map_dfr(~ f(.x, na.rm=T), .id ="Name") %>%
data.frame()

When you are applying this function
dataset%>% summarise_if(is.numeric,.funs=funs(sum,mean,sd))
You are applying three different function (sum, mean and sd) which is applied to all columns individually. So every column which is numeric these function would be applied to them. Here we have got three different function returning three values.
Regarding your function, I think what you were trying to do was
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
Now , when this function is applied to one column it returns you three values, so here one function is returning you three values instead.
myfun(dataset$var1)
#[1] 597.17994 59.71799 29.03549
As #NelsonGon mentioned in the comments, you are trying to store three values in single column. You could return them as list as #Pkumar showed or some variation of do also would help you achieve that. If you break down the functions and make three functions separately, it would work the same way as you have shown earlier.
myfun1 <- function(x) sum(x)
myfun2 <- function(x) mean(x)
myfun3 <- function(x) sd(x)
dataset %>% summarise_if(is.numeric,.funs=funs(myfun1,myfun2,myfun3))

it's not the most elegant way, but if your external function is just a list of other functions, maybe you can just use a list for your functions:
myfun_ls <- list(sum,mean,sd)
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=myfun_ls)

Related

My R function accepts individual column names but not lists of column names when passed through map()

After many months of using this forum, I finally
have a question for the community that I can't seem to
find sufficiently addressed elsewhere.
In R, I created a function that accepts individual
column names but not lists of column names when passed through
map(). The problem appears to be one of evaluation, so have
tried quo() and enquo(), but since I don't properly
understand how they work, I need some help.
I've tried iterating through different versions of the function
(commenting out the offending lines as per error messages) but
this only moves the problem around without solving it. Thanks
in advance.
# Load:
library(tidyverse)
# Create df:
set.seed(12)
df <- tibble(col1 = sample(c("a", "b", "c"), 10, replace = TRUE),
col2 = sample(1:4, 10, replace = TRUE),
col3 = sample(1:4, 10, replace = TRUE))
# My function:
my_function <- function(col_name) {
df <- df %>%
filter({{ col_name }} != 1) %>%
group_by(fct_relevel(factor(col1), "c", "b", "a")) %>%
mutate(col4 = 5 - {{ col_name }}) %>%
summarise("Score" = mean(col4)) %>%
rename("Levels" =
`fct_relevel(factor(col1), "c", "b", "a")`)
return(df)
}
# List of col_names to pass to function:
col_list <- list(df$col2, df$col3)
# Attempt function in map() using list of col_names:
map(col_list, my_function)
# Gives error message:
# Error in `mutate()`:
# ! Problem while computing `col4 = 5 - c(1L, 2L, 1L, 2L,
# 4L, 2L, 2L, 3L, 4L, 1L)`.
# ✖ `col4` must be size 2 or 1, not 10.
# ℹ The error occurred in group 1: fct_relevel(factor(col1), "c",
# "b", "a") = c.
One issue you're having is that col_list is not actually a list of column names, but rather the actual data from those columns.
I'm not totally sure what output you're hoping for, but I'm guessing it's the full_join of the result of my_function applied to each column. One way to do that is:
new_f <- function(...){
df %>%
mutate(across(-col1, ~if_else(.x == 1L, NA, .x))) %>%
group_by("Levels" = fct_relevel(factor(col1), "c", "b", "a")) %>%
select(Levels, ...) %>%
summarize(across(everything(), ~ mean(5- .x, na.rm = TRUE)))
}
new_f(col2, col3)
new_f(col2)
new_f(col3)
Now, I realize that maybe I have missed your true intention. For example, maybe you're trying to understand how to use purrr::map. If so, please comment or update your question.
In any case, you should check out Programming with dplyr

cannot reference grouped data in summarize(across(...))

When I try to create several columns within summarize(), I can reference a newly created column name in the same summarize statement.
Example:
Goal: Try to calculate the standard error ("se") based on the standard deviation ("sd").
Step 1 (start to assign sd for se):
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd)))
returns
style score_mean score_sd score_se
* <fct> <dbl> <dbl> <dbl>
1 S1 3.5 0.707 0.707
Step 2: calculate se based on sd
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd/sqrt(nrow(score)))))
returns
Error: Problem with `summarise()` input `..1`.
x non-numeric argument to binary operator
ℹ Input `..1` is `across(score, list(mean = mean, sd = sd, se = sd/sqrt(nrow(data))))`.
ℹ The error occured in group 1: style = "S1".
Step 3 debugging assignment term
3a) grouped data reference
I replaced the grouped data in nrow(score)) by the other column names or even nrow(data), but they all led to the same error message.
3b) assignment operation
I replaced the assignement for se sd/sqrt(nrow(score))) with different variations leading all to the same error. The simplest was sd/2, so even dividing by a constant doesn't work.
3c) assignment reference
I replaced sd by score_sd to reference the new column created, as seen in the output (Step 1). Still the same error message.
Question: Why does Step 1 work but not Step 2?
The error message just refers to the whole across() statement, so doesn't help to narrow down the root cause.
My hunch is that I have to reference the grouped data somehow, but I tried
se = sd(.)/sqrt(nrow(data) with no success.
Would be grateful for any hints...
Minimal reproducible example:
data <- structure(list(style = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("S1", "S2", "S3", "S4", "S5"), class = "factor"),
param = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), score = c(4,
1, 1, 3, 3, 3, 5, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
After many trial & error attempts, I found the solution myself. This is for everyone who is not yet familiar with the across function, as dplyr 1.0.0 is not yet released.
So the answer to my question is:
You must reference the grouped data by the . operator - BUT ONLY IF you use the purrr formula operator ~!
However, you must NOT reference the grouped data in the n() function, as the n() does NOT accept the . operator.
The second point took endless trials to find out, and is the reason why I wanted to share this solution.
You might not find it intuitive to understand either that, even though n() is defined with brackets, it is never allowed to use the . operator as it always refers to the grouped data.
This is how this double trick looks like:
data %>%
group_by(style) %>%
summarise(across(
score,
list(mean = mean, sd = sd, se = ~sd(.)/sqrt(n()))
))
If you know it, it's easy :-)

eval parse is not working properly in my function

I Have the following dataset which includes 2 variables:
dt4<-structure(list(a1 = c(4L, 4L, 3L, 4L, 4L), a2 = c(1L,
3L, 4L, 5L, 4L)), .Names = c("a1", "a2"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
I Have the following function that add labels and levels to an existing dataset:
Add_Labels_Level_To_Dataset <- function(df, df_name,levels_list,labels_list) {
df[] <- lapply( df, ordered)
for (i in 1:length(colnames(df))) {
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
}
df
}
which is run by that R command:
Add_Labels_Level_To_Dataset(dt4, "dt4", level_list, labels_list)
The lists supplied in the R command are the following ones which represents the ordered levels of each variable in the dataset, respectively:
label_list=list("'S','SA','SB','SC,'SD'", "'S','SA','SB','SC,'SD'")
level_list=list("5,4,3,2,1", "5,4,3,2,1")
Why my function is not working properly?
I dont know what is wrong with that!
When I run the R commands outside R function, they tie levels/ labels to the dataset given. However, when I run my R function, this does not happen!
df_name="dt4"
df=dt4
levels_list=level_list
labels_list=label_list
i=3
df[] <- lapply( df, ordered)
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
Can you help?
This is a xy problem. I agree with #MrFlick that parse should be avoided.
On the original post the main issue is the function should be returning dt4 and not df. There are some missing ' (single quote) when defining label_list.
We could use mapply and avoid the single quote:
label_list=list(c('S','SA','SB','SC','SD'), c('S','SA','SB','SC','SD'))
level_list=list(c(5,4,3,2,1), c(5,4,3,2,1))
as.data.frame(mapply(function(x, labels,levels ) {ordered(x, labels,levels)}, dt4, level_list, label_list, SIMPLIFY = F))
# a1 a2
#1 SA SD
#2 SA SB
#3 SB SA
#4 SA S
#5 SA SA
Using eval/parse should be avoided. There are tpyically much easier ways to do what you want in R. For example, with this code, we can just write
Add_Labels_Level_To_Dataset <- function(df, levels_list, labels_list) {
df[] <- Map(function(data, levels, labels) {
ordered(data, levels=strsplit(levels,",")[[1]], labels=strsplit(labels, ",")[[1]])
}, df, levels_list, labels_list)
df
}
And we can call it like
dt4 <- Add_Labels_Level_To_Dataset(dt4, level_list, label_list)
Note that it returns a new data.frame which you can reassign to dt4 or some other variable. Functions in R should never modify objects outside their own scope which is one of the other reasons you were running into problems with your function.

How to do a simple transpose/pivot in R

I simply want to take a dataframe with two columns, one with a grouping variable and the second with values, and transform it so that the grouping variable becomes columns with the appropriate values. A very simple question, but after searching for about an hour, I cannot find a good answer. Here is a toy example:
var <- c("Var1", "Var1", "Var2", "Var2")
value <- c(1, 2, 3, 4)
df <- data.frame(var, value)
df.one <- df[df$var == "Var1", ]
df.two <- df[df$var == "Var2", ]
desired.df <- data.frame(df.one[2], df.two[2])
colnames(desired.df) <- c("Var1", "Var2")
desired.df
With more variables and values, this bit of code could become extremely clunky. Can anyone suggest a better method? Any advice would be greatly appreciated!
Data:
df <- structure(list(var = structure(c(1L, 1L, 2L, 2L),
.Label = c("Var1", "Var2"), class = "factor"),
value = c(1, 2, 3, 4)), .Names = c("var", "value"),
class = "data.frame", row.names = c(NA, -4L))
It looks like it is useful to introduce a new variable that identifies the observation within var (I call this case below); you can remove it after reshaping it if you like.
With reshape2/plyr:
library("plyr")
library("reshape2")
## add 'case' identifier
df <- ddply(df,"var",mutate,case=1:length(var))
## dcast() to reshape; then drop identifier
dcast(df,case~var)[,-1]
With tidyr (same strategy):
library("tidyr")
library("dplyr")
df %>% group_by(var) %>%
mutate(case=seq(n())) %>%
spread(var,value) %>%
select(-case)
This could probably be done with reshape() in base R as well, but I have never been able to figure it out ...
Base R solution:
data.frame(split(df$value,df$var))
# Var1 Var2
#1 1 3
#2 2 4
This solution implies that all 'VarN' subsets have equal length.
More general solution will be:
z <- split(df$value,df$var)
max.length <- max(sapply(z,length))
data.frame(lapply(z,`length<-`,max.length))
which appends NA to shorter lists to make sure that all lists have the same length.

Returning first row of group

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)

Resources