Use a function of the data in dplyr::summarise

Use a function of the data in dplyr::summarise - r

Assume I have a function of a data.frame which gives a single number back, now I would like to use the summarise in dplyr where the new variable should be this function applied for the data.frame grouped by another variable.
This is a stupid example
df <- data.frame(id=rep(c("A","B"),each=5),diff=rnorm(10))
func<-function(data){
mean(data$diff)
}
I know this example is easily done using summarise(Mean = mean(diff)), but the points is not solving this example but in general using summarise with a function of a data.frame
My try so far has been
df %>% group_by(id) %>% summarise(New = func(.))
but it gives the same value for every group, which is the overall function.
Hope everything is clear.

I'm not sure I understand what you are trying to do, and I'm not familiar with the differences between the plyr and dplyr packages. The most straightforward way to do what I think you're trying to do is with daply:
> daply(df, .(id), func)
A B
-0.0301488 0.2088815

As akrun pointed out in the comments, you can do this using do in dplyr:
df %>% group_by(id) %>% do(data.frame(New=func(.)))
You can also add other variables, though you have to use .$:
df %>% group_by(id) %>% do(data.frame(New=func(.), SmthElse = sd(.$diff)))
# id New SmthElse
#1 A 0.1934552 1.0932424
#2 B -0.4161216 0.4841031
That said, the simpler and faster performance solution is using data.table:
library(data.table)
dt = as.data.table(df) # or convert in place using setDT
dt[, .(New = func(.SD), SmthElse = sd(diff)), by = id]
# id New SmthElse
#1: A 0.1934552 1.0932424
#2: B -0.4161216 0.4841031

Related

Product of columns selected by starts_with()

I am wondering if there is an efficient way or alternative way to compute the row wise product of a selection of columns in dplyr format.
I know one way to do it (see below), but it seems using rowwise() take a long time to run on my large data set, hence looking for any alternative way to do this.
df = df %>%
rowwise %>%
mutate(myprod = prod(c_across(starts_with('var_xyz'))))

Here are some alternative options.
If you want to stay in tidyverse you can try pmap_dbl :
library(dplyr)
library(purrr)
df %>% mutate(myprod = pmap_dbl(select(., starts_with('var_xyz')), prod))
A base R option with Reduce or using rowProds from matrixStats.
cols <- grep('^var_xyz', names(df))
#2.
df$myprod <- Reduce(`*`, df[cols])
#3.
df$myprod <- matrixStats::rowProds(as.matrix(df[cols]))

dplyr: add rows within group_by groups

Is there a better way to add rows within group_by() groups than using bind_rows()? Here's an example that's a little clunky:
df <- data.frame(a=c(1,1,1,2,2), b=1:5)
df %>%
group_by(a) %>%
do(bind_rows(data.frame(a=.$a[1], b=0), ., data.frame(a=.$a[1], b=10)))
The idea is that columns that we're already grouping on could be inferred from the groups.
I was wondering whether something like this could work instead:
df %>%
group_by(a) %>%
insert(b=0, .at=0) %>%
insert(b=10)
Like append(), it could default to inserting after all existing elements, and it could be smart enough to use group values for any columns unspecified. Maybe use NA for non-grouping columns unspecified.
Is there an existing convenient syntax I've missed, or would this be helpful?

Here's an approach using data.table:
library(data.table)
setDT(df)
rbind(df, expand.grid(b = c(0, 10), a = df[ , unique(a)]))[order(a, b)]
Depending on your actual context this much simpler alternative would work too:
df[ , .(b = c(0, b, 10)), by = a]
(and we can simply use c(0, b, 10) in j if we don't care about keeping the name b)
The former has the advantage that it will work even if df has more columns -- just have to set fill = TRUE for rbind.data.table.

Using gtools::mixedsort or alternatives with dplyr::arrange

I am trying to order a dataframe by making use of dplyr::arrange. The issue is that the column I am trying to sort on contains both a fixed string followed by a number, as for instance generated by the dummycode below.
dummydf<-data.frame(values=rnorm(100),sortcol=paste0("ABC",sample(1:100,100,replace=FALSE)))
By default, using dummydf %>% arrange(sortcol) would generate a df which is sorted alphanumerically (?) but this is of course not the desired result:
values sortcol
0.708081720 ABC1
0.041348322 ABC10
1.730962886 ABC100
0.423480861 ABC11
-1.545837266 ABC12
-1.345539947 ABC13
-0.078998792 ABC14
0.088712174 ABC15
0.670583024 ABC16
1.238837680 ABC17
-1.459044293 ABC18
-2.028535223 ABC19
0.779514385 ABC2
1.360509910 ABC20
In this example, I would like to sort the column as gtools::mixedsort would do, making sure ABC2 follows ABC1 and is not preceed by ABC1-19 and ABC100 mixedsort(as.character(dummydf$sortcol)) would do that trick.
Now, I am aware I could do this by using sub in my arrange argument: dummydf %>% arrange(as.numeric(sub("ABC","",sortcol))) but that is mainly because my string is something fixed (although any regex could be used to capture the last digits following any string I suppose).
I am just wondering: is there a more "elegant" and generic way to get this done with dplyr::arrange, in the same fashion as gtools::mixedsort?
Kind regards,
FM

Here's a functional solution making use of the mysterious identity order(order(x)) == rank(x).
mixedrank = function(x) order(gtools::mixedorder(x))
dummydf %>% dplyr::arrange(mixedrank(sortcol))

I don't see this answer posted so I'll throw it out. You can use mixedorder with slice to arrange it.
dummydf %>%
slice(mixedorder(sortcol))

Using data.table
library(data.table)
dummydf = data.table(dummydf)
dummydf[gtools::mixedorder(as.character(sortcol))]
Honestly just copied your example and stuck it in as the select argument in the data.table syntax. You already did all the hard work :).

Credit to Akhil Nair for his data.table answer which is what the first code snippet derives from. If you like the data.table answer but still want magrittr piping, you can consider calculating a new column and using piping with data.table to get your output:
dummydf %>%
dplyr::mutate(row_lookup = gtools::mixedorder(as.character(sortcol))) %>%
data.table::data.table() %>%
.[.$row_lookup]
I think it's debatable whether that helps or detracts from the readability.
If you don't want to call data.table, you can go through some extra contortions to calculate a column you can use dplyr::arrange on. Here's one example:
library(dplyr)
bind_cols(dummydf,
dummydf %>%
tibble::rowid_to_column("order") %>%
mutate(rowname = gtools::mixedorder(as.character(sortcol))) %>%
arrange(rowname) %>%
select(order)) %>%
arrange(order)
I think this code is more confusing to read and isn't worth those extra contortions to avoid data.table.

Here is a solution that will allow for sorting if there are repeats and multiple conditions to sort. Most previous answers are not generic: they freeze the ordering at level 1.
df <- data.frame(values = rnorm(100),
sortcol1 = paste0("ASORT", sample(1:100, 100, replace = TRUE)),
sortcol2 = paste0("BSORT", sample(1:100, 100, replace = TRUE)),
stringsAsFactors = F)
df %>%
mutate(
`sortcol1` = factor(`sortcol1`, ordered = T, levels = unique(gtools::mixedsort(`sortcol1`))),
`sortcol2` = factor(`sortcol2`, ordered = T, levels = unique(gtools::mixedsort(`sortcol2`)))
) %>%
arrange(`sortcol1`, `sortcol2`)

Working with temporary columns (created on-the-fly) more efficiently in a dataframe

Consider the following dataframe:
df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))
If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
This really feels inefficient:
Create an rs column
Divide each of the values by their corresponding row rowSums()
Remove the temporarily created column to clean up the original dataframe.
When working with existing columns, it feels much more natural:
df %>% summarise_each(funs(weighted.mean(., X1)), -X1)
Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?
I'm also interested in how data.table would handle such a task.

As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:
dt = as.data.table(df)
dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]

Why not considering base R as well:
as.data.frame(as.matrix(df)/rowSums(df))
Or just with your data.frame:
df/rowSums(df)

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.

For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use a function of the data in dplyr::summarise - r

I'm not sure I understand what you are trying to do, and I'm not familiar with the differences between the plyr and dplyr packages. The most straightforward way to do what I think you're trying to do is with daply: > daply(df, .(id), func) A B -0.0301488 0.2088815

Related

Product of columns selected by starts_with()

dplyr: add rows within group_by groups

Using gtools::mixedsort or alternatives with dplyr::arrange

Working with temporary columns (created on-the-fly) more efficiently in a dataframe

pass grouped dataframe to own function in dplyr

Categories

Resources