pass grouped dataframe to own function in dplyr - r

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.

For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

Related

Create a dataframe looping a function's results

So this is a simplification of my problem.
I have a dataframe like this:
df <- data.frame(name=c("lucas","julio","jack","juan"),number=c(1,15,100,22))
And I have a function that creates new values for every name, like this:
var_number <- function(x) {
example <- df %>%
filter(name %in% unique(df$name)[x]) %>%
select(-name) %>%
mutate(value1=number/2^5, value2=number^5)
(example)
}
var_number(1)
0.03125 1
Now I have two new values for every name and I would like to create a loop to save each result in a new dataframe.
I know how to solve this particular problem, but I need a general solution that allows me to save the results of all functions into a dataframe.
I'm looking for an automatic way to do something like this:
result<- bind_rows(var_number(1),var_number(2),var_number(3),var_number(4))
Since I would have to apply var_number around 1000 times and the lenght would change with every test i do.
There is anyway I can do something like this? I was thinking about doing it with "for", but I'm not really sure about how to do it, I have just started with R and I am a total newbie.
This answers my problem:
library(tidyverse) # contains purrr library
#an arbitrary function that always outputs a dataframe
# with a consistent number of columns, in this case 3
myfunc <- function(x){
data.frame(a=x*2,
b=x^2,
c=log2(x))
}
# iterate over 1:10 as inputs to myfunc, and
# combine the results rowwise into a df
purrr::map_dfr(1:10,
~myfunc(.))
Why do you want to apply var_number function for each name, create a new dataframe for each and then combine all of them together?
Do it only once in the same dataframe.
library(dplyr)
df1 <- df %>%
mutate(value1=number/2^5,value2=number^5) %>%
select(-name)
If you want to do it only for specific names, you can filter them first before applying the above.

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

How to analyse a data set both grouped by and ungrouped in one analysis using dplyr

This is my first stackoverflow question.
I'm trying to use dplyr to process and output a summary of data grouped by a categorical variable (inj_length_cat3) in my dataset. Actually, I generate this variable (from inj_length) on the fly using mutate(). I also want to output the same summary of the data without grouping. The only way I figured out how to do that is to do the analysis twice over, once with, once without grouping, and then combine the outputs. Ugh.
I'm sure there is a more elegant solution than this and it bugs me. I wonder if anyone would be able to help.
Thanks!
library(dplyr)
df<-data.frame(year=sample(c(2005,2006),20,replace=T),inj_length=sample(1:10,20,replace=T),hiv_status=sample(0:1,20,replace=T))
tmp <- df %>%
mutate(inj_length_cat3 = cut(inj_length, breaks=c(0,3,100), labels = c('<3 years','>3 years')))%>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years'))
tmp_all <- df %>%
group_by(year)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
)
tmp_all$inj_length_cat3=as.factor('All')
tmp<-merge(tmp_all,tmp,all=T)
I'm not sure you consider this more elegant, but you can get a solution to work if you first create a dataframe that has all your data twice: once so that you can get the subgroups and once to get the overall summary:
df1 <- rbind(df,df)
df1$inj_length_cat3 <- cut(df$inj_length, breaks=c(0,3,100,Inf),
labels = c('<3 years','>3 years','All'))
df1$inj_length_cat3[-(1:nrow(df))] <- "All"
Now you just need to run your first analysis without mutate():
tmp <- df1 %>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years','All'))

Split up a dataframe by number of rows

I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
Make your own grouping variable.
d <- split(my_data_frame,rep(1:400,each=1000))
You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.
edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
chunk <- 1000
n <- nrow(my_data_frame)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)
You could also use
r <- ggplot2::cut_width(1:n,chunk,boundary=0)
For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like
(my_data_frame
%>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
%>% group_by(index)
%>% [mutate, summarise, do()] ...
)
There are also many answers here
I had a similar question and used this:
library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)
from left to right:
you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.
So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).
You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions

Resources