I'm new to functions. I'm trying to create one that will aggregate the total number of unique values of one variable by some category. Ex. the number of unique visitors to a store every day.
I was not able to get this to work using ddply, which was my original plan. I was successful, however, using aggregate. My problem is that I want the variable names to retain their original name, instead of take on the names used in the function (return the column names in the dataframe as day and visitor_id instead of a and b).
I have the a and b in the function because that was the only way I could figure out how to make it look for a variable instead of an object.
data <- data.frame(day = rep(c("Mon", "Tues", "Wed", "Thurs", "Fri"), time=5),
visitor_id = c(111,222,333,222,111,222,333,222,222,222,222,111,222,222,333,111,111,222,222,111,222,333,333,333,333))
total_unique <- function(var) {
x <- length(unique(var))
return(x)
}
my_function <- function(data, ag_category, var) {
a <- eval(substitute(ag_category), data)
b <- eval(substitute(var), data)
x <- aggregate(b~a, data, FUN=total_unique)
return(x)
}
test <- my_function(data=data, ag_category=day, var=visitor_id)
Also, if anyone can point out what I did wrong with the ddply code, that would also be really helpful!
my_function2 <- function(data, ag_category, var) {
require(plyr)
a <- eval(substitute(ag_category), data)
b <- eval(substitute(var), data)
x <- ddply(data,~a,summarise, length(unique(b)))
return(x)
}
test2 <- my_function2(data=data, ag_category=day, var=visitor_id)
Here is a solution:
library(tidyverse)
myFun <- function(data, ag_category, var){
varname <- quo({{var}})
data %>%
group_by({{ag_category}}) %>%
summarise(!!varname := length(unique({{var}})))
}
myFun(data=data, ag_category=day, var=visitor_id)
#> # A tibble: 5 x 2
#> day visitor_id
#> <fct> <int>
#> 1 Fri 3
#> 2 Mon 2
#> 3 Thurs 2
#> 4 Tues 3
#> 5 Wed 2
Instead of saving the variables as new variables, we use rlang from the tidyverse to pass the variable name from the function call. We group by the grouping variable and then summarize the unique number of observations.
If you really want to pass in the names as symbols, then you need to take extra care to construct the formula you want. Here's one way to do it
my_function <- function(data, ag_category, var) {
ff <- do.call("~", list(substitute(var), substitute(ag_category)))
x <- aggregate(ff, data, FUN=total_unique)
return(x)
}
my_function(data=data, ag_category=day, var=visitor_id)
It would be even easier if you passed in the names as strings rather then symbols
my_function_str <- function(data, ag_category, var) {
x <- aggregate(reformulate(ag_category, var), data, FUN=total_unique)
return(x)
}
my_function_str(data=data, ag_category="day", var="visitor_id")
Related
Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}
Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)
I am writing a function that helps me subset a dataframe, and then feeds the dataframe to another action. The output for this function would be the result for the second action. However, since I would still need the cleaned dataframe for another purpose, I was wondering if I can store such dataframe in the environment so that it can be called for later?
For instance,
Let's say I have this dataframe.
ID Var1
1 5 3
2 6 1
And my function is like this:
mu_fuc <- function(df, condition) {
#clean dataset
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3). ####I am trying to store this working dataframe for later use.
#second action
result = sum(workingdf[condition])
#output of the function
return(result)
}
Since the result of the function would be used later as well, I can't add workingdf to return. Otherwise, the output of the function would contain workingdf when I try to feed the output to another function, which is something I don't want.
So for example, in this case, if I want to do, I need the output of the function to be of integers only.
my_fun(data, Var1) - 5
I hope I am making myself clear.
Any help is greatly appreciated!!
You can return a list from the function with the result that you want.
mu_fuc <- function(df, condition) {
#clean dataset
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3)
#second action
result = sum(workingdf)
#output of the function
return(list(result = result, workingdf = workingdf))
}
Call it as :
output <- mu_fuc(df, Var1)
You can separate out the result using $ operator and process them separately.
output$result
output$workingdf
You may store workingdf in an attribute.
mu_fuc <- function(df, condition) {
## clean dataset
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3)
## second action
result <- sum(condition)
attr(result, "workingdf") <- workingdf
return(result)
}
Calculation with the result as usual.
r <- mu_fuc(d, Var1)
r - 5
# [1] -1
# attr(,"workingdf")
# ID Var1
# 2 6 1
To avoid the attribute to be displayed for cosmetic reasons use as.numeric
as.numeric(r) - 5
# [1] -1
or
r2 <- as.numeric(mu_fuc(d, Var1))
r2 - 5
# [1] -1
To get workingdf, fetch it from the attribute.
wdf <- attr(mu_fuc(d, Var1), "workingdf")
wdf
# ID Var1
# 2 6 1
Data:
d <- data.frame(ID=5:6, Var1=c(3, 1))
I'm trying to make my code general, I'd only want to change the YEAR variable without having to change everything in the code
YEAR = 1970
y <- data.frame(col1 = c(1:5))
function (y){
summarize(column_YEAR = sum(col1))
}
#Right now this gives
column_YEAR
1 15
#I would like this function to output this (so col1 is changed to column_1970)
column_1970
1 15
or for example this
df <- list("a_YEAR" = anotherdf)
#I would like to have a list with a df with the name a_1970
I tried things like
df <- list(assign(paste0(a_, YEAR), anotherdf))
But it does not work, does somebody have any advice? Thanks in advance :)
rlang provides a flexible way to defuse R expressions. You can use that functionality to create dynamic column names within dplyr flow. In this example dynamic column name is created using suffix argument passed to a wrapper function on dplyr's summarise.
library("tidyverse")
YEAR = 1970
y <- data.frame(col1 = c(1:5))
function (y) {
summarize(column_YEAR = sum(col1))
}
my_summarise <- function(.data, suffix, sum_col) {
var_name <- paste0("column_", suffix)
summarise(.data,
{{var_name}} := sum({{sum_col}}))
}
my_summarise(.data = y, suffix = YEAR, sum_col = col1)
Results
my_summarise(.data = y, suffix = YEAR, sum_col = col1)
# column_1970
# 1 15
You can also source arguments directly from global environment but from readability perspective this is poorer solution as it's not immediately clear how the function creates suffix.
my_summarise_two <- function(.data, sum_col) {
var_name <- paste0("column_", YEAR)
summarise(.data,
{{var_name}} := sum({{sum_col}}))
}
my_summarise_two(.data = y, sum_col = col1)
I am trying to create a custom function where some operation is carried out only on one column of the dataframe. But I want the function to work in such a way that it will output not only the column on which the operation was carried out, but also the entire dataframe from which that particular column was drawn. This is a very simple example of what I want to achieve:
# libraries needed
library(dplyr)
library(rlang)
# creating a dataframe
data <-
as.data.frame(cbind(
x = rnorm(5),
y = rnorm(5),
z = rnorm(5)
))
# defining the custom function
custom.fn <- function(data, x) {
df <- dplyr::select(.data = data,
x = !!rlang::enquo(x)) # how can I also retain all the other columns in the dataframe apart from x?
df$x <- df$x / 2
return(df)
}
# calling the function (also want y and z here in the ouput)
custom.fn(data = data, x = x)
#> x
#> 1 0.49917536
#> 2 -0.03373202
#> 3 -1.24845349
#> 4 -0.15809688
#> 5 0.11237030
Created on 2018-02-14 by the reprex
package (v0.1.1.9000).
Just specify the columns you want to include in your select call:
custom.fn <- function(data, x) {
df <- dplyr::select(.data = data,
x = !!rlang::enquo(x), y, z)
df$x <- df$x / 2
return(df)
}
If you don't want to name the rest of the columns explicitly, you can also use everything:
df <- dplyr::select(.data = data, x = !!rlang::enquo(x), dplyr::everything())
I have a table, called table_wo_nas, with multiple columns, one of which is titled ID. For each value of ID there are many rows. I want to write a function that for input x will output a data frame containing the number of rows for each ID, with column headers ID and nobs respectively as below for x <- c(2,4,8).
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
This is what I have. It works when x is a single value (ex. 3), but not when it contains multiple values, for example 1:10 or c(2,5,7). I receive the warning "In ID[counter] <- x : number of items to replace is not a multiple of replacement length". I've just started learning R and have been struggling with this for a week and have searched manuals, this site, Google, everything. Can someone help please?
counter <- 1
ID <- vector("numeric") ## contain x
nobs <- vector("numeric") ## contain nrow
for (i in x) {
r <- subset(table_wo_nas, ID %in% x) ## create subset for rows of ID=x
ID[counter] <- x ## add x to ID
nobs[counter] <- nrow(r) ## add nrow to nobs
counter <- counter + 1 } ## loop
result <- data.frame(ID, nobs) ## create data frame
In base R,
# To make a named vector, either:
tmp <- sapply(split(table_wo_nas, table_wo_nas$ID), nrow)
# OR just:
tmp <- table(table_wo_nas$ID)
# AND
# arrange into data.frame
nobs_df <- data.frame(ID = names(tmp), nobs = tmp)
Alternately, coerce the table into a data.frame directly, and rename:
nobs_df <- data.frame(table(table_wo_nas$ID))
names(nobs_df) <- c('ID', 'nobs')
If you only want certain rows, subset:
nobs_df[c(2, 4, 8), ]
There are many, many more options; these are just a few.
With dplyr,
library(dplyr)
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n())
If you only want certain IDs, add on a filter:
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n()) %>% filter(ID %in% c(2, 4, 8))
Seems pretty straightforward if you just use table again:
tbl <- table( table_wo_nas[ , 'ID'] )
data.frame( IDs = names(tbl), nobs= tbl)
Could also get a quick answer although with different column names using:
as.data.frame(table( table_wo_nas[ , 'ID'] ))
Try this.
x=c(2,4,8)
count_of_id=0
#df is your data frame table_wo_nas
count_of<-function(x)
{for(i in 1 : length(x))
{count_of_id[i]<-length(which(df$id==x[i])) #find out the n of rows for each unique value of x
}
df_1<-cbind(id,count_of_id)
return(df_1)
}