R: Group by and Apply a general function to two columns - r

Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.

Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.

you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}

Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)

Related

pass a list of variable names as an argument to an R function

I am trying achieve the following: I have a dataset, and a function that subsets this dataset and then performs a series of operations on the subset. Subsetting happens based on row names. I am able to do it step by step (i.e. running this function for each subset separately), but I have a list of desired subsets, and I would like to loop over this list. It sounds complicated - please check the example below.
This is what I can do:
#dataframe with rownames
whole_dataset <- data.frame(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2))
row.names(whole_dataset) = c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")
# two different non-overlapping subsets
his <- c("HTA1", "HTA2", "HTB2")
cse <- c("CSE1", "CSE2")
#this is the function I have
fav_complex <- function (data, complex) {
small_data<- data[complex,] #subset only the rows that you need
sum.all<-colSums(small_data) #calculate sum of columns
return(sum.all)
}
#I generate two deparate named vectors
his_data <- fav_complex(data = whole_dataset, complex = his)
cse_data <- fav_complex(data = whole_dataset, complex = cse)
#and merge them
merged_data<- rbind(his_data,cse_data)
it looks like this
> merged_data
wt1 wt2
his_data 6 9
cse_data 12 6
I would like to somehow generate the merged_data dataframe without having to call the 'fav_complex' function multiple times. In real life I have about 20 subsets, and it is a lot of code. This is my solution that doesn't work
#I first have a character vector listing all the variable names
subset_list <- c("his", "cse")
#then create a loop that goes over this list
#make an empty dataframe
merged_data2 <- data.frame()
#fill it with a for loop output
for (element in subset_list) {
result <- fav_complex(data = whole_dataset, element)
merged_data2 <-rbind(merged_data2, result)
}
I know this is wrong. In this loop, 'element' is just a string, rather than a variable with stuff in it. But I don't know how to make it a variable. noquote(element) didn't work. I tried reading about non standard evaluation and eval(), substitute(), but it is too abstract for me - I think I am not there yet with my R expertise.
Consider by to run needed operation across all subsets. But first create a group column:
# ANY FUNCTION TO APPLY ON SUBSETS (REMOVE GROUP COL)
fav_complex_new <- function (sub) {
sum.all <- colSums(transform(sub, group=NULL))
return(sum.all)
}
# ASSIGN GROUPING
whole_dataset$group <- ifelse(row.names(whole_dataset) %in% his, "his",
ifelse(row.names(whole_dataset) %in% cse, "cse", NA))
# BY CALL
df_list <- by(whole_dataset, whole_dataset$group, FUN=fav_complex_new)
# COMBINE ALL DFs IN LIST
merged_data <- do.call(rbind, df_list)
Rextester demo (includes OP's original and above solution)
Following #Gregor's suggestion of a modified workflow, would you consider this solution, including some bonus data wrangling?
Put the data that's currently in row names in its own column.
Add a column for complex. We can do this programmatically in case the data are large.
Use dplyr to created split-apply-combine summaries of data grouped by complex.
It could work like this
library(dplyr)
whole_dataset <- tibble(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2),
id = factor(c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")))
whole_dataset <- mutate(whole_dataset,
complex = case_when(
grepl("^HT", id) ~ "his",
grepl("^CSE", id) ~ "cse")
) %>%
group_by(factor(complex))
whole_dataset %>% summarize(sum_wt1 = sum(wt1),
sum_wt2 = sum(wt2))
# # A tibble: 2 x 3
# `factor(complex)` sum_wt1 sum_wt2
# <fct> <dbl> <dbl>
# 1 cse 12 6
# 2 his 6 9

Variable names change with R function return

I'm new to functions. I'm trying to create one that will aggregate the total number of unique values of one variable by some category. Ex. the number of unique visitors to a store every day.
I was not able to get this to work using ddply, which was my original plan. I was successful, however, using aggregate. My problem is that I want the variable names to retain their original name, instead of take on the names used in the function (return the column names in the dataframe as day and visitor_id instead of a and b).
I have the a and b in the function because that was the only way I could figure out how to make it look for a variable instead of an object.
data <- data.frame(day = rep(c("Mon", "Tues", "Wed", "Thurs", "Fri"), time=5),
visitor_id = c(111,222,333,222,111,222,333,222,222,222,222,111,222,222,333,111,111,222,222,111,222,333,333,333,333))
total_unique <- function(var) {
x <- length(unique(var))
return(x)
}
my_function <- function(data, ag_category, var) {
a <- eval(substitute(ag_category), data)
b <- eval(substitute(var), data)
x <- aggregate(b~a, data, FUN=total_unique)
return(x)
}
test <- my_function(data=data, ag_category=day, var=visitor_id)
Also, if anyone can point out what I did wrong with the ddply code, that would also be really helpful!
my_function2 <- function(data, ag_category, var) {
require(plyr)
a <- eval(substitute(ag_category), data)
b <- eval(substitute(var), data)
x <- ddply(data,~a,summarise, length(unique(b)))
return(x)
}
test2 <- my_function2(data=data, ag_category=day, var=visitor_id)
Here is a solution:
library(tidyverse)
myFun <- function(data, ag_category, var){
varname <- quo({{var}})
data %>%
group_by({{ag_category}}) %>%
summarise(!!varname := length(unique({{var}})))
}
myFun(data=data, ag_category=day, var=visitor_id)
#> # A tibble: 5 x 2
#> day visitor_id
#> <fct> <int>
#> 1 Fri 3
#> 2 Mon 2
#> 3 Thurs 2
#> 4 Tues 3
#> 5 Wed 2
Instead of saving the variables as new variables, we use rlang from the tidyverse to pass the variable name from the function call. We group by the grouping variable and then summarize the unique number of observations.
If you really want to pass in the names as symbols, then you need to take extra care to construct the formula you want. Here's one way to do it
my_function <- function(data, ag_category, var) {
ff <- do.call("~", list(substitute(var), substitute(ag_category)))
x <- aggregate(ff, data, FUN=total_unique)
return(x)
}
my_function(data=data, ag_category=day, var=visitor_id)
It would be even easier if you passed in the names as strings rather then symbols
my_function_str <- function(data, ag_category, var) {
x <- aggregate(reformulate(ag_category, var), data, FUN=total_unique)
return(x)
}
my_function_str(data=data, ag_category="day", var="visitor_id")

Creating a function to perform a calculation and add answer as column to existing dataframe

I am trying to work out how to create a user defined function to perform a calculation on a series of columns in a dataframe, and add the answer as an additional column to the same dataframe. To keep things simple, the test example I have been using is to calculate percentage growth from one year to the next, but the goal is to be able to create more elaborate calculations that are too cumbersome and repetitive to manually calculate.
The practice data I have been using is...
a <- c(10, 12)
b <- c(11, 9)
df <- t(data.frame(a, b))
df <- data.frame(df)
colnames(df) <- c(2001, 2002))
Which will look like...
2001 2002
a 10 12
b 11 9
The manual calculation I have been using is...
df$PercGrowth <- (df$`2002` - df$`2001`) / df$`2001` * 100
Which returns:
2001 2002 PercGrowth
a 10 12 20.00000
b 11 9 -18.18182
How do I turn this into a user defined function where I can specify the columns to perform the calculation, and then have the answer added to the dataframe as a derived value?
What I initially thought might work was...
pg <- function(data, c1, c2)
df <- mutate(data, PercGrowth = ((df[c2] -df[c1]) / df[c1] * 100))
pg(df, 1, 2)
However I keep getting the error message:
Error: Column PercGrowth is of unsupported class data.frame
How do I get this to work?
This is actually more complicated than it looks - you need to use dplyr pronouns and quasiquotation in order to pass the column names as arguments in the function. The following code works:
library(dplyr)
a <- c(10, 12)
b <- c(11, 9)
df <- t(data.frame(a, b))
df <- data.frame(df)
colnames(df) <- c("year1", "year2")
pg <- function(df, col1, col2) {
quo_col1 <- enquo(col1)
quo_col2 <- enquo(col2)
df %>%
mutate(pct_growth = (!! quo_col2 - !! quo_col1) / !! quo_col1 * 100)
}
pg(df, year1, year2)
I renamed the columns to strings so they are easier to work with. You can read more at this link: https://dplyr.tidyverse.org/articles/programming.html
Another option could be to use some kind of string matching on the column names you're interested in, perform operations using those columns, and then join the result back to the main data frame.

R Check a row of strings, if equal, assign equal ID, less time consuming

im fairly new to R and was wondering if anyone here had a better solution to my problem, as mine is too time consuming. I know R is not very "for-loop-friendly" so I am sure there is a better way to solve this.
I have a data frame where x is a text string and y is a numeric id:
x = c("a", "b", "c", "b", "a")
y = c(1,2,3,4,5)
df <- data.frame(x, y)
I want a to find all matches in column x, and assign them the same numeric value as the first in y. I have solved this with the following:
library(foreach)
library(iterators)
for(i in 1:NROW(df)) {
for(j in i:NROW(df)) {
if(df$x[j] == df$x[i]){
df$y[j] <- df$y[i]
}
j = j + 1
}
i = i + 1
}
Problem is, I have a fairly large dataset which makes this process take a lot of time! Hope anyone here knows a less time consuming alternative!
If your dataset is indeed large, then data.table will probably the fastest solution (see benchmarks here).
library(data.table)
setDT(df)
df[, y := first(y), by = x]
R likes vectorised code, so things like arithmetic operations and assignments can be slow if done in a loop. Consider for example assigning the vector 1, 2, ... 1,000,000 to a variable x in two different ways
x <- 1:1e6
and
x <- numeric(x, 1e6) # initialise a numeric vector of length 1 million
for (i in 1:1e6) x[i] <- i
If you try this out you will see that the second method takes much longer.
Coming to your problem, you want to group the data by the value in df$x and replace the values of y by their first element
df.by <- by(df$x, function(d) transform(d, y = y[1]), data = df)
will set the second column of each subset of df (subsetting based on df$x) equal to its first element. The result is
#df$x: a
# x y
#1 a 1
#5 a 1
#------------------------------------------------------------
#df$x: b
# x y
#2 b 2
#4 b 2
#------------------------------------------------------------
#df$x: c
# x y
#3 c 3
To combine these back to a data frame, use df.new <- do.call(rbind, df.by). One (possibly unwanted) side effect of this operation is that it will change the order of the rows.
If you are new to R check out the dplyr package, it's got a smooth learning curve and easy to write and read syntax. What you want to do could be accomplished in only a few lines.
library(dplyr)
df %>% group_by(x) %>% mutate(y = y[1])
will do it!

Create a function for mean calculation using an specific rule

I need to create a function for mean calculation using an specific rule without the use of apply or aggregate functions. I have 3 variables and I would like to calculate the mean of var3 each change in var2 first and second the var 3 mean each change in the var1 in the same function. This is possible? My code is:
Variable 1
var1 <- sort(rep(LETTERS[1:3],10))
Variable 2
var2 <- rep(1:5,6)
Variable 3
var3 <- rnorm(30)
Create data frame
DB<-NULL
DB<-cbind(var1,var2,as.numeric(var3))
head(DB)
Function for calculate the mean follow a rule
mymean <- function(x, db=DB){
for (1:length(db[,1])){
if (db[,[i]] != db[,[i]]) {
mean(db[,[i]])
}
else (db[,[i]] == db[,[i]]) {
stop("invalid rule")
}}
Here start the problems and doesn't work
Thanks
Alexandre
It appears that you want to obtain means by groups.
To do this I would use the dplyr package
library(dplyr)
db <- data.frame(var1 = sort(rep(LETTERS[1:3],10)), var2=rep(1:5,6), var3=rnorm(30))
db %>%
group_by(var1) %>%
summarise(mean_over_va1 = mean(var3))
var1 mean_over_va1
1 A 0.07314416
2 B -0.05983557
3 C -0.03592565
db %>%
group_by(var2) %>%
summarise(mean_over_va2 = mean(var3))
var2 mean_over_va2
1 1 -0.4512942044
2 2 -0.1331316802
3 3 0.0821958902
4 4 -0.0001081054
5 5 0.4646429921
From you comments however, it appears that you don't want to use any base R commands like apply and aggregate so I assume you may not like the above solution.
If I had to do this with brute force do something like this:
db <- data.frame(var1 = sort(rep(LETTERS[1:3],10)), var2=rep(1:5,6), var3=rnorm(30), stringsAsFactors = FALSE)
#Obtaining Groups
group1 <- unique(db$var1)
group2 <- unique(db$var2)
#Obtaining Number of Different types of groups so I dont have to keep calling length
N1 <- length(group1)
N2 <- length(group2)
#Preallocating, not necessary but a good habit
res1 <- data.frame(group = group1, mean = rep(NA, N1))
res2 <- data.frame(group = group2, mean = rep(NA, N2))
#Looping over the group members rather than each row of data. I like this approach because it relies more heavily on sub-setting than it does on iteration, which is always a good idea in R.
for (i in seq(1, N1)){
res1[i,"mean"] <- mean(db[db$var1%in%group1[i], "var3"])
}
for (i in seq(1, N2)){
res2[i,"mean"] <- mean(db[db$var2%in%group2[i], "var3"])
}
res <- list(res1, res2)

Resources