Create a function for mean calculation using an specific rule - r

I need to create a function for mean calculation using an specific rule without the use of apply or aggregate functions. I have 3 variables and I would like to calculate the mean of var3 each change in var2 first and second the var 3 mean each change in the var1 in the same function. This is possible? My code is:
Variable 1
var1 <- sort(rep(LETTERS[1:3],10))
Variable 2
var2 <- rep(1:5,6)
Variable 3
var3 <- rnorm(30)
Create data frame
DB<-NULL
DB<-cbind(var1,var2,as.numeric(var3))
head(DB)
Function for calculate the mean follow a rule
mymean <- function(x, db=DB){
for (1:length(db[,1])){
if (db[,[i]] != db[,[i]]) {
mean(db[,[i]])
}
else (db[,[i]] == db[,[i]]) {
stop("invalid rule")
}}
Here start the problems and doesn't work
Thanks
Alexandre

It appears that you want to obtain means by groups.
To do this I would use the dplyr package
library(dplyr)
db <- data.frame(var1 = sort(rep(LETTERS[1:3],10)), var2=rep(1:5,6), var3=rnorm(30))
db %>%
group_by(var1) %>%
summarise(mean_over_va1 = mean(var3))
var1 mean_over_va1
1 A 0.07314416
2 B -0.05983557
3 C -0.03592565
db %>%
group_by(var2) %>%
summarise(mean_over_va2 = mean(var3))
var2 mean_over_va2
1 1 -0.4512942044
2 2 -0.1331316802
3 3 0.0821958902
4 4 -0.0001081054
5 5 0.4646429921
From you comments however, it appears that you don't want to use any base R commands like apply and aggregate so I assume you may not like the above solution.
If I had to do this with brute force do something like this:
db <- data.frame(var1 = sort(rep(LETTERS[1:3],10)), var2=rep(1:5,6), var3=rnorm(30), stringsAsFactors = FALSE)
#Obtaining Groups
group1 <- unique(db$var1)
group2 <- unique(db$var2)
#Obtaining Number of Different types of groups so I dont have to keep calling length
N1 <- length(group1)
N2 <- length(group2)
#Preallocating, not necessary but a good habit
res1 <- data.frame(group = group1, mean = rep(NA, N1))
res2 <- data.frame(group = group2, mean = rep(NA, N2))
#Looping over the group members rather than each row of data. I like this approach because it relies more heavily on sub-setting than it does on iteration, which is always a good idea in R.
for (i in seq(1, N1)){
res1[i,"mean"] <- mean(db[db$var1%in%group1[i], "var3"])
}
for (i in seq(1, N2)){
res2[i,"mean"] <- mean(db[db$var2%in%group2[i], "var3"])
}
res <- list(res1, res2)

Related

R: Group by and Apply a general function to two columns

Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}
Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)

Conditional creation (mutate) of new columns

I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)
This should work.
df[,setdiff(col_vector, colnames(df))] <- NA
Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.

How can I create a function to generate new variables based on values in different dataframe in R

I would like to create a function like this (obviously not proper code):
forEach ID in DATAFRAME1 look at each row with ID in DATAFRAME2 {
if DATAFRAME2$VARIABLE1 = something {
DATAFRAME1$VARIABLE1 = TRUE;
DATAFRAME1$VARIABLE2 = DATAFRAME2$VARIABLE2
}
}
In plain text, I've got a list of individuals and a database with mixed information on these
individuals. Let's say DATAFRAME2 contains informations on books read c(id, title, author, date). I want to create a new variable in DATAFRAME1 with a boolean of if the individual has read a specific book (VARIABLE1 above) and the date they first read it (VARIABLE2above). Also adding a third variable with number of times read would be interesting but not neccesary.
I haven't really done this in R before, mostly doing basic statistics and basic wrangling with dplyr. I guess I could use dplyr and join but this feels like a better approach. Any help to get me started would be much appreciated.
The following function does what the question asks for. Its arguments are
DF1 and DF2 have an obvious meaning;
var1 and var2 are VARIABLE1 and VARIABLE2 in the question;
value is the value of something.
The test data is at the end.
fun <- function(DF1, DF2, ID = 'ID', var1, var2, value){
DF1[[var1]] <- NA
DF1[[var2]] <- NA
k <- DF2[[var1]] == value
for(id in df1[[ID]]){
i <- DF1[[ID]] == id
j <- DF2[[ID]] == id
if(any(j & k)){
DF1[[var1]][i] <- TRUE
DF1[[var2]][i] <- DF2[[var2]][j & k]
}
}
DF1
}
fun(df1, df2, value = 4, var1 = 'X', var2 = 'Y')
# ID X Y
#1 a NA NA
#2 d TRUE 19
Test data.
set.seed(1234)
df1 <- data.frame(ID = c("a", "d"))
df2 <- data.frame(ID = rep(letters[1:5], 4),
X = sample(20, 20, TRUE),
Y = sample(20))

Create a function that replace values where n <5 with a random number between 1 and 4 (integer)

I am quite new to R and have run into a problem I apparently can't solve by myself. It should be fairly easy thou.
I aim to write a generic function that manipulates column n in dataframe df. I want it to peform a simple task, for each row, when n < 5 it should replace that value with a random number between 1 and 4.
df <- data.frame(n= 1:10, y = letters[1:10],
stringsAsFactors = FALSE)
What is the most elegant solution?
One way to do is create a logical index based on the column, subset the column based on the index and assign the sampled values
f1 <- function(dat, col) {
i1 <- dat[[col]] < 5
dat[[col]][i1] <- sample(1:4, sum(i1), replace = TRUE)
dat
}
f1(df, "n")

How I can select rows from a dataframe that do not match?

I'm trying to identify the values in a data frame that do not match, but can't figure out how to do this.
# make data frame
a <- data.frame( x = c(1,2,3,4))
b <- data.frame( y = c(1,2,3,4,5,6))
# select only values from b that are not in 'a'
# attempt 1:
results1 <- b$y[ !a$x ]
# attempt 2:
results2 <- b[b$y != a$x,]
If a = c(1,2,3) this works, as a is a multiple of b. However, I'm trying to just select all the values from data frame y, that are not in x, and don't understand what function to use.
If I understand correctly, you need the negation of the %in% operator. Something like this should work:
subset(b, !(y %in% a$x))
> subset(b, !(y %in% a$x))
y
5 5
6 6
Try the set difference function setdiff. So you would have
results1 = setdiff(a$x, b$y) # elements in a$x NOT in b$y
results2 = setdiff(b$y, a$x) # elements in b$y NOT in a$x
You could also use dplyr for this task. To find what is in b but not a:
library(dplyr)
anti_join(b, a, by = c("y" = "x"))
# y
# 1 5
# 2 6

Resources