I have a list of functions, for example:
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
I need to call stored functions in myFunctions based on some criteria for example, I have a table (myTable) with prices and I need to calculate means and medians (I also need to do more things like standardize names, join a specific value with a table with codes, etc).
If a value in a column in myTable is == "a" I want to use function calculateMean, if == "b" I want to use function calculateMedian, if == "c" use function calculateMean.
What is the best way to do this? I am saving functions as a list as I will have a lot of functions. And how can I call a function in the myFunctions based on a specific criteria?
Thanks!
Maybe the following does what the question asks for.
Depending on ID, function priceStat determines which function from myFunctions to apply to column price.
priceStat <- function(x, funlist) {
type <- unique(as.character(x[["ID"]]))
f <- switch(type,
pear = funlist[[1]],
orange = funlist[[2]])
f(x[["price"]])
}
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
set.seed(1234)
df1 <- data.frame(ID = sample(c("pear", "orange"), 20, TRUE),
price = runif(20),
stringsAsFactors = FALSE)
sapply(split(df1, df1$ID), priceStat, myFunctions)
# orange pear
#0.3036828 0.5427695
Here is something that I think does what you are suggesting.
library(dplyr)
Create some data.
set.seed(1234)
data <- tibble(id = rep(letters[1:2], each = 3), price = rnorm(6, 100, 5))
data
# # A tibble: 6 x 2
# id price
# <chr> <dbl>
# 1 a 94.0
# 2 a 101.
# 3 a 105.
# 4 b 88.3
# 5 b 102.
# 6 b 103.
Create a list of functions. Note we named the list item for the id we want to apply it to.
myFunctions <- list(
a = mean,
b = median
)
Group the data on the id. Then iterate over each list item, calling summarize(). For each list (which is the subset of the data for that given id) call the function from the myFunctions list.
data %>%
group_by(id) %>%
group_modify(~ summarize(.x, calc = myFunctions[[pull(.y[1])]](.x$price)))
# # A tibble: 2 x 2
# id calc
# <chr> <dbl>
# 1 a 100.
# 2 b 102.
Testing it out.
> mean(data$price[data$id == "a"])
[1] 100.258
> median(data$price[data$id == "b"])
[1] 102.1456
Related
There is a similar question for data.table at Replace sets of rows in a data.table with a single row but I am looking for an equivalent solution in tidyverse. So, I have a tibble like:
DT <- tibble (
id=c(1,1,1,1,2,2,2,2),
location = c("a","b","c","d","a","b","d","e"),
seq = c(1,2,3,4,1,2,3,4))
For every id, I want to look for the sequence b,c,d and if there is such a thing, I want to replace the rows with b and c with a single row, let's say z. The values for the other variables should retain the values of the previous b (in this case id and seq)
So in this case, the new tibble should be
DT.Tobe <- tibble (
id=c(1,1,1,2,2,2,2),
place = c("a","z","d","a","b","d","e"),
seq = c(1,2,4,1,2,3,4))
I was not able to find even a starting point for this...
library(dplyr)
# library(zoo) # rollapply
DT %>%
group_by(id) %>%
mutate(
isseq = zoo::rollapply(location, 3, FUN = function(z) identical(z, c("b", "c", "d")), align = "left", partial = TRUE),
isseq = isseq | lag(isseq, default = FALSE)
) %>%
group_by(id, isseq) %>%
summarize(
across(everything(), ~ {
if (cur_group()$isseq) {
if (cur_column() == "location") "z" else first(.)
} else .
})
) %>%
ungroup() %>%
select(-isseq)
# # A tibble: 7 x 3
# id location seq
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 d 4
# 3 1 z 2
# 4 2 a 1
# 5 2 b 2
# 6 2 d 3
# 7 2 e 4
The order is changed because the group_by(isseq) tends to keep "like" together. This should be easy to either re-order (assuming "seq" is meaningful) or pre-add an order variable and using it later.
If it is possible for a single id to have multiple of such sequences (if so, say something), then run-length encoding will be needed here as well (to differentiate between different b-c-d sequences in the same id).
One possible option would be to use a for-loop. Here is my pseudocode.
for (i in nrows(DT)){ # Repeat this if statement for each row in your DT
if (place[i] == "b & place[i+1] == "c"){ # if the first item is B and the second item is C
DT <- DT %>%
dplyr::replace(place[i] == "z") # Replaces item B with the z character
DT[-(i+1)] # Deletes item C's row
}
}
The Dplyr cheat sheet has some useful functions that may help with finding the right tools for the if-statement part of this pseudocode.
What are your thoughts?
Is it possible in R to create argument names in a function call dynamically?
For example, if we start with
name <- "variable"
I would like to create a new data frame like this
a.new.data.frame <- data.frame(name = c(1, 2))
which of course does not work.
The only solution I could invent was
arg <- list(c(1, 2))
names(arg) <- name
a.new.data.frame <- do.call(data.frame, arg)
a.new.data.frame
# variable
#1 1
#2 2
I don't like this code, since it seems not to be elegant.
Is there a better way to do it?
PS Important! This is a more general problem I have when writing R-programmes (e.g. when I use ggplot, and many other cases). So, I expect general solutions to this (creation of data.frame is only an example).
A more compact code for dynamic args could look like this:
df <- do.call(data.frame, list(name = c(1, 2)))
You could use the ?dotsMethods to encapsulate the do.call in a generic function like this to save the noisy list() part of the call:
call.with.dyn.args <- function(f, ...) {
args <- list(...)
do.call(f, args)
}
df1 <- call.with.dyn.args(data.frame, a = 1:2, b = letters[1:2])
df1
# a b
# 1 1 a
# 2 2 b
But you also have other options for dynamic argument passing to functions without a do.call, eg.:
dyn.values <- c(1:2)
name = "dyn.values"
df2 <- data.frame(dyn.values, # values from a variable
name = dyn.values, # values from a variable + new name
static.arg = letters[1:2], # usual direct passing of an arg
name.from.variable = get(name)) # get the values from a variable whose name is stored in another variable
df2
# dyn.values name static.arg name.from.variable
# 1 1 1 a 1
# 2 2 2 b 2
An option using tidyverse
library(tibble)
library(dplyr)
tibble(!! name := c(1, 2))
# A tibble: 2 x 1
# variable
# <dbl>
#1 1
#2 2
I need to merge several different dataframes.
On the one hand, I have several data frames with metadata A and, on the other hand, respective information B.
A.
[1] "LOJun_Meta" "LOMay_Meta" "VOJul_Meta" "VOJun_Meta" "VOMay_Meta" "ZOJun_Meta"
[7] "ZOMay_Meta"
B.
[1] "LOJun_All." "LOMay_all." "VOJul_All." "VOJun_all." "VOMay_all." "ZOJun_all."
[7] "ZOMay_all."
The names of the data frames are already in a list format (i.e. list1 and list2) and the data frames are already imported in R.
My aim is to create a loop which would merge dplyr > left-join the respective dataframes. For example:
LOJun_Meta + LOJun_All; LoMay_Meta + LOJun_all etc...
What I have a hard time on is creating the loop that would "synchronize" the "merging" procedure.
I am unsure if I should create a function which would have two inputs and would do such "merging".
It would be something like
merging(list1, list2){
for i in length(list):
left_join(list1[i], list[2], by = c("PrimaryKey" = "ForeignKey"))
}
I reckon the problem is that the function should refer to data frames which are not list1 & list2 values but data frame names stored in list1 & list2.
Any ideas?
Thanks a lot! Cheers
A diagram of what I intend to achieve is presented below:
[Diagram of loop - dplyr / several dataframes1
An example of what I am keen to automate would be this action:
ZOMay<- left_join(ZOMay_Meta, ZOMay_all., by = c("Primary Key" = "Foreign key"))
ZOJun<- left_join(ZOJun_Meta, ZOJun_all., by = c("Primary Key" = "Foreign Key"))
write.csv(ZOMay, file = "ZOMay_Consolidated.csv")
write.csv(ZOMay, file = "ZOJun_Consolidated.csv")
Here's an example of how you could build a reproducible example for your situation:
library(tidyverse)
df1a <- data_frame(id = 1:3, var1 = LETTERS[1:3])
df2a <- data_frame(id = 1:3, var1 = LETTERS[4:6])
df1b <- data_frame(id = 1:3, var2 = LETTERS[7:9])
df2b <- data_frame(id = 1:3, var2 = LETTERS[10:12])
list1 <- list(df1a, df2a)
list2 <- list(df1b, df2b)
Now as I understand it you want to do a left_join for df1a and df1b, as well as df2a and df2b. Instead of a loop, you can use map2 from the purrr package. This will iterate over two lists and apply a function to each pair of elements.
map2(list1, list2, left_join)
# [[1]]
# # A tibble: 3 x 3
# id var1 var2
# <int> <chr> <chr>
# 1 1 A G
# 2 2 B H
# 3 3 C I
#
# [[2]]
# # A tibble: 3 x 3
# id var1 var2
# <int> <chr> <chr>
# 1 1 D J
# 2 2 E K
# 3 3 F L
I am trying to get away from loops in R and was looking to both vectorize and speed up a section of my code.
I am looking to convert a For loop using lapply, but am getting an error:
Reproducible example:
library(dplyr)
# This works using a For loop -----------------------------------
# create sample data frame
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
diff <- numeric() # initialize
# Loop through each item and take difference of latest value from earlier values
for (myitem in unique(df$Item)) {
y = df[df$Date == last(df$Date) & df$Item == myitem, "Value"] # Latest value for an item
x = df[df$Item == myitem, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
df_final <- mutate(df, Difference = diff)
df_final
I found related questions here (lapply), here (lapply), and here ($ operator) but none really helped me with my question.
Here is how I tried to vectorize using lapply:
# Same thing using vectorized approach ----------------------------------
mylist <- list(unique(df$Item))
myfunction <- function(df = df, diff = numeric()) {
y = df[df$Date == last(df$Date) & df$Item == mylist, "Value"] # Latest value for an item
x = df[df$Item == mylist, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
# throws error
diff_vector <- unlist(lapply(mylist, myfunction))
df_final2 <- mutate(df, Difference = diff_vector)
df_final2
My real data set has hundreds of thousand of rows. If someone could point me in the right direction on how to vectorize this to get the same output as the For loop I would appreciate it.
Thanks!
So lapply isn't being used quite right here, that's all!
lapply applies a function to each element of a list. To be explicit, it takes each element of a list, and applies the function to that element.
So if you want it to apply a function to several subsets of a data frame, you need to get it a list which is several subsets of a data frame. So let's create that list first.
We can do this using the split function, it splits your data frame into several data frames based on a column and stores these as a list. A list of subsets of a data frame. Perfect!
So let's replace the line where you create mylist with this line instead.
mylist <- split(df,df[,c("Item")])
Now we just need to make some changes tomyfunction. Remember we're now passing through our data already subsetted, so we can remove the conditions about the Item matching with what we'd expect. Remember this function will get applied to each of these data frames in their entirety.
myfunction <- function(df = df, diff = numeric()) {
y = df[df$Date == last(df$Date), "Value"] # Latest value for an item
x = df[, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
And the rest my friend, is exactly as you have it :)
I'm not sure lapply is the right approach to take. I'd stick with mutate - which you already seem to be using:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
df <- df %>%
group_by(Item) %>%
mutate(diff = last(Value) - Value)
df
#> # A tibble: 9 x 4
#> # Groups: Item [3]
#> Date Item Value diff
#> <fct> <fct> <int> <int>
#> 1 Jan1 A 10 2
#> 2 Jan2 A 11 1
#> 3 Jan3 A 12 0
#> 4 Jan1 B 13 2
#> 5 Jan2 B 14 1
#> 6 Jan3 B 15 0
#> 7 Jan1 C 16 2
#> 8 Jan2 C 17 1
#> 9 Jan3 C 18 0
Created on 2018-06-27 by the reprex package (v0.2.0).
This does assume that the observations (at least within the "Item" group) are arranged in order. If not, add arrange(Date) %>% as a step after group_by
you could create a table with the latest value, join with the original table and get the difference or use data.table to create an additional column with latest value
library(data.table)
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
setDT(df)
df[,latestVal:=last(Value),by=.(Item)][,diff:=latestVal-Value][,.(Date,Item,Value,diff)]
I am learning the syntax of manipulating data.table variables. Although I can do simple things, my understanding is not thorough enough for more complex tasks. For example, I would like to transform the following data to have one distinct "type" value per row, separate columns generated based on the value of "subtype", and unique values collapsed when there are multiple rows with the same "type/subtype" combination.
Given the input data:
data = data.frame(
var1 = c("a","b","c","b","d","e","f"),
var2 = c("aa","bb","cc","dd","ee","ee","ff"),
subtype = c("1","2","2","2","1","1","2"),
type = c("A","A","A","A","B","B","B")
)
var1 var2 subtype type
1 a aa 1 A
2 b bb 2 A
3 c cc 2 A
4 b dd 2 A
5 d ee 1 B
6 e ee 1 B
7 f ff 2 B
I would like to derive:
1.var1 1.var2 2.var1 2.var2 2.type
A "a" "aa" "b|c" "bb|cc|dd" "A"
B "d|e" "ee" "f" "ff" "B"
Using a data frame, I can achieve this with the following code:
data.derived = do.call(
rbind,
lapply(
split(data,list(data$type)),
function(x) {
do.call (
c,
lapply(
split(x, list(x$subtype)),
function(y) {
result = c(
var1 = paste(unique(y$var1),collapse ="|"),
var2 = paste(unique(y$var2),collapse ="|")
)
if (as.character(y$subtype[1]) == "2") {
result = c(result, type = as.character(y$type[1]))
}
result}))}))
How can I do the same using a data table?
From your result, it's clear to see that you are transforming data from long to wide format and subtype is spread along the row direction, so you will need dcast from data.table. And since your want to aggregate your values from var1 and var2 to be a single string, you will need to customize the aggregate function as paste to collapse the result:
library(data.table)
setDT(data)
dcast(data, type ~ subtype, value.var = c("var1", "var2"),
fun = function(v) paste0(unique(v), collapse = "|"))
# type var1_function_1 var1_function_2 var2_function_1 var2_function_2
# 1: A a b|c aa bb|cc|dd
# 2: B d|e f ee ff
Not sure if you want to use data.table package and commands, or if you want to find out whether your code works with data tables as well.
I think complex computations require the usage of the appropriate packages. The above script works for you, but it's hard to see what it does if it's not written by you.
Before you start using data.table check some nice packages that make life easier for you. Like
library(dplyr)
library(tidyr)
data = data.frame(
var1 = c("a","b","c","b","d","e","f"),
var2 = c("aa","bb","cc","dd","ee","ee","ff"),
subtype = c("1","2","2","2","1","1","2"),
type = c("A","A","A","A","B","B","B")
)
data %>%
group_by(type, subtype) %>%
summarise(x1 = paste(unique(var1),collapse ="|"),
x2 = paste(unique(var2),collapse ="|")) %>%
unite(xx,x1,x2) %>%
spread(subtype,xx) %>%
separate(`1`, c("1.var1","1.var2"), sep="_") %>%
separate(`2`, c("2.var1","2.var2"), sep="_") %>%
ungroup
# # A tibble: 2 x 5
# type 1.var1 1.var2 2.var1 2.var2
# * <fctr> <chr> <chr> <chr> <chr>
# 1 A a aa b|c bb|cc|dd
# 2 B d|e ee f ff
You can use the same code, or even your script, when having a data table instead of a data frame. But if you're looking for using data table commands that's a different story.