Passing a list of functions to ddply in R - r

I want to able to apply a list of function to ddply so that the list can expand based on what I want.
Something like this:
func_list = list(Start.Date = "min(Date)", End.Date = "max(Date)")
do.call(ddply, c(list(.data = df, .variables = grps, .fun=summarize), func_list))
When I run that, it calls ddply but then has a variable called Start.Date that equals the string "min(Date)". I've tried not quoting, but that doesn't work either.
Edit (example code and desired results):
library(plyr)
get_summary <- function(grps, func_list, df = raw) {
out <- do.call(ddply,
c(list(.data = df, .variables = grps, .fun=summarize),
func_list))
return(out)
}
raw <- data.frame(date = c(as.Date("2015-5-1"),
as.Date("2015-5-1"),
as.Date("2015-5-2"),
as.Date("2015-5-2")),
count = c(2,4,6,8),
amnt = c(100,200,300,400))
func_list <- list(
total_count = "sum(count)",
avg_amnt = "mean(amnt)"
)
get_summary("date", func_list)
# Desired output:
# date total_count avg_amnt
# 1 2015-05-01 6 150
# 2 2015-05-02 14 350
#
# Which is equivalent to:
# ddply(raw, "date", summarize, total_count = sum(count), avg_amnt = mean(amnt))

To fix your code you only need parse():
func_list <- list(
total_count = parse(text="sum(count)"),
avg_amnt = parse(text="mean(amnt)"))
This will tell the interpreter that the text should be evaluated as code and not as strings.

Related

Run functions stored in a list based on criteria

I have a list of functions, for example:
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
I need to call stored functions in myFunctions based on some criteria for example, I have a table (myTable) with prices and I need to calculate means and medians (I also need to do more things like standardize names, join a specific value with a table with codes, etc).
If a value in a column in myTable is == "a" I want to use function calculateMean, if == "b" I want to use function calculateMedian, if == "c" use function calculateMean.
What is the best way to do this? I am saving functions as a list as I will have a lot of functions. And how can I call a function in the myFunctions based on a specific criteria?
Thanks!
Maybe the following does what the question asks for.
Depending on ID, function priceStat determines which function from myFunctions to apply to column price.
priceStat <- function(x, funlist) {
type <- unique(as.character(x[["ID"]]))
f <- switch(type,
pear = funlist[[1]],
orange = funlist[[2]])
f(x[["price"]])
}
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
set.seed(1234)
df1 <- data.frame(ID = sample(c("pear", "orange"), 20, TRUE),
price = runif(20),
stringsAsFactors = FALSE)
sapply(split(df1, df1$ID), priceStat, myFunctions)
# orange pear
#0.3036828 0.5427695
Here is something that I think does what you are suggesting.
library(dplyr)
Create some data.
set.seed(1234)
data <- tibble(id = rep(letters[1:2], each = 3), price = rnorm(6, 100, 5))
data
# # A tibble: 6 x 2
# id price
# <chr> <dbl>
# 1 a 94.0
# 2 a 101.
# 3 a 105.
# 4 b 88.3
# 5 b 102.
# 6 b 103.
Create a list of functions. Note we named the list item for the id we want to apply it to.
myFunctions <- list(
a = mean,
b = median
)
Group the data on the id. Then iterate over each list item, calling summarize(). For each list (which is the subset of the data for that given id) call the function from the myFunctions list.
data %>%
group_by(id) %>%
group_modify(~ summarize(.x, calc = myFunctions[[pull(.y[1])]](.x$price)))
# # A tibble: 2 x 2
# id calc
# <chr> <dbl>
# 1 a 100.
# 2 b 102.
Testing it out.
> mean(data$price[data$id == "a"])
[1] 100.258
> median(data$price[data$id == "b"])
[1] 102.1456

If else ladder not working in R

I have this in my dataframe after reading and rearranging multiple csv files. Basically I want an if else ladder to refer to the ID column and if it matches a number from the list of concatenates then place a word in a new "group" column
# of int. int. not.int. ID
1 50 218.41 372.16 1
3 33 134.94 158.17 3
I then made these concatenates to refer to.
veh = as.character(c('1', '5'))
thc1 = as.character(c('2', '6'))
thc2 = as.character(c('3', '7'))
thc3 = as.character(c('4', '8'))
Then I made an if else ladder to list the corresponding values.
social.dat$group = if (social.dat$ID == veh) {
social.dat$group == "veh"
} else if (social.dat$group == thc1) {
social.dat$group == "thc1"
} else if (social.dat$group == thc2) {
social.dat$group == "thc2"
} else {
social.dat$group == "thc3"
}
However, I then get this warning message.
Warning message:
In if (social.dat$ID == veh) { :
the condition has length > 1 and only the first element will be used
I have looked up this warning message in multiple different variations and have not found anything that really helped. Any help for this would be much appreciated or and alternate options would be good as well. I apologize in advance if there was a solution on stack already if I missed it.
EDIT:
I tried using
social.dat$group = ifelse(social.dat$ID == veh, "veh", "thc")
social.dat$group = ifelse(social.dat$ID == thc, "thc", "veh")
but it changed the output of the dataframe after each line.
Here is the full code i am using to rearrange the csv files and get the dataframe that I first mentioned above.
#calls packages
library(tidyr)
library( plyr )
library(reshape2)
#make sure to change your working directory to where the files are kept
setwd("C:/Users/callej03/Desktop/test")
wd = "C:/Users/callej03/Desktop/test"
files = list.files(path=wd, pattern="*.csv", full.names=TRUE,
recursive=FALSE)
################################################################
#this function creates a list of the number of interactions for each file in
the folder
lap.list = lapply(files, function(x) {
dat = read.csv(x, header= TRUE)
dat = dat[-c(1),]
dat = as.data.frame(dat)
dat = separate(data = dat, col = dat, into = c("lap", "duration"), sep = "\\
")
dat$count = 1:nrow(dat)
y = dat$count
i= y%%2==0
dat$interacting = i
int = dat[which(dat$interacting == TRUE),]
interactions = sum(int$interacting)
})
#########################################################################
#this changes the row name to the name of the file - i.e. the rat ID
lap.list = as.data.frame(lap.list)
lap.list = t(lap.list)
colnames(lap.list) = c("# of int.")
row.names(lap.list) = sub(wd, "", files)
row.names(lap.list) = gsub("([0-9]+).*$", "\\1", rownames(lap.list))
row.names(lap.list) = gsub('/', "", row.names(lap.list), fixed = TRUE)
###########################################################################
#this applies almost the same function as the one listed above except I call
it a different vector name so it can be manipulated
int.duration = lapply(files, function(x) {
dat2 = read.csv(x, header= TRUE)
dat2 = dat2[-c(1),]
dat2 = as.data.frame(dat2)
dat2 = separate(data = dat2, col = dat2, into = c("lap", "duration"), sep =
"\\ ")
dat2$count = 1:nrow(dat2)
y = dat2$count
i= y%%2==0
dat2$interacting = i
int = dat2[which(dat2$interacting == TRUE),]
})
noint.duration = lapply(files, function(x) {
dat2 = read.csv(x, header= TRUE)
dat2 = dat2[-c(1),]
dat2 = as.data.frame(dat2)
dat2 = separate(data = dat2, col = dat2, into = c("lap", "duration"), sep =
"\\ ")
dat2$count = 1:nrow(dat2)
y = dat2$count
i= y%%2==0
dat2$interacting = i
noint = dat2[which(dat2$interacting == FALSE),]
})
###################################################################
#this splits the output time of minutes, seconds, and milliseconds.
#then it combines them into a total seconds.milliseconds readout.
#after that, it takes the sum of the times for each file and combines them
with the total interactions dataframe.
int.duration = melt(int.duration)
int.duration = as.data.frame(int.duration)
int.left = as.data.frame(substr(int.duration$duration, 1, 2))
colnames(int.left) = "min"
int.mid = as.data.frame(substr(int.duration$duration, 4, 4 + 2 - 1))
colnames(int.mid) = "sec"
int.right = as.data.frame(substr(int.duration$duration,
nchar(int.duration$duration) - (2-1), nchar(int.duration$duration)))
colnames(int.right) = "ms"
int.time = cbind(int.left, int.mid, int.right)
int.time$min = as.numeric(as.character(int.time$min))
int.time$sec = as.numeric(as.character(int.time$sec))
int.time$ms = as.numeric(as.character(int.time$ms))
int.time$ms = int.time$ms/100
int.time$min = ifelse(int.time$min > 0, int.time$min*60,0)
int.time$sum = rowSums(int.time)
int.file = as.data.frame(int.duration$L1)
colnames(int.file) = "file"
int.time = cbind(int.time, int.file)
int.tot = as.data.frame(tapply(int.time$sum, int.time$file, sum))
colnames(int.tot) = "int."
social.dat = cbind(lap.list, int.tot)
noint.duration = melt(noint.duration)
noint.duration = as.data.frame(noint.duration)
noint.left = as.data.frame(substr(noint.duration$duration, 1, 2))
colnames(noint.left) = "min"
noint.mid = as.data.frame(substr(noint.duration$duration, 4, 4 + 2 - 1))
colnames(noint.mid) = "sec"
noint.right = as.data.frame(substr(noint.duration$duration,
nchar(noint.duration$duration) - (2-1), nchar(noint.duration$duration)))
colnames(noint.right) = "ms"
noint.time = cbind(noint.left, noint.mid, noint.right)
noint.time$min = as.numeric(as.character(noint.time$min))
noint.time$sec = as.numeric(as.character(noint.time$sec))
noint.time$ms = as.numeric(as.character(noint.time$ms))
noint.time$ms = noint.time$ms/100
noint.time$min = ifelse(noint.time$min > 0, noint.time$min*60,0)
noint.time$sum = rowSums(noint.time)
noint.file = as.data.frame(noint.duration$L1)
colnames(noint.file) = "file"
noint.time = cbind(noint.time, noint.file)
noint.tot = as.data.frame(tapply(noint.time$sum, noint.time$file, sum))
colnames(noint.tot) = "not.int."
social.dat = cbind(social.dat, noint.tot)
social.dat$ID = rownames(social.dat)
Here is and axample of a csv file I am working with. The words are all in the same column and separated by spaces.
Total time 10:00.61
Lap times
01 00:07.46
02 00:05.64
03 00:01.07
04 00:01.04
05 00:04.71
06 00:06.43
07 00:12.52
08 00:07.34
09 00:05.46
10 00:05.81
11 00:05.52
12 00:06.51
13 00:10.75
14 00:00.83
15 00:03.64
16 00:02.75
17 00:01.20
18 00:06.17
19 00:04.40
20 00:00.75
21 00:00.84
22 00:01.29
23 00:02.31
24 00:03.04
25 00:02.85
26 00:05.86
27 00:05.76
28 00:05.06
29 00:00.96
30 00:06.91
#akrun suggested ifelse, which works great for one or two nestings. Much past that, and my personal preference is to use dplyr::case_when or a separate data.frame in a merge/join of sorts.
If you are using the "simple case" of assigning consistently by the same fields (id in this case), then the merge/join is my preferred method: it makes maintenance much simpler (IMO). (When I say "consistently by the same fields", I mean that you could have a id1 and id2 fields by which you define the individual records and their applicable groups. Likely too much for your example, so I'll keep this answer at one key merging.)
Three methods (data far below):
Base R
dat2a <- merge(dat, groups, by="id", all.x=TRUE)
dat2a
# id int group
# 1 1 22 veh
# 2 2 33 thc1
# 3 3 44 <NA>
Note that any id not included in the definition of groups will have NA group. You can assign a default group with this:
dat2a$group[is.na(dat2a$group)] <- "somedefaultgroup"
dat2a
# id int group
# 1 1 22 veh
# 2 2 33 thc1
# 3 3 44 somedefaultgroup
dplyr, merge/join
Similar concept, but using dplyr-esque verbs.
library(dplyr)
dat2c <- left_join(dat, groups, by="id") %>%
mutate(group = if_else(is.na(group), "somedefaultgroup", group))
dplyr::case_when
(This does not use groups as I defined for the merge/join cases.)
In case you really want to do some ladder/nesting of if/else-like statements, case_when is easier to read (and debug) and might be faster, depending on your use-case.
Most direct:
library(dplyr)
dat2b <- dat
dat2b$group <- case_when(
dat2b$id %in% c("1","5") ~ "veh",
dat2b$id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
)
A little easier to read than the previous by using with(...), but functionally identical. (If your "ladder" is much longer, then code-golf (number of characters in the code) can be significantly reduced.)
dat2b <- dat
dat2b$group <- with(dat2b, case_when(
id %in% c("1","5") ~ "veh",
id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
))
If you want to use some dplyr verbs, then:
dat2b <- dat
dat2b <- dat2b %>%
mutate(
group = case_when(
id %in% c("1","5") ~ "veh",
id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
)
)
Data
When doing merge/join actions, it's important to use stringsAsFactors=FALSE so that the absence of factor levels (of the newly-assigned groups) is not a problem. (This can be worked around, but ...)
dat <- data.frame(id=c("1","2","3"), int=c(22L,33L,44L),
stringsAsFactors=FALSE)
Optional use for the merge examples above:
groups <- data.frame(id=c("1","5","2","6"), group=c("veh","veh","thc1","thc1"),
stringsAsFactors=FALSE)
groups
# id group
# 1 1 veh
# 2 5 veh
# 3 2 thc1
# 4 6 thc1
The premise is that you define one row for each unique id.
Thanks to #r2evans the following code worked exactly as I wanted it to (using dplyr::case_when)
social.dat$group = case_when(
social.dat$ID %in% c("1","5") ~ "veh",
social.dat$ID %in% c("2","6") ~ "thc1",
social.dat$ID %in% c("3","7") ~ "thc2",
social.dat$ID %in% c("4","8") ~ "thc3"
)
This was the final output of the dataframe
# of int. int. not.int. ID group
1 50 218.41 372.16 1 veh
3 33 134.94 158.17 3 thc2

r aggregate dynamic columns

I'd like to create an aggregation without knowing neither the column names nor their positions ie. I retrieve the names dynamically.
Further I'm able to use data.frame or data.table as I'm forced to use R version 3.1.1
Is there an option like do.call... as explained in this answer for 'order'
trying a similar do.call with 'aggregate' leads to an error
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
group_by <- c('group1','group2')
test <- do.call( aggregate.data.frame , c(by=group_by, x=smalldat, FUN=mean))
#output
#Error in is.data.frame(x) : Argument "x" missing (no default)
or is there an option with data.table?
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
# convert to data.frame to data.table
library(data.table)
smalldat <- data.table(smalldat)
# convert aggregated variable into raw data file
smalldat[, aggGroup1 := mean(x), by = group1]
Thanks for advice!
aggregate can take a formula, and you can build a formula from a string.
form = as.formula(paste(". ~", paste(group_by, collapse = " + ")))
aggregate(form, data = smalldat, FUN = mean)
# group1 group2 x y
# 1 1 a 0.1021667 -0.09798418
# 2 2 a -0.5695960 -0.67409059
# 3 1 b -1.0341342 -0.46696381
# 4 2 b -0.3102046 0.46478476

How to RBind First 4 Column one above Other with Tag

Below i have to tried to reproduce in representable Form
`v<- data.frame(C1TEMP = c(3,6,1,8,9,2,2,9,1,23),
C1VIB = c(5,6,1,8,9,2,2,9,1,23),
C1DE = c(9,6,1,8,9,2,2,9,1,23),
C1NDE = c(8,6,1,8,9,2,2,9,1,23),
C2TEMP = c(5,6,1,8,9,2,2,9,1,23),
C2VIB = c(378,6,1,8,9,2,2,9,1,23),
C2DE = c(3,78,1,8,9,2,2,9,1,23),
C2NDE = c(3,6,1,8,9,2,2,9,1,23),
C3TEMP= c(3,6,89,8,9,2,2,9,1,23),
C3VIB = c(3,6,1,98,9,2,2,9,1,23),
C3DE = c(33,56,91,82,99,12,22,19,81,23),
C3NDE = c(13,76,91,88,59,42,22,39,21,23))`
Here i want to rbind Every 4 column one above each Other with the tag No Along. And No of Columns will always be divisible of 4. I here with also Attaching an image for a clear picture what result should be expected.
EXPECTED OUTPUT:
I agree with YCR's comment. Still, this is a way to tackle your problem. Use the following code:
# data frames need column headers, so convert to matrix
v01 <- as.matrix(v[, 1:4])
v02 <- as.matrix(v[, 5:8])
v03 <- as.matrix(v[, 9:12])
# remove columnnames
colnames(v01) <- NULL
colnames(v02) <- NULL
colnames(v03) <- NULL
# now you can use rbind and give the columnnames back
v2 <- rbind( v01, v02, v03)
colnames(v2) <- c("C1TEMP", "C1VIB", "C1DE", "C1NDE")
v2
try this
It is a bit more convoluted than previous answers but it should be more adaptable to other data frames
# how many blocks have you got?
howMany <-table(gsub(names(v),pattern = "[0-9]",replacement = ""))[1]
# make a common name string
NAMES <- unique(gsub(names(v),pattern = "[0-9]",replacement = ""))
# create a list
list() -> V
for(i in 1:howMany){
# get the column with matching index number
v[,grep(names(v),pattern = i)] -> vi
names(vi) <- NAMES# change name
data.frame(Tag=i,vi) -> V[[i]]# put it in the list
}
# combine tables in the list into one list
do.call(rbind,V)
Nils
The melt and reshape way:
It implies to get an identifier per row:
v<- data.frame(C1TEMP = c(3,6,1,8,9,2,2,9,1,23),
C1VIB = c(5,6,1,8,9,2,2,9,1,23),
C1DE = c(9,6,1,8,9,2,2,9,1,23),
C1NDE = c(8,6,1,8,9,2,2,9,1,23),
C2TEMP = c(5,6,1,8,9,2,2,9,1,23),
C2VIB = c(378,6,1,8,9,2,2,9,1,23),
C2DE = c(3,78,1,8,9,2,2,9,1,23),
C2NDE = c(3,6,1,8,9,2,2,9,1,23),
C3TEMP= c(3,6,89,8,9,2,2,9,1,23),
C3VIB = c(3,6,1,98,9,2,2,9,1,23),
C3DE = c(33,56,91,82,99,12,22,19,81,23),
C3NDE = c(13,76,91,88,59,42,22,39,21,23),
id = 1:10
, stringsAsFactors = F)
library(tidyverse)
# melt the dataframe(reshape from wide to long format):
v_melt <- reshape2::melt(v, id.vars = "id")
# modify the aggregation variables
v_melt <- v_melt %>%
mutate(var = substr(as.character(variable), 3, 8),
group_id = paste0(substr(as.character(variable), 1, 2), "_", id))
# reshape the data frame in a wide format:
v_cast <- reshape2::dcast(v_melt, group_id ~ var, value.var = "value")

How do I pass names for new summary columns to data.table in a function?

Say I want to create a function that calculates a summary dataset from a data.table in R, and I want to be able to pass the name of the new calculated variable in programmatically.
For example:
library(data.table)
# generate some fake data
set.seed(919)
dt <- data.table(x = rnorm(50), by.var = rep(c("a", "b"), 25))
dt[, list(group.means = mean(x)), by = "by.var"] # This is what I want
# But I want to do in a function, so I can do it repeatedly:
groupMeans <- function(out.var, by.var, dat = dt) {
return(dat[, list(out.var = mean(x)), by = by.var]) # doesn't work
}
groupMeans("group.means", "by.var") # out.var should be "group.means"
How do I do this?
Courtesy of docendo discimus, you can use a named list created with setNames, like this:
groupMeans <- function(out.var, by.var, dat = dt) {
return(dat[, setNames(list(mean(x)), out.var), by = by.var])
}
groupMeans("group.means", "by.var")
# by.var group.means
# 1: a -0.1159832
# 2: b 0.2910531
You could consider changing the column names inside your function:
groupMeans <- function(out.var, by.var, dat = dt) {
res <- dat[, list(mean(x)), by=by.var]
setnames(res, "V1", out.var)
res
}
We could use setnames to name the summarised column with the 'out.var' vector.
groupMeans <- function(out.var, by.var, dat = dt) {
setnames(dat[, list(mean(x)), by = by.var],
length(by.var)+1L, out.var)
}
groupMeans("group.var","by.var", dt)[]
# by.var group.var
#1: a -0.1159832
#2: b 0.2910531
EDIT: Based on #Frank's suggestion.

Resources