Creating multiple functions with dplyr non-standard evaluation - r

I am trying to use mutate_() to create multiple columns where each is based on a custom function called with different inputs. I can use paste() to create multiple quoted function calls, but this doesn't work because dplyr's NSE requires formulas (~) rather than quoted strings to be able to find the function. How can I write the "dots = " line below so that the function can be found? I tried experimenting with ~, as.formula(), and lazyeval::interp(), but couldn't get any to work. My actual "prefixes" is a long vector so I don't want to separately write out the function calls for each new column. Thanks
library(dplyr)
library(lazyeval)
library(nycflights13)
myfunc = function(x, y) { x - y }
# this works
flights1 <- mutate(flights, dep_time_sched = myfunc(dep_time, dep_delay),
arr_time_sched = myfunc(arr_time, arr_delay))
# this doesn't - Error: could not find function "myfunc"
prefixes <- c('dep', 'arr')
dots = as.list(paste0('myfunc(',
paste0(prefixes, '_time'), ', ',
paste0(prefixes, '_delay)')))
flights2 <- mutate_(flights, .dots = setNames(dots, paste0(prefixes, '_time_sched')))

You could approach this by using interp with lapply to loop through your prefixes and get a list in the desired format for mutate_.
dots = lapply(prefixes, function(var) interp(~myfunc(x, y),
.values = list(x = as.name(paste0(var, "_time")),
y = as.name(paste0(var, "_delay")))))
dots
[[1]]
~myfunc(dep_time, dep_delay)
<environment: 0x0000000019e51f00>
[[2]]
~myfunc(arr_time, arr_delay)
<environment: 0x0000000019f1e5b0>
This gives the same results as your flights1.
flights2 = mutate_(flights, .dots = setNames(dots, paste0(prefixes, '_time_sched')))
identical(flights1, flights2)
[1] TRUE

My actual "prefixes" is a long vector so I don't want to separately write out the function calls for each new column.
If that's the case, you should really transform your data to long format. To clarify what I mean, let's look at a smaller example:
mydat <- flights[1:5, c(paste0(prefixes,"_time"), paste0(prefixes,"_delay"))]
# dep_time arr_time dep_delay arr_delay
# (int) (int) (dbl) (dbl)
# 1 517 830 2 11
# 2 533 850 4 20
# 3 542 923 2 33
# 4 544 1004 -1 -18
# 5 554 812 -6 -25
library(data.table)
longdat <- setDT(mydat)[, .(
pref = rep(prefixes, each=.N),
time = unlist(mget(paste0(prefixes,"_time"))),
delay = unlist(mget(paste0(prefixes,"_delay")))
)]
longdat[, time_sched := myfunc(time, delay) ]
# pref time delay time_sched
# 1: dep_ 517 2 515
# 2: dep_ 533 4 529
# 3: dep_ 542 2 540
# 4: dep_ 544 -1 545
# 5: dep_ 554 -6 560
# 6: arr_ 830 11 819
# 7: arr_ 850 20 830
# 8: arr_ 923 33 890
# 9: arr_ 1004 -18 1022
# 10: arr_ 812 -25 837
Besides being simpler, calling the function a single time takes advantage of its vectorization.
While I used data.table to construct longdat, I'm sure there's a tool to do the same thing in the tidyr package (companion to dplyr). Similarly, the addition of the time_sched column is just a mutate.
Alternative ways of reshaping Thanks to #akrun, here is another way to get to longdat, using melt function syntax that will be available in the next version of data.table (1.9.8, not released yet):
longdat <- melt(mydat,
measure = patterns('time$','delay$'),
variable.name = "pref",
value.name = c('time', 'delay')
)[, pref := prefixes[pref]]
or, also thanks to #akrun, here is a way to reshape that automatically constructs the prefixes, given the suffixes (time and delay), using #AnandaMahto's splitstackshape package:
library(splitstackshape)
longdat <- merged.stack(transform(mydat, ind=1:nrow(mydat)),
var.stubs = c('_time', '_delay'),
sep = 'var.stubs',
atStart = FALSE)

Related

R data.table - how to use assigned variables as column names for computing summaries _and_ grouping

The problem is well-known: unlike data.frame's, where one can point to column names by character variables, the default behaviour of data.table is to want actual column names (e.g. you cannot do DT[, "X"], but you must do DT[, X], if your table has a column named "X").
Which in some cases is a problem, because one wants to handle a generic dataset with arbitrary, user-defined column names.
I saw a couple of posts about this:
Pass column name in data.table using variable
Select / assign to data.table when variable names are stored in a character vector
And the official FAQ says I should use with = FALSE:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html#i-assigned-a-variable-mycol-x-but-then-dt-mycol-returns-x.-how-do-i-get-it-to-look-up-the-column-name-contained-in-the-mycol-variable
The quote + eval method, I do not really understand; and the one with .. gave an error even before starting doing anything.
So I only compared the method using the actual column names (which I could not use in real practice), the one using get and the one using with = FALSE.
Interestingly, the latter, i.e. the official, recommended one, is the only one that does not work at all.
And get, while it works, for some reason is far slower than using the actual column names, which I really don't get (no pun intended).
So I guess I am doing something wrong...
Incidentally, but importantly, I turned to data.table because I needed to make a grouped mean of a fairly large dataset, and my previous attempts using aggregate, by or tapply were either too slow, or too memory-hungry, and they crashed R.
I cannot disclose the actual data I am working with, so I made a simulated dataset of the same size here:
require(data.table)
row.var = "R"
col.var = "C"
value.var = "V"
set.seed(934293)
d <- setNames(data.frame(sample(1:758145, 7582953, replace = T), sample(1:450, 7582953, replace = T), runif(7582953, 5, 9)),
c(row.var, col.var, value.var))
DT <- as.data.table(d)
rm(m)
print(system.time({
m <- DT[, mean(V), by = .(R, C)]
}))
# user system elapsed
# 1.64 0.27 0.51
rm(m)
print(system.time({
m <- DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
}))
# user system elapsed
# 16.05 0.02 14.97
rm(m)
print(system.time({
m <- DT[, mean(value.var), by = .(row.var, col.var), with = FALSE]
}))
#Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed
#In addition: Warning message:
#In mean.default(value.var) :
#
# Error in h(simpleError(msg, call)) :
#error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed Timing stopped at: 0 0 0
Any ideas?
collap from collapse gives a better timing
library(collapse)
> system.time(collap(DT, reformulate(c(row.var, col.var),
response = value.var), fmean))
user system elapsed
0.881 0.020 0.901
> system.time(fmean(fgroup_by(DT, c(row.var, col.var))))
user system elapsed
0.931 0.021 0.952
> system.time(DT[, mean(V), by = .(R, C)])
user system elapsed
5.052 0.099 0.646
As the get approach or the one with .SDcols is taking time, another approach is to interpolate the values in a string and evaluate
system.time(eval(parse(text = glue::glue("DT[, mean({value.var}), by = .({row.var}, {col.var})]"))))
user system elapsed
5.065 0.105 0.660
-checking the output
> out_c <- collap(DT, reformulate(c(row.var, col.var),
response = value.var), fmean)
> out_d <- DT[, mean(V), by = .(R, C)]
> out_dte <- eval(parse(text = glue::glue("DT[, mean({value.var}), by = .({row.var}, {col.var})]")))
> out_c
R C V
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
> out_d[order(R, C)]
R C V1
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
> out_dte[order(R, C)]
R C V1
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
Once we get the output, the column names can be updated with setnames
> setnames(out_dte[order(R, C)], 'V1', value.var)[]
R C V
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
This particular problem of programming data.tables can be solved without get() at least in two different ways:
1. Using .SDcols and passing character values to by
Here, .SDcols takes a vector of character column names to operate on. by = accepts also a character vector of column names.
DT[, lapply(.SD, mean), .SDcols = value.var, by = c(row.var, col.var)]
2. Using the new env parameter
With development version 1.14.3. data.table has gained a new interface for programming on data.table (see item 10 in the Changelog).
data.table::update.dev.pkg() # Install latest dev version (1.14.3)
library(data.table)
DT[, mean(v1), by = .(v2, v3), env = list(v1 = value.var, v2 = row.var, v3 = col.var)]
Benchmarks
microbenchmark::microbenchmark(
nat = DT[, mean(V), by = .(R, C)],
# get = DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))],
chr = DT[, lapply(.SD, mean), .SDcols = value.var, by = c(row.var, col.var)],
env = DT[, mean(v1), by = .(v2, v3), env = list( v1 = value.var, v2 = row.var, v3 = col.var)],
times = 3L,
check = "equivalent"
)
Unit: seconds
expr min lq mean median uq max neval
nat 1.275479 1.313737 1.356826 1.351995 1.397500 1.443005 3
chr 1.279219 1.297576 1.328261 1.315933 1.352782 1.389630 3
env 1.324985 1.327743 1.331603 1.330502 1.334912 1.339323 3
All three variants (except get) took approximately the same time. get was excluded as the run time was too long.
Edit: Benchmarks results including get:
Unit: seconds
expr min lq mean median uq max neval
nat 1.238719 1.288629 1.315594 1.338539 1.354032 1.369525 3
get 569.560843 572.417951 576.482720 575.275059 579.943658 584.612257 3
chr 1.275734 1.279536 1.309346 1.283338 1.326153 1.368967 3
env 1.298941 1.316105 1.326649 1.333268 1.340503 1.347738 3
So, get takes about 500 times longer than the other three variants.
An explanation for this behaviour is given by data.table when the verbose option is switched on:
options(datatable.verbose = TRUE)
Now,
DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
gives the following hints:
Argument 'by' after substitute: .(get(row.var), get(col.var))
'(m)get'
found in j. ansvars being set to all columns. Use .SDcols or a single
j=eval(macro) instead. Both will detect the columns used which is
important for efficiency.
Old ansvars: [R, C, V]
New ansvars: [R, C, V]
Finding groups using forderv ... forder.c received 7582953 rows
and 2 columns
0.600s elapsed (0.720s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.070s elapsed (0.060s cpu)
Getting
back original order ... forder.c received a vector type 'integer'
length 7499423
0.470s elapsed (0.640s cpu)
lapply optimization is on, j unchanged as 'mean(get(value.var))'
GForce is on, left j unchanged
Old mean
optimization changed j from 'mean(get(value.var))' to
'.External(Cfastmean, get(value.var), FALSE)'
Making each group and
running j (GForce FALSE) ...
[...]
The 3 other variants are all using data.table's gforce optimization.

return output as columns instead of list after applying a function using dplyr [duplicate]

I have a dataset with several columns, one of which is a column for reaction times. These reaction times are comma separated to denote the reaction times (of the same participant) for the different trials.
For example: row 1 (i.e.: the data from participant 1) has the following under the column "reaction times"
reaction_times
2000,1450,1800,2200
Hence these are the reaction times of participant 1 for trials 1,2,3,4.
I now want to create a new data set in which the reaction times for these trials all form individual columns. This way I can calculate the mean reaction time for each trial.
trial 1 trial 2 trial 3 trial 4
participant 1: 2000 1450 1800 2200
I tried the colsplit from the reshape2 package but that doesn't seem to split my data into new columns (perhaps because my data is all in 1 cell).
Any suggestions?
I think you are looking for the strsplit() function;
a = "2000,1450,1800,2200"
strsplit(a, ",")
[[1]]
[1] "2000" "1450" "1800" "2200"
Notice that strsplit returns a list, in this case with only one element. This is because strsplit takes vectors as input. Therefore, you can also put a long vector of your single cell characters into the function and get back a splitted list of that vector. In a more relevant example this look like:
# Create some example data
dat = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
splitdat = do.call("rbind", strsplit(dat$reaction_time, ","))
splitdat = data.frame(apply(splitdat, 2, as.numeric))
names(splitdat) = paste("trial", 1:4, sep = "")
head(splitdat)
trial1 trial2 trial3 trial4
1 597 1071 1430 997
2 614 322 1242 1140
3 1522 1679 51 1120
4 225 1988 1938 1068
5 621 623 1174 55
6 1918 1828 136 1816
and finally, to calculate the mean per person:
apply(splitdat, 1, mean)
[1] 1187.50 361.25 963.75 1017.00 916.25 1409.50 730.00 1310.75 1133.75
[10] 851.25 914.75 881.25 889.00 1014.75 676.75 850.50 805.00 1460.00
[19] 901.00 1443.50 507.25 691.50 1090.00 833.25 669.25
A nifty, if rather heavy-handed, way is to use read.csv in conjunction with textConnection. Assuming your data is in a data frame, df:
x <- read.csv(textConnection(df[["reaction times"]]))
Old question, but I came across it from another recent question (which seems unrelated).
Both existing answers are appropriate, but I wanted to share an answer related to a package I have created called "splitstackshape" that is fast and has straightforward syntax.
Here's some sample data:
set.seed(1)
dat = data.frame(
reaction_time = apply(matrix(round(
runif(24, 1, 2000)), 6, 4), 1, paste, collapse = ","))
This is the splitting:
library(splitstackshape)
cSplit(dat, "reaction_time", ",")
# reaction_time_1 reaction_time_2 reaction_time_3 reaction_time_4
# 1: 532 1889 1374 761
# 2: 745 1322 769 1555
# 3: 1146 1259 1540 1869
# 4: 1817 125 996 425
# 5: 404 413 1436 1304
# 6: 1797 354 1984 252
And, optionally, if you need to take the rowMeans:
rowMeans(cSplit(dat, "reaction_time", ","))
# [1] 1139.00 1097.75 1453.50 840.75 889.25 1096.75
Another option using dplyr and tidyr with Paul Hiemstra's example data is:
# create example data
data = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
head(data)
# clean data
data2 <- data %>% mutate(split_reaction_time = str_split(as.character(reaction_time), ",")) %>% unnest(split_reaction_time)
data2$col_names <- c("trial1", "trial2", "trial3", "trial4")
data2 <- data2 %>% spread(key = col_names, value = split_reaction_time) %>% select(-reaction_time)
head(data2)

ifelse didn't work in dataframe in R

I have a question about ifelse in data.frame in R. I checked several SO posts about it, and unfortunately none of these solutions fitted my case.
My case is, making a conditional calculation in a data frame, but it returns the condition has length > 1 and only the first element will be used even after I used ifelse function in R, which should work perfectly according to the SO posts I checked.
Here is my sample code:
library(scales)
head(temp[, 2:3])
previous current
1 0 10
2 50 57
3 92 177
4 84 153
5 30 68
6 162 341
temp$change = ifelse(temp$previous > 0, rate(temp$previous, temp$current), temp$current)
rate = function(yest, tod){
value = tod/yest
if(value>1){
return(paste("+", percent(value-1), sep = ""))
}
else{
return(paste("-", percent(1-value), sep = ""))
}
}
So if I run the ifelse one, I will get following result:
head(temp[, 2:4])
previous current change
1 0 10 10
2 50 57 +NaN%
3 92 177 +NaN%
4 84 153 +NaN%
5 30 68 +NaN%
6 162 341 +NaN%
So my question is, how should I deal with it? I tried to assign 0 to the last column before I run ifelse, but it still failed.
Many thanks in advance!
Try the following two segments, both should does what you wanted. May be it is the second one you are looking for.
library(scales)
set.seed(1)
temp <- data.frame(previous = rnorm(5), current = rnorm(5))
rate <- function(i) {
yest <- temp$previous[i]
tod <- temp$current[i]
if (yest <= 0)
return(tod)
value = tod/yest
if (value>1) {
return(paste("+", percent(value-1), sep = ""))
} else {
return(paste("-", percent(1-value), sep = ""))
}
}
temp$change <- unlist(lapply(1:dim(temp)[1], rate))
Second:
ind <- which(temp$previous > 0)
temp$change <- temp$current
temp$change[ind] <- unlist(lapply(ind,
function(i) rate(temp$previous[i], temp$current[i])))
In the second segment, the function rate is same as you've coded it.
Here's another way to do the same
# 1: load dplyr
#if needed install.packages("dplyr")
library(dplyr)
# 2: I recreate your data
your_dataframe = as_tibble(cbind(c(0,50,92,84,30,162),
c(10,57,177,153,68,341))) %>%
rename(previous = V1, current = V2)
# 3: obtain the change using your conditions
your_dataframe %>%
mutate(change = ifelse(previous > 0,
ifelse(current/previous > 1,
paste0("+%", (current/previous-1)*100),
paste0("-%", (current/previous-1)*100)),
current))
Result:
# A tibble: 6 x 3
previous current change
<dbl> <dbl> <chr>
1 0 10 10
2 50 57 +%14
3 92 177 +%92.3913043478261
4 84 153 +%82.1428571428571
5 30 68 +%126.666666666667
6 162 341 +%110.493827160494
Only the first element in value is evaluated. So, the output of rate solely depend on the first row of temp.
Adopting the advice I received from warm-hearted SO users, I vectorized some of my functions and it worked! Raise a glass to SO community!
Here is the solution:
temp$rate = ifelse(temp$previous > 0, ifelse(temp$current/temp$previous > 1,
temp$current/temp$previous - 1,
1 - temp$current/temp$previous),
temp$current)
This will return rate with scientific notation. If "regular" notation is needed, here is an update:
temp$rate = format(temp$rate, scientific = F)

How to work with all subsets in vectorized way

I have a stock price dataframe containing a lot of symbols and I would like to perform operations on subsets for every symbol in a vectorized way. My data is :
head(dataset)
date open high low close volume symbol
1 2014-08-29 34.59 34.6800 34.59 34.6800 200 AAIT
2 2014-08-28 34.96 34.9600 34.96 34.9600 211 AAIT
3 2014-08-27 35.28 35.2800 35.28 35.2800 507 AAIT
4 2014-08-26 35.02 35.0200 35.02 35.0200 00 AAIT
5 2014-08-25 34.57 35.0200 34.57 35.0200 385 AAIT
6 2014-08-22 34.80 34.8299 34.80 34.8299 802 AAIT
For every symbol I would like to do something like that :
for (symb in unique(dataset$symbol){
dataset$night = with(subset(dataset, dataset$symbol == symb), open[-length(open)]-close[-1])
}
This causes the last row to be filled with NA so I can't do that on the whole dataframe. I could replace the last line afterwards but I would prefer to work with the subsets for more convenience. Is it possible to do the for loop in a vectorized way (for loops are very slow on r, it can become a problem if I have too many symbols)
You could use dplyr:
library(dplyr)
dataset <- dataset %>%
group_by(symbol) %>%
mutate(night = c(head(open, -1) - tail(close, -1), NA))
or plyr:
library(plyr)
dataset <- ddply(dataset, .(symbol), mutate,
night = c(head(open, -1) - tail(close, -1), NA))
or data.table:
library(data.table)
dt <- data.table(dataset)
setkey(dt, symbol)
dt[, night := c(head(open, -1) - tail(close, -1), NA), by = symbol]

Convert comma separated string to numeric columns

I have a dataset with several columns, one of which is a column for reaction times. These reaction times are comma separated to denote the reaction times (of the same participant) for the different trials.
For example: row 1 (i.e.: the data from participant 1) has the following under the column "reaction times"
reaction_times
2000,1450,1800,2200
Hence these are the reaction times of participant 1 for trials 1,2,3,4.
I now want to create a new data set in which the reaction times for these trials all form individual columns. This way I can calculate the mean reaction time for each trial.
trial 1 trial 2 trial 3 trial 4
participant 1: 2000 1450 1800 2200
I tried the colsplit from the reshape2 package but that doesn't seem to split my data into new columns (perhaps because my data is all in 1 cell).
Any suggestions?
I think you are looking for the strsplit() function;
a = "2000,1450,1800,2200"
strsplit(a, ",")
[[1]]
[1] "2000" "1450" "1800" "2200"
Notice that strsplit returns a list, in this case with only one element. This is because strsplit takes vectors as input. Therefore, you can also put a long vector of your single cell characters into the function and get back a splitted list of that vector. In a more relevant example this look like:
# Create some example data
dat = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
splitdat = do.call("rbind", strsplit(dat$reaction_time, ","))
splitdat = data.frame(apply(splitdat, 2, as.numeric))
names(splitdat) = paste("trial", 1:4, sep = "")
head(splitdat)
trial1 trial2 trial3 trial4
1 597 1071 1430 997
2 614 322 1242 1140
3 1522 1679 51 1120
4 225 1988 1938 1068
5 621 623 1174 55
6 1918 1828 136 1816
and finally, to calculate the mean per person:
apply(splitdat, 1, mean)
[1] 1187.50 361.25 963.75 1017.00 916.25 1409.50 730.00 1310.75 1133.75
[10] 851.25 914.75 881.25 889.00 1014.75 676.75 850.50 805.00 1460.00
[19] 901.00 1443.50 507.25 691.50 1090.00 833.25 669.25
A nifty, if rather heavy-handed, way is to use read.csv in conjunction with textConnection. Assuming your data is in a data frame, df:
x <- read.csv(textConnection(df[["reaction times"]]))
Old question, but I came across it from another recent question (which seems unrelated).
Both existing answers are appropriate, but I wanted to share an answer related to a package I have created called "splitstackshape" that is fast and has straightforward syntax.
Here's some sample data:
set.seed(1)
dat = data.frame(
reaction_time = apply(matrix(round(
runif(24, 1, 2000)), 6, 4), 1, paste, collapse = ","))
This is the splitting:
library(splitstackshape)
cSplit(dat, "reaction_time", ",")
# reaction_time_1 reaction_time_2 reaction_time_3 reaction_time_4
# 1: 532 1889 1374 761
# 2: 745 1322 769 1555
# 3: 1146 1259 1540 1869
# 4: 1817 125 996 425
# 5: 404 413 1436 1304
# 6: 1797 354 1984 252
And, optionally, if you need to take the rowMeans:
rowMeans(cSplit(dat, "reaction_time", ","))
# [1] 1139.00 1097.75 1453.50 840.75 889.25 1096.75
Another option using dplyr and tidyr with Paul Hiemstra's example data is:
# create example data
data = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
head(data)
# clean data
data2 <- data %>% mutate(split_reaction_time = str_split(as.character(reaction_time), ",")) %>% unnest(split_reaction_time)
data2$col_names <- c("trial1", "trial2", "trial3", "trial4")
data2 <- data2 %>% spread(key = col_names, value = split_reaction_time) %>% select(-reaction_time)
head(data2)

Resources