How to store replicate runs in a dataframe

How to store replicate runs in a dataframe - r

I have a function (call it random_func) that generates random numbers according to some rules using parameters. I'm trying to repeatedly call that function and store the results in a dataframe.
df <- lapply(c(1,2,3,4,5), FUN = function(x) replicate(100, expr = random_func(n=10, param=x)))
Right now, the output is a list of 5 vectors each with 100 elements. What R voodoo do I need to do in order to get it to look something like:
param, result
1, 5
1, 6
1, 8
...
5, 10

set.seed(42)
do.call(rbind, #rbind results for different x together
lapply(c(1,2), FUN = function(x)
data.frame(param = x, #will be recycled
result = do.call(what = c, #concatenate results of replicate
replicate(n = 2,
expr = rnorm(n = 3, mean = x), #replace with random_func
simplify = FALSE))))) #when FALSE, replicate returns list
# param result
# 1 1 2.3709584
# 2 1 0.4353018
# 3 1 1.3631284
# 4 1 1.6328626
# 5 1 1.4042683
# 6 1 0.8938755
# 7 2 3.5115220
# 8 2 1.9053410
# 9 2 4.0184237
# 10 2 1.9372859
# 11 2 3.3048697
# 12 2 4.2866454

rerun and map_df solution
from purrr
library(dplyr)
library(purrr)
Random function
random_func <- function(n, param) {
rnorm(n)+(param*10)
}
solution
myfun <- function() {
df <- 100 %>%
rerun(x=10, y=1:5) %>%
map_df(~data.frame(param=.x$y, result=random_func(n=.x$x, param=.x$y)))
}
Output
df <- myfun()
head(df)
param result
1 1 10.15325
2 2 19.52867
3 3 30.08218
4 4 40.06418
5 5 48.39804
6 1 11.00435
Additional validation
df %>%
group_by(param) %>%
summarise(mean = mean(result))
param mean
1 1 10.00634
2 2 20.03874
3 3 30.11093
4 4 40.06166
5 5 50.02632
Performance
library(microbenchmark)
microbenchmark(myfun())
expr min lq mean median uq max neval
myfun() 65.93166 66.80521 69.42349 68.5152 69.57185 90.77295 100

Related

Split duplicated rows of matrices in lists with base R

I have a list of matrices, with duplicated values in column id. How can I split the duplicates in all list elements?
The way I do it with data.frames is with lapply + split + duplicated, but that doesnt work with matrices, as they are also split up in numerics. I would like to keep the matrix structure.
## Data.frame - all good
df <- data.frame(
id = rep(1:10, each = 2),
val = rep(10, each = 20)
)
df_list <- rep(list(df), 2);
lapply(df_list, function(x){split(x, duplicated(x[,'id']))$'FALSE'})
## Matrix - Here's my problem
mt <- as.matrix(data.frame(
id = rep(seq(1,10,1), each = 2),
val = rep(10, each = 20)
))
mt_list <- rep(list(mt), 2)
lapply(mt_list, function(x){split(x, duplicated(x[,'id']))$'FALSE'})

Maybe try
split(df,ave(df$id, df$id, FUN= function(x) seq_along(x)))
$`1`
id val
1 1 10
3 2 10
5 3 10
7 4 10
9 5 10
11 6 10
13 7 10
15 8 10
17 9 10
19 10 10
$`2`
id val
2 1 10
4 2 10
6 3 10
8 4 10
10 5 10
12 6 10
14 7 10
16 8 10
18 9 10
20 10 10

While writing that question and fiddling around with the code, I came up with a solution.
Since i didnt find anything about this specific setup, I though I'll post it anyway.
The functions subset / subset.matrix work:
lapply(mt_list, function(x){subset.matrix(x, !duplicated(x[,'id']))})
I benchmarked the different options; subset.matrix seems to be slightly faster than just subset.
mt <- as.matrix(data.frame(
id = rep(seq(1,1000,1), each = 2),
val = rep(1000, each = 20)
))
mt_list <- rep(list(mt), 50)
mc <- microbenchmark::microbenchmark(
subset = lapply(mt_list, function(x){subset(x, !duplicated(x[,'id']))}),
subset.matrix = lapply(mt_list, function(x){subset.matrix(x, !duplicated(x[,'id']))}),
split = lapply(mt_list, function(x){matrix(split(x, duplicated(x[,'id']))$'FALSE', ncol = 2)}),
unique = lapply( mt_list, unique )
)
mc
Unit: milliseconds
expr min lq mean median uq max neval cld
subset 3.758708 3.862849 4.256363 3.900580 3.981629 9.713416 100 a
subset.matrix 3.583632 3.700450 4.174137 3.729881 3.821947 9.611992 100 a
split 32.630604 33.061503 34.535531 33.262841 33.726039 77.531039 100 b
unique 144.832487 148.408874 155.099591 150.326865 155.456601 430.992916 100 c

Return vector position in list r

I am trying to determine the vector where an element is coming from in a list I have created. I'll give a repeatable example here:
set.seed(101)
a <- runif(10, min=0, max=100)
b <- runif(10, min=0, max=100)
c <- runif(10, min=0, max=100)
d <- runif(10, min=0, max=100)
information <- list(a, b, c, d)
information.wanted <- mean(do.call(pmax, information))
The code to get the information.wanted works just fine. What I am now trying to find is the individual vector in the list where each of the maximum values comes from. For example, value 1 in information.wanted (87.97...) comes from vector b in the information list. I would like to create another piece of code that gives the vector where the information.wanted comes from.
> information.wanted
[1] 87.97957 95.68375 73.19726 93.16344 92.33189 91.34787 82.04361 81.42830 62.20120
[10] 92.48044
I have no idea how to do this though. None of the code that I've tried has gotten me anywhere close.
postition.of.information.wanted <- ??
I'm looking to get something like this. A numeric vector is fine. I can supplement the values in later.
> position.of.informaiton.wanted
[1] 2 3 ...
Any help would be greatly appreciated. Thanks.

You need to apply which.max to each "i" index of each element in "information":
f1 = function(x)
sapply(seq_along(x[[1]]), function(i) which.max(sapply(x, "[[", i)))
f1(information)
# [1] 2 3 2 2 3 4 2 4 1 4
mapply already provides that kind of "parallel" functionality:
f2 = function(x)
unlist(.mapply(function(...) which.max(c(...)), x, NULL))
f2(information)
# [1] 2 3 2 2 3 4 2 4 1 4
Or, instead of concatenating "information" in chunks, convert to a "matrix" -as David Arenburg notes in the comments- at start and apply which.max to its rows:
f3a = function(x)
apply(do.call(cbind, x), 1, which.max)
f3a(information)
# [1] 2 3 2 2 3 4 2 4 1 4
or its columns:
f3b = function(x)
apply(do.call(rbind, x), 2, which.max)
f3b(information)
# [1] 2 3 2 2 3 4 2 4 1 4
also, max.col is convenient for a "matrix":
f4 = function(x)
max.col(do.call(cbind, x), "first")
f4(information)
# [1] 2 3 2 2 3 4 2 4 1 4
If it wasn't R, then a simple loop over the elements would provide both which.max and max ...but R, also, handles vectors:
f5 = function(x)
{
ans = rep_len(1L, length(x[[1]]))
maxs = x[[1]]
for(i in 2:length(x)) {
wh = x[[i]] > maxs
maxs[wh] = x[[i]][wh]
ans[wh] = i
}
ans #or '(data.frame(i = ans, val = maxs)' for both
}
f5(information)
# [1] 2 3 2 2 3 4 2 4 1 4
It had to end with a benchmark:
set.seed(007)
dat = replicate(13, runif(1e4), FALSE)
identical(f1(dat), f2(dat))
#[1] TRUE
identical(f2(dat), f3a(dat))
#[1] TRUE
identical(f3a(dat), f3b(dat))
#[1] TRUE
identical(f3b(dat), f4(dat))
#[1] TRUE
identical(f4(dat), f5(dat))
#[1] TRUE
microbenchmark::microbenchmark(f1(dat), f2(dat), f3a(dat), f3b(dat), f4(dat), f5(dat), do.call(pmax, dat), times = 50)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# f1(dat) 274995.963 298662.210 339279.948 318937.172 350822.539 723673.972 50 d
# f2(dat) 94619.397 100079.205 114664.776 107479.127 114619.439 226733.260 50 c
# f3a(dat) 19767.925 23423.688 26382.919 25795.499 29215.839 40100.656 50 b
# f3b(dat) 20351.872 22829.997 28889.845 25090.446 30503.100 140311.058 50 b
# f4(dat) 975.102 1109.431 1546.571 1169.462 1361.733 8954.100 50 a
# f5(dat) 2427.665 2470.816 5299.386 2520.755 3197.793 112986.612 50 a
# do.call(pmax, dat) 1477.618 1530.166 1627.934 1551.046 1602.898 2814.295 50 a

R-ddply function over loop variable

I need to loop over a data frame and calculate functions over the variable that is being looped.
A table example:
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
I create a list of variables:
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
And the table that I will populate:
planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]
And the loop:
library(plyr)
for (i in 1:length(nums)){
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}
I need to get the min and max of the variable por each decile. But instead I get:
deciles min max pos
1 1 num1 num1 4
2 2 num2 num2 4
3 3 <NA> <NA> 2
4 4 <NA> <NA> 0
5 5 <NA> <NA> 0
6 1 num1 num1 4
7 2 num2 num2 4
8 3 <NA> <NA> 2
9 4 <NA> <NA> 0
10 5 <NA> <NA> 0
For variable num1 I need to get the result of:
ddply(table, 'deciles', summarize, min=min(num1),
max=max(num1),pos=sum(target))
deciles min max pos
1 5.736842 7.157895 0
2 7.631579 9.052632 0
3 1.000000 10.000000 2
4 1.947368 3.368421 4
5 3.842105 5.263158 4
And below the result of doing the same with num2.
I understand that I need to introduce the variable with the following form:
num1
but the code is writing
'num1'
I tried with:
min=min(as.name(nums[i]))
But I get an error:
Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument
how can I calculate a function over the variable that is being looped?

The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.
## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)
FUNS <- list(min = min, max = max, mean = mean)
## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))
## combine results
do.call('rbind', x)
# min max mean
# 1 1.000000 2.421053 1.710526
# 2 2.894737 4.315789 3.605263
# 3 4.789474 6.210526 5.500000
# 4 6.684211 8.105263 7.394737
# 5 8.578947 10.000000 9.289474
Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below
f <- function(num, data = table) {
FUNS <- list(min = min, max = max, mean = mean)
x <- tapply(data[, num], data[, 'deciles'], function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))
cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}
This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like
f('num1')
f('num2')
Or use a loop to get everything at once
lapply(c('num1','num2'), f)
# [[1]]
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# [[2]]
# deciles min max mean
# 1 1 20.00000 21.57895 20.78947
# 2 2 22.10526 23.68421 22.89474
# 3 3 24.21053 25.78947 25.00000
# 4 4 26.31579 27.89474 27.10526
# 5 5 28.42105 30.00000 29.21053
If you don't like lapply, you can Vectorize the function to make it a little easier:
Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))
Which you would more commonly use like this (SIMPLIFY = FALSE to retain the list structures)
v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474

I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _):
library(lazyeval)
library(dplyr)
# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10))
)
# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")
# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
lapply(
numericCols, function(x) {
# compute the quantiles
quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))
# create quantile membership
dfX[["quantile_membership"]] =
findInterval(dfX[[x]], vec = quantiles,
rightmost.closed = TRUE,
all.inside = TRUE)
# summarize variables by decile
dfX %>%
group_by(quantile_membership) %>%
summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
max = interp( ~ max(x_name), x_name = as.name(x)),
mean = interp( ~ mean(x_name), x_name = as.name(x)))
}),
numericCols
)
# inspect the output
liDFY[[numericCols[1]]]

Resampling from subject id's in R

Assume we have the following data
set.seed(123)
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4))
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE))
> [1] 2 4 2
The sampledIDs is a vector of id's that is sampled (with replacement) from dat$id.
I need the code that results in (and works also for a large dataset with more variables):
var1 id
13 2
19 2
15 2
19 4
13 2
19 2
15 2
The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the the result of this code is
var1 id
13 2
19 2
15 2
19 4
where the subject with dat$id==2 appears only once in this data (I understand why this is the result, but don't know how to get what I want). Can someone please help?
EDIT: Thank you for the answers, here the runtime of all answers (for those who are interested):
test replications elapsed relative user.self
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 1000 0.67 1.000 0.64
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ] 1000 0.67 1.000 0.67
2 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 1000 1.83 2.731 1.83
4 setkey(setDT(dat), id)[J(sampledIDs)] 1000 1.33 1.985 1.33

This would be probably the fastest approach for a big data set using data.table binary search
library(data.table)
setkey(setDT(dat), id)[J(sampledIDs)]
# var1 id
# 1: 13 2
# 2: 19 2
# 3: 15 2
# 4: 19 4
# 5: 13 2
# 6: 19 2
# 7: 15 2
Edit:
Here's a benchmark for a not so big data set (1e+05 rows) which illustrates which is the clear winner
library(data.table)
library(microbenchmark)
set.seed(123)
n <- 1e5
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE))
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE))
dat2 <- copy(dat)
Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)]
Res <- microbenchmark(Sven1(dat),
Sven2(dat),
flodel(dat),
David(dat2))
Res
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100
boxplot(Res)
If, for example, we would like to sample more then just 3 Ids, but lets say, 10, the benchmark becomes ridiculous
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE))
[1] 7 6 10 9 5 9 5 3 7 3
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100

You can do:
do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])

One approach:
dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
# var1 id
# 3 13 2
# 4 19 2
# 5 15 2
# 7 19 4
# 3.1 13 2
# 4.1 19 2
# 5.1 15 2
An alternative approach:
dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect

Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000

Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28

I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to store replicate runs in a dataframe - r

Related

Split duplicated rows of matrices in lists with base R

Return vector position in list r

R-ddply function over loop variable

Resampling from subject id's in R

Apply a correction factor to one column based on the value of a second column

Categories

Resources