R - traverse data.frame levels in loop or vectorially? - r

I'm processing a data.frame of products called "all" whose first variable all$V1 is a product family. There are several rows per product family, i.e. length(levels(all$V1)) < length(all$V1).
I want to traverse the data.frame and process by product family "p". I'm new to R, so I haven't fully grasped when I can do something vectorially, or when to loop. At the moment, I can traverse and get subsets by:
for (i in levels (all$V1)){
p = all[which(all[,'V1'] == i), ];
calculateStuff(p);
}
Is this the way to do this, or is there a groovy vectorial way of doing this with apply or something? There are only a few thousand rows, so performance gain is probably negligeable, but I'd like to devlop good habits for larger data ses.

Data:
all = data.frame(V1=c("a","b","c","d","d","c","c","b","a","a","a","d"))
'all' can be split by V1:
> ll = split(all, all$V1)
> ll
$a
V1
1 a
9 a
10 a
11 a
$b
V1
2 b
8 b
$c
V1
3 c
6 c
7 c
$d
V1
4 d
5 d
12 d
sapply can be used to analyze each component of list 'll'. Following finds number of rows in each component (which represents product family):
calculateStuff <- function(p){
nrow(p)
}
> sapply(ll, calculateStuff)
a b c d
4 2 3 3

There is unlikely to be much of a performance gain with the below, but it is at least more compact and returns the results of calculateStuff as a convenient list:
lapply(levels(all$V1), function(i) calculateStuff(all[all$V1 == i, ]) )
As #SimonG points out in his comment, depending on exactly what your calculateStuff function is, aggregate may also be useful to you if you want your results in the form of dataframe.

Related

Split column into vectors by group R - independent of column order

Edit
This question seems to be a duplicate of the question How to group a vector into a list of vectors?, and the answer split(df$b, df$id) was suggested. First happy with the solution, I realized that the given answers do not fully address my question. In the below question, I would like to obtain a list in which the vector elements are assigned to the value of a third column (in my example df$a). This is important, as otherwise the order of df$b plays a role. I mean obviously I can arrange by df$a and then call split(), but maybe there is another way of doing that.
My sample df:
df <- data_frame(id = paste0('id',rep(1:2, each = 5)), a = rep(letters[1:5],2),b=c(1:5,5:1))
Df should be grouped by ID (in df$id). I would like to create a list of vectors for each group (id) element that contains the values of df$b. My approach
require(tidyr)
spread_df <- df %>% spread(id,b) #makes new columns for each id
#loop over spread_df
for (i in 1:length(spread_df)) {
list_group_elements [i]<- list(spread_df[[i]])
#I want each vector to be identified by the identifier of column df$a
#therefore:
names(list_group_elements[[i]]) <- list_group_elements[[1]]
}
This results in :
list_group_elements
[[1]]
a b c d e
"a" "b" "c" "d" "e"
[[2]]
a b c d e
1 2 3 4 5
[[3]]
a b c d e
5 4 3 2 1
I don't need the first element of the list, but the rest is basically what I need. I have the peculiar impression that my approach is somewhat not ideal and if someone has an idea to improve this, (e.g., with dplyr?) this would be highly appreciated. Why do I want this: I made a function that uses vectors as arguments and I would like to run this function over certain columns from dataframes - but only using the grouped values as arguments and not the entire column.
You may make df$b a named vector using setNames, and then split it into a list:
split(setNames(df$b, df$a), df$id)
# $id1
# a b c d e
# 1 2 3 4 5
#
# $id2
# a b c d e
# 5 4 3 2 1
One way is
lapply(levels(df$id), function(L) df$b[df$id == L])
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 5 4 3 2 1
Consider by, object-oriented wrapper of tapply, designed to split dataframe by factor(s):
by(df, df$id, FUN=function(i) i$b)

Methods to exhaustively partition a vector into pairs in R

(This is inspired by another question marked as a duplicate. I think it is an interesting problem though, although perhaps there is an easy solution from combinatorics, about which I am very ignorant.)
Problem
For a vector of length n, where n mod 2 is zero, find all possible ways to partition all elements of the vector into pairs, without replacement, where order does not matter.
For example, for a vector c(1,2,3,4):
list(c(1,2), c(3,4))
list(c(1,3), c(2,4))
list(c(1,4), c(2,3))
My approach has been the following (apologies in advance for novice code):
# write a function that recursively breaks down a list of unique pairs (generated with combn). The natural ordering produced by combn means that for the first pass through, we take as the starting pair, all pairings with element 1 of the vector with all other elements. After that has been allocated, we iterate through the first p/2 pairs (this avoids duplicating).
pairer2 <- function(kn, pair_list) {
pair1_partners <- lapply(kn, function(x) {
# remove any pairs in the 'master list' that contain elements of the starting pair.
partners <- Filter(function(t) !any(t %in% x), pair_list)
if(length(partners) > 1) {
# run the function again
pairer2(kn = partners[1:(length(partners)/2)], partners)
} else {return(partners)}
})
# accumulate results into a nested list structure
return(mapply(function(x,y) {list(root = x, partners = y)}, kn, pair1_partners, SIMPLIFY = F))
}
# this function generates all possible unique pairs for a vector of length k as the starting point, then runs the pairing off function above
pair_combn <- function(k, n = 2) {
p <- combn(k, n, simplify = F)
pairer2(kn = p[1:(length(k)-1)], p)}
# so far a vector k = 4
pair_combn(1:4)
[[1]]
[[1]]$root
[1] 1 2
[[1]]$partners
[[1]]$partners[[1]]
[1] 3 4
[[2]]
[[2]]$root
[1] 1 3
[[2]]$partners
[[2]]$partners[[1]]
[1] 2 4
[[3]]
[[3]]$root
[1] 1 4
[[3]]$partners
[[3]]$partners[[1]]
[1] 2 3
It also works for larger k as far as I can tell. This isn't that efficient, possibly because Filter is slow for large lists, and I have to confess I can't collapse the nested lists (which are a tree representation of possible solutions) into a list of each partitioning. It feels like there should be a more elegant solution (in R)?
Mind you, it is interesting that this recursive approach generates a parsimonious (albeit inconvenient) representation of the possible solutions.
Here is one way:
> x <- c(1,2,3,4)
> xc <- combn(as.data.frame(combn(x, 2)), 2, simplify = FALSE)
> Filter(function(x) all(1:4 %in% unlist(x)), xc)
[[1]]
V1 V6
1 1 3
2 2 4
[[2]]
V2 V5
1 1 2
2 3 4
[[3]]
V3 V4
1 1 2
2 4 3
>
More generally:
pair_combn <- function(x) {
Filter(function(e) all(unique(x) %in% unlist(e)),
combn(as.data.frame(combn(x, 2)),
length(x)/2, simplify = FALSE))
}

Parallel processing for multiple nested for loops

I am trying to run simulation scenarios which in turn should provide me with the best scenario for a given date, back tested a couple of months. The input for a specific scenario has 4 input variables with each of the variables being able to be in 5 states (625 permutations). The flow of the model is as follows:
Simulate 625 scenarios to get each of their profit
Rank each of the scenarios according to their profit
Repeat the process through a 1-day expanding window for the last 2 months starting on the 1st Dec 2015 - creating a time series of ranks for each of the 625 scenarios
The unfortunate result for this is 5 nested for loops which can take extremely long to run. I had a look at the foreach package, but I am concerned around how the combining of the outputs will work in my scenario.
The current code that I am using works as follows, first I create the possible states of each of the inputs along with the window
a<-seq(as.Date("2015-12-01", "%Y-%m-%d"),as.Date(Sys.Date()-1, "%Y-%m-%d"),by="day")
#input variables
b<-seq(1,5,1)
c<-seq(1,5,1)
d<-seq(1,5,1)
e<-seq(1,5,1)
set.seed(3142)
tot_results<-NULL
Next the nested for loops proceed to run through the simulations for me.
for(i in 1:length(a))
{
cat(paste0("\n","Current estimation date: ", a[i]),";itteration:",i," \n")
#subset data for backtesting
dataset_calc<-dataset[which(dataset$Date<=a[i]),]
p=1
results<-data.frame(rep(NA,625))
for(j in 1:length(b))
{
for(k in 1:length(c))
{
for(l in 1:length(d))
{
for(m in 1:length(e))
{
if(i==1)
{
#create a unique ID to merge onto later
unique_ID<-paste0(replicate(1, paste(sample(LETTERS, 5, replace=TRUE), collapse="")),round(runif(n=1,min=1,max=1000000)))
}
#Run profit calculation
post_sim_results<-profit_calc(dataset_calc, param1=e[m],param2=d[l],param3=c[k],param4=b[j])
#Exctract the final profit amount
profit<-round(post_sim_results[nrow(post_sim_results),],2)
results[p,]<-data.frame(unique_ID,profit)
p=p+1
}
}
}
}
#extract the ranks for all scenarios
rank<-rank(results$profit)
#bind the ranks for the expanding window
if(i==1)
{
tot_results<-data.frame(ID=results[,1],rank)
}else{
tot_results<-cbind(tot_results,rank)
}
suppressMessages(gc())
}
My biggest concern is the binding of the results given that the outer loop's actions are dependent on the output of the inner loops.
Any advice on how proceed would greatly be appreciated.
So I think that you can vectorize most of this, which should give a big reduction in run time.
Currently, you use for-loops (5, to be exact) to create every combination of values, and then run the values one by one through profit_calc (a function that is not specified). Ideally, you'd just take all possible combinations in one go and push them through profit_calc in one single operation.
-- Rationale --
a <- 1:10
b <- 1:10
d <- rep(NA,10)
for (i in seq(a)) d[i] <- a[i] * b[i]
d
# [1] 1 4 9 16 25 36 49 64 81 100
Since * also works on vectors, we can rewrite this to:
a <- 1:10
b <- 1:10
d <- a*b
d
# [1] 1 4 9 16 25 36 49 64 81 100
While it may save us only one line of code, it actually reduces the problem from 10 steps to 1 step.
-- Application --
So how does that apply to your code? Well, given that we can vectorize profit_calc, you can basically generate a data frame where each row is every possible combination of your parameters. We can do this with expand.grid:
foo <- expand.grid(b,c,d,e)
head(foo)
# Var1 Var2 Var3 Var4
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 1 1
# 4 4 1 1 1
# 5 5 1 1 1
# 6 1 2 1 1
Lets say we have a formula... (a - b) / (c + d)... Then it would work like:
bar <- (foo[,1] - foo[,2]) * (foo[,3] + foo[,4])
head(bar)
# [1] 0 2 4 6 8 -2
So basically, try to find a way to replace for-loops with vectorized options. If you cannot vectorize something, try looking into apply instead, as that can also save you some time in most cases. If your code is running too slow, you'd ideally first see if you can write a more efficient script. Also, you may be interested in the microbenchmark library, or ?system.time.

R: Make new vector correlating to a specified level in a factor

I'm stuck with a really easy task and hope someone could help me with it..
I'd like to make a new (sub)vector from an existing vector, based on 1 level of a factor.
Example:
v = c(1,2,3,4,5,6,7,8,9,10)
f = factor(rep(c("Drug","Placebo"),5))
I want to make a new vector from v, containing only "Drug" or "Placebo". Resulting in:
vDrug = 1,3,5,7,9
vPlacebo = 2,4,6,8,10
Thanks in advance!
You can easily subset v by f:
v[ f == "Drug" ]
[1] 1 3 5 7 9
However, this approach might become error prone in a more complex environment or with larger data sets. Accordingly it would be better to store vand f together in a data.frame and than perform on this data.frame all kinds of queries and transformations:
mdf <- data.frame( v = c(1,2,3,4,5,6,7,8,9,10), f = factor(rep(c("Drug","Placebo"),5)) )
mdf
v f
1 1 Drug
2 2 Placebo
3 3 Drug
4 4 Placebo
...
If you want to look at your data interactively, you can subset using the subset function:
subset( mdf, f == "Drug", select=v )
If you are doing this programmatically, you should rather use
mdf[ mdf$f == "Drug", "v" ]
For the difference of the two have a look at: Why is `[` better than `subset`?.

Elements within lists.

I'm relatively new in R (~3 months), and so I'm just getting the hang of all the different data types. While lists are a super useful way of holding dissimilar data all in one place, they are also extremely inflexible for function calls, and riddle me with angst.
For the work I'm doing, I often uses lists because I need to hold a bunch of vectors of different lengths. For example, I'm tracking performance statistics of about 10,000 different vehicles, and there are certain vehicles which are so similar they can essentially be treated as the same vehicles for certain analyses.
So let's say we have this list of vehicle ID's:
List <- list(a=1, b=c(2,3,4), c=5)
For simplicity's sake.
I want to do two things:
Tell me which element of a list a particular vehicle is in. So when I tell R I'm working with vehicle 2, it should tell me b or [2]. I feel like it should be something simple like how you can do
match(3,b)
> 2
Convert it into a data frame or something similar so that it can be saved as a CSV. Unused rows could be blank or NA. What I've had to do so far is:
for(i in length(List)) {
length(List[[i]]) <- max(as.numeric(as.matrix(summary(List)[,1])))
}
DF <- as.data.frame(List)
Which seems dumb.
For your first question:
which(sapply(List, `%in%`, x = 3))
# b
# 2
For your second question, you could use a function like this one:
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
list.to.df(List)
# a b c
# 1 1 2 5
# 2 NA 3 NA
# 3 NA 4 NA
Both of those tasks (and many others) would become much easier if you were to "flatten" your data into a data.frame. Here's one way to do that:
fun <- function(X)
data.frame(element = X, vehicle = List[[X]], stringsAsFactors = FALSE)
df <- do.call(rbind, lapply(names(List), fun))
# element vehicle
# 1 a 1
# 2 b 2
# 3 b 3
# 4 b 4
# 5 c 5
With a data.frame in hand, here's how you could perform your two tasks:
## Task #1
with(df, element[match(3, vehicle)])
# [1] "b"
## Task #2
write.csv(df, file = "outfile.csv")

Resources