Fast method for combining list elements based on criteria - r

I'm building a little function in R that takes size measurements from several species and several sites, combines all the data by site (lumping many species together), and then computes some statistics on those combined data.
Here is some simplistic sample data:
SiteID <- rep(c("D00002", "D00003", "D00004"), c(5, 2, 3))
SpeciesID <- c("CHIL", "CHIP", "GAM", "NZMS", "LUMB", "CHIL", "SIMA", "CHIP", "CHIL", "NZMS")
Counts <- data.frame(matrix(sample(0:99,200, replace = TRUE), nrow = 10, ncol = 20))
colnames(Counts) <- paste0('B', 1:20)
spec <- cbind(SiteID, SpeciesID, Counts)
stat1 <- data.frame(unique(SiteID))
colnames(stat1) <- 'SiteID'
stat1$Mean <- NA
Here is the function, which creates a list, lsize1, where each list element is a vector of the sizes (B1 to B20) for a given SpeciesID in a given SiteID, multiplied by the number of counts for each size class. From this, the function creates a list, lsize2, which combines list elements from lsize1 that have the same SiteID. Finally, it gets the mean of each element in lsize2 (i.e., the average size of an individual for each SiteID, regardless of SpeciesID), and outputs that as a result.
fsize <- function(){
specB <- spec[, 3:22]
lsize1 <- apply(specB, 1, function(x) rep(1:20, x))
names(lsize1) <- spec$SiteID
lsize2 <- sapply(unique(names(lsize1)), function(x) unlist(lsize1[names(lsize1) == x], use.names = FALSE), simplify = FALSE)
stat1[stat1$SiteID %in% names(lsize2), 'Mean'] <- round(sapply(lsize2, mean), 2)
return(stat1)
}
In creating this function, I followed the suggestion here: combine list elements based on element names, which gets at the crux of my problem: combining list elements based on some criteria in common (in my case, combining all elements from the same SiteID). The function works as intended, but my question is if there's a way to make it substantially faster?
Note: for my actual data set, which is ~40,000 rows in length, I find that the function runs in ~ 0.7 seconds, with the most time consuming step being the creation of lsize2 (~ 0.5 seconds). I need to run this function many, many times, with different permutations and subsets of the data, so I'm hoping there's a way to cut this processing time down significantly.

There shouldn't be any need for loops here. Here's one attempt:
tmp <- data.frame(spec["SiteID"], sums = rowSums(specB * col(specB)), counts=rowSums(specB) )
tmp <- aggregate(. ~ SiteID, tmp, sum)
tmp$avg <- tmp$sums / tmp$counts
tmp
# SiteID sums counts avg
#1 D00002 46254 4549 10.16795
#2 D00003 20327 1810 11.23039
#3 D00004 29651 2889 10.26341
Compare:
fsize()
# SiteID Mean
#1 D00002 10.17
#2 D00003 11.23
#3 D00004 10.26
This code essentially multiplies each value by it's index (col(specB)), then aggregates the sums and counts by SiteID. This logic should be relatively transferable to other methods (data.table/dplyr) as well. E.g.: in data.table:
setDT(spec)
spec[, .(avg = sum(.SD * col(.SD)) / sum(unlist(.SD))), by=SiteID, .SDcols=B1:B20]
# SiteID avg
#1: D00002 10.16795
#2: D00003 11.23039
#3: D00004 10.26341

Related

Issue with tapply

I am using tapply to combine a table by Sample ID(SID). For the first sample on the list there are 3 measurements but it appears as only one.
I have 4 things that need to pass to the new table. First is SID. Second is the mean of the areas for all measurements that have that SID. Third is all the Distances. Finally the number of measurements.
cases_iTLS <- data.frame(unique(iTLS$SID))
colnames(cases_iTLS)[colnames(cases_iTLS)=="unique.iTLS.SID."] <- "SID"
cases_iTLS$SID <- factor(cases_iTLS$SID)
# Average of TLS on one slide for area
cases_iTLS$Area_iTLS <- tapply(iTLS$Area, iTLS$SID,FUN=mean)
# Average of TLS on one slide for distance
cases_iTLS$Distance_iTLS <- tapply(iTLS$Distance, iTLS$SID,FUN=mean)
# Number of measurements per SID
cases_iTLS$Count_iTLS <- tapply(iTLS$Region_Index, iTLS$SID,FUN=length)
SID Region_index Area Distance Type Location
112906 1 53531.53 71.982 iTLS intratumoral
112906 3 76809.61 97.384 iTLS intratumoral
112906 5 40937.30 9.643 iTLS intratumoral
112947 1 35071.66 2.067 iTLS intratumoral
112947 3 17979.88 36.319 iTLS
Because you need to run separate aggregate functions (mean and length) across multiple columns (Area, Distance, and SID), consider using aggregate for grouping aggregation to return a data frame.
Usually, tapply runs on a single numeric metric not across columns or functions to return a single named, atomic vector. Below calls a do.call + data.frame to bind the nested result of multiple aggregations
aggregate
# AGGREGATE ACROSS COLS AND FUNCS
cases_iTLS <- aggregate(cbind(Area, Distance, Region_Index) ~ SID, iTLS,
function(x) c(mean=mean(x), count = length(x))
# BIND NESTED, UNDERLYING RESULTS
cases_iTLS <- do.call(data.frame, cases_iTLS)
# KEEP NEEDED COLUMNS
cases_iTL <- cases_iTL[c("SID", "Area.mean", "Distance.mean", "Region_Index.count")
tapply
Should you want to go the tapply route, consider building a matrix of your separate aggregations with rbind and transpose t:
cases_iTL_mat <- with(iTLS,
t(rbind(Area_mean = tapply(Area, SID, FUN=mean) ,
Distance_mean = tapply(Distance, SID, FUN=mean),
Region_count = tapply(Region_Index, SID, FUN=length)
))
)
by
And I would be remiss not to point by (the object-oriented wrapper to tapply):
cases_iTL_mat <- do.call(rbind,
by(iTLS, iTLS$SID, function(sub) {
c(Area_mean = mean(sub$Area),
Distance_mean = mean(sub$Distance),
Region_count = length(sub$Region_Index))
})
)

equivalent of melt+reshape that splits on column names

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.
Question:
How do I do the following "partial melt" in a way that memory can support?
Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.
Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.
This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.
Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?
Thoughts:
this is a little like melt-colsplit-cast
libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
reshape2's dcast (I'm looking for 2d output) wants to aggregate
brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.
Update (dummy code):
#libraries
library(stringr)
#reproducibility
set.seed(56873504)
#geometry
Ncol <- 2e3
Nrow <- 1e6
#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
col_idx <- 1:200+200*(i-1)
if(i<26){
namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
} else {
namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
}
}
#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist
The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.
(Ugly/Manual) Brute force version:
(working on it... )
Here are two options both using data.table.
If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.
melt(dt
, measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
, value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]
#this data set used Ncol_p_grp <- 5 to help condense the data.
variable 1 2 3 4 5
1: A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
2: A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
3: A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
4: A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
5: A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
---
259996: Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997: Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998: Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999: Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000: Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571
Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.
rbindlist(
lapply(namelist_letters,
function(x) setnames(
dt[, grep(x, names(dt), value = T), with = F]
, as.character(seq_len(Ncol_p_grp)))
)
, idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]
With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.
#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
expr min lq mean median uq max neval
melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521 3
rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147 3
Data: Run this before everything above:
#packages ----
library(data.table)
library(stringr)
#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26
#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))
names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
, str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))
#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

Aggregating multiple dependent measures

I need to aggregate over a number of dependent measures (DMs) in R. I found the following discussion here quite useful:
Aggregate / summarize multiple variables per group (i.e. sum, mean, etc)
Based on this, the code below basically does what I need. It gets quite lengthy, however, as the number of DMs increases (I have many DMs):
aggregate(cbind(DM1, DM2, DV3, DM4, DM5 ... DMn) ~ F1 + F2 +
F3, data = sst2, mean, na.rm=TRUE)
I was therefore wondering if there was a more efficient way of writing the DMs, without having to individually type every one of them. Most DMs of interest are next to one another (i.e. DM3, DM4, DM5 etc.), so I was thinking of using something along the lines of cbind(DM1, DM3:DM10, DM14), but this doesn't seem to work. I also tried generating a list of the relevant column names. Unfortunately this didn't work either:
pr<-colnames(sst2)
pr2<-pr[pr!="DM2" & pr!="DM11" & pr!="DM12" & pr!="DM13"]
pr3<-noquote(paste(pr2,collapse=","))
pp<-aggregate(cbind(pr3) ~ F1 + F2 +
F3, data = sst2, mean, na.rm=TRUE)
Any suggestions on how to efficiently include a large number of DMs in the aggregate function (or other related functions such as ddply) would be much appreciated.
I believe that this should work
sst2 <- data.frame(F1=c("A","A","B","B","C","C"),
F2=c("A","A","A","B","B","B"),
F3=c("D","D","D","D","D","D"),
DM1=c(5,6,21,61,2,3),
DM2=c(1,5,3,6,1,6),
DM3=c(1,7,9,1,4,44))
n = 3 # number of DM columns
m = 2 # number of F columns
DM <- paste0("DM", 1:n)
attach(sst2)
# use sapply(DM,get) but this produces separate columns
tmp <- aggregate(sapply(DM, get) ~ F1 + F2,
data = sst2, mean, na.rm=TRUE)
detach(sst2)
# combine these separate columns. The apply is to each row of tmp
data.frame(F1 = tmp$F1, F2 = tmp$F2,
DM = apply(tmp[(m+1):(n+length(DM)-1)], 1, mean))
# F1 F2 DM
# 1 A A 4.166667
# 2 B A 11.000000
# 3 B B 22.666667
# 4 C B 10.000000
Edit
If your variable names are different than the only line that would need to change is
DM <- c("mean.go.RT", "mean.SRT", "mean.SSD", "SSRT")
If these variables are in your data frame, you could easily get them with
DM <- names(sst2)[4:6]
or whatever other columns (i.e. instead of 4-6) that you want
An alternative solution using select, ddply and numcolwise:
library(dplyr)
library(plyr)
sst21 <- data.frame(F1=c("A","A","B","B","C","C"),
F2=c("A","A","A","B","B","B"),
F3=c("D","D","D","D","D","D"),
DM1=c(5,6,21,61,2,3),
DM2=c(1,5,3,6,1,6),
DM3=c(1,7,9,1,4,44),
DM4=c(2,3,6,7,2,33),
DM5=c(44,55,66,77,55,88))
sel1 <- dplyr::select(sst21, starts_with("F"), .data$DM1 : .data$DM3, .data$DM5) # select columns of interest
sel1 <- dplyr::select(sst21, -c(.data$DM4)) # Alternative: specifying columns to be excluded
sst22 <- plyr::ddply(sel1, .(F1, F2, F3), plyr::numcolwise(mean, na.rm = TRUE)) # Aggregate selected data

Very quickly sequentially subsetting a large data frame (or data table) on a numeric position column via a large vector of numeric positions

I want to do this much faster:
set.seed(100)
pos <- sample(1:100000000, 10000000, replace=F)
df <- data.table(pos, name="arbitrary_string")
query <- sample(1:100000000, 10000, replace=F)
df_list <- lapply(query, function(x) subset(df, pos >= x - 10000 & pos <= x + 10000))
output <- rbindlist(df_list)
So basically, I'm looping through a vector of positions X and extracting every row from a data frame that has a number in the "pos" column that falls between a range defined as [X - 10000, X + 10000]. I expect some rows will to be represented multiple times in "output"; this is desirable. The ordering doesn't need to be the same as returned by the code posted in this question.
This toy example is based on a much larger set of data that I estimate will take ~10,000 hours to run on a single core as programmed above. It is therefore valuable to me to have a radically faster solution to this problem. I'd like a pure R solution, but I'm open to solutions that involve other languages.
In this solution, I'm going to assume the row ordering doesn't matter (unfortunately it won't work if you need the exact row ordering you have in your original post). The approach I propose is:
Determine the number of times you need elements from each range within the pos variable using cumsum.
Determine the range each element of df$pos falls into using a single call to the cut function.
Grab each row the appropriate number of times, subsetting df only once.
This approach cuts down on the number of times you scan df and grab a subset, which should yield a significant speedup. Let's start with a reproducible example:
library(data.table)
set.seed(144)
pos <- sample(1:100000000, 10000000, replace=F)
df <- data.table(pos, name="arbitrary_string")
query <- c(100000, 101000, 200000)
Now, let's determine the ranges and number of times we need rows from each range:
query.cut <- rbind(data.frame(x=query-10000, y=1), data.frame(x=query+10001, y=-1))
query.cut <- query.cut[order(query.cut$x),]
query.cut$y <- cumsum(query.cut$y)
query.cut
# x y
# 1 90000 1
# 2 91000 2
# 4 110001 1
# 5 111001 0
# 3 190000 1
# 6 210001 0
We will take rows with pos value 90000-90999 once, rows with pos value 91000-110000 twice, rows with pos value 110001-111000 once, and rows with pos value 190000-210000 once.
To determine which range an element falls in, we can use the cut function, looking up the relevant number of replications in our query.cut table:
num.rep <- query.cut$y[as.numeric(cut(df$pos, query.cut$x))]
num.rep[is.na(num.rep)] <- 0
table(num.rep)
# num.rep
# 0 1 2
# 9995969 2137 1894
For our current small query, almost all rows are never taken at all. The last step is to grab each row the appropriate number of times.
output <- df[rep(1:nrow(df), times=num.rep),]
We can get a pretty solid speedup even with a relatively small set of queries (300 here):
OP <- function(query) {
df_list <- lapply(query, function(x) subset(df, pos >= x - 10000 & pos <= x + 10000))
rbindlist(df_list)
}
josilber <- function(query) {
query.cut <- rbind(data.frame(x=query-10000, y=1), data.frame(x=query+10001, y=-1))
query.cut <- query.cut[order(query.cut$x),]
query.cut$y <- cumsum(query.cut$y)
query.cut <- query.cut[!duplicated(query.cut$x, fromLast=T),]
num.rep <- query.cut$y[as.numeric(cut(df$pos, query.cut$x))]
num.rep[is.na(num.rep)] <- 0
df[rep(1:nrow(df), times=num.rep),]
}
set.seed(144)
big.query <- sample(df$pos, 300)
system.time(OP(big.query))
# user system elapsed
# 196.693 17.824 217.141
system.time(josilber(big.query))
# user system elapsed
# 3.418 0.124 3.673
As the size of query set grows the advantage of the new approach should get larger, because it's still making just one pass through df$pos while the original approach is making one pass for each element in query (aka the new approach is asymptotically faster).

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Resources