equivalent of melt+reshape that splits on column names - r

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.
Question:
How do I do the following "partial melt" in a way that memory can support?
Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.
Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.
This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.
Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?
Thoughts:
this is a little like melt-colsplit-cast
libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
reshape2's dcast (I'm looking for 2d output) wants to aggregate
brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.
Update (dummy code):
#libraries
library(stringr)
#reproducibility
set.seed(56873504)
#geometry
Ncol <- 2e3
Nrow <- 1e6
#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
col_idx <- 1:200+200*(i-1)
if(i<26){
namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
} else {
namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
}
}
#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist
The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.
(Ugly/Manual) Brute force version:
(working on it... )

Here are two options both using data.table.
If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.
melt(dt
, measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
, value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]
#this data set used Ncol_p_grp <- 5 to help condense the data.
variable 1 2 3 4 5
1: A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
2: A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
3: A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
4: A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
5: A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
---
259996: Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997: Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998: Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999: Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000: Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571
Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.
rbindlist(
lapply(namelist_letters,
function(x) setnames(
dt[, grep(x, names(dt), value = T), with = F]
, as.character(seq_len(Ncol_p_grp)))
)
, idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]
With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.
#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
expr min lq mean median uq max neval
melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521 3
rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147 3
Data: Run this before everything above:
#packages ----
library(data.table)
library(stringr)
#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26
#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))
names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
, str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))
#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

Related

Fast method for combining list elements based on criteria

I'm building a little function in R that takes size measurements from several species and several sites, combines all the data by site (lumping many species together), and then computes some statistics on those combined data.
Here is some simplistic sample data:
SiteID <- rep(c("D00002", "D00003", "D00004"), c(5, 2, 3))
SpeciesID <- c("CHIL", "CHIP", "GAM", "NZMS", "LUMB", "CHIL", "SIMA", "CHIP", "CHIL", "NZMS")
Counts <- data.frame(matrix(sample(0:99,200, replace = TRUE), nrow = 10, ncol = 20))
colnames(Counts) <- paste0('B', 1:20)
spec <- cbind(SiteID, SpeciesID, Counts)
stat1 <- data.frame(unique(SiteID))
colnames(stat1) <- 'SiteID'
stat1$Mean <- NA
Here is the function, which creates a list, lsize1, where each list element is a vector of the sizes (B1 to B20) for a given SpeciesID in a given SiteID, multiplied by the number of counts for each size class. From this, the function creates a list, lsize2, which combines list elements from lsize1 that have the same SiteID. Finally, it gets the mean of each element in lsize2 (i.e., the average size of an individual for each SiteID, regardless of SpeciesID), and outputs that as a result.
fsize <- function(){
specB <- spec[, 3:22]
lsize1 <- apply(specB, 1, function(x) rep(1:20, x))
names(lsize1) <- spec$SiteID
lsize2 <- sapply(unique(names(lsize1)), function(x) unlist(lsize1[names(lsize1) == x], use.names = FALSE), simplify = FALSE)
stat1[stat1$SiteID %in% names(lsize2), 'Mean'] <- round(sapply(lsize2, mean), 2)
return(stat1)
}
In creating this function, I followed the suggestion here: combine list elements based on element names, which gets at the crux of my problem: combining list elements based on some criteria in common (in my case, combining all elements from the same SiteID). The function works as intended, but my question is if there's a way to make it substantially faster?
Note: for my actual data set, which is ~40,000 rows in length, I find that the function runs in ~ 0.7 seconds, with the most time consuming step being the creation of lsize2 (~ 0.5 seconds). I need to run this function many, many times, with different permutations and subsets of the data, so I'm hoping there's a way to cut this processing time down significantly.
There shouldn't be any need for loops here. Here's one attempt:
tmp <- data.frame(spec["SiteID"], sums = rowSums(specB * col(specB)), counts=rowSums(specB) )
tmp <- aggregate(. ~ SiteID, tmp, sum)
tmp$avg <- tmp$sums / tmp$counts
tmp
# SiteID sums counts avg
#1 D00002 46254 4549 10.16795
#2 D00003 20327 1810 11.23039
#3 D00004 29651 2889 10.26341
Compare:
fsize()
# SiteID Mean
#1 D00002 10.17
#2 D00003 11.23
#3 D00004 10.26
This code essentially multiplies each value by it's index (col(specB)), then aggregates the sums and counts by SiteID. This logic should be relatively transferable to other methods (data.table/dplyr) as well. E.g.: in data.table:
setDT(spec)
spec[, .(avg = sum(.SD * col(.SD)) / sum(unlist(.SD))), by=SiteID, .SDcols=B1:B20]
# SiteID avg
#1: D00002 10.16795
#2: D00003 11.23039
#3: D00004 10.26341

Replace values in a dataframe based on lookup table

I am having some trouble replacing values in a dataframe. I would like to replace values based on a separate table. Below is an example of what I am trying to do.
I have a table where every row is a customer and every column is an animal they purchased. Lets call this dataframe table.
> table
# P1 P2 P3
# 1 cat lizard parrot
# 2 lizard parrot cat
# 3 parrot cat lizard
I also have a table that I will reference called lookUp.
> lookUp
# pet class
# 1 cat mammal
# 2 lizard reptile
# 3 parrot bird
What I want to do is create a new table called new with a function replaces all values in table with the class column in lookUp. I tried this myself using an lapply function, but I got the following warnings.
new <- as.data.frame(lapply(table, function(x) {
gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)
Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
Any ideas on how to make this work?
You posted an approach in your question which was not bad. Here's a smiliar approach:
new <- df # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])
An alternative approach which will be faster is:
new <- df
new[] <- look$class[match(unlist(df), look$pet)]
Note that I use empty brackets ([]) in both cases to keep the structure of new as it was (a data.frame).
(I'm using df instead of table and look instead of lookup in my answer)
Another options is a combination of tidyr and dplyr
library(dplyr)
library(tidyr)
table %>%
gather(key = "pet") %>%
left_join(lookup, by = "pet") %>%
spread(key = pet, value = class)
Anytime you have two separate data.frames and are trying to bring info from one to the other, the answer is to merge.
Everyone has their own favorite merge method in R. Mine is data.table.
Also, since you want to do this to many columns, it'll be faster to melt and dcast -- rather than loop over columns, apply it once to a reshaped table, then reshape again.
library(data.table)
#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE)
setDT(lookUp)
#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')
# merging
][lookup, new_value := i.class, on = c(value = 'pet')
#reform back to original shape
][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
# rn P1 P2 P3
# 1: 1 mammal reptile bird
# 2: 2 reptile bird mammal
# 3: 3 bird mammal reptile
In case you find the dcast/melt bit a bit intimidating, here's an approach that just loops over columns; dcast/melt is simply sidestepping the loop for this problem.
setDT(table) #don't need row names this time
setDT(lookUp)
sapply(names(table), #(or to whichever are the relevant columns)
function(cc) table[lookUp, (cc) := #merge, replace
#need to pass a _named_ vector to 'on', so use setNames
i.class, on = setNames("pet", cc)])
Make a named vector, and loop through every column and match, see:
# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1
# cat lizard parrot
# "mammal" "reptile" "bird"
# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL
# res
# P1 P2 P3
# 1 mammal reptile bird
# 2 reptile bird mammal
# 3 bird mammal reptile
data
df1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
lookUp <- read.table(text = "
pet class
1 cat mammal
2 lizard reptile
3 parrot bird", header = TRUE)
I did it using the factor built-in.
table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)
The answer above showing how to do this in dplyr doesn't answer the question, the table is filled with NAs. This worked, I would appreciate any comments showing a better way:
# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>%
# put in long format, naming column filled with P1, P2, P3 "petCount"
gather(key="petCount", value="pet", -customer) %>%
# add a new column based on the pet's class in data frame "lookup"
left_join(lookup, by="pet") %>%
# since you wanted to replace the values in "table" with their
# "class", remove the pet column
select(-pet) %>%
# put data back into wide format
spread(key="petCount", value="class")
Note that it would likely be useful to keep the long table that contains the customer, the pet, the pet's species(?) and their class. This example simply adds an intermediary save to a variable:
table$customer = seq(nrow(table))
petClasses <- table %>%
gather(key="petCount", value="pet", -customer) %>%
left_join(lookup, by="pet")
custPetClasses <- petClasses %>%
select(-pet) %>%
spread(key="petCount", value="class")
I tried other approaches and they took a really long time with my very large dataset. I used the following instead:
# make table "new" using ifelse. See data below to avoid re-typing it
new <- ifelse(table1 =="cat", "mammal",
ifelse(table1 == "lizard", "reptile",
ifelse(table1 =="parrot", "bird", NA)))
This method requires you to write more text for your code, but the vectorization of ifelse makes it run faster. You have to decide, based on your data, if you want to spend more time writing code or waiting for your computer to run. If you want to make sure it worked (you didn't have any typos in your iflese commands), you can use apply(new, 2, function(x) mean(is.na(x))).
data
# create the data table
table1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
Benchmark
Out of burning curiosity, I just ran a benchmark with some of the approaches that I want to share with you. I couldn't quite believe some of the statements about performance in the answers and am trying to clarify this herewith. In order not to be misled by different rows/columns ratios, I consider three scenarios:
ncol == nrow
ncol << nrow
ncol >> nrow.
It might be beneficial to coerce as.matrix beforehand, so I included this as an additional solution (unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
Note: Performed on an Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz using R --vanilla.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
Conclusion
Using lapply with a lookup matrix appears to be a good choice if the number of columns is rather low/lower than the number of rows. If we have many columns, especially compared to rows, we might benefit from coercing the respective columns of the data frame into a matrix first, which should only take a blink of an eye.
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat

R generating a sparse matrix

I have a large file with the following format which I read as x
userid,productid,freq
293994,8,3
293994,5,3
949859,2,1
949859,1,1
123234,1,1
123234,3,1
123234,4,1
...
It gives the product a given user bought and its frequency. I'm trying to make it into a matrix which gives all the productid's as columns and userids as rows with the frequency value as the entry. So the expected output is
1 2 3 4 5 8
293994 0 0 0 0 3 3
949859 1 1 0 0 0 0
123234 1 0 1 1 0 0
It is a sparse matrix. I tried doing table(x[[1]],x[[2]]) which works for small files, but beyond a point table gives an error
Error in table(x[[1]], x[[2]]) :
attempt to make a table with >= 2^31 elements
Execution halted
Is there a way to get this to work? I'm on R-3.1.0 and its supposed to support 2^51 sized vectors, so confused why it can't handle the file size. I've 40MM lines with total file size of 741M. Thanks in advance
One data.table way of doing it is:
library(data.table)
library(reshape2)
# adjust fun.aggregate as necessary - not very clear what you want from OP
dcast.data.table(your_data_table, userid ~ productid, fill = 0L)
You can check if that works for your data.
#This is old, but worth noting the Matrix package sparseMatrix() to directly format object without reshaping.
userid <- c(293994,293994,949859,949859,123234,123234,123234)
productid <- c(8,5,2,1,1,3,4)
freq <- c(3,3,1,1,1,1,1)
library(Matrix)
#The dgCMatrix sparseMatrix is a fraction of the size and builds much faster than reshapeing if the data gets large
x <- sparseMatrix(i=as.integer(as.factor(userid)),
j=as.integer(as.factor(productid)),
dimnames = list(as.character(levels(as.factor(userid))),
as.character(levels(as.factor(productid)))
),
x=freq)
#Easily converted to a matrix.
x <- as.matrix(x)
#Learned this the hard way using recommenderlab (package built on top of Matrix) to build a binary matrix, so in case it helps someone else.
Here is a tidyr approach to this:
library(tidyverse)
library(magrittr)
# Replicate your example data
example_data <- matrix(
c(293994,8,3,
293994,5,3,
949859,2,1,
949859,1,1,
123234,1,1,
123234,3,1,
123234,4,1),
ncol = 3,
byrow = TRUE) %>%
as.data.frame %>%
set_colnames(c('userid','productid','freq'))
# Convert data into wide format
spread(example_data, key = productid, value = freq, fill = 0)
spread will be a lot faster than the base R table operation, but at scale, data.table would in turn easily outperform tidyr / dplyr. However, as noted in the previous answer, the data.table equivalent dcast isn't working properly. This seems to be a known issue which, unfortunately, remains unresolved.
I tried the tidyr approach at scale (2 mio records). I couldn't make it work on my local machine. So you'll have to either chop it up (then using rbind) or take it to a cluster (with rhadoop or sparklyr).
Nonetheless, code for a reproducible "big data" example below in case somebody else would like to add something.
# Make some random IDs
randomkey <- function(digits){
paste(sample(LETTERS, digits, replace = TRUE), collapse = '')
}
products <- replicate(10, randomkey(20)) %>% unique
customers <- replicate(500000, randomkey(50)) %>% unique
big_example_data <- data.frame(
useruid = rep(sample(customers, length(customers), replace = FALSE), 4),
productid = sample(products, replace = TRUE),
freq = sample(1:5)
)
# 2 mio rows of purchases
dim(big_example_data)
# With useruid, productid, freq
head(big_example_data)
# Test tidyr approach
system.time(
big_matrix <- spread(big_example_data, key = productid, value = freq, fill = 0)
)

Extract values from a transition matrix using columns of a data.frame in R

I have a transition matrix, with the cost of going from one state to another, e.g.
cost <- data.frame( a=c("aa","ab"),b=c("ba","bb"))
(pretending that the string "aa" is the cost of moving from a to a)
I've got a data.frame, with states in:
transitions <- data.frame( from=c("a","a","b"), to=c("a","b","b") )
I'd like to be able to add a column to transitions, with the cost of each transition in, so it ends up being:
from to cost
1 a a aa
2 a b ab
3 b b bb
I'm sure there is an R-ish way to do this. I've ended up using a for loop:
n <- dim(data)[1]
v <- vector("numeric",n)
for( i in 1:n )
{
z<-data[i,c(col1,col2),with=FALSE]
za <- z[[col1]]
zb <- z[[col2]]
v[i] <- dist[za,zb]
}
data <- cbind(data,d=v)
names(data)[dim(data)[2]] <- colName
data
But this feels pretty ugly, and it's incredibly slow - it takes about 20 minutes on a 2M row data.frame (and an operation to compute distances between elements of the same table takes less than a second).
Is there a simple, fast, one or two line command that would get me the cost column above?
UPDATE: Consider known states
data.table solution:
require(utils)
require(data.table)
## Data generation
N <- 2e6
set.seed(1)
states <- c("a","b")
cost <- data.frame(a=c("aa","ab"),b=c("ba","bb"))
transitions <- data.frame(from=sample(states, N, replace=T),
to=sample(states, N, replace=T))
## Expanded cost matrix construction
f <- expand.grid(states, states)
f <- f[order(f$Var1, f$Var2),]
f$cost <- unlist(cost)
## Prepare data.table
dt <- data.table(transitions)
setkey(dt, from, to)
## Routine itself
dt[,cost:=as.character("")] # You don't need this line if cost is numeric
apply(f, 1, function(x) dt[J(x[1],x[2]),cost:=x[3]])
With 2M rows in transitions it takes about 0.3sec to proceed.
Here's one way: (At least this works on this example and I believe it'll work on larger data as well. Please write back with an example if it doesn't)
# load both cost and transition with stringsAsFactors = FALSE
# so that strings are NOT by default loaded as factors
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb"), stringsAsFactors=F)
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b"),
stringsAsFactors = FALSE)
# convert cost to vector: it'll have names a1, a2, b1, b2. we'll exploit that.
cost.vec <- unlist(cost)
# convert "to" to factor and create id column with "from" and as.integer(to)
# the as.integer(to) will convert it into its levels
transitions$to <- as.factor(transitions$to)
transitions$id <- paste0(transitions$from, as.integer(transitions$to))
# now, you'll have a1, a2 etc.. here as well, just match it with the vector
transitions$val <- cost.vec[!is.na(match(names(cost.vec), transitions$id))]
# from to id val
# 1 a a a1 aa
# 2 a b a2 ab
# 3 b b b2 bb
You can of course remove the id. If this wouldn't work in any case, let me know. I'll try to fix it.
Starting from Arun's answer, I went with:
library(reshape)
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb") )
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b") )
row.names(cost) <- c("a","b") #Normally get this from the csv file
cost$from <- row.names(cost)
m <- melt(cost, id.vars=c("from"))
m$transition = paste(m$from,m$variable)
transitions$transition=paste(transitions$from,transitions$to)
merge(m, transitions, by.x="transition",by.y="transition")
It's a few more lines, but I'm a bit untrusting of factor orderings as indexes. It also means that when they are data.tables, I can do:
setkey(m,transition)
setkey(transitions,transition)
m[transitions]
I haven't benchmarked, but on large datasets, I'm pretty confident the data.table merge will be faster than the merge or vector scan approaches.

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Resources