Combining objects across a list - r

I have a simple question. I have a list of objects. Each object holds a few lists. Before this gets too complicated, let me illustrate:
x = a list
x[[1]] = some object
x[[2]] = another object
...
x[[n]] = another object
And as I said, each object holds some more lists. But I'm interested in a specific list, let's call it "a".
x[[1]][[a]] = ('A': 1, 'B': 2, 'C': 3, ..., Z: 26)
Sorry for the python-like syntax! I am really just learning R. Anyway, what I want to do is combine the lists held in these objects, then take their median. To make this more clear, I want to group all 'A' elements, then take their median:
x[[1]][[a]][['A']], x[[2]][[a]][['A']], x[[3]][[a]][['A']], ..., x[[n]][[a]][['A']]
Similarly I want to group all 'B', 'C', ..., 'Z' elements and take their median...
x[[1]][[a]][['Z']], x[[2]][[a]][['Z']], x[[3]][[a]][['Z']], ..., x[[n]][[a]][['Z']]
So the question is what's the best way to do this? I've spent hours trying to figure this out! It would be great if someone could help me.
And if you would like to know what I'm actually doing, basically I have a list (x) of random forest objects. So x[[1]] is the first random forest, x[[100]] is the 100th random forest. Each random forest has a list of predicted values, which are stored in, e.g. x[[1]][['predicted']]. Each prediction list has a label associated with its predicted value. What I'm actually trying to do is calculate each label's median predicted value across all 100 random forests. And I want to do it efficiently. In Python, this is easy, but in R I'm not so sure. Anyway, thanks for the help!!! I really appreciate it.

Here's one way you could do it. It's a bit tough because you can't use rapply to subset by the names of list elements (which is frustrating). But you can unlist and then subset on names and take the median that way...
# Make some reproducible data
set.seed(1)
l <- list( a = sample(10,3) , b = sample(10,3) , c = sample(10,3) )
ll <- list( l , l , l )
# Unlist - we get a named vector but all a's have unique names - e.g. a1 , a2... an
unl <- unlist(ll)
# a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3
# 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9
# Subset by those elements that contian 'a' in their name
a.unl <- unl[ grepl("a",names(unl)) ]
# a1 a2 a3 a1 a2 a3 a1 a2 a3
# 3 4 5 3 4 5 3 4 5
# Take median
median( a.unl )
# [1] 4
To loop over multiple names try this...
sapply( c( "a" , "b" , "c" ) , function(x) median( unl[ grepl(x,names(unl) ) ] ) )
# a b c
# 4 8 9

you could do this with a simple loop for every A,B,C,...
x <- c()
for( i in 1:n ) x <- c( x, x[[i]][[a]][['A']] )
median(x)

Sample data for creating your top-level list x:
x <- replicate(3, list(a = as.list(setNames(sample(1:100, 26), LETTERS)),
b = runif(10)),
simplify = FALSE)
First, extract a from each list:
a.only <- lapply(ll, `[[`, "a")
Then, to compute all A through Z medians in one shot, do:
do.call(mapply, c(a.only, FUN = function(...) median(unlist(list(...)))))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 55 59 41 21 93 72 65 74 51 42 87 25 60 40 13 77 35 31 92 51 57 37 87 67 29 46
If the sublists contain more items than you need, say you only want to compute medians on A, C, Z, do:
a.slices <- lapply(a.only, `[`, c("A", "C", "Z"))
do.call(mapply, c(a.slices, FUN = function(...) median(unlist(list(...)))))
# A C Z
# 55 41 46

Related

Remove duplicating docs of docs with high similarity

When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?
Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.
Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.
We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.
library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.14
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_simil(mydfm))
# a1 b1 a2 b2
# b1 -0.22203788
# a2 0.80492203 -0.23090513
# b2 -0.23427416 0.90082239 -0.28140219
# a3 0.81167608 -0.09065452 0.92242890 -0.12530944
# create a data.frame of the unique pairs and their similarities
sim_pair_names <- t(combn(docnames(mydfm), 2))
sim_pairs <- data.frame(sim_pair_names,
sim = as.numeric(sim),
stringsAsFactors = FALSE)
sim_pairs
# X1 X2 sim
# 1 a1 b1 -0.22203788
# 2 a1 a2 0.80492203
# 3 a1 b2 -0.23427416
# 4 a1 a3 0.81167608
# 5 b1 a2 -0.23090513
# 6 b1 b2 0.90082239
# 7 b1 a3 -0.09065452
# 8 a2 b2 -0.28140219
# 9 a2 a3 0.92242890
# 10 b2 a3 -0.12530944
Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().
# set the threshold for similarity
threshold <- 0.80
# discard one of the pair if similarity > threshold
todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
todrop
# [1] "a1" "a1" "b1" "a2"
# then subset the dfm, keeping only the "keepers"
dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
# 2 x 20 sparse Matrix of class "dfm"
# features
# docs a b c d w g j t l y h x s k i r f p q e
# b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
# a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0
Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.
You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.
The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).
I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().
If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().
In this example, cosine similarity scores smaller than 0.9 are all ignored.
library("quanteda")
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
# 5 x 5 sparse Matrix of class "dsTMatrix"
# a1 b1 a2 b2 a3
# a1 1 . . . .
# b1 . 1.0000000 . 0.9113423 .
# a2 . . 1.0000000 . 0.9415838
# b2 . 0.9113423 . 1.0000000 .
# a3 . . 0.9415838 . 1.0000000
matrix2list <- function(x) {
names(x#x) <- rownames(x)[x#i + 1]
split(x#x, factor(x#j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
matrix2list(sim)
# $a1
# a1
# 1
#
# $b1
# b1
# 1
#
# $a2
# a2
# 1
#
# $b2
# b1 b2
# 0.9113423 1.0000000
#
# $a3
# a2 a3
# 0.9415838 1.0000000
See https://koheiw.net/?p=839 for the performance differences.

Split dataframe into bins based on another vector

suppose I have the following dataframe
x <- c(12,30,45,100,150,305,2,46,10,221)
x2 <- letters[1:10]
df <- data.frame(x,x2)
df <- df[with(df, order(x)), ]
x x2
7 2 g
9 10 i
1 12 a
2 30 b
3 45 c
8 46 h
4 100 d
5 150 e
10 221 j
6 305 f
And I would like to split these into groups based on another vector,
v <- seq(0, 500, 50)
Basically, I would like to partition out each row based on column x and how it matches with to v ( so for example x <= an element in v) - the location/index of that element in v is then used to assign a group for that row. The resulting table should look something like the following:
x x2 group
7 2 g g1
9 10 i g1
1 12 a g1
2 30 b g1
3 45 c g1
8 46 h g2
4 100 d g3
5 150 e g4
10 221 j g4
6 305 f g6
I could try to loop through each row and try and match it to v but I'm still confuse as to how I could easily detect where the match x<=element v occurs so that I can assign a group id to it. thanks.
You can use cut to break up df$x by the values of v:
df$group <- as.numeric(cut(df$x, breaks = v))
df$group <- paste0('g', df$group)
cut returns a factor so you can use as.numeric to just pull out which numeric bucket the value of df$x falls into based on v.

Generating a 96 or 384 well plate layout in R

I am trying to write some code which will take a .csv file which contains some sample names as input and will output a data.frame containing the sample names and either a 96 well plate or 384 well plate format (A1, B1, C1...). For those who do not know, a 96 well plate has eight alphabetically labeled rows (A, B, C, D, E, F, G, H) and 12 numerically labeled columns (1:12) and a 384 well plate has 16 alphabetically labeled rows (A:P) and 24 numerically labeled columns (1:24). I am trying to write some code that will generate either of these formats (there CAN be two different functions to do this) allowing for the samples to be labeled either DOWN (A1, B1, C1, D1, E1, F1, G1, H1, A2...) or ACROSS (A1, A2, A3, A4, A5 ...).
So far, I have figured out how to get the row names fairly easily
rowLetter <- rep(LETTERS[1:8], length.out = variable)
#variable will be based on how many samples I have
I just cannot figure out how to get the numeric column names to apply correctly... I have tried:
colNumber <- rep(1:12, times = variable)
but it isn't that simple. All 8 rows must be filled before the col number increases by 1 if you're going 'DOWN' or all 12 columns must be filled before the row letter increases by 1 if you're going 'ACROSS'.
EDIT:
Here is a clunky version. It takes the number of samples that you have, a 'plate format' which IS NOT functional yet, and a direction and will return a data.frame with the wells and plate numbers. Next, I am going to a) fix the plate format so that it will work correctly and b) give this function the ability to take a list of samples names or ID's or whatever and return the sample names, well positions, and plate numbers!
plateLayout <- function(numOfSamples, plateFormat = 96, direction = "DOWN"){
#This assumes that each well will be filled in order. I may need to change this, but lets get it working first.
#Calculate the number of plates required
platesRequired <- ceiling(numOfSamples/plateFormat)
rowLetter <- character(0)
colNumber <- numeric(0)
plateNumber <- numeric(0)
#The following will work if the samples are going DOWN
if(direction == "DOWN"){
for(k in 1:platesRequired){
rowLetter <- c(rowLetter, rep(LETTERS[1:8], length.out = 96))
for(i in 1:12){
colNumber <- c(colNumber, rep(i, times = 8))
}
plateNumber <- c(plateNumber, rep(k, times = 96))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateLayout, plateNumber)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
#The following will work if the samples are going ACROSS
if(direction == "ACROSS"){
for(k in 1:platesRequired){
colNumber <- c(colNumber, rep(1:12, times = 8))
for(i in 1:8){
rowLetter <- c(rowLetter, rep(LETTERS[i], times = 12))
}
plateNumber <- c(plateNumber, rep(k, times = 96))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateLayout, plateNumber)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
}
Does anybody have any thoughts on what else might make this cool? I'm going to use this function to generate .csv or .txt files to use as sample name imports for different instruments so I will be kind of constrained in terms of 'cool features', but I think it would be cool to use ggplot to make a graphic which shows the plates and sample names?
You don't need for loops. Here is a start:
#some sample ids
ids <- c(LETTERS, letters)
#plate size:
n <- 96
nrow <- 8
samples <- character(n)
samples[seq_along(ids)] <- ids
samples <- matrix(samples, nrow=nrow)
colnames(samples) <- seq_len(n/nrow)
rownames(samples) <- LETTERS[seq_len(nrow)]
# 1 2 3 4 5 6 7 8 9 10 11 12
# A "A" "I" "Q" "Y" "g" "o" "w" "" "" "" "" ""
# B "B" "J" "R" "Z" "h" "p" "x" "" "" "" "" ""
# C "C" "K" "S" "a" "i" "q" "y" "" "" "" "" ""
# D "D" "L" "T" "b" "j" "r" "z" "" "" "" "" ""
# E "E" "M" "U" "c" "k" "s" "" "" "" "" "" ""
# F "F" "N" "V" "d" "l" "t" "" "" "" "" "" ""
# G "G" "O" "W" "e" "m" "u" "" "" "" "" "" ""
# H "H" "P" "X" "f" "n" "v" "" "" "" "" "" ""
library(reshape2)
samples <- melt(samples)
samples$position <- paste0(samples$Var1, samples$Var2)
# Var1 Var2 value position
# 1 A 1 A A1
# 2 B 1 B B1
# 3 C 1 C C1
# 4 D 1 D D1
# 5 E 1 E E1
# 6 F 1 F F1
# 7 G 1 G G1
# 8 H 1 H H1
# 9 A 2 I A2
# 10 B 2 J B2
# 11 C 2 K C2
# 12 D 2 L D2
# 13 E 2 M E2
# 14 F 2 N F2
# 15 G 2 O G2
# 16 H 2 P H2
# 17 A 3 Q A3
# 18 B 3 R B3
# 19 C 3 S C3
# 20 D 3 T D3
# 21 E 3 U E3
# 22 F 3 V F3
# 23 G 3 W G3
# 24 H 3 X H3
# 25 A 4 Y A4
# 26 B 4 Z B4
# 27 C 4 a C4
# 28 D 4 b D4
# 29 E 4 c E4
# 30 F 4 d F4
# 31 G 4 e G4
# 32 H 4 f H4
# 33 A 5 g A5
# 34 B 5 h B5
# 35 C 5 i C5
# 36 D 5 j D5
# 37 E 5 k E5
# 38 F 5 l F5
# 39 G 5 m G5
# 40 H 5 n H5
# 41 A 6 o A6
# 42 B 6 p B6
# 43 C 6 q C6
# 44 D 6 r D6
# 45 E 6 s E6
# 46 F 6 t F6
# 47 G 6 u G6
# 48 H 6 v H6
# 49 A 7 w A7
# 50 B 7 x B7
# 51 C 7 y C7
# 52 D 7 z D7
# 53 E 7 E7
# 54 F 7 F7
# 55 G 7 G7
# 56 H 7 H7
# 57 A 8 A8
# 58 B 8 B8
# 59 C 8 C8
# 60 D 8 D8
# 61 E 8 E8
# 62 F 8 F8
# 63 G 8 G8
# 64 H 8 H8
# 65 A 9 A9
# 66 B 9 B9
# 67 C 9 C9
# 68 D 9 D9
# 69 E 9 E9
# 70 F 9 F9
# 71 G 9 G9
# 72 H 9 H9
# 73 A 10 A10
# 74 B 10 B10
# 75 C 10 C10
# 76 D 10 D10
# 77 E 10 E10
# 78 F 10 F10
# 79 G 10 G10
# 80 H 10 H10
# 81 A 11 A11
# 82 B 11 B11
# 83 C 11 C11
# 84 D 11 D11
# 85 E 11 E11
# 86 F 11 F11
# 87 G 11 G11
# 88 H 11 H11
# 89 A 12 A12
# 90 B 12 B12
# 91 C 12 C12
# 92 D 12 D12
# 93 E 12 E12
# 94 F 12 F12
# 95 G 12 G12
# 96 H 12 H12
Use the byrow argument to fill the matrix in the other direction:
samples <- matrix(samples, nrow=nrow, byrow=TRUE)
To fill more than one plate, you can use basically the same idea, but use an array instead of a matrix.
I've never written this code in R before but it should be the same as Perl, Python or Java
For Row major order (going across) the pseudocode algorithm is simply:
for each( i : 0..totalNumWells - 1){
column = (i % numColumns)
row = ((i % totalNumWells) / numColumns)
}
Where numColumns is 12 for 96 well plate, 24 or 384 and totalNumWells is 96 or 384 respectively. This will give you a column and row index in 0-based coordinates which is perfect for accessing arrays.
wellName = ABCs[row], column + 1
Where ABCs is an array of all the valid letters in your plate (or A-Z). +1 is to convert 0-based into 1-based, otherwise the first well will be A0 instead of A1.
I also want to point out that often 384 wells aren't in row major order. I've seen most often sequencing centers preferring a "checker board" pattern A01, A03, A05... then A02, A04, A06..., B01, B03... etc to be able to combine 4 96-well plates into a single 384 well without changing the layout and simplifying the picking robot's work. that's a much harder algorithm to compute the ith well for
The following code does what I set out to do. You can use it to make as many plates as you need, with the assumptions that whatever your import list is will be in order. It can make as many plates as you need and will add a column for "plateNumber" which will indicate which batch it's on. It can only handle 96 or 384 well plates, but that is all I deal in so that is fine.
plateLayout <- function(numOfSamples, plateFormat = 96, direction = "DOWN"){
#This assumes that each well will be filled in order.
#Calculate the number of plates required
platesRequired <- ceiling(numOfSamples/plateFormat)
rowLetter <- character(0)
colNumber <- numeric(0)
plateNumber <- numeric(0)
#define the number of columns and number of rows based on plate format (96 or 384 well plate)
switch(as.character(plateFormat),
"96" = {numberOfColumns = 12; numberOfRows = 8},
"384" = {numberOfColumns = 24; numberOfRows = 16})
#The following will work if the samples are going DOWN
if(direction == "DOWN"){
for(k in 1:platesRequired){
rowLetter <- c(rowLetter, rep(LETTERS[1:numberOfRows], length.out = plateFormat))
for(i in 1:numberOfColumns){
colNumber <- c(colNumber, rep(i, times = numberOfRows))
}
plateNumber <- c(plateNumber, rep(k, times = plateFormat))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateNumber,plateLayout)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
#The following will work if the samples are going ACROSS
if(direction == "ACROSS"){
for(k in 1:platesRequired){
colNumber <- c(colNumber, rep(1:numberOfColumns, times = numberOfRows))
for(i in 1:numberOfRows){
rowLetter <- c(rowLetter, rep(LETTERS[i], times = numberOfColumns))
}
plateNumber <- c(plateNumber, rep(k, times = plateFormat))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateNumber, plateLayout)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
}
An example of how to use this would be as follows
#load whatever data you're going to use to get a plate layout on (sample ID's or names or whatever)
thisData <- read.csv("data.csv")
#make a data.frame containing your sample names and the function's output
#alternatively you can use length() if you have a list
plateLayoutDataFrame <- data.frame(thisData$sampleNames, plateLayout(nrow(thisData), plateFormat = 96, direction = "DOWN")
#It will return something similar to the following, depending on your selections
#data plateNumber plateLayout
#sample1 1 A1
#sample2 1 B1
#sample3 1 C1
#sample4 1 D1
#sample5 1 E1
#sample6 1 F1
#sample7 1 G1
#sample8 1 H1
#sample9 1 A2
#sample10 1 B2
#sample11 1 C2
#sample12 1 D2
#sample13 1 E2
#sample14 1 F2
#sample15 1 G2
That sums up this function for now. Roland offered a good method of doing this which is less verbose, but I wanted to avoid the use of external packages if possible. I'm working on a shiny app now which actually uses this! I want it to be able to automatically subset based on the 'plateNumber' and write each plate as it's own file... for more on this, go to: Automatic multi-file download in R-Shiny
Here's how I'd do it.
put_samples_in_plates = function(sample_list, nwells=96, direction="across")
{
if(!nwells %in% c(96, 384)){
stop("Invalid plate size")
}
nsamples = nrow(sample_list)
nplates = ceiling(nsamples/nwells);
if(nwells==96){
rows = LETTERS[1:8]
cols = 1:12
}else if(nwells==384){
rows = LETTERS[1:16]
cols = 1:24
}else{
stop("Unrecognized nwells")
}
nrows = length(rows)
ncols = length(cols)
if(tolower(direction)=="down"){
single_plate_df = data.frame(row = rep(rows, times=ncols),
col = rep(cols, each=nrows))
}else if(tolower(direction)=="across"){
single_plate_df = data.frame(row = rep(rows, each=ncols),
col = rep(cols, times=nrows))
}else{
stop("Unrecognized direction")
}
single_plate_df = transform(single_plate_df,
well = sprintf("%s%02d", row, col))
toobig_plate_df = cbind(data.frame(plate=rep(1:nplates, each=nwells)),
do.call("rbind", replicate(nplates,
single_plate_df,
simplify=FALSE)))
res = cbind(sample_list, toobig_plate_df[1:nsamples,])
return(res)}
# Quick test
a_sample_list = data.frame(x=1:386, y=rnorm(386))
r.096.across = put_samples_in_plates(sample_list = a_sample_list,
nwells= 96,
direction="across")
r.096.down = put_samples_in_plates(sample_list = a_sample_list,
nwells= 96,
direction="down")
r.384.across = put_samples_in_plates(sample_list = a_sample_list,
nwells=384,
direction="across")
r.384.down = put_samples_in_plates(sample_list = a_sample_list,
nwells=384,
direction="down")
Two points worth noting in the function above:
the use of the times and each parameters within the rep function to differentiate "across" and "down" directions, and
the use of replicate to repeat the individual plate as many times as needed along with the use of a call to rbind from do.call.

Difference between tilde and "by" while using aggregate function in R

Every time I do an aggregate on a data.frame I default to using the "by = list(...)" parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.
In some cases, the output is exactly the same. For example:
aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))
AND
aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)
What is the difference between the two and when do you use which?
I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.
I'll illustrate with a small example.
Here's some sample data:
set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
x[sample(nrow(mydf), 1)] <- NA
x
})
mydf
# A B X1 X2 X3
# 1 1 A 27 69 27
# 2 1 A 38 NA 39
# 3 1 A 58 77 2
# 4 2 A 91 50 39
# 5 2 A 21 72 87
# 6 3 B 90 100 35
# 7 3 B 95 39 49
# 8 3 B 67 78 60
# 9 3 B 63 94 NA
# 10 4 B NA 22 19
# 11 4 B 21 66 83
# 12 4 B 18 13 67
First, the formula interface. The following three commands will all yield the same output.
aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
# A B X1 X2 X3
# 1 1 A 85 146 29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B 39 79 150
Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with, if required).
aggregate(cbind(mydf$X1, mydf$X2, mydf$X3),
by = list(mydf$A, mydf$B), sum)
Group.1 Group.2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112 122 126
3 3 B 315 311 NA
4 4 B NA 101 169
Now, stop and make note of any differences.
The two that pop into my mind are:
The formula method does a nicer job of preserving names but it doesn't let you control the names directly in your command, which you can do in the data.frame method:
aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3),
by = list(NewA = mydf$A, NewB = mydf$B), sum)
The formula method and the data.frame method treat NA values differently. To get the same result with the formula method as you do with the data.frame method, you need to use na.action = na.pass.
aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.
From the help page,
aggregate.formula is a standard formula interface to aggregate.data.frame
So I don't think it really matters. Use whichever approach you're comfortable with, or which fits existing variables and formulas in your workspace.

Move cells to new column at each 48th row

I have list with names in A1:A144 and I want to move A49:A96 to B1:B48 and A97:144 to C1:C48.
So for each 48th row, I want the next 48 rows moved to a new column.
How to do that?
If you want to consider a VBA alternative then:
Sub MoveData()
nF = 1
nL = 48
nSize = Cells(Rows.Count, "A").End(xlUp).Row
nBlock = nSize / nL
For k = 1 To nBlock
nF = nF + 48
nL = nL + 48
Range("A" & nF & ":A" & nL).Copy Cells(1, k + 1)
Range("A" & nF & ":A" & nL).ClearContents
Next k
End Sub
Not sure how scalable this solution is, but it does work.
First let's pretend your names are x and you want the solution to be in new.df
number.shifts <- ceiling(length(x) / 48) # work out how many columns we need
# create an empty (NA) data frame with the dimensions we need
new.df <- matrix(data = NA, nrow = length(x), ncol = number.shifts)
# run a for-loop over the x, shift the column over every 48th row
j <- 1
for (i in 1:length(x)){
if (i %% 48 == 0) {j <- j + 1}
new.df[i,j] <- x[i]
}
I think you have to elaborate on your question a little more. Do you have the data in R or in Excel and do you want the output to be in R or in Excel?
That beeing said, if x is your vector indicating clusters
x <- rep(1:3, each = 48)
and y is the variable containing names or whatever that you want to distribute over columns A:C (each having 48 rows),
y <- sample(letters, 3 * 48, replace = TRUE)
you can do this:
y.wide <- do.call(cbind, split(y, x))
Just as there is stack in R to create a very long representation of a group of columns, there is unstack to take a long column and make it into a wide form.
Here's a basic example:
mydf <- data.frame(A = 1:144)
mydf$groups <- paste0("A", gl(n=3, k=48)) ## One of many ways to create groups
mydf2 <- unstack(mydf)
head(mydf2)
# A1 A2 A3
# 1 1 49 97
# 2 2 50 98
# 3 3 51 99
# 4 4 52 100
# 5 5 53 101
# 6 6 54 102
tail(mydf2)
# A1 A2 A3
# 43 43 91 139
# 44 44 92 140
# 45 45 93 141
# 46 46 94 142
# 47 47 95 143
# 48 48 96 144

Resources