I have a data frame consisting of about 22 fields, some system ids and some measurements, such as
bsystemid dcesystemid lengthdecimal heightquantity
2218 58 22 263
2219 58 22 197
2220 58 22 241
What I want:
1 . loop through a list of field ids
2 . define a function to test for a condition
3 . such that both x and y can vary
Where does the y variable definition belong, for varying both x and y? Other different structures?
This code block works for a single field and value of y:
varlist4<-names(brg) [c(6)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5))
This code block executes, but the counts are off:
varlist4<-names(brg) [c(6,8,10,12)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5,130,150,0))
For example,
varlist4<-names(brg) [c(6)]
f1<-(function(x,y) count(brg[,x]<y) )
lapply(varlist4, f1, y=c(7.5))
returns (correctly),
x freq
1 FALSE 9490
2 TRUE 309
3 NA 41
whereas the multiple x,y block of code above returns this for the first case,
x freq
1 FALSE 4828
2 TRUE 4971
3 NA 41
Thanks for any comments.
Update:
What I would like is to automate counting of occurances of values in specified fields in a df, meeting some condition. The conditions are numeric constants or text strings, one for each field. For example, I might want to count occurances meeting the condition >360 in field1, >0 in field2, etc. What I thus mean by allowing x and y to vary is reading x and y vectors with the field names and corresponding conditions into a looping structure.
I'd like to automate this task because it involves around 30 tables, each with up to 50 or so fields. And I'll need to do it twice, scanning once for values exceeding a maximum and once for values less than a minimum. Better still might be loading the conditions into a table and referencing that in the loop. That may be the next step but I'd like to understand this piece first.
This working example
t1<-18:29
t2<-c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
t3<-c(1.2,-0.2,-0.3,1.2, 2.2,0.4,0.6,0.4,-0.8,-0.1,5.0,3.1)
t<-data.frame(v1=t1,v2=t2,v3=t3)
varlist<-names(t) [c(1)]
f1<-(function(x,y) count(t[,x]>y) )
lapply(varlist, f1, y=c(27))
illustrates the correct answer for the first field, returning
x freq
1 FALSE 10
2 TRUE 2
But if I add in other fields and the corresponding conditions (the y's) I get something different for the first case:
varlist<-names(t) [c(1,2,3)]
f1<-(function(x,y) count(t[,x]>y) )
lapply(varlist, f1, y=c(27,83,3))
x freq
1 FALSE 8
2 TRUE 4
[[2]]
x freq
1 FALSE 1
2 TRUE 11
[[3]]
x freq
1 FALSE 11
2 TRUE 1
My sense is I'm not going about structuring the y part correctly.
Thanks for any comments.
You can use mapply. Let's create some data:
set.seed(123) # to get exactly the same results
brg = data.frame(x = rnorm(100), y=rnorm(100), z=rnorm(100))
brg$x[c(10, 15)] = NA # some NAs
brg$y[c(12, 21)] = NA # more NAs
Then you need to define the function to do the job. The function .f1 counts the data, and ensure there are always three levels (TRUE, FALSE, NA). Then, f1 uses .f1 in an mapply context to be able to vary x and y. Finally, some improvements in the output (changing the names of the columns).
f1 = function(x, y, data) {
.f1 = function(x, y, data) {
out = factor(data[, x] < y,
levels=c("TRUE", "FALSE", NA), exclude=NULL)
return(table(out))
}
out = mapply(.f1, x, y, MoreArgs = list(data = data)) # check ?mapply
colnames(out) = paste0(x, "<", y) # more clear names for the output
return(out)
}
Finally, the test:
varlist = names(brg)
threshold = c(0, 1, 1000)
f1(x=varlist, y=threshold, data=brg)
And you should get
x<0 y<1 z<1000
TRUE 46 87 100
FALSE 52 11 0
<NA> 2 2 0
Related
I am currently trying to program a function that shows me possible ways to finish a certain number of points in 1 to 3 dart throws. Only the range from 170 to 2 points is necessary.
Example: I have 120 points left. I want a function that gives out possible ways to finish 120 points. For example by throwing 40 - 40 - 40; 60 - 60 or 57 - 57 - 6 and so on...
I am new to programming in R and to programming in general and have absolutely no idea how and where to start. Does anyone have a recommendation for me? Is there a specific function that can be applied in this case?
Thank you very much in advance!
here is another approach. One thing you would have to fix (if needed), this solution makes separate 'solutions` for, let's say, a single 2, and a double 1.
so, 4 can be ended with
single 1, single 1, double 2
single 2, double 2
double 2, double 2
and this gets worse on the thee dart finishes...
singles <- c(1:20, 25)
doubles <- 2 * singles
triples <- 3*1:20
#all possible eidings
one.dart <- tidyr::crossing( first = doubles )
two.dart <- tidyr::crossing( first = unique( c(singles, doubles, triples ) ),
second = doubles )
three.dart <- tidyr::crossing( first = unique( c( singles, doubles, triples ) ),
second = unique( c( singles, doubles, triples ) ),
third = doubles )
#bind together
DT <- rbindlist( list( one.dart, two.dart, three.dart), use.names = TRUE, fill = TRUE )
#calculate finish total and number of darts used
DT[, finish := rowSums(.SD, na.rm = TRUE) ]
DT[, darts := rowSums( !is.na(.SD)), .SDcols = 1:3 ]
calculate_finish <- function( x ) {
DT[ finish == x, 1:3]
}
calculate_finish( 120 )
# first second third
# 1: 10 60 50
# 2: 13 57 50
# 3: 16 54 50
# 4: 19 51 50
# 5: 20 50 50
# ---
# 130: 60 40 20
# 131: 60 42 18
# 132: 60 48 12
# 133: 60 50 10
# 134: 60 54 6
the question you have is rather general - more like outsouring your programming than finding help for specific implementation parts of the code. Anyhow here goes a simplified solution that is far from perfect programming wise but does the job. I hope it helps you to understand progamming and functions plus a bit of R, given that you have a specific problem.
# we need this library to be able to use the %>% operator below
library(dplyr)
# vector of all the possible field numbers (i have no idea but you can look them up and fill the field)
dboard <- c(1, 2, 3, 4, 5)
# expand this to a grid of all possible combinations
all_combinations <- expand.grid(dboard, dboard,dboard ,stringsAsFactors = FALSE)
# your function with two inputs
get_throw_combinations <- function(remaining_points, throws){
# call the outside of the functions defined possible combinations
res <- all_combinations %>%
# select from first to number of trows columns
dplyr::select(1:throws) %>%
# reduce reduncancy (happens when throws < 3)
dplyr::distinct() %>%
# filter those where the rowsum is equal to the target
dplyr::filter(rowSums(.) == remaining_points)
# return the result
return (res)
}
# run your function for 5 points with two throws
get_throw_combinations(5, 2)
# results R will display
Var1 Var2
1 4 1
2 3 2
3 2 3
4 1 4
Here is a brute-force approach using expand.grid + subset
v <- 2:170
res <- lapply(
1:3,
function(k) subset(u <- expand.grid(rep(list(v), k)), rowSums(u) == 120)
)
I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp
I have a function that I want to iterate over only certain rows of my dataset, and then save the results in a variable in the dataset.
So for example say I have this set up:
library(tidyverse)
add_one <- function(vector, x_id){
return(vector[x_id] + 1)
}
test <- data.frame(x = c(1,2,3,4), y = c(1,2,3,4), run_on = c(TRUE,FALSE,TRUE,FALSE))
test
So the test data frame looks like:
> x y run_on
>1 1 1 TRUE
>2 2 2 FALSE
>3 3 3 TRUE
>4 4 4 FALSE
So what I want to do is iterate over the dataframe and set the y column to be the result of applying the function add_one() to the x column for just the rows where run_on is TRUE. I want the end result to look like this:
> x y run_on
>1 1 2 TRUE
>2 2 2 FALSE
>3 3 4 TRUE
>4 4 4 FALSE
I have been able to iterate the function over all of the rows using apply(). So for example:
test$y <- apply(test,1,add_one,x_id = 1)
test
> x y run_on
>1 1 2 TRUE
>2 2 3 FALSE
>3 3 4 TRUE
>4 4 5 FALSE
But this also applies the function to rows 2 and 4, which I do not want. I suspect there may be some way to do this using versions of the map() functions from ::purrr, which is why I tagged this post as such.
In reality, I am using this kind of procedure to repeatedly iterate over a large dataset multiple times, so I need it to be done automatically and cleanly. Any help or suggestions would be very much appreciated.
UPDATE
I managed to find a solution. Some of the solutions offered here did work in my toy example but did not extend to the more complex function I was actually using. Ultimately what worked was something similar to what tmfmnk suggested. I just wrapped the original function inside another function that included an if statement to determine whether or not to apply the original function. So to extend my toy example, my solution looks like this:
add_one_if <- function(vector, x_id, y_id, run_on_id){
if(vector[run_on_id]){
return(add_one(vector,x_id))}
else{
return(vector[x_id])
}
}
test$y <- apply(test, 1, add_one_if, x_id = 1, y_id = 2, run_on_id = 3)
It seems a little convoluted, but it worked for me and is reproducible and reliable in the way I need it to be.
You can also do:
add_one <- function(data, vector, x_id, n, is.true = c(TRUE, FALSE)) {
if (is.true) {
return(data[[vector]] + (data[[x_id]]) * n)
} else {
return(data[[vector]] + (!data[[x_id]]) * n)
}
}
add_one(test, vector = "y", x_id = "run_on", 1, is.true = TRUE)
[1] 2 2 4 4
add_one(test, vector = "y", x_id = "run_on", 5, is.true = FALSE)
[1] 1 7 3 9
It may be that your real case is more complicated than allowed by this, but why not just use ifelse?
test$y <- ifelse(test$run_on,add_one(test,x),y)
Or even:
test$y[test$run_on]<-add_one(test[run_on,],x)
You won't need to use purrr until you are applying the same function to multiple columns. Since you want to modify only one column, but based on a condition you can use mutate() + case_when().
mutate(test, y = case_when(run_on ~ add_one(y),
!run_on ~ y))
#> x y run_on
#> 1 1 2 TRUE
#> 2 2 2 FALSE
#> 3 3 4 TRUE
#> 4 4 4 FALSE
Horrible title question, but this is what I am trying to achieve. For Table1 I want to add the Column "BETWEEN", verifying if the "POSITION" falls between any of the "START" and "STOP" values for the corresponding "BIN" in Table2.
Table1. BIN names (character) and POSITION in BIN (numeric):
BIN POSITION
1 12
1 52
1 86
7 6
7 22
X 112
X 139
MT 3
MT 26
Table2: BIN name (character) and START and STOP positions (numeric)
BIN START STOP
1 2 64
1 90 110
7 20 100
7 105 200
X 1 5
MT 1 1000
And the desired result - Table 1 with "BETWEEN":
CHROM POSITION BETWEEN
1 12 TRUE
1 52 TRUE
1 86 FALSE
7 6 FALSE
7 22 TRUE
X 112 FALSE
X 139 FALSE
MT 3 TRUE
MT 26 TRUE
My Table 1 has about 4,000,000 rows, and Table 2 about 500,000 rows, and anything I came up with was very slow.
As an example of bigger tables, use the following:
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)
The desired output would be:
tab1
bin pos between
1: A 1 FALSE
2: A 11 TRUE
3: A 21 TRUE
4: A 31 FALSE
5: A 41 FALSE
The following method requires that for a given bin, the bins are mutually exclusive. (e.g. you cant have bin A with bounds 1-5 and another bin A with bounds 4-8.) Also, I modified your example a bit.
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]
tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]
tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]
setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]
setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")
result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]
I don't have the time to explain my solution in depth right now. Instead I'll take the cheap way out and refer you to research the roll parameter of data.table. The basic methodology above is that I'm joining tab1 and tab2, rolling pos forward to the nearest end bound. Then I join tab1 and tab2, rolling pos backward to the nearest start bound. Then I do an inner join on the those two sets, giving me all rows in tab1 which fall inside the bounds of a bin. From that point, it's just grunt work.
Most straightforward approach is to nest the matching loops I think. You may be need to handle factors slightly differently. I haven't tested to see what happens if it does not find a bin match.
BIN <- c("1","1","1","7","7","X","X","MT","MT")
POSITION <- c(12,52,86,6,22,112,139,3,26)
npos <- length(POSITION)
BETWEEN <- vector(mode="logical",length=npos)
tab1 <- as.data.frame(cbind(BIN,POSITION))
BIN2 <- c("1","1","7","7","X","MT")
START <- c(2,90,20,105,1,1)
STOP <- c(64,110,100,200,5,1000)
tab2 <- as.data.frame(cbind(BIN2,START,STOP))
bins <- unique(tab1$BIN)
for(bin in bins){
#print(paste("bin=",bin))
t1.bin.matches <- which(tab1$BIN==bin)
t2.bin.compares <- which(tab2$BIN2==bin)
#print(t1.bin.matches)
#print(t2.bin.compares)
for(match in t1.bin.matches){
between = FALSE
candidate = as.numeric(as.vector(tab1$POSITION)[match])
for(compare in t2.bin.compares){
comp.start <- as.numeric(as.vector(tab2$START)[compare])
comp.stop <- as.numeric(as.vector(tab2$STOP)[compare])
if(candidate>=comp.start&&candidate<=comp.stop){
between = TRUE
break
}
}
#print(paste(comp.start,candidate,comp.stop,between))
BETWEEN[match] = between
}
}
tab1 <- as.data.frame(cbind(tab1,BETWEEN))
tab1
Make sure your BIN columns are character, POSITION, START, END are numeric.
Table1$BIN = as.character(Table1$BIN)
Table1$POSITION = as.numeric(Table1$POSITION)
Table2$BIN = as.character(Table2$BIN)
Table2$START = as.numeric(Table2$START)
Table2$STOP = as.numeric(Table2$STOP)
Convert your data.frame to library(data.table) because the code below might be slow.
Table1 = as.data.table(Table1)
Table2 = as.data.table(Table2)
Generate desired output
z = apply(Table1, 1, function(x) {nrow(Table2[(as.numeric(x[2])>START) & (as.numeric(x[2])<STOP) & (BIN == as.character(x[1])),])>0})
cbind(Table1, z)
Old function is z(), new is y(). With the sample Table1, Table2, the new function is 30% faster. I don't know how this advantage will scale as nrow increases, but I'm guessing this scaling will be very positive. Let me know.
z = function(a){apply(Table1, 1, function(x) {z = subset(Table2, Table2$BIN == as.character(x[1]))
any(as.numeric(x[2])>z$START & as.numeric(x[2])<z$STOP)})}
y = function(a){apply(Table1, 1, function(x) {nrow(Table2[(as.numeric(x[2])>START) & (as.numeric(x[2])<STOP) & (BIN == as.character(x[1])),])>0})}
microbenchmark(z(), y(), times = 1000L)
expr min lq median uq max neval
z() 1168.283 1219.793 1237.791 1276.267 3481.576 1000
y() 809.575 848.052 863.257 885.909 1683.383 1000
edit: you might need to muck with the as.numeric, and as.character in the subsetting. I lost the data.table I created earlier and directly used the answer above's data.frame.
I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
library(plyr)
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
}
y <- lapply(x, doddply);y
y comes very close to what I want to get
$category
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
$categoryA
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
EDIT:
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
sep=""))
}
res <- lapply(exlist,eval,envir=dfs)
return(res)
}
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
by.list=list(mydata$group,mydata$size,mydata[,column])
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
}
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)