How can I use ddply with varying .variables? - r

I use ddply to summarize some data.frameby various categories, like this:
# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
sumGroupSize = sum(someValue))
This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply call?
Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first.
Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?
EDIT:
I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.
library(plyr)
# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
sumTest = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
sumTestTotal = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
return(res)
}
test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)
head(test)
head(test2)
As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.
applytest <- head(apply(mydata[grep("^cat",
names(mydata),value=T)],2,calcShares,data=mydata))
.. returns a warning message and a strange name (newX[, i] ) for the category var.
So how can I do THIS a) more elegantly and b) fix the apply issue?

This seems simple, so I may be missing some aspect of your question.
First, define a function that calculates the values you want inside each level of group. Then, instead of using .(group, size) to split the data.frame, use .(group), and apply the newly defined function to each of the split pieces.
library(plyr)
# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")
# A function that calculates the proportional contribution of each size class
# to the sum of someValue within a level of group
getProps <- function(df) {
with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}
# The call to ddply()
res <- ddply(mydata, .(group),
.fun = function(X) transform(X, PROPS=getProps(X)))
head(res, 12)
# someValue group size PROPS
# 1 26 A L 0.4785203
# 2 30 A L 0.4785203
# 3 54 A L 0.4785203
# 4 25 A L 0.4785203
# 5 70 A L 0.4785203
# 6 52 A L 0.4785203
# 7 51 A L 0.4785203
# 8 26 A L 0.4785203
# 9 67 A L 0.4785203
# 10 18 A M 0.2577566
# 11 21 A M 0.2577566
# 12 29 A M 0.2577566

Related

Optimize for a parameter across many different sites

I have data that looks similar to the following
Site Unknown_Parameter X Y Z Predicted Actual
A 2 3 4 2 5 6
A 2 4 3 2 7 5
B 3 6 8 9 12 9
B 3 4 6 2 10 10
etc...
I am trying to create a function that minimizes the RMSE of each site by determining the optimal value for the unknown parameter. I can do this for a single site at a time using the following pseudocode
fn <- function(unknown_parameter) {
df$Predicted <- calculations with unknown_parameter and X Y Z
RMSE <- sqrt(mean((df$Predicted - df$Actual)^2))
RMSE
}
optimize(fn, c(1,10))
I am able to obtain the optimal value for the unknown parameter as well as the RMSE for a single site, but I would like to scale this to do it for every site since I have 100s. Ideally, I would want my output to look like the following
Site Optimal_Value RMSE
A 1.7 2.45
B 1.2 3.24
C 1.3 9.21
etc...
I have been trying to use the split command, but this transforms my data into a list, and I'm not really sure how to work with it. Any thoughts?
While split produces a list of subsetted dataframes by the input factor's value, consider by that also subsets the dataframe by one or more factor(s) but can also pass the subset into a function. And to bind all dataframes together run a do.call(rbind, ...) on returned list.
# USER-DEFINED METHOD RECEIVING subsetted df AS INPUT AND RETURNING dataframe AS OUTPUT
subset_process <- function(subdf) {
fn <- function(unknown_parameter) {
subdf$Predicted <- calculations with unknown_parameter and X Y Z
RMSE <- sqrt(mean((subdf$Predicted - subdf$Actual)^2))
return(RMSE)
}
opt <- optimize(fn, c(1,10))
tmp <- data.frame(Site = subdf$Site[[1]],
Optimal Value = opt,
RMSE = fn)
return(tmp)
}
# SPLIT + RUN METHOD ON EACH SUBSET
df_list <- by(df, df$Site, FUN=subset_process)
# APPEND ALL DF ELEMENTS INTO MASTER DF
final_df <- do.call(rbind, df_list)

Creating combination of sequences

I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.
You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100
I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!

Transferring factor properties between two data frames

I've built a predictive model that uses a large number (30 or so) of independent factor variables. As the dataset I'm using is much larger than the RAM of my machine, I have sampled it for both my training and test sets.
I am now looking to use the model to make predictions over the entire dataset. I'm pulling in the dataset 1 million rows at a time, and each time, I find new levels for some of my factor variables that were not in my training and test set, therefore preventing the model from making predictions.
As there are so many independent factor variables (and so many overall observations), correcting each case by hand is becoming a real pain.
One additional wrinkle to be aware of: there is no guarantee that the order of variables in the overall dataframe and the training/test sets are the same, as I do pre-processing on the data that changes their order.
As such, I'd like to write a function that:
Selects and sorts the columns of the new data based on the
configuration of my sampled dataframe
Loops through the sampled and new dataframe and designates all factor levels in the new
dataframe that do not exist in their corresponding column in the
sample dataframe as Other.
If a factor level exists in my sample but not the new dataframe, create the level (with no observations assigned to it) to its corresponding column in the new dataframe.
I've got #1 together, but don't know the best way to do #2 and #3. If it were any other language, I'd use for loops, but I know that's frowned upon in R.
Here's a reproducible example:
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), montreal=factor(c("f","f","f","f","a")), boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), montreal=factor(c("f","f","f","f","a", "a")), boston=factor(c("m","y","z","z","r", "f")), abacus=factor(c("a","b","z","a","a", "g")))
sampleData
abacus montreal boston
1 a f z
2 b f y
3 a f z
4 a f z
5 a a q
dataset
florida montreal boston abacus
1 e f m a
2 q f y b
3 z f z z
4 d f z a
5 b a r a
6 a a f g
sampleData <- sample[,order(names(sampleData))]
dataset <- dataset[,order(names(dataset))]
dataset <- dataset[,(colnames(sampleData)]
Below is what I would want dataset to look like once this function is complete (I don't really care about the final ordering of the columns in dataset; I'm just thinking its necessary for the loop (or whatever you guys deem best) to work. Notice that the column dataset$florida is omitted:
dataset
montreal boston abacus
1 f Other a
2 f y b
3 f z Other
4 f z a
5 a Other a
6 a Other Other
Also note that in dataset, the 'q' level for boston does not appear, although it does appear in sampleData. Therefore, the levels will differ if we omit 'q' from the factor in dataset, meaning that in 'dataset', we need boston to include the level q, but to have no actual observations assigned to it.
Last, note that as I'm doing this on 30 variables at a time, I need a programmatic solution and not one that reassigns factors by using explicit column names.
This seems like it might work.
From this function, the new levels returned for the boston column are Other y z q, even though there are no values for the level q. Regarding your comment in the original question, the only way I've found to effectively apply new factor levels is also with a for loop like you, and it's worked well for me so far.
A function, findOthers() :
findOthers <- function(newData) ## might want a second argument for sampleData
{
## take only those columns that are in 'sampleData'
dset <- newData[, names(sampleData)]
## change the 'dset' columns to character
dsetvals <- sapply(dset, as.character)
## change the 'sampleData' levels to character
samplevs <- sapply(sampleData, function(y) as.character(levels(y)))
## find the unmatched elements
others <- sapply(seq(ncol(dset)), function(i){
!(dsetvals[,i] %in% samplevs[[i]])
})
## change the unmatched elements to 'Other'
dsetvals[others] <- "Other"
## create new data frame
newDset <- data.frame(dsetvals)
## get the new levels for each column
newLevs <- lapply(seq(newDset), function(i){
Get <- c(as.character(newDset[[i]]), as.character(samplevs[[i]]))
ul <- unique(unlist(Get))
})
## set the new levels for each column
for(i in seq(newDset)) newDset[,i] <- factor(newDset[,i], newLevs[[i]])
## result
newDset
}
Your sample data :
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")),
montreal=factor(c("f","f","f","f","a")),
boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")),
montreal=factor(c("f","f","f","f","a", "a")),
boston=factor(c("m","y","z","z","r", "f")),
abacus=factor(c("a","b","z","a","a", "g")))
Call findOthers() and view the result with the new factor levels :
(new <- findOthers(newData = dataset))
# abacus montreal boston
# 1 a f Other
# 2 b f y
# 3 Other f z
# 4 a f z
# 5 a a Other
# 6 Other a Other
as.list(new)
# $abacus
# [1] a b Other a a Other
# Levels: a b Other
#
# $montreal
# [1] f f f f a a
# Levels: f a
#
# $boston
# [1] Other y z z Other Other
# Levels: Other y z q ## note the new level 'q', with no value in the column
To answer just the question you ask (rather than suggest what you might do instead). Here we have to make each column character, replace then re-factorise.
sampleData = sapply(sampleData, as.character)
sampleData = gsub("q", "other", sampleData)
sampleData = sapply(sampleData, as.factor)
This depends on "q" only inhabiting one column. Otherwise you just have to edit each column separately to get only the changes you want:
sampleData = sapply(sampleData, as.character)
sampleData$boston = gsub("q", "other", sampleData$boston)
sampleData = sapply(sampleData, as.factor)
However I think you should just filter the train and test data of these rows as they are so few
they will make absolutely no difference to your model. Otherwise you're making it difficult.
summary(dataset)
dataset <- dataset[dataset$abacus!="z", ]
If the dataset is very very large and you are not doing this because of that then you may want to do this with something like the dplyr package and filter function.
Does this accomplish what you want?
# Select and sort the columns of dataset as in sampleData
sampleData <- sampleData[, order(names(sampleData))]
dataset <- dataset[, colnames(sampleData)]
f <- function(dataset, sampleData, col) {
# For a given column col, assign "Other" to all factor levels
# in dataset[col] that do not exist in sampleData[col].
# If a factor level exists in sampleData[col] but not in dataset[col],
# preserve it as a factor level.
v <- factor(dataset[, col], levels = c(levels(sampleData[, col]), "Other"))
v[is.na(v)] <- "Other"
v
}
# Apply f to all columns of dataset
l <- lapply(colnames(dataset), function(x) f(dataset, sampleData, x))
res <- data.frame(l) # Format into a data frame
colnames(res) <- colnames(dataset) # Assign the names of dataset
dataset <- res # Assign the result to dataset
You can test as follows
> dataset[, "boston"]
[1] Other y z z Other Other
Levels: q y z Other
> dataset[, "montreal"]
[1] f f f f a a
Levels: a f Other
> dataset[, "abacus"]
[1] a b Other a a Other
Levels: a b Other

Incorporating external function in R's apply

Given this data.frame
x y z
1 1 3 5
2 2 4 6
I'd like to add the value of columns x and z plus a coefficient 10, for every rows in dat.
The intended result is this
x y z result
1 1 3 5 16 #(1+5+10)
2 2 4 6 18 #(2+6+10)
But why this code doesn't produce the desired result?
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(v1,v2,cf) {
return(v1+v2+cf)
}
# It breaks here
sm <- apply(dat[,c('x','z')], 1, process.xz(dat$x,dat$y,Coeff ))
# Later I'd do this:
# cbind(dat,sm);
I wouldn't use an apply here. Since the addition + operator is vectorized, you can get the sum using
> process.xz(dat$x, dat$z, Coeff)
[1] 16 18
To write this in your data.frame, don't use cbind, just assign it directly:
dat$result <- process.xz(dat$x, dat$z, Coeff)
The reason it fails is because apply doesn't work like that - you must pass the name of a function and any additional parameters. The rows of the data frame are then passed (as a single vector) as the first argument to the function named.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(x,cf) {
return(x[1]+x[2]+cf)
}
sm <- apply(dat[,c('x','z')], 1, process.xz,cf=Coeff)
I completely agree that there's no point in using apply here though - but it's good to understand anyway.

Aggregate over categories that contain NAs with ddply and lapply?

I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
library(plyr)
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
}
y <- lapply(x, doddply);y
y comes very close to what I want to get
$category
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
$categoryA
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
EDIT:
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
sep=""))
}
res <- lapply(exlist,eval,envir=dfs)
return(res)
}
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
by.list=list(mydata$group,mydata$size,mydata[,column])
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
}
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)

Resources