Split Data into groups of equal means - r

I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!

If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.

Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075

Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458

Related

How do I describe() a column while meeting criteria for a unique value in another column?

I'm very new to R so excuse any incorrect language. I'm not sure if I even asked this question correctly, but here is the problem I'm dealing with.
Suppose I have a data frame that contains data for lengths and weights for 10 different species of fish. Suppose I have 100 samples for each species a fish (1000 rows of data). Is it possible to return the describe() function of a column for each unique species of fish without having to create an object for each species?
For example if I write:
Catfish <- filter(dataframe, dataframe$lengths == "Catfish")
describe(Catfish$lengths)
Do I have to manually create an object (Catfish for example) for each species and then describe? Or is there a simpler way to return describe() for the lengths of each unique species directly from my original dataframe? Hopefully I asked the clearly enough. Thanks for any help!
I think what you might want to look into is a split-apply-combine technique (example below)
df
value ID
1 1 ID
2 2 ID
3 3 PD
4 4 PD
5 5 ID
#split by grouping variable (in your case a fishspecies)
df_split <- split(df, df$ID)
#apply a function (in your case describe)
df_split <- lapply(df_split, function(x) { x["ID"] <- NULL; x }) #removed ID for easier merging
df_split <- lapply(df_split, describe)
#combine
Result <- Reduce(rbind, df_split)
Result
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 3 2.67 2.08 2.0 2.67 1.48 1 5 4 0.29 -2.33 1.2
X11 1 2 3.50 0.71 3.5 3.50 0.74 3 4 1 0.00 -2.75 0.5
What would improve this script is to add the specific grouping variable to each row (so "ID" in this example). But I think this provides a starting point for you.

Selecting all values above a threshold and then a random sample of the values below the threshold

Using the cars dataset for example (shown below), I want to create a new data.frame by selecting all of the rows where cars$speed >= 12 and a random sample of 5 from the rows where cars$speed < 12.
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
I can do this in multiple steps by first selecting the rows where speed >= 12 using
one <- cars[cars$speed >=12 , ]
then select a random sample of 5 where cars$speed < 12
two <- cars[sample( which( cars$speed < 12) , 5 ) , ]
and then rbind() the two data.frames.
However, I want to add the code to a larger loop and a single step with the same outcome would be greatly preferred.
Thanks in advance for any suggestions.
I think you still need to rbind the results but you can do it in one line of code. I've used the sample_n function from dplyr for the sampling:
library(dplyr)
rbind(sample_n(cars[cars$speed<12,], 5), cars[cars$speed>=12,])
dplyr also has the rbind_list function if you need something faster than rbind.
I'm not sure how much more this can be condensed, but here's one approach:
library(data.table)
cdt <- data.table(cars)
set.seed(1)
##
rbind(
cdt[speed>11,],
cdt[speed<12,][sample(1:.N,5),])
and just to show that the 5 rows were correctly sampled,
R> rbind(
cdt[speed>11,],
cdt[speed<12,][sample(1:.N,5),]
)[order(speed)][1:8,]
speed dist
1: 4 2
2: 7 22
3: 7 4
4: 8 16
5: 10 18
6: 12 14
7: 12 20
8: 12 24
This isn't necessarily a data.table specific approach, I just find the syntax convenient to use. Also, if your actual data is very large, you might see a performance benefit from using a data.table and replacing rbind(<first subset>, <second subset>) with rbindlist(list(<first subset>,<second subset>)).
Edit:
Thanks to #Arun, another data.table approach:
cdt[
,if (grp) .SD else .SD[sample(.N, 5L)],
.(grp=speed>11),
.SDcols=1:2]
Perhaps a less elegant solution, but one that just uses plain vanilla R:
cars[c(which(cars$speed >=12),sample(which(cars$speed < 12) , 5 )) ,]
This might not speed things up but it avoids the additional call of rbind.

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

Summary in R for frequency tables?

I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?

How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row

It is easy to do an Exact Binomial Test on two values but what happens if one wants to do the test on a whole bunch of number of successes and number of trials. I created a dataframe of test sensitivities, potential number of enrollees in a study and then for each row I calculate how may successes that would be. Here is the code.
sens <-seq(from=.1, to=.5, by=0.05)
enroll <-seq(from=20, to=200, by=20)
df <-expand.grid(sens=sens,enroll=enroll)
df <-transform(df,succes=sens*enroll)
But now how do I use each row's combination of successes and number of trials to do the binomial test.
I am only interested in the upper limit of the 95% confidence interval of the binomial test. I want that single number to be added to the data frame as a column called "upper.limit"
I thought of something along the lines of
binom.test(succes,enroll)$conf.int
alas, conf.int gives something such as
[1] 0.1266556 0.2918427
attr(,"conf.level")
[1] 0.95
All I want is just 0.2918427
Furthermore I have a feeling that there has to be do.call in there somewhere and maybe even an lapply but I do not know how that will go through the whole data frame. Or should I perhaps be using plyr?
Clearly my head is spinning. Please make it stop.
If this gives you (almost) what you want, then try this:
binom.test(succes,enroll)$conf.int[2]
And apply across the board or across the rows as it were:
> df$UCL <- apply(df, 1, function(x) binom.test(x[3],x[2])$conf.int[2] )
> head(df)
sens enroll succes UCL
1 0.10 20 2 0.3169827
2 0.15 20 3 0.3789268
3 0.20 20 4 0.4366140
4 0.25 20 5 0.4910459
5 0.30 20 6 0.5427892
6 0.35 20 7 0.5921885
Here you go:
R> newres <- do.call(rbind, apply(df, 1, function(x) {
+ bt <- binom.test(x[3], x[2])$conf.int;
+ newdf <- data.frame(t(x), UCL=bt[2]) }))
R>
R> head(newres)
sens enroll succes UCL
1 0.10 20 2 0.31698
2 0.15 20 3 0.37893
3 0.20 20 4 0.43661
4 0.25 20 5 0.49105
5 0.30 20 6 0.54279
6 0.35 20 7 0.59219
R>
This uses apply to loop over your existing data, compute test, return the value you want by sticking it into a new (one-row) data.frame. And we then glue all those 90 data.frame objects into a new single one with do.call(rbind, ...) over the list we got from apply.
Ah yes, if you just want to directly insert a single column the other answer rocks as it is simple. My longer answer shows how to grow or construct a data.frame during the sweep of apply.

Resources