I have a continuous variable that I want to split into bins, returning a numeric vector (of length equal to my original vector) whose values relate to the values of the bins. Each bin should have roughly the same number of elements.
This question: splitting a continuous variable into equal sized groups describes a number of techniques for related situations. For instance, if I start with
x = c(1,5,3,12,5,6,7)
I can use cut() to get:
cut(x, 3, labels = FALSE)
[1] 1 2 1 3 2 2 2
This is undesirable because the values of the factor are just sequential integers, they have no direct relation to the underlying original values in my vector.
Another possibility is cut2: for instance:
library(Hmisc)
cut2(x, g = 3, levels.mean = TRUE)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
This better because now the return values relate to the values of the bins. It is still less than ideal though since:
(a) it yields a factor which then needs to be converted to numeric (see, e.g.), which is both slow and awkward code wise.
(b) Ideally I'd like to be able to choose whether to use the top or bottom end points of the intervals, instead of just the means.
I know that there are also options using regex on the factors returns from cut or cut2 to get the top or bottom points of the intervals. These too seem overly cumbersome.
Is this just a situation that requires some not-so-elegant hacking? Or, is there some easier functionality to accomplish this?
My current best effort is as follows:
MyDiscretize = function(x, N_Bins){
f = cut2(x, g = N_Bins, levels.mean = TRUE)
return(as.numeric(levels(f))[f])
}
My goal is to find something faster, more elegant, and easily adaptable to use either of the endpoints, rather than just the means.
Edit:
To clarify: my desired output would be:
(a) an equivalent to what I can achieve right now in the example with cut2 but without needing to convert the factor to numeric.
(b) if possible, the ability to also easily chose to use either of the endpoints of the interval, instead of the midpoint.
Use ave like this:
Given:
x = c(1,5,3,12,5,6,7)
Mean:
ave(x,cut2(x,g = 3), FUN = mean)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
Min:
ave(x,cut2(x,g = 3), FUN = min)
[1] 1 1 1 7 1 6 7
Max:
ave(x,cut2(x,g = 3), FUN = max)
[1] 5 5 5 12 5 6 12
Or standard deviation:
ave(x,cut2(x,g = 3), FUN = sd)
[1] 1.914854 1.914854 1.914854 3.535534 1.914854 NA 3.535534
Note the NA result for only one data point in interval.
Hope this is what you need.
NOTE:
Parameter g in cut2 is number of quantile groups. Groups might not have the same amount of data points, and the intervals might not have the same length.
On the other hand, cut splits the interval into several of equal length.
Maybe not much elegant, but should be efficient. Try this function:
myCut<-function(x,breaks,retValues=c("means","highs","lows")) {
retValues<-match.arg(retValues)
if (length(breaks)!=1) stop("breaks must be a single number")
breaks<-as.integer(breaks)
if (is.na(breaks)||breaks<2) stop("breaks must greater than or equal to 2")
intervals<-seq(min(x),max(x),length.out=breaks+1)
bins<-findInterval(x,intervals,all.inside=TRUE)
if (retValues=="means") return(rowMeans(cbind(intervals[-(breaks+1)],intervals[-1]))[bins])
if (retValues=="highs") return(intervals[-1][bins])
intervals[-(breaks+1)][bins]
}
x = c(1,5,3,12,5,6,7)
myCut(x,3)
#[1] 2.833333 6.500000 2.833333 10.166667 6.500000 6.500000 6.500000
myCut(x,3,"highs")
#[1] 4.666667 8.333333 4.666667 12.000000 8.333333 8.333333 8.333333
myCut(x,3,"lows")
#[1] 1.000000 4.666667 1.000000 8.333333 4.666667 4.666667 4.666667
Related
I have a problem with documentation and the return value of plot() for factors. I'd like to add a horizontal line with the mean value to the plot, but I fail to compute it. I was hoping to be able to use the value of the plot, but I failed. For example:
> x<-sample(5, 10, replace=TRUE)
> x
[1] 3 5 1 4 5 4 2 4 1 5
> y<-plot(factor(x))
> y
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
Obviously the domain and range are all integer, so what do this numbers returned by plot really mean, and how could I get the mean bar height?
Of course (if there's not a more elegant solution) I can iterate over the factor levels counting the the number of items for each, and then take the mean value of those. Also, if you use hist() instead of plot(), then the solution is very simple: abline(h=mean(hist(x)$counts))
To add a horizontal line, just say:
abline(h = mean(x))
abline(h = whatever) gives you a horizontal line. abline(y = whatever) gives you a vertical line.
Probably the (most?) complicated solution invented by myself is this:
abline(h=mean(unlist(lapply(min(x):max(x), function(ff) length(which(x == ff))))))
Of course this solution only works if x is a factor, and the levels are numeric; otherwise replace min(x):max(x) with levels(x).
And (harder to understand for me) a more simple solution seems to be (from #Marco Sandri):
abline(h=length(x)/length(y))
I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!
If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.
Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075
Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458
I was trying to calculate equal quantile cuts for a vector by using cut2 from Hmisc.
library(Hmisc)
c <- c(-4.18304,-3.18343,-2.93237,-2.82836,-2.13478,-2.01892,-1.88773,
-1.83124,-1.74953,-1.74858,-0.63265,-0.59626,-0.5681)
cut2(c, g=3, onlycuts=TRUE)
[1] -4.18304 -2.01892 -1.74858 -0.56810
But I was expecting the following result (33%, 33%, 33%):
[1] -4.18304 -2.13478 -1.74858 -0.56810
Should I still use cut2 or try something different? How can I make it work? Thanks for your advice.
You are seeing the cutpoints, but you want the tabular counts, and you want them as fractions of the total, so do this instead:
> prop.table(table(cut2(c, g=3) ) )
[-4.18,-2.019) [-2.02,-1.749) [-1.75,-0.568]
0.3846154 0.3076923 0.3076923
(Obviously you cannot expect cut2 to create an exact split when the count of elements was not evenly divisible by 3.)
It seems that there were accidentally thirteen values in the original data set, instead of twelve. Thirteen values cannot be equally divided into three quantile groups (as mentioned by BondedDust). Here is the original problem, except that one selected data value (-1.74953) is excluded, making it twelve values. This gives the result originally expected:
library(Hmisc)
c<-c(-4.18304,-3.18343,-2.93237,-2.82836,-2.13478,-2.01892,-1.88773,-1.83124,-1.74858,-0.63265,-0.59626,-0.5681)
cut2(c, g=3,onlycuts=TRUE)
#[1] -4.18304 -2.13478 -1.74953 -0.5681
To make it clearer to anyone not familiar with cut2 from the Hmisc package (like me as of this morning), here's a similar problem, except that we'll use the integers 1 through 12 (assigned to the vector dozen_values).
library(Hmisc)
dozen_values <-1:12
quantile_groups <- cut2(dozen_values,g=3)
levels(quantile_groups)
## [1] "[1, 5)" "[5, 9)" "[9,12]"
cutpoints <- cut2(dozen_values, g=3, onlycuts=TRUE)
cutpoints
## [1] 1 5 9 12
# Show which values belong to which quantile group, using a data frame
quantile_DF <- data.frame(dozen_values, quantile_groups)
names(quantile_DF) <- c("value", "quantile_group")
quantile_DF
## value quantile_group
## 1 1 [1, 5)
## 2 2 [1, 5)
## 3 3 [1, 5)
## 4 4 [1, 5)
## 5 5 [5, 9)
## 6 6 [5, 9)
## 7 7 [5, 9)
## 8 8 [5, 9)
## 9 9 [9,12]
## 10 10 [9,12]
## 11 11 [9,12]
## 12 12 [9,12]
Notice that, the first quantile group includes everything up to, but not including, 5 (i.e. 1 thorough 4, in this case). The second quantile group contains 5 up to, but not including, 9 (i.e. 5 through 8, in this case). The third (last) quantile group contains 9 through 12, which includes the last value 12. Unlike the other quantile groups, the third quantile group includes the last value shown.
Anyway, you can see that the "cutpoints" 1, 5, 9, and 12 describe the start and end points of the quantile groups in the most concise way, but it is obtuse without reading relevant documentation (link to single page Inside-R site, instead of the almost 400 page PDF manual).
See this explanation about the parentheses vs square bracket notation, if it is unfamiliar to you.
I have a dataframe:
> df <- data.frame(
+ Species = rep(LETTERS[1:4], times=c(5,6,7,6)),
+ Length = rep(11:14, each=3)
+ )
>
> df
I need to be able to count the number of individuals of a certain Length for each Species (i.e., how many individuals in Species A have a length of 1, 2, 3, etc?) Then, I need to perform a series of additional analyses on the output. For example, I need to calculate the density of individuals of each length, and the decrease in density from one length class to the next.
This is easy if I subset the data first:
Spec.A<-df[df$Species=="A",]
#count number of specimens of each length;
count<-table(Spec.A$Length)
count
#calculate density per length category (divide by total area sampled =30)
density<-count/(30)
density
#calculate the decrease in density (delta.N) from one length category to the next;
delta.N<-diff(density, lag=1, differences=1)
delta.N
The problem is that I need to do these calculations for each species (i.e., to loop through each subset).
On the one hand, I could use tapply(), with a function that uses table();
#function: count number of specimens of each length;
count<-function(x){
table(x)
}
Number<-tapply(df$Length, df$Species, FUN=count, simplify=FALSE)
Number
This gives me what I want, but the format of the output is funky, and I can't figure out how to perform additional analyses on the results.
I have tried using ddply() from plyr, something like:
ddply(df$Length, df$Species,
count)
But I clearly don't have it right, and I'm not even sure ddply() is appropriate for my problem, given that I have a different number of length observations for each species.
Should I be looking more closely at other options in plyr? Or is there a way to write a for loop to do what I need?
You're on the right track! tapply with list output is definitely one way to go, and may be a good choice since your outputs will have varying lengths.
ddply, like you guessed, is another way. The key is that the output of the function you give to ddply should be a data frame with all your statistics in a "long" mode (so that they will stack nicely). The simple count function can't do this, so you'll need to make your own function. The way I go about devising a function for a ddply call like this is actually very similar to what you were doing: I get a subset of the data, and then craft my function using that. Then, when you submit it to ddply, it'll nicely apply that function across all the subsets.
SpeciesStats <- function(df) {
counts = table(df$Length)
densities = counts/30
delta.N = diff(densities, lag=1, differences=1)
data.frame(Length = names(counts),
Count = as.numeric(counts),
Density = as.numeric(densities),
delta.N = c(NA, delta.N),
row.names=NULL)
}
> ddply(df, 'Species', SpeciesStats)
Species Length Count Density delta.N
1 A 11 3 0.10000000 NA
2 A 12 2 0.06666667 -0.03333333
3 B 12 1 0.03333333 NA
4 B 13 3 0.10000000 0.06666667
5 B 14 2 0.06666667 -0.03333333
6 C 11 3 0.10000000 NA
7 C 12 3 0.10000000 0.00000000
8 C 14 1 0.03333333 -0.06666667
9 D 13 3 0.10000000 NA
10 D 14 3 0.10000000 0.00000000
You can do this in a simpler way by using the count function in plyr
df1 <- ddply(df, .(Species, Length), count)
df2 <- ddply(df1, .(Species), mutate, Dens = freq/30, Del = diff(c(NA, Dens)))
I have a set of user recommandations
review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
and wanted to use summary(review) to show basic properties mean, median, quartiles and min max.
But it gives back the summary of both columns. I refrain from using data.frame because the factors 'Star' are ordered.
How can I tell R that Star is a ordered list of factors numeric score and votes are their frequency?
I'm not exactly sure what you mean by taking the mean in general if Star is supposed to be an ordered factor. However, in the example you give where Star is actually a set of numeric values, you can use the following:
library(Hmisc)
R> review=matrix(c(5:1,10,2,1,1,2), nrow=5, ncol=2, dimnames=list(NULL,c("Star","Votes")))
R> wtd.mean(review[, 1], weights = review[, 2])
[1] 4.0625
R> wtd.quantile(review[, 1], weights = review[, 2])
0% 25% 50% 75% 100%
1.00 3.75 5.00 5.00 5.00
I don't understand what's the problem. Why shouldn't you use data.frame?
rv <- data.frame(star = ordered(review[, 1]), votes = review[, 2])
You should convert your data.frame to vector:
( vts <- with(rv, rep(star, votes)) )
[1] 5 5 5 5 5 5 5 5 5 5 4 4 3 2 1 1
Levels: 1 < 2 < 3 < 4 < 5
Then do the summary... I just don't know what kind of summary, since summary will bring you back to the start. O_o
summary(vts)
1 2 3 4 5
2 1 1 2 10
EDIT (on #Prasad's suggestion)
Since vts is an ordered factor, you should convert it to numeric, hence calculate the summary (at this moment I will disregard the background statistical issues):
nvts <- as.numeric(levels(vts)[vts]) ## numeric conversion
summary(nvts) ## "ordinary" summary
fivenum(nvts) ## Tukey's five number summary
Just to clarify -- when you say you would like "mean, median, quartiles and min/max", you're talking in terms of number of stars? e.g mean = 4.062 stars?
Then using aL3xa's code, would something like summary(as.numeric(as.character(vts))) be what you want?