I have decided to learn R. I am trying to get a sense of how to write "R style" functions and to avoid looping. Here is a sample situation:
Given a vector a, I would like to compute a vector b whose elements b[i] (the vector index begins at 1) are defined as follows:
1 <= i <= 4:
b[i] = NaN
5 <= i <= length(a):
b[i] = mean(a[i-4] to a[i])
Essentially, if we pretend 'a' is a list of speeds where the first entry is at time = 0, the second at time = 1 second, the third at time = 2 seconds... I would like to obtain a corresponding vector describing the average speed over the past 5 seconds.
E.g.:
If a is (1,1,1,1,1,4,6,3,6,8,9) then b should be (NaN, NaN, NaN, NaN, 1, 1.6, 2.6, 3, 4, 5.4, 6.4)
I could do this using a loop, but I feel that doing so would not be in "R style".
Thank you,
Tungata
Because these rolling functions often apply with time-series data, some of the newer and richer time-series data-handling packages already do that for you:
R> library(zoo) ## load zoo
R> speed <- c(1,1,1,1,1,4,6,3,6,8,9)
R> zsp <- zoo( speed, order.by=1:length(speed) ) ## creates a zoo object
R> rollmean(zsp, 5) ## default use
3 4 5 6 7 8 9
1.0 1.6 2.6 3.0 4.0 5.4 6.4
R> rollmean(zsp, 5, na.pad=TRUE, align="right") ## with padding and aligned
1 2 3 4 5 6 7 8 9 10 11
NA NA NA NA 1.0 1.6 2.6 3.0 4.0 5.4 6.4
R>
The zoo has excellent documentation that will show you many, many more examples, in particular how to do this with real (and possibly irregular) dates; xts extends this further but zoo is a better starting point.
Something like b = filter(a, rep(1.0/5, 5), sides=1) will do the job, although you will probably get zeros in the first few slots, instead of NaN. R has a large library of built-in functions, and "R style" is to use those wherever possible. Take a look at the documentation for the filter function.
You can also use a combination of cumsum and diff to get the sum over sliding windows. You'll need to pad with your own NaN, though:
> speed <- c(1,1,1,1,1,4,6,3,6,8,9)
> diff(cumsum(c(0,speed)), 5)/5
[1] 1.0 1.6 2.6 3.0 4.0 5.4 6.4
Related
I have a continuous variable that I want to split into bins, returning a numeric vector (of length equal to my original vector) whose values relate to the values of the bins. Each bin should have roughly the same number of elements.
This question: splitting a continuous variable into equal sized groups describes a number of techniques for related situations. For instance, if I start with
x = c(1,5,3,12,5,6,7)
I can use cut() to get:
cut(x, 3, labels = FALSE)
[1] 1 2 1 3 2 2 2
This is undesirable because the values of the factor are just sequential integers, they have no direct relation to the underlying original values in my vector.
Another possibility is cut2: for instance:
library(Hmisc)
cut2(x, g = 3, levels.mean = TRUE)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
This better because now the return values relate to the values of the bins. It is still less than ideal though since:
(a) it yields a factor which then needs to be converted to numeric (see, e.g.), which is both slow and awkward code wise.
(b) Ideally I'd like to be able to choose whether to use the top or bottom end points of the intervals, instead of just the means.
I know that there are also options using regex on the factors returns from cut or cut2 to get the top or bottom points of the intervals. These too seem overly cumbersome.
Is this just a situation that requires some not-so-elegant hacking? Or, is there some easier functionality to accomplish this?
My current best effort is as follows:
MyDiscretize = function(x, N_Bins){
f = cut2(x, g = N_Bins, levels.mean = TRUE)
return(as.numeric(levels(f))[f])
}
My goal is to find something faster, more elegant, and easily adaptable to use either of the endpoints, rather than just the means.
Edit:
To clarify: my desired output would be:
(a) an equivalent to what I can achieve right now in the example with cut2 but without needing to convert the factor to numeric.
(b) if possible, the ability to also easily chose to use either of the endpoints of the interval, instead of the midpoint.
Use ave like this:
Given:
x = c(1,5,3,12,5,6,7)
Mean:
ave(x,cut2(x,g = 3), FUN = mean)
[1] 3.5 3.5 3.5 9.5 3.5 6.0 9.5
Min:
ave(x,cut2(x,g = 3), FUN = min)
[1] 1 1 1 7 1 6 7
Max:
ave(x,cut2(x,g = 3), FUN = max)
[1] 5 5 5 12 5 6 12
Or standard deviation:
ave(x,cut2(x,g = 3), FUN = sd)
[1] 1.914854 1.914854 1.914854 3.535534 1.914854 NA 3.535534
Note the NA result for only one data point in interval.
Hope this is what you need.
NOTE:
Parameter g in cut2 is number of quantile groups. Groups might not have the same amount of data points, and the intervals might not have the same length.
On the other hand, cut splits the interval into several of equal length.
Maybe not much elegant, but should be efficient. Try this function:
myCut<-function(x,breaks,retValues=c("means","highs","lows")) {
retValues<-match.arg(retValues)
if (length(breaks)!=1) stop("breaks must be a single number")
breaks<-as.integer(breaks)
if (is.na(breaks)||breaks<2) stop("breaks must greater than or equal to 2")
intervals<-seq(min(x),max(x),length.out=breaks+1)
bins<-findInterval(x,intervals,all.inside=TRUE)
if (retValues=="means") return(rowMeans(cbind(intervals[-(breaks+1)],intervals[-1]))[bins])
if (retValues=="highs") return(intervals[-1][bins])
intervals[-(breaks+1)][bins]
}
x = c(1,5,3,12,5,6,7)
myCut(x,3)
#[1] 2.833333 6.500000 2.833333 10.166667 6.500000 6.500000 6.500000
myCut(x,3,"highs")
#[1] 4.666667 8.333333 4.666667 12.000000 8.333333 8.333333 8.333333
myCut(x,3,"lows")
#[1] 1.000000 4.666667 1.000000 8.333333 4.666667 4.666667 4.666667
What would be your scholarly recommendation to model a population within R when
DELTA_Z = .2Z, Z0 = 10
? The output should be similar to the following
Or as another example, suppose a population is described by the model
Nt+1 = 1.5Nt and N5 = 7.3. Find Nt for t = 0, 1, 2, 3, and 4.
t 0 1 2 3 4 5 6
Zt 10 12 14.4 17.28 20.736 24.8832 29.8598
Those recursions i.e. Z=k*Z are done quite easily within a spreadsheet such as Excel. In R however, the following (far from efficient) have been done thus far:
#loop implementation in R
Z=10;Z;for (t in 6:0)
{Z=.2*Z+Z; print(Z)}
pr
Z0=10;
Z1=.2*Z0+Z0; Z2=.2*Z1+Z1; Z3=.2*Z2+Z2
Z4=.2*Z3+Z3;Z5=.2*Z4+Z4;Z6=.2*Z5+Z5
Zn=c(Z0,Z1,Z2,Z3,Z4,Z5,Z6);
Since R tries to avoid for loops and iterations at all costs, what would be your recommendation (could it be done preferably without iteration?)
What has been done in Excel is the following:
t Nt
5 7.3 k=1.5
4 =B2/$C$2
3 =B3/$C$2
2 =B4/$C$2
1 =B5/$C$2
0 =B6/$C$2
It is a lot easier:
R> Z <- 10
R> Z * 1.2 ^ (0:6)
[1] 10.00000 12.00000 14.40000 17.28000 20.73600 24.88320 29.85984
R>
We set Z to ten, and then multiply it by the growth rate. And that is really just taking 'growth' to the t-th power.
There is a nice short tutorial in the appendix of the An Introduction to R manual that came with your copy of R. I went over that a number of times when I started.
Is there any difference between what these two lines of code do:
mv_avg[i-2] <- (sum(file1$rtn[i-2]:file1$rtn[i+2])/5)
and
mv_avg[i-2] <- mean(file1$rtn[i-2]:file1$rtn[i+2])
I'm trying to calculate the moving average of first 5 elements in my dataset. I was running a for loop and the two lines are giving different outputs. Sorry for not providing the data and the rest of the code for you guys to execute and see (can't do that, some issues).
I just want to know if they both do the same thing or if there's a subtle difference between them both.
It's not an issue with mean or sum. The example below illustrates what's happening with your code:
x = seq(0.5,5,0.5)
i = 8
# Your code
x[i-2]:x[i+2]
[1] 3 4 5
# Index this way to get the five values for the moving average
x[(i-2):(i+2)]
[1] 3.0 3.5 4.0 4.5 5.0
x[i-2]=3 and x[i+2]=5, so x[i-2]:x[i+2] is equivalent to 3:5. You're seeing different results with mean and sum because your code is not returning 5 values. Therefore dividing the sum by 5 does not give you the average. In my example, sum(c(3,4,5))/5 != mean(c(3,4,5)).
#G.Grothendieck mentioned rollmean. Here's an example:
library(zoo)
rollmean(x, k=5, align="center")
[1] 2.1 3.1 4.1 5.1 6.1 7.1 8.1
I am aiming at smoothing out a curve with set values. To do this, I currently generate a vector between points in my curve like so:
> y.values <- c(values[1], mean(values[1:2]), values[2], ...)
This is not the fastest approach to say the least (and this snippet is just between two of the numbers!). I need a better way to generate a vector with known non-linear values and insert a value between each one, like so:
> values
[1] 1 2 4 6 9
> y.values <- magic(values)
> y.values
[1] 1 1.5 2 3 4 5 6 7.5 9
This question feels basic but I researched it and cannot seem to find a proper method for my non-linear vector, and any help is appreciated. Thank you for reading.
Maybe not the most elegant way to do this but it works:
values <- c(1,2,4,6,9)
#lapply is used to create the mean values and those get merged
#in between your values inside the function
a <- c(unlist(lapply( 1:(length(values)-1 ), function(x) c(values[x],(values[x]+values[x+1])/2))),
values[length(values)])
Output:
> a
[1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.5 9.0
Or as a function:
magic <- function(x) {
c(unlist(lapply( 1:(length(x)-1 ), function(z) c(x[z],(x[z]+x[z+1])/2))),
x[length(x)])
}
> magic(values)
[1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.5 9.0
I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!
If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.
Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075
Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458