Generate values between non-linear points - r

I am aiming at smoothing out a curve with set values. To do this, I currently generate a vector between points in my curve like so:
> y.values <- c(values[1], mean(values[1:2]), values[2], ...)
This is not the fastest approach to say the least (and this snippet is just between two of the numbers!). I need a better way to generate a vector with known non-linear values and insert a value between each one, like so:
> values
[1] 1 2 4 6 9
> y.values <- magic(values)
> y.values
[1] 1 1.5 2 3 4 5 6 7.5 9
This question feels basic but I researched it and cannot seem to find a proper method for my non-linear vector, and any help is appreciated. Thank you for reading.

Maybe not the most elegant way to do this but it works:
values <- c(1,2,4,6,9)
#lapply is used to create the mean values and those get merged
#in between your values inside the function
a <- c(unlist(lapply( 1:(length(values)-1 ), function(x) c(values[x],(values[x]+values[x+1])/2))),
values[length(values)])
Output:
> a
[1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.5 9.0
Or as a function:
magic <- function(x) {
c(unlist(lapply( 1:(length(x)-1 ), function(z) c(x[z],(x[z]+x[z+1])/2))),
x[length(x)])
}
> magic(values)
[1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.5 9.0

Related

Is there any subtle difference between 'mean' and 'average' in R?

Is there any difference between what these two lines of code do:
mv_avg[i-2] <- (sum(file1$rtn[i-2]:file1$rtn[i+2])/5)
and
mv_avg[i-2] <- mean(file1$rtn[i-2]:file1$rtn[i+2])
I'm trying to calculate the moving average of first 5 elements in my dataset. I was running a for loop and the two lines are giving different outputs. Sorry for not providing the data and the rest of the code for you guys to execute and see (can't do that, some issues).
I just want to know if they both do the same thing or if there's a subtle difference between them both.
It's not an issue with mean or sum. The example below illustrates what's happening with your code:
x = seq(0.5,5,0.5)
i = 8
# Your code
x[i-2]:x[i+2]
[1] 3 4 5
# Index this way to get the five values for the moving average
x[(i-2):(i+2)]
[1] 3.0 3.5 4.0 4.5 5.0
x[i-2]=3 and x[i+2]=5, so x[i-2]:x[i+2] is equivalent to 3:5. You're seeing different results with mean and sum because your code is not returning 5 values. Therefore dividing the sum by 5 does not give you the average. In my example, sum(c(3,4,5))/5 != mean(c(3,4,5)).
#G.Grothendieck mentioned rollmean. Here's an example:
library(zoo)
rollmean(x, k=5, align="center")
[1] 2.1 3.1 4.1 5.1 6.1 7.1 8.1

Double For Loop and calculate averages in R

I have a minor problem, and I'm unsure how to fix the error.
Basically, I have two columns and I want to use a Double For Loop to calculate the averages between each number in both columns so it results in a vector of averages. To clarify, apply and mean functions isn't the best function because I need only half of the total possible combinations to obtain averages. For example:
Col1<-c(1,2,3,4,5)
Col2<-c(1,2,3,4,5)
Q1<-data.frame(cbind(Col1, Col2))
Q1$mean<-0
for (i in 1:length(Q1$Col1)) {
for (j in i+1:length(Q1$Col2)) {
Q1$mean[i]<-(Q1$Col1[i]+Q1$Col2[j])/2
}
}
Basically, for each number in Q1$Col1, I want it average it with Q1$Col2. The reason why I want to use a double for loop is to eliminate duplicates. This is the matrix version to provide visualization:
1.0 1.5 2.0 2.5 3.0
1.5 2.0 2.5 3.0 3.5
2.0 2.5 3.0 3.5 4.0
2.5 3.0 3.5 4.0 4.5
3.0 3.5 4.0 4.5 5.0
Here, each row represents a number from Q1$Col1 and each column represents a number from Q1$Col2. However, notice that there is redundancy on both sides of the matrix diagonal. So using the Double For Loop, I eliminate the redundancy to obtain the averages of the unique combination of cases. Using the matrix above, it should look like this:
1.0 1.5 2.0 2.5 3.0
2.0 2.5 3.0 3.5
3.0 3.5 4.0
4.0 4.5
5.0
What I think you're asking is this: given two vectors of numbers, how can I find the mean of the first items in each vector, the mean of the second items in each vector, and so on. If that's the case, then here is a way to do that.
First, you want use cbind() not rbind() in order to get columns not rows.
Col1<-c(1,2,3,4,5)
Col2<-c(2,3,4,5,6)
Q1<-cbind(Col1, Col2)
Then you can use the function [rowMeans()][1] to figure out (you guessed it) the means of each row. (See also rowSums() and colMeans() and colSums().)
rowMeans(Q1)
#> [1] 1.5 2.5 3.5 4.5 5.5
The more general way to do this is the apply() function, which will let us apply a function to each column or row. Here we use the argument 1 to apply it to rows (because the first row takes the first item from Col1 and Col2, etc.).
apply(Q1, 1, mean)
The results are these:
#> [1] 1.5 2.5 3.5 4.5 5.5
If you really want them in your existing matrix, you could do something like this:
means <- rowMeans(Q1)
cbind(Q1, means)
You do not need the loops to get the averages, you can use vectorised operations:
Col1 <- c(1,2,3,4,5)
Col2 <- c(2,3,4,5,6)
Mean <- (Col1+Col2)/2
Q1 <- rbind(Col1, Col2, Mean)
However rbind treats your vectors as rows, you could use cbind for columns.
You could just use the outer function to first calculate the averages, then use lower.trito fill the area underneath the diagonal of the matrix with NA values.
matrix<-outer(Q1$Col1, Q1$Col2, "+")/2
matrix[lower.tri(matrix)] = NA

Split Data into groups of equal means

I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!
If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.
Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075
Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458

R as.numeric in a formula

I have the following data frame:
id<-c(1,2,3,4)
x<-c(0,2,1,0)
A<-c(1,3,4,3)
df<-data.frame(id,x,A)
now I want to make a variable called value in a way that if x>0, value for each id would be equal to A+1/x, and if x=0, value would be equal to A.
with this aim I have typed
value <- df$A + as.numeric(df$x > 0)*(1/df$x)
what I expect is a vector a follows:
[1] 1 3.5 5.0 3
however what I get by typing the above command is :
[1] NaN 3.5 5.0 NaN
I wonder if anyone can help me with this problem.
Thanks in advance!
I think it would be simpler to use function ifelse in this case:
ifelse(df$x>0, df$A + (1/df$x), df$A)
[1] 1.0 3.5 5.0 3.0
See ?ifelse for more details.
With the command you were trying to apply, although as.numeric(df$x>0) was indeed giving you a vector of 1 and 0 (so it was a good idea), it didn't change the fact that 1/df$x was an NaN when df$x was equal to 0 (since you can not divide by 0).

Avoiding loops in R

I have decided to learn R. I am trying to get a sense of how to write "R style" functions and to avoid looping. Here is a sample situation:
Given a vector a, I would like to compute a vector b whose elements b[i] (the vector index begins at 1) are defined as follows:
1 <= i <= 4:
b[i] = NaN
5 <= i <= length(a):
b[i] = mean(a[i-4] to a[i])
Essentially, if we pretend 'a' is a list of speeds where the first entry is at time = 0, the second at time = 1 second, the third at time = 2 seconds... I would like to obtain a corresponding vector describing the average speed over the past 5 seconds.
E.g.:
If a is (1,1,1,1,1,4,6,3,6,8,9) then b should be (NaN, NaN, NaN, NaN, 1, 1.6, 2.6, 3, 4, 5.4, 6.4)
I could do this using a loop, but I feel that doing so would not be in "R style".
Thank you,
Tungata
Because these rolling functions often apply with time-series data, some of the newer and richer time-series data-handling packages already do that for you:
R> library(zoo) ## load zoo
R> speed <- c(1,1,1,1,1,4,6,3,6,8,9)
R> zsp <- zoo( speed, order.by=1:length(speed) ) ## creates a zoo object
R> rollmean(zsp, 5) ## default use
3 4 5 6 7 8 9
1.0 1.6 2.6 3.0 4.0 5.4 6.4
R> rollmean(zsp, 5, na.pad=TRUE, align="right") ## with padding and aligned
1 2 3 4 5 6 7 8 9 10 11
NA NA NA NA 1.0 1.6 2.6 3.0 4.0 5.4 6.4
R>
The zoo has excellent documentation that will show you many, many more examples, in particular how to do this with real (and possibly irregular) dates; xts extends this further but zoo is a better starting point.
Something like b = filter(a, rep(1.0/5, 5), sides=1) will do the job, although you will probably get zeros in the first few slots, instead of NaN. R has a large library of built-in functions, and "R style" is to use those wherever possible. Take a look at the documentation for the filter function.
You can also use a combination of cumsum and diff to get the sum over sliding windows. You'll need to pad with your own NaN, though:
> speed <- c(1,1,1,1,1,4,6,3,6,8,9)
> diff(cumsum(c(0,speed)), 5)/5
[1] 1.0 1.6 2.6 3.0 4.0 5.4 6.4

Resources