Difference between mean and manual calculation in R?

Difference between mean and manual calculation in R? - r

I am writing a simple function in R to calculate percentage differences between two input numbers.
pdiff <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/((a+b)/2))*100
}
pdiffa <-function(a,b)
{
if(length(a>=1)) a <- median(a)
if(length(b>=1)) b <- median(b)
(abs(a-b)/mean(a,b))*100
}
When you run it with a random value of a and b, the functions give different results
x <- 5
y <- 10
pdiff(x,y) #gives 66%
pdiffa(x,y) #gives 100%
When I go into the code, apparently the values of (x+y)/2 = 7.5 and mean(x,y) = 5 differ......Am I missing something really obvious and stupid here?

This is due to a nasty "gotcha" in the mean() function (not listed in the list of R traps, but probably should be): you want mean(c(a,b)), not mean(a,b). From ?mean:
mean(x, ...)
[snip snip snip]
... further arguments passed to or from other methods.
So what happens if you call mean(5,10)? mean calls the mean.default method, which has trim as its second argument:
trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
The last phrase "values of trim outside that range are taken as the nearest endpoint" means that values of trim larger than 0.5 are set to 0.5, which means that we're asking mean to throw away 50% of the data on either end of the data set, which means that all that's left is the median. Debugging our way through mean.default, we see that we indeed end up at this code ...
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
So mean(c(x,<value_greater_than_0.5>)) returns the median of c(5), which is just 5 ...

Try mean(5, 10) by itself.
mean(5, 10)
[1] 5
Now try mean(c(5, 10)).
mean(c(5, 10))
[1] 7.5
mean takes a vector as its first argument.

Related

what is the most efficient way to find the most common value in a vector?

I'm trying to create a function to solve this puzzle:
An Arithmetic Progression is defined as one in which there is a constant difference between the consecutive terms of a given series of numbers. You are provided with consecutive elements of an Arithmetic Progression. There is however one hitch: exactly one term from the original series is missing from the set of numbers which have been given to you. The rest of the given series is the same as the original AP. Find the missing term.
You have to write the function findMissing(list), list will always be at least 3 numbers. The missing term will never be the first or last one.
The next section of code shows my attempt at this function. The site i'm on runs tests against the function, all of which passed, as in they output the correct missing integer.
The problem i'm facing is it's giving me a timeout error, because it takes to long to run all the tests. There are 102 tests and it's saying it takes over 12 seconds to complete them. Taking more than 12 seconds means the function isn't efficient enough.
After running my own timing tests in RStudio it seems running the function would take considerably less time than 12 seconds to run but regardless i need to make it more efficient to be able to complete the puzzle.
I asked on the site forum and someone said "Sorting is expensive, think of another way of doing it without it." I took this to mean i shouldn't be using the sort() function. Is this what they mean?
I've since found a few different ways of getting my_diff which is calculated using the sort() function. All of these ways are even less efficient than the original way of doing it.
Can anyway give me a more efficient way of doing the sort to find my_diff or maybe make other parts of the code more efficient? It's the sort() part which is apparently the inefficient part of the code though.
find_missing <- function(sequence){
len <- length(sequence)
if(len > 3){
my_diff <- as.integer(names(sort(table(diff(sequence)), decreasing = TRUE))[1])
complete_seq <- seq(sequence[1], sequence[len], my_diff)
}else{
differences <- diff(sequence)
complete_seq_1 <- seq(sequence[1],sequence[len],differences[1])
complete_seq_2 <- seq(sequence[1],sequence[len],differences[2])
if(length(complete_seq_1) == 4){
complete_seq <- complete_seq_1
}else{
complete_seq <- complete_seq_2
}
}
complete_seq[!complete_seq %in% sequence]
}
Here are a couple of sample sequences to check the code works:
find_missing(c(1,3,5,9,11))
find_missing(c(1,5,7))
Here are some of the other things i tried instead of sort:
1:
library(pracma)
Mode(diff(sequence))
2:
library(dplyr)
(data.frame(diff_1 = diff(sequence)) %>%
group_by(diff_1) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(count==max(count)))[1]
3:
MaxTable <- function(sequence, mult = FALSE) {
differences <- diff(sequence)
if (!is.factor(differences)) differences <- factor(differences)
A <- tabulate(differences)
if (isTRUE(mult)) {
as.integer(levels(differences)[A == max(A)])
}
else as.integer(levels(differences)[which.max(A)])
}

Here is one way to do this using seq. We can create a sequence from minimum value in sequence to maximum value in the sequence having length as length(x) + 1 as there is exactly one term missing in the sequence.
find_missing <- function(x) {
setdiff(seq(min(x), max(x), length.out = length(x) + 1), x)
}
find_missing(c(1,3,5,9,11))
#[1] 7
find_missing(c(1,5,7))
#[1] 3

This approach takes the diff() of the vector - there will always be one difference higher than the others.
find_missing <- function(x) {
diffs <- diff(x)
x[which.max(diffs)] + min(diffs)
}
find_missing(c(1,3,5,9,11))
[1] 7
find_missing(c(1,5,7))
[1] 3

There is actually a simple formula for this, which will work even if your vector is not sorted...
find_missing <- function(x) {
(length(x) + 1) * (min(x) + max(x))/2 - sum(x)
}
find_missing(c(1,5,7))
[1] 3
find_missing(c(1,3,5,9,11,13,15))
[1] 7
find_missing(c(2,8,6))
[1] 4
It is based on the fact that the sum of the full series should be the average value times the length.

Poisson Process algorithm in R (renewal processes perspective)

I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.

Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!

I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")

I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.

Median for a sequency with missing values in R

I’m trying to do a median like a function, for vectors with even and odd numbers (related to quantities of observations – the calculating method changes). I need to include an option for missing values as well. How can I do this? I need to do this without to use the ready median(x) in R. Is it possible? I have put bellow how I tried to do:
x<-c(1,2,3)
function (x, na.rm = FALSE) {
x <- x[!is.na(x)]
return(mediana = median(x))

Finding a value in an interval

Sorry if this is a basic question. Have been trying to figure this out but not being able to.
I have a vector of values called sym.
> head(sym)
[,1]
val 3.652166e-05
val -2.094026e-05
val 4.583950e-05
val 6.570184e-06
val -1.431486e-05
val -5.339604e-06
These I put in intervals by using factor on cut function on sym.
factorx<-factor(cut(sym,breaks=nclass.Sturges(sym)))
[1] (2.82e-05,5.28e-05] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (-2.11e-05,3.55e-06] (-2.11e-05,3.55e-06]
[7] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (7.74e-05,0.000102]
Levels: (-2.11e-05,3.55e-06] (3.55e-06,2.82e-05] (2.82e-05,5.28e-05] (7.74e-05,0.000102]
So clearly, four intervals were created in factorx. Now I have a new value tmp=3.7e-0.6.
My question is how can I find which interval in the above does it belongs to? I tried to use findInterval() but seems it does not work on factors like factorx.
Thanks

If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.
So first, I will generate some sample data.
set.seed(18)
x <- runif(50)
Now I will show two different way to calculate breaks. Here are b1() and b2()
b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}
So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor
mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))
(Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared
nv <- runif(5)
grp <- findInterval(nv,mybreaks)
And we can check the results with
data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678
and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.

why does sd in R return a vector for matrix input, and what can I do about it?

I am somewhat confused as to why the sd function in R returns an array for matrix input (I suppose to maintain backwards compatibility, it always will). This is very odd behaviour to me:
#3d input, same same
print(length(mean(array(rnorm(60),dim=c(3,4,5)))))
print(length(sd(array(rnorm(60),dim=c(3,4,5)))))
#1d input, same same
print(length(mean(array(rnorm(60),dim=c(60)))))
print(length(sd(array(rnorm(60),dim=c(60)))))
#2d input, different!
print(length(mean(array(rnorm(60),dim=c(12,5)))))
print(length(sd(array(rnorm(60),dim=c(12,5)))))
I get
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 5
That is sd behaves differently from mean when the input is a 2-d array (and apparently only in that case!) Consider then, this failed function to rescale each column of a k-dimensional array by the standard deviation:
re.scale <- function(x) {
#rescale by the standard deviation of each column
scales <- apply(x,2,sd)
ret.val <- sweep(x,2,scales,"/")
}
#this works just fine
x <- array(rnorm(60),dim=c(12,5))
y <- re.scale(x)
#this throws a warning
x <- array(rnorm(60),dim=c(3,4,5))
y <- re.scale(x)
Is there some other function to replace sd without this weird behavior? How would one write re.scale properly? Or a Z-score-by-column function?

It is behaving as document in sd's help page. At the very top it announces:
"If x is a matrix or a data frame, a vector of the standard deviation of the columns is returned."
Note it does not say that the arrays are included, so only arrays with two dimensions are included. If you want to stop this behavior, then just make a vector out of it with c():
sd( c(array(rnorm(60),dim=c(12,5))) )
# [1] 0.9505643
I see that you added a request for column z scores. Try this for matrices:
colMeans(x)/sd(x)
And this for arrays (although the definition of a "column" may need clarification:
apply(x, 2:3, mean)/apply(x, 2:3, sd) # will generalize to higher dimensions

The actions of sd were changed:
1. version 2.13.2(2011-09-30) and earlier
> set.seed(1)
> sd(array(rnorm(60),dim=c(12,5)))
[1] 0.8107276 1.1234795 0.7925743 0.6186082 0.9464160
Description
This function computes the standard deviation of the values in x. If
na.rm is TRUE then missing values are removed before computation
proceeds.
If x is a matrix or a data frame, a vector of the standard
deviation of the columns is returned.
2. R version 2.14.0(2011-10-31) - 2.15.3(2013-03-01)
> set.seed(1)
> sd(array(rnorm(60),dim=c(12,5)))
[1] 0.8107276 1.1234795 0.7925743 0.6186082 0.9464160
WARNING：
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
Details
Prior to R 2.14.0, sd(dfrm) worked directly for a data.frame
dfrm. This is now deprecated and you are expected to use sapply(dfrm,
sd).
3. R version 3.0.0 (2013-04-03) and later
> sd(array(rnorm(60),dim=c(12,5)))
[1] 0.8551688
>
(no WARNIG)