quick standard deviation with weights - r

I wanted to use a function that would quickly give me a standard deviation of a vector ad allow me to include weights for elements in the vector. i.e.
sd(c(1,2,3)) #weights all equal 1
#[1] 1
sd(c(1,2,3,3,3)) #weights equal 1,1,3 respectively
#[1] 0.8944272
For weighted means I can use wt.mean() from library(SDMTools) e.g.
> mean(c(1,2,3))
[1] 2
> wt.mean(c(1,2,3),c(1,1,1))
[1] 2
>
> mean(c(1,2,3,3,3))
[1] 2.4
> wt.mean(c(1,2,3),c(1,1,3))
[1] 2.4
but the wt.sd function does not seem to provide what I thought I wanted:
> sd(c(1,2,3))
[1] 1
> wt.sd(c(1,2,3),c(1,1,1))
[1] 1
> sd(c(1,2,3,3,3))
[1] 0.8944272
> wt.sd(c(1,2,3),c(1,1,3))
[1] 1.069045
I am expecting a function that returns 0.8944272 from me weighted sd. Preferably I would be using this on a data.frame like:
data.frame(x=c(1,2,3),w=c(1,1,3))

library(Hmisc)
sqrt(wtd.var(1:3,c(1,1,3)))
#[1] 0.8944272

You can use rep to replicate the values according to their weights. Then, sd can be computed for the resulting vector.
x <- c(1, 2, 3) # values
w <- c(1, 1, 3) # weights
sd(rep(x, w))
[1] 0.8944272

Related

R - which of these functions is giving the correct value of the integrals?

I am trying to compute this integral in R:
I found three functions which can be used for this and they are all giving me different results. Here is the code:
integrand <- function(x){
r <- 1/x
return(r)
}
First is the option from base R:
integrate(integrand,-Inf, Inf)
Giving the result:
0 with absolute error < 0
The second is from the pracma package:
quadinf(integrand, -Inf, Inf)
Giving this output:
$Q
[1] -106.227
$relerr
[1] 108.0135
$niter
[1] 7
And the last one is from the cubature package:
cubintegrate(integrand, -Inf, Inf)
Which gives the following result:
$integral
[1] Inf
$error
[1] NaN
$neval
[1] 15
$returnCode
[1] 0
So then, which one of these is correct and which should I trust? Is it 0, infinity, or -106.227? Why are they all different in the first place?
1/x isn't integrable in [-Inf,Inf] range, because not integrable in 0.
On an integrable range, results are similar:
integrate(\(x) 1/x,1,2)
#0.6931472 with absolute error < 7.7e-15
pracma::quadinf( \(x) 1/x,1,2)
#$Q
#[1] 0.6931472
#$relerr
#[1] 7.993606e-15
#$niter
#[1] 4
Note that integral of 1/x in ]0,Inf] range is log(x):
log(2)-log(1)
#[1] 0.6931472

prob function in of rolldie but I dont know how to wirte excatly, see in question b

preconditions: the "prob" package and a seirous packages it requires has been installed
a) Consider the experiment of rolling three dice. Using R, show how would you use a user-defined function to define a random variable that is the mean of the three rolls rounded to the nearest integer.
> rollthree <- rolldie(3, makespace = TRUE)
> rollthree$mean = as.integer((rollthree$X1 + rollthree$X2 + rollthree$X3)/3)
> rollthree
X1 X2 X3 probs mean
1 1 1 1 0.00462963 1
2 2 1 1 0.00462963 1
... ...
b) Using the above result, what is the probability that the random variable equals 3? What is the probability that the random variable takes a value of at most 3? What is the probability that the random variable takes on a value of at least 3? Use the Prob function as shown in the code samples.
> equal3 <- subset(rollthree$mean, rank == 3)
Error in rank == 3 :
comparison (1) is possible only for atomic and list types```
I believe the issue here is that subset can't operate on rank, one solution to this would be to have equal3 <- subset(rollthree, mean == 3) which woud store all of the rows wher we have a mean of 3. Then we can sum the probabilities or multiply our probability for a single roll by the length of the array.
Using your code as a base I have produced the following code.
library(prob)
# Part a
rollthree <- rolldie(3, makespace = T)
rollthree$mean = as.integer((rollthree$X1 + rollthree$X2 + rollthree$X3)/ 3)
# Part b
print("Probability mean is 3:")
# Note here we sum the probablities from the events we want to occur
# Additionally we have done this all in one line by taking only the mean column from the subset
sum(subset(rollthree, mean == 3)$prob)
print("Probability mean is less than or equal to 3:")
sum(subset(rollthree, mean <= 3)$prob)
print("Probability mean is greater than or equal to 3:")
sum(subset(rollthree, mean >= 3)$prob)
#> [1] "Probability mean is 3:"
#> [1] 0.3657407
#> [1] "Probability mean is less than or equal to 3:"
#> [1] 0.625
#> [1] "Probability mean is greater than or equal to 3:"
#> [1] 0.7407407
Created on 2021-06-08 by the reprex package (v2.0.0)
An alternate approach for a) is written below:
library(prob)
# part a
#function to roll and calculate the means for some number of dice
roll_x_mean_int <- function(x) {
# Check the input value is an integer
if(typeof(x) != "integer"){
stop("Input value is not an integer")
}
# Check the input value is positive
if(x < 1){
stop("Input integer is not positive")
}
# Roll the die
vals <- rolldie(x, makespace = T)
# Calculate the sum of each row (excluding the value of the probability)
vals$mean <- as.integer(rowSums(vals[1:x]/x))
return(vals)
}
# Call the fucntion with 3 dice (note the L makes the value an integer)
rollthree <- roll_x_mean_int(3L)
# part b
# Called this section as one block
{
print("Probability mean is 3:")
print(sum(subset(rollthree, mean == 3)$prob))
print("Probability mean is less than or equal to 3:")
print(sum(subset(rollthree, mean <= 3)$prob))
print("Probability mean is greater than or equal to 3:")
print(sum(subset(rollthree, mean >= 3)$prob))
}
#> [1] "Probability mean is 3:"
#> [1] 0.3657407
#> [1] "Probability mean is less than or equal to 3:"
#> [1] 0.625
#> [1] "Probability mean is greater than or equal to 3:"
#> [1] 0.7407407
Created on 2021-06-08 by the reprex package (v2.0.0)

how to generate random numbers with conditons impose in R?

I would like to generate 500 different combination of a,b,and c meeting the following conditions
a+ b+ c = 1 and
a < b < c
here is a basic sample of generating random numbers, however, I need to generate it based on aforementioned conditions.
Coeff = data.frame(a=runif(500, min = 0, max = 1),
b=runif(500, min = 0, max = 1),
c=runif(500, min = 0, max = 1))
myrandom <- function(n) {
m <- matrix(runif(3*n), ncol=3)
m <- cbind(m, rowSums(m)) # rowSums is efficient
t(apply(m, 1, function(a) sort(a[1:3] / a[4])))
}
Demonstration:
set.seed(2)
(m <- myrandom(5))
# [,1] [,2] [,3]
# [1,] 0.1099815 0.3287708 0.5612477
# [2,] 0.1206611 0.2231769 0.6561620
# [3,] 0.2645362 0.3509054 0.3845583
# [4,] 0.2057215 0.2213517 0.5729268
# [5,] 0.2134069 0.2896015 0.4969916
all(abs(rowSums(m) - 1) < 1e-8) # CONSTRAINT 1: a+b+c = 1
# [1] TRUE
all(apply(m, 1, diff) > 0) # CONSTRAINT 2: a < b < c
# [1] TRUE
Note:
my test for "sum to 1" is more than just ==1 because of IEEE-754 and R FAQ 7.31, suggesting that any floating-point test should be an inequality vice a test for equality; if you test for ==1, you will eventually find occurrences where it does not appear to be satisfied:
set.seed(2)
m <- myrandom(1e5)
head(which(rowSums(m) != 1))
# [1] 73 109 199 266 367 488
m[73,]
# [1] 0.05290744 0.24824770 0.69884486
sum(m[73,])
# [1] 1
sum(m[73,]) == 1
# [1] FALSE
abs(sum(m[73,]) - 1) < 1e-15
# [1] TRUE
max(abs(rowSums(m) - 1))
# [1] 1.110223e-16
I would like to point out that ANY distribution law (uniform, gaussian, exponential, ...) will produce numbers a, b and c meeting your condition as soon as you normalize and sort them, so there should be some domain knowledge to prefer one over the other.
As an alternative, I would propose to use Dirichlet distribution which produce numbers naturally satisfying your first condition: a+b+c=1. It was applied to rainfall modelling as well, I believe (https://arxiv.org/pdf/1801.02962.pdf)
library(MCMCpack)
abc <- rdirichlet(n, c(1,1,1))
sum(abc) # should output n
You could vary power law values to shape the data, and, of course, sort them to satisfy your second condition. For many cases it is easy to argue about your model behavior if it uses Dirichlet (Dirichlet being prior for multinomial in Bayes approach, f.e.)

How can I print the p-value with 2 significant figures?

When I print my p value from my t.test by doing:
ttest_bb[3]
It returns the full p value. How can I make it so it only prints the first two integers? i.e. .03 instead of .034587297?
The output from t.test is a list. If you only use [ to grab the p-value then what is returned is a list with one element. You want to use [[ to grab the element contained at the spot in the list returned by t.test if you want to treat it as a vector.
> ttest_bb <- t.test(rnorm(20), rnorm(20))
> ttest_bb
Welch Two Sample t-test
data: rnorm(20) and rnorm(20)
t = -2.5027, df = 37.82, p-value = 0.01677
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.4193002 -0.1498456
sample estimates:
mean of x mean of y
-0.3727489 0.4118240
> # Notice that what is returned when subsetting like this is
> # a list with the name p.value
> ttest_bb[3]
$`p.value`
[1] 0.01676605
> # If we use the double parens then it extracts just the vector contained
> ttest_bb[[3]]
[1] 0.01676605
> # What you're seeing is this:
> round(ttest_bb[3])
Error in round(ttest_bb[3]) :
non-numeric argument to mathematical function
> # If you use double parens you can use that value
> round(ttest_bb[[3]],2)
[1] 0.02
> # I prefer using the named argument to make it more clear what you're grabbing
> ttest_bb$p.value
[1] 0.01676605
> round(ttest_bb$p.value, 2)
[1] 0.02

How can I detect changes from negative to positive value?

I have calculated the differences of my data points and received this vector:
> diff(smooth$a)/(diff(smooth$b))
[1] -0.0099976150 0.0011162606 0.0116275973 0.0247594149 0.0213592319 0.0205187495 0.0179274056 0.0207752713
[9] 0.0231903072 -0.0077549224 -0.0401528643 -0.0477294350 -0.0340842051 -0.0148157337 0.0003829642 0.0160912230
[17] 0.0311189830
Now I want to get the positions (index) where I have a change from negative to positive when the following 3 data points are also positive.
So my output would be like this:
> output
-0.0099976150 -0.0148157337
How could I do this?
One way like this:
series <- paste(ifelse(vec < 0, 0, 1), collapse = '')
vec[gregexpr('0111', series)[[1]]]
#[1] -0.009997615 -0.014815734
The first line creates a sequence of 0s and 1s depending on the sign of the number. In the second line of the code we capture the sequence with gregexpr. Finally, we use these indices to subset the original vector.
Imagine a vector z:
z <- seq(-2, 2, length.out = 20)
z
#> [1] -2.0000000 -1.7894737 -1.5789474 -1.3684211 -1.1578947 -0.9473684 -0.7368421 -0.5263158
#> [9] -0.3157895 -0.1052632 0.1052632 0.3157895 0.5263158 0.7368421 0.9473684 1.1578947
#> [17] 1.3684211 1.5789474 1.7894737 2.0000000
then you can do
turn_point <- which(z == max(z[z < 0]))
turn_plus_one <- c(turn_point, turn_point + 1)
z[turn_plus_one]
#> [1] -0.1052632 0.1052632

Resources