Formula to compute the between group sum of squares in R - r

Can anyone tell me how to code the SS between in R
to compute by hand, it is ∑ ni(meanXi - the grand mean)2
thanks,
lp

If you have a vector of values x and the mean in x_mean, you can compute the SS error manually simply like this:
> x=c(1,2,3,4,5)
> x_mean = mean(x)
> x-x_mean
[1] -2 -1 0 1 2
> (x-x_mean)^2
[1] 4 1 0 1 4
> sum((x-x_mean)^2)
[1] 10

Not sure this is what you want, but
# create sample dataset: 5 groups, 10 values per group
set.seed(1)
df <- data.frame(group=rep(LETTERS[1:5],each=10),value=rnorm(50))
# calculate between-group sum of squares (SSB)
sum((aggregate(value~group,df,mean)$value-mean(df$value))^2)
# [1] 0.07938908
This calculates the mean by group using aggregate(...) and then sums the squared difference between that and the grand mean (mean(df$value)).

Related

prob function in of rolldie but I dont know how to wirte excatly, see in question b

preconditions: the "prob" package and a seirous packages it requires has been installed
a) Consider the experiment of rolling three dice. Using R, show how would you use a user-defined function to define a random variable that is the mean of the three rolls rounded to the nearest integer.
> rollthree <- rolldie(3, makespace = TRUE)
> rollthree$mean = as.integer((rollthree$X1 + rollthree$X2 + rollthree$X3)/3)
> rollthree
X1 X2 X3 probs mean
1 1 1 1 0.00462963 1
2 2 1 1 0.00462963 1
... ...
b) Using the above result, what is the probability that the random variable equals 3? What is the probability that the random variable takes a value of at most 3? What is the probability that the random variable takes on a value of at least 3? Use the Prob function as shown in the code samples.
> equal3 <- subset(rollthree$mean, rank == 3)
Error in rank == 3 :
comparison (1) is possible only for atomic and list types```
I believe the issue here is that subset can't operate on rank, one solution to this would be to have equal3 <- subset(rollthree, mean == 3) which woud store all of the rows wher we have a mean of 3. Then we can sum the probabilities or multiply our probability for a single roll by the length of the array.
Using your code as a base I have produced the following code.
library(prob)
# Part a
rollthree <- rolldie(3, makespace = T)
rollthree$mean = as.integer((rollthree$X1 + rollthree$X2 + rollthree$X3)/ 3)
# Part b
print("Probability mean is 3:")
# Note here we sum the probablities from the events we want to occur
# Additionally we have done this all in one line by taking only the mean column from the subset
sum(subset(rollthree, mean == 3)$prob)
print("Probability mean is less than or equal to 3:")
sum(subset(rollthree, mean <= 3)$prob)
print("Probability mean is greater than or equal to 3:")
sum(subset(rollthree, mean >= 3)$prob)
#> [1] "Probability mean is 3:"
#> [1] 0.3657407
#> [1] "Probability mean is less than or equal to 3:"
#> [1] 0.625
#> [1] "Probability mean is greater than or equal to 3:"
#> [1] 0.7407407
Created on 2021-06-08 by the reprex package (v2.0.0)
An alternate approach for a) is written below:
library(prob)
# part a
#function to roll and calculate the means for some number of dice
roll_x_mean_int <- function(x) {
# Check the input value is an integer
if(typeof(x) != "integer"){
stop("Input value is not an integer")
}
# Check the input value is positive
if(x < 1){
stop("Input integer is not positive")
}
# Roll the die
vals <- rolldie(x, makespace = T)
# Calculate the sum of each row (excluding the value of the probability)
vals$mean <- as.integer(rowSums(vals[1:x]/x))
return(vals)
}
# Call the fucntion with 3 dice (note the L makes the value an integer)
rollthree <- roll_x_mean_int(3L)
# part b
# Called this section as one block
{
print("Probability mean is 3:")
print(sum(subset(rollthree, mean == 3)$prob))
print("Probability mean is less than or equal to 3:")
print(sum(subset(rollthree, mean <= 3)$prob))
print("Probability mean is greater than or equal to 3:")
print(sum(subset(rollthree, mean >= 3)$prob))
}
#> [1] "Probability mean is 3:"
#> [1] 0.3657407
#> [1] "Probability mean is less than or equal to 3:"
#> [1] 0.625
#> [1] "Probability mean is greater than or equal to 3:"
#> [1] 0.7407407
Created on 2021-06-08 by the reprex package (v2.0.0)

How to pull points that are within a certain distance away in R?

I have a parameter space given by (x,y) with x values from 1:5 and y values from 1:8. Let's say my current point p is located at (2,5) (it is colored in red). My goal is to try to pull all the points within one unit distance away from point p (the points in blue).
I was wondering if there was an efficient way to do this. Let's say my variables are stored in the following way:
xrange <- 1:5
yrange <- 1:8
grid <- expand.grid(xrange,yrange)
p <- data.frame(x=2,y=5)
I would like to store the other points below p in this fashion:
res <- data.frame(x=c(1,1,1,2,2,3,3,3),y=c(4,6,4,5,6,4,5,6))
res <- rbind(p,res)
> res
x y
1 2 5
2 1 4
3 1 6
4 1 4
5 2 5
6 2 6
7 3 4
8 3 5
9 3 6
The ultimate goal is to have a parameter space that is more than 2 dimensional. So I would eventually like to find all points that are some euclidean distance s away and similarly have a resulting dataframe with each column being a parameter in the parameter space and each row being a point with coordinates (x,y,z,..,etc) from its columns.
EDIT I have tried the following implementation if I wanted a circle or euclidean distance s and this seems to work. I am not sure how efficient the solution is though.
eucdist <- function(z,p){
return(dist(rbind(z, p)))
}
# in this case s=1 since that is the <= condition
res <- do.call(rbind,lapply(1:nrow(grid),function(m) if(eucdist(as.numeric(grid[m,]),as.numeric(p[1,])) <= 1){return(grid[m,])}))
More information: for now, my parameter space is discretized like the one in the picture above. Eventually some parameters will be continuous mixed in with discrete parameters as well. Thank you so much!
The euclidean distance of each point on the grid from the target point p can be efficiently computed with:
dist <- sqrt(rowSums(mapply(function(x,y) (x-y)^2, grid, p)))
Basically the inner mapply call will result in a matrix of the same size as grid but that has the squared distance of that point from the target point in that dimension; rowSums and sqrt efficiently then compute the euclidean distance.
In this case you are including anything with sqrt(2) Euclidean distance from the target point:
grid[dist < 1.5,]
# Var1 Var2
# 16 1 4
# 17 2 4
# 18 3 4
# 21 1 5
# 22 2 5
# 23 3 5
# 26 1 6
# 27 2 6
# 28 3 6
The use of mapply (operating over dimensions) and rowSums makes this much more efficient than an approach that loops through individual points on the grid, computing the distance to the target point. To see this, consider a slightly larger example with 1000 randomly distributed points in three dimensions:
set.seed(144)
grid <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
p <- data.frame(x=rnorm(1), y=rnorm(1), z=rnorm(1))
lim <- 1.5
byrow <- function(grid, p, lim) grid[apply(grid, 1, function(x) sqrt(sum((x-p)^2))) < lim,]
vectorized <- function(grid, p, lim) grid[sqrt(rowSums(mapply(function(x,y) (x-y)^2, grid, p))) < lim,]
identical(byrow(grid, p, lim), vectorized(grid, p, lim))
[1] TRUE
library(microbenchmark)
# Unit: microseconds
# expr min lq mean median uq max neval
# byrow(grid, p, lim) 446792.71 473428.137 500680.0431 495824.7765 521185.093 579999.745 10
# vectorized(grid, p, lim) 855.33 881.981 954.1773 907.3805 1081.658 1108.679 10
The vectorized approach is 500 times faster than the approach that loops through the rows.
This approach can be used in cases where you have many more points (1 million in this example):
set.seed(144)
grid <- data.frame(x=rnorm(1000000), y=rnorm(1000000), z=rnorm(1000000))
p <- data.frame(x=rnorm(1), y=rnorm(1), z=rnorm(1))
lim <- 1.5
system.time(vectorized(grid, p, lim))
# user system elapsed
# 3.466 0.136 3.632
Here's how to do it with package FNN. The result is different from what you have because your solution has (1 4) and (2 5) twice. The solution also works with border data. You will only have 6 nearest neighbors if your x or y is 1 or on the edge of your matrix.
library(FNN)
x <-2
y <- 5
pt <-grid[grid$Var1==x & grid$Var2==y ,] #target point
distance <-knnx.dist(grid,pt,k=9) #distance from pt
k <-length(distance[distance<2]) #distance is less than 2. Useful for border data
nearest <-knnx.index(grid,pt,k=k) #find index of k nearest neighbors
grid[nearest,]
Var1 Var2
22 2 5
23 3 5
27 2 6
21 1 5
17 2 4
26 1 6
28 3 6
18 3 4
16 1 4
I see that you also have asked for higher dimensions. It would still work witht he following changes:
x <-2
y <- 5
z <-3
pt <-grid[grid$Var1==x & grid$Var2==y & grid$Var3==z ,] #3-dimensional point
distance <-knnx.dist(grid,pt,k=27) #increase to k=27
k <-length(distance[distance<2])
nearest <-knnx.index(grid,pt,k=k)
grid[nearest,]

comparing a vector to a probability distribution

I have a vector:
r <- runif(10)
r
[1] 0.52324423 0.89110751 0.44616915 0.70163640 0.63741495 0.31263977
[7] 0.73947973 0.83278799 0.04971461 0.01820381
I also have a probability distribution
p <- c(0, cumsum(rep(0.25, 4)))
p
[1] 0.00 0.25 0.50 0.75 1.00
I would like to assign factors to r based on the probability distribution in p.
In other words, I would like my output to be:
r
[1] 3 4 2 3 3 2 3 4 1 1
When I try this, I get a warning:
which( r >= p) -1
[1] 3
Warning message:
In r < p : longer object length is not a multiple of shorter object length
In other words, only the first value in r is compared to p.
How would I go about converting r into a vector of levels that I can then turn into factors?
You can use cut
as.integer(cut(r, breaks=p))

R Program- i would like generate the random sample but with specified lower and upper limit. <x=abs(rnorm(100,mean=10,sd=5))

I am using R software for learning.
i tried generating random sample but i do not know how to get the sample between ranges.
i would like generate the random sample with mean 10 and standard deviation 5 but data with specified lower and upper limit. for example car mileage, ranges between 14 to 20 miles.
Interpreting the question liberally, you can generate a distribution with mean 10 and sd of 5 and then simply pick out the values between 14 and 20.
rnorm(1000000,mean=10,sd=5) -> x
x[x>14 & x <20][1:100] -> x
Unlike some of the commenters I did not see that you wnated the resulting set of values to ahve a mean of 10, but rather wanted to know how to restrict the sample to a particular range in the "right side" of the (sample) distribution:
> x=abs(rnorm(100,mean=10,sd=5))
> Hmisc::describe(x)
x
n missing unique Mean .05 .10 .25 .50 .75 .90 .95
100 0 100 10.01 2.713 3.830 6.995 9.899 13.338 15.890 17.244
lowest : 0.2297 1.2789 1.4875 2.1591 2.7105
highest: 17.4245 18.2214 18.5216 20.6713 29.7004
> x[ x >10 & x < 14 ]
[1] 12.18180 12.20234 10.88162 13.29551 11.93855 12.88514 10.16698 11.59605 10.49250
[10] 11.87572 12.39030 11.94950 13.46728 13.82126 10.83314 12.43298 10.14287 11.55421
[19] 10.30776 12.24819 13.55493 10.49318 10.40018 12.51552 10.63401 10.15365 11.16643
[28] 11.01404
> x[ x >14 & x < 20 ]
[1] 15.67082 14.23257 15.32423 17.23473 16.49950 17.19964 15.98334 15.87908 17.42446
[10] 18.22140 14.20300 14.40036 15.82410 15.15962 18.52156 14.55390 15.36474 14.37368
[19] 16.90896 14.61357

calculate cumulative probability from binomial logit output in R

I have a fitted binomial logit model and want to calculate the cumulative probability of experiencing an event <= some value of a covariate.
For example, If I have a fitted model that predicts and outcome based on a continuous distance range (0-8.5 km) I might want to find out the cumulative probability for distance <= to 4.5 km.
I have vectors of estimated probabilities and the associated distances as below
dat <- structure(list(km = c(0, 0.447368421052632, 0.894736842105263,
1.34210526315789, 1.78947368421053, 2.23684210526316, 2.68421052631579,
3.13157894736842, 3.57894736842105, 4.02631578947368, 4.47368421052632,
4.92105263157895, 5.36842105263158, 5.81578947368421, 6.26315789473684,
6.71052631578947, 7.15789473684211, 7.60526315789474, 8.05263157894737,
8.5), prob = c(0.99010519543441, 0.985413663823809, 0.97854588563623,
0.968547716962174, 0.954108659036907, 0.933496091194704, 0.904551377544634,
0.864833064332603, 0.81202174997839, 0.744668375529677, 0.663191827576796,
0.570704402277059, 0.47300143764816, 0.377323442817887, 0.290336664745317,
0.216433162546689, 0.157174982015906, 0.111825887625402, 0.0783449309507567,
0.054275681518511)), .Names = c("km", "prob"), row.names = c(NA,
-20L), class = "data.frame")
What I ultimately want to say is "x% of observations within x distance are predicted to experience an event". Is this the right way to go about that?
Also is there an easy way to calculate at which distance (from 0 - whatever) encompasses the 50% cumulative probability.
Thanks, Tim
There is probably some way to extract this from your model, but if you were doing it from scratch I would try to fit your data to a distribution, then extract your relevant data points.
First define an error function:
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x)) # or some other error fxn
Now let's say your data sort of looks like a gamma distribution, so try:
gdf <- function(x, d=dat$km) pgamma(d,shape=x[1], scale=x[2])
So your function to optimize will be the error function of your data and the fit distribution:
error_fxn <- function(x) rmse(rev(dat$prob),gdf(x)) # rev data to make ascending
Now optimize this function to get your parameters for the distribution of interest:
rr <- optim(c(1,1),error_fxn)
And let's see how good the fit is (just ok...);
rr
# $par
# [1] 3.108392 1.112584
# $value
# [1] 0.0333369
# $counts
# function gradient
119 NA
# $convergence
# [1] 0
# $message
# NULL
Or graphicaly:
with(dat,plot(km,prob,xlim=c(10,0)))
with(dat,lines(rev(km),pgamma(km,shape=rr$par[1], scale=rr$par[2]),col='red'))
Take a look at the values for the CDF:
kms <- seq(0,8.5,0.5)
data.frame(dist = kms, cdf = pgamma(kms,shape=rr$par[1], scale=rr$par[2]))
# dist cdf
# 1 0.0 0.000000000
# 2 0.5 0.008634055
# 3 1.0 0.053615340
# 4 1.5 0.137291689
# 5 2.0 0.245961242
# 6 2.5 0.363956061
# 7 3.0 0.479070721
# 8 3.5 0.583659363
# 9 4.0 0.673982194
# 10 4.5 0.749075757
# 11 5.0 0.809691054
# 12 5.5 0.857478086
# 13 6.0 0.894431622
# 14 6.5 0.922551998
# 15 7.0 0.943661710
# 16 7.5 0.959325076
# 17 8.0 0.970830577
# 18 8.5 0.979207658
And to answer your final question, get the distance at 50% of the CDF:
qgamma(0.5,shape=rr$par[1], scale=rr$par[2])
# [1] 3.095395

Resources