Generating 3D data with cube as a decision surface - r

I am new to using r program. I have a task to use r to create a function to simulate standard normal distribution containing 500 observations and three variables, x,y,& z.
I am to use cube as a decision surface to categorize observations based on whether they fell within or outside the cube.
Below is my code. I am able to plot the 3D data, but I am not sure of how to categorize the datasets into two classes.
library(scatterplot3d)
set.seed (1234)
nObs <- 500
x <- matrix (rnorm (1.25*nObs), ncol =2)
y <- matrix (rnorm (1.25*nObs), ncol =2)
z <- matrix (rnorm (1.25*nObs), ncol =2)
mSample <- function(nObs,x,y,z){
x1 <- rnorm(1,x)
x1[y==1,] <- x[y==1,] + 1
mSample <- as_tibble(rbind(mvnfast::rmvn(x,y = y1,z = z1), mvnfast::rmvn(x,y = y1,z = z1)))
mSample[1:x1, 1.25] <- 0
mSample[(x1 + 1):(x1 + 1), 1.25] <- 1
mSample <- mSample[sample(nrow(mSample)), ]
colnames(mSample <- c("x", "y", "class"))
mSample
}
spl <- scatterplot3d(x,y,z)
spl <- scatterplot3d(x,y,z,pch=16,highlight.3d=TRUE)

I had a similar question to this recently. Basically, to know if a given point is inside or outside of a cube, first you need to know the length of the cube.
Then, simply iterate over all the points (nObs) and do an if statement
if (x > -cubeLength ** x < cubeLength && y > -cubeLength ** y < cubeLength && z > -cubeLength ** z < cubeLength) {
classify positive
}
else {
classify negative
}

Related

Generate a binary variable with a predefined correlation to an already existing variable

For a simulation study, I want to generate a set of random variables (both continuous and binary) that have predefined associations to an already existing binary variable, denoted here as x.
For this post, assume that x is generated following the code below. But remember: in real life, x is an already existing variable.
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
I want to generate both a binary variable and a continuous variable. I have figured out how to generate a continuous variable (see code below)
set.seed(1245)
cor <- 0.8 #Correlation
y <- rnorm(1000, cor*x, sqrt(1-cor^2))
But I can't find a way to generate a binary variable that is correlated to the already existing variable x. I found several R packages, such as copula which can generate random variables with a given dependency structure. However, they do not provide a possibility to generate variables with a set dependency on an already existing variable.
Does anyone know how to do this in an efficient way?
Thanks!
If we look at the formula for correlation:
For the new vector y, if we preserve the mean, the problem is easier to solve. That means we copy the vector x and try to flip a equal number of 1s and 0s to achieve the intended correlation value.
If we let E(X) = E(Y) = x_bar , and E(XY) = xy_bar, then for a given rho, we simplify the above to:
(xy_bar - x_bar^2) / (x_bar - x_bar^2) = rho
Solve and we get:
xy_bar = rho * x_bar + (1-rho)*x_bar^2
And we can derive a function to flip a number of 1s and 0s to get the result:
create_vector = function(x,rho){
n = length(x)
x_bar = mean(x)
xy_bar = rho * x_bar + (1-rho)*x_bar^2
toflip = sum(x == 1) - round(n * xy_bar)
y = x
y[sample(which(x==0),toflip)] = 1
y[sample(which(x==1),toflip)] = 0
return(y)
}
For your example it works:
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
cor(x,create_vector(x,0.8))
[1] 0.7986037
There are some extreme combinations of intended rho and p where you might run into problems, for example:
set.seed(111)
res = lapply(1:1000,function(i){
this_rho = runif(1)
this_p = runif(1)
x = rbinom(1000,1,this_p)
data.frame(
intended_rho = this_rho,
p = this_p,
resulting_cor = cor(x,create_vector(x,this_rho))
)
})
res = do.call(rbind,res)
ggplot(res,aes(x=intended_rho,y=resulting_cor,col=p)) + geom_point()
Here's a binomial one - the formula for q only depends on the mean of x and the correlation you desire.
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- 1/((1-p)/cor^2+p)
y <- rbinom(100000, 1, q)
z <- x*y
cor(x,z)
#> [1] 0.7984781
This is not the only way to do this - note that mean(z) is always less than mean(x) in this construction.
The continuous variable is even less well defined - do you really not care about its mean/variance, or anything else about its distibution?
Here's another simple version where it flips the variable both ways:
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- (1+cor/sqrt(1-(2*p-1)^2*(1-cor^2)))/2
y <- rbinom(100000, 1, q)
z <- x*y+(1-x)*(1-y)
cor(x,z)
#> [1] 0.8001219
mean(z)
#> [1] 0.57908

Plot Sphere with custom gridlines in R

I would like to plot a sphere in R with the gridlines on the surface corresponding to the equal area gridding of the sphere using the arcos transformation.
I have been experimenting with the R packakge rgl and got some help from :
Plot points on a sphere in R
Which plots the gridlines with equal lat long spacing.
I have the below function which returns a data frame of points that are the cross over points of the grid lines I want, but not sure how to proceed.
plot_sphere <- function(theta_num,phi_num){
theta <- seq(0,2*pi,(2*pi)/(theta_num))
phi <- seq(0,pi,pi/(phi_num))
tmp <- seq(0,2*phi_num,2)/phi_num
phi <- acos(1-tmp)
tmp <- cbind(rep(seq(1,theta_num),each = phi_num),rep(seq(1,phi_num),times = theta_num))
results <- as.data.frame(cbind(theta[tmp[,1]],phi[tmp[,2]]))
names(results) <- c("theta","phi")
results$x <- cos(results$theta)*sin(results$phi)
results$y <- sin(results$theta)*sin(results$phi)
results$z <- cos(results$phi)
return(results)
}
sphere <- plot_sphere(10,10)
Can anyone help, in general I am finding the rgl functions tricky to work with.
If you use lines3d or plot3d(..., type="l"), you'll get a plot joining the points in your dataframe. To get breaks (you don't want one long line), add rows containing NA values.
The code in your plot_sphere function seems really messed up (you compute phi twice, you don't generate vectors of the requested length, etc.), but this function based on it works:
function(theta_num,phi_num){
theta0 <- seq(0,2*pi, len = theta_num)
tmp <- seq(0, 2, len = phi_num)
phi0 <- acos(1-tmp)
i <- seq(1, (phi_num + 1)*theta_num) - 1
theta <- theta0[i %/% (phi_num + 1) + 1]
phi <- phi0[i %% (phi_num + 1) + 1]
i <- seq(1, phi_num*(theta_num + 1)) - 1
theta <- c(theta, theta0[i %% (theta_num + 1) + 1])
phi <- c(phi, phi0[i %/% (theta_num + 1) + 1])
results <- data.frame( x = cos(theta)*sin(phi),
y = sin(theta)*sin(phi),
z = cos(phi))
lines3d(results)
}

Example for dimension reduction (SVD vs Random Projection) in R

I am learning about dimension reduction techniques in R. I take one image as input and I have reduced dimension using svd using this code
library(raster)
img <- raster("C:/Users/***/Pictures/pansy.jpg")
img_flip <- flip(img, direction = "y")
img <- t(as.matrix(img_flip))
dim(img)
image(img,col=grey(seq(0,1,length=256))) # Actual
img_svd <- svd(img)
u <- img_svd$u
d <- diag(img_svd$d)
v <- img_svd$v
u1 <- as.matrix(u[, 1:50])
d1 <- as.matrix(d[1:50, 1:50])
v1 <- as.matrix(v[, 1:50])
photo1 <- u1 %*% d1 %*% t(v1)
image(photo1, col = grey(seq(0, 1, length = 256))) #Reduced
After that, I got Output Like this
Then I learned about Random Projection to compare with svd,
I followed the below steps for the same Image
Convert the image into matrix [465 X 600]
Create a Random matrix of size [600 X 300] filled with +1 and -1 and equal probability
Multiply both the matrix and I get a matrix of size [465 X 300]
Code:
rp_img <- raster("C:/Users/***/Pictures/pansy.jpg")
img_flip <- flip(rp_img, direction = "y")
rp_img <- t(as.matrix(img_flip))
rm <- form_sparse_matrix(600,n_cols = ncol(rp_img),TRUE,0.5,method = "probability") # To create a random matrix with +1 and -1 of equal probability
photo2 <- rp_img %*% rm
image(photo2, col = grey(seq(0, 1, length = 256)))
After that, I view the image, It show like below
Whether my steps for performing Random projection is correct or wrong? Where have I done a mistake?

Adding two random variables via convolution in R

I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")

Offline orthogonal range counting implementation

It seems that a statistical problem that I am working on requires doing something known in computational geometry as "offline orthogonal range counting":
Suppose I have a set of n points (for the moment, in the plane). For every pair of points i and j, I would like to count the number of remaining points in the set that are in the rectangle whose diagonal is the segment with endpoints i and j. The overall output then is a vector of n(n-1) values each in [0, 1, 2, ... , n-2].
I've seen that a rich literature on the problem (or at least a very similar problem) exists, but I cannot find an implementation. I would prefer an R (a statistical computing language) package, but I guess that's asking too much. An open source C/C++ implementation will also work.
Thanks.
I hope I understand well your proble. Here an implementation in R using package geometry. I use
mesh.drectangle function which compute a signed distance from points p to boundary of rectangle.
I create a combination for all points using combn
for each point p of combination , I compute the distance from the rectangle rect_p to the others points
if distance < 0 I choose the points.
For example
library(geometry)
## I generate some data
set.seed(1234)
p.x <- sample(1:100,size=30,replace=T)
p.y <- sample(1:100,size=30,replace=T)
points <- cbind(p.x,p.y)
## the algortithm
ll <- combn(1:nrow(points),2,function(x){
x1<- p.x[x[1]]; y1 <- p.y[x[1]]
x2<- p.x[x[2]]; y2 <- p.y[x[2]]
p <- points[-x,]
d <- mesh.drectangle(p,x1,y1,x2,y2)
res <- NA
if(length(which(d <0))){
points.in = as.data.frame(p,ncol=2)[ d < 0 , ]
res <- list(n = nrow(points.in),
rect = list(x1=x1,x2=x2,y1=y1,y2=y2),
points.in = points.in)
}
res
},simplify=F)
ll <- ll[!is.na(ll)]
## the result
nn <- do.call(rbind,lapply(ll,'[[','n'))
To visualize the results, I plots rectangles with 5 points for example.
library(grid)
grid.newpage()
vp <- plotViewport(xscale = extendrange(p.x),
yscale = extendrange(p.y))
pushViewport(vp)
grid.xaxis()
grid.yaxis()
grid.points(x=points[,'p.x'],y=points[,'p.y'],pch='*')
cols <- rainbow(length(ll))
ll <- ll[nn == 5] ## here I plot only the rectangle with 5 points
lapply(seq_along(ll),function(i){
x <- ll[[i]]
col <- sample(cols,1)
x1<- x$rect$x1; x2<- x$rect$x2
y1<- x$rect$y1; y2<- x$rect$y2
grid.rect(x=(x1+x2)*.5,y=(y1+y2)*.5,
width= x2-x1,height = y2-y1,
default.units ='native',
gp=gpar(fill=col,col='red',alpha=0.2)
)
grid.points(x=x$points.in$p.x,y=x$points.in$p.y,pch=19,
gp=gpar(col=rep(col,x$n)))
}
)
upViewport()

Resources