How to simulate data in R for a polynomial SVM - r

I'm new to simulating data in R and would like to know how to generate a polynomial separator of degree 2 in R. This is the question:
Generate a Data Set by Simulations:
We seek to generate 5000 cases x^1 ... x^5000 in R^4
each case x = [ x1 x2 x3 x4 ] has 4 numerical features.
using random sampling of a uniform distribution over the interval [-2, +2]:
select 16 random numbers Aij with i= 1 2 3 4 and j = 1 2 3 4; select 4 random numbers Bi with i= 1 2 3 4; select 1 random number c
Define the polynomial of degree 2 in the 4 variables x1 x2 x3 x4 as follows:
Pol(x) = ∑i ∑j Aij xi xj + ∑i Bi xi + c/20
so far I've generated A, B, and C:
A <- matrix(runif(16, -2, 2), nrow=4, ncol=4)
B <- runif(4, -2, 2)
C <- runif(1, -2, 2)
But I'm having trouble finding out how to define a polynomial using the values I generated.

Related

simulating two-level data with level 1 interaction term

I am trying to simulate two-level data, with level 1 interaction term.
For example,
I have two level-2 variables, two level -1 variable, and interaction variable.
CN<- 40 # number of cluster
nj<- 2 # cluster size, in this case, dyadic data.
l2v1 <- rep(rnorm(CN, mean=0, sd=1), each=nj) # level 2 variable 1
l2v2 <- rep(sample(rep(c(-1, 1), CN),each=nj) # level 2 variable 2, which is binary variable
l1v1 <- rnorm(CN*ng, 0, 1) # level 1 variable 2
l1v2 <- rnorm(sample(rep(c(-1,1),CN*nj) # level 2 variable 2, which is binary
error2 <- rep(rnorm(CN, 0,1) each = nj)) # error for level 2
error1 <- rnorm(CN*nj) # error for level 1
## putting together
y <- coef1*l1v1 + coef2*l1v2 + coef3*l2v1 + coef4*l2v2 + coef5 * l1v1* l1v2 + error2 + error1
In this case, how can I control ICC?
For example, I want to simulate this data with ICC of 0.3
ICC = between variance / total variance
ICC= {coef3^2 + coef4^2 + 1}/{coef3^2 + coef4^2 + 1 + 1+ coef1^2 + coef2^2 + coeff5^2}
and coef1 = coef3, coef 2= coef 4 due to some research questions.
so I plugged in the arbitrary numbers in coef 1, 2, 3, 4 and tried to set coeff 5 so that I can have data with targeted ICC. However, it seems like it does not work.
Did I miss something?

Is it mathematically possible to solve this problem?

x <- abs(rnorm(8))
C <- (x[1]*x[2]*x[3])^(1/3)
y <- log(x/C)
Is it mathematically possible to determine x[1:3] given you only have y? Here, x and y are always vectors of length 8. I should note that x is known for some of my dataset, which could be useful to find a solution for the other portion of the data where x is unknown. All of my code is implemented in R, so R code would be appreciated if this is solvable!
Defining f as
f <- function(x) {
C <- (x[1]*x[2]*x[3])^(1/3)
log(x/C)
}
we first note that if k is any scalar constant then f(x) and f(k*x) give the same result so if we have y = f(x) we can't tell whether y came from x or from k*x. That is, y could have come from any scalar multiple of x; therefore, we cannot recover x from y.
Linear formulation
Although we cannot recover x we can determine x up to a scalar multiple. Define the matrix A:
ones <- rep(1, 8)
a <- c(1, 1, 1, 0, 0, 0, 0, 0)
A <- diag(8) - outer(ones, a) / 3
in which case f(x) equals:
A %*% log(x)
Inverting formula
From this formula, given y and solving for x, the value of x would equal
exp(solve(A) %*% y) ## would equal x if A were invertible
if A were invertible but unfortunately it is not. For example, rowSums(A) equals zero which shows that the columns of A are linearly dependent which implies non-invertibility.
all.equal(rowSums(A), rep(0, 8))
## [1] TRUE
Rank and nullspace
Note that A is a projection matrix. This follows from the fact that it is idempotent, i.e. A %*% A equals A.
all.equal(A %*% A, A)
## [1] TRUE
It also follows from the fact that its eigenvalues are all 0 and 1:
zapsmall(eigen(A)$values)
## [1] 1 1 1 1 1 1 1 0
From the eigenvalues we see that A has rank 7 (the number of nonzero eigenvalues) and the dimension of the nullspace is 1 (the number of zero eigenvalues).
Another way to see this is that knowing that A is a projection matrix its rank equals its trace, which is 7, so its nullspace must have dimension 8-7=1.
sum(diag(A)) # rank of A
## [1] 7
Taking scalar multiples spans a one dimensional space so from the fact that the nullspace has dimension 1 it must be the entirely of the values that map into the same y.
Key formula
Now replacing solve in ## above with the generalized inverse, ginv, we have this key formula for our approximation to x given that y = f(x) for some x:
library(MASS)
exp(ginv(A) %*% y) # approximation to x accurate up to scalar multiple
or equivalently if y = f(x)
exp(y - mean(y))
While these do not give x they do determine x up to a scalar multiple. That is if x' is the value produced by the above expressions then x equals k * x' for some scalar constant k.
For example, using x and y from the question:
exp(ginv(A) %*% y)
## [,1]
## [1,] 1.2321318
## [2,] 0.5060149
## [3,] 3.4266146
## [4,] 0.1550034
## [5,] 0.2842220
## [6,] 3.7703442
## [7,] 1.0132635
## [8,] 2.7810703
exp(y - mean(y)) # same
## [1] 1.2321318 0.5060149 3.4266146 0.1550034 0.2842220 3.7703442 1.0132635
## [8] 2.7810703
exp(y - mean(y))/x
## [1] 2.198368 2.198368 2.198368 2.198368 2.198368 2.198368 2.198368 2.198368
Note
Note that y - mean(y) can be written as
B <- diag(8) - outer(ones, ones) / 8
B %*% y
and if y = f(x) then y must be in the range of A so we can verify that:
all.equal(ginv(A) %*% A, B %*% A)
## [1] TRUE
It is not true that the matrix ginv(A) equals B. It is only true that they act the same on the range of A which is all that we need.
No, it's not possible. You have three unknowns. That means you need three independent pieces of information (equations) to solve for all three. y gives you only one piece of information. Knowing that the x's are positive imposes a constraint, but doesn't necessarily allow you to solve. For example:
x1 + x2 + x3 = 6
Doesn't allow you to solve. x1 = 1, x2 = 2, x3 = 3 is one solution, but so is x1 = 1, x2 = 1, x3 = 4. There are many other solutions. [Imposing your "all positive" constraint would rule out solutions such as x1 = 100, x2 = 200, x3 = -294, but in general would leave more than one remaining solution.]
x1 + x2 + x3 = 6,
x1 + x2 - x3 = 0
Constrains x3 to be 3, but allows arbitrary solutions for x1 and x2, subject to x1 + x2 = 3.
x1 + x2 + x3 = 6,
x1 + x2 - x3 = 0,
x1 - x2 + x3 = 2
Gives the unique solution x1 = 1, x2 = 2, x3 = 3.

How do I get all solutions from this system?

I am new to linear algebra and I am trying to solve a system of three equations with five unknowns. The system I have is the following:
x1 + x2 + x3 + x4 + x5 = 1
-x1 + x2 + x3 - 2x4 - 2x5 = 1
2x1 + 2x2 - x3 - x4 + x5 = 1
So what I did was set up the augmented matrix like this:
1 1 1 1 1 1
-1 1 1 -2 -2 1
2 2 -1 -1 1 1
Then I try to obtain an identity matrix on the left side and end up with the following:
1 0 0 3/2 3/2 0
0 1 0 -3/2 -5/6 2/3
0 0 1 1 1/3 1/3
So I think the answer is x1 = 0, x2 = 2/3 and x3 = 1/3
But when I look in my answer sheet it reads:
(x1, x2, x3, x4, x5) = (0, 2/3, 1/3, 0, 0) + s(−3/2, 3/2, −1, 1, 0) + t(−3/2, 5/6, −1/3, 0, 1)
I have no idea how to interpret this. My x1,x2,x3 seems to match the first three in the first five-tuple but what are the other two five-tuples? Can someone explain what I am missing here? I would highly appreciate it.
A system of equations can be represented in matrix form as
Ax = b
where A is the matrix of coefficients, x is the column vector (x1, ..., xn) and b is the column vector with as many entries as equations are.
When b is not 0 we say that the system is not homogeneous. The associated homogeneous system is
Ax = 0
where the 0 on the right is again a column vector.
When you have a non-homogeneous system, like in this case, the general solution has the form
P + G
where P is any particular solution and G is the generic solution of the homogeneous system.
In your case the vector
P = (0, 2/3, 1/3, 0, 0)
satisfies all the equations and is therefore a valid particular solution.
The other two vectors (−3/2, 3/2, −1, 1, 0) and (−3/2, 5/6, −1/3, 0, 1) satisfy the homogeneous equations (take a moment to check this). And since there are 3 (independent) equations with 5 unknowns (x1..x5), the space of homogenous solutions can be generated by these two vectors (again because they are independent).
So, to describe the space of all homogeneous solutions you need two scalar variables s and t. In other words
G = s(−3/2, 3/2, −1, 1, 0) + t(−3/2, 5/6, −1/3, 0, 1)
will generate all homogeneous solutions as s and t take all posible real values.

How to generate correlated numbers?

I have correlated one set number with .9, .5, .0
A derives from rnorm(30,-0.5,1)
B derives from rnorm(30,.5,2)
and want to make A & B correlated with .9, .5, .0.
You are describing a multivariate normal distribution, which can be computed with the mvrnorm function:
library(MASS)
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
set.seed(144)
vals <- mvrnorm(10000, c(meanA, meanB), matrix(c(sdA^2, correlation*sdA*sdB,
correlation*sdA*sdB, sdB^2), nrow=2))
mean(vals[,1])
# [1] -0.4883265
mean(vals[,2])
# [1] 0.5201586
sd(vals[,1])
# [1] 0.9994628
sd(vals[,2])
# [1] 1.992816
cor(vals[,1], vals[,2])
# [1] 0.8999285
As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be
# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
n <- 10000 # You want 30
# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)
# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y
# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2
# Check summary statistics
mean(x3)
# [1] -0.4981958
mean(y3)
# [1] 0.4999068
sd(x3)
# [1] 1.014299
sd(y3)
# [1] 2.022377
cor(x3, y3)
# [1] 0.9002529
I created the correlate package to be able to create a correlation between any type of variable (regardless of distribution) given a certain amount of toleration. It does so by permutations.
install.packages('correlate')
library('correlate')
A <- rnorm(30, -0.5, 1)
B <- rnorm(30, .5, 2)
C <- correlate(cbind(A,B), 0.9)
# 0.9012749
D <- correlate(cbind(A,B), 0.5)
# 0.5018054
E <- correlate(cbind(A,B), 0.0)
# -0.00407327
You can pretty much decide the whole matrix if you want (for multiple variables), by giving a matrix as second argument.
Ironically, you can also use it to create a multivariate normal.....

creating a function to calculate auc in R

I am extremely new with R, I have an assignment that I'm working on that I am having a lot of trouble with. I have defined a discrete probability distribution:
s P(s)
0 1/9
1 4/9
2 1/9
3 0/9
4 1/9
5 0/9
6 0/9
7 1/9
8 0/9
9 1/9
Now I have to work on this question:
Consistent with other distributions available in R, create a
family of support functions for your probability distributuon:
f = dsidp(d) # pmf - the height of the curve/bar for digit d
p = psidp(d) # cdf - the probability of a value being d or less
d = qsidp(p) # icdf - the digit corresponding to the given
# cumulative probability p
d[] = rsidp(n) # generate n random digits based on your probability distribution.
If someone could help me get started on writing these functions, it would be greatly appreciated!
Firstly, read the data:
dat <- read.table(text = "s P(s)
0 1/9
1 4/9
2 1/9
3 0/9
4 1/9
5 0/9
6 0/9
7 1/9
8 0/9
9 1/9", header = TRUE, stringsAsFactors = FALSE)
names(dat) <- c("s", "P")
Transform the fractions (represented as strings) to numeric values:
dat$P <- sapply(strsplit(dat$P, "/"), function(x) as.numeric(x[1]) / as.numeric(x[2]))
The functions:
# pmf - the height of the curve/bar for digit d
dsidp <- function(d) {
with(dat, P[s == d])
}
# cdf - the probability of a value being d or less
psidp <- function(d) {
with(dat, cumsum(P)[s == d])
}
# icdf - the digit corresponding to the given cumulative probability p
qsidp <- function(p) {
with(dat, s[sapply(cumsum(P), all.equal, p) == "TRUE"][1])
}
Note. Since some probabilities are zero, some digits have identical cumulative probabilities. In these cases the lowest digit is returned by function qsidp.
# generate n random digits based on your probability distribution.
rsidp <- function(n) {
with(dat, sample(s, n, TRUE, P))
}

Resources