I'm a complete beginner with R and I need to perform regressions on some data sets. My problem is, I'm not sure, how to rewrite the model into the mathematical formula.
Most confusing are interactions and poly function.
Can they be understood like a product and a polynomial?
Example
Let's have following model, both a and b are vectors of numbers:
y ~ poly(a, 2):b
Can it be rewritten mathematically like this?
y = a*b + a^2 * b
Example 2
And when I get a following expression from fit summary
poly(a, 2)2:b
is it equal to the following formula?
a^2 * b
Your question has two fold:
what does poly do;
what does : do.
For the first question, I refer you to my answer https://stackoverflow.com/a/39051154/4891738 for a complete explanation of poly. Note that for most users, it is sufficient to know that it generates a design matrix of degree number or columns, each of which being a basis function.
: is not a misery. In your case where b is also a numeric, poly(a, 2):b will return
Xa <- poly(a, 2) # a matrix of two columns
X <- Xa * b # row scaling to Xa by b
So your guess in the question is correct. But note that poly gives you orthogonal polynomial basis, so it is not as same as I(a) and I(a^2). You can set raw = TRUE when calling poly to get ordinary polynomial basis.
Xa has column names. poly(a,2)2 just means the 2nd column of Xa.
Note that when b is a factor, there will be a design matrix, say Xb, for b. Obviously this is a 0-1 binary matrix as factor variables are coded as dummy variables. Then poly(a,2):b forms a row-wise Kronecker product between Xa and Xb. This sounds tricky, but is essentially just pair-wise multiplication between all columns of two matrices. So if Xa has ka columns and Xb has kb columns, the resulting matrix has ka * kb columns. Such mixing is called 'interaction'.
The resulting matrix also has column names. For example, poly(a, 2)2:b3 means the product of the 2nd column of Xa and the dummy column in Xb for the third level of b. I am not saying 'the 3rd column of Xb' as this is false if b is contrasted. Usually a factor will be contrasted so if b has 5 levels, Xb will have 4 columns. Then the dummy column for third level will be the 2nd column of Xb, if the first factor level is the reference level (hence not appearing in Xb).
Related
I'm estimating a Phillips Curve Model - and, as such, need to take into account the unemployment Gap, which is the difference between actual unemployment and the Nairu (here, Nairu is the unobservable variable).
I need to impose the following constraint: some of the coefficients (say, beta 1 and beta 2) in the Z matrix (which relationates to the Nairu) must be the same that in the D matrix (which accounts for the unemployment). However, it doesn't seem possible to impose such a constraint simultaneously in the Z and D matrix.
Can you guys help me?
I've tried setting the same "names" for the coefficients in the Z and D matrix, but that didn't work.
I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.
I noticed when using a dummy coding for fitting my linear models R excludes certain parameters when forming model matrix. What is the R algorithm for doing this?
It is not well documented, but it goes back to whatever pivoting algorithm the underlying LAPACK code uses:
from the source code of lm.fit:
z <- .Call(C_Cdqrls, x, y, tol, FALSE)
...
coef <- z$coefficients
pivot <- z$pivot
...
r2 <- if(z$rank < p) (z$rank+1L):p else integer()
if (is.matrix(y)) {
....
} else {
coef[r2] <- NA
## avoid copy
if(z$pivoted) coef[pivot] <- coef
...
}
If you want to dig back further, you need to look into dqrdc2.f, which says (for what it's worth):
c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.
In practice I have generally found that R eliminates the last (rightmost) column of a set of collinear predictor variables ...
Given a set of variables, x's. I want to find the values of coefficients for this equation:
y = a_1*x_1 +... +a_n*x_n + c
where a_1,a_2,...,a_n are all unknowns. Thinking this in perspective of data frame, I want to create this value of y for every rows in the data.
My question is: for y, a_1...a_n and c are all unknown, is there a way for me to find a set of solutions a_1,...,a_n under the condition that corr(y,x_1), corr(y,x_2) .... corr(y,x_n) are all greater than 0.7. For simplicity take correlation here as Pearson correlation. I know there would no be unique solution. But how can I construct a set of solutions for a_1,...,a_n to fulfill this condition?
Spent a day to search the idea but could not get any information out of it. Any programming language to tackle this problem is welcomed or at least some reference for this.
No, it is not possible in general. It may be possible in some special cases.
Given x₁, x₂, ... you want to find y = a₁x₁ + a₂x₂ + ... + c so that all the correlations between y and the x's are greater than some target R. Since the correlation is
Corr(y, xi) = Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ]
your constraint is
Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ] > R
which can be rearranged to
Cov(y, xi)² > R² * Var(y) * Var(xi)
and this needs to be true for all i.
Consider the simple case where there are only two columns x₁ and x₂, and further assume that they both have mean zero (so you can ignore the constant c) and variance 1, and that they are uncorrelated. In that case y = a₁x₁ + a₂x₂ and the covariances and variances are
Cov(y, x₁) = a₁
Cov(y, x₂) = a₂
Var(x₁) = 1
Var(x₂) = 1
Var(y) = (a₁)² + (a₂)²
so you need to simultaneously satisfy
(a₁)² > R² * ((a₁)² + (a₂)²)
(a₂)² > R² * ((a₁)² + (a₂)²)
Adding these inequalities together, you get
(a₁)² + (a₂)² > 2 * R² * ((a₁)² + (a₂)²)
which means that in order to satisfy both of the inequalities, you must have R < Sqrt(1/2) (by cancelling common factors on both sides of the inequality). So the very best you could do in this simple case is to choose a₁ = a₂ (the exact value doesn't matter as long as they are equal) and both of the correlations Corr(y,a₁) and Corr(y,a₂) will be equal to 0.707. You cannot achieve correlations higher than this between y and all of the x's simultaneously in this case.
For the more general case with n columns (each of which has mean zero, variance 1 and zero correlation between columns) you cannot simultaneously achieve correlations greater than 1 / sqrt(n) (as pointed out in the comments by #kazemakase).
In general, the more independent variables there are, the lower the correlation you will be able to achieve between y and the x's. Also (although I haven't mentioned it above) the correlations between the x's matter. If they are in general positively correlated, you will be able to achieve a higher target correlation between y and the x's. If they are in general uncorrelated or negatively correlated, you will only be able to achieve low correlations between y and the x's.
I am not expert in this field so read with extreme prejudice!
I am a bit confused by your y
Your y is a single constant and you want to have the correlation between it and all the x_i values be > 0.7 ? I am no math/statistics expert but my feelings for this are that this is achievable only if the correlation between x_i,x_j upholds the same condition. in that case you can simply do the average of x_i like this:
y=(x_1+x_2+x_3+...+x_n)/n
so the a_i=1.0/n and c=0.0 But still the question is:
What meaning has a correlation between 2 numbers only?
More reasonable would be if y is a function dependent on x
for example like this:
y(x) = a_1*(x-x_1)+... +a_n*(x-x_n) + c
or any other equation (hard to make any without knowing where it came from and for what purpose). Then you can compute the correlation between two sets
X = { x_1 , x_2 ,..., x_n }
Y = { y(x_1),y(x_2),...y(x_n) }
In that case I would give try approximation search for the c,a_i constants to maximize correlation between X,Y, but the results complexity for the whole thing would be insane. So instead I would tweak just one constant. at the time
set some safe c,a_1,a_2,... constants
tweak a_1
so compute correlation for (a_1-delta) and (a_1+delta) and then choose the direction which is in favor of correlation. then keep going in that direction until the correlation coefficient start to drop.
Then you can recursively to this again with smaller delta. Btw this is exactly what my approx class does from the link above.
loop #2 through all the a_i
loop this whole few times to enhance precision
May be you could compute the c after each run to minimize the distance between X,Y sets.
I am given a table, indexed on 2 dimensions (x,y), with values(not necessarily ordered, though I think it isn't a horrendously unsafe assumption) z given, such that f(x, y) = z
So given an x and y, I interpolate to find a z value. Now given an x value(or y I suppose, not really important) and a z value, I need to find that y value that corresponds to the data. Is it possible to do this without the knowledge of an ordering in the z values of the table? If there is an ordering to the z values of the table is it possible? In my head, given ordering, it should be possible to find a unique solution, but I don't know how I can do it if I am not given ordering.
Could you post some or preferably all of the data? Assuming it is linear and continuous, we have n copies of ax +by = z. Let's say x=3 and z=4, we have 3 unknowns which we can put in a matrix with n rows and 3 columns. The first row would look like 3 b 4 since we are treating a, y and z as unknowns. Now try echelon row reduction. More specifically, do row1 - row2 (now don't do row2 - row1), row1-row3, row1-row4... row2-row3, row2-row4... there should be nchoose2 of these permutations. If there is a solution then each permutation will be in the form qia=qjz (the a and z won't be there of course) where qi and qj are the known numbers and q,i and j are constant.