Should I shrink multiple factor variables into one?

Should I shrink multiple factor variables into one? - r

* NEW CAR: Purpose for this credit/application? (Binary)
+ 0: No
+ 1: Yes
* USED CAR: Purpose for this credit/application? (Binary)
+ 0: No
+ 1: Yes
* FURNITURE: Purpose for this credit/application? (Binary)
+ 0: No
+ 1: Yes
* RADIO/TV: Purpose for this credit/application? (Binary)
+ 0: No
+ 1: Yes
* EDUCATION: Purpose for this credit/application? (Binary)
+ 0: No
+ 1: Yes
* RETRAINING: Purpose for this credit/application? (Binary)
+ 0: No
+ 1: Yes
So these are some of the variables I have for a new dataset. Since they are related to the purpose behind loan applications, would it be wise to group them under one variable that I could call "Purpose"?
I plan on running a Random Forest Model. I want to try out both ways and I probably will as soon as I conduct an analysis as it is.
To be clear, I've already checked that two variables don't hold a "Yes" in the same row number.
Is there any downside to creating ONE vector/variable that looks like when creating decision trees?:
c("NEW_CAR", "RADIO/TV", "RETRAINING", "FURNITURE", "USED_CAR", "USED_CAR", "NEW_CAR")
I'm worried that there is a major downside as each current variable is binary and a tree can only split into two nodes. Perhaps it doesn't make a difference considering that there could be multiple nodes with a choice between the current variables.
I'm new. Thank you.
Thank You.

Related

lme4 translate formula to code in 3-level model

I have been provided with the following formulas and need to find the correct lme4 code. I find this rather challenging and could not find a good example I could follow...perhaps you can help?
I have two patient groups: group1 and group2. Both groups came to the lab 4 times for testing (4 VISITs) and during each of these visits, their memory was tested 4 times (4 RECALLs). The memory performance should be predicted from Age, Sex and two sleep parameters, which were assessed during each VISIT.
Thus Level 1 should be the RECALL (index i), Level 2 the VISIT (index j) and Level 3 the subject level (index k).
Level 1:
MEMSCOREijk = β0jk + β1jk * RECALLijk + Rijk
Level 2:
β0jk = γ00k + γ01k * VISITjk + U0jk
β1jk = γ10k + γ11k * VISITjk + U1jk
Level 3:
γ00k = δ000 + δ001 * SLEEPPARAM + V0k
γ10k = δ100 + δ101 * SLEEPPARAM + V1k
Thanks so much for your thoughts!

Something like this should work:
lmer(memscore ~ age + sex + sleep1 + sleep2 + (1 | visit) + (1 + sleep1 + sleep2 | subject), data = mydata)
By adding sleep1 and sleep2 in (1 + sleep1 + sleep2 | subject) you will be allowing the effects of the two sleep parameters to vary by participant (random slopes), and have a random intercept (more on that next sentence).(1 | visit) will allow random intercepts for each visit (random intercepts would model data where different visits had a higher or lower mean memscore), but not random slopes; I don't think that you want random slopes for the sleep parameters by visit - were they only measured one time per visit? If so, there would be no slope variation to model, I believe.
Hope that helps! I found this book very useful:
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: an introduction to basic and advanced multilevel modeling (2nd ed.): Sage.

Why variable importance is not reflected in variable actually used in tree construction?

I generated an (unpruned) classification tree on R with the following code:
fit <- rpart(train.set$line ~ CountryCode + OrderType + Bon + SupportCode + prev_AnLP + prev_TXLP + prev_ProfLP + prev_EVProfLP + prev_SplLP + Age + Sex + Unknown.Position + Inc + Can + Pre + Mol, data=train.set, control=rpart.control(minsplit=5, cp=0.001), method="class")
printcp(fit) shows:
Variables actually used in tree construction:
Age
CountryCode
SupportCode
OrderType
prev_AnLP
prev_EVProfLP
prev_ProfLP
prev_TXLP
prev_SplLP
Those are the same variables I can see at each node in the classification tree, so they are correct.
What I do not understand is the result of summary(fit):
Variable importance:
29 prev_EVProfLP
19 prev_AnLP
16 prev_TXLP
15 prev_SplLP
9 prev_ProfLP
7 CountryCode
2 OrderType
1 Pre
1 Mol
From summary(fit) results it seems that variables Pre and Mol are more important than SupportCode and Age, but in the tree Pre and Mol are not used to split the data, while SupportCode and Age are used (just before two leafs, actually... but still used!).
Why?

The importance of an attribute is based on the sum of the improvements in all nodes in which the attribute appears as a splitter (weighted by the fraction of the training data in each node split). Surrogates are also included in the importance calculations, which means that even a variable that never splits a node may be assigned a large importance score. This allows the variable importance rankings to reveal variable masking and nonlinear correlation among the attributes. Importance scores may optionally be confined to splitters; comparing the splitters-only and the full (splitters and surrogates) importance rankings is a useful diagnostic.
Also see chapter 10 of book 'The Top Ten Algorithms in Data Mining' for more information
https://www.researchgate.net/profile/Dan_Steinberg2/publication/265031802_Chapter_10_CART_Classification_and_Regression_Trees/links/567dcf8408ae051f9ae493fe/Chapter-10-CART-Classification-and-Regression-Trees.pdf.

Solving a single, long linear equation in R - many unknown variables

I have a single twelve-term equation for different classes of disease prevalence (nx) that looks like this:
y = f*b0*n0 + f*b1*n1 + f*b2*n2 + ... + f*b12*n12
I also have an equality constraint, such that
n0 + n1 + n2 + ... + n12 = 1 ## prevalence must add up to one
The unknowns are nx (where x=[1,12] ).
I would prefer to solve this analytically, but can also make do with a numerical solution.
I would be grateful for any pointers on R packages or approaches (or is this unsolvable?). The project is in support of the current humanitarian response in Iraq - so hopefully a worthwhile cause.

LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code?

I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!

This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.

Mathematical (Arithmetic) representation of XOR

I have spent the last 5 hours searching for an answer. Even though I have found many answers they have not helped in any way.
What I am basically looking for is a mathematical, arithmetic only representation of the bitwise XOR operator for any 32bit unsigned integers.
Even though this sounds really simple, nobody (at least it seems so) has managed to find an answer to this question.
I hope we can brainstorm, and find a solution together.
Thanks.

XOR any numerical input between 0 and 1 including both ends
a + b - ab(1 + a + b - ab)
XOR binary input
a + b - 2ab or (a-b)²
Derivation
Basic Logical Operators
NOT = (1-x)
AND = x*y
From those operators we can get...
OR = (1-(1-a)(1-b)) = a + b - ab
Note: If a and b are mutually exclusive then their and condition will always be zero - from a Venn diagram perspective, this means there is no overlap. In that case, we could write OR = a + b, since a*b = 0 for all values of a & b.
2-Factor XOR
Defining XOR as (a OR B) AND (NOT (a AND b)):
(a OR B) --> (a + b - ab)
(NOT (a AND b)) --> (1 - ab)
AND these conditions together to get...
(a + b - ab)(1 - ab) = a + b - ab(1 + a + b - ab)
Computational Alternatives
If the input values are binary, then powers terms can be ignored to arrive at simplified computationally equivalent forms.
a + b - ab(1 + a + b - ab) = a + b - ab - a²b - ab² + a²b²
If x is binary (either 1 or 0), then we can disregard powers since 1² = 1 and 0² = 0...
a + b - ab - a²b - ab² + a²b² -- remove powers --> a + b - 2ab
XOR (binary) = a + b - 2ab
Binary also allows other equations to be computationally equivalent to the one above. For instance...
Given (a-b)² = a² + b² - 2ab
If input is binary we can ignore powers, so...
a² + b² - 2ab -- remove powers --> a + b - 2ab
Allowing us to write...
XOR (binary) = (a-b)²
Multi-Factor XOR
XOR = (1 - A*B*C...)(1 - (1-A)(1-B)(1-C)...)
Excel VBA example...
Function ArithmeticXOR(R As Range, Optional EvaluateEquation = True)
Dim AndOfNots As String
Dim AndGate As String
For Each c In R
AndOfNots = AndOfNots & "*(1-" & c.Address & ")"
AndGate = AndGate & "*" & c.Address
Next
AndOfNots = Mid(AndOfNots, 2)
AndGate = Mid(AndGate, 2)
'Now all we want is (Not(AndGate) AND Not(AndOfNots))
ArithmeticXOR = "(1 - " & AndOfNots & ")*(1 - " & AndGate & ")"
If EvaluateEquation Then
ArithmeticXOR = Application.Evaluate(xor2)
End If
End Function
Any n of k
These same methods can be extended to allow for any n number out of k conditions to qualify as true.
For instance, out of three variables a, b, and c, if you're willing to accept any two conditions, then you want a&b or a&c or b&c. This can be arithmetically modeled from the composite logic...
(a && b) || (a && c) || (b && c) ...
and applying our translations...
1 - (1-ab)(1-ac)(1-bc)...
This can be extended to any n number out of k conditions. There is a pattern of variable and exponent combinations, but this gets very long; however, you can simplify by ignoring powers for a binary context. The exact pattern is dependent on how n relates to k. For n = k-1, where k is the total number of conditions being tested, the result is as follows:
c1 + c2 + c3 ... ck - n*∏
Where c1 through ck are all n-variable combinations.
For instance, true if 3 of 4 conditions met would be
abc + abe + ace + bce - 3abce
This makes perfect logical sense since what we have is the additive OR of AND conditions minus the overlapping AND condition.
If you begin looking at n = k-2, k-3, etc. The pattern becomes more complicated because we have more overlaps to subtract out. If this is fully extended to the smallest value of n = 1, then we arrive at nothing more than a regular OR condition.
Thinking about Non-Binary Values and Fuzzy Region
The actual algebraic XOR equation a + b - ab(1 + a + b - ab) is much more complicated than the computationally equivalent binary equations like x + y - 2xy and (x-y)². Does this mean anything, and is there any value to this added complexity?
Obviously, for this to matter, you'd have to care about the decimal values outside of the discrete points (0,0), (0,1), (1,0), and (1,1). Why would this ever matter? Sometimes you want to relax the integer constraint for a discrete problem. In that case, you have to look at the premises used to convert logical operators to equations.
When it comes to translating Boolean logic into arithmetic, your basic building blocks are the AND and NOT operators, with which you can build both OR and XOR.
OR = (1-(1-a)(1-b)(1-c)...)
XOR = (1 - a*b*c...)(1 - (1-a)(1-b)(1-c)...)
So if you're thinking about the decimal region, then it's worth thinking about how we defined these operators and how they behave in that region.
Non-Binary Meaning of NOT
We expressed NOT as 1-x. Obviously, this simple equation works for binary values of 0 and 1, but the thing that's really cool about it is that it also provides the fractional or percent-wise compliment for values between 0 to 1. This is useful since NOT is also known as the Compliment in Boolean logic, and when it comes to sets, NOT refers to everything outside of the current set.
Non-Binary Meaning of AND
We expressed AND as x*y. Once again, obviously it works for 0 and 1, but its effect is a little more arbitrary for values between 0 to 1 where multiplication results in partial truths (decimal values) diminishing each other. It's possible to imagine that you would want to model truth as being averaged or accumulative in this region. For instance, if two conditions are hypothetically half true, is the AND condition only a quarter true (0.5 * 0.5), or is it entirely true (0.5 + 0.5 = 1), or does it remain half true ((0.5 + 0.5) / 2)? As it turns out, the quarter truth is actually true for conditions that are entirely discrete and the partial truth represents probability. For instance, will you flip tails (binary condition, 50% probability) both now AND again a second time? Answer is 0.5 * 0.5 = 0.25, or 25% true. Accumulation doesn't really make sense because it's basically modeling an OR condition (remember OR can be modeled by + when the AND condition is not present, so summation is characteristically OR). The average makes sense if you're looking at agreement and measurements, but it's really modeling a hybrid of AND and OR. For instance, ask 2 people to say on a scale of 1 to 10 how much do they agree with the statement "It is cold outside"? If they both say 5, then the truth of the statement "It is cold outside" is 50%.
Non-Binary Values in Summary
The take away from this look at non-binary values is that we can capture actual logic in our choice of operators and construct equations from the ground up, but we have to keep in mind numerical behavior. We are used to thinking about logic as discrete (binary) and computer processing as discrete, but non-binary logic is becoming more and more common and can help make problems that are difficult with discrete logic easier/possible to solve. You'll need to give thought to how values interact in this region and how to translate them into something meaningful.

"mathematical, arithmetic only representation" are not correct terms anyway. What you are looking for is a function which goes from IxI to I (domain of integer numbers).
Which restrictions would you like to have on this function? Only linear algebra? (+ , - , * , /) then it's impossible to emulate the XOR operator.
If instead you accept some non-linear operators like Max() Sgn() etc, you can emulate the XOR operator with some "simpler" operators.

Given that (a-b)(a-b) quite obviously computes xor for a single bit, you could construct a function with the floor or mod arithmetic operators to split the bits out, then xor them, then sum to recombine. (a-b)(a-b) = a2 -2·a·b + b2 so one bit of xor gives a polynomial with 3 terms.
Without floor or mod, the different bits interfere with each other, so you're stuck with looking at a solution which is a polynomial interpolation treating the input a,b as a single value: a xor b = g(a · 232 + b)
The polynomial has 264-1 terms, though will be symmetric in a and b as xor is commutative so you only have to calculate half of the coefficients. I don't have the space to write it out for you.

I wasn't able to find any solution for 32-bit unsigned integers but I've found some solutions for 2-bit integers which I was trying to use in my Prolog program.
One of my solutions (which uses exponentiation and modulo) is described in this StackOverflow question and the others (some without exponentiation, pure algebra) can be found in this code repository on Github: see different xor0 and o_xor0 implementations.
The nicest xor represention for 2-bit uints seems to be: xor(A,B) = (A + B*((-1)^A)) mod 4.
Solution with +,-,*,/ expressed as Excel formula (where cells from A2 to A5 and cells from B1 to E1 contain numbers 0-4) to be inserted in cells from A2 to E5:
(1-$A2)*(2-$A2)*(3-$A2)*($A2+B$1)/6 - $A2*(1-$A2)*(3-$A2)*($A2+B$1)/2 + $A2*(1-$A2)*(2-$A2)*($A2-B$1)/6 + $A2*(2-$A2)*(3-$A2)*($A2-B$1)/2 - B$1*(1-B$1)*(3-B$1)*$A2*(3-$A2)*(6-4*$A2)/2 + B$1*(1-B$1)*(2-B$1)*$A2*($A2-3)*(6-4*$A2)/6

It may be possible to adapt and optimize this solution for 32-bit unsigned integers. It's complicated and it uses logarithms but seems to be the most universal one as it can be used on any integer number. Additionaly, you'll have to check if it really works for all number combinations.

I do realize that this is sort of an old topic, but the question is worth answering and yes, this is possible using an algorithm. And rather than go into great detail about how it works, I'll just demonstrate with a simple example (written in C):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
typedef unsigned long
number;
number XOR(number a, number b)
{
number
result = 0,
/*
The following calculation just gives us the highest power of
two (and thus the most significant bit) for this data type.
*/
power = pow(2, (sizeof(number) * 8) - 1);
/*
Loop until no more bits are left to test...
*/
while(power != 0)
{
result *= 2;
/*
The != comparison works just like the XOR operation.
*/
if((power > a) != (power > b))
result += 1;
a %= power;
b %= power;
power /= 2;
}
return result;
}
int main()
{
srand(time(0));
for(;;)
{
number
a = rand(),
b = rand();
printf("a = %lu\n", a);
printf("b = %lu\n", b);
printf("a ^ b = %lu\n", a ^ b);
printf("XOR(a, b) = %lu\n", XOR(a, b));
getchar();
}
}

I think this relation might help in answering your question
A + B = (A XOR B ) + 2*(A.B)

(a-b)*(a-b) is the right answer. the only one? I guess so!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Should I shrink multiple factor variables into one? - r

Related

lme4 translate formula to code in 3-level model

Why variable importance is not reflected in variable actually used in tree construction?

Solving a single, long linear equation in R - many unknown variables

LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code?

Mathematical (Arithmetic) representation of XOR

Categories

Resources