ML : Dummy variable trap or reason why do we have n-1 dummy variables? - dummy-variable

Is it possible to explain dummy variable trap or reason why do we have n-1 dummy variables..
If we have n dummy variables, what would be the effect...
n is unique categories..

Related

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

Dummy Variable Problems in Latent Class Analysis with R

I am a beginner in R.
Conjoint analysis and Latent Class Analysis (LCA) are being conducted simultaneously.
The conjoint analysis went well.
There was a problem with the LCA analysis.
When running poLCA, the following warning message appears
'ALERT : some manifest variables contain values ​​that are not positive integers. For poLCA to run, please recode categorical
outcome variables to increment from 1 to the maximum number of outcome categories for each variable'
So, after thinking about it, I changed all the dummy variables of the coding data from 0 to 1 // and 1 to 2.
In addition, all categorical variables such as marital status and gender were changed from 0 to 1 and 1 to 2.
After that, the result came out without error.
Usually, dummy variable coding is known as '0 or 1' // '-1 or +1'.
Is it possible to code dummy variables as 1 or 2 in R(poLCA)?
Or did I do something wrong? I do not know.
It's my first time in LCA or R, so I'm unfamiliar with it.
Advice from experienced experts please.
thank you.

How to save values in Vector using R

I am supposed to find the mean and standard deviation at each given sample size (N), using the "FOR LOOP". I started writing the code as below, I am required to save all the means into vector "p". How do I save all the means into one vector?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( N in sample.sizes ){
x <- rnorm(3,mean=0,sd=1)
mean.sds[i]
}
mean(x)
Actually you are doing many thing wrong?
If you are using variable N in for loop, you are not using it anywhere
for (N in 'some_vector') actually means N will take that value one by one. So N in sample sizes will first take, 3 then 10 then 50 and so on.
Now where does i come into picture?
You are calculating x for each iteration of N. In fact you are not using N anywhere in the loop?
first x will return 3 values. In the next line you intend to store these three values in just ith value of mean.sds where i is unknown and storing three values into one value, as it is, is not logically possible.
Do you want this?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( i in seq_along(sample.sizes )){
x <- rnorm(sample.sizes[i], mean=0, sd=1)
mean.sds[i] <- mean(x)
}
mean.sds
[1] 0.6085489531 -0.1547286299 0.0052106559 -0.0452804986 -0.0374094936 0.0005667246
I replaced N with seq_along(sample.sizes) which will give iterations equal to the number of that vector. Six in this example.
I passed each ith element to first argument of rnorm to generate these many random values.
Stored each random value into single vector. calculated its mean (one value only) and stored in ith value of your empty vector.

How to use a binary variable to build a logistic regression model?

As you can see, this is the structure of my dependent variable (G3):
G3 is student's final period grade. It is a binary variable, if G3<10, students fail; if G3>=10, students pass. It is represented by "1" means fail, "2" means pass.
Now I'm going to build a logistic regression model. I need to convert this binary variable into a numeric variable, and we assumed that if the dependent variable G3 is equal to 1 if students failed, if G3 is equal to 0 if students passed. What should I do?
And I checked the structure of G3 again:
It turned into numeric variable, but "fail" or "pass" still represent by "1" and "2". How can i change them to "1" and "0"?
How about
performance$G3 <- 2-performance$G3
?
Alternatively, you could have started at the beginning with
performance$G3 <- ifelse(performance$G3=="fail",0,1)
Finally, you can use a factor variable as a response. From ?binomial, if the response variable is a factor,
... ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
You'd have to change the order of the levels, e.g.
performance$G3 <- factor(performance$G3, levels=c("pass", "fail"))

"Factor codes of type double or float detected" when using read.dta13

I am using read.dta13 packages to load data. There are a bunch of categorical variables with Stata values labels in the data set. The data set looks like below in Stata:
cohort year age gender income health migration
1101 2010 35 F 13034 healthy yes
1102 2010 54 M 34134 unhealthy no
For gender, health and migration, the original values are numeric, for example, gender = 1 for male. In Stata, for the convenience of understanding, I add value labels for categorical variables using label define, so it shows as above. But the original values are kept. Now let's go to R. If I simply type
mydata <- read.dta13("mydata_stata13.dta")
I get a lot of warnings like these
Factor codes of type double or float detected - no labels assigned.
Set option nonint.factors to TRUE to assign labels anyway.
All the value labels I add in Stata will be dropped, which is what I need in R. The problem is that R gives warnings even for some variables that should be taken as numeric, for example income. I don't want to set nonint.factor = TRUE since I need the numeric values of the categorical variables for the calculation.
It's not actually an error, but I would like to know whether it is safe to just ignore the warnings.
As the warning states, there are doubles or floats with labels assigned. This is because I assumed you created a categorical variable without specifying Stata to store it as a byte. readstata13 gives you a warning because it is not sure if floats/doubles with value labels are categorical or continuous variables.
Let's say gender is the wrongly stored variable, I assumed the person who coded the variables in stata created it as:
gen gender = *expr*
instead of
gen byte gender = *expr*
This can be solved either by always prefixing categorical variables with gen byte or by using compress (see Stata's manual) before saving/exporting the whole dataset. You can detect which variables are wrongly coded using describe and checking value label assignment in non-byte-variables. This will in turn will store your data efficiently.
In addition, I assume that for some reason the same person accidentally added a value label to a "true" float variable, like income at some point. Check labelbook command to correct such problems.

Resources