R glm regression not including several dummy variables [closed] - r

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
I have a data set (acs_hh) in which one of the columns is race_eth.
For the following regression:
reg <- glm(acs_hh$own ~ acs_hh$hhincome + acs_hh$race_eth, family = "binomial")
summary(q7reg)
However, in my data there exist more than just the four races mentioned in the summary; asian is also a race in my dataset.
Why is R not calculating a coefficient for asians, i.e acs_hh$race_ethasian, non-hisp ?

When using dummy variables one of the categories is excluded and serves as the reference category to which all the others are compared. So to calculate fitted values for Asian, non-hisp you would set all of the other categories to 0.

Because "asian" is the reference level of acs_hh$race_eth -- all the other coefficients represent the effect relative to the reference level (which in your case, I suspect is "asian" because that is the alphabetically first level).

Related

Issue with the Summarize Function [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 days ago.
Improve this question
Code:
data(tips)
tips%>%
group_by(sex)%>%
summarize(variance=var(tip))
Output:
variance
1 1.914455
The output isn't the desired one. The result should be a tibble with variance computed against each group (in this case, sex). The summarize function is computing the variance of the entire tip column, rather than calculating the variance of each group.
Tried executing the code and restarting RStudio several times, but it didn't work.

Looking for resources for help modeling logistic regression with an event/trial syntax in R? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 days ago.
Improve this question
I am modeling bird nesting success and my advisor wants me to use event/trials syntax in R to model the amount of eggs that hatched vs the total amount of eggs per nest (i.e. the event/trials) against a variety of predictor variables - using essentially the logistic regression format.
This is totally new to me, so any online resources or code help would be incredibly useful! Thank you!
Haven't tried much yet, can't find the resources! I can only find info for SAS.
When specifying a logistic regression with glm(), the response can be specified multiple different ways. One is as a two-column matrix, first column is number of successes, second is number of failures.
So if you have two variables, total_eggs and hatched, try
mod <- glm( cbind(hatched, total_eggs-hatched) ~ x + ...,
data=your_data_frame, family=binomial)

How to do a fancy box-plot using ggplot2? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
I am trying to plot a box-plot with ggplot2 using the Wage database in the ISLR package. The box-plot is meant to visualize the Wage versus educational level, which is presented in five categories. When I try to use the typical code to generated the box-plot a get the following warning from Rstudio:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (3000): y
My code is
library("ISLR")
library("MASS")
setwd("C:/Users/Alonso/Desktop/ITSL")
View(Wage)
ggplot(Wage, aes(x=education, y=Wage))+
geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)+labs(x="Nivel de estudio", y="Salario")
I have made other graphics but just with numeric variables, maybe the problem is that now I am using a categorical variable. Any ideas?, thanks in advance and greetings from Chile.
You were almost there, just needed a lowercase y=wage because the column name is wage and not Wage.
ggplot(Wage, aes(x=education, y=wage))+
+ geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)+labs(x="Nivel de estudio", y="Salario")

Machine learning - Calculating the importance of a "value" in a variable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.

Should Categorical predictors within a linear model be normally distributed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman Luštriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).

Resources