Regression with several dummy variables - r

I am running a logistic regression and I want to control for the country of the respondents. I have 12 countries. I used the "fastDummy" package to create dummies for each country
ALL<-dummy_cols(ALL, select_columns = "country")
I get something like this:
country_Japan 1 1 0 0 0 0
country_Taiwan 0 0 1 1 0 0
country_China 0 0 0 0 1 1
and so on...
As you can see, the sum of all variables makes a perfect collinearity. For this reason, I cannot estimate the model.
I read that I need to include a variable with 0s as the last country dummy to avoid this collinearity. Is this correct? I included the intercept (a column with 1s) , but it did not help.
I would appreciate your suggestions. Thanks

Check the remove_first_dummy parameter in the dummy_cols function, i.e. set it to TRUE. This should solve your problem of multicollinearity.

Related

Create a new Variable of values of another variable-multilevel regression

I am up to create a multilevel analysis (and I am a total newbie).
In this analysis I want to test if a high value of a predictor( here:senseofhumor) (numeric value - transfered into "high","low","medium") would predict the (numeric)outcome more than the other (numeric)predictors (senseofhomor-seriousness-friednlyness).
I have a dataset with many people and groups and want to compare the outcome between the groups regarding the influence of SenseofhumorHIGH
The code for that might look like this
RandomslopeEC <- lme(criteria(timepoint1) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
For that reason I created values "high" "low" "medium" for my numeric predictor via
library(tidyverse)
DATA <- DATA %>%
mutate(predictorNew = case_when(senseofhumor< quantile(senseofhumor, 0.5) ~ 'low',
senseofhumor > quantile(senseofhumor, 0.75)~'high',
TRUE ~ 'med'))
Now they look like this:
Person
Group
senseofhumor
1
56
low
7
1
high
87
7
low
764
45
high
Now I realized i might need to cut this variable values in separate variables if I want to test my idea.
Do any of you know how to generate variables, that they may look like this?
Person
Group
senseofhumorHIGH
senseofhumorMED
senseofhumorLOW
1
56
0
0
1
7
1
1
0
0
87
7
0
0
1
764
45
1
0
0
51
3
1
0
0
362
9
1
0
0
87
27
0
0
1
Does this make any sense to you regarding my approach? Or do you have a better idea?
Thanks a lot in advance
Welcome to learning R. You will want to convert these types of variables to "factors," and R will be able to process them accordingly. To do that, use as.factor(variable) - so for you it may be DATA$senseofhumor <- as.factor(DATA$senseofhumor). If you need to convert multiple columns, you can use:
factor_cols <- c("Var1","Var2","Var3") # list columns you want as factors
DATA[factor_cols] <- lapply(DATA[factor_cols], as.factor)
Since you are new, note that this forum is typically for questions that cant be easily found online. This question is relatively routine and more details can be found with a quick google search. While SO is a great place to learn R, you may be penalized by the SO community in the future for routine questions like this. Just trying to help ensure you keep learning!

Matches in binary columns-R

I am performing some prediction models. I have 2 binary columns , one with predicted values and the other one with the actual values.
Since the columns have few ones because it counts the number of people with cancer, i want to observe how many cases the model detected(how many real ones it predicted) and the percentage of sick persons correctly predicted.
Brief description of the data: the first column shows the real values and the seconde one shows the predicted values:
> predictedvsreal
real prediction
39240 0 0
39241 0 0
39242 0 0
39243 1 0
39244 0 1
39245 0 0
39246 0 0
39247 0 0
39248 1 1
39249 0 0
39250 0 0
39251 0 0
39252 0 0
Thanks!
Next time please include a reproducible example as it makes the question much better - both for letting people who answer have a concrete example to work with and to catch edge-cases, and for future readers to see a real example.
There are lots of good recommendations for how to create nice, minimal, reproducible examples at this link.
From what you describe, you want the table function, probably like this:
with(your_data, table(your_first_column_name, your_second_column_name))

How to subset data in R: participant only needs to meet one of five criteria?

I'm having a lot of trouble figuring out how to subset a data set in R despite reading through many pages here. The set contains information from over 3000 participants. Each participant was asked about five different health conditions and gave binary answers (i.e., yes/no diabetes; yes/no obesity, etc.). How do I make a subset that includes people who have only ONE of the conditions? For instance, everyone in this new subset would have either obesity or diabetes or high cholesterol, but none would have two or more conditions.
Thank you!!
ETA: After a night's sleep, I looked at everything (and the comments) again. Here's some clarification and what I've done since.
Sample data (mydata) (0 = no, 1 = yes)
Participant HighCho Diabetes Obesity
1 1 1 0
2 0 1 1
3 1 0 0
4 0 0 0
5 0 1 0
I want my subset outcome to include only those with none of the three conditions or only one of the three:
Participant HighCho Diabetes Obesity
3 1 0 0
4 0 0 0
5 0 1 0
I've written:
new.data <- subset(mydata = (HighCho == 0 & Diabetes == 0 & Obesity==0) | HighCho == 1 | Diabetes == 1 | Obesity == 1)
My problem is that even though I capture everyone who is free from all conditions, I still include people who have more than one condition. I thought with my "or" statement, I would only include those with only one of the three conditions (rather than two). Any insights as to what I might be doing incorrectly?
You can use the apply function to sum the number of conditions each participant has.
mydata[apply(mydata[, c('HighCho', 'Diabetes', 'Obesity')], 1, sum) %in% 0:1, ]

adonis function from vegan doesn't work

I've got a problem fighting one error. Here is the line I try to execute:
library(vegan)
adonis(data = dset, adiv ~ N+P+K)
It returns a failure message:
Error in rowSums(x, na.rm = TRUE) :
'x' must be an array of at least two dimensions
Everything seems to be alright with the dataset, because aov(data = dset, adiv ~ N+P+K) works just fine. I know that such errors appear when some functions drop data frame dimensions, but I don't know how to fix it in this case.
Edit. Adding a piece of my dataset.
treatment N P K M adiv
N 1 0 0 0 0.2059
P 0 1 0 0 0.20856
K 0 0 1 0 0.22935
O 0 0 0 0 0.10729
NP 1 1 0 0 0.30674
NK 1 0 1 0 0.30509
PK 0 1 1 0 0.30606
NPK+ 1 1 1 1 0.50389
NPK 1 1 1 0 0.40731
manure 0 0 0 1 0.2085
Before I try to execute adonis I convert treatment values into factors with:
dataset$N <- as.factor(dat$N)
dataset$P <- as.factor(dat$P)
dataset$K <- as.factor(dat$K)
dataset$M <- as.factor(dat$M)
Then I just try to execute the function and get the error.
As I've already mentioned, everything works just fine when I try aov() or lm().
This is guessing since there is nothing reproducible in your question. However, I can trigger similar error if I use univariate responses: adonis is intended for multivariate responses, and may not work with univariate responses. The adonis help page can be read with ?adonis, and it says that the left-hand-side of the formula should be "either a dissimilarity object (inheriting from class "dist") or data frame or a matrix." Following this helps when I try (but I really cannot reproduce your example): you could try with lhs of as.matrix(Nitrososphaearaceae) or dist(Nitrososphaeraceae).
The adonis function is really intended for multivariate responses and use univariate responses needs care. You should also carefully consider the type of dissimilarity (or distance) you use with such models. For instance, the two alternatives above will give different results because they use different dissimilarity measures. I am not at all sure that it makes much sense to use distance-based methods like adonis with univariate responses.

plotting variables of procrustes analysis in r?

I have performed non-metric multidimensional scaling (NMDS) on two data frames, each containing different variables but for the same sites. I am using the vegan package:
> head (ResponsesS3)
R1_S3 R10_S3 R11_S3 R12_S3 R2_S3 R3_S3 R4_S3 R6_S3 R7_S3 R8_S3 R9_S3
4 0 0 0 0 0 1 0 0 0 0 0
5 0 0 0 0 0 1 0 0 0 0 0
7 1 0 0 1 0 0 0 0 0 0 0
12 0 0 0 0 0 1 0 0 0 0 0
14 2 2 0 0 0 0 2 0 0 0 0
16 0 0 1 0 0 0 0 0 0 1 0
> head (EnvtS3)
Dep_Mark Dep_Work Dep_Ext Use_For Use_Fish Use_Ag Div_Prod
4 0.06222836 1.0852315 0.8367309 1.1415929 1.644670 0.1006964 0.566474
5 0.25946808 1.3342266 0.0000000 1.7123894 0.822335 0.0000000 0.283237
7 2.20668862 0.0000000 0.8769881 0.4280973 0.822335 0.5244603 0.849711
12 2.26323697 0.0000000 0.8090991 1.1415929 0.000000 1.4957609 1.416185
14 1.65107675 0.5195901 0.2921132 0.5707965 0.822335 1.7873609 0.849711
16 1.82230225 0.4760163 0.1915366 2.2831858 0.000000 1.6614904 0.849711
> ResponsesS3.mds = metaMDS (ResponsesS3, k =2, trymax = 100)
> EnvtS3.mds = metaMDS (EnvtS3, k =2, trymax = 100)
I fit the results using a procrustean superimposition
> pro.ResponsesS3.EnvtS3.mds <- procrustes(ResponsesS3.mds,EnvtS3.mds)
I am most interested in understanding how the variables from each dataset fit together. I would like to use the plot() function to return a graph of the variables from ResponsesS3 and from EnvtS3, rather than the sites (which is what the plot function returns by default).
Is this possible?
No, this is not possible. The problem you'll find you have is that there will be different numbers of variables in the two datasets which causes the procrustes() method to fail if you try procrustes(..., scores = "species").
Even if you fit with procrustes(..., score = "sites") (the default), who do you propose to draw the plot if we could extract the species information? The current plot joins rows from one matric with the rows of other; this works in the default setting because the datasets are assumed to be measurements on the same locations/sites. But this is not possible with species/variables. More fundamentally, how should we pair up species with environmental variables?
Finally, you are trying to look at how the variables compare yet have used a method that essentially throws this information away once dissimilarities are computed.
I would look at the method of coinertia analysis, of which there is a crude interface in my cocorresp package and a fuller one in the ade4 package. If you find yourself wanting to compare two sets of species data, try cocorrespondence analysis, which cocorresp fits.
Like Gav said, the points must match each other one to one for Procrustes rotation. However, once you have a Procrustes rotation, you can naturally apply it to other matrices with the same number of columns. The number of columns is crucial: If you have 2-dim NMDS, your variables also must be mapped into these 2 dim. Function metaMDS() will get you such column scores corresponding to your ordination of row scores, but I don't know how adequate these are in your case. The easiest way to rotate those scores in vegan is to use predict method with newdata. Continuing with your example:
predict(pro.ResponsesS3.EnvtS3.mds, newdata=scores(EnvtS3.mds, "species"))
This will only rotate your column scores ("species") similarly as is rotated your row scores.
We do not know what you try to achieve, and indeed there may be better ways to achieve your goal (check Gavin's answer for a starter). However, this will do the rotation.

Resources