Displaying of factor levels and labels in R - r

I am having an issue with displaying the correct grouping of a factor variable after using MICE. I believe this is an R thing, but I included it with mice just to be sure.
So, I run my mice algorithm, here is a snipit of how I call I format it in the mice algorithm. Note, I want it to be 0 for no drug, and 1 for yes drug, so I coerce it to be a factor with levels 0 and 1 before I run it
mydat$drug=factor(mydat$drug,levels=c(0,1),labels=c(0,1))
I then run mice and it runs logistic regression (this is the default) on drug, along with my other variables to be imputed.
I can extract the results of one of the imputations when it is complete by
drug=complete(imp,1)$drug
We can view it
> head(drug)
[1] 0 0 1 0 1 1
attr(,"contrasts")
2
0 0
1 1
Levels: 0 1
So the data is certainly 0,1.
However, when I do something with it, like cbind, it changes to 1's and 2's
> head(cbind(drug))
drug
[1,] 1
[2,] 1
[3,] 2
[4,] 1
[5,] 2
[6,] 2
Even when I coerce it to a numeric
> head(as.numeric(drug))
[1] 1 1 2 1 2 2
I want to say it has something to do with the contrasts, but when I delete the contrast by doing
attr(drug,"contrasts")=NULL
It still shows up with 1's and 2's when called and printed by others.
I am able to get it to print correctly by using I()
> head(I(drug))
[1] 0 0 1 0 1 1
Levels: 0 1
So, I believe that this is an R issue, but I don't know how to remedy it. Is using I() the correct solution, or is it just a workaround that happens to work here? What is actually happening behind the scenes that is making the output display as 1's and 2's?
Thanks

Factors start with the first level being represented internally by 1.
Your two options:
1) Adjust for 1-based index of levels:
as.numeric(drug) - 1
2) Take the labels of the factors and convert to numeric:
as.numeric(as.character(drug))
Some people will point you in the direction of the faster option that does the same thing:
as.numeric(levels(drug))[drug]
I'd also consider using logical values instead of factor in the first place.
mydat$drug = as.logical(mydat$drug)

The 0s and 1s are the names of your levels. The underlying integer corresponding to the names is 1 and 2. You can see with str,
str(drug)
# Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 2
When you coerce the factor to numeric, you drop the names and get the integer representation.

This is how R encodes factors. The underlying numeric representation of the factors always starts with 1. As you can see with the following to examples:
as.numeric(factor(c(0,1)))
as.numeric(factor(c(A,B)))
Not sure about the specifics about how MICE works, but if it requires a factor instead of a simple 0/1 numeric variable to use logistic regression, you can always hack the results with something like the following:
as.numeric(as.character(factor(c(0,1))))
or in your specific case
drug <- as.numeric(as.character(drug))

Related

R 'memisc' package: why has "as.data.frame()" changed 0/1 values of data.set to 1/2 in data.frame?

I'm trying to prepare an SPSS .sav data file with survey data for performing analyses in R.
Now I have an issue that some variables with binary values 0/1 (signifying no/yes) have been transformed unexpectedly.
I have used the memisc package to import the data as a data.set object.
Dset.core <- spss.system.file(file="C://..../data_coded.sav",
varlab.file=NULL,
codes.file=NULL,
missval.file=NULL,
count.cases=TRUE,
to.lower=FALSE
)
This worked all fine, from what I saw from str() and codebook() outputs. One example of a 0/1 variable $AMEVYES (labels are 0=no, 1=yes) is shown here:
str(Dset.core)
Data set with 1999 obs. of 106 variables:
(...)
$ AMEVYES : Nmnl. item w/ 2 labels for 0,1 num 0 0 0 0 0 0 0 0 0 1 ...
I now want to convert the special data.set object created by memisc into a data frame with:
Dset2Df.core <- as.data.frame(Dset.core)
As intended, the nominal 0/1 variable was changed into a factor variable with corresponding levels. But for some strange reason, this procedure also changed the values of the variables, from 0/1 to 1/2, like in this example output:
str(Dset2Df.core)
'data.frame': 1999 obs. of 106 variables:
(...)
$ AMEVYES : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
Why did this happen, and most importantly, how can I stop this from happening?
Many thanks for a hint!
PS: I'm rather new to R and new to this forum, so please excuse if I missed any best practices when formulating my question.
As The Carpentries states:
Factors are stored as integers, and have labels associated with these
unique integers. While factors look (and often behave) like character
vectors, they are actually integers under the hood, and you need to be
careful when treating them like strings.
Factors are internally stored as integers starting from 1. You cannot change these internally stored values. You can, however, change their labels ("Yes", "No") or (0, 1).

Like dcast but without sum of data

I have data organized for the R survival package, but want to export it to work in Graphpad Prism, which uses a different structure.
#Example data
Treatment<-c("A","A","A","A","A","B","B","B","B","B")
Time<-c(3,4,5,5,5,1,2,2,3,5)
Status<-c(1,1,0,0,0,1,1,1,1,1)
df<-data.frame(Treatment,Time,Status)
The R survival package data structure looks like this
Treatment Time Status
A 3 1
A 4 1
A 5 0
A 5 0
A 5 0
B 1 1
B 2 1
B 2 1
B 3 1
B 5 1
The output I need organizes each treatment as one column, and then sorts by time. Each individual is then recorded as a 1 or 0 according to its Status. The output should look like this:
Time A B
1 1
2 1
2 1
3 1 1
4 1
5 0 1
5 0
5 0
dcast() does something similar to what I want, but it sums up the Status values and merges them into one cell for all individuals with matching Time values.
Thanks for any help!
I ran into a weird issue when trying to implement Sotos' code to my actual data. I got the error:
Error in Math.factor(var) : ‘abs’ not meaningful for factors
Which is weird, because Sotos' code works for the example. When I checked the example data frame using sapply() it gave me the result:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "numeric"
My issue as far as I could tell, was that my Status variable was read as numeric in my example, but an integer in my real data:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "integer"
I loaded my data from a .csv, so maybe that's what caused the change in variable calling. I ended up changing my Status variable using as.numeric(), and then re-generating the dataframe.
Status<-as.numeric(df$Status)
df<-data.frame(Treatment, Time, Status)
And was able to apply Sotos' code to the new dataframe.

How to change the names of confidence levels per variable in linear regression

I got the confidence levels per variable in linear regression.I wanted to use the results for sorting variables so I kept the result set as a data frame. However when I tried to do an str() function on one of the variables I got an error (written below).How can I store the result data set so I'll be able to work on it?
df <- read.table(text = "target birds wolfs
1 9 7
1 8 4
0 2 8
1 2 3 3
0 1 2
1 7 1
0 1 5
1 9 7
1 8 7
0 2 7
0 2 3
1 6 3
0 1 1
0 3 9
0 1 1 ",header = TRUE)
model<-lm(target~birds+wolfs,data=df)
confint(model)
2.5 % 97.5 %
(Intercept) -0.23133823 0.36256052
birds 0.10102771 0.18768505
wolfs -0.09698902 0.00812353
s<-as.data.frame(confint(model))
str(s$2.5%)
Error: unexpected numeric constant in "str(s$2.5"
The expression behind the $ operator must be a valid R identifier. 2.5% isn’t a valid R identifier, but there’s a simple way of making it one: put it into backticks: `2.5%`1. In addition, you need to pay attention that the column name matches exactly (or at least its prefix does). In other words, you need to add a space before the %:
str(s$`2.5 %`)
In general, a$b is the same as a[['b']] (with some subtleties; refer to the documentation). So you can also write:
str(s[['2.5 %']])
Alternatively, you could provide different column names for the data.frame that are valid identifiers, by just assigning different column names. Beware of make.names though: it makes your strings into valid R names, but at the cost of mangling them in ways that are not always obvious. Relying on it risks confusing readers of the code, because previously undeclared identifiers suddenly appear in the code. In the same vein, you should always specify check.names = FALSE with data.frame, otherwise R once again mangles your column names.
1 In fact, R also accepts single quotes here (s$'2.5 %'). However, I suggest you forget this immediately; it’s a historical accident of the R language, and treating identifiers and strings the same (especially since it’s done inconsistently) does more harm than good.

Force model.matrix() in R to use a given set of levels

I need to convert a small number of categorical variables in a survey dataframe to dummy variables. The variables are grouped by type (e.g. food type), and within each type survey respondents ranked their 1st, 2nd and 3rd preferences. The list of choices available for each type is similar but not identical. My problem is that I want to force the superset of category choices to be dummy-coded in every case.
set.seed(1)
d<-data.frame(foodtype1rank1=sample(c('noodles','rice','cabbage','pork'),5,replace=T),
foodtype1rank2=sample(c('noodles','rice','cabbage','pork'),5,replace=T),
foodtype1rank3=sample(c('noodles','rice','cabbage','pork'),5,replace=T),
foodtype2rank1=sample(c('noodles','rice','cabbage','tuna'),5,replace=T),
foodtype2rank2=sample(c('noodles','rice','cabbage','tuna'),5,replace=T),
foodtype2rank3=sample(c('noodles','rice','cabbage','tuna'),5,replace=T),
foodtype3rank1=sample(c('noodles','rice','cabbage','pork','mackerel'),5,replace=T),
foodtype3rank2=sample(c('noodles','rice','cabbage','pork','mackerel'),5,replace=T),
foodtype3rank3=sample(c('noodles','rice','cabbage','pork','mackerel'),5,replace=T))
To recap, model.matrix() will create dummy variables for any individual variable:
model.matrix(~d[,1]-1)
d[, 1]cabbage d[, 1]noodles d[, 1]pork d[, 1]rice
1 0 0 0 1
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
5 0 1 0 0
Or via sapply() for all variables:
sapply(d,function(x) model.matrix(~x-1))
Naturally, model.matrix() will only consider the levels that are present in each factor separately. But I want to force the complete set of foodtypes to be included for each type: noodles, rice, cabbage, pork, tuna, mackerel. In this example that would generate 54 dummy variables (3 types x 3 ranks x 6 categories). I assume I would pass the complete set explicitly to model.matrix() in some way, but can't see how.
Finally, I know R models automatically dummy-code factors internally but I still need to do it, including for exporting outside R.
The best way to achieve this is by explicitly specifying the levels to each factor:
d$foodtype1rank1=factor(sample(c('noodles','rice','cabbage','pork'), 5, replace=T),
levels=c('noodles','rice','cabbage','pork','mackerel'))
When you know the data this is always good practice.

ChoiceModelR - Hierarchical Bayes Multinomial Logit Model

I hope that some of you are a bit experienced with the R package ChoiceModelR by Sermas and Colias, to estimate a Hierarchical Bayes Multinomial Logit Model. Actually, I am quite a newbie on both R and Hierarchical Bayes. However, I tried to get some estimates by using the script provided by Sermas and Colias in the help file. I have a data set in the same structure as they use (ID, choice set, alternative, independent variables, and choice variable). I have four independent variables all of them binary coded as categorical variables, none of them restricted. I have eight choice sets with three alternatives within each set as well as one no-choice-option as fourth alternative. I tried the following script:
library (ChoiceModelR)
data <- read.delim("Z:/KLU/CSR/CBC/mp3_vio.txt")
xcoding=c(0,0,0,0)
mcmc = list(R = 10, use = 10)
options = list(none=FALSE, save=TRUE, keep=1)
attlevels=c(2,2,2,2)
c1=matrix(c(0,0,0,0),2,2)
c2=matrix(c(0,0,0,0),2,2)
c3=matrix(c(0,0,0,0),2,2)
c4=matrix(c(0,0,0,0),2,2)
constraints = list(c1, c2, c3, c4)
out = choicemodelr(data, xcoding, mcmc = mcmc, options = options, constraints = constraints)
and have got the following error message:
Error in 1:nalts[i] : result would be too long a vector
In addition: There were 50 or more warnings (use warnings() to see the first 50). The mentioned warnings are of the following:
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
Actually, I have no idea what went wrong so far as I used the same data structure even I have more independent variables, more choice sets, and more alternatives within a choice set. I would be fantastic if anybody can shed some light into the darkness
I know that this may not be helpful since you posted so long ago, but if it comes up again in the future, this could prove useful.
One of the most common reasons for this error (in my experience) has been that either the scenario variable or the alternative variable is not in ascending order within your data.
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 3 1 4 2
1 3 2 5 0
2 1 4 3 1
2 1 5 1 0
2 2 1 4 2
2 2 2 3 0
This dataset will give you errors since the scenario and alternative variables must be ascending, and they must not skip any values. Just to fully reiterate what I mean, the scenario and alt variables must be reordered as follows in order to work:
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 2 1 4 2
1 2 2 5 0
2 1 1 3 1
2 1 2 1 0
2 2 1 4 2
2 2 2 3 0
I work with ChoiceModelR quite frequently, and this is what has caused these errors for me in the past. If you have a github account, you can also post your data (or modified data) there if you end up wanting to have other users take a look.

Resources