"Factor codes of type double or float detected" when using read.dta13 - r

I am using read.dta13 packages to load data. There are a bunch of categorical variables with Stata values labels in the data set. The data set looks like below in Stataļ¼š
cohort year age gender income health migration
1101 2010 35 F 13034 healthy yes
1102 2010 54 M 34134 unhealthy no
For gender, health and migration, the original values are numeric, for example, gender = 1 for male. In Stata, for the convenience of understanding, I add value labels for categorical variables using label define, so it shows as above. But the original values are kept. Now let's go to R. If I simply type
mydata <- read.dta13("mydata_stata13.dta")
I get a lot of warnings like these
Factor codes of type double or float detected - no labels assigned.
Set option nonint.factors to TRUE to assign labels anyway.
All the value labels I add in Stata will be dropped, which is what I need in R. The problem is that R gives warnings even for some variables that should be taken as numeric, for example income. I don't want to set nonint.factor = TRUE since I need the numeric values of the categorical variables for the calculation.
It's not actually an error, but I would like to know whether it is safe to just ignore the warnings.

As the warning states, there are doubles or floats with labels assigned. This is because I assumed you created a categorical variable without specifying Stata to store it as a byte. readstata13 gives you a warning because it is not sure if floats/doubles with value labels are categorical or continuous variables.
Let's say gender is the wrongly stored variable, I assumed the person who coded the variables in stata created it as:
gen gender = *expr*
instead of
gen byte gender = *expr*
This can be solved either by always prefixing categorical variables with gen byte or by using compress (see Stata's manual) before saving/exporting the whole dataset. You can detect which variables are wrongly coded using describe and checking value label assignment in non-byte-variables. This will in turn will store your data efficiently.
In addition, I assume that for some reason the same person accidentally added a value label to a "true" float variable, like income at some point. Check labelbook command to correct such problems.

Related

How to change a class variable into an ordered variable in R?

Hey guys I am trying to calculate the p value of individual variables to see if they have an impact when the other variable is set to 0. Here is my code:
quiet_result = aov(overbearing ~ as.factor(Intention)*as.factor(quiet_only), data=df)
summary(quiet_result)
loud_result = aov(overbearing ~ as.factor(Intention)*as.factor(loud_only), data = df)
summary(loud_result)
For context, the intention variable only has the values of -1 and 1. -1 is intentional and 1 is intentional. Quiet_only and loud_only are new columns created from a data set. quiet_only only has the values of 0 and 2 and it is the original column of sound + 1, and loud_only only has the values of -2 and 0 because it is only the original column of sound - 1. Therefore these are all ordered variables and they are not supposed to be assessed by their actual numerical value like a class variable. However, my code keeps reading it as a class variable even though I changed all the variables to factors to make them ordered variables. Therefore, when I run anova on them, they all return the same result. I am wondering how I can change the variables to make them into ordered variables because the anova is only reading the change between the intention and quiet_only/loud_only columns, which would obviously return the same anovas because there is no actual change if you subtract or add 1 to a column. Therefore, I'm trying to find the p value of the intention variable with loud_only and quiet_only and this p value should change depending on whether I use loud_only or quiet_only.
Sorry if this doesn't make any sense lol. This is research work for a graduate professor so it uses concept that I don't fully understand (I'm undergrad) so I don't think I explained it very well. Anyways, if any of you have any ideas that would be great.

How to discretize a variable with only 2 distinct values?

I am trying to discretize the variable- DEATH, into two bins.
DEATH can only be a value of 0 or 1
The command I am using is as follows:
to convert Death to a factor variable using unsupervised discretization with equal frequency binning
burn$DEATH<-discretize(burn$DEATH, method="interval", breaks=2)
summary(burn$DEATH)
However, my output is the entire range of values. I would like to show the individual count for 0 and 1.
My current output:
summary(burn$DEATH)
[0,1]
1000
I think the user specified method would be the solution but when I tried this, I received an error stating that 'x must be numeric'
burn$FACILITY <- discretize(burn$FACILITY, method="fixed", breaks=c(-Inf,0, 1, Inf))
Additional note: This is for a class so I'm assuming they wouldn't want us to use a method that we haven't discussed yet. I'd prefer to use a discretization method if possible! Someone suggested I use the factor() command, but how do I see the summary statistics with the levels if I do this?

What are the practical differences between 'factor' and 'string' data types in R?

From other programming languages I am familiar with the string data type. In addition to this data type, R also has the factor data type. I am new to the R language, so I am trying to wrap my head around the intent behind this new data type.
Question: What are the practical differences between 'factor' and 'string' data types in R?
I get that (on a conceptual/philosophical level) the factor data type is supposed to encode the values of a categorical random variable, but I do not understand (on a practical level) why the string data type would be insufficient for this purpose.
Seemingly having duplicate data types which serve the same practical purpose would be bad design. However, if R were truly poorly designed on such a fundamental level, it would be much less likely to have achieved the level of popularity it has. So either a very improbable event has happened, or I am misunderstanding the practical significance/purpose of the factor data type.
Attempt: The one thing I could think of is the concept of "factor levels", whereby one can assign an ordering to factors (which one can't do for strings), which is helpful when describing "ordinal categorical variables", i.e. categorical variables with an order (e.g. "Low", "Medium", "High").
(Although even this wouldn't seem to make factors strictly necessary. Since the ordering is always linear, i.e. no true partial orders, on countable sets, we could always just accomplish the same with a map from some subset of the integers to the strings in question -- however in practice that would probably be a pain to implement over and over again, and a naive implementation would probably not be as efficient as the implementation of factors and factor levels built into R.)
However, not all categorical variables are ordinal, some are "nominal" (i.e. have no order). And yet "factors" and "factor levels" still seem to be used with these "nominal categorical variables". Why is this? I.e. what is the practical benefit to using factors instead of strings for such variables?
The only other information I could find on this subject is the following quote here:
Furthermore, storing string variables as factor variables is a more efficient use of memory.
What is the reason for this? Is this only true for "ordinal categorical variables", or is it also true for "nominal categorical variables"?
Related but different questions: These questions seem relevant, but don't specifically address the heart of my question -- namely, the difference between factors and strings, and why having such a difference is useful (from a programming perspective, not a statistical one).
Difference between ordered and unordered factor variables in R
Factors ordered vs. levels
Is there an advantage to ordering a categorical variable?
factor() command in R is for categorical variables with hierarchy level only?
Practical differences:
If x is a string it can take any value. If x is a factor it can only take a values from a list of all levels. That makes these variables more memory effecient as well.
example:
> x <- factor(c("cat1","cat1","cat2"),levels = c("cat1","cat2") )
> x
[1] cat1 cat1 cat2
Levels: cat1 cat2
> x[3] <- "cat3"
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "cat3") :
invalid factor level, NA generated
> x
[1] cat1 cat1 <NA>
Levels: cat1 cat2
As you said, you can have ordinal factors. Meaning that you can add extra information aout your variable that for instance level1 < level2 < level3. Characters don't have that. However, the order doesn't necessarily have to be linear, not sure where you found that.

Is there anything like numerical variable with labels?

I have a numerical variable with discrete levels, that have a special meaning for me, e.g.
-1 'less than zero'
0 'zero'
1 'more than zero'
I know, that I can convert the variable as factor/ordinal and keep the labels, but then the numerical representation of the variable would be
1 'less than zero'
2 'zero'
3 'more than zero'
which is useless for me. I cannot afford having two copies of the variable, because of memory constraints (it is a very big data.table).
Is there any standard way of adding text labels to certain levels of the numerical (possibly integer) variable, so that I can get a nice looking frequency tables just like if it was a factor, and simultaneously being able to treat it as the source numerical variable with values untouched?
I'm going to say the answer to your questions is "no". There's no standard or built-in way of doing what you want.
Because, as you note, factors have positive non-zero integer codes, and integers can't be denoted by label strings in a vector. Not in a "standard" way anyway.
So you will have to do the labelling yourself, in whatever outputs you want to present, manually.
Any tricks like keeping your data (once) as a factor and subtracting a number to get the negative values you need (presumably for your analysis) will make a copy of that data. Keep the numbers, do the analysis, then do replacement with the results (which I presume are tables and plots and so aren't as big as the data).
R also doesn't have an equivalent to the "enumerated type" of many languages, which is one way this can be done.
You could use a vector. Would that work?
var <- c(-1,0,1)
names(var) <- c("less than zero", "zero", "more than zero")
that would give you
> var
less than zero zero more than zero
-1 0 1
Hope that helps,
Umberto

Run nested logit regression in R

I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful

Resources