What are the practical differences between 'factor' and 'string' data types in R? - r

From other programming languages I am familiar with the string data type. In addition to this data type, R also has the factor data type. I am new to the R language, so I am trying to wrap my head around the intent behind this new data type.
Question: What are the practical differences between 'factor' and 'string' data types in R?
I get that (on a conceptual/philosophical level) the factor data type is supposed to encode the values of a categorical random variable, but I do not understand (on a practical level) why the string data type would be insufficient for this purpose.
Seemingly having duplicate data types which serve the same practical purpose would be bad design. However, if R were truly poorly designed on such a fundamental level, it would be much less likely to have achieved the level of popularity it has. So either a very improbable event has happened, or I am misunderstanding the practical significance/purpose of the factor data type.
Attempt: The one thing I could think of is the concept of "factor levels", whereby one can assign an ordering to factors (which one can't do for strings), which is helpful when describing "ordinal categorical variables", i.e. categorical variables with an order (e.g. "Low", "Medium", "High").
(Although even this wouldn't seem to make factors strictly necessary. Since the ordering is always linear, i.e. no true partial orders, on countable sets, we could always just accomplish the same with a map from some subset of the integers to the strings in question -- however in practice that would probably be a pain to implement over and over again, and a naive implementation would probably not be as efficient as the implementation of factors and factor levels built into R.)
However, not all categorical variables are ordinal, some are "nominal" (i.e. have no order). And yet "factors" and "factor levels" still seem to be used with these "nominal categorical variables". Why is this? I.e. what is the practical benefit to using factors instead of strings for such variables?
The only other information I could find on this subject is the following quote here:
Furthermore, storing string variables as factor variables is a more efficient use of memory.
What is the reason for this? Is this only true for "ordinal categorical variables", or is it also true for "nominal categorical variables"?
Related but different questions: These questions seem relevant, but don't specifically address the heart of my question -- namely, the difference between factors and strings, and why having such a difference is useful (from a programming perspective, not a statistical one).
Difference between ordered and unordered factor variables in R
Factors ordered vs. levels
Is there an advantage to ordering a categorical variable?
factor() command in R is for categorical variables with hierarchy level only?

Practical differences:
If x is a string it can take any value. If x is a factor it can only take a values from a list of all levels. That makes these variables more memory effecient as well.
example:
> x <- factor(c("cat1","cat1","cat2"),levels = c("cat1","cat2") )
> x
[1] cat1 cat1 cat2
Levels: cat1 cat2
> x[3] <- "cat3"
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "cat3") :
invalid factor level, NA generated
> x
[1] cat1 cat1 <NA>
Levels: cat1 cat2
As you said, you can have ordinal factors. Meaning that you can add extra information aout your variable that for instance level1 < level2 < level3. Characters don't have that. However, the order doesn't necessarily have to be linear, not sure where you found that.

Related

Pavlidis Template Matching (PTM, DataVisEasy) R function with 3 levels

I need to perform a correlation analysis on a data frame constructed as follows:
Rows: Features --> gene variants related with different levels of severity of the disease we are studying, in a format of a Boolean matrix
Columns: Observations --> List of patients;
The discriminant of my analysis is, thus, the severity marked as follows:
A: less severe than expected
B: equal to what expected
C: more severe than expected
Suppose I have a lot more features than observations and I want to use the PTM function with a three-level annotation (i.e. A,B,C) as a match template. The function requires you to set the annotation.level.set.high parameter, but it's not clear to me how it works. For example, if I set annotation.level.set.high='A', does that mean I'm making a comparison between A vs B&C? So I can only do a comparison between two groups/classes even if I have multiple levels? Because my goal is to compare all levels with each other (i.e. A vs B vs C), but it is not clear to me how to achieve this comparison, if it is possible.
Thanks

R factor and level

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. It just seems to repeat the vector values.
factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5
Can anyone explain what factor is supposed to do, or why would I used it?
I'm starting to wonder if factors are like a code table in a database. Where the factor name is code table name and levels are the unique options of the code table. ?
A factor is stored as a hash table rather than raw character vector. What does this imply? There are two major benefits.
Much smaller memory footprint. Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster.
Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. That is called a factor, and the valid choices are called levels.

"Factor codes of type double or float detected" when using read.dta13

I am using read.dta13 packages to load data. There are a bunch of categorical variables with Stata values labels in the data set. The data set looks like below in Stataļ¼š
cohort year age gender income health migration
1101 2010 35 F 13034 healthy yes
1102 2010 54 M 34134 unhealthy no
For gender, health and migration, the original values are numeric, for example, gender = 1 for male. In Stata, for the convenience of understanding, I add value labels for categorical variables using label define, so it shows as above. But the original values are kept. Now let's go to R. If I simply type
mydata <- read.dta13("mydata_stata13.dta")
I get a lot of warnings like these
Factor codes of type double or float detected - no labels assigned.
Set option nonint.factors to TRUE to assign labels anyway.
All the value labels I add in Stata will be dropped, which is what I need in R. The problem is that R gives warnings even for some variables that should be taken as numeric, for example income. I don't want to set nonint.factor = TRUE since I need the numeric values of the categorical variables for the calculation.
It's not actually an error, but I would like to know whether it is safe to just ignore the warnings.
As the warning states, there are doubles or floats with labels assigned. This is because I assumed you created a categorical variable without specifying Stata to store it as a byte. readstata13 gives you a warning because it is not sure if floats/doubles with value labels are categorical or continuous variables.
Let's say gender is the wrongly stored variable, I assumed the person who coded the variables in stata created it as:
gen gender = *expr*
instead of
gen byte gender = *expr*
This can be solved either by always prefixing categorical variables with gen byte or by using compress (see Stata's manual) before saving/exporting the whole dataset. You can detect which variables are wrongly coded using describe and checking value label assignment in non-byte-variables. This will in turn will store your data efficiently.
In addition, I assume that for some reason the same person accidentally added a value label to a "true" float variable, like income at some point. Check labelbook command to correct such problems.

Is there anything like numerical variable with labels?

I have a numerical variable with discrete levels, that have a special meaning for me, e.g.
-1 'less than zero'
0 'zero'
1 'more than zero'
I know, that I can convert the variable as factor/ordinal and keep the labels, but then the numerical representation of the variable would be
1 'less than zero'
2 'zero'
3 'more than zero'
which is useless for me. I cannot afford having two copies of the variable, because of memory constraints (it is a very big data.table).
Is there any standard way of adding text labels to certain levels of the numerical (possibly integer) variable, so that I can get a nice looking frequency tables just like if it was a factor, and simultaneously being able to treat it as the source numerical variable with values untouched?
I'm going to say the answer to your questions is "no". There's no standard or built-in way of doing what you want.
Because, as you note, factors have positive non-zero integer codes, and integers can't be denoted by label strings in a vector. Not in a "standard" way anyway.
So you will have to do the labelling yourself, in whatever outputs you want to present, manually.
Any tricks like keeping your data (once) as a factor and subtracting a number to get the negative values you need (presumably for your analysis) will make a copy of that data. Keep the numbers, do the analysis, then do replacement with the results (which I presume are tables and plots and so aren't as big as the data).
R also doesn't have an equivalent to the "enumerated type" of many languages, which is one way this can be done.
You could use a vector. Would that work?
var <- c(-1,0,1)
names(var) <- c("less than zero", "zero", "more than zero")
that would give you
> var
less than zero zero more than zero
-1 0 1
Hope that helps,
Umberto

Run nested logit regression in R

I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful

Resources