Is there anything like numerical variable with labels? - r

I have a numerical variable with discrete levels, that have a special meaning for me, e.g.
-1 'less than zero'
0 'zero'
1 'more than zero'
I know, that I can convert the variable as factor/ordinal and keep the labels, but then the numerical representation of the variable would be
1 'less than zero'
2 'zero'
3 'more than zero'
which is useless for me. I cannot afford having two copies of the variable, because of memory constraints (it is a very big data.table).
Is there any standard way of adding text labels to certain levels of the numerical (possibly integer) variable, so that I can get a nice looking frequency tables just like if it was a factor, and simultaneously being able to treat it as the source numerical variable with values untouched?

I'm going to say the answer to your questions is "no". There's no standard or built-in way of doing what you want.
Because, as you note, factors have positive non-zero integer codes, and integers can't be denoted by label strings in a vector. Not in a "standard" way anyway.
So you will have to do the labelling yourself, in whatever outputs you want to present, manually.
Any tricks like keeping your data (once) as a factor and subtracting a number to get the negative values you need (presumably for your analysis) will make a copy of that data. Keep the numbers, do the analysis, then do replacement with the results (which I presume are tables and plots and so aren't as big as the data).
R also doesn't have an equivalent to the "enumerated type" of many languages, which is one way this can be done.

You could use a vector. Would that work?
var <- c(-1,0,1)
names(var) <- c("less than zero", "zero", "more than zero")
that would give you
> var
less than zero zero more than zero
-1 0 1
Hope that helps,
Umberto

Related

How can I identify inconsistencies and outliers in a dataset in R

I have a big dataset with alot of columns, being most of them not numeric values. I need to find inconsistencies in the data as well as outliers and the part of obtaining inconsistencies would be easy if the dataset wasn't so big (7032 rows to be exact).
An inconsistency would be something like: ID supposed to be 4 letters and 4 numbers and I obtain something else (like 3 numbers and 2 letters); or other example would be a number that should be a 0 or 1 and I obtain a -1 or a 2 .
Is there any function that I can use to obtain the inconsitencies in each column?
For the specific columns that doesn't have numeric values, I thought of doing a regex and validate if each row for a certain column is valid but I didn't found info that could give me that.
For the part of outliers I did a boxplot to see if I could obtain any outlier, like this:
boxplot(dataset$column)
But the graphic didn't gave me any outliers. Should I be ok with the results that I obtain in the graphic or should I try something else to see if there is really any outlier in the data?
For the specific examples you've given:
an ID must be be four numbers and 4 letters:
!grepl("^[0-9]{4}-[[:alpha:]]{4}$", ID)
will be TRUE for inconsistent values (^ and $ mean beginning- and end-of-string respectively; {4} means "previous pattern repeats exactly four times"; [0-9] means "any symbol between 0 and 9 (i.e. any numeral); [[:alpha:]] means "any alphabetic character"). If you only want uppercase letters you could use [A-Z] instead (assuming you are not working in some weird locale like Estonian).
If you need a numeric value to be 0 or 1, then !num_val %in% c(0,1) will work (this will work for any set of allowed values; you can use it for a specific set of allowed character values as well)
If you need a numeric value to be between a and b then !(a < num_val & num_val < b) ...

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

R factor and level

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. It just seems to repeat the vector values.
factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5
Can anyone explain what factor is supposed to do, or why would I used it?
I'm starting to wonder if factors are like a code table in a database. Where the factor name is code table name and levels are the unique options of the code table. ?
A factor is stored as a hash table rather than raw character vector. What does this imply? There are two major benefits.
Much smaller memory footprint. Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster.
Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. That is called a factor, and the valid choices are called levels.

What are the practical differences between 'factor' and 'string' data types in R?

From other programming languages I am familiar with the string data type. In addition to this data type, R also has the factor data type. I am new to the R language, so I am trying to wrap my head around the intent behind this new data type.
Question: What are the practical differences between 'factor' and 'string' data types in R?
I get that (on a conceptual/philosophical level) the factor data type is supposed to encode the values of a categorical random variable, but I do not understand (on a practical level) why the string data type would be insufficient for this purpose.
Seemingly having duplicate data types which serve the same practical purpose would be bad design. However, if R were truly poorly designed on such a fundamental level, it would be much less likely to have achieved the level of popularity it has. So either a very improbable event has happened, or I am misunderstanding the practical significance/purpose of the factor data type.
Attempt: The one thing I could think of is the concept of "factor levels", whereby one can assign an ordering to factors (which one can't do for strings), which is helpful when describing "ordinal categorical variables", i.e. categorical variables with an order (e.g. "Low", "Medium", "High").
(Although even this wouldn't seem to make factors strictly necessary. Since the ordering is always linear, i.e. no true partial orders, on countable sets, we could always just accomplish the same with a map from some subset of the integers to the strings in question -- however in practice that would probably be a pain to implement over and over again, and a naive implementation would probably not be as efficient as the implementation of factors and factor levels built into R.)
However, not all categorical variables are ordinal, some are "nominal" (i.e. have no order). And yet "factors" and "factor levels" still seem to be used with these "nominal categorical variables". Why is this? I.e. what is the practical benefit to using factors instead of strings for such variables?
The only other information I could find on this subject is the following quote here:
Furthermore, storing string variables as factor variables is a more efficient use of memory.
What is the reason for this? Is this only true for "ordinal categorical variables", or is it also true for "nominal categorical variables"?
Related but different questions: These questions seem relevant, but don't specifically address the heart of my question -- namely, the difference between factors and strings, and why having such a difference is useful (from a programming perspective, not a statistical one).
Difference between ordered and unordered factor variables in R
Factors ordered vs. levels
Is there an advantage to ordering a categorical variable?
factor() command in R is for categorical variables with hierarchy level only?
Practical differences:
If x is a string it can take any value. If x is a factor it can only take a values from a list of all levels. That makes these variables more memory effecient as well.
example:
> x <- factor(c("cat1","cat1","cat2"),levels = c("cat1","cat2") )
> x
[1] cat1 cat1 cat2
Levels: cat1 cat2
> x[3] <- "cat3"
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "cat3") :
invalid factor level, NA generated
> x
[1] cat1 cat1 <NA>
Levels: cat1 cat2
As you said, you can have ordinal factors. Meaning that you can add extra information aout your variable that for instance level1 < level2 < level3. Characters don't have that. However, the order doesn't necessarily have to be linear, not sure where you found that.

Xgboost - Do we have to convert integers to factors if they are only 0 & 1

I have many columns in a dataframe that are flags "0" and "1". They belong to class "integer" when i import the dataframe.
0 denotes absence and 1 denotes presence in all columns.
Do i need to convert them to fators?[factors will make levels 1 & 2 while currently they are almost similar 0 & 1 albeit integers]
I plan to later use xgboost to build a predictive model.
Xgboost works only on numeric columns so if i convert the columns to factor's then i will need to one-hot encode them to convert them to numeric.
(Side question: Do we always need to drop one column if we do one hot encoding to remove collinearity?)
Short answer: Depends. Yes, just for better variable interpretation. No as for 0/1 variables integer and factors both are same.
If you ask my personal opinion then I am more towards YES; as you will more likely also be having some categorical variables which are either have string values or more than 2 levels or 2 integer levels other than 0 and 1. In all aforementioned cases 0/1 variables integer and factors both are NOT same. Only specific case of 0/1 binary levels; integer variable and factors are same. So you may want to bring consistency in your coding and even want to adopt this for 0/1 case as well.
To see yourself:
a <- c(1,2,1,2,1,2,5)
c<-as.character(a)
b<-as.factor(c)
d<-as.integer(b)
Here I am just playing with a vector, which in end gives me:
> d
[1] 1 2 1 2 1 2 3
So if you don't want to debug why values are changing in future then use as.factor() from starting.
Side Answer: Yes. Search for model.matrix() and contrasts.arg for getting this done in R.
The error states that xgb.DMatrix takes numeric values, where the data were integers.
To convert the data to numeric use
train[] <- lapply(train, as.numeric)
and then use
xgb.DMatrix(data=data.matrix(train))

Resources