When or why convert a numeric variable to factor? - r

I am quite new in R and sort of learning by myself. I have a data set with 43 variables and I want to forecast one of theme. Some are numeric variables and some are factor variables.
The question is that I don't know when someone should convert factors to numerics and vice versa. I found on internet that you should not keep variables as numeric if they always take integer values and in a narrow range. (For exemple if the values are always between 1 and 7).
On of my variables is "NSM" and it represents number of seconds since midnight for each day. The value are integer and discrete (61200 61800 62400 63600 64200 65400 66000 66600 68400 69000 69600 70800 72000 72600 73200 etc you can observe that there is a 600 step).
They go from 0 to 85800.
So I want to have the opinion of someone more experienced than me (I have 0). Should I keep NSP numerci or convert it to factor and then groups factor values by levels ( otherwise I would have 144 levels and that would be too much and not relevant)
Thank you,

I generally only convert a variable to a factor if one or more of the following are true:
the values of the variable represent some form of grouping, i.e. the variable is categorical in nature.
there are substantial memory savings to be had - this is usually the case where character variables have been used to identify group levels.
the variable is numeric in nature, but is highly non-linear, and there's no better way of entering it into a model than to convert it to a factor with one or two meaningful cut-points chosen.
However, manipulting factor variables can be more fiddly than characters or integers, so I tend to save the factoring until the very end, unless memory pressures force my hand.

I have also been self-learning just like you and as per my understanding of this topic it's better to use factors when we have limited nominal/categorical values especially in the case of character vectors like gender ("Male", "Female"). This saves us from comparison errors involved in characters like case sensitivity or spelling mistakes.
Also, internally factors and integers work in the same way and if there are limited categorical integer values then it's suggested to use factors in order to have more meaningful data through levels. In your case, my opinion is to use Integers than factors as there are too many of levels to attach any meaningful information even if its required.
Finally you should be the best judge to decide if you should use factors in your code as you'd know where exactly you're going to use them again in your program as some algorithms explicitly demand factors than character vectors.

Related

Using factor vs. character and integer vs. double columns in dplyr/ggplot2

I am setting up a pipeline to import, format, normalize, and plot a bunch of datasets. The pipeline will rely heavily on tidyverse solutions (dplyr and ggplot2).
During the input/format step I would like to decide if/when to use factors vs. characters in various columns that contain letters. Likewise, I need to decide if I should designate numerical columns as integers (when it's reasonable) or use double.
My gut feeling is that as default I should just use character and double. Neither speed nor space are an issue since the resulting datasets are relatively small (~20 x 10,000 max) so I figure that this will give me the most flexibility. Are the disadvantages to going down this road?
Performance shouldn't be a concern in most use case, the criterium is the meaning of the variables.
Factor vs character
Use character if your data is just strings that do not hold specific meaning; use factor if it's a categorical variable with a limited set of values. The main advantages of using factors are:
you get an error if you try to give a new value that is not in the levels (so that can save you from typos)
you can give an order to the levels and get an ordered factor
some functions (especially when modelling) require an explicit factor for categorical variables
you make it clear to the reader that these are not random character strings.
Integer vs double
If you know your column will only ever contain integer values, integer can be a better choice. Indeed, computations on doubles can give some numeric error, and in some situations you can end up with 26.0000000001 != 26. In addition, some packages may be aware of the type of input (although I can't think of any example).
For big numbers (more than 2e31), integers won't be able to store them whereas doubles will still behave correctly.
as.integer(2147483647)
#> [1] 2147483647
as.integer(2147483648)
#> [1] NA
#> Warning message:
#> NAs introduced by coercion to integer range
But when the numbers get even bigger, doubles will also start loosing significant digits:
1234578901234567890 == 1234578901234567891
#> [1] TRUE
Overall, I don't think there it makes a big difference in practice, using an integer type can be a way to signal to the reader and to the program that if there is a decimal number in that column, something went wrong.

How to convert the levels of an integer variable of a dataset to string characters

Hi I have a problem with one of my assignments. I am using the following dataset http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
One of the questions asks to "reduce the levels of rating for quality to three levels as high, medium, and low".
I would like to output the summary of the quality variable to these strings.
They are originally as integers
If it is a homework you should really try to get part of the solution yourself nevertheless here are helping ideas:
You want to cut or bin the variables. E.g. you have a scale of 1-6, you could cut it into three groups of 1-2,3-4 and 5-6.
Once you have cut or binned your variables you can transform the binned variable (which is now a factor) to the desired levels by using transformations like "5-6" -> "high".
Can you provide as least some code you have already worked on and where your problems are? Then I could provide better feedback instead of just providing a solution.

How do I code opposite 'levels' or 'factors' in R

I have browsed through many different pages on the net but cannot seem to find an answer to this question:
How do I code up equally-opposing levels for use in multiple regression in R?
The only way I can currently see achieving this is to assign the data type to these columns as 'numeric' however I know that this is not an entirely accurate way of doing this as they should ideally be treated as factors.
Essentially what I am looking for is creating 'dummy' variables that are factors rather than numeric.
By way of example, suppose we have a scenario of sporting teams opposing each other, each row assigns a different game where Team A's presence in the game is suggested by the '1' dummy and Team B's presence is '-1' (and all other team's values are obviously set to 0 because they are not playing in this game).
I have attempted to change the data type of these columns to factors, however this results in three levels of 1,0 and -1 but the beta assigned to 1 and -1 is never equally opposite (and in theory the value assigned to 0 should be zero).
Does anybody know how to correctly mathematically and programatically achieve this? Your help would be much appreciated!

How to deal with categorical features having large number of levels in it

I am working on a data set in R having dimensions
dim(adData)
[1] 15844717 11
Out of 11 features,
one feature is having 273596(random integers used as id) unique values out of 15844717.
second feature is having 884353(random integers used as id) unique values out of 15844717.
My confusion is whether to convert them into factors or not because categorical variables with large number of levels will create a problem at the time of modelling or please suggest how to treat them.
I am new to Data Science and never worked on large data sets before.
~300k categories for one variable is sure to cause computational issues. I would first take a step back and examine the nature of this variable and its relevance to the prediction at hand. Without knowing the source of the data, it is hard to give specific advice.
If it is truly a categorical variable, it would be silly to leave the ids as numeric variables since the scale and order of the ids are likely meaningless.
Is it possible to group the levels into fewer but still meaningful categories?
Example 1: If the ids were zipcodes in the United States, there are potentially 40,000 unique values. These can be grouped into states or regions, reducing the number of levels to 50 or fewer.
Example 2: If the ids were product ids from an e-commerce site, they could be grouped by product category or sub-category. There would be much fewer distinct values to work with.
Another option is to examine the relative frequency of each category. If there are a few very common categories, with thousands of rare categories, you leave the common levels in tact and group the rare levels into an 'other' category.

Make use of available data and neglect missing data for building classifier

I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards
I can see at least the following approaches. Personally, I prefer the third option.
1) Discard the extra columns
You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.
2) Use the data as it is
In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying "if I'm not sure whether something is present or absent, I'll treat it as absent". I know this is obvious, but if you haven't tried this, you should, to compare it to option 1.
3) Use separate classifiers
If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that's 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.
Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.
Other tips
If the 15 "normal" variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you're not sure, normalize the 15 "normal" variables to lie in the interval [0,1] -- you probably won't lose anything by doing this.
I'd like to add a further suggestion to Herr Kapput's: if you use a probabilistic approach, you can treat "missing" as a value which you have a certain probability of observing, either globally or within each class (not sure which makes more sense). If it's missing, it has probability of occurring p(missing), and if it's present it has probability p(not missing) * p(val | not missing). This allows you to gracefully handle the case where the values have arbitrary range when they are present.

Resources