I have an R data frame and some of the variables are categorical. For example sex is "male" or "female" and "do you smoke" is 0 or 1. Others variables instead are continuous.
I would like to know if there is any way to decide if a variable is categorical or not and in case compute its frequencies.
I think in my case a good test would be to check if the variable takes less than k=4 values.
While you should use factors for categorical variables, you can find the unique values in a vector x with unique, and count them:
length(unique(x))
You can use class(dataframe$variable) to know the class of a variable within a data frame as well as determine whether the variable is a factor or not.
Related
I have a data in which 2 variables are factor variables. First one is 'Frequency' which has 4 values - Mly, Qly. Hly and Yly. Second one is Type which has values like Trad, Ulip, Term and Pension. Is it advisable to convert these variables to numeric like assigning values 1 to 4 and do the prediction?
I am new to datascience, hence the question
I think you'd better leave categorical variables as it, and do not convert them in numerical. The regression packages in R, for instance, are able to manage correctly factor variables (even without defining dummy variables). Moreover when you'll do logistic regression the response variable must be categorical.
Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.
I have a CSV dataset that has a 1000 rows and 21 variables. Out of these 21, 9 are categorical variables having more than 2 values. How do I create dummy variables for the same in R? I wish to conduct logistic regression on this data set to interpret it. I tried using factors and levels to convert them but it works best for 2 variables only I think. I googled quite a bit and found many sites that explain how to do it theoretically but there's not code or function mentioned to understand it fully. On this website, I came across model.matrix () function, the dummies package of R and the dummy.code() function. However I am still stuck because I am newly introduced to R. Sorry for the long question, this is my first time asking here. Thanks in advance!
In R most functions will recognize when you are sending categorical values (gender, location, etc.) and will automatically create the dummy variables! For example if you are doing a linear regression you can just do lm(CSV_DATA). If the categorical values are being represented by actual numbers it is recommended to first convert them to a string to allow R to adjust accordingly!
If you must manually do this process you can instead create a loop that will iterate through your dataset and populate additional variables. For each categorical value, you will need n-1 additional variables to represent it as continuous data, n being the number of possible categories the variable contains. with your n-1 new variables you assign each one to a possible category in your original categorial variable. The last category will be represented by 0's in all of your n-1 new variables. For example, if you are trying to represent location and your data can either be "New York", "LA", or "Miami" you would create two (n-1) dummy variables, and for ease of explaining we will give them the name city1 and city2. If the original variable was equal to "New York" you would set city1 = 1 and city2 = 0, if it was "LA" you would set city1 = 0 and city2=1, and if your original value was "Miami" you would set city1=0 and city2=0.
The reason this works is because it does not rank any one of the categories numerically higher than any of the rest, and it uses the last category as a 'reference' to which all the rest are compared! As said previously, if you represent your variables as strings R will do this automatically for you.
Or will the package realize that they are not continuous and treat them as factors? I know that, for classification, the feature being classified does need to be a factor. But what about predictive features? I've run it on a couple of toy datasets, and I get slightly different results depending on whether categorical features are numeric or factors, but the algorithm is random, so I do not know if the difference in my results are meaningful.
Thank you!
Yes there is a difference between the two. If you want to use a factor variable you should specify it as such and not leave it as a numeric.
For categorical data (this is actually a very good answer on CrossValidated):
A split on a factor with N levels is actually a selection of one of the (2^N)−2 possible combinations. So, the algorithm will check all the possible combinations and choose the one that produces the better split
For numerical data (as seen here):
Numerical predictors are sorted then for every value Gini impurity or entropy is calculated and a threshold is chosen which gives the best split.
So yeah it makes a difference whether you will add it as a factor or as a numeric variable. How much of a difference depends on the actual data.
laglaw only takes on 0 or 1. My categorical variables suffixed with TRUE were created using the syntax YX$jan <- seatbelt$month[t]==1 and if I call them with YX$jan they return a dataframe of TRUE or FALSE. Are these being treated differently than my laglaw categorical variable containing 0's and 1's?
If laglaw is not being treated as a categorical variable
1)what information does it provide in this form that's different from if it were categorical?
2)how can I make it categorical?
Your categorical variables were NOT suffixed with TRUE. Only the labels for the coefficients were displayed with the variable name followed by the level values to which they applied. In the case of numeric variables there is no such labeling. The interpretation would be the same in this case but if laglaw had values 0,1,2 you would still not see the sort of level specfic estimates but rather estimates that were for a one unit increase in the variable value. To make numeric variables factor variables you use the function .... wait for it .... factor.
Furthermore YX$jan does NOT retrun a data.frame but rather a vector.