Transform factor variable to numeric R - r

I have tried multiple things so I'll ask my question here.
I have a dataset, containing of 5 columns. The first one lists countries (text), the second Year (integer) and 3-5 are my variables which now are factors.
I want to run a regression with my 3 variables, which is not possible rn as (I guess) my variables are not numeric/integers. I tried to transform them to numeric directly, but it only gave out ranks. I also tried to firstly transform them to characters and secondly to integers/numeric (tried both), but also only transformed my 3 variables into ranks. I used the transform and as.integer code, thus creating a new dataset.
x<-transform(GDPall, HardWork = as.integer(HardWork), FamilyImportance = as.integer(FamilyImportance), GDPWorker = as.integer(GDPWorker))
How can I transform my 3 variables into a class which allows me to run my regression?
Thank you in advance!

Related

Is it necessary to convert factor variables to numeric while doing predictions?

I have a data in which 2 variables are factor variables. First one is 'Frequency' which has 4 values - Mly, Qly. Hly and Yly. Second one is Type which has values like Trad, Ulip, Term and Pension. Is it advisable to convert these variables to numeric like assigning values 1 to 4 and do the prediction?
I am new to datascience, hence the question
I think you'd better leave categorical variables as it, and do not convert them in numerical. The regression packages in R, for instance, are able to manage correctly factor variables (even without defining dummy variables). Moreover when you'll do logistic regression the response variable must be categorical.

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

class/type columns and match factor levels of dataframe based on another dataframe

I need to score a big dataframe-- 250M rows x ~2000 cols (1TB). A fit has been developed (from C5.0 library) based on training data. Currently the entire dataset wont fit in memory. Big data solutions are being investigated, but I'm wondering if I can cut up the file and run predict() in chunks.
The basic problem is that I want to translate column classes and factor levels from one dataframe to another. More detail:
PROBLEM #1: Some of the column classes don't match when being read from hadoop.
PROBLEM #2: by chunking you can get factors that have fewer factor levels than the training dataset (since you're looking at a subsample of the full set). Because of this predict() doesn't want to try and score a set that has missing factor levels.
QUESTION: I was hoping to just take the classes and factor levels from the training set and class the columns and 'relevel' the factors in the big scoring set with levels(). Can one take classes of columns and factors of one dataframe and transpose them to the same variable name to 'relevel' the variables of another dataframe? I suppose this could be done with a for loop by reading factors from one frame and applying it to the other. But this would require lots of if statements for all the classes, it seems it would be messy. Is there a way to do this with apply functions or a simpler 'one-liner'?

How to create appropriate dummy variables for all categorical variables with more than 2 values in R?

I have a CSV dataset that has a 1000 rows and 21 variables. Out of these 21, 9 are categorical variables having more than 2 values. How do I create dummy variables for the same in R? I wish to conduct logistic regression on this data set to interpret it. I tried using factors and levels to convert them but it works best for 2 variables only I think. I googled quite a bit and found many sites that explain how to do it theoretically but there's not code or function mentioned to understand it fully. On this website, I came across model.matrix () function, the dummies package of R and the dummy.code() function. However I am still stuck because I am newly introduced to R. Sorry for the long question, this is my first time asking here. Thanks in advance!
In R most functions will recognize when you are sending categorical values (gender, location, etc.) and will automatically create the dummy variables! For example if you are doing a linear regression you can just do lm(CSV_DATA). If the categorical values are being represented by actual numbers it is recommended to first convert them to a string to allow R to adjust accordingly!
If you must manually do this process you can instead create a loop that will iterate through your dataset and populate additional variables. For each categorical value, you will need n-1 additional variables to represent it as continuous data, n being the number of possible categories the variable contains. with your n-1 new variables you assign each one to a possible category in your original categorial variable. The last category will be represented by 0's in all of your n-1 new variables. For example, if you are trying to represent location and your data can either be "New York", "LA", or "Miami" you would create two (n-1) dummy variables, and for ease of explaining we will give them the name city1 and city2. If the original variable was equal to "New York" you would set city1 = 1 and city2 = 0, if it was "LA" you would set city1 = 0 and city2=1, and if your original value was "Miami" you would set city1=0 and city2=0.
The reason this works is because it does not rank any one of the categories numerically higher than any of the rest, and it uses the last category as a 'reference' to which all the rest are compared! As said previously, if you represent your variables as strings R will do this automatically for you.

Efficient way of counting through categorical data values and printing to excel

I have a probably very basic question, but cannot figure out the necessary control structures in R, as I am pretty new to R programming.
The situation is as follows:
I have a data.frame with ten factor variables which have 4 levels each (very important - not important). Now I want to count through the occurences of levels in each variable and put them in a new dataframe. Which should then look something like this:
Var1 Var2 etc..
Important 78 ...
.... 12 ...
.... 4 ...
Unimportant 0 etc.
As of now, I can only think about counting each of the original variables with count() from the plyr package and then somehow cbind() the columns together. However, this would require a lot of typing work and I cannot suppress the feeling that there must be a better way to solve this in R.
However, I can't figure out the necessary commands for this in R as I am pretty new to R programming.
Try this:
data.frame(sapply(your.dataframe, function(x) { summary(x) }))
summary() is a magical little R function which will give you the number of occurrences of each factor level in an input vector (or data frame column in this case).
I will also note that this solution will only work if each column in your.dataframe has the same number of factors (which is true and equal to 4 in your original problem).

Resources