factors to dummies in R - r

My data consists of data about smartphones.
To do a random forest, I need to convert my factor Brand into a lot of dummies.
I tried this code
m <- model.matrix( ~ Brand, data = data_price)
Intercept BrandApple BrandAcer BrandAlcatel ...
1 0 0 1
1 1 0 0
...
The problem is that the original data has 2039 rows, while the output of this only has 2038.
Now I want to add the dummies to my data_price, but this doesn't works.
How could I make a dummy and add it to my data set?

Your approach using model.matrix should work fine, and we only need to figure out what happened to that missing row. I guess the issue is that there are missing values in your factor. Consider the following:
dat <- factor(mtcars$cyl)
dat2 <- dat
dat2[1] <- NA
Here, I have taken a factor, namely the number of cylinders in the mtcars dataset, and for comparison I have created a second factor where I have replaced one value with NA. Let's look at the number of rows that model.matrix will spit out in each case:
nrow(model.matrix(~dat))
[1] 32
nrow(model.matrix(~dat2))
[1] 31
You see that in the case where the factor variable had a missing value, the output of model.matrix had one row less, which is maybe not surprising.
You can either create an own factor level for the missing value, or you can safely drop the row with the missing value from your original data set, if this seems appropriate given your application. The output of model.matrix contains row names, which you can use to merge the data back onto the original dataframe if you want to go down that route.

Related

why a specific model is not appropriate, given a data with 6 variables (they are chr variables)

i want to show why a specific model is not appropriate, given a data with 6 variables (they are chr variables)
the model is y= abc*(x1+x2)
a and b from the data are non numeric, so is it enough to say that because they are not numeric we cant use this model?
and if i want to change a chr variables into numeric how can i do this without making it become NA?
You can still potentially model non-numeric type data by converting it to factor by: df$a <- as.factor(df$a); df$b <- as.factor(df$b).
However, it might not make sense sometimes. For example, df$a is an unique id for each observation and you don't want to put it as a predictor.

How to create a dummy variable based on row numbers in R

I have a 10 (question items) by 500 (respondents) vector in R.
Upper 250 are male while lower 250 are female.
Can you tell me how to create a gender variable, and assign 0 and 1 to this variable based on row numbers in R?
Thank you very much! Stay safe.
This solution assumes your dataset is in a data frame, not a vector, that the dataset is named "dat" (change it to whatever you are calling your data), and that the variable "gender" does not already exist in "dat".
dat$gender <- NA # Creates a new, empty column in the dataset (NA stands for missing data, or not available)
dat[1:250, "gender"] <- "0" # assigns the category 0 to rows 1-250
dat[251:500, "gender"] <- "1" # assigns the category 1 to rows 251-500
Hope this helps! As the comments suggest, providing a sample of your data will help us help you.

Corr.test exlusion of NA values in R

I am trying to run the corr.test function in a for loop between a range of columns in a data frame against the rest of the columns in the same data frame. However, I have a lot of NA values throughout this data frame. I don't want to omit the rows altogether and lose the rest of the data in the rows and I also don't want to set NA = 0 because it will interfere with the rest of the data (scores that are either -1, 1, or 0). Every time I try to run the corr.test function, R keeps saying that x or y are not numeric vectors.
Is there any way to get around this?
The first column (rownames) of my data frame is a list of sample IDs, columns 2-50 are scores, and 51 onward are scores of a different type. What I've been doing so far is using for loop to run corr.test between each range of columns like this example:
cor.test(data[1:50], data[51:200])
This works fine in the for loop if I convert NA values to 0 but is there any way to avoid doing that?

how to ignore factors or levels while using summary function in R

I am working on a dataset which has a column with only 2 possible values i.e. 0 and 1. I applied as.factor() to this column and it created two levels for me.
dr$col <- as.factor(dr$col)
Now when I do summary(dataset) it gives me occurrences of those values instead of mean/max/min etc. values.
summary(dr)
col
0:12
1:34
How can I advice summary function to ignore the factors for that column and calculate aggregate values like it does for other numeric columns.
Let's assume you have the following
>> vec=c(1,1,1,1,0,0,0)
>> vecf=as.factor(vec)
Then the following will give you the desired results
>> summary(as.numeric(as.character(vecf)))

keep most common factor levels in R

I used the "dummies" package to create 42 dummy variables for the 42 levels of a factor variable in my data-frame. Now I only want to keep the 5 dummies that represent the five most common factor levels. I used:
counts <- colSums(dummy_variables)
rank <- sort(counts)
to figure out what those levels are, but now I want to be able to reference the most common ones and keep them in my data frame. I am somewhat new to R - I just can't figure out the syntax to do this.
Filter out the top 5 variables, and then subset only those columns.
rank <- sort(counts)[(length(counts)-4):length(counts)]
dummy_variables <- dummy_variables[names(dummy_variables) %in% names(rank)]
Or in one line as the commenter suggested,
dummy_variables[names(dummy_variables) %in% names(tail(sort(colSums(dummy_variables)),5))]

Resources