Replace missing values of categorical variables and mutiple variables in R

Replace missing values of categorical variables and mutiple variables in R - r

I am facing the following issues:
I want to replace all NA's of a certain categorical variable with "Unknown", however it does not work.
Here's the code:
x <- "Unknown"
kd$form_of_address[which(is.na(kd$form_of_address))]) <- x
The problem arises when I perform
levels(kd$form_of_address)
Sadly, my output does not include "Unknown".
My data includes ebooks whose weight is always 0. Which code is appropriate to replace NAs of the variable weight that have values of the variable ebook_count with ebook_count > 0 with 0 ?
Thank you in advance :)

I assume your variable is in factor form, which does not let you change its cell if it is not in the level.
Using iris, let's see what may have happened and how it can solved.
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can see Species variable is a factor.
We can put some NAs into it by:
iris[c(11:13),5] = NA
iris[c(11:15), ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
11 5.4 3.7 1.5 0.2 <NA>
12 4.8 3.4 1.6 0.2 <NA>
13 4.8 3.0 1.4 0.1 <NA>
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
Now, if I try to fill those NAs with "Unknown" using your code:
x = "Unknown"
iris$Species[which(is.na(iris$Species))] = x
which will generate:
Warning message: In [<-.factor(*tmp*, which(is.na(iris$Species)),
value = c(1L, : invalid factor level, NA generated
What you can do first is to add a new level to your variable, and then you can do so
levels(iris$Species) = c(levels(iris$Species), "Unknown")
levels(iris$Species)
[1] "setosa" "versicolor" "virginica" "Unknown"
#You can see now Unknown is one of the levels
iris$Species[which(is.na(iris$Species))] = "Unknown"
table(iris$Species)
setosa versicolor virginica Unknown
47 50 50 3

Related

Looping Over Categories to Create Individual Regressions

In addendum to my previous question, I want to then create a for loop, each iteration of which would create a regression for each unique code I have created. More specifically, I want to create a regression with only the data that correspond to each unique code. How do I do this?
I have tried Googling "for loops in R" and have failed to find an answer that suits my need to iterate over categories rather than variables.

Here is an example predicting Sepal.Length from Sepal.Width, Petal.Length, and Petal.Width for each of three species using the iris data:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Now split and lapply:
iris.split <- split(iris[, -5], iris$Species)
iris.lm <- lapply(iris.split, \(x) lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, x))
lapply(iris.lm, summary)
The last line prints summary reports for the regressions for each species.

Spearman CI on subset of data

I've calculated spearmans rho for a subset of my data:
cor.test(formula = ~ mvd_score + total_jealousy_score,
data = analytic_data_survey,
subset = sex == "Female",
method = "spearman")
But this does not give me a confidence interval, so I know I need to use SpearmanCI
spearmanCI(analytic_data_survey$mvd_score, analytic_data_survey$total_jealousy_score,
level = 0.95,
method = "Euclidean")
However, this is not on the subset of female participants, and spearmanCI does not recognize when I enter subset = sex == "Female"
Anyone have any suggestions/guidance?

The manual page for spearmanCI does not say that it works with the formula method used by cor.test. R has many ways of extracting data (?Extract) and subsetting data ?subset. Here is one approach using the iris data set that comes with R:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa <- iris[iris$Species=="setosa", ]
with(setosa, spearmanCI(Sepal.Length, Sepal.Width))
# confidence interval
# 2.5 % 97.5 %
# 0.625475 0.8836843
# sample estimate
# 0.7553375

Display column names that are only factors

Is there a way to extract only column names that are factor. For example, in iris dataset, last column is a factor, so only Species (column name and not entire column) should be extracted
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> str(head(iris))
'data.frame': 6 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1

We can use :
names(iris)[sapply(iris, is.factor)]
#[1] "Species"
Or using Filter :
names(Filter(is.factor, iris))

Another solution which involves the dplyr package (if by chance you are already using it in your own project) is
names(iris %>% select_if(is.factor))
or equivalently (choose the one you like more)
iris %>% select_if(is.factor) %>% names()
Output
# [1] "Species"

Error using Levene test in R after a group by [error: is not a numeric variable]

I´m trying to use levene Test from "car" library in R with the iris dataset.
The code I have is:
library(tidyverse)
library(car)
iris %>% group_by (Species) %>% leveneTest( Sepal.Length )
From there I´m getting the following error:
Error in leveneTest.default(., Sepal.Length) :
. is not a numeric variable
I don´t know how to fix this, since the data types seem to be of the rigth type:
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Your question is mainly about R syntax, which is not on topic on CrossValidated. That being said, you can either use the formula interface as in
leveneTest(Sepal.Length ~ Species, data=iris)
or state the data directly as in
leveneTest(y = iris$Sepal.Length, group = iris$Species)

Is there an easy way to separate categorical vs continuous variables into two dataset in R

Say I have about 500 variables available, and I'm trying to do variable selection for my model ( response is binary )
I am planning on doing some kind of corr analysis for all continuous, then do categorical after.
Since there's a lot of variables involved, I can't do it manually.
Is there a function that I can use ? or maybe a module ?

I'm using iris data set avaialbe in R. Then
sapply(iris, is.factor)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
FALSE FALSE FALSE FALSE TRUE
will tell you weather your columns are factor or not. So using
iris[ ,sapply(iris, is.factor)]
you can pick factor columns only. And
iris[ ,!sapply(iris, is.factor)]
will give you those columns which are not factor. You can also use is.numeric, is.character and different other versions.

You can use str(df) to see which columns are factors and which are not (df is your dataframe). For example, for data iris in R:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Or, you can use lapply(iris,class)
$Sepal.Length
[1] "numeric"
$Sepal.Width
[1] "numeric"
$Petal.Length
[1] "numeric"
$Petal.Width
[1] "numeric"
$Species
[1] "factor"

Create a function that returns logical for number of unique value less than some fraction of the total and I'm picking 5%:
discreteL <- function(x) length(unique(x)) < 0.05*length(x)
Now sapply it (with negation for continuous variables) to the data.frame:
> str( iris[ , !sapply(iris, discreteL)] )
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
You could have picked a particular number, say 15, as your criterion I suppose.
I should make clear that the statistical theory suggests this procedure to be dangerous for the purpose outlined. Just picking the variables that are most correlated with a binary response is not well-supported. There have been many studies that show better approaches to variable selection. So my answer is really only how to do the separation, but not an endorsement of the overall plan that you have vaguely described.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace missing values of categorical variables and mutiple variables in R - r

Related

Looping Over Categories to Create Individual Regressions

Spearman CI on subset of data

Display column names that are only factors

Error using Levene test in R after a group by [error: is not a numeric variable]

Is there an easy way to separate categorical vs continuous variables into two dataset in R

Categories

Resources