This question already has answers here:
Create new dummy variable columns from categorical variable
(8 answers)
How to make a function in R to recode a variable into new binary columns? (with ifelse statement) [duplicate]
(3 answers)
Closed 1 year ago.
I want to one-hot encode in R through tidyverse, and not use packages such as caret, mltools, etc.
## Load vcd package
library(vcd)
## Load Arthritis dataset (data frame)
data(Arthritis)
Arthritis[1:5, ][2:5]
Treatment Sex Age Improved
1 Treated Male 27 Some
2 Treated Male 29 None
3 Treated Male 30 None
4 Treated Male 32 Marked
5 Treated Male 46 Marked
Is there an easy way to do this in tidyverse where I keep n-1 of the values for each categorical column? For example Sex is binary in this dataset so I would only need a one-hot encoded column for either Male or Female. The age feature would be ignored.
For your specific example you could do this:
library(dplyr)
Arthritis |>
as_tibble() |> # not necessary, just using it for output readability
mutate(sex_male = as.numeric(Sex) - 1)
#> # A tibble: 84 × 6
#> ID Treatment Sex Age Improved sex_male
#> <int> <fct> <fct> <int> <ord> <dbl>
#> 1 57 Treated Male 27 Some 1
#> 2 46 Treated Male 29 None 1
#> 3 77 Treated Male 30 None 1
#> 4 17 Treated Male 32 Marked 1
#> 5 36 Treated Male 46 Marked 1
#> 6 23 Treated Male 58 Marked 1
#> 7 75 Treated Male 59 None 1
#> 8 39 Treated Male 59 Marked 1
#> 9 33 Treated Male 63 None 1
#> 10 55 Treated Male 63 None 1
#> # … with 74 more rows
This only works because Sex is a factor variable with two levels/distinct values. More complex variables will need more attention, unless you are willing to use a function from a package.
You are asking for a tidyverse solution. The recipes package is part of tidymodels.
library(recipes)
Arthritis |>
recipe(Improved ~ .) |>
step_dummy(Sex, Treatment) |>
prep() |>
bake(Arthritis)
#> # A tibble: 84 × 5
#> ID Age Improved Sex_Male Treatment_Treated
#> <int> <int> <ord> <dbl> <dbl>
#> 1 57 27 Some 1 1
#> 2 46 29 None 1 1
#> 3 77 30 None 1 1
#> 4 17 32 Marked 1 1
#> 5 36 46 Marked 1 1
#> 6 23 58 Marked 1 1
#> 7 75 59 None 1 1
#> 8 39 59 Marked 1 1
#> 9 33 63 None 1 1
#> 10 55 63 None 1 1
#> # … with 74 more rows
I agree with Till that recipes is the way to go here. But if you want a solution strictly from the tidyverse, you could do something like this:
library(vcd)
library(tidyverse)
Arthritis %>%
as_tibble() %>%
mutate(d = map_dfc(unique(Improved) %>%
set_names(.),
~ Improved == .x
) %>%
.[-1]
)
#> # A tibble: 84 × 6
#> ID Treatment Sex Age Improved d$None $Marked
#> <int> <fct> <fct> <int> <ord> <lgl> <lgl>
#> 1 57 Treated Male 27 Some FALSE FALSE
#> 2 46 Treated Male 29 None TRUE FALSE
#> 3 77 Treated Male 30 None TRUE FALSE
#> 4 17 Treated Male 32 Marked FALSE TRUE
#> 5 36 Treated Male 46 Marked FALSE TRUE
#> 6 23 Treated Male 58 Marked FALSE TRUE
#> 7 75 Treated Male 59 None TRUE FALSE
#> 8 39 Treated Male 59 Marked FALSE TRUE
#> 9 33 Treated Male 63 None TRUE FALSE
#> 10 55 Treated Male 63 None TRUE FALSE
#> # … with 74 more rows
You could use a combination of pivot_longer and pivot_wider for this.
Arthritis %>%
as_tibble() %>% # not neccessary, for better viewing
mutate(across(everything(), as.character)) %>%
pivot_longer(c(Sex, Treatment, Improved), names_to = 'variable', values_to = 'value') %>% # specify the columns to encode here
mutate(ind = 1) %>%
unite(col_name, variable, value) %>%
pivot_wider(values_from = ind, names_from = col_name, values_fill = 0)
For the n-1, once the data is in long format, you could filter out one of the values
long_format <- Arthritis %>%
as_tibble() %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(c(Sex, Treatment, Improved), names_to = 'variable', values_to = 'value') %>%
mutate(ind = 1)
# for the n-1
values_to_keep <- long_format %>%
count(variable, value) %>%
group_by(variable) %>%
slice(-1) %>%
pull(value)
long_format %>%
filter(value %in% values_to_keep) %>%
unite(col_name, variable, value) %>%
pivot_wider(values_from = ind, names_from = col_name, values_fill = 0)
# A tibble: 78 x 6
ID Age Sex_Male Treatment_Treated Improved_Some Improved_None
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 57 27 1 1 1 0
2 46 29 1 1 0 1
3 77 30 1 1 0 1
4 17 32 1 1 0 0
5 36 46 1 1 0 0
6 23 58 1 1 0 0
7 75 59 1 1 0 1
8 39 59 1 1 0 0
9 33 63 1 1 0 1
10 55 63 1 1 0 1
You may be able to use just model.matrix for this. I've altered your sample data a little to ensure there are 2 or more levels for all:
dat <- structure(list(Treatment = c("Treated", "Treated", "UnTreated", "Treated", "Treated"), Sex = c("Male", "Male", "Male", "FeMale", "Male"), Age = c(27L, 29L, 30L, 32L, 46L), Improved = c("Some", "None", "None", "Marked", "Marked")), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
dat
# Treatment Sex Age Improved
# 1 Treated Male 27 Some
# 2 Treated Male 29 None
# 3 UnTreated Male 30 None
# 4 Treated FeMale 32 Marked
# 5 Treated Male 46 Marked
From there,
isnum <- sapply(dat, is.numeric)
iscat <- !isnum & lengths(lapply(dat, unique)) > 1
paste("~ 0 +", paste(names(dat)[iscat], collapse = " + "))
# [1] "~ 0 + Treatment + Sex + Improved"
cbind(dat[, !iscat, drop=FALSE],
model.matrix(formula(paste("~ 0 +", paste(names(dat)[iscat], collapse = " + "))), data = dat))
# Age TreatmentTreated TreatmentUnTreated SexMale ImprovedNone ImprovedSome
# 1 27 1 0 1 0 1
# 2 29 1 0 1 1 0
# 3 30 0 1 1 1 0
# 4 32 1 0 0 0 0
# 5 46 1 0 1 0 0
Related
I have a task that's becoming quite difficult for me.
I have to create a variable (pr_test_1) to test whether a variable for a procedure (I10_PR1) is in a list of procedures, and this code is working great:
df <- df %>%
mutate(pr_test_1=ifelse(I10_PR1 %in% abl_pr, 1,0))
However, I have 25 variables for procedures (I10_PR1 to I10_PR25) and I have to create one for each (pr_test_1 to pr_test_25).
I don't seem to find the right syntax to get a for loop to work.
Any help will be greatly appreciated!
dplyr::across allows you to apply a function to multiple columns as specified with a selector (the below uses the starts_with selector).
library(dplyr)
library(purrr)
# sample data
df <- tibble::tibble(
I10_PR1 = sample(100),
I10_PR2 = sample(100),
I10_PR3 = sample(100),
I10_PR4 = sample(100)
)
# a sample list of values to compare against
match_list <- sample(10)
df %>%
mutate(
across(
starts_with("I10_PR"),
~ if_else(.x %in% match_list, 1, 0),
.names = "pr_test_{.col}"
)
)
#> # A tibble: 100 × 8
#> I10_PR1 I10_PR2 I10_PR3 I10_PR4 pr_test_I10_PR1 pr_test_I10…¹ pr_te…² pr_te…³
#> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 93 45 47 46 0 0 0 0
#> 2 91 89 90 76 0 0 0 0
#> 3 16 32 30 24 0 0 0 0
#> 4 66 26 46 41 0 0 0 0
#> 5 53 51 79 9 0 0 0 1
#> 6 36 64 61 32 0 0 0 0
#> 7 45 75 23 25 0 0 0 0
#> 8 86 61 77 52 0 0 0 0
#> 9 17 87 64 53 0 0 0 0
#> 10 6 42 57 33 1 0 0 0
#> # … with 90 more rows, and abbreviated variable names ¹pr_test_I10_PR2,
#> # ²pr_test_I10_PR3, ³pr_test_I10_PR4
Created on 2022-10-26 with reprex v2.0.2
This for() loop works perfectly with your one (slightly modified) line of code and dynamic variable names
for(i in 1:3){
df <- df %>%
mutate(!!paste0("pr_test_",i) := ifelse(!!as.name(paste0("I10_PR",i)) %in% abl_pr, 1,0))
}
Data used:
abl_pr <- sample(LETTERS)[1:10]
I10_PR1 <- sample(LETTERS)
I10_PR2 <- sample(LETTERS)
I10_PR3 <- sample(LETTERS)
df <- data.frame(I10_PR1,I10_PR2,I10_PR3)
I have the following dataset:
Lines <- "id time sex Age A
1 1 male 90 0
1 2 male 91 0
1 3 male 92 1
2 1 female 87 0
2 2 female 88 0
2 3 female 89 0
3 1 male 50 0
3 2 male 51 1
3 3 male 52 0
4 1 female 54 0
4 2 female 55 0
4 3 female 56 0"
I would like to obtain the mean value for the numeric variable and the highest value for the binary variables (e.g. if 0 and 1, then 1). And for the string variables, just the name as this does not vary across id.
The resulting data frame should be something like this:
Lines <- "id time sex Age A
1 2 male 91 1
2 2 female 88 0
3 2 male 51 1
4 2 female 55 0"
I have more or less an idea:
Lines <- Lines %>%
group_by(id, across(where(is.character))) %>%
summarise(across(where(all(column %in% 0:1)), max)) %>%
summarise(across(where(is.numeric), mean)) %>%
ungroup
I thought about using across function to apply mean on only numeric variables in one try, but I noticed they are actually stored as integer as well as your binary variable. So we either have to change their classes before our solution and use across or explicitly name every variable like this:
library(dplyr)
df %>%
group_by(id) %>%
summarise(id = first(id), sex = first(sex), time = mean(time),
Age = mean(Age), A = max(A))
# A tibble: 4 x 5
id sex time Age A
<int> <chr> <dbl> <dbl> <int>
1 1 male 2 91 1
2 2 female 2 88 0
3 3 male 2 51 1
4 4 female 2 55 0
Here is another solution, however I coerced some variables into classes mentioned in the question before applying our transformation. It would be a good idea if you could do it too:
df %>%
group_by(id) %>%
mutate(across(c(time, Age), as.numeric)) %>%
summarise(id = first(id), sex = first(sex),
across(where(is.integer), max, .names = "{.col}"),
across(where(is.double), mean, .names = "{.col}"))
# A tibble: 4 x 5
id sex A time Age
<int> <chr> <int> <dbl> <dbl>
1 1 male 1 2 91
2 2 female 0 2 88
3 3 male 1 2 51
4 4 female 0 2 55
I am applying a user defined function on numeric variables from a dataset but instead of getting their name's I am getting x when applied using map function. How do I replace x with variable name in map functions?
dataset: hd_trn
age sex cp trestbps chol fbs restecg thalach exang
<int> <fctr> <fctr> <int> <int> <fctr> <fctr> <int> <fctr>
63 1 1 145 233 1 2 150 0
67 1 4 160 286 0 2 108 1
67 1 4 120 229 0 2 129 1
37 1 3 130 250 0 0 187 0
41 0 2 130 204 0 2 172 0
56 1 2 120 236 0 0 178 0
user defined function to calculate high freq elements column wise
top_freq_elements <- function(x){
table(x) %>% as.data.frame() %>% top_n(5, Freq) %>% arrange(desc(Freq))
}
Applying function
hd_trn %>% select_if(is.numeric) %>% map(., .f = top_freq_elements)
######### output #########
x Freq
<fctr> <int>
54 51
58 43
55 41
56 38
57 38
desired: In the above output I am looking to get variable name instead of x
Tried reconstructing code below using imap but that is also not giving variable name:
hd_trn %>%
select_if(is.numeric) %>%
imap(function(feature_value, feature_name){
table(feature_value) %>%
as.data.frame() %>% #head()
rename(feature_name = feature_value) %>%
top_n(5, Freq) %>%
arrange(desc(Freq))
})
######### output #########
feature_name Freq
<fctr> <int>
54 51
58 43
55 41
56 38
57 38
You can rename the 1st column in each list :
library(dplyr)
library(purrr)
iris %>%
select(where(is.numeric)) %>%
imap(function(feature_value, feature_name){
table(feature_value) %>%
as.data.frame() %>%
rename_with(~feature_name, 1) %>%
slice_max(n = 5, Freq) %>%
arrange(desc(Freq))
})
This could be achieved using e.g. curly-curly {{ and := in rename like so:
top_freq_elements <- function(x){
table(x) %>% as.data.frame() %>% top_n(5, Freq) %>% arrange(desc(Freq))
}
library(dplyr)
library(purrr)
hd_trn %>%
select_if(is.numeric) %>%
imap(function(feature_value, feature_name){
table(feature_value) %>%
as.data.frame() %>% #head()
rename({{feature_name}} := feature_value) %>%
top_n(5, Freq) %>%
arrange(desc(Freq))
})
#> $age
#> age Freq
#> 1 67 2
#> 2 37 1
#> 3 41 1
#> 4 56 1
#> 5 63 1
#>
#> $sex
#> sex Freq
#> 1 1 5
#> 2 0 1
#>
#> $cp
#> cp Freq
#> 1 2 2
#> 2 4 2
#> 3 1 1
#> 4 3 1
#>
#> $trestbps
#> trestbps Freq
#> 1 120 2
#> 2 130 2
#> 3 145 1
#> 4 160 1
I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0.
I first cleaned the data set and took out only the chicks that were recorded for all 12 weigh ins:
library(datasets)
library(dplyr)
Frequency <- dplyr::count(ChickWeight$Chick)
colnames(Frequency)[colnames(Frequency)=="x"] <- "Chick"
a <- inner_join(ChickWeight, Frequency, by='Chick')
complete <- a[(a$freq == 12),]
head(complete,3)
This data set is in the library(datasets) of r, called ChickWeight.
You can try:
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(wdiff = weight - first(weight))
# A tibble: 540 x 5
# Groups: Chick [45]
weight Time Chick Diet wdiff
<dbl> <dbl> <ord> <fct> <dbl>
1 42 0 1 1 0
2 51 2 1 1 9
3 59 4 1 1 17
4 64 6 1 1 22
5 76 8 1 1 34
6 93 10 1 1 51
7 106 12 1 1 64
8 125 14 1 1 83
9 149 16 1 1 107
10 171 18 1 1 129
# ... with 530 more rows
I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0. The data set is in library(datasets) so you should have it.
library(dplyr)
weightgain <- ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(weightgain = weight - first(weight))
I have this code, but this code just subtracts each weight by 42 which is the weight at time 0 for chick 1. I need each chick to be subtracted by its own weight at time 0 so that the weightgain column is correct.
We could do
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
mutate(weightgain = weight - weight[Time == 0])
#Or mutate(weightgain = weight - first(weight))
# A tibble: 578 x 5
# Groups: Chick [50]
# weight Time Chick Diet weightgain
# <dbl> <dbl> <ord> <fct> <dbl>
# 1 42 0 1 1 0
# 2 51 2 1 1 9
# 3 59 4 1 1 17
# 4 64 6 1 1 22
# 5 76 8 1 1 34
# 6 93 10 1 1 51
# 7 106 12 1 1 64
# 8 125 14 1 1 83
# 9 149 16 1 1 107
#10 171 18 1 1 129
# … with 568 more rows
Or using base R ave
with(ChickWeight, ave(weight, Chick, FUN = function(x) x - x[1]))