Recoding variables with R - r

Recoding variables in R, seems to be my biggest headache. What functions, packages, processes do you use to ensure the best result?
I've found very few useful examples on the Internet that give a one-size-fits-all solution to recoding and I'm interested to see what you guys and gals are using.
Note: This may be a community wiki topic.

Recoding can mean a lot of things, and is fundamentally complicated.
Changing the levels of a factor can be done using the levels function:
> #change the levels of a factor
> levels(veteran$celltype) <- c("s","sc","a","l")
Transforming a continuous variable simply involves the application of a vectorized function:
> mtcars$mpg.log <- log(mtcars$mpg)
For binning continuous data look at cut and cut2 (in the hmisc package). For example:
> #make 4 groups with equal sample sizes
> mtcars[['mpg.tr']] <- cut2(mtcars[['mpg']], g=4)
> #make 4 groups with equal bin width
> mtcars[['mpg.tr2']] <- cut(mtcars[['mpg']],4, include.lowest=TRUE)
For recoding continuous or factor variables into a categorical variable there is recode in the car package and recode.variables in the Deducer package
> mtcars[c("mpg.tr2")] <- recode.variables(mtcars[c("mpg")] , "Lo:14 -> 'low';14:24 -> 'mid';else -> 'high';")
If you are looking for a GUI, Deducer implements recoding with the Transform and Recode dialogs:
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RecodeVariables

I found mapvalues from plyr package very handy. Package also contains function revalue which is similar to car:::recode.
The following example will "recode"
> mapvalues(letters, from = c("r", "o", "m", "a", "n"), to = c("R", "O", "M", "A", "N"))
[1] "A" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "M" "N" "O" "p" "q" "R" "s" "t" "u" "v" "w" "x" "y" "z"

I find this very convenient when several values should be transformed (its like doing recodes in Stata):
# load package and gen some data
require(car)
x <- 1:10
# do the recoding
x
## [1] 1 2 3 4 5 6 7 8 9 10
recode(x,"10=1; 9=2; 1:4=-99")
## [1] -99 -99 -99 -99 5 6 7 8 2 1

I've found that it can sometimes be easier to convert non numeric factors to character before attempting to change them, for example.
df <- data.frame(example=letters[1:26])
example <- as.character(df$example)
example[example %in% letters[1:20]] <- "a"
example[example %in% letters[21:26]] <- "b"
Also, when importing data, it can be useful to ensure that numbers are actually numeric before attempting to convert:
df <- data.frame(example=1:100)
example <- as.numeric(df$example)
example[example < 20] <- 1
example[example >= 20 & example < 80] <- 2
example[example >= 80] <- 3

When you want to recode levels of a factor, forcats might come in handy. You can read a chapter of R for Data Science for an extensive tutorial, but here is the gist of it.
library(tidyverse)
library(forcats)
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
#> # A tibble: 8 × 2
#> partyid n
#> <fctr> <int>
#> 1 Other 548
#> 2 Republican, strong 2314
#> 3 Republican, weak 3032
#> 4 Independent, near rep 1791
#> 5 Independent 4119
#> 6 Independent, near dem 2499
#> # ... with 2 more rows
You can even let R decide what categories (factor levels) to merge together.
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump(). [...] The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
#> # A tibble: 2 × 2
#> relig n
#> <fctr> <int>
#> 1 Protestant 10846
#> 2 Other 10637

Consider this sample data.
df <- data.frame(a = 1:5, b = 5:1)
df
# a b
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
Here are two options -
1. case_when :
For single column -
library(dplyr)
df %>%
mutate(a = case_when(a == 1 ~ 'a',
a == 2 ~ 'b',
a == 3 ~ 'c',
a == 4 ~ 'd',
a == 5 ~ 'e'))
# a b
#1 a 5
#2 b 4
#3 c 3
#4 d 2
#5 e 1
For multiple columns -
df %>%
mutate(across(c(a, b), ~case_when(. == 1 ~ 'a',
. == 2 ~ 'b',
. == 3 ~ 'c',
. == 4 ~ 'd',
. == 5 ~ 'e')))
# a b
#1 a e
#2 b d
#3 c c
#4 d b
#5 e a
2. dplyr::recode :
For single column -
df %>%
mutate(a = recode(a, '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e'))
For multiple columns -
df %>%
mutate(across(c(a, b),
~recode(., '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e')))

Create a lookup vector using setNames, then match on name:
# iris as an example data
table(iris$Species)
# setosa versicolor virginica
# 50 50 50
x <- setNames(c("x","y","z"), c("setosa","versicolor","virginica"))
iris$Species <- x[ iris$Species ]
table(iris$Species)
# x y z
# 50 50 50

Related

Automating the process of recoding numeric variables to meaningful factor variables

I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown.
However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.
data <- data.frame("ID" = c(1:5),
"Var1" = c(2,2,8,1,8),
"Var2" = c(5,8,4,88,10))
For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like
> Var1_values
[1] 8 2 1
with a corresponding vector containing the categories:
> Var1_categories
[1] "unknown" "yes" "no"
But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like
| ID | Var1 | Var2 |
|----|---------|-------------------|
| 1 | yes | condition E |
| 2 | yes | condition H |
| 3 | unknown | condition D |
| 4 | no | unknown condition |
| 5 | unknown | condition H |
where each column is a meaningful factor variable.
As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.
// edit:
I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:
mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
"Value1" = c(2,2,2,7),
"Word1" = c("yes","yes","yes","condition A"),
"Value2" = c(1,1,1,6),
"Word2" = c("no","no","no","Condition B"),
"Value3" = c(8,8,8,5),
"Word3" = c("unk","unk","unk", "Condition C"),
"Value4" = c(NA,NA,NA,4),
"Word4" = c(NA,NA,NA,"Condition B")
)
I would like to "long"-transform it so I can use it with #r2evan 's solution.
Here's one thought, though it requires reshaping (twice) the data.
mapping <- data.frame(
Var = c(rep("Var1", 3), rep("Var2", 5)),
Val = c(1, 2, 8, 4, 5, 8, 10, 88),
Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
# Var Val Words
# 1 Var1 1 no
# 2 Var1 2 yes
# 3 Var1 8 unk
# 4 Var2 4 D
# 5 Var2 5 E
# 6 Var2 8 H
# 7 Var2 10 H
# 8 Var2 88 unk
library(dplyr)
library(tidyr) # pivot_*
data %>%
pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
left_join(mapping, by = c("Var", "Val")) %>%
pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
# ID Var1 Var2
# <int> <chr> <chr>
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
With this method, you control the number-to-words mapping for each variable.
Another option is to use a map list, similar to above but it does not require double-reshaping.
maplist <- list(
Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
# 1 2 8
# "no" "yes" "unk"
# $Var2
# 4 5 8 10 88
# "D" "E" "H" "H" "unk"
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)],
data[nms], maplist[nms])
data
# ID Var1 Var2
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
Between the two, I think I prefer the first if your data doesn't punish you for reshaping it (many things could make this less appealing). One reason it's good is that maintaining the mapping can be as easy as maintaining a CSV (which might be done in your favorite spreadsheet tool, e.g., Excel or Calc).
Here's a way of doing it that requires no reshaping of your original data, and can be conceivably applied to any number of columns. First, put all your existing "values" and "categories" vectors into lists, formatting all as characters:
library(tidyverse)
# Recreating your existing vectors
Var1_values <- c(8, 2, 1)
Var2_values <- c(88, 10, 8, 5, 4)
Var1_categories <- c("unknown", "yes", "no")
Var2_categories <- c("unknown condition", "condition H", "condition H", "condition E", "condition D")
Var_values <- list(Var1_values, Var2_values) %>%
map(as.character)
Var_categories <- list(Var1_categories, Var2_categories)
Add names to each element of each vector in Var_categories using Var_values, and get a list of variable names to recode from your dataset:
for (i in 1:length(Var_categories)) {
names(Var_categories[[i]]) <- Var_values[[i]]
}
vars_names <- str_subset(colnames(data), "Var")
Then, use map2 to recode all of your target variables, before transforming into a tibble with ID column.
data_recoded <- map2(vars_names, Var_categories, ~ dplyr::recode(unlist(data[.x], use.names = F), !!!.y)) %>%
as_tibble(.name_repair = ~ vars_names) %>%
add_column(ID = 1:5, .before = "Var1")
Output (data_recoded):
ID Var1 Var2
<int> <chr> <chr>
1 1 yes condition E
2 2 yes condition H
3 3 unknown condition D
4 4 no unknown condition
5 5 unknown condition H

Row-wise Boolean comparison of data

I have grouped my data by the appropriate grouping, and I need to be sure that "x" and "y" values equal each other for each unique combination of Group1 and Group2. In other words, what code could I use to cycle through this dataset and ensure that A1x == A1y and A2x == A2y, etc.
"Group1","Group2","group3","values"
"A" "1" x 10
"A" "1" y 10
"A" "2" x 15
"A" "2" y 15
To help make the answer easier, here is the data.frame from the example
d <- data.frame(Group1= c("A", "A", "A", "A"),
Group2= c("1", "1", "2", "2"),
group3= c("x", "y", "x", "y"),
values= c(10, 10, 15, 15))
With dplyr, you can do:
d %>%
group_by(Group1, Group2) %>%
mutate(cond = all(values == first(values)))
Group1 Group2 group3 values cond
<fct> <fct> <fct> <dbl> <lgl>
1 A 1 x 10 TRUE
2 A 1 y 10 TRUE
3 A 2 x 15 TRUE
4 A 2 y 15 TRUE
Or:
d %>%
group_by(Group1, Group2) %>%
mutate(cond = n_distinct(values) == 1)
You can also do this with pivot_wider:
tidyr::pivot_wider(d, names_from='group3', values_from='values') %>%
dplyr::mutate(eq=x==y)
I think you went too far into turning your data into a long format maybe this is easier to manipulate
d %>%
pivot_wider(names_from = group3,values_from = values) %>%
mutate(is_equal = x == y)
Here is a base R solution using ave() to make it
d <- within(d,isequal <- as.logical(ave(values,Group1,Group2,FUN = function(v) v==unique(v))))
such that
> d
Group1 Group2 group3 values isequal
1 A 1 x 10 TRUE
2 A 1 y 10 TRUE
3 A 2 x 15 TRUE
4 A 2 y 15 TRUE
Another option if the data is grouped properly and has 2 rows for each group:
d$check <- rep(d$values[seq(1L,nrow(d),2L)]==d$values[seq(2L,nrow(d),2L)], each=2L)
A simple way would be to merge the sub tables with group x and group y to compare the values.
> d[d$group3=="y",]
# Group1 Group2 group3 values
# 2 A 1 y 10
# 4 A 2 y 15
> merge(d[d$group3=="y",],d[d$group3=="x",],by=c("Group1","Group2"))
# Group1 Group2 group3.x values.x group3.y values.y
# 1 A 1 y 10 x 10
# 2 A 2 y 15 x 15
with(merge(d[d$group3=="y",], d[d$group3=="x",],
by=c("Group1","Group2")),
values.x==values.y)
## [1] TRUE TRUE
Of course you have fancier ways of doing it but it is not bad to start simple first

Is there a way to find the indices of common (exactly the same) elements in a dataframe?

Given a dataframe such as,
num <- c(5,10,15,20,25)
letter <- c("A", "B", "A", "C", "B")
thelist <- data.frame(num, letter)
I need to find the indices where the letters are the same.
Output:
A 1 3
B 2 5
C 4
Then, take these indices and find the mean of those indices in num.
Output:
A 10
B 17.5
C 20
I cannot use loops or if statements, I am looking at using a sort of apply, which, etc.
As the objective is to find the mean for each similar 'letter', it is better to group by 'letter' and get the mean of 'num'
library(dplyr)
thelist %>%
group_by(letter) %>%
summarise(num = mean(num))
# A tibble: 3 x 2
# letter num
# <fct> <dbl>
#1 A 10
#2 B 17.5
#3 C 20
or in base R
aggregate(num ~ letter, thelist, mean)
To find the index of the same 'letter', we can split the sequence of rows by 'letter
split(seq_len(nrow(thelist)), thelist$letter)
#$A
#[1] 1 3
#$B
#[1] 2 5
#$C
#[1] 4
Another option using data.table:
library(data.table)
setDT(thelist)[, .(ind = paste(.I, collapse = " "),
mean_num = mean(num)
),
by = letter]
Output:
letter ind mean_num
1: A 1 3 10.0
2: B 2 5 17.5
3: C 4 20.0
I'd use dplyr/tidyverse for this:
# setup
library(tidyverse)
# group by letters then get mean of num
thelist %>%
group_by(letter) %>%
summarise(mean_num = mean(num))
You could also use base R with a for loop:
lets <- unique(thelist$letter)
x <- rep(NA, length(lets))
for(i in 1:3){
x[i] <- mean(thelist$num[thelist$letter %in% lets[i]])
}
x

Combining data under different factor levels while retaining original levels

I would like to have a tidyverse solution for the following problem. In my dataset, I have data on various factor levels. I would like to create a new factor level "Total" that is the sum of all values Y at existing factor levels of X. This can be done, for example, with:
mutate(Data, X = fct_collapse(X, Total = c("A", "B", "C", "D"))) %>%
group_by(X) %>%
summarize(Y = sum(Y))
However, this also necessarily overwrites the original factor levels. I would have to combine the original dataset with the new collapsed dataset in an additional step.
One solution I have used in the past to retain the original levels is to bring data in the wide format and proceed with rowwise() and mutate() to create a new variable with the "Total" and then reshape back to long.
spread(Data, key = X, value = Y) %>%
rowwise() %>%
mutate(Total = sum(A, B, C, D)) %>%
gather(1:5, key = "X", value = "Y")
However, I am very unhappy with this solution since using rowwise() is not considered good practice. It would be great if you could point me to an available alternative solution how to combine data under different factor levels while retaining original levels.
Minimal reproducible example:
Data<-data.frame(
X = factor(c("A", "B", "C", "D")),
Y = c(1000, 2000, 3000, 4000))
Expected result:
# A tibble: 5 x 2
X Y
<chr> <dbl>
1 A 1000
2 B 2000
3 C 3000
4 D 4000
5 Total 10000
Using janitor library, this would be straightforward.
Data %>% janitor::adorn_totals("row") %>% mutate(X=factor(X))
# X Y
# A 1000
# B 2000
# C 3000
# D 4000
# Total 10000
Looking at the output structure:
str(output)
# 'data.frame': 5 obs. of 2 variables:
# $ X: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
# $ Y: num 1000 2000 3000 4000 10000
Using the suggestion in #M--'s first version of his comment to the question, now edited, I have added bind_rows.
I have also changed the input dataset a bit. Following the OP's and #camille's comment, this dataset has a factor level "Z" but keeps the original order and adds level "Total" at the end.
Data <- data.frame(
X = factor(c("A", "B", "C", "Z")),
Y = c(1000, 2000, 3000, 4000))
Data %>%
mutate(lvl = levels(X),
X = fct_collapse(X, Total = c("A", "B", "C", "Z")),
X = as.character(X)) %>%
bind_rows(mutate(Data, X = as.character(X)), .) %>%
mutate(X = factor(X, levels = c(lvl, "Total"))) %>%
group_by(X) %>%
summarize(Y = sum(Y)) -> d
d
## A tibble: 5 x 2
# X Y
# <fct> <dbl>
#1 A 1000
#2 B 2000
#3 C 3000
#4 Z 4000
#5 Total 10000
Check the output factor levels.
levels(d$X)
#[1] "A" "B" "C" "Z" "Total"
This solution can also be used in this case:
library(dplyr)
Data %>%
add_row(X = "Total", Y = sum(.$Y)) %>%
mutate(X = factor(X))
X Y
1 A 1000
2 B 2000
3 C 3000
4 D 4000
5 Total 10000
Data %>%
add_row(X = "Total", Y = sum(.$Y)) %>%
mutate(X = factor(X)) %>%
{levels(.$X)}
[1] "A" "B" "C" "D" "Total"

How to rank column accordingly using case_when?

I wanted to create another column (called delayGrade) where the top 10% of values (closest to 0) from another column (averageDelay) get assigned the letter 'A', the next 25% 'B', and the remaining 'C'. I figured I could use a case_when function to do so, but not sure how to go about doing it. Any ideas?
Here is toy data frame and solution:
library(tidyverse)
df <- tibble(
averageDelay = rnorm(10)
)
df %>%
mutate(
delayGrade = case_when(
averageDelay < quantile(averageDelay, .1) ~ "A",
averageDelay < quantile(averageDelay, .35) ~ "B",
TRUE ~ "C"
)
) %>%
arrange(averageDelay) # Not necissary, but improves readability
# A tibble: 10 x 2
averageDelay delayGrade
<dbl> <chr>
1 -1.57878473 A
2 -1.00129022 B
3 -0.34245100 B
4 -0.08652020 B
5 -0.05240453 C
6 0.15732711 C
7 0.21509389 C
8 0.34202367 C
9 0.90296373 C
10 0.90820894 C

Resources