Automating the process of recoding numeric variables to meaningful factor variables - r

I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown.
However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.
data <- data.frame("ID" = c(1:5),
"Var1" = c(2,2,8,1,8),
"Var2" = c(5,8,4,88,10))
For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like
> Var1_values
[1] 8 2 1
with a corresponding vector containing the categories:
> Var1_categories
[1] "unknown" "yes" "no"
But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like
| ID | Var1 | Var2 |
|----|---------|-------------------|
| 1 | yes | condition E |
| 2 | yes | condition H |
| 3 | unknown | condition D |
| 4 | no | unknown condition |
| 5 | unknown | condition H |
where each column is a meaningful factor variable.
As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.
// edit:
I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:
mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
"Value1" = c(2,2,2,7),
"Word1" = c("yes","yes","yes","condition A"),
"Value2" = c(1,1,1,6),
"Word2" = c("no","no","no","Condition B"),
"Value3" = c(8,8,8,5),
"Word3" = c("unk","unk","unk", "Condition C"),
"Value4" = c(NA,NA,NA,4),
"Word4" = c(NA,NA,NA,"Condition B")
)
I would like to "long"-transform it so I can use it with #r2evan 's solution.

Here's one thought, though it requires reshaping (twice) the data.
mapping <- data.frame(
Var = c(rep("Var1", 3), rep("Var2", 5)),
Val = c(1, 2, 8, 4, 5, 8, 10, 88),
Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
# Var Val Words
# 1 Var1 1 no
# 2 Var1 2 yes
# 3 Var1 8 unk
# 4 Var2 4 D
# 5 Var2 5 E
# 6 Var2 8 H
# 7 Var2 10 H
# 8 Var2 88 unk
library(dplyr)
library(tidyr) # pivot_*
data %>%
pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
left_join(mapping, by = c("Var", "Val")) %>%
pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
# ID Var1 Var2
# <int> <chr> <chr>
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
With this method, you control the number-to-words mapping for each variable.
Another option is to use a map list, similar to above but it does not require double-reshaping.
maplist <- list(
Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
# 1 2 8
# "no" "yes" "unk"
# $Var2
# 4 5 8 10 88
# "D" "E" "H" "H" "unk"
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)],
data[nms], maplist[nms])
data
# ID Var1 Var2
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
Between the two, I think I prefer the first if your data doesn't punish you for reshaping it (many things could make this less appealing). One reason it's good is that maintaining the mapping can be as easy as maintaining a CSV (which might be done in your favorite spreadsheet tool, e.g., Excel or Calc).

Here's a way of doing it that requires no reshaping of your original data, and can be conceivably applied to any number of columns. First, put all your existing "values" and "categories" vectors into lists, formatting all as characters:
library(tidyverse)
# Recreating your existing vectors
Var1_values <- c(8, 2, 1)
Var2_values <- c(88, 10, 8, 5, 4)
Var1_categories <- c("unknown", "yes", "no")
Var2_categories <- c("unknown condition", "condition H", "condition H", "condition E", "condition D")
Var_values <- list(Var1_values, Var2_values) %>%
map(as.character)
Var_categories <- list(Var1_categories, Var2_categories)
Add names to each element of each vector in Var_categories using Var_values, and get a list of variable names to recode from your dataset:
for (i in 1:length(Var_categories)) {
names(Var_categories[[i]]) <- Var_values[[i]]
}
vars_names <- str_subset(colnames(data), "Var")
Then, use map2 to recode all of your target variables, before transforming into a tibble with ID column.
data_recoded <- map2(vars_names, Var_categories, ~ dplyr::recode(unlist(data[.x], use.names = F), !!!.y)) %>%
as_tibble(.name_repair = ~ vars_names) %>%
add_column(ID = 1:5, .before = "Var1")
Output (data_recoded):
ID Var1 Var2
<int> <chr> <chr>
1 1 yes condition E
2 2 yes condition H
3 3 unknown condition D
4 4 no unknown condition
5 5 unknown condition H

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

R get column names for changed rows

I have two dataframes, old and new, in R. Is there a way to add a column (called changed) to the new dataframe that lists the column names (in this case, separated with a ";") where the values are different between the two dataframes? I am also trying to use this is a function where the column names that I am comparing are contained in other variables (x1, x2, x3). Ideally, I would only refer to x1, x2, x3 instead of the actual column names, but I can make due if this isn't possible. A tidy solution is preferable.
old <- data.frame(var1 = c(1, 2, 3, 5), var2 = c("A", "B", "C", "D"))
new <- data.frame(var1 = c(1, 4, 3, 6), var2 = c("A", "B", "D", "Z"))
x1 <- "var1"
x2 <- "var2"
x3 <- "changed"
#Output, adding a new column changed to new dataframe
var1 var2 changed
1 1 A NA
2 4 B var1
3 3 D var2
4 6 Z var1; var2
A tidyverse way -
library(dplyr)
library(tidyr)
cols <- names(new)
bind_cols(new, map2_df(old, new, `!=`) %>%
rowwise() %>%
transmute(changed = {
x <- c_across()
if(any(x)) paste0(cols[x], collapse = ';') else NA
}))
# var1 var2 changed
#1 1 A <NA>
#2 4 B var1
#3 3 D var2
#4 6 Z var1;var2
The same logic can be implemented in base R as well -
new$changed <- apply(mapply(`!=`, old, new), 1, function(x)
if(any(x)) paste0(cols[x], collapse = ';') else NA)
Here is a base R approach.
new$changed <- apply(old != new, 1L, \(r, nms) toString(nms[which(r)]), colnames(old))
Output
var1 var2 changed
1 1 A
2 4 B var1
3 3 D var2
4 6 Z var1, var2

Iterating through all df column pairs and counting non-zero intersections

I have a ~15000*1000 dataframe, where each column represents an individual, and each row represents the incidence of a trait (0 or 1).
I want to efficiently compare all pairs of columns, and generate a comma separated list of all mutual traits (row names) for all possible pairs.
Currently, I am looping through all the columns via combn, and pasting mutual row names into a string. That's to say I have a solution, however, it is very, very slow (probably quadratic with column number).
Is there a way to vectorise this problem/approach it with tidyr/dplyr etc.? I can't seem to find a way.
For example:
------|individual1 | individual2 | individual3 | ...
trait1| 0 | 1 | 1 | ...
trait2| 0 | 0 | 0 | ...
trait3| 1 | 1 | 1 | ...
... | ... | ... | ... | ...
Yields the string trait1,trait3 for the pair individual 2 and individual 3.
Thanks!
Toy data (the actual data is too sparse just to pull a subset):
df <- data.frame(trait = c("a", "b", "c", "d", "e"), ind1 = c(0, 1, 1, 0, 1), ind2 = c(1, 0, 1, 0, 1), ind3 = c(1, 0, 1, 1, 1))
Try to apply a custom function on each combination of columns. Maybe the efficiency can be improved a little.
t(combn(1:(ncol(df)-1), 2, function(x){
string <- paste(df$trait[df[[x[1]+1]] == 1 & df[[x[2]+1]] == 1], collapse = ",")
c(names(df)[x+1], string)
}))
# [,1] [,2] [,3]
# [1,] "Alice" "Bob" "c,e"
# [2,] "Alice" "Charlie" "c,e"
# [3,] "Bob" "Charlie" "a,c,e"
Data
df <- data.frame(trait = c("a", "b", "c", "d", "e"),
Alice = c(0, 1, 1, 0, 1),
Bob = c(1, 0, 1, 0, 1),
Charlie = c(1, 0, 1, 1, 1))
Although this question has an accepted answer, I would like to suggest a different approach which uses dplyr and tidyr as well as a data.table variant.
Whenever column names are treated as data items this indicates that the dataset is stored in an untidy format, IMHO. Reshaping the data into long format will allow to apply the usual data manipulations like joining, grouping, aggregating.
dplyr and tidyr
library(dplyr)
library(tidyr)
df %>%
pivot_longer(!"trait") %>%
filter(value == 1L) %>%
select(-value) %>%
inner_join(., ., by = "trait") %>%
filter(name.x < name.y) %>%
group_by(name.x, name.y) %>%
summarise(traits = toString(trait)) %>%
ungroup()
# A tibble: 3 x 3
name.x name.y traits
<chr> <chr> <chr>
1 Alice Bob c, e
2 Alice Charlie c, e
3 Bob Charlie a, c, e
Explanation
df %>%
pivot_longer(!"trait") %>%
filter(value == 1L)
reshapes the data into long format which is compact representation of the original matrix in wide format:
# A tibble: 10 x 3
trait name value
<fct> <chr> <dbl>
1 a Bob 1
2 a Charlie 1
3 b Alice 1
4 c Alice 1
5 c Bob 1
6 c Charlie 1
7 d Charlie 1
8 e Alice 1
9 e Bob 1
10 e Charlie 1
The value column is dropped as it is no longer needed. Then, the long data is joined with itself to find all names which match on trait. The result includes pairs of names which are given in a different order, e.g., (Alice, Bob) and (Bob, ALice) as well as duplicates names, e.g., (Bob, Bob). These are removed.
Finally, the data are grouped and summarised.
data.table
The data.table variant implements the same approach but has the advantage to allow for a non-equi self-join which reduces the number of rows directly in the join instead of a subsequent filtering step.
library(data.table)
long <- melt(setDT(df), id.vars = "trait", variable.name = "name")[value == 1]
long[long, on = .(trait, name < name), .(name1 = x.name, name2 = i.name, trait), nomatch = NULL][
, .(traits = toString(trait)), keyby = .(name1, name2)]
name1 name2 traits
1: Alice Bob c, e
2: Alice Charlie c, e
3: Bob Charlie a, c, e

Unlist column in data frame with listed

I have a list with multiple levels that I would like to the data level into a data frame, where the variable chr is collapsed into single strings.
myList <- list(total_reach = list(4),
data = list(list(reach = 2,
chr = list("A", "B", "C"),
nr = 3,
company = "Company A"),
list(reach = 2,
chr = list("A", "B", "C"),
nr = 3,
company = "Company B")))
I would like to transform this into a data frame that looks like this:
reach chr nr company
1 2 A, B, C 3 Company A
2 2 A, B, C 3 Company B
Using dplyr and data.table I've come this far.
library(data.table)
library(dplyr)
df <- data.frame(rbindlist(myList[2])) %>% t() %>% as.data.frame()
colnames(df) <- names(myList$data[[1]])
rownames(df) <- c(1:nrow(df))
df$chr <- as.character(df$chr)
df <- df %>%
mutate_all(funs(unlist(.recursive = F, use.names = F)))
However, chr column contains strings with "list()" wrapped around it.
reach chr nr company
1 2 list("A", "B", "C") 3 Company A
2 2 list("A", "B", "C") 3 Company B
A) Is there a better way to unlist this kind of list and turn it into a data frame?
B) How do I collapse the lists in chr to strings or factors?
Here is an option using tidyverse
library(tidyverse)
myList[-1] %>%
map_df(transpose) %>%
mutate_at(vars(c('reach', 'nr', 'company')), funs(unlist))
With data.table you can try
library(data.table)
rbindlist(lapply(myList$data, as.data.table))[, .(chr = toString(chr)),
by = .(reach, nr, company)]
reach nr company chr
1: 2 3 Company A A, B, C
2: 2 3 Company B A, B, C
Note that there is a difference in using as.data.table or as.data.frame:
rbindlist(lapply(myList$data, as.data.table))
reach chr nr company
1: 2 A 3 Company A
2: 2 B 3 Company A
3: 2 C 3 Company A
4: 2 A 3 Company B
5: 2 B 3 Company B
6: 2 C 3 Company B
rbindlist(lapply(myList$data, as.data.frame))
reach chr..A. chr..B. chr..C. nr company
1: 2 A B C 3 Company A
2: 2 A B C 3 Company B
Alternatively, chr can be manipulated before converting the list into a data.table:
rbindlist(lapply(myList$data, function(x) {
x$chr = toString(x$chr)
return(as.data.table(x))
}))
reach chr nr company
1: 2 A, B, C 3 Company A
2: 2 A, B, C 3 Company B
I'm using rbind to put everything together, then I reformat the chr column with sapply
library(magrittr)
myList$data %>%
do.call(rbind,.) %>%
transform(chr %<>% sapply(paste,collapse=","))
# reach chr nr company
# 1 2 A,B,C 3 Company A
# 2 2 A,B,C 3 Company B
EDIT a few months later:
One line longer but a more idiomatic tidyverse variation:
library(tidyverse)
myList$data %>%
map_df(as_tibble) %>%
group_by(reach,nr,company) %>%
summarize_at("chr",paste,collapse=",")

Recoding variables with R

Recoding variables in R, seems to be my biggest headache. What functions, packages, processes do you use to ensure the best result?
I've found very few useful examples on the Internet that give a one-size-fits-all solution to recoding and I'm interested to see what you guys and gals are using.
Note: This may be a community wiki topic.
Recoding can mean a lot of things, and is fundamentally complicated.
Changing the levels of a factor can be done using the levels function:
> #change the levels of a factor
> levels(veteran$celltype) <- c("s","sc","a","l")
Transforming a continuous variable simply involves the application of a vectorized function:
> mtcars$mpg.log <- log(mtcars$mpg)
For binning continuous data look at cut and cut2 (in the hmisc package). For example:
> #make 4 groups with equal sample sizes
> mtcars[['mpg.tr']] <- cut2(mtcars[['mpg']], g=4)
> #make 4 groups with equal bin width
> mtcars[['mpg.tr2']] <- cut(mtcars[['mpg']],4, include.lowest=TRUE)
For recoding continuous or factor variables into a categorical variable there is recode in the car package and recode.variables in the Deducer package
> mtcars[c("mpg.tr2")] <- recode.variables(mtcars[c("mpg")] , "Lo:14 -> 'low';14:24 -> 'mid';else -> 'high';")
If you are looking for a GUI, Deducer implements recoding with the Transform and Recode dialogs:
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RecodeVariables
I found mapvalues from plyr package very handy. Package also contains function revalue which is similar to car:::recode.
The following example will "recode"
> mapvalues(letters, from = c("r", "o", "m", "a", "n"), to = c("R", "O", "M", "A", "N"))
[1] "A" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "M" "N" "O" "p" "q" "R" "s" "t" "u" "v" "w" "x" "y" "z"
I find this very convenient when several values should be transformed (its like doing recodes in Stata):
# load package and gen some data
require(car)
x <- 1:10
# do the recoding
x
## [1] 1 2 3 4 5 6 7 8 9 10
recode(x,"10=1; 9=2; 1:4=-99")
## [1] -99 -99 -99 -99 5 6 7 8 2 1
I've found that it can sometimes be easier to convert non numeric factors to character before attempting to change them, for example.
df <- data.frame(example=letters[1:26])
example <- as.character(df$example)
example[example %in% letters[1:20]] <- "a"
example[example %in% letters[21:26]] <- "b"
Also, when importing data, it can be useful to ensure that numbers are actually numeric before attempting to convert:
df <- data.frame(example=1:100)
example <- as.numeric(df$example)
example[example < 20] <- 1
example[example >= 20 & example < 80] <- 2
example[example >= 80] <- 3
When you want to recode levels of a factor, forcats might come in handy. You can read a chapter of R for Data Science for an extensive tutorial, but here is the gist of it.
library(tidyverse)
library(forcats)
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
#> # A tibble: 8 × 2
#> partyid n
#> <fctr> <int>
#> 1 Other 548
#> 2 Republican, strong 2314
#> 3 Republican, weak 3032
#> 4 Independent, near rep 1791
#> 5 Independent 4119
#> 6 Independent, near dem 2499
#> # ... with 2 more rows
You can even let R decide what categories (factor levels) to merge together.
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump(). [...] The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
#> # A tibble: 2 × 2
#> relig n
#> <fctr> <int>
#> 1 Protestant 10846
#> 2 Other 10637
Consider this sample data.
df <- data.frame(a = 1:5, b = 5:1)
df
# a b
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
Here are two options -
1. case_when :
For single column -
library(dplyr)
df %>%
mutate(a = case_when(a == 1 ~ 'a',
a == 2 ~ 'b',
a == 3 ~ 'c',
a == 4 ~ 'd',
a == 5 ~ 'e'))
# a b
#1 a 5
#2 b 4
#3 c 3
#4 d 2
#5 e 1
For multiple columns -
df %>%
mutate(across(c(a, b), ~case_when(. == 1 ~ 'a',
. == 2 ~ 'b',
. == 3 ~ 'c',
. == 4 ~ 'd',
. == 5 ~ 'e')))
# a b
#1 a e
#2 b d
#3 c c
#4 d b
#5 e a
2. dplyr::recode :
For single column -
df %>%
mutate(a = recode(a, '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e'))
For multiple columns -
df %>%
mutate(across(c(a, b),
~recode(., '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e')))
Create a lookup vector using setNames, then match on name:
# iris as an example data
table(iris$Species)
# setosa versicolor virginica
# 50 50 50
x <- setNames(c("x","y","z"), c("setosa","versicolor","virginica"))
iris$Species <- x[ iris$Species ]
table(iris$Species)
# x y z
# 50 50 50

Resources