I have the following table in R which lists a person race, gender, age, and cholesterol test. age and cholesterol test are displayed as dummy variables. age can be categorized as low, medium, or high, while cholesterol tests can be categorized as low or high. I want to transform the age and cholesterol columns to be single columns where low is categorized as 1, medium is categorized as 2, and high is categorized as 3. Cholesterol test can be neigh low or high if a person never took one and should be N/A in the expected output.
I want the solution to be dynamic so that if I have multiple columns in this format, the code would still work (i.e. there may be some new tests, which can be categorized as high, low, or medium as dummy variables).
How can I do this in R?
input:
race gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
<chr> <int> <int> <int> <int> <int> <int>
1 white 0 1 0 0 0 0
2 white 0 1 0 0 0 0
3 white 1 1 0 0 0 0
4 black 1 0 1 0 0 0
5 white 0 0 0 1 0 1
6 black 0 0 1 0 1 0
expected output:
race gender age chol_test
1 white 0 1 n/a
2 white 0 1 n/a
3 white 1 1 n/a
4 black 1 2 n/a
5 white 0 3 3
6 black 0 2 1
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(across(contains("_"), ~
. * setNames(1:3, c("low", "medium", "high"))[
str_extract(cur_column(), "low|medium|high")])) %>%
rename_with(~ str_remove(., "_tm1")) %>%
pivot_longer(cols = -c(race, gender),
names_to = c(".value", "categ"), names_sep = "\\.") %>%
filter(age > 0|chol_test > 0) %>%
select(-categ) %>%
mutate(chol_test = na_if(chol_test, 0))
-output
# A tibble: 7 × 4
race gender age chol_test
<chr> <int> <int> <int>
1 white 0 1 NA
2 white 0 1 NA
3 white 1 1 NA
4 black 1 2 NA
5 white 0 3 3
6 black 0 0 1
7 black 0 2 NA
data
df1 <- structure(list(race = c("white", "white", "white", "black", "white",
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L,
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L,
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))
We could first define a custom function that allows us to recode dummy variables based on their variable names, below called var_nm2value.
This function takes the values of the variables as x argument. In dplyr::across this is the .x part. And it takes a list of name-value pairs as value_ls argument. The function just loops over the list of name-value pairs, checks if the name in value_ls is found in the variable name. To do this it uses grepl on dplyr::cur_column(). If we have a match then we replace all 1s with the value from our value_ls and we return all other values, that is the zeros, as is.
Then we can define a list of recode values, below recode_ls.
Finally, we use purrr::map_dfc in a dplyr::summarise where we use the variable strings we want to create "age" and "chol_test", then ii) select only columns which contain this string, and in each iteration we iii) apply dplyr::across to recode the values, iv) pipe the result in a do.call to get the max and finally v) recode 0s to NA:
# custom function to recode 0/1 dummy variables based on their variable name an
var_nm2value <- function(x, values_ls) {
for (val in seq_along(values_ls)) {
if(grepl(names(values_ls)[val], dplyr::cur_column())) {
return(ifelse(x == 1L, values_ls[[val]], x))
}
}
}
# define list of recode values
recode_ls <- list(low = 1, medium = 2, high = 3)
library(tidyverse)
# apply functions to data.frame
df1 %>%
summarise(race = race,
gender = gender,
map_dfc(set_names(c("age", "chol_test")), # i)
function(x) {
select(., contains(x)) %>% # ii)
summarise("{x}" := across(everything(), var_nm2value, recode_ls) %>% # iii)
do.call("pmax", .) %>% # iv)
ifelse(. == 0, NA, .))} # v)
))
#> race gender age chol_test
#> 1 white 0 1 NA
#> 2 white 0 1 NA
#> 3 white 1 1 NA
#> 4 black 1 2 NA
#> 5 white 0 3 3
#> 6 black 0 2 1
Created on 2022-01-03 by the reprex package (v0.3.0)
Related
I am populating a data frame in R with 0s and 1s and, in some cases, I need to insert a row between two consecutive rows as long as a condition is met. The condition would be that if Column C == 1 (first row; i) or Column A == 1 (consecutive row, i + 1), the row to be inserted in between them should have a 1 in Column B and 0s in Columns A and C. This condition should also apply in the opposite way (if Column A == 1 and the consecutive row has Column C == 1)
Column A
Column B
Column C
1
0
0
0
1
0
0
0
1
0
0
1
1
0
0
Considering that my data wrangling skills are not that advanced, I tried with the functions complete() and fill(), but without further luck.
So I did not understand the condition, because first you say OR and after AND, but I created a code that you can use and adapt to the condition you want.
Data
df <-
tibble::tribble(
~Column.A, ~Column.B, ~Column.C,
1L, 0L, 0L,
0L, 1L, 0L,
0L, 0L, 1L,
0L, 0L, 1L,
1L, 0L, 0L
)
Code
library(tidyr)
library(dplyr)
df %>%
mutate(aux = row_number()) %>%
bind_rows(
df %>%
mutate(aux = row_number() + .5) %>%
filter((Column.A == 1 & lead(Column.C) == 1) | (Column.C == 1 & lead(Column.A) == 1)) %>%
mutate(Column.B = 1, Column.A = 0, Column.C = 0)
) %>%
arrange(aux)
Output
# A tibble: 6 x 4
Column.A Column.B Column.C aux
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 1
2 0 1 0 2
3 0 0 1 3
4 0 0 1 4
5 0 1 0 4.5
6 1 0 0 5
I have age columns like so that are dummy encoded.
How can I transform these columns to one column using dplyr?
Input:
age_0-10 age_11-20 age_21-30 age_31-40 age_41-50 age_51-60 gender
1 0 1 0 0 0 0 0
2 0 0 1 0 0 0 1
3 0 0 0 1 0 0 0
4 0 1 0 0 0 0 1
5 0 0 0 0 0 1 1
Expected output:
age gender
1 11-20 0
2 21-30 1
3 31-40 0
4 11-20 1
5 51-60 1
A possible solution, now, thanks to #Adam's comment, with names_prefix:
library(tidyverse)
df <- data.frame(
check.names = FALSE,
`age_0-10` = c(0L, 0L, 0L, 0L, 0L),
`age_11-20` = c(1L, 0L, 0L, 1L, 0L),
`age_21-30` = c(0L, 1L, 0L, 0L, 0L),
`age_31-40` = c(0L, 0L, 1L, 0L, 0L),
`age_41-50` = c(0L, 0L, 0L, 0L, 0L),
`age_51-60` = c(0L, 0L, 0L, 0L, 1L),
gender = c(0L, 1L, 0L, 1L, 1L)
)
df %>%
pivot_longer(col=starts_with("age"), names_to="age", names_prefix="age_") %>%
filter(value==1) %>%
select(age, gender, -value)
#> # A tibble: 5 × 2
#> age gender
#> <chr> <int>
#> 1 11-20 0
#> 2 21-30 1
#> 3 31-40 0
#> 4 11-20 1
#> 5 51-60 1
Here is a way in dplyr using c_across().
library(dplyr)
library(stringr)
df %>%
rowwise() %>%
mutate(age = str_remove(names(.)[which(c_across(starts_with("age")) == 1)], "^age_")) %>%
ungroup() %>%
select(age, gender)
# # A tibble: 5 x 2
# age gender
# <chr> <int>
# 1 11-20 0
# 2 21-30 1
# 3 31-40 0
# 4 11-20 1
# 5 51-60 1
Try the base R code below using max.col
cbind(
age = gsub("^age_", "", head(names(df), -1)[max.col(df[-ncol(df)])]),
df[ncol(df)]
)
which gives
age gender
1 11-20 0
2 21-30 1
3 31-40 0
4 11-20 1
5 51-60 1
Here is another tidyverse solution:
library(dplyr)
library(purrr)
df %>%
mutate(age = pmap_chr(select(cur_data(), !gender),
~ names(df)[-ncol(df)][as.logical(c(...))])) %>%
select(age, gender)
age gender
1 age_11-20 0
2 age_21-30 1
3 age_31-40 0
4 age_11-20 1
5 age_51-60 1
I have a series of columns which are numeric ranged from 0 to 8. I want to make a binominal variable when a row just one time reported 3 or more than coded as "high" otherwise "low".
structure(list(AE_1 = c(0L, 1L, 0L, 0L, 0L, 2L, 0L), AE_2 = c(0L,
1L, 2L, 1L, 0L, 0L, 0L), AE_3 = c(1L, 4L, 1L, 8L, 0L, 8L, 1L),
AE_4 = c(0L, 1L, 1L, 0L, 0L, 0L, 0L), AE_5 = c(0L, 0L, 1L,
1L, 0L, 0L, 1L), AE_6 = c(0L, 5L, 1L, 3L, 0L, 4L, 1L), AE_7 = c(0L,
1L, 1L, 1L, 0L, 2L, 0L), AE_8 = c(0L, 2L, 1L, 2L, 0L, 0L,
0L), new_AE = c("low", "low", "low", "low", "low", "low",
"low")), class = "data.frame", row.names = c(NA, -7L))
I had this code and the outcome is low for all rows.
df<-df%>%
mutate(new_AE= pmap_chr(select(., starts_with('AE')), ~
case_when(any(c(...) <= 2) ~ "low" , any(c(...) >=3) ~ "high")))
while I want something like this :
This may be done esaily be checking max of each row in base R using pmax. Now of course, you won't write 8 col names into pmax so do this.
df[,9] <- c("low", "high")[ 1 + (do.call(pmax, df[,-9]) >= 3)]
> df
AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
1 0 0 1 0 0 0 0 0 low
2 1 1 4 1 0 5 1 2 high
3 0 2 1 1 1 1 1 1 low
4 0 1 8 0 1 3 1 2 high
5 0 0 0 0 0 0 0 0 low
6 2 0 8 0 0 4 2 0 high
7 0 0 1 0 1 1 0 0 low
see that expr inside [] returns true/false as per your desired condition
# this returns max of each row
do.call(pmax, df[,-9])
[1] 1 5 2 8 0 8 1
# this checks whether max of each row is 3 or more
do.call(pmax, df[,-9]) >= 3
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE
So if you aren't comfortable using this strategy, you may use replace instead
df$new_AE <- replace(df$new_AE, do.call(pmax, df[,-9]) >= 3, "high")
Update
I made a slight modification to my solution as it appears new_AE column exists from the beginning and only the values were not right so here is also another solution just in case you would like to use pmap in one go. However, you already received some fabulous solutions.
library(dplyr)
library(purrr)
df %>%
mutate(new_AE = pmap(df %>%
select(-9), ~ ifelse(any(c(...) >= 3), "high", "low")))
AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
1 0 0 1 0 0 0 0 0 low
2 1 1 4 1 0 5 1 2 high
3 0 2 1 1 1 1 1 1 low
4 0 1 8 0 1 3 1 2 high
5 0 0 0 0 0 0 0 0 low
6 2 0 8 0 0 4 2 0 high
7 0 0 1 0 1 1 0 0 low
The issue is that case_when with the first condition is all TRUE, thus we are only getting the 'low' values. Here, we don't even need a case_when as there are only two categories, and this can be created by converting the logical to numeric index and replace with a vector of labels
library(dplyr)
df %>%
rowwise %>%
mutate(new_AE = c('low', 'high')[1+ any(c_across(where(is.numeric)) >=3)]) %>%
ungroup
-output
# A tibble: 7 x 9
# AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
# <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 0 0 1 0 0 0 0 0 low
#2 1 1 4 1 0 5 1 2 high
#3 0 2 1 1 1 1 1 1 low
#4 0 1 8 0 1 3 1 2 high
#5 0 0 0 0 0 0 0 0 low
#6 2 0 8 0 0 4 2 0 high
#7 0 0 1 0 1 1 0 0 low
Or this may be done more easily with rowSums from base R
df$new_AE <- c("low", "high")[(!!rowSums(df >= 3)) + 1]
df$new_AE
#[1] "low" "high" "low" "high" "low" "high" "low"
While applying case_when have to consider the order of logical statements or make sure to do corrections in the succeeding expressions. if we test the second of OP's data
v1 <- c(1, 1, 4, 1, 0, 5, 1)
any(v1 <= 2)
#[1] TRUE
which is the first expression in case_when. As the first one is already executed and found a match, the subsequent expressions are not executed
case_when(any(v1 <=2) ~ 'low', any(v1 >=3) ~ 'high')
#[1] "low"
By reversing the order, we get "high"
case_when( any(v1 >=3) ~ 'high', any(v1 <=2) ~ 'low')
#[1] "high"
So, make sure which one is more priority and set the order of those expressions based on that
I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
I want to have a list of positive and negative values corresponding to each value that comes after grouping a column. My data looks like this:
dataset <- read.table(text =
"id value
1 4
1 -2
1 0
2 6
2 -4
2 -5
2 -1
3 0
3 0
3 -4
3 -5",
header = TRUE, stringsAsFactors = FALSE)
I want my result to look like this:
id num_pos_value num_neg_value num_zero_value
1 1 1 1
2 1 3 0
3 0 2 2
I want to extend the columns of the above result by adding sum of the positive and negative values.
id num_pos num_neg num_zero sum_pos sum_neg
1 1 1 1 4 -2
2 1 3 0 6 -10
3 0 2 2 0 -9
We create a group by 'id' and calculate the sum of logical vector
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(num_pos = sum(value > 0),
num_neg = sum(value < 0),
num_zero = sum(value == 0))
# A tibble: 3 x 4
# id num_pos num_neg num_zero
# <int> <int> <int> <int>
#1 1 1 1 1
#2 2 1 3 0
#3 3 0 2 2
Or get the table of sign of 'value' and spread it to 'wide'
library(tidyr)
df1 %>%
group_by(id) %>%
summarise(num = list(table(factor(sign(value), levels = -1:1)))) %>%
unnest %>%
mutate(grp = rep(paste0("num", c("pos", "zero", "neg")), 3)) %>%
spread(grp, num)
Or using count
df1 %>%
count(id, val = sign(value)) %>%
spread(val, n, fill = 0)
data
df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), value = c(4L, -2L, 0L, 6L, -4L, -5L, -1L, 0L, 0L, -4L, -5L
)), class = "data.frame", row.names = c(NA, -11L))