A dataset that I am using has recorded respondents' ethnicity. Responses are recorded across multiple variables, and respondents were allowed to pick more than one. Example:
ethnicity1 ethnicity2 ethnicity3 ethnicity4 ethnicity5 ethnicity6
1 0 0 0 0 0
0 2 0 0 0 0
0 0 3 4 0 0
Each ethnicity has its own column. I have recoded each column using the recode command so that different numbers represent different ethnicities (i.e., black would be 1, white would be 2, etc.) to try and make a single ethnicity variable
A) create a single column from the multiple columns combined
B) have it so that anyone who reported multiple columns gets designated "multiple".
My expected output would look like this:
Ethnicity
1
2
999
(I'm not sure whether it's best to have a numerical value signify multiple races for coding purposes, or have it be a character value as in "multiple')
Initially, I thought to do this but it's not going as I'd hope.
Ethnicity <- df %>% dplyr::na_if(0)
## create column for ethnicity
Ethnicity %>% unite("RaceEthnicity", ethnicity1:ethnicity5, na.rm = TRUE, remove = FALSE)
Here's a tidyverse solution. I assume your data has a column for the respondent. I've added that and named it ID.
To understand what's happening you can run the code by successively adding each line, up to but not including the pipe (%>%) and look at the output.
The columns used for pivot_longer will depend on what your real data looks like: here the ethnicities are in 1-6, ID in 7.
library(dplyr)
library(tidyr)
mydata %>%
# add IDs for respondent
mutate(ID = LETTERS[1:3]) %>%
# convert to 'long' format
pivot_longer(1:6) %>%
# remove zero value rows
filter(value != 0) %>%
# group by person
group_by(ID) %>%
# use value where there is one row per person, otherwise use 999
# we need doubles for both values (existing are int)
summarise(ethnicity = case_when(n() == 1 ~ as.double(value),
TRUE ~ 999)) %>%
ungroup() %>%
# discard duplicate rows
distinct()
Result:
ID ethnicity
<chr> <dbl>
1 A 1
2 B 2
3 C 999
Your example data with corrected column names:
mydata <- structure(list(ethnicity1 = c(1L, 0L, 0L),
ethnicity2 = c(0L, 2L, 0L),
ethnicity3 = c(0L, 0L, 3L),
ethnicity4 = c(0L, 0L, 4L),
ethnicity5 = c(0L, 0L, 0L),
ethnicity6 = c(0L, 0L, 0L)),
class = "data.frame",
row.names = c(NA, -3L))
Here's an approach with dplyr and purrr:
library(dplyr);library(purrr)
df %>%
mutate(RaceEthnicity = select(cur_data(), enthnicity1:ethnicity6) %>%
{case_when(pmap_lgl(., ~ all(is.na(.x))) ~ NA_real_,
rowSums(.,na.rm = TRUE) == 0 ~ 0,
rowSums(.,na.rm = TRUE) != pmap_int(.,pmax,na.rm = TRUE) ~ 999,
TRUE ~ rowSums(.,na.rm = TRUE))})
enthnicity1 enthnicity2 ethnicity3 enthnicity4 enthnicity5 ethnicity6 RaceEthnicity
1 1 0 0 0 0 0 1
2 0 2 0 0 0 0 2
3 0 0 3 4 0 0 999
This is probably not the most novice friendly approach, but you can define the columns within the select call. After select, we pass the data into a set of {} so that the data is represented by the . symbol. From there, we use dplyr::case_when to test multiple conditions.
If all columns are NA, return NA
If the rowSums = 0, return 0
If the rowSums don't equal the row max, return 999
Otherwise, return the rowSum (because it will only be length one and be the ethnicity of interest).
Note you misspelled column names.
Data:
df <- structure(list(enthnicity1 = c(1L, 0L, 0L), enthnicity2 = c(0L,
2L, 0L), ethnicity3 = c(0L, 0L, 3L), enthnicity4 = c(0L, 0L,
4L), enthnicity5 = c(0L, 0L, 0L), ethnicity6 = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -3L))
Here is another tidyverse solution. Here, I create a new column (with mutate) and then use pmap to select all of the columns that start with ethnicity. Then, I put everything from that row into a list. Then, I remove all of the 0s from that list and replace any row that has more than one value with 999 and keep only 1 unique value.
library(tidyverse)
df %>%
mutate(Ethnicity = pmap(
select(., starts_with("ethnicity")),
~ c(...) %>%
keep(~ all(. != 0)) %>%
replace(length(.) > 1, 999) %>%
unique
))
If you need to simply select the columns (since your real data may not actually have the word "ethnicity" for each column), then you can just put in the column index (e.g., c(1:6)) or use the column names (shown below).
df %>%
mutate(Ethnicity = pmap(
select(., c("ethnicity1", "ethnicity2", "ethnicity3", "ethnicity4", "ethnicity5", "ethnicity6")),
~ c(...) %>%
keep(~ all(. != 0)) %>%
replace(length(.) > 1, 999) %>%
unique
))
Another option is to use mutate with ifelse and change any row with multiple values to 999.
library(tidyverse)
df %>%
mutate(Ethnicity = pmap(select(., starts_with("ethnicity")), ~ c(...) %>%
keep( ~ all(. != 0)))) %>%
rowwise %>%
mutate(Ethnicity = ifelse(length(Ethnicity) > 1, 999, Ethnicity)) %>%
select(Ethnicity)
Output
# A tibble: 3 × 1
# Rowwise:
Ethnicity
<dbl>
1 1
2 2
3 999
Data
df <-
structure(
list(
ethnicity1 = c(1L, 0L, 0L),
ethnicity2 = c(0L, 2L, 0L),
ethnicity3 = c(0L, 0L, 3L),
ethnicity4 = c(0L, 0L, 4L),
ethnicity5 = c(0L, 0L, 0L),
ethnicity6 = c(0L, 0L, 0L)
),
class = "data.frame",
row.names = c(NA,-3L)
)
in Base R you could do:
aggregate(.~row, data.frame(which(df>0, TRUE)), \(x) if(length(x)>1)999 else x)
row col
1 1 1
2 2 2
3 3 999
I would propose another strategy to consider. It seems that if the new number of ethnicityn columns is limited (less than 32 in the simple case) the better approach could be using a bitmask. This way is used in many languages for the alike purposes for instance in MySQL list columns, in Pascal/Delphi sets, etc. In this case, the resulting column will hold the following values: c(1L, 2L, 12L)
Maybe that simple? Or do I overlook something?
library(dplyr)
df %>%
mutate(Ethnicity = rowSums(select(., contains("ethnicity"))),
Ethnicity = ifelse(Ethnicity > 2, 999, Ethnicity))
ethnicity1 ethnicity2 ethnicity3 ethnicity4 ethnicity5 ethnicity6 Ethnicity
1 1 0 0 0 0 0 1
2 0 2 0 0 0 0 2
3 0 0 3 4 0 0 999
Related
I have the following table in R which lists a person race, gender, age, and cholesterol test. age and cholesterol test are displayed as dummy variables. age can be categorized as low, medium, or high, while cholesterol tests can be categorized as low or high. I want to transform the age and cholesterol columns to be single columns where low is categorized as 1, medium is categorized as 2, and high is categorized as 3. Cholesterol test can be neigh low or high if a person never took one and should be N/A in the expected output.
I want the solution to be dynamic so that if I have multiple columns in this format, the code would still work (i.e. there may be some new tests, which can be categorized as high, low, or medium as dummy variables).
How can I do this in R?
input:
race gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
<chr> <int> <int> <int> <int> <int> <int>
1 white 0 1 0 0 0 0
2 white 0 1 0 0 0 0
3 white 1 1 0 0 0 0
4 black 1 0 1 0 0 0
5 white 0 0 0 1 0 1
6 black 0 0 1 0 1 0
expected output:
race gender age chol_test
1 white 0 1 n/a
2 white 0 1 n/a
3 white 1 1 n/a
4 black 1 2 n/a
5 white 0 3 3
6 black 0 2 1
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(across(contains("_"), ~
. * setNames(1:3, c("low", "medium", "high"))[
str_extract(cur_column(), "low|medium|high")])) %>%
rename_with(~ str_remove(., "_tm1")) %>%
pivot_longer(cols = -c(race, gender),
names_to = c(".value", "categ"), names_sep = "\\.") %>%
filter(age > 0|chol_test > 0) %>%
select(-categ) %>%
mutate(chol_test = na_if(chol_test, 0))
-output
# A tibble: 7 × 4
race gender age chol_test
<chr> <int> <int> <int>
1 white 0 1 NA
2 white 0 1 NA
3 white 1 1 NA
4 black 1 2 NA
5 white 0 3 3
6 black 0 0 1
7 black 0 2 NA
data
df1 <- structure(list(race = c("white", "white", "white", "black", "white",
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L,
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L,
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))
We could first define a custom function that allows us to recode dummy variables based on their variable names, below called var_nm2value.
This function takes the values of the variables as x argument. In dplyr::across this is the .x part. And it takes a list of name-value pairs as value_ls argument. The function just loops over the list of name-value pairs, checks if the name in value_ls is found in the variable name. To do this it uses grepl on dplyr::cur_column(). If we have a match then we replace all 1s with the value from our value_ls and we return all other values, that is the zeros, as is.
Then we can define a list of recode values, below recode_ls.
Finally, we use purrr::map_dfc in a dplyr::summarise where we use the variable strings we want to create "age" and "chol_test", then ii) select only columns which contain this string, and in each iteration we iii) apply dplyr::across to recode the values, iv) pipe the result in a do.call to get the max and finally v) recode 0s to NA:
# custom function to recode 0/1 dummy variables based on their variable name an
var_nm2value <- function(x, values_ls) {
for (val in seq_along(values_ls)) {
if(grepl(names(values_ls)[val], dplyr::cur_column())) {
return(ifelse(x == 1L, values_ls[[val]], x))
}
}
}
# define list of recode values
recode_ls <- list(low = 1, medium = 2, high = 3)
library(tidyverse)
# apply functions to data.frame
df1 %>%
summarise(race = race,
gender = gender,
map_dfc(set_names(c("age", "chol_test")), # i)
function(x) {
select(., contains(x)) %>% # ii)
summarise("{x}" := across(everything(), var_nm2value, recode_ls) %>% # iii)
do.call("pmax", .) %>% # iv)
ifelse(. == 0, NA, .))} # v)
))
#> race gender age chol_test
#> 1 white 0 1 NA
#> 2 white 0 1 NA
#> 3 white 1 1 NA
#> 4 black 1 2 NA
#> 5 white 0 3 3
#> 6 black 0 2 1
Created on 2022-01-03 by the reprex package (v0.3.0)
I have a dataframe that resembles the following:
ID
X
Y
A_1_l
A_2_m
B_1_n
B_2_l
C_1_m
C_2_n
C_3_l
w
X
Y
0
0
0
0
0
0
0
x
X
Y
0
0
3
0
0
0
0
y
X
Y
0
1
0
4
0
1
0
z
X
Y
3
4
5
6
2
1
5
The first letter denotes a sample, the number a repetition and the second letter a batch. I am trying to find a count of the number of samples with at least one value > 0 for each ID and store these numbers in a list.
This is the desired result as a list that I can append to a an existing dataframe:
0,1,3,3
For a previous analysis I used strsplit to count the total number of samples per batch.
colsList <- colnames(df)
cols <- grep("_", colsList, value=TRUE)
splitList <- strsplit(cols, "_\\d_")
stats <-data.frame(t(as.data.frame.list(splitList)))
rownames(stats)<-NULL
names(stats)<-c("Sample", "Batch")
perSample <- aggregate(Sample ~ Batch, stats,
function(x) length(unique(x))) # number of strains
And I was able to find the total number of columns with a value > 0 using rowSums(df[sapply(df, is.numeric)] > 0) but I cant seem to figure out how to combine the two to find the total number of samples > 0
First filter the data to keep only the numeric columns.
Use split.default to divide the data into groups so that you have all the 'A' columns in one group, 'B' in another and so on. Within each group return TRUE if a row has a single value which is greater than 0, sum all the values together from all the groups to get final count.
tmp <- Filter(is.numeric, df)
rowSums(sapply(split.default(tmp, sub('_.*', '', names(tmp))),
function(x) rowSums(x) > 0))
#[1] 0 1 3 3
We can do this in tidyverse
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
select(ID, where(is.numeric)) %>%
pivot_longer(cols = -ID) %>%
mutate(name = str_remove(name, "_.*")) %>%
group_by(ID, name) %>%
summarise(value = sum(value > 0), .groups = 'drop_last') %>%
summarise(value = sum(value > 0))
# A tibble: 4 x 2
ID value
<chr> <int>
1 w 0
2 x 1
3 y 3
4 z 3
data
df1 <- structure(list(ID = c("w", "x", "y", "z"), X = c("X", "X", "X",
"X"), Y = c("Y", "Y", "Y", "Y"), A_1_l = c(0L, 0L, 0L, 3L), A_2_m = c(0L,
0L, 1L, 4L), B_1_n = c(0L, 3L, 0L, 5L), B_2_l = c(0L, 0L, 4L,
6L), C_1_m = c(0L, 0L, 0L, 2L), C_2_n = c(0L, 0L, 1L, 1L), C_3_l = c(0L,
0L, 0L, 5L)), class = "data.frame", row.names = c(NA, -4L))
I am working with the following data frame:
I am wondering how I can create a new column which shows the percentage of the indicator column for all previous observations within the group. So the above data frame would become:
Basically, the new column just indicates the percentage (in decimal form) of the indicator up to that point within the group. It just divides the sum of the indicator column up to that point by the row count of previous observations within the group.
My first thought was to use group_by along with row_number in order reference previous observations, but I couldn't figure out how to make it work.
Data:
structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Indicator = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L), IndicatorPercent = c(NA,
1, 0.5, 0.67, 0.75, NA, 0, 0, 0, 0.25)), class = "data.frame", row.names = c(NA,
-10L))
We get the cummean of the 'Indicator' after grouping by 'Group' and then get the lag on it
library(dplyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = lag(cummean(Indicator))) %>%
ungroup
-output
# A tibble: 10 x 3
# Group Indicator IndicatorPercent
# <int> <int> <dbl>
# 1 1 1 NA
# 2 1 0 1
# 3 1 1 0.5
# 4 1 1 0.667
# 5 1 0 0.75
# 6 2 0 NA
# 7 2 0 0
# 8 2 0 0
# 9 2 1 0
#10 2 0 0.25
If we want to do this based on value of other column, use replace
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = replace(rep(NA_real_, n()),
color == 'red', lag(cummean(Indicator[color == "red"])))) %>%
fill(IndicatorPercent) %>%
ungroup
Or with data.table
library(data.table)
setDT(df1)[color == 'red',
IndicatorPercent := shift(cummean(Indicator)), Group][,
IndicatorPercent := nafill(IndicatorPercent, type = 'locf'), Group][]
I have a df (coming from a csv) that has this structure
id att1_beer att1_wine att2_beer att2_wine
1 1 1 0 0
2 0 1 0 1
3 1 1 0 1
4 0 1 0 1
5 1 1 0 0
I would like to get a table (preferably with Tidyverse) to this format:
Beer Wine
Att1 3 5
Att2 0 3
Is this possible? I'm trying to avoid exporting to Excel to do it.
Can reshape to 'long' with pivot_longer and then get the sum by group
library(dplyr)
df %>%
select(-id) %>%
pivot_longer(cols =everything(), names_sep="_",
names_to = c("grp", ".value")) %>%
group_by(grp) %>%
summarise(across(everything(), sum), .groups = 'drop')
Or using base R
sapply(split.default(df[-1], sub(".*_", "", names(df)[-1])), colSums)
data
df <- structure(list(id = 1:5, att1_beer = c(1L, 0L, 1L, 0L, 1L),
att1_wine = c(1L,
1L, 1L, 1L, 1L), att2_beer = c(0L, 0L, 0L, 0L, 0L), att2_wine = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L
))
data.table solution for completeness' sake
library( data.table ); setDT(df) #prepare
ans <- melt( df, id.vars = "id" ) #melt to long format
ans[, c("att", "drink") := tstrsplit( variable, "_" )] #split column to variables
dcast(ans, att ~ drink, fun.aggregate = sum ) #cast to wide and sum
A base R option using xtabs + colSums
u <- colSums(df[-1])
xtabs(
u ~ .,
data.frame(
u,
do.call(rbind, strsplit(names(u), "_"))
)
)
gives
X2
X1 beer wine
att1 3 5
att2 0 3
I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))