Collapsing multiple columns with repeating values - r

I am currently working with a previously collected dataframe. Participant race is currently split among several categories (Race_White, Race_Black, etc.) where each participant has a value of 1 for Yes or 2 for No. For example, a White participant that does not identify with any other race would have a 1 in the Race_White column and 2's in all other Race_X columns.
I would like to merge these into one "Race" column, where 1 = White, 2 = Black, etc. Does anyone know of a nice piece of code/function/package to do this efficiently?
This is what I have been trying:
Race <- mutate(mydata,
Race = case_when(
mydata$Race_White = 1 & mydata$Race_Black = 2 & mydata$Race_Asian = 2 & mydata$Race_NoReply = 2 ~ 1,
mydata$Race_White = 2 & mydata$Race_Black = 1 & mydata$Race_Asian = 2 & mydata$Race_NoReply = 2 ~ 2,
mydata$Race_White = 2 & mydata$Race_Black = 2 & mydata$Race_Asian = 1 & mydata$Race_NoReply = 2 ~ 3,
mydata$Race_White = 2 & mydata$Race_Black = 2 & mydata$Race_Asian = 2 & mydata$Race_NoReply = 1 ~ 4,
TRUE ~ NA_real_))

I would use pivot_longer and str_remove like this:
tib <- tibble::tibble(#example data
individual = 1:10,
race_white = sample(c(0,1), 10, T),
race_black = 1 - race_white
)
tib %>%
dplyr::pivot_longer(dplyr::contains('race')) %>%
dplyr::filter(value == 1) %>%
dplyr::mutate(
name = stringr::str_remove(name, 'race_')
) %>%
dplyr::select(-value, race = name)
If you want them integer coded you could use case_when on the character column.
But it is hard to know exactly what u want without example data.
Here is the output:
# A tibble: 10 x 2
individual race
<int> <chr>
1 1 white
2 2 white
3 3 white
4 4 white
5 5 white
6 6 white
7 7 black
8 8 white
9 9 white
10 10 black
Edit:
I used 0 = No, and 1 = Yes. But that does not change anything. I also added package notation to all functions.

You could do:
names(df)[max.col(df==1)]
[1] "Race_yellow" "Race_red" "Race_green" "Race_red" "Race_green" "Race_yellow"
[7] "Race_red" "Race_purple" "Race_purple" "Race_yellow" "Race_yellow" "Race_blue"
[13] "Race_purple" "Race_red" "Race_purple"
The data:
df <- read.table(text =
"Race_yellow Race_green Race_purple Race_blue Race_red
1 1 2 2 2 2
2 2 2 2 2 1
3 2 1 2 2 2
4 2 2 2 2 1
5 2 1 2 2 2
6 1 2 2 2 2
7 2 2 2 2 1
8 2 2 1 2 2
9 2 2 1 2 2
10 1 2 2 2 2
11 1 2 2 2 2
12 2 2 2 1 2
13 2 2 1 2 2
14 2 2 2 2 1
15 2 2 1 2 2")

Related

Rank ordering a rows of a data.frame in R

I was was if there is a way to rank-order rows of my Data below such that rows that simultaneously have the largest values on each of risk1, risk2 and risk3 (NOT TOTAL Of the three) are at the top?
For example, in my Desired_output, you see that id == 4 simultaneously has the largest values on risk1, risk2 and risk3 (4,3,2).
For all other ids, there is a 1 or 0 on at least one of the risk1, risk2 and risk3.
Note: Tie's are fine. 4,3,2 == 2,3,4 == 3,2,4.
Data = data.frame(id=1:4,risk1 = c(1,3,5,4), risk2 = c(8,2,1,3), risk3 = c(0,1,4,2))
Desired_output = read.table(h=T,text="
id risk1 risk2 risk3
4 4 3 2
3 5 1 4
2 3 2 1
1 1 8 0
")
Maybe this helps - loop over the rows, sort the elements, paste, convert to numeric, use that to order the rows
Data[order(-apply(Data[-1], 1, \(x)
as.numeric(paste(sort(x), collapse = "")))),]
-output
id risk1 risk2 risk3
4 4 4 3 2
3 3 5 1 4
2 2 3 2 1
1 1 1 8 0
This does the trick:
library(dplyr)
Data %>%
arrange(-row_number())
id risk1 risk2 risk3
1 4 4 3 2
2 3 5 1 4
3 2 3 2 1
4 1 1 8 0

Recoding specific column values using reference list

My dataframe looks like this
data = data.frame(ID=c(1,2,3,4,5,6,7,8,9,10),
Gender=c('Male','Female','Female','Female','Male','Female','Male','Male','Female','Female'))
And I have a reference list that looks like this -
ref=list(Male=1,Female=2)
I'd like to replace values in the Gender column using this reference list, without adding a new column to my dataframe.
Here's my attempt
do.call(dplyr::recode, c(list(data), ref))
Which gives me the following error -
no applicable method for 'recode' applied to an object of class
"data.frame"
Any inputs would be greatly appreciated
An option would be do a left_join after stacking the 'ref' list to a two column data.frame
library(dplyr)
left_join(data, stack(ref), by = c('Gender' = 'ind')) %>%
select(ID, Gender = values)
A base R approach would be
unname(unlist(ref)[as.character(data$Gender)])
#[1] 1 2 2 2 1 2 1 1 2 2
In base R:
data$Gender = sapply(data$Gender, function(x) ref[[x]])
You can use factor, i.e.
factor(data$Gender, levels = names(ref), labels = ref)
#[1] 1 2 2 2 1 2 1 1 2 2
You can unlist ref to give you a named vector of codes, and then index this with your data:
transform(data,Gender=unlist(ref)[as.character(Gender)])
ID Gender
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
7 7 1
8 8 1
9 9 2
10 10 2
Surprisingly, that one works as well:
data$Gender <- ref[as.character(data$Gender)]
#> data
# ID Gender
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 2
# 5 5 1
# 6 6 2
# 7 7 1
# 8 8 1
# 9 9 2
# 10 10 2

R: How to create a equal blocks of variables randomly?

I have a data frame of n = 20 variables (number of columns) spread over b = 5 blocks (4 variables per block).
I would like to create p = 4 random and equal sized blocks of variables from the 5 blocks of variables.
I tried :
sample (x = 1: p, size = n, replace = TRUE)
[1] 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 4 4 4 4 4
Example of expected result (5 variables per block):
[1] 4 1 2 1 4 2 3 1 2 3 2 1 4 3 1 2 3 3 4 4
Thanks for your help !
You can try:
sample(x = rep(1:p,n/p), size = n, replace = FALSE)
Having discussed this in comments below, here is a solution:
Create a vector that looks like what you want, and then use sample to randomly sort it by sampling the whole vector without replacement:
p <- 4
b <- 5
sample(rep(1:p, b), size = p * b)
[1] 3 1 4 3 3 4 1 1 4 2 2 4 3 2 1 2 2 4 3 1

create variable conditionally by group in R (write function)

I want to create a variable by group conditioned on existing variable on individual level. Each individual has a outlier variable 1, 2, 3. I want to create a new variable by group so that the new var = 2 whenever there is at least one individual in that group whose outlier variable = 2; and the new var = 3 whenever there is at least one individual in that group whose outlier variable = 3.
The data looks like this
grpid id outlier
1 1 1
1 2 1
1 3 2
2 4 1
2 5 3
2 6 1
3 7 1
3 8 1
3 9 1
Ideal output like this
grpid id outlier goutlier
1 1 1 2
1 2 1 2
1 3 2 2
2 4 1 3
2 5 3 3
2 6 1 3
3 7 1 1
3 8 1 1
3 9 1 1
Any suggestions?
Thanks!
It is easy with dplyr
library(dplyr)
df <- read.table(header = TRUE,sep = ",",
text = "grpid,id,outlier
1,1,1
1,2,1
1,3,2
2,4,1
2,5,3
2,6,1
3,7,1
3,8,1
3,9,1")
df %>% group_by(grpid) %>% mutate(goutlier = max(outlier))

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Resources