How do find the length of highest repeated character in a string
col1 repeated letter repeated number
apples333 2 3
summer13 2 0
talk77 0 2
Aa6668 2 3
I can use lengths(regmatches(str, gregexpr("a",str) or str_count(str,"a") but the idea is to automatically check which is the highest repeating char/number and return count.
Using rle and rawConversion functions:
d <- data.frame(col1 = c("apples333", "summer13", "talk77", "Aa6668"))
foo <- function(x, p){
r <- rle(charToRaw(tolower(x)))
res <- max(r$lengths[ grepl(p, rawToChar(r$values, multiple = TRUE)) ])
if(res == 1) res <- 0
res
}
d$repLetter <- sapply(d$col1, foo, p = "[a-z]")
d$repNumber <- sapply(d$col1, foo, p = "[0-9]")
d
# col1 repLetter repNumber
# 1 apples333 2 3
# 2 summer13 2 0
# 3 talk77 0 2
# 4 Aa6668 2 3
There is probably an elegant regex-based solution for this (obviously I am not a big regex-er). The following is based on determining the run length of a vector using the base-rle() function, i.e. counting the repetition of elements.
As a strategy, we develop a function to work on a single string input providing the different portions and associated occurrences/counts. Then, to operate over several input strings, we apply (loop) a function to each element of the input vector.
single loop
Let's see how rle() works:
x <- "abba" # a test string - who does not know ABBA
x_split <- strsplit(x, "") %>% unlist # split the string, unlist to coerce vector
x_rle <- rle(x_split) # apply rle()
# now let's check what we have
x_rle
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "a" "b" "a"
rle() returns a list. As you want to filter, etc. on your results, it might be easier to turn this into a data frame. We also store the actual input.
With a view to apply this to other strings (e.g. loop over input vector), we wrap this into a function call:
library(dplyr)
check_rle_char_num <- function(x){
# split the string and count occurrences
x_split <- strsplit(x, "") %>% unlist()
x_rle <- rle(x_split).
# turn it into a tibble
df <- with(x_rle, tibble(values, lengths)) %>%
# ----------- store the input string and check for chars/numerics
mutate( input = x
, is_num = grepl(pattern = "[0-9]", values) # logical check for numbers
) %>%
# ----------- order output tibble
select(input, everything())
}
check that it works:
> ( check_rle_char_num("Appllles44777") )
# A tibble: 7 x 4
input values lengths is_num
<chr> <chr> <int> <lgl>
1 Appllles44777 A 1 FALSE
2 Appllles44777 p 2 FALSE
3 Appllles44777 l 3 FALSE
4 Appllles44777 e 1 FALSE
5 Appllles44777 s 1 FALSE
6 Appllles44777 4 2 TRUE
7 Appllles44777 7 3 TRUE
We have all the pieces on which you can filter, select, etc. your desired output.
loop over multiple input strings
We use tidyverse's {purrr} package for this.
# multiple input strings
my_strings <- c("apples333", "summer13","talk77","Aa6668","Appllles44777")
# loop over my_strings
library(purrr)
test <- my_strings %>%
map_dfr(.f = ~ check_rle_char_num(.x)) # map_dfr returns a data frame
test
# A tibble: 29 x 4
input values lengths is_num
<chr> <chr> <int> <lgl>
1 apples333 a 1 FALSE
2 apples333 p 2 FALSE
3 apples333 l 1 FALSE
4 apples333 e 1 FALSE
5 apples333 s 1 FALSE
6 apples333 3 3 TRUE
7 summer13 s 1 FALSE
8 summer13 u 1 FALSE
9 summer13 m 2 FALSE
10 summer13 e 1 FALSE
final push, filter, and reshape a nice output tibble
# per problem statement - filter for maximum and min 2 counts (i.e. > 1)
result <- test %>%
group_by(input, is_num) %>%
filter(lengths == max(lengths), lengths > 1)
> result
# A tibble: 7 x 4
# Groups: input, is_num [7]
input values lengths is_num
<chr> <chr> <int> <lgl>
1 apples333 p 2 FALSE
2 apples333 3 3 TRUE
3 summer13 m 2 FALSE
4 talk77 7 2 TRUE
5 Aa6668 6 3 TRUE
6 Appllles44777 l 3 FALSE
7 Appllles44777 7 3 TRUE
Emulating a bit your results listed in the problem statement, one can reshuffle the columns and provide "nice" column names:
library(tidyr) # for reshuffling
result %>%
tidyr::pivot_wider( names_from = is_num
, values_from = c(values, lengths)
) %>%
#---------- we spread the tibble, "spread" column-names combine previous colnames and TRUE/FALSE - mind that TRUE were numbers
rename( char = values_FALSE
, char_count = lengths_FALSE
, nums = values_TRUE
, nums_count = lengths_TRUE) %>%
#---------- changing order of columns for nice output
select(input, starts_with("char"), starts_with("num"))
# A tibble: 5 x 5
# Groups: input [5]
input char char_count nums nums_count
<chr> <chr> <int> <chr> <int>
1 apples333 p 2 3 3
2 summer13 m 2 NA NA
3 talk77 NA NA 7 2
4 Aa6668 NA NA 6 3
5 Appllles44777 l 3 7 3
final notes
The solution presented
does the filtering on the result data frame (after the loop). If there are no other operations on your data, you can lift this into the function.
does not clean the NAs in the final output. If you need zeros for no letter or no number, you can replace the NAs.
keeps characters and numbers in a single data frame. Obviously, you can split them. One could combine both again based on a join() or bind_cols() on the input-variable. This saves the pivot-wider bit.
does not care for "ties", i.e. you have a sequence of multiple characters and/or numbers with the same count. You may have to handle this.
Last but not least: simplify the code, if none of the columns/variables kept in the tibble help for your problem.
Solution
You can go with this:
library(stringr)
max_freq <- Vectorize(function(x) max(tabulate(factor(x))))
df$repeated_letter <- max_freq(str_extract_all(str_to_lower(df$col1), "[:alpha:]"))
df$repeated_letter <- max_freq(str_extract_all(str_to_lower(df$col1), "[:digit:]"))
df
#> col1 repeated_letter repeated_number
#> 1 apples333 2 3
#> 2 summer13 2 1
#> 3 talk77 1 2
#> 4 Aa6668 2 3
#> 5 Appllles44777 3 3
Explanation
Following a breakdown of the solution step by step with some explanations:
# take your column
df$col1 |>
# set to lower so A and a is the same character
str_to_lower() |>
# extract only letters or digits as list of vectors
str_extract_all("[:alpha:]") |>
# get frequency table for each vector
lapply(factor) |> lapply(tabulate) |>
# extract the count of most repeated letter for each table and return a vector
sapply(max)
#> [1] 2 2 1 2 3
Data
Where df is:
df <- data.frame(col1 = c("apples333", "summer13", "talk77", "Aa6668", "Appllles44777"))
Warnings
When there are no repeated characters, 1 will be returned, which is actually a more consistent answer, since the most repeated character will be repeated once. If you prefer zero, you can replace all ones with zeros.
In case of no characters or no numbers, -Inf will be returned. If you want a different result (like zero) you can replace it. In your example, it was not specified an occurrence like that.
Though late to the party but this method might still be of interest:
library(tidyr)
library(stringr)
library(dplyr)
d %>%
# count the number of character repetitions:
mutate(
# for letters:
dup_w = lapply(str_extract_all(col1, "(?i)([a-z])\\1+"), nchar),
# for numbers:
dup_n = lapply(str_extract_all(col1, "([0-9])\\1+"), nchar)) %>%
# throw all repetition counts into a single column:
pivot_longer(c(dup_w, dup_n)) %>%
# show items in list:
unnest(cols = value) %>%
# group:
group_by(col1, name) %>%
# reduce dataframe to maximum values per group:
filter(value == max(value)) %>%
# widen the dataframe back to original format:
pivot_wider(names_from = name, values_from = value)
# A tibble: 5 x 3
# Groups: col1 [5]
col1 dup_w dup_n
<chr> <int> <int>
1 11applesssss333 5 3
2 summer13 2 NA
3 talk77 NA 2
4 Aa6668 2 3
5 Appllles44777 3 3
Data (with lots ore repetitions to make things clearer):
d <- data.frame(col1 = c("11applesssss333", "summer13", "talk77",
"Aa6668", "Appllles44777"))
Related
I want to create a function that I can simulate n number of times. My ultimate goal is to find if the sum of c for every n number of simulations. I am a beginner in r-coding so I am just starting to practice with for loops and if else statements.
This is what I hope to achieve as of now: If a> b, c would be "2" and if a < b, c would be "-2". If a=b, c would be determined by the a and b value of the NEXT row. This is what i have so far, but I am keep getting errors. I would like to know if what I have for a=b is how I should approach this. Any help is appreciated.
a<-c(5,6,7,8,9,10,1,4,6,7)
b<-c(4,6,8,5,3,4,5,2,1,3)
c<-c(0,0,0,0,0,0,0,0,0,0)
df<-data.frame(a,b,c)
if(df$a > df$b){
df$c<- c(2)}
else if(df$a < df$b){
df$c<- c(-2)}
else if(df$a == df$b){ # a=b
if(df$a[+1,] > df$b[+1,]) {
df$c<- c(2)}
else(df$a[+1,] < df$b[+1,]){
df$c<- c(-2) }
}
else
print("error")
}
sum(df$c)
The problem
if() and else() in R is meant for control flow, and is not vectorized. In plain English this means that if() is expecting a statement evaluating to one TRUE or FALSE. When you do df$a > df$b you get a boolean vector of the same length as rows in your dataframe. When this happens, if() will only use the first item, and give you a warning. This will give you the wrong answers.
A better solution
I think you are looking for ifelse() which is vectorized. And since you have nested if-else statements you are probably better off with dplyr::case_when().
Here is an example which also fixes cases where a == b for multiple rows:
# Note that I've added two consecutive rows where a == b
a <- c(5,6,6,7,8,9,10,1,4,6,7)
b <- c(4,6,6,8,5,3,4,5,2,1,3)
df <- data.frame(a, b)
library(dplyr)
df %>%
mutate(
c = case_when(
a > b ~ 2,
a < b ~ -2,
# If not a > b nor a < b is TRUE, they must be equal,
# so we set all other cases to NA...
TRUE ~ NA_real_
)
) %>%
# ... and then we use fill() to replace NAs with the first
# non NA valua after it
tidyr::fill(c, .direction = "up")
#> a b c
#> 1 5 4 2
#> 2 6 6 -2
#> 3 6 6 -2
#> 4 7 8 -2
#> 5 8 5 2
#> 6 9 3 2
#> 7 10 4 2
#> 8 1 5 -2
#> 9 4 2 2
#> 10 6 1 2
#> 11 7 3 2
Created on 2022-03-30 by the reprex package (v2.0.1)
How this works:
ifelse() works like if() and else() in your code, but it accepts multiple values
case_when() acts like nested ifelse() statements, so it will first check if a > b and set those values equal to 2, next it will check the remaining rows if a < b and set those to -2 and so on.
In cases where a is not less nor more than b, they must be equal. We set these cases to NA.
After we use tidyr::fill() to replace missing values with the first instance of a non-missing value after it. This handles cases where there are multiple consecutive rows of a == b.
Edit: two users already pointed out what to do if there's consecutive rows of a == b. Good opportunity to dive into the tidyverse (as already suggested by others):
library(dplyr)
library(tidyr)
df <- data.frame(
a = c(5,6,7,8,9,10,1,4,6,7),
b = c(4,6,8,5,3,4,5,2,1,3)
)
df %>%
mutate(c = ifelse(a == b, NA, 2 * sign(a-b))) %>% ## (1)
fill(c, .direction = 'up') ## (2)
(1) set c to NA when a == b
(2) 'fill' (replace) NAs with the next availabe value down the rows
Starting with R, it's helpful to know that vectorizing (the x[n] thing) usually makes your code conciser and—in certain situations— much faster than using loops. In your case:
df$c <- 2 * sign(df$a - df$b) ## see ?sign
z <- df$c == 0 ## see (1)
df$c[z] = lead(df$c,1)[z] ## see (2)
(1) equal numbers have sign zero, z is a boolean vector indicating the positions (rows) where a == b (thus: z is TRUE)
(2) change c only at the positions where z is TRUE. lead and lag are functions taking a vector and returning its shifted (by a given number of positions) vector.
Here is a tidyverse solution. This will also work with multiple equal a and b in series (I have added row 3 to the data to demonstrate).
It relies on cumsum() to group the data, such that rows with a == b are in the same group as the next row that is a != b. Then it sets c to the last value in the group.
library(tidyverse)
a<-c(5,6,5,7,8,9,10,1,4,6,7)
b<-c(4,6,5,8,5,3,4,5,2,1,3)
df <-data.frame(a,b)
df |>
mutate(c = ifelse(a>b, 2, -2), # Determines c for `a != b` cases
grp = rev(cumsum(rev(a != b)))) |> # create group variable, use rev() since we want backward cumsum
group_by(grp) |>
mutate(c = last(c)) |>
ungroup() |>
select(-grp)
#> # A tibble: 11 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 5 4 2
#> 2 6 6 -2
#> 3 5 5 -2
#> 4 7 8 -2
#> 5 8 5 2
#> 6 9 3 2
#> 7 10 4 2
#> 8 1 5 -2
#> 9 4 2 2
#> 10 6 1 2
#> 11 7 3 2
Created on 2022-03-30 by the reprex package (v2.0.1)
I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d
I have a dataframe in R and I want to create a single numeric vector by splitting all of the character values in a specific column and then appending them to the vector or list. The values in the column are all comma-separated numbers and there are rows with missing values or NA.
Current data
id col
1 2,6,10
2 NA
3 5, 10
4 1
Final vector
# v <- c(2, 6, 10, 5, 10, 1)
# v
[1] 2 6 10 5 10 1
I'm able to do this by iterating through all the values in the column but I know this isn't the most efficient way since R is made to work easily with vectors. Is there a better way to do this?
v <- c()
for(val in df$col){
if(!is.na(val)){
ints <- as.numeric(unlist(strsplit(val, ",")))
v <- c(v, ints)
}
}
You already have the answer in your code since all the functions you are using are vectorised.
v <- as.numeric(na.omit(unlist(strsplit(df$col, ','))))
v
#[1] 2 6 10 5 10 1
Does this work:
library(dplyr)
library(tidyr)
df %>% separate_rows(col) %>% na.omit() %>% pull(col) %>% as.numeric() -> v
v
[1] 2 6 10 5 10 1
Data used:
df
# A tibble: 4 x 2
id col
<dbl> <chr>
1 1 2,6,10
2 2 NA
3 3 5, 10
4 4 1
I have a data frame like below having name and email column.
df <- data.frame(name=c("maay,bhtr","nsgu,nhuts thang","affat,nurfs","nukhyu,biyts","ngyst,muun","nsgyu,noon","utrs guus,book","thum,cryant","mumt,cant","bhan,btan","khtri,ntuk","ghaan,rstu","shaan,btqaan","nhue,bjtraan","wutys,cyun","hrtsh,jaan"),
email=c("maay.bhtr#email.com","nsgu.nhuts#gmail.com","asfa.1234#gmail.com","nukhyu.biyts#gmail.com","ngyst.muun#gmail.com","nsgyu.noon#gmail.com","utrs.book#hotmail.com","thum.cryant#live.com","mumt.cant#gmail.com","bhan.btan#gmail.com","khtri.ntuk#gmail.c.om","chang.lee#gmail.com","shaan.btqaan#gmail.com","nhue.bjtraan#gmail.com","wutys.cyun#gmailcom","hrtsh.jaan#gmail.com"))
I am looking for a function by which i can check if the first name or last name matches with mail id then mutate new column to true.
In Base R we can utilize Map() and sapply() to loop through your list and create a logical vector to then append to your df:
Since this code included a lot of nested apply statements let me try to explain whats ging on. The code is probably best understood when starting from the inside.
# t is the strsplit() names column
strsplit(df[,1], ",")
# this next line checks if the names occur in the email address
grepl(t, y, fixed = T)
# this statement wrapped in sapply returns a list with each entry containing two true/false statements for first and last name
# the sapply() statement above allows us to do exactly that for every row
# lastly we convert this list into a single true/false for each df entry
Code:
a <- sapply(Map(function(x, y){
sapply(x, function(t){
grepl(t, y, fixed = T)
})}
, strsplit(df[,1], ","), df[, 2]), function(p){
if(any(p)){
T
} else {
F
}
})
# result
cbind(df, a)
name email a
1 maay,bhtr maay.bhtr#email.com TRUE
2 nsgu,nhuts thang nsgu.nhuts#gmail.com TRUE
3 affat,nurfs asfa.1234#gmail.com FALSE
4 nukhyu,biyts nukhyu.biyts#gmail.com TRUE
5 ngyst,muun ngyst.muun#gmail.com TRUE
6 nsgyu,noon nsgyu.noon#gmail.com TRUE
7 utrs guus,book utrs.book#hotmail.com TRUE
8 thum,cryant thum.cryant#live.com TRUE
9 mumt,cant mumt.cant#gmail.com TRUE
10 bhan,btan bhan.btan#gmail.com TRUE
11 khtri,ntuk khtri.ntuk#gmail.c.om TRUE
12 ghaan,rstu chang.lee#gmail.com FALSE
13 shaan,btqaan shaan.btqaan#gmail.com TRUE
14 nhue,bjtraan nhue.bjtraan#gmail.com TRUE
15 wutys,cyun wutys.cyun#gmailcom TRUE
16 hrtsh,jaan hrtsh.jaan#gmail.com TRUE
Maybe you can try
within(
df,
consistent <- mapply(
function(x, y) 1 - any(mapply(grepl, x, y) | mapply(grepl, x, y)),
strsplit(name, ","),
strsplit(gsub("#.*", "", email), "\\.")
)
)
which gives
name email consistent
1 maay,bhtr maay.bhtr#email.com 0
2 nsgu,nhuts thang nsgu.nhuts#gmail.com 0
3 affat,nurfs asfa.1234#gmail.com 1
4 nukhyu,biyts nukhyu.biyts#gmail.com 0
5 ngyst,muun ngyst.muun#gmail.com 0
6 nsgyu,noon nsgyu.noon#gmail.com 0
7 utrs guus,book utrs.book#hotmail.com 0
8 thum,cryant thum.cryant#live.com 0
9 mumt,cant mumt.cant#gmail.com 0
10 bhan,btan bhan.btan#gmail.com 0
11 khtri,ntuk khtri.ntuk#gmail.c.om 0
12 ghaan,rstu chang.lee#gmail.com 1
13 shaan,btqaan shaan.btqaan#gmail.com 0
14 nhue,bjtraan nhue.bjtraan#gmail.com 0
15 wutys,cyun wutys.cyun#gmailcom 0
16 hrtsh,jaan hrtsh.jaan#gmail.com 0
You could do this as follows - code commented below.
df <- data.frame(name=c("maay,bhtr","nsgu,nhuts thang","affat,nurfs","nukhyu,biyts","ngyst,muun","nsgyu,noon","utrs guus,book","thum,cryant","mumt,cant","bhan,btan","khtri,ntuk","ghaan,rstu","shaan,btqaan","nhue,bjtraan","wutys,cyun","hrtsh,jaan"),
email=c("maay.bhtr#email.com","nsgu.nhuts thang#gmail.com","asfa.1234#gmail.com","nukhyu.biyts#gmail.com","ngyst.muun#gmail.com","nsgyu.noon#gmail.com","utrs guus.book#hotmail.com","thum.cryant#live.com","mumt.cant#gmail.com","bhan.btan#gmail.com","khtri.ntuk#gmail.c.om","chang.lee#gmail.com","shaan.btqaan#gmail.com","nhue.bjtraan#gmail.com","wutys.cyun#gmailcom","hrtsh.jaan#gmail.com"))
library(stringr)
library(dplyr)
## extract all of the names any string of letters unbroken by a space or punctuation or number
names <- str_extract_all(df$name, "[A-Za-z]*") %>%
## make a matrix out of the names
do.call(rbind, .) %>%
## turn the names into a data frame
as.data.frame()
## some of the columns have all "" in them, find which ones are all ""
w <- sapply(names, function(x)all(x == ""))
## if any of the columns are all "" then ...
if(any(w)){
## remove those columns from the dataset
names <- names[,-which(w)]
}
## add email into this dataset that has the individual names
names$email <- df$email
library(tidyr)
## pipe the names dataset (which has individual names and an e-mail address)
out <- names %>%
## switch from wide to long format
pivot_longer(-email, names_to="V", values_to="n") %>%
## create consistent = 1 if the name is not detected in the e-mail
mutate(consistent = !str_detect(email, n)) %>%
## group the data by e-mail
group_by(email) %>%
## take the maximum of consistent by group
## this will be 1 if any of the names are not detected in the e-mail
summarise(consistent = max(consistent)) %>%
## join back together with the original data
left_join(df) %>%
## change the variable ordering back
select(name, email, consistent)
out
# # A tibble: 16 x 3
# name email consistent
# <chr> <chr> <int>
# 1 affat,nurfs asfa.1234#gmail.com 1
# 2 bhan,btan bhan.btan#gmail.com 0
# 3 ghaan,rstu chang.lee#gmail.com 1
# 4 hrtsh,jaan hrtsh.jaan#gmail.com 0
# 5 khtri,ntuk khtri.ntuk#gmail.c.om 0
# 6 maay,bhtr maay.bhtr#email.com 0
# 7 mumt,cant mumt.cant#gmail.com 0
# 8 ngyst,muun ngyst.muun#gmail.com 0
# 9 nhue,bjtraan nhue.bjtraan#gmail.com 0
# 10 nsgu,nhuts thang nsgu.nhuts thang#gmail.com 0
# 11 nsgyu,noon nsgyu.noon#gmail.com 0
# 12 nukhyu,biyts nukhyu.biyts#gmail.com 0
# 13 shaan,btqaan shaan.btqaan#gmail.com 0
# 14 thum,cryant thum.cryant#live.com 0
# 15 utrs guus,book utrs guus.book#hotmail.com 0
# 16 wutys,cyun wutys.cyun#gmailcom 0
#
Note, I had to change two of the values of e-mail in your dataset to match the image you posted.
Problem
In some health datasets, a column may categorize various disease manifestations of interest for individual cases. In some summaries it is beneficial to tabulate various combinations of these manifestations, including counting if a given case had 'greater than' or 'less than' a selection of key manifestations.
In SAS, a column can be assigned a multilabel format, which can allow various overlapping categories to be summarized at the same time during procedure steps. I have struggled to find a satisfactory solution in R that replicates this feature from SAS. I am aware that a combination of dplyr or base functions chained together can tabulate and append different combinations, effectively creating a dataset that duplicates rows needed for representing all overlapping levels.
Aim
To create a function that allows for easy creation of a dataset that considers various overlapping levels of a target category. This would allow for the transformation of the example data provided below into a new dataset that appends the correct rows, and can provide checks within groups to see if a certain grouping matches all the desired levels to be considered part of a new grouping.
library(tibble)
# Example data (Repeat groups)
exampleData <- tibble(group = c(1, 1, 1, 2, 3, 3),
condition = factor(c('A', 'B', 'C', 'A', 'B', 'Q'), ordered = F))
# Initial output
# A tibble: 6 x 2
group condition
<dbl> <fct>
1 1 A
2 1 B
3 1 C
4 2 A
5 3 B
6 3 Q
# Function to add new level combinations, based upon the levels within each group.
create_multilevelFactor(exampleData , target_col = 'condition', group_col = 'group', new_levels = list('AB' = c('A', 'B'), 'QB' = c('Q', 'B')))
# Desired output
# A tibble: 8 x 3
group condition track_col
<dbl> <chr> <dbl>
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 1
5 3 B 1
6 3 Q 1
7 1 AB 2
8 3 QB 3
You will note that the original factor levels persist, and the groups that contained the correct levels in the named list will form a new row if the combination exists. In more realistic examples, the grouping for AB could be considered as group 1 having 'at least A or B disease manifestations'.
Challenge
I suspect that others may have a similar need for this function and like me, are either ignorant of a simpler approach or have not come across an existing solution that is easy to use. During my thought process for this question, I have created a function (trying to use base R primarily) that, albeit inelegantly, creates the aforementioned desired output.
It is my hope that others can provide a more ideal solution using an alternative approach or increase robustness and wider applicability of the function.
The following function provides a working, albeit inelegant, solution to the problem. I tend to overthink processes, which is likely reflected in the answer here.
This function will take in the initial dataset, and based upon if a grouping function is provided, it will create a new dataset with additional rows for various combinations of aggregated factor levels if those levels existed within the groupings. Various new levels can be provided as a list, and an additional column makes it easy to see which new levels were added in addition to the original rows.
#-----------------------------------------------------------#
# Create function for multilevel labelling of factor groups #
#-----------------------------------------------------------#
# target_col is a character string for the column of interest to be adjusted
# group_col is a character string for the column to check levels that exist within groupings
# new_levels is a list that uses name and value pairs to determine how new levels should be aggregated
# collapse will ensure that only unique combinations of the new level is appended
# track will add a flag to ensure one can easily see the new combinations that were appended
create_multilevelFactor <- function(data, target_col, new_levels , group_col, collapse = T, track = T) {
#
# Do some basic checks on inputs
#
# Check if new_levels is provided as a list
if(!is.list(new_levels)) stop('The provided set of levels is not in a list format, please provide as a list')
# Check if target_col is a factor
if(!is.factor(data[[target_col]])) stop('The target column for multiple levels is not a factor, convert to a factor before proceeding.')
# Check if levels are in list
for(i in 1:length(new_levels)) {
if(length(setdiff(levels(factor(new_levels[[i]])),
levels(factor(data[[target_col]])))) > 0) { # If levels in provided list contain a level not in the column, then throw error
stop('Levels in list do not match the levels in the target column')
}
}
# State if grouping col was provided and its purpose
if(!missing(group_col)) { message(paste0('The following column is used as a grouping variable for summarizing multilevel factoring: ',
group_col, '. If you do not want labels determined by those within groupings, leave argument blank.'))
}
#
# Main
#
# Set new column for tracking if desired
if(track == T) {track_col <- rep(NA,nrow(data)); data$track_col <- 1; trackColIndex <- 1;}
OutData <- as.data.frame(NULL) # Empy data frame to fill and append later
# Loop for all new levels of interest to add
for(i in 1:length(new_levels)){
tempData <- data # Look at fresh data every pass
levelIndex <- which(levels(tempData[[target_col]]) %in% new_levels [[i]]) # Index of matches
# If grouping provided, do necessary splits and rbinds
if(!missing(group_col)) {
tempData <- split(tempData, tempData[[group_col]]) # Split if there are groupings
tempData <- lapply(tempData, function(x) {
if(!(length(setdiff(levels(factor(new_levels [[i]])), levels(factor(x[[target_col]])))) > 0)) { # If the grouping does not have all the levels for the new grouping, then do nothing
levels(x[[target_col]])[levelIndex] <- names(new_levels )[i]
x
}
})
tempData <- do.call(rbind, tempData) # If didnt match necessary group conditions, will bring back empty
rownames(tempData) <- NULL # Correct row names for tibble
} else { # If not grouping
levels(tempData[[target_col]])[levelIndex] <- names(new_levels )[i]
}
tempData <- tempData[tempData[[target_col]] %in% names(new_levels )[i],] # Only keep new factor levels (could be empty if no group matches)
if(collapse == T) tempData <- unique(tempData[(tempData[[target_col]] %in% names(new_levels )[i]),]) # Collapse to unique combinations if desired
if(track == T){track_col <- rep(NA, nrow(tempData)); tempData$track_col <- trackColIndex+1; trackColIndex <- trackColIndex+1;} # Add track column to the new rows
OutData <- suppressWarnings(dplyr::bind_rows(OutData, tempData)) # Append all the new rows
}
# Append new rows to the original rows
OutData <- suppressWarnings(dplyr::bind_rows(data, OutData)) #
return(OutData)
}
Using the example data initially provided, this can produce the following outputs:
#Original data
library(tibble)
# Example data (Repeat groups)
exampleData <- tibble(group = c(1, 1, 1, 2, 3, 3),
condition = factor(c('A', 'B', 'C', 'A', 'B', 'Q'), ordered = F))
# Original data
# A tibble: 6 x 2
group condition
<dbl> <fct>
1 1 A
2 1 B
3 1 C
4 2 A
5 3 B
6 3 Q
##################
newData <- create_multilevelFactor(exampleData,
target_col = 'condition',
group_col = 'group',
new_levels = list('AB' = c('A', 'B'), 'QB' = c('Q', 'B')),
collapse = T, track = T)
newData
# Data with grouping argument
# A tibble: 8 x 3
group condition track_col
<dbl> <chr> <dbl>
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 1
5 3 B 1
6 3 Q 1
7 1 AB 2
8 3 QB 3
addmargins(table(newData$group,newData$condition))
A AB B C Q QB Sum
1 1 1 1 1 0 0 4
2 1 0 0 0 0 0 1
3 0 0 1 0 1 1 3
Sum 2 1 2 1 1 1 8
newData <- create_multilevelFactor(exampleData,
target_col = 'condition',
new_levels = list('AB' = c('A', 'B'), 'QB' = c('Q', 'B')),
collapse = T, track = T)
newData
# Without grouping argument
# A tibble: 11 x 3
group condition track_col
<dbl> <chr> <dbl>
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 1
5 3 B 1
6 3 Q 1
7 1 AB 2
8 2 AB 2
9 3 AB 2
10 1 QB 3
11 3 QB 3
newData <- create_multilevelFactor(exampleData,
target_col = 'condition',
new_levels = list('AB' = c('A', 'B'), 'QB' = c('Q', 'B')),
collapse = F, track = T)
newData
# Without collapse and grouping argument
# A tibble: 13 x 3
group condition track_col
<dbl> <chr> <dbl>
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 1
5 3 B 1
6 3 Q 1
7 1 AB 2
8 1 AB 2
9 2 AB 2
10 3 AB 2
11 1 QB 3
12 3 QB 3
13 3 QB 3