Intrincate variable generation with conditionals against multiple factor variables in R - r

I'm triying to generate a new variable using multiple conditionals that evaluate against factor variables.
So, let's say I got this factor variables data.frame
x<-c("1", "2", "1","NA", "1", "2", "NA", "1", "2", "2", "NA" )
y<-c("1","NA", "2", "1", "1", "NA", "2", "1", "2", "1", "1" )
z<-c("1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3")
w<- c("01", "02", "03", "04","05", "06", "07", "01", "02", "03", "04")
df<-data.frame(x,y,z,w)
df$x<-as.factor(df$x)
df$y<-as.factor(df$y)
df$z<-as.factor(df$z)
df$w<-as.factor(df$w)
str(df)
So I need to get a new v colum on my dataframe which takes values between 1, 0 or NA with the following conditionals:
Takes value 1 if: x = "1", y = "1", z = "1" or "2", w = "01" to "06"
Takes value 0 if it doesn't meet at least one of the conditionals.
Takes value NA if any of x, y, z, or w is NA.
Had tried using a pipe %>% along mutate and case_when but have been unable to make it work.
So my desired result would be a new column v in df which would look like this:
[1] 1 NA 0 NA 1 NA NA 0 0 0 NA

Here I also use mutate with case_when. Since the NA in your dataset is of character "NA" (literal string of "NA"), we cannot use function like is.na() to idenify it. Would recommend to change it to "real" NA (by removing double quotes in your input).
As I've pointed out in the comment, I'm not sure why the eighth entry is "1" when the corresponding z is not "1" or "2".
library(dplyr)
df %>% mutate(v = case_when(x == "1" & y == "1" & z %in% c("1", "2") & w %in% paste0(0, seq(1:6)) ~ "1",
x == "NA" | y == "NA" | z == "NA" | w == "NA" ~ NA_character_,
T ~ "0"))
x y z w v
1 1 1 1 01 1
2 2 NA 2 02 <NA>
3 1 2 3 03 0
4 NA 1 4 04 <NA>
5 1 1 1 05 1
6 2 NA 2 06 <NA>
7 NA 2 3 07 <NA>
8 1 1 4 01 0
9 2 2 1 02 0
10 2 1 2 03 0
11 NA 1 3 04 <NA>

Related

Make a new column with values according to other columns - in R

Taking this dummy data for example
structure(list(Metastasis_Brain = c("1", "1", "0", "1", "0",
"0"), Metastasis_Liver = c("0", "0", "1", "1", "1", "0"), Metastasis_Bone = c("1",
"1", "0", "1", "1", "0")), class = "data.frame", row.names = c("Patient_1",
"Patient_2", "Patient_3", "Patient_4", "Patient_5", "Patient_6"
))
Example of what I'm searching for: If there is 1 in columns Metastasis_Brain and Metastasis_Liver, the new column will contain "Brain, Liver".
If all three tissues are 1, then that row in the new column will contain "Brain, Liver, Bone".
If all are 0, then it doesn't matter, NA would be fine.
Using tidyverse:
library(tidyverse)
df %>%
rownames_to_column() %>%
left_join(pivot_longer(.,-rowname, names_prefix = '.*_') %>%
filter(value>0) %>%
group_by(rowname) %>%
summarise(nm = toString(name)))
rowname Metastasis_Brain Metastasis_Liver Metastasis_Bone nm
1 Patient_1 1 0 1 Brain, Bone
2 Patient_2 1 0 1 Brain, Bone
3 Patient_3 0 1 0 Liver
4 Patient_4 1 1 1 Brain, Liver, Bone
5 Patient_5 0 1 1 Liver, Bone
6 Patient_6 0 0 0 <NA>
in Base R you could do:
aggregate(ind~rn, subset(transform(stack(df),
ind = sub('.*_', '', ind), rn = rownames(df)), values>0), toString)
rn ind
1 Patient_1 Brain, Bone
2 Patient_2 Brain, Bone
3 Patient_3 Liver
4 Patient_4 Brain, Liver, Bone
5 Patient_5 Liver, Bone
base
df <- data.frame(
stringsAsFactors = FALSE,
row.names = c("Patient_1","Patient_2","Patient_3","Patient_4","Patient_5","Patient_6"),
Metastasis_Brain = c("1", "1", "0", "1", "0", "0"),
Metastasis_Liver = c("0", "0", "1", "1", "1", "0"),
Metastasis_Bone = c("1", "1", "0", "1", "1", "0"),
res = c("Brain, Bone","Brain, Bone",
"Liver","Brain, Liver, Bone","Liver, Bone",NA)
)
df$res <- sapply(apply(df, 1, function(x) gsub("Metastasis_", "", names(df)[x == 1])), toString)
df
#> Metastasis_Brain Metastasis_Liver Metastasis_Bone res
#> Patient_1 1 0 1 Brain, Bone
#> Patient_2 1 0 1 Brain, Bone
#> Patient_3 0 1 0 Liver
#> Patient_4 1 1 1 Brain, Liver, Bone
#> Patient_5 0 1 1 Liver, Bone
#> Patient_6 0 0 0 NA
Created on 2022-06-20 by the reprex package (v2.0.1)

How to format data from excel containing two rows of column headers to be able to use in R?

I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.
Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.
read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)
One option would be to
read your excel file with option colNames = FALSE
Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
Get rid of the first two rows
Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)
# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
arrange(Time)
#> # A tibble: 16 × 4
#> category Time Y Z
#> <chr> <chr> <chr> <chr>
#> 1 Total Feb-21 1 1
#> 2 A Feb-21 2 2
#> 3 B Feb-21 3 3
#> 4 C Feb-21 4 4
#> 5 D Feb-21 5 5
#> 6 E Feb-21 6 6
#> 7 F Feb-21 7 7
#> 8 G Feb-21 8 8
#> 9 Total Jan-21 1 1
#> 10 A Jan-21 2 2
#> 11 B Jan-21 3 3
#> 12 C Jan-21 4 4
#> 13 D Jan-21 5 5
#> 14 E Jan-21 6 6
#> 15 F Jan-21 7 7
#> 16 G Jan-21 8 8
DATA
df <- structure(list(X1 = c(
NA, NA, "Total", "A", "B", "C", "D", "E",
"F", "G"
), X2 = c(
"Year Rolling Avg.", "Share", NA, "1", "1",
"1", "1", "1", "1", "1"
), X3 = c(
"Year Rolling Avg.", "Y", "1",
"2", "3", "4", "5", "6", "7", "8"
), X4 = c(
"Year Rolling Avg.",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
"Jan-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
"Jan-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
"Feb-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
"Feb-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
NA,
10L
), class = "data.frame")

Select first row per run by group [duplicate]

This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 1 year ago.
I have data with a grouping variable (ID) and some values (type):
ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
type <- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")
dat <- data.frame(ID,type)
Within each ID, I want to delete the repeated number, not the unique one but the one the same as the previous one. I have annotated some examples:
# ID type
# 1 1 1
# 2 1 3 # first value in a run of 3s within ID 1: keep
# 3 1 3 # 2nd value: remove
# 4 1 2
# 5 2 3
# 6 2 3
# 7 2 1
# 8 2 1
# 9 3 1
# 10 3 2 # first value in a run of 2s within ID 3: keep
# 11 3 2 # 2nd value: remove
# 12 3 1
For example, ID 3 have the sequence of values 1, 2, 2, 1. The third value is the same as the second value, so it should be deleted, to become 1,2,1
Thus, the desired output is:
data.frame(ID = c("1", "1", "1", "2", "2", "3", "3", "3"),
type = c("1", "3", "2", "3", "1", "1", "2", "1"))
ID type
1 1 1
2 1 3
3 1 2
4 2 3
5 2 1
6 3 1
7 3 2
8 3 1
I've tried
df[!duplicated(df), ]
however what I got was
ID <- c("1", "1", "1", "2", "2", "3", "3")
type<- c("1", "3", "2", "3", "1", "1", "2")
I know duplicated would only keep the unique one. how can I get the values I want?
Thanks for the help in advance!
Does this work:
library(dplyr)
dat %>% group_by(ID) %>%
mutate(flag = case_when(type == lag(type) ~ TRUE, TRUE ~ FALSE)) %>%
filter(!flag) %>% select(-flag)
# A tibble: 8 x 2
# Groups: ID [3]
ID type
<chr> <chr>
1 1 1
2 1 3
3 1 2
4 2 3
5 2 1
6 3 1
7 3 2
8 3 1
Using data.table rleid and duplicated -
library(data.table)
setDT(dat)[!duplicated(rleid(ID, type))]
# ID type
#1: 1 1
#2: 1 3
#3: 1 2
#4: 2 3
#5: 2 1
#6: 3 1
#7: 3 2
#8: 3 1
Improved answer including suggestion from #Henrik.
Base R way If you want to eliminate consecutive duplicate rows only (8 rows output)
ID <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
type<- c("1", "3", "3", "2", "3", "3", "1", "1", "1", "2", "2", "1")
dat <- data.frame(ID,type)
subset(dat, !duplicated(with(rle(paste(dat$ID, dat$type)), rep(seq_len(length(lengths)), lengths))))
#> ID type
#> 1 1 1
#> 2 1 3
#> 4 1 2
#> 5 2 3
#> 7 2 1
#> 9 3 1
#> 10 3 2
#> 12 3 1
Created on 2021-05-22 by the reprex package (v2.0.0)

Creating all combinations of sampling from two groups of columns in R

I have the dataframe below and within this "two groups", the columns A&B and D&E. I would like to find all combinations and then to group by all combinations of applying different filters at columns A&B and D&E but in the form of only choosing 1 column from each group at the time. I dont know the right formula to do this and the problem is way bigger in reality.
df=
Size A B D E
1 1 1 0 0
5 0 0 1 0
10 1 1 1 0
3 1 0 0 0
2 1 1 1 1
55 0 0 0 1
5 1 0 1 1
2 0 0 1 1
1 1 1 1 1
4 1 1 1 0
So the combinations to filter should be
Filter 1: A=1 AND D=1
Filter 2: A=1 AND D=0
Filter 3: A=1 AND E=1
Filter 4: A=1 AND E=0
Filter 5: A=0 AND D=1
Filter 6: A=0 AND D=0
Filter 7: A=0 AND E=1
Filter 8: A=0 AND E=0
Filter 9: B=1 AND D=1
Filter 10: B=1 AND D=0
Filter 11: B=1 AND E=1
Filter 12: B=1 AND E=0
Filter 13: B=0 AND D=1
Filter 14: B=0 AND D=0
Filter 15: B=0 AND E=1
Filter 16: B=0 AND E=0
I want to find a way to efficiently create these filter groups (drawing always 1 filter from either columns A&B or D&E) and then to find the average and count of the Size column for each filter setting. I only managed to do this without different groups to sample the filter from.
What I tried was in the form of this:
groupNames <- names(df)[2:5]
myGroups <- Map(combn,list(groupNames),seq_along(groupNames),simplify = FALSE) %>% unlist(recursive = FALSE)
results = lapply(myGroups, FUN = function(x) {do.call(what = group_by_, args = c(list(df), x)) %>% summarise( n = length(Size), avgVar1 = mean(Size))})
It treats the four columns equally and does not consider sampling from the 2 groups. What could I do to the code to make this work?
Thank you very much.
library(tidyverse)
df <- tribble(~Size, ~A, ~B, ~D, ~E,
1, "1", "1", "0", "0",
5, "0", "0", "1", "0",
10, "1", "1", "1", "0",
3, "1", "0", "0", "0",
2, "1", "1", "1", "1",
55, "0", "0", "0", "1",
5, "1", "0", "1", "1",
2, "0", "0", "1", "1",
1, "1", "1", "1", "1",
4, "1", "1", "1", "0")
p <- function(...) paste0(...) # for legibility, should rather use glue
all_filtering_groups <- list(c("A", "B"), c("D", "E")) # assuming these are known
all_combns <- map(1:length(all_filtering_groups), ~ combn(all_filtering_groups, .))
res <- list(length(all_combns))
#microbenchmark::microbenchmark({
for(comb_length in seq_along(all_combns)){
res[[comb_length]] <- list(ncol(all_combns[[comb_length]]))
for(col_i in seq_len(ncol(all_combns[[comb_length]]))){
filtering_groups <- all_combns[[comb_length]][,col_i]
group_names <- as.character(seq_along(filtering_groups))
# prepare grid of all combinations
filtering_combs <- c(filtering_groups, rep(list(0:1), length(filtering_groups)))
names(filtering_combs) <- c(p("vars_", group_names), p("vals_", group_names))
full_grid <- expand.grid(filtering_combs)
for(ll in 1:nrow(full_grid)){ # for each line in the full_grid
# find df lines that correspond
cond <- as.logical(rep(TRUE, nrow(df)))
for(grp in group_names){
cond <- cond & df[[full_grid[p("vars_", grp)][ll,]]] == full_grid[p("vals_", grp)][ll,]
}
# and compute whatever
full_grid$lines[ll] <- paste(which(cond), collapse = ", ") #for visual verification
full_grid$n[ll] <- length(df$Size[cond])
full_grid$sum[ll] <- sum(df$Size[cond])
full_grid$mean[ll] <- mean(df$Size[cond])
}
res[[comb_length]][[col_i]] <- full_grid
}
}
#}, times = 10) #microbenchmark
bind_rows(res) %>% relocate(starts_with("vars") | starts_with("vals"))
Following the discussion in the comments, I think we can treat the groups as variables. So we need to reshape the dataframe to have one column per factor, then we can use standard tidyverse approaches. I'm assuming the groups are defined by the column names (A1...Ak, B1...Bk, ...).
library(tidyverse)
df <- tribble(~Size, ~A1, ~A2, ~B1, ~B2,
1, "1", "1", "0", "0",
5, "0", "0", "1", "0",
10, "1", "1", "1", "0",
3, "1", "0", "0", "0",
2, "1", "1", "1", "1",
55, "0", "0", "0", "1",
5, "1", "0", "1", "1",
2, "0", "0", "1", "1",
1, "1", "1", "1", "1",
4, "1", "1", "1", "0")
get_levels <- function(col){
paste(names(col)[col == "1"], collapse = ",")
}
# Rewrite with groups as factors
df_factors <- df %>%
mutate(id = row_number()) %>% #to avoid aggregating same Size
nest(A = starts_with("A"), B = starts_with("B")) %>%
mutate(A = factor(map_chr(A, get_levels)),
B = factor(map_chr(B, get_levels)))
# Now look at factor combinations
df_factors %>%
group_by(A, B) %>%
summarize(n = n(),
mean = mean(Size))
# A tibble: 8 x 4
# Groups: A [3]
# A B n mean
# <fct> <fct> <int> <dbl>
# 1 "" "B1" 1 5
# 2 "" "B1,B2" 1 2
# 3 "" "B2" 1 55
# 4 "A1" "" 1 3
# 5 "A1" "B1,B2" 1 5
# 6 "A1,A2" "" 1 1
# 7 "A1,A2" "B1" 2 7
# 8 "A1,A2" "B1,B2" 2 1.5
I called "A" and "B" explicitly. It seems still doable to do that with 6 groups. If you have more, it would become necessary to automatize, but I'm not sure how to do that easily.

R dataframe, expand rows by string variable [duplicate]

This question already has answers here:
R semicolon delimited a column into rows
(3 answers)
Closed 6 years ago.
Can anyone please help with this little data.frame expansion problem?
Thanks in advance!
# I have
data.frame(rbind(c("1", "2", "3", "a/b/c"),
c("11", "0", "5", "c/d"),
c("3", "33", "0", "a"))
)
# X1 X2 X3 X4
# 1 1 2 3 a/b/c
# 2 11 0 5 c/d
# 3 3 33 0 a
# I want
data.frame(rbind(c("1", "2", "3", "a"),
c("1", "2", "3", "b"),
c("1", "2", "3", "c"),
c("11", "0", "5", "c"),
c("11", "0", "5", "d"),
c("3", "33", "0", "a"))
)
# X1 X2 X3 X4
# 1 1 2 3 a
# 2 1 2 3 b
# 3 1 2 3 c
# 4 11 0 5 c
# 5 11 0 5 d
# 6 3 33 0 a
We can use data.table
library(data.table)
setDT(df1)[, strsplit(as.character(X4), "/"), by = .(X1, X2, X3)]

Resources