Related
I have a dataset that stores instances for participants vertically over time. They can have basically any number of follow-ups, there are participants with anywhere from 1 to 14 lines, but more are expected to be added with time.
I have a list of variables var that the participants have presumably reported in each follow-up and want to create a new set of "ever" variables vare that describe if at any time before this follow-up, a participant reported "yes" for the corresponding variable.
Here is an example of the desired input/output:
var = c("var1","var2")
vare = paste0(var,"_ever")
data = data.frame(idno = c(123,123,123,123,123,123,123)
followup_num = c(0,1,2,3,4,5,6)
var1 = c(0,NA,0,1,0,NA,1)
var2 = c(1,NA,NA,0,0,0,1)
)
data$var1_ever = c(0,0,0,1,1,1,1)
data$var2_ever = c(1,1,1,1,1,1,1)
idno
followup_num
var1
var1_ever
var2
var2_ever
123
0
0
0
1
1
123
1
NA
0
NA
1
123
2
0
0
NA
1
123
3
1
1
0
1
123
4
0
1
0
1
123
5
NA
1
0
1
123
6
1
1
1
1
This is the code I am currently using. Obviously, nested for loops are not ideal in R and this segment of code is particularly slow when handed a few thousand lines.
#For each ID
for (i in unique(data$idno)) {
id = data$idno%in%i #Get the relevant lines for this ID
fus = sort(data$followup_num[id]) #Get the follow-up numbers
#For each variable in the list
for (v in seq_along(var)) {
#Loop through the follow-ups. If you see that the variable reports "yes", mark
# this and every proceeding follow-up as having reported that variable ever
# Otherwise, mark the opposite at that line and move to the next follow-up
for (f in fus) {
if (t(data[id & data$followup_num%in%f,var[v]])%in%1) {
data[id & data$followup_num >= f,vare[v]] = 1
break
} else {
data[id & data$followup_num%in%f,vare[v]] = 0
}
}
}
}
Is this a problem with an existing solution? Is there a way to optimize/simplify? Is there a use of apply/sapply/etc. functions that I neglected to try?
At its core, the solution is the base function cummax(). We need to take into account NA, so I added replace_na(). And we need to account for additional idno's by using group_by()
A minimal vectorized solution is
df$var1_test<-cummax(x=replace_na(df$var1, 0))
This is a great problem to solve with the tidyverse mutate across function set!
df = data.frame(idno = c(123,123,123,123,123,123,123),
followup_num = c(0,1,2,3,4,5,6),
var1 = c(0,NA,0,1,0,NA,1),
var2 = c(1,NA,NA,0,0,0,1))
df %>% group_by(idno) %>%
arrange(idno, followup_num) %>%
mutate(across(.cols=starts_with("var"),
.fns= ~ cummax(tidyr::replace_na(.x, 0)),
.names="{.col}_ever2"))
idno followup_num var1 var2 var1_ever2 var2_ever2
1 123 0 0 1 0 1
2 123 1 NA NA 0 1
3 123 2 0 NA 0 1
4 123 3 1 0 1 1
5 123 4 0 0 1 1
6 123 5 NA 0 1 1
7 123 6 1 1 1 1
Alternativly, if you want to summarize the data to a single row, then a grouped max works
df %>%
group_by(idno) %>%
summarise(across(.cols=starts_with("var"),
.fns= ~ max(.x, na.rm=T),
.names="{.col}_ever3"))
idno var1_ever3 var2_ever3
1 123 1 1
ps. data is an internal function, better to call variable df.
Consider ave + cummax (with ifelse to handle NAs):
data <- within(
data, {
var2_ever <- ave(var2, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
var1_ever <- ave(var1, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
}
)
For many columns:
vars <- names(data)[grep("var", names(data))]
data[paste0(vars, "_ever")] <- sapply(
vars, \(var) ave(data[[var]], data$idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
)
I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
>df
# a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0
In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.
>df_updated
# a b c d abc abd bcd cda
# 1 0 1 0 1 0 0 0 0
# 2 1 1 0 1 0 1 0 0
# 3 0 1 0 0 0 0 0 0
# 4 1 1 1 0 1 0 0 0
# 5 0 0 1 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.
Something like this?
funCombn <- function(data){
f <- function(x, data){
data <- data[x]
list(
name = paste(x, collapse = ""),
vec = apply(data, 1, function(x) +all(as.logical(x)))
)
}
res <- combn(names(df), 3, f, simplify = FALSE, data = df)
out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
names(out) <- sapply(res, '[[', 'name')
cbind(data, out)
}
funCombn(df)
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Base R option using combn :
n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)),
paste0(x, collapse = ''))
}, simplify = FALSE))) -> result
result
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.
If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd b bd c abc
# 1 1 1 1 2
Ok, so let's use your data, including one row without any 1's:
df <- data.frame(
a = c(0,1,0,1,0,1,0),
b = c(1,1,1,1,0,1,0),
c = c(0,0,0,1,1,1,0),
d = c(1,1,0,0,0,0,0)
)
Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.
df2 <- df %>%
dplyr::mutate(
combination = paste0(
ifelse(a == 1, "a", ""), # There is possibly a way to automate this as well using across()
ifelse(b == 1, "b", ""),
ifelse(c == 1, "c", ""),
ifelse(d == 1, "d", "")
),
combination = ifelse(
combination == "",
"nothing",
paste0("comb_", combination)
),
value = ifelse(
is.na(combination),
0,
1
),
i = dplyr::row_number()
) %>%
tidyr::pivot_wider(
names_from = combination,
values_from = value,
names_repair = "unique"
) %>%
replace(., is.na(.), 0) %>%
dplyr::select(-i)
Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.
comb_in_order <- df2 %>%
dplyr::select(
-tidyselect::any_of(
c(
names(df),
"nothing" # I think you want these last.
)
)
) %>%
dplyr::summarise(
dplyr::across(
.cols = tidyselect::everything(),
.fns = sum
)
) %>%
tidyr::pivot_longer(
cols = tidyselect::everything(),
names_to = "combination",
values_to = "frequency"
) %>%
dplyr::arrange(
dplyr::desc(frequency)
) %>%
dplyr::pull(combination)
The only thing to do then is to reconstruct the original df by these after arranging by the columns.
df2 %>%
dplyr::arrange(
across(
tidyselect::any_of(comb_in_order),
desc
)
) %>%
dplyr::select(
tidyselect::any_of(names(df))
)
This should work for all possible combinations.
I am working to classify a wide data frame in an added column, but based on the threshold (>0) of multiple columns. Previous examples here on SO require complete names of columns and if else() statements with > and ==. But I need to be able to use grep() or contains() to isolate columns based on a common string.
Input dataframe:
library(tidyverse)
df <- data.frame(
"ID" = c("asdf","vfdkun", "seifu", "seijd", "qweri"),
"elephant_zoo" = c(1,1,1,2,0), #Should not be useful there
"rhino_zoo" = c(1,2,3,1,0), #Should not be useful there
"hippo_zoo" = c(1,1,0,0,0),
"elephant_wild_A" = c(0,0,1,1,3),
"rhino_wild_A" = c(0,0,4,3,1),
"elephant_wild_B" = c(0,0,0,0,0),
"rhino_wild_C" = c(0,0,1,5,7),
"hippo_wild_B" = c(0,0,0,0,0)) %>%
column_to_rownames(var = "ID")
df
In reality, this has many many more columns and rows!
Desired output dataframe has classified rows (ZOO and WILD) and a compilation of these CLASSIFICATION.
df_goal <- data.frame(
"ID" = c("asdf","vfdkun", "seifu", "seijd", "qweri"),
"elephant_zoo" = c(1,1,1,2,2), #Should not be useful there
"rhino_zoo" = c(1,2,3,1,2), #Should not be useful there
"hippo_zoo" = c(1,1,0,0,2),
"elephant_wild_A" = c(0,0,1,1,3),
"rhino_wild_A" = c(1,0,4,3,1),
"elephant_wild_B" = c(0,0,0,0,0),
"rhino_wild_C" = c(6,0,1,5,7),
"hippo_wild_B" = c(0,0,0,0,0)) %>%
column_to_rownames(var = "ID") %>%
add_column(ZOO = c("zoo", "zoo", "zoo", "zoo", "")) %>%
add_column(WILD = c("", "", "wild", "wild", "wild")) %>%
add_column(CLASSIFICATION = c("zoo only", "zoo only", "both", "both", "wild only"))
df_goal
I am hoping to use a combination of mutate() and case_when(), but I cannot get the select of multiple columns correct.
Examples of attempts:
# using an if else statement
df %>%
mutate(ZOO = ifelse(select(contains("zoo")) > 0, "zoo", "F"))
# using mutate and case_when
df %>%
mutate(ZOO = case_when(
select(contains("zoo")) > 0 ~ "zoo",
TRUE ~ ""))
My actual data frame has many more categories, so being able to break it down into the ZOO vs WILD and then following up with the compiled column.
You can try using reduce from purrr package. One may use an intermediate function any_cols to made the code clearer and use it with across:
library(tidyverse)
any_cols <- function(df) reduce(df, `|`)
df %>%
mutate(ZOO = ifelse(any_cols(across(contains("zoo"), ~`>`(.,0))), "zoo", "F"))
elephant_zoo rhino_zoo hippo_zoo elephant_wild_A rhino_wild_A elephant_wild_B rhino_wild_C hippo_wild_B ZOO
1 1 1 1 0 0 0 0 0 zoo
2 1 2 1 0 0 0 0 0 zoo
3 1 3 0 1 4 0 1 0 zoo
4 2 1 0 1 3 0 5 0 zoo
5 0 0 0 3 1 0 7 0 F
df %>%
mutate(ZOO =
case_when(any_cols(across(contains("zoo"), ~`>`(.,0))) ~ "zoo",
TRUE ~ "F"))
elephant_zoo rhino_zoo hippo_zoo elephant_wild_A rhino_wild_A elephant_wild_B rhino_wild_C hippo_wild_B ZOO
1 1 1 1 0 0 0 0 0 zoo
2 1 2 1 0 0 0 0 0 zoo
3 1 3 0 1 4 0 1 0 zoo
4 2 1 0 1 3 0 5 0 zoo
5 0 0 0 3 1 0 7 0 F
The conditional statement is that in any event, if there are two or more consecutive rows with values higher than 1, the group should be deleted.
For example:
Event<- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
Value<- c(1,0,0,0,8,7,1,0,0,0,8,0,0,0,0)
A<- data.frame(Event, Value)
Event Value
1 1
1 0
1 0
1 0
2 8
2 7
2 1
2 0
2 0
2 0
3 8
3 0
3 0
3 0
3 0
In this example the group of event 2 should be deleted because it has two consecutive rows with values higher than 1. So it should looks like:
Event Value
1 1
1 0
1 0
1 0
3 8
3 0
3 0
3 0
3 0
Any suggestion?
We can use rle by groups.
library(dplyr)
A %>%
group_by(Event) %>%
filter(!any(with(rle(Value > 1), lengths[values] > 1)))
#Opposite way using all
#filter(all(with(rle(Value > 1), lengths[values] < 2)))
# Event Value
# <dbl> <dbl>
#1 1 1
#2 1 0
#3 1 0
#4 1 0
#5 3 8
#6 3 0
#7 3 0
#8 3 0
#9 3 0
The same logic can be used in base R :
subset(A, !ave(Value > 1, Event, FUN = function(x)
any(with(rle(x), lengths[values] > 1))))
as well as data.table
library(data.table)
setDT(A)[, .SD[!any(with(rle(Value > 1), lengths[values] > 1))], Event]
Using dplyr
A %>%
group_by(Event) %>%
mutate(consec = if_else(Value > 1, row_number(), 0L),
remove = if_else(consec > 1,"Y","N")) %>%
filter(!any(remove == "Y")) %>%
select(-c("consec","remove"))
A base R approach:
# split the dataframe by event into separate lists, record whether values are > 1 (T/F)
A_split <- split(A$Value > 1, Event)
# for each item in the list, record the number of consecutive T values;
# make T/F vector "keep" with row names corresponding to A$Event
keep <- sapply(A_split, function(x) sum(x[1:length(x) - 1] * x[2:length(x)])) == 0
# convert keep to numeric vector of A$Event values
keep <- as.numeric(names(keep == T))
# subset A based on keep vector
A[A$Event %in% keep, ]
I'm relatively new to R and was struggling with potentially a very simple problem.
I have data that has multiple columns named in a similar way. Here is a sample data:
df = data.frame(PPID = 1:50,
time1 = sample(c(0,1), 50, replace = TRUE),
time2 = sample(c(0,1), 50, replace = TRUE),
time3 = sample(c(0,1), 50, replace = TRUE),
condition1 = sample(c(0:3), 50, replace = TRUE),
condition2 = sample(c(0:3), 50, replace = TRUE))
In my actual data, I have much more columns - approximately 50 for time and 10 for condition.
I want to multiply week columns and condition columns, e.g. in that sample data it should give me 6 extra columns, like: time1_condition1, time1_condition2, time2_condition1, time2_condition2, time3_condition1, time3_condition2.
I tried solutions that were suggested in this thread but they did not work (presumably because I didn't understand how mapply/apply worked and did not make appropriate changes) - it gave me error message that the longer argument is not a multiple of length of shorter.
Any help would be greatly appreciated!
#Get all the columns with "time" columns
time_cols <- grep("^time", names(df))
#Get all the columns with "condition" column
condition_cols <- grep("^condition", names(df))
#Multiply each "time" columns with all the condition columns
# and creating a new dataframe
new_df <- do.call("cbind", lapply(df[time_cols] , function(x) x *
df[condition_cols]))
#Combine both the dataframes
complete_df <- cbind(df,new_df)
We can also generate column names using expand.grid
new_names <- do.call("paste0",
expand.grid(names(df)[condition_cols], names(df)[time_cols]))
colnames(complete_df)[7:12] <- new_names
Here is a tidyverse alternative
library(tidyverse)
idx.time <- grep("time", names(df), value = T)
idx.cond <- grep("condition", names(df), value = T)
bind_cols(
df,
map_dfc(transpose(expand.grid(idx.time, idx.cond, stringsAsFactors = F)),
~setNames(data.frame(df[, .x$Var1] * df[, .x$Var2]), paste(.x$Var1, .x$Var2, sep = "_"))))
# PPID time1 time2 time3 condition1 condition2 time1_condition1
#1 1 1 0 1 3 0 3
#2 2 0 1 1 0 1 0
#3 3 0 1 1 0 2 0
#4 4 0 0 1 0 3 0
#5 5 0 0 0 0 3 0
#...
Explanation: expand.grid creates all pairwise combinations of idx.time and idx.cond. transpose turns a list/data.frame inside-out and returns a list, similar to apply(..., 1, as.list); map_dfc then operates on every element of that list and column-binds results.
Using
library(tidyverse)
a = df[grep("time",names(df))]
b = df[grep("condition",names(df))]
we can do:
map(a,~.x*b)%>%
bind_cols()%>%
set_names(paste(rep(names(a),each=ncol(b)),names(b),sep="_"))
or we can
cross2(a,b)%>%
map(lift(`*`))%>%
set_names(paste(rep(names(a),each=ncol(b)),names(b),sep="_"))%>%
data.frame()
time1_condition1 time2_condition1 time3_condition1 time1_condition2 time2_condition2 time3_condition2
1 3 0 3 2 0 2
2 3 3 0 1 1 0
3 0 0 0 0 0 0
4 3 3 0 0 0 0
5 0 0 2 0 0 1
6 0 0 1 0 0 1
7 2 2 0 0 0 0