Note: This question was closed as a 'duplicate'. The solutions offered here and here did not answer my question. They showed how to merge when a single entry fell within a range, I'm trying to identify overlapping ranges and joining them. Perhaps my title could have been better...
I have a main data set main_df with a start and end time (in seconds). I would like to see if the time range in main_df falls within a list of ranges in lookup_df, and if so, grab the value from lookup_df. Additionally, if the main_df falls within two different lookup ranges, duplicate the row so each value is represented.***
main_df <- tibble(start = c(30,124,161),
end = c(80,152,185))
lookup_df <- tibble(start = c(34,73,126,141,174,221),
end = c(69,123,136,157,189,267),
value = c('a','b','b','b','b','a'))
# Do something here to get the following:
> final_df
# A tibble: 4 x 4
start end value notes
<dbl> <dbl> <chr> <chr>
1 30 80 a ""
2 30 80 b "Duplicate because it falls within a and b"
3 124 152 b "Falls within two lookups but both are b"
4 161 185 b ""
***Edit: Looking at the way I've structured the problem...
#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>%
add_rows(through_some_means)
Rather than having to add a new row I could flip how I'm joining them...
semi_join(lookup_df, main_df, by(some_range_join_function))
You could do some logical comparisons and then a case handling what shall happen if all are 'b', 'a' and 'b', etc. In this way you easily could add more cases, e.g. both are 'a', one is 'a', more are 'b' which you didn't declare in OP. The approach yields NULL if there are no matches which gets omitted during rbind.
f <- \(x, y) {
w <- which((x[1] >= y[, 1] & x[1] <= y[, 2]) | (x[2] >= y[, 1] & x[1] <= y[, 2]))
if (length(w) > 0) {
d <- data.frame(t(x), value=cbind(y[w, 3]), notes='')
if (length(w) >= 2) {
if (all(d$value == 'b')) {
d <- d[!duplicated(d$value), ]
d$notes[1] <- 'both b'
}
else {
d$notes[nrow(d)] <- 'a & b'
}
}
d
}
}
apply(main_df, 1, f, lookup_df, simplify=F) |> do.call(what=rbind)
# start end value notes
# 1 30 80 a
# 2 30 80 b a & b
# 3 124 152 b both b
# 4 161 185 b
Data:
main_df <- structure(list(start = c(2, 30, 124, 161), end = c(1, 80, 152,
185)), row.names = c(NA, -4L), class = "data.frame")
lookup_df <- structure(list(start = c(34, 73, 126, 141, 174, 221), end = c(69,
123, 136, 157, 189, 267), value = c("a", "b", "b", "b", "b",
"a")), row.names = c(NA, -6L), class = "data.frame")
Another option is fuzzyjoin::interval_join:
library(fuzzyjoin)
library(dplyr)
interval_join(main_df, lookup_df, by = c("start", "end"), mode = "inner") %>%
group_by(value, start.x, end.x) %>%
slice(1) %>%
select(start = start.x, end = end.x, value)
# A tibble: 4 × 3
# Groups: value, start, end [4]
start end value
<dbl> <dbl> <chr>
1 30 80 a
2 30 80 b
3 124 152 b
4 161 185 b
You can use foverlaps from data.table for this.
library(data.table)
setDT(main_df) # make it a data.table if needed
setDT(lookup_df) # make it a data.table if needed
setkey(main_df, start, end) # set the keys of 'y'
foverlaps(lookup_df, main_df, nomatch = NULL) # do the lookup
# start end i.start i.end value
# 1: 30 80 34 69 a
# 2: 30 80 73 123 b
# 3: 124 152 126 136 b
# 4: 124 152 141 157 b
# 5: 161 185 174 189 b
Or to get the cleaned results as end result (OP's final_df)
unique(foverlaps(lookup_df, main_df, nomatch = NULL)[, .(start, end, value)])
start end value
1: 30 80 a
2: 30 80 b
3: 124 152 b
4: 161 185 b
A possible solution, based on powerjoin:
library(tidyverse)
library(powerjoin)
power_left_join(
main_df, lookup_df,
by = ~ (.x$start <= .y$start & .x$end >= .y$end) |
(.x$start >= .y$start & .x$start <= .y$end) |
(.x$start <= .y$start & .x$end >= .y$start),
keep = "left") %>%
distinct()
#> # A tibble: 4 x 3
#> start end value
#> <dbl> <dbl> <chr>
#> 1 30 80 a
#> 2 30 80 b
#> 3 124 152 b
#> 4 161 185 b
Or using tidyr::crossing:
library(tidyverse)
crossing(main_df, lookup_df,
.name_repair = ~ c("start", "end", "start2", "end2", "value")) %>%
filter((start <= start2 & end >= end2) |
(start >= start2 & start <= end2) | (start <= start2 & end >= start2)) %>%
select(-start2, -end2) %>%
distinct()
#> # A tibble: 4 x 3
#> start end value
#> <dbl> <dbl> <chr>
#> 1 30 80 a
#> 2 30 80 b
#> 3 124 152 b
#> 4 161 185 b
You can use the fuzzyjoin package to join based on intervals with the fuzzyjoin::interval_*_join() functions.
I'll be using an inner join, because if you use a semi join like you propose, you will loose the value col and get just 3 rows.
library(tidyverse)
library(fuzzyjoin)
fuzzyjoin::interval_inner_join(lookup_df, main_df, by = c("start", "end"), type = "any")
#> # A tibble: 5 × 5
#> start.x end.x value start.y end.y
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 34 69 a 30 80
#> 2 73 123 b 30 80
#> 3 126 136 b 124 152
#> 4 141 157 b 124 152
#> 5 174 189 b 161 185
As you can see, the fuzzy_inner_join() preserves the by cols from both tables, since they are not the same in a fuzzy join. Also, we still have separate rows for those cases in main_df that match multiple cases in lookup_df. Thus, we do some cleanup of the joined table:
interval_inner_join(lookup_df, main_df,
by = c("start", "end"),
type = "any") |>
select(-ends_with(".x")) |> # remove lookup interval cols
distinct() |> # remove duplicate
rename_with(str_remove, ends_with(".y"), "\\.y") # remove suffixes from col names
#> # A tibble: 4 × 3
#> value start end
#> <chr> <dbl> <dbl>
#> 1 a 30 80
#> 2 b 30 80
#> 3 b 124 152
#> 4 b 161 185
Finally, a clarification of terminology: In your question you state you want to join based on the interval from main_df falling within the interval from lookup_df. This is possible by using type = "within" in interval_*_join(). But based on the examples you provide, it appears you want to join based on any overlap. This can be done with type = "any", but it is the default, so you don't need to specify it.
Related
I have a dataframe like the following:
id = c(1,2,3,4,5)
value = c(100, 200, 300, 400, 500)
tech = c('A','B','C','D','E')
tech2 = c(NA, NA,'A','B', NA)
data = data.frame(id, value, tech, tech2)
I want to convert data to the following:
Where tech is only one column, and any id with no NA in tech 2 is duplicated and has its value split by 2 for each tech e.g. id number 3 has two techs, and a value of 300, so each tech gets 150.
I have looked at pivot_wider and pivot_longer but the examples either have numerical values within the tech column, or only one tech column.
Any suggestions?
A possible solution:
library(tidyverse)
data %>%
mutate(
value = if_else(rowSums(is.na(across(tech:tech2))) == 0, value/2, value),
tech = paste(tech, tech2), tech2 = NULL) %>%
separate_rows(tech) %>%
filter(tech != "NA")
#> # A tibble: 7 × 3
#> id value tech
#> <dbl> <dbl> <chr>
#> 1 1 100 A
#> 2 2 200 B
#> 3 3 150 C
#> 4 3 150 A
#> 5 4 200 D
#> 6 4 200 B
#> 7 5 500 E
A base R option using reshape + ave
transform(
na.omit(
reshape(
data,
direction = "long",
idvar = c("id", "value"),
varying = -c(1:2),
v.names = "tech"
)
)[-3],
value = value / ave(value, id, FUN = length)
)
gives
id value tech
1.100.1 1 100 A
2.200.1 2 200 B
3.300.1 3 150 C
4.400.1 4 200 D
5.500.1 5 500 E
3.300.2 3 150 A
4.400.2 4 200 B
I'm fairly new in R and struggling to get this. The type of problem I'm trying to address involves one data frame containing books and the start and end page of a particular chapter.
book <- c("Dune", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea")
chapt.start <- c(300, 8, 94, 150, 600)
chapt.end <- c(310, 19, 110, 158, 630)
df1 <- data.frame(books, chapt.start, chapt.end)
df1
books chapt.start chapt.end
1 Dune 300 310
2 LOTR 8 19
3 LOTR 94 110
4 OriginOfSpecies 150 158
5 OldManSea 600 630
My second dataframe contains a list of book titles and a single page.
title <- c("LOTR", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea", "OldManSea")
page <- c(4, 12, 30, 200, 620, 650)
df2 <- data.frame(title, page)
df2
title page
1 LOTR 4
2 LOTR 12
3 LOTR 30
4 OriginOfSpecies 200
5 OldManSea 620
6 OldManSea 650
What I'm trying to ask is for each row in df1 is whether df2 contains any rows with the corresponding book title and the page is within the chapter, i.e. df2$title==df1$book and df2$page>df1$chapt.start and df2$page < df1$chapt.end
The desired output for these data would be FALSE, TRUE, FALSE, FALSE, TRUE
Is this best approached as some kind of for, ifelse loop, sapply, or something different? Thanks for your help people!
This is a range-based join. There are three good ways to do this in R. All of these are returning the page number itself instead of true/false, it should be straight-forward to convert to logical with something like !is.na(page).
sqldf
library(sqldf)
sqldf(
"select df1.*, df2.page
from df1
left join df2 on df1.book=df2.title
and df2.page between df1.[chapt.start] and df1.[chapt.end]")
# book chapt.start chapt.end page
# 1 Dune 300 310 NA
# 2 LOTR 8 19 12
# 3 LOTR 94 110 NA
# 4 OriginOfSpecies 150 158 NA
# 5 OldManSea 600 630 620
fuzzyjoin
(Edited out, see #IanCampbell's answer.)
data.table
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[, p2 := page][DT1, on = .(title == book, p2 >= chapt.start, p2 <= chapt.end)]
# title page p2 p2.1
# <char> <num> <num> <num>
# 1: Dune NA 300 310
# 2: LOTR 12 8 19
# 3: LOTR NA 94 110
# 4: OriginOfSpecies NA 150 158
# 5: OldManSea 620 600 630
The reason I add p2 as a copy of page is that data.table on range-joins replaces the left's (inequality) column with those from the right (or something like that), so we'd lose that bit of info.
You're looking for a non-equi join. This can be accomplished in many ways, but I prefer the fuzzyjoin package:
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`))
books chapt.start chapt.end title page
1 Dune 300 310 <NA> NA
2 LOTR 8 19 LOTR 12
3 LOTR 94 110 <NA> NA
4 OriginOfSpecies 150 158 <NA> NA
5 OldManSea 600 630 OldManSea 620
From here it's easy to get to the desired output:
library(dplyr)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`)) %>%
mutate(result = !is.na(page)) %>%
select(-c(title,page))
books chapt.start chapt.end result
1 Dune 300 310 FALSE
2 LOTR 8 19 TRUE
3 LOTR 94 110 FALSE
4 OriginOfSpecies 150 158 FALSE
5 OldManSea 600 630 TRUE
using dplyr only i.e. without purrr or fuzzyjoin
df2 %>% right_join(df1 %>% mutate(id = row_number()), by = c("title" = "book")) %>%
group_by(id, title) %>%
summarise(desired = ifelse(is.na(as.logical(sum(chapt.start <= page & page <= chapt.end))),
F,
as.logical(sum(chapt.start <= page & page <= chapt.end))))
# A tibble: 5 x 3
# Groups: id [5]
id title desired
<int> <chr> <lgl>
1 1 Dune FALSE
2 2 LOTR TRUE
3 3 LOTR FALSE
4 4 OriginOfSpecies FALSE
5 5 OldManSea TRUE
Another approach using purrr without joining data
Create a logical check variable for df1
library(dplyr)
library(purrr)
# This function design to take ... which is a row of data from pmap
# And then look up if there is any record match condition define in df2
look_up_check_df1 <- function(..., page_df) {
book_record <- tibble(...)
any_record <- page_df %>%
filter(title == book_record[["book"]],
page >= book_record[["chapt.start"]],
page <= book_record[["chapt.end"]])
nrow(any_record) > 0
}
df1$check <- pmap_lgl(df1, look_up_check_df1, page_df = df2)
df1
#> book chapt.start chapt.end check
#> 1 Dune 300 310 FALSE
#> 2 LOTR 8 19 TRUE
#> 3 LOTR 94 110 FALSE
#> 4 OriginOfSpecies 150 158 FALSE
#> 5 OldManSea 600 630 TRUE
Same logics just did it for df2
# If the check is for df2 then just need to revised it a bit
look_up_check <- function(..., book_chapters_df) {
page_record <- tibble(...)
any_record <- book_chapters_df %>%
filter(book == page_record[["title"]],
chapt.start <= page_record[["page"]],
chapt.end >= page_record[["page"]])
nrow(any_record) > 0
}
# Run a pmap_lgl which passing each row of df2 into function look_up_check
# and return a vector of logical TRUE/FALSE
df2$check <- pmap_lgl(df2, look_up_check, book_chapters_df = df1)
df2
#> title page check
#> 1 LOTR 4 FALSE
#> 2 LOTR 12 TRUE
#> 3 LOTR 30 FALSE
#> 4 OriginOfSpecies 200 FALSE
#> 5 OldManSea 620 TRUE
#> 6 OldManSea 650 FALSE
Created on 2021-04-12 by the reprex package (v1.0.0)
I have a dataframe like the one below...
df <- data.frame(row.names = c(1,2,3,4,5,6,7,8), Week = c(1,1,2,2,52,52,53,53), State = c("Florida", "Georgia","Florida", "Georgia","Florida", "Georgia","Florida", "Georgia"), Count_2001 = c(25,16,83,45,100,98,22,34), Count_2002 = c(3, 78, 22, 5, 78, 6, 88, 97))
I am now trying to manipulate this dataset such that only weeks 52 and 53 get summed together for each state in the list, across all of the Count columns. Similar to this example.. GROUP BY for specific rows
The new dataset should have these rows summed together to create the new Week 52 row for each state, like this example below...
df2 <- data.frame(row.names = c(1,2,3,4,5,6), Week = c(1,1,2,2,52,52), State = c("Florida", "Georgia","Florida", "Georgia","Florida", "Georgia"), Count_2001 = c(25,16,83,45,122,132), Count_2002 = c(3, 78, 22, 5, 166, 103))
Is there an easy solution for this in R?
Change your 53s to 52s and do a sum by group:
library(dplyr)
df %>%
mutate(Week = case_when(Week == 53 ~ 52, TRUE ~ Week)) %>%
group_by(State, Week) %>%
summarize(across(everything(), sum))
# # A tibble: 6 x 4
# # Groups: State [2]
# State Week Count_2001 Count_2002
# <chr> <dbl> <dbl> <dbl>
# 1 Florida 1 25 3
# 2 Florida 2 83 22
# 3 Florida 52 122 166
# 4 Georgia 1 16 78
# 5 Georgia 2 45 5
# 6 Georgia 52 132 103
Using aggregate.
s <- 52:53
tp <- transform(aggregate(cbind(Count_2001, Count_2002) ~ State, df[df$Week %in% s, ], sum),
Week=52)
df <- merge(df[!df$Week %in% s, ], tp, all=T)
df
# Week State Count_2001 Count_2002
# 1 1 Florida 25 3
# 2 1 Georgia 16 78
# 3 2 Florida 83 22
# 4 2 Georgia 45 5
# 5 52 Florida 122 166
# 6 52 Georgia 132 103
A simple alternative to using anything state specific would just be to create a new column with weeks at the level of aggregation that works!
I'd get this by doing: (using the tidyverse library)
df <- df %>%
mutate(week1 = if_else(week %in% c(52,53),52,week)
and then you can summate as
dfsumm <- df %>%
group_by(state, week1)%>%
summarise()
i am having hard time with this one...So i am trying to find points per group that are close to each other and furthermore group them. Let me explain You on base of the example data below:
Group X Y Z
1 110 3762 431 10
2 112 4950 880 10
3 113 5062 873 20
4 113 5225 874 30
5 113 5262 875 10
6 113 5300 874 20
structure(list(Group = c(110, 112, 113, 113, 113, 113), X = c(3762,
4950, 5062, 5225, 5262, 5300), Y = c(431, 880, 873, 874, 875,
874), Z = c(10, 10, 20, 30, 10, 20)), row.names = c(NA, -6L), class = "data.frame")
As we can see we have grouping column Group, X & Y Columns are our coordinates and Z Column should be further summarised when points are defined as "Close" (Euclidean distance < 100).
What i have tried:
I have calculated sucesfully Euclidean distance between points using this function:
for(i in 1:nrow(test)) {
if(i > 1 && test$Group[i] == test$Group[i-1]) {
test$Distance[i] <- sqrt(((test$X[i] - test$X[i-1]) ^ 2) + ((test$Y[i] - test$Y[i-1]) ^ 2))
} else {
test$Distance[i] <- NA
}
}
Which gives me this:
Group X Y Z Distance
1 110 3762 431 10 NA
2 112 4950 880 10 NA
3 113 5062 873 20 NA
4 113 5225 874 30 163.00307
5 113 5262 875 10 37.01351
6 113 5300 874 20 38.01316
And here everything complicates as there are NA´s for the first row for each Group etc....
What i wanna achieve:
I would like to find points per goup that their distance is not greater then 100 (Distance < 100), and on base of that summarise it (simple sum of Z column). So manually done:
Group Z Grouped
1 110 10 no
2 112 10 no
3 113 20 no
4 113 60 yes
Thanks for help!
That was difficult. I'm not sure I have figured it out completely.
#get data and libraries
library(tidyverse)
df <- read.table(text = "
Group X Y Z Distance
1 110 3762 431 10 NA
2 112 4950 880 10 NA
3 113 5062 873 20 NA
4 113 5225 874 30 163.00307
5 113 5262 875 10 37.01351
6 113 5300 874 20 38.01316", header = T, stringsAsFactors = F)
df %>%
group_by(Group) %>%
do(melt(outer(.$Distance, .$Distance, `-`))) %>%
filter(between(value, -100, 0) | between(value, 0, 100)) %>%
distinct(Var1) %>%
mutate(grouped = 1) %>%
rename(row = Var1) -> rows
df %>%
group_by(Group) %>%
mutate(row = row_number()) %>%
left_join(rows, by = c("row", "Group")) %>%
mutate(grouped = ifelse(is.na(grouped), "no", "yes")) %>%
group_by(Group, grouped) %>%
mutate(Z = ifelse(!is.na(grouped), sum(Z), Z)) %>%
distinct(Group, Z, grouped)
# A tibble: 4 x 3
# Groups: Group, grouped [4]
Group Z grouped
<int> <int> <chr>
1 110 10 no
2 112 10 no
3 113 20 no
4 113 60 yes
Hope it's what you were looking for, if not maybe it gave you some new ideas.
UPDATE: And now what I hope will really help you:
df %>%
group_by(Group) %>%
mutate(int1 = lead(Distance) < 100 | Distance < 100,
int1 = replace(int1, is.na(int1), FALSE),
int2 = rleid(int1),
int2 = replace(int2, !int1 | is.na(int1), NA)) -> df2
df2 %>%
filter(int1) %>%
group_by(Group, int2) %>%
summarise(Z = sum(Z),
Grouped = "yes") %>%
select(Group, Z, Grouped) %>%
bind_rows(df2 %>%
filter(!int1) %>%
mutate(Grouped = "no") %>%
select(Group, Z, Grouped)) %>%
arrange(Group)
# A tibble: 4 x 3
# Groups: Group [3]
Group Z Grouped
<int> <int> <chr>
1 110 10 no
2 112 10 no
3 113 60 yes
4 113 20 no
I worked out a little use case that can get you started. It is a base approach using a for loop and aggregation based on vector of columns to which you could apply a paired vector of functions by which to aggregate.
df <- read.table(text = "
Group X Y Z Distance
1 110 3762 431 10 NA
2 112 4950 880 10 NA
3 113 5062 873 20 NA
4 113 5225 874 30 163.00307
5 113 5262 875 10 37.01351
6 113 5300 874 20 38.01316
7 114 5300 874 30 NA
8 114 5300 874 20 38.01316", header = T, stringsAsFactors = F)
aggregateIt <- function(df = data, #data.frame
returnRaw = F, #to get the raw unaggregted df (only first case from column `grouped` by `subgroup` usable in this application)
colsToAgg = c("Z1", "Z2", "Z3"), #cols to aggregate
how = c("sum", "sum", "max")) #how to aggregate the columns, `Z1` by sum, `Z2` by sum and `Z3` by max
{
count <- 1L
result <- vector("integer", nrow(df))
grouped <- vector("character", nrow(df))
for(i in seq_len(length(result)-1L)){
if(df$Group[i] != df$Group[i+1L]) {
result[i] <- count
grouped[i] <- "no"
count <- count + 1L
if((i+1L) == length(result)) {
result[i+1L] <- count
grouped[i+1L] <- "no"
}
} else {
if(df$Distance[i+1L] > 100L) {
result[i] <- count
grouped[i] <- "no"
count <- count + 1L
if((i+1L) == length(result)) {
result[i+1L] <- count
grouped[i+1L] <- "no"
}
} else {
result[i] <- count
grouped[i] <- "yes"
if((i+1L) == length(result)) {
result[i+1L] <- count
grouped[i+1L] <- "yes"
}
}
}
}
df <- within(df, {subgroup <- result; grouped <- grouped})
if(returnRaw) return(df)
A <- Reduce(function(a, b) merge(a, b, by = "subgroup"),
lapply(seq_along(how), function(x) aggregate(.~subgroup, df[, c(colsToAgg[x], "subgroup")], how[x])))
B <- df[!duplicated(df$subgroup, fromLast = F), c("Group", "subgroup", "grouped")]
out <- merge(A, B, by = "subgroup")
return(out[, c("Group", colsToAgg, "grouped")])
}
aggregateIt(df = df, colsToAgg = "Z", how = "sum")
# Group Z grouped
#1 110 10 no
#2 112 10 no
#3 113 20 no
#4 113 60 yes
#5 114 50 yes
Not claiming this is most efficient solution but it points out the solution. Hope this helps!
I have to perform some simple operations upon few vectors and rows.
Assume that I have a database such as:
observation outcome_1_a outcome_2_a outcome_1_b outcome_2_b choice_a choice_b
1 41 34 56 19 1 1
2 32 78 43 6 2 1
3 39 19 18 55 1 2
For each observation, outcome_1 and outcome_2 are the two possible outcomes, choice is the outcome chosen and the prefix _i, with i = a,b, indicates the number of times the situation is repeated.
If I want to create variables storing the highest outcome for each situation (a,b), hence:
max.a <- pmax(data$outcome_1_a, data$outcome_2_a)
max.b <- pmax(data$outcome_1_b, data$outcome_2_b)
Similarly, if I want to create variables storing the values chosen in each situation, I can do:
choice.a <- ifelse(data$choice_a == "1", data$outcome_1_a, data$outcome_1_b)
choice.b <- ifelse(data$choice_b == "1", data$outcome_2_a, data$outcome_2_b)
Finally, If I'd like to compute the mean by row of the situations a and b, I can do:
library(data.table)
setDT(data)
data[, .(Mean = rowMeans(.SD)), by = observation, .SDcols = c("outcome_1_a","outcome_2_a", "outcome_1_b", "outcome_2_b")]
Now, all of these work just fine. However, I was wondering if such operations can be done in a more efficient way.
In the example there are only few situations, but, if in the future I'll have to deal with, let's say, 15 or more different situations (a,b,c,d,...,), writing such operations might be annoying.
Is there a way to automate such process based on the different prefixes and/or suffixes of the variables?
Thank you for your help
You can select columns with some regex. For example, to get your max.a value.
library(data.table)
setDT(data)
data[, do.call(pmax, .SD), .SDcols = names(data) %like% "\\d+_a$"]
[1] 41 78 39
Alternatively, you could select your columns with some regex outside of the data.table. Lots of ways to go about this.
Similar application to your last command.
data[,
.(Mean = rowMeans(.SD)),
by = observation,
.SDcols = names(data) %like% "^outcome"]
observation Mean
1: 1 37.50
2: 2 39.75
3: 3 32.75
For choice.a, how would you choose between b, c, d, e etc?
For instance:
outcome_1_a outcome_2_a outcome_1_b outcome_2_b outcome_1_c outcome_2_c outcome_1_d outcome_2_d outcome_1_e outcome_2_e choice_a choice_b choice_c choice_d choice_e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 85 32 28 91 42 32 96 27 29 2 1 1 1 1
2 17 22 84 53 11 69 16 66 11 41 1 2 2 1 1
3 92 98 76 83 18 27 21 51 92 41 1 1 1 1 2
4 63 49 61 64 100 28 43 51 22 94 1 2 1 1 1
Define an index variable that will help you go through the loops:
seqmax <- seq(1, 10, by = 2)
seqmax is a 1 3 5 7 9. The reason being is that there are 5 letters "a" "b" "c" "d" "e". So this sequence will help you to pattern the loop. This can be automated for the max number of letters, just find the column index for the last column before choice_a. Then you can do seq(1, grep(names(data), pattern = "choice_a") - 1, by = 2). The by = 2 argument can be adjusted for the number of columns by letter.
I use lapply with <<- to assing the new column to data.
lapply(c(1:5), function(x){
data[, paste0("max.", letters[x])] <<- apply(data[, c(seqmax[x], seqmax[x] + 1)], 1, max)
data[, paste0("choice.", letters[x])] <<- ifelse(
data[, grep(names(data), pattern = paste0("choice_", letters[x]), value = T)] == 1,
data[, seqmax[x]], data[, seqmax[x] + 1])
data[, paste0("mean.", letters[x])] <<- rowMeans(
data[, grep(names(data), pattern = paste0("outcome_\\d+_", letters[x]), value = T)])
})