I have a data frame with two grouping variables, 'mkt' and 'mdl', and some values 'pr':
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2),
mdl = c('a','a','b','b','b','a','b','a','b'),
pr = c(120,120,110,110,145,130,145,130, 145))
df
mkt mdl pr
1 1 a 120
2 1 a 120
3 1 b 110
4 1 b 110
5 2 b 145
6 2 a 130
7 2 b 145
8 2 a 130
9 2 b 145
Within each 'mkt', the mean 'pr' for each 'mdl' should be calculated as the mean of 'pr' of all other 'mdl' in the same 'mkt', except the current 'mdl'.
For example, for the group defined by mkt == 1 and mdl == a, the 'avgother' is calculated as the average of 'pt' for mkt == 1 (same 'mkt') and mdl == b (all other 'mdl' than the current group a).
Desired result:
# mkt mdl pr avgother
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
First get the average of each mkt and mdl values and for each mkt exclude the current value and get the average of remaining values.
library(dplyr)
library(purrr)
df %>%
group_by(mkt, mdl) %>%
summarise(avgother = mean(pr)) %>%
mutate(avgother = map_dbl(row_number(), ~mean(avgother[-.x]))) %>%
ungroup %>%
inner_join(df, by = c('mkt', 'mdl'))
# mkt mdl avgother pr
# <dbl> <chr> <dbl> <dbl>
#1 1 a 110 120
#2 1 a 110 120
#3 1 b 120 110
#4 1 b 120 110
#5 2 a 145 130
#6 2 a 145 130
#7 2 b 130 145
#8 2 b 130 145
#9 2 b 130 145
Using data.table, calculate sum and length by 'mkt'. Then, within each mkt-mdl group, calculate mean as (mkt sum - group sum) / (mkt length - group length)
library(data.table)
setDT(df)[ , `:=`(s = sum(pr), n = .N), by = mkt]
df[ , avgother := (s - sum(pr)) / (n - .N), by = .(mkt, mdl)]
df[ , `:=`(s = NULL, n = NULL)]
# mkt mdl pr avgother
# 1: 1 a 120 110
# 2: 1 a 120 110
# 3: 1 b 110 120
# 4: 1 b 110 120
# 5: 2 b 145 130
# 6: 2 a 130 145
# 7: 2 b 145 130
# 8: 2 a 130 145
# 9: 2 b 145 130
Consider base R with multiple ave calls for different level grouping calculation using the decomposed version of mean with sum / count:
df <- within(df, {
avgoth <- (ave(pr, mkt, FUN=sum) - ave(pr, mkt, mdl, FUN=sum)) /
(ave(pr, mkt, FUN=length) - ave(pr, mkt, mdl, FUN=length))
})
df
# mkt mdl pr avgoth
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
For the sake of completeness, here is another data.table approach which uses grouping by each i, i.e., join and aggregate simultaneously.
For demonstration, an enhanced sample dataset is used which has a third market with 3 products:
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3]),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3))
library(data.table)
mdt <- setDT(df)[, .(mdl, s = sum(pr), .N), by = .(mkt)]
df[mdt, on = .(mkt, mdl), avgother := (sum(pr) - s) / (.N - N), by = .EACHI][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
The temporay table mdt contains the sum and count of prices within each mkt but replicated for each product mdl within the market:
mdt
mkt mdl s N
1: 1 a 460 4
2: 1 a 460 4
3: 1 b 460 4
4: 1 b 460 4
5: 2 b 695 5
6: 2 a 695 5
7: 2 b 695 5
8: 2 a 695 5
9: 2 b 695 5
10: 3 a 6 3
11: 3 b 6 3
12: 3 c 6 3
Having mkt and mdl in mdt allows for grouping by each i (by = .EACHI)
Here is an approach which computes avgother directly by subsetting pr values which do not belong to the actual value of mdl before computing the averages.
This is quite different to the other answers posted so far which justifies to post this as a separate answer, IMHO.
# enhanced sample dataset covering more corner cases
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3,4),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3],'d'),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3, 9))
library(data.table)
setDT(df)[, avgother := sapply(mdl, function(m) mean(pr[m != mdl])), by = mkt][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
13: 4 d 9 NaN
Difference between approaches
The other answers share more or less the same approach (although implemented in different manners)
compute sums and counts of pr for each mkt
compute sums and counts of prfor each mkt and mdl
subtract mkt/mdl sums and counts from mkt sums and counts
compute avgother
This approach
groups by mkt
loops through mdl within each mkt,
subsets pr to drop values which do not belong to the actual value of mdl
before computing mean() directly.
Caveat concerning performance: Although the code essentially is a one-liner it does not imply it is the fastest.
Related
I wrote a script based on two for loops that I would like to optimize to speed up its running time.
Below are reproducible data that I simplified with the code that I am using on my own data.
nuc is a vector with 101 "position" and
tel is a data frame with different coordinates "aa" and "bb"
The aim is to calculate for each position the number of times each position is comprised between each aa and bb coordinate. For example position 111 is comprise between 3 couple of coordinates : G, I and J
#data
tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
bb=c(189,176,185,130,200,189,156,123,198,189),
ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))
> tel
aa bb ID
1 153 189 A
2 113 176 B
3 163 185 C
4 117 130 D
5 193 200 E
6 162 189 F
7 110 156 G
8 109 123 H
9 186 198 I
10 103 189 J
nuc2=100:200
# Loop
count_occ=0
count_occ_int=NULL
count_occ_fin=NULL
for (j in 1:length(nuc2)){
for (i in 1:nrow(tel)) {
if (nuc2[j]< tel$bb[i] & nuc2[j]>tel$aa[i])
{count_occ=count_occ+1}
}
count_occ_int=count_occ
count_occ_fin=c(count_occ_fin,count_occ_int)
count_occ=0
}
nuc_occ=data.frame(nuc=nuc2, occ=count_occ_fin)
> head(nuc_occ,20)
nuc occ
1 100 0
2 101 0
3 102 0
4 103 0
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 1
11 110 2
12 111 3
13 112 3
14 113 3
15 114 4
16 115 4
17 116 4
18 117 4
19 118 5
20 119 5
In my data, the length of my nuc vector is 9304567 and the number of couple of coordinates is 53 (I will have some hundred soon) and it took more than 60 hours to run the code !!
Any idea to help me to speed up this code ?
I though to the apply function but I am not sure how to combine the two for loop operations.
You can use data.table non-equi join like this:
library(data.table)
setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)]
Explanation:
setDT(tel) sets the tel data.frame to be of class data.table
SJ(v=nuc2) is a convenience function for converting a vector to a data.table; in this case converting nuc2 to a data.table with one column v. I'm doing this becuase I want to join two data.tables, one which is tel (with columns aa,bb and v) and one which has a single column v holding the values in nuc2
the join conditions are in the on=.. param of the setDT(tel)[...] clause; here the join condition is that the v value must be >= the aa value and must be <= the bb value
the final step (i.e. the next chained data.table operation) simply counts the number of rows where ID is not NA, by nuc value (by=.(nuc=aa))
Output:
nuc occ
<int> <int>
1: 100 0
2: 101 0
3: 102 0
4: 103 1
5: 104 1
---
97: 196 2
98: 197 2
99: 198 2
100: 199 1
101: 200 1
Here's a tidyverse solution:
lapply(
100:200,
\(x) tel %>%
filter(aa <= x & x <= bb) %>%
summarise(occ=n(), .groups="drop") %>%
add_column(nuc=x, .before=1)
) %>%
bind_rows() %>%
as_tibble()
# A tibble: 101 × 2
nuc occ
<int> <int>
1 100 0
2 101 0
3 102 0
4 103 1
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 2
# … with 91 more rows
Using microbenchmark to assess performance, this gives
Unit: nanoseconds
expr min lq mean median uq max neval
lapply 7 9 8.8 9 9 9 10
original 8 9 23.8 9 9 158 10
In other words, a decrease in speed of about two-thirds. And the tidyverse is not known for speed. A base R solution is likely to be faster still.
I have a data frame (datadf) with 3 columns, 'x', 'y, and z. Several 'x' values are missing (NA). 'y' and 'z' are non measured variables.
x y z
153 a 1
163 b 1
NA d 1
123 a 2
145 e 2
NA c 2
NA b 1
199 a 2
I have another data frame (imputeddf) with the same three columns:
x y z
123 a 1
145 a 2
124 b 1
168 b 2
123 c 1
176 c 2
184 d 1
101 d 2
I wish to replace NA in 'x' in 'datadf' with values from 'imputeddf' where 'y' and 'z' matches between the two data sets (each combo of 'y' and 'z' has its own value of 'x' to fill in).
The desired result:
x y z
153 a 1
163 b 1
184 d 1
123 a 2
145 e 2
176 c 2
124 b 1
199 a 2
I am trying things like:
finaldf <- datadf
finaldf$x <- if(datadf[!is.na(datadf$x)]){ddply(datadf, x=imputeddf$x[datadf$y == imputeddf$y & datadf$z == imputeddf$z])}else{datadf$x}
but it's not working.
What is the best way for me to fill in the NA in the using my imputed value df?
I would do this:
library(data.table)
setDT(DF1); setDT(DF2)
DF1[DF2, x := ifelse(is.na(x), i.x, x), on=c("y","z")]
which gives
x y z
1: 153 a 1
2: 163 b 1
3: 184 d 1
4: 123 a 2
5: 145 e 2
6: 176 c 2
7: 124 b 1
8: 199 a 2
Comments. This approach isn't so great, since it merges the whole of DF1, while we only need to merge the subset where is.na(x). Here, the improvement looks like (thanks, #Arun):
DF1[is.na(x), x := DF2[.SD, x, on=c("y", "z")]]
This way is analogous to #RHertel's answer.
From #Jakob's comment:
does this work for more than one x variable? If I want to fill up entire datasets with several columns?
You can enumerate the desired columns:
DF1[DF2, `:=`(
x = ifelse(is.na(x), i.x, x),
w = ifelse(is.na(w), i.w, w)
), on=c("y","z")]
The expression could be constructed using lapply and substitute, probably, but if the set of columns is fixed, it might be cleanest just to write it out as above.
Here's an alternative with base R:
df1[is.na(df1$x),"x"] <- merge(df2,df1[is.na(df1$x),][,c("y","z")])$x
> df1
# x y z
#1 153 a 1
#2 163 b 1
#3 124 b 1
#4 123 a 2
#5 145 e 2
#6 176 c 2
#7 184 d 1
#8 199 a 2
A dplyr solution, conceptually identical to the answers above. To pull out just the rows of imputeddf that correspond to NAs in datadf, use semi_join. Then, use another join to match back to datadf. (This step is not very clean, unfortunately.)
library(dplyr)
replacement_rows <- imputeddf %>%
semi_join(datadf %>% filter(is.na(x)), by = c("y", "z"))
datadf <- datadf %>%
left_join(replacement_rows, by = c("y", "z")) %>%
mutate(x = if_else(is.na(x.x), x.y, x.x)) %>%
select(x, y, z)
This gets what you want:
> datadf
# A tibble: 8 x 3
x y z
<dbl> <chr> <dbl>
1 153 a 1
2 163 b 1
3 184 d 1
4 123 a 2
5 145 e 2
6 176 c 2
7 124 b 1
8 199 a 2
In dplyr, you can use rows_patch to update NAs:
rows_patch(datadf, imputeddf, by = c("y", "z"), unmatched = "ignore")
# x y z
# 1 153 a 1
# 2 163 b 1
# 3 184 d 1
# 4 123 a 2
# 5 145 e 2
# 6 176 c 2
# 7 124 b 1
# 8 199 a 2
data:
datadf <- read.table(header = T, text = "x y z
153 a 1
163 b 1
NA d 1
123 a 2
145 e 2
NA c 2
NA b 1
199 a 2")
imputeddf <- read.table(header = T, text = " x y z
123 a 1
145 a 2
124 b 1
168 b 2
123 c 1
176 c 2
184 d 1
101 d 2")
I looked here and elsewhere, but I cannot find something that does exactly what I'm looking to accomplish using R.
I have data similar to below, where col1 is a unique ID, col2 is a group ID variable, col3 is a status code. I need to flag all rows with the same group ID, and where any of those rows have a specific status code, X in this case, as == 1, otherwise 0.
ID GroupID Status Flag
1 100 A 1
2 100 X 1
3 102 A 0
4 102 B 0
5 103 B 1
6 103 X 1
7 104 X 1
8 104 X 1
9 105 A 0
10 105 C 0
I have tried writing some ifelse where groupID == groupID and status == X then 1 else 0, but that doesn't work. The pattern of Status is random. In this example, the GroupID is exclusively pairs, but I don't want to assume that in the code, b/c I have other instance where there are 3 or more rows in a GroupID.
It would be helpful if this were open ended IE I could add other conditions if necessary, like, for each matching group ID, where Status == X, and other or other, etc.
Thank you !
Group-based operations like this are easy to do with the dplyr package.
The data:
library(dplyr)
txt <- 'ID GroupID Status
1 100 A
2 100 X
3 102 A
4 102 B
5 103 B
6 103 X
7 104 X
8 104 X
9 105 A
10 105 C '
df <- read.table(text = txt, header = T)
Once we have the data frame, we establish dplyr groups with the group_by function. The mutate command will then be applied per each group, creating a new column entry for each row.
df.new <- df %>%
group_by(GroupID) %>%
mutate(Flag = as.numeric(any(Status == 'X')))
# A tibble: 10 x 4
# Groups: GroupID [5]
ID GroupID Status Flag
<int> <int> <fct> <dbl>
1 1 100 A 1
2 2 100 X 1
3 3 102 A 0
4 4 102 B 0
5 5 103 B 1
6 6 103 X 1
7 7 104 X 1
8 8 104 X 1
9 9 105 A 0
10 10 105 C 0
From base R
ave(df$Status=='X',df$GroupID,FUN=any)
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Data.table way:
library(data.table)
setDT(df)
df[ , flag := sum(Status == "X") > 0, by=GroupID]
An alternative using data.table
library(data.table)
dt <- read.table(stringsAsFactors = FALSE,text = "ID GroupID Status
1 100 A
2 100 X
3 102 A
4 102 B
5 103 B
6 103 X
7 104 X
8 104 X
9 105 A
10 105 C", header=T)
setDT(dt)[,.(ID,Status, Flag=ifelse("X"%in% Status,1,0)),by=GroupID]
#returns
GroupID ID Status Flag
1: 100 1 A 1
2: 100 2 X 1
3: 102 3 A 0
4: 102 4 B 0
5: 103 5 B 1
6: 103 6 X 1
7: 104 7 X 1
8: 104 8 X 1
9: 105 9 A 0
10: 105 10 C 0
A base R option with rowsum
i1 <- with(df1, rowsum(+(Status == "X"), group = GroupID) > 0)
transform(df1, Flag = +(GroupID %in% row.names(i1)[i1]))
Or using table
df1$Flag <- +(with(df1, GroupID %in% names(which(table(GroupID,
Status == "X")[,2]> 0))))
I have a data frame (datadf) with 3 columns, 'x', 'y, and z. Several 'x' values are missing (NA). 'y' and 'z' are non measured variables.
x y z
153 a 1
163 b 1
NA d 1
123 a 2
145 e 2
NA c 2
NA b 1
199 a 2
I have another data frame (imputeddf) with the same three columns:
x y z
123 a 1
145 a 2
124 b 1
168 b 2
123 c 1
176 c 2
184 d 1
101 d 2
I wish to replace NA in 'x' in 'datadf' with values from 'imputeddf' where 'y' and 'z' matches between the two data sets (each combo of 'y' and 'z' has its own value of 'x' to fill in).
The desired result:
x y z
153 a 1
163 b 1
184 d 1
123 a 2
145 e 2
176 c 2
124 b 1
199 a 2
I am trying things like:
finaldf <- datadf
finaldf$x <- if(datadf[!is.na(datadf$x)]){ddply(datadf, x=imputeddf$x[datadf$y == imputeddf$y & datadf$z == imputeddf$z])}else{datadf$x}
but it's not working.
What is the best way for me to fill in the NA in the using my imputed value df?
I would do this:
library(data.table)
setDT(DF1); setDT(DF2)
DF1[DF2, x := ifelse(is.na(x), i.x, x), on=c("y","z")]
which gives
x y z
1: 153 a 1
2: 163 b 1
3: 184 d 1
4: 123 a 2
5: 145 e 2
6: 176 c 2
7: 124 b 1
8: 199 a 2
Comments. This approach isn't so great, since it merges the whole of DF1, while we only need to merge the subset where is.na(x). Here, the improvement looks like (thanks, #Arun):
DF1[is.na(x), x := DF2[.SD, x, on=c("y", "z")]]
This way is analogous to #RHertel's answer.
From #Jakob's comment:
does this work for more than one x variable? If I want to fill up entire datasets with several columns?
You can enumerate the desired columns:
DF1[DF2, `:=`(
x = ifelse(is.na(x), i.x, x),
w = ifelse(is.na(w), i.w, w)
), on=c("y","z")]
The expression could be constructed using lapply and substitute, probably, but if the set of columns is fixed, it might be cleanest just to write it out as above.
Here's an alternative with base R:
df1[is.na(df1$x),"x"] <- merge(df2,df1[is.na(df1$x),][,c("y","z")])$x
> df1
# x y z
#1 153 a 1
#2 163 b 1
#3 124 b 1
#4 123 a 2
#5 145 e 2
#6 176 c 2
#7 184 d 1
#8 199 a 2
A dplyr solution, conceptually identical to the answers above. To pull out just the rows of imputeddf that correspond to NAs in datadf, use semi_join. Then, use another join to match back to datadf. (This step is not very clean, unfortunately.)
library(dplyr)
replacement_rows <- imputeddf %>%
semi_join(datadf %>% filter(is.na(x)), by = c("y", "z"))
datadf <- datadf %>%
left_join(replacement_rows, by = c("y", "z")) %>%
mutate(x = if_else(is.na(x.x), x.y, x.x)) %>%
select(x, y, z)
This gets what you want:
> datadf
# A tibble: 8 x 3
x y z
<dbl> <chr> <dbl>
1 153 a 1
2 163 b 1
3 184 d 1
4 123 a 2
5 145 e 2
6 176 c 2
7 124 b 1
8 199 a 2
In dplyr, you can use rows_patch to update NAs:
rows_patch(datadf, imputeddf, by = c("y", "z"), unmatched = "ignore")
# x y z
# 1 153 a 1
# 2 163 b 1
# 3 184 d 1
# 4 123 a 2
# 5 145 e 2
# 6 176 c 2
# 7 124 b 1
# 8 199 a 2
data:
datadf <- read.table(header = T, text = "x y z
153 a 1
163 b 1
NA d 1
123 a 2
145 e 2
NA c 2
NA b 1
199 a 2")
imputeddf <- read.table(header = T, text = " x y z
123 a 1
145 a 2
124 b 1
168 b 2
123 c 1
176 c 2
184 d 1
101 d 2")
Let's say I have the following data.table:
set.seed(123)
dt <- data.table (id=1:10,
group=sample(LETTERS[1:3], 10, replace=TRUE),
val=sample(1:100, 10, replace=TRUE),
ltr=sample(letters, 10),
col5=sample(100:200, 10)
)
setkey(dt, id)
(dt)
# id group val ltr col5
# 1: 1 A 96 x 197
# 2: 2 C 46 r 190
# 3: 3 B 68 p 168
# 4: 4 C 58 w 177
# 5: 5 C 11 o 102
# 6: 6 A 90 v 145
# 7: 7 B 25 k 172
# 8: 8 C 5 l 120
# 9: 9 B 33 f 129
# 10: 10 B 96 c 121
now I want to process it with grouping by group, and in each group I would need to order records by val column and then do some manipulations within each ordered group (for example, add a column with values from ltr merged in order):
# id group val ltr letters
# 1 6 A 90 v v_x
# 2 1 A 96 x v_x
# 3 7 B 25 k k_f_p_c
# 4 9 B 33 f k_f_p_c
# 5 3 B 68 p k_f_p_c
# 6 10 B 96 c k_f_p_c
# 7 8 C 5 l l_o_r_w
# 8 5 C 11 o l_o_r_w
# 9 2 C 46 r l_o_r_w
# 10 4 C 58 w l_o_r_w
(in this example the whole table is ordered but this is not required)
That's how I imagine the code in general:
dt1 <- dt[,
{
# processing here, reorder somehow
# ???
# ...
list(id=id, ltr=ltr, letters=paste0(ltr,collapse="_"))
},
by=group]
Thanks in advance for any ideas!
UPD. As noted in answers, for my example I can simply order by group and then by val. And if I need to do several different orderings? For example, I want to sort by col5 and add col5diff column which will show the difference of col5 values:
# id group val ltr col5 letters col5diff
# 1: 6 A 90 v 145 v_x
# 2: 1 A 96 x 197 v_x 52
# 3: 10 B 96 c 121 k_f_p_c
# 4: 9 B 33 f 129 k_f_p_c 8
# 5: 3 B 68 p 168 k_f_p_c 47
# 6: 7 B 25 k 172 k_f_p_c 51
# 7: 5 C 11 o 102 l_o_r_w
# 8: 8 C 5 l 120 l_o_r_w 18
# 9: 4 C 58 w 177 l_o_r_w 75
#10: 2 C 46 r 190 l_o_r_w 88
ok, for this example calculations of letters and col5diff are independent, so I can simply do them consecutively:
setkey(dt, "group", "val")
dt[, letters := paste(ltr, collapse="_"), by = group]
setkey(dt, "group", "col5")
dt<-dt[, col5diff:={
diff <- NA;
for (i in 2:length(col5)) {diff <- c(diff, col5[i]-col5[1]);}
diff; # updated to use := instead of list - thanks to comment of #Frank
}, by = group]
but I would be also glad to know what to do if I would need to use both of these orderings (in single {} block).
I think you're just looking for order
dt[, letters:=paste(ltr[order(val)], collapse="_"), by=group]
dt[order(group, val)]
# id group val ltr col5 letters
# 1: 6 A 90 v 145 v_x
# 2: 1 A 96 x 197 v_x
# 3: 7 B 25 k 172 k_f_p_c
# 4: 9 B 33 f 129 k_f_p_c
# 5: 3 B 68 p 168 k_f_p_c
# 6: 10 B 96 c 121 k_f_p_c
# 7: 8 C 5 l 120 l_o_r_w
# 8: 5 C 11 o 102 l_o_r_w
# 9: 2 C 46 r 190 l_o_r_w
#10: 4 C 58 w 177 l_o_r_w
Or, if you do not want to add a column by reference:
dt[, list(id, val, ltr, letters=paste(ltr[order(val)], collapse="_")),
by=group][order(group, val)]
# group id val ltr letters
# 1: A 6 90 v v_x
# 2: A 1 96 x v_x
# 3: B 7 25 k k_f_p_c
# 4: B 9 33 f k_f_p_c
# 5: B 3 68 p k_f_p_c
# 6: B 10 96 c k_f_p_c
# 7: C 8 5 l l_o_r_w
# 8: C 5 11 o l_o_r_w
# 9: C 2 46 r l_o_r_w
#10: C 4 58 w l_o_r_w
Unless I'm missing something, this just requires setting the key of your data.table to group and val:
setkey(dt, "group", "val")
# id group val ltr col5
# 1: 6 A 90 v 145
# 2: 1 A 96 x 197
# 3: 7 B 25 k 172
# 4: 9 B 33 f 129
# 5: 3 B 68 p 168
# 6: 10 B 96 c 121
# 7: 8 C 5 l 120
# 8: 5 C 11 o 102
# 9: 2 C 46 r 190
# 10: 4 C 58 w 177
You see that the values are automatically ordered. Now you can subset by group:
dt[, letters := paste(ltr, collapse="_"), by = group]
# id group val ltr col5 letters
# 1: 6 A 90 v 145 v_x
# 2: 1 A 96 x 197 v_x
# 3: 7 B 25 k 172 k_f_p_c
# 4: 9 B 33 f 129 k_f_p_c
# 5: 3 B 68 p 168 k_f_p_c
# 6: 10 B 96 c 121 k_f_p_c
# 7: 8 C 5 l 120 l_o_r_w
# 8: 5 C 11 o 102 l_o_r_w
# 9: 2 C 46 r 190 l_o_r_w
# 10: 4 C 58 w 177 l_o_r_w