The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:
library (data.table)
set.seed(1)
df <- data.frame(
word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),
true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
item_score = runif(40),
avg_score = rnorm(40),
line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
validity = sample(c("T", "F"), 40, replace = T)
)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt
<df
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
4: 1 cat 234 234 0.93296878 -0.19981748 34 T 4 2
5: 2 house 39 34 0.49471189 0.68924373 25 F 1 0
6: 2 house 39 34 0.64499368 0.03614551 32 T 2 1
7: 2 house 39 34 0.17580259 1.94353631 573 F 3 0
8: 3 sung 876 876 0.60299465 0.73721373 3 T 1 1
9: 3 sung 876 876 0.88775767 2.32133393 673 F 2 0
10: 4 door 4 4 0.49020940 0.34890935 899 T 1 1
11: 4 door 4 4 0.01838357 -1.13391666 904 T 2 2
The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
My basic iteration (found on this site) currently looks like:
for (i in 1:10) {
a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])
b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}
}
So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?
Thanks,
Here is a data.table solution that uses a join on a sampled data.table.
set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]
The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.
In this instance, you get
word_ID word true_count occurrences item_score avg_score line validity
1: 1 cat 234 234 0.26550866 0.91897737 71 F
2: 1 cat 234 234 0.57285336 0.07456498 71 T
3: 1 cat 234 234 0.37212390 0.78213630 234 T
4: 2 house 39 34 0.89838968 -0.05612874 32 T
5: 2 house 39 34 0.94467527 -0.15579551 573 F
6: 3 sung 876 876 0.62911404 -0.47815006 673 T
7: 3 sung 876 876 0.66079779 -1.47075238 3 T
8: 4 door 4 4 0.06178627 0.41794156 899 F
9: 4 door 4 4 0.38410372 -0.05380504 55 F
10: 5 pretty 67 65 0.71761851 -0.39428995 100 F
11: 5 pretty 67 65 0.38003518 1.10002537 100 F
12: 5 pretty 67 65 0.49769924 -0.41499456 1003 F
13: 6 towel 81 81 0.21214252 -0.25336168 326 F
14: 6 towel 81 81 0.93470523 -0.16452360 87 F
15: 7 car 90 90 0.12555510 0.55666320 32 T
16: 7 car 90 90 0.26722067 -0.68875569 54 F
17: 8 island 43 43 0.01339033 0.36458196 87 T
18: 8 island 43 43 0.38238796 0.76853292 988 F
19: 8 island 43 43 0.86969085 -0.11234621 988 T
20: 9 ran 54 54 0.59956583 -0.61202639 754 F
21: 9 ran 54 54 0.82737332 1.43302370 4276 F
22: 10 pizza 53 51 0.79423986 -0.36722148 93 F
23: 10 pizza 53 51 0.41127443 -0.13505460 49 T
word_ID word true_count occurrences item_score avg_score line validity
If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.
I'm not proficient with data.table, so I'll use base functions. (dplyr would work well here, too, but since you're using data.table, I'll avoid it for now.) (As I'm about to hit submit, #lmo provided a dt-specific answer ...)
By "de-duplicate", I mean:
subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36 2
head(subdf)
# word_ID line
# 1 1 71
# 2 1 234
# 4 1 34
# 5 2 25
# 6 2 32
# 7 2 573
Note that the subdf only has three rows for 1, whereas the original data has 4:
df[1:4,]
# word_ID word true_count occurrences item_score avg_score line validity
# 1 1 cat 234 234 0.2655087 0.91897737 71 F
# 2 1 cat 234 234 0.3721239 0.78213630 234 T
# 3 1 cat 234 234 0.5728534 0.07456498 71 T
# 4 1 cat 234 234 0.9082078 -1.98935170 34 T
I'm using by here instead of lapply/split, but the results should be the same:
out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
# word_ID line word true_count occurrences item_score avg_score validity
# 1 1 34 cat 234 234 0.9082078 -1.98935170 T
# 2 1 71 cat 234 234 0.5728534 0.07456498 T
# 3 1 71 cat 234 234 0.2655087 0.91897737 F
Related
My question is described in the code below. I have looked here and in other forums for similar problems, but haven't found a solution that quite matches what I'm asking here. If it can be solved relying only on basic R, that would be preferable, but using a package is fine too.
id1 <- c("A", "A", "A", "B", "B", "C", "C", "C")
id2 <- c(10, 20, 30, 10, 30, 10, 20, 30)
x.1 <- ceiling(runif(8)*80) + 20
y.1 <- ceiling(runif(8)*15) + 200
x.2 <- ceiling(runif(8)*90) + 20
y.2 <- ceiling(runif(8)*20) + 200
x.3 <- ceiling(runif(8)*80) + 40
# The data frame contains to kinds of data values, x and y, repeated by a suffix number. In my example both
# the id-part and the data-part are not structured in a completely uniform manner.
mywidedata <- data.frame(id1, id2, x.1, y.1, x.2, y.2, x.3)
# If I wanted to make the data frame even wider, this would work. It generates NAs for the missing combination (B,20).
reshape(mywidedata, idvar = "id1", timevar = "id2", direction = "wide")
# What I want is "long", and this fails.
reshape(mywidedata, varying = c(3:7), direction = "long")
# I could introduce the needed column. This works.
mywidecopy <- mywidedata
mywidecopy$y.3 <- NA
mylongdata <- reshape(mywidecopy, idvar=c(1,2), varying = c(3:8), direction = "long", sep = ".")
# (sep-argument not needed in this case - the function can figure out the system)
names(mylongdata)[(names(mylongdata)=="time")] <- "id3"
# I want to reach the same outcome without manual manipulation. Is it possible with the just the
# built-in 'reshape'?
# Trying 'melt'. Not what I want.
reshape::melt(mywidedata, id.vars = c(1,2))
You can use pivot_longer from tidyr :
tidyr::pivot_longer(mywidedata,
cols = -c(id1, id2),
names_to = c('.value', 'id3'),
names_sep = '\\.')
# A tibble: 24 x 5
# id1 id2 id3 x y
# <chr> <dbl> <chr> <dbl> <dbl>
# 1 A 10 1 66 208
# 2 A 10 2 95 220
# 3 A 10 3 89 NA
# 4 A 20 1 34 208
# 5 A 20 2 81 219
# 6 A 20 3 82 NA
# 7 A 30 1 23 201
# 8 A 30 2 80 204
# 9 A 30 3 75 NA
#10 B 10 1 52 210
# … with 14 more rows
Just cbind the missing level as NA.
reshape(cbind(mywidedata, y.2=NA), varying=3:8, direction="long")
# id1 id2 time x y id
# 1.1 A 10 1 98 215 1
# 2.1 A 20 1 38 208 2
# 3.1 A 30 1 97 205 3
# 4.1 B 10 1 61 207 4
# 5.1 B 30 1 73 201 5
# 6.1 C 10 1 96 202 6
# 7.1 C 20 1 100 202 7
# 8.1 C 30 1 94 202 8
# 1.2 A 10 2 73 208 1
# 2.2 A 20 2 69 218 2
# 3.2 A 30 2 64 219 3
# 4.2 B 10 2 104 213 4
# 5.2 B 30 2 99 203 5
# 6.2 C 10 2 92 206 6
# 7.2 C 20 2 49 206 7
# 8.2 C 30 2 59 209 8
# 1.3 A 10 3 63 208 1
# 2.3 A 20 3 91 218 2
# 3.3 A 30 3 42 219 3
# 4.3 B 10 3 67 213 4
# 5.3 B 30 3 90 203 5
# 6.3 C 10 3 74 206 6
# 7.3 C 20 3 86 206 7
# 8.3 C 30 3 83 209 8
We can use melt from data.table
library(data.table)
melt(setDT(mywidedata), measure = patterns("^x", "^y"), value.name = c('x', 'y'))
# id1 id2 variable x y
# 1: A 10 1 97 215
# 2: A 20 1 75 202
# 3: A 30 1 87 213
# 4: B 10 1 51 206
# 5: B 30 1 75 203
# 6: C 10 1 41 210
# 7: C 20 1 58 211
# 8: C 30 1 50 207
# 9: A 10 2 92 204
#10: A 20 2 60 207
#11: A 30 2 35 201
#12: B 10 2 83 202
#13: B 30 2 81 202
#14: C 10 2 55 216
#15: C 20 2 68 204
#16: C 30 2 70 218
#17: A 10 3 89 NA
#18: A 20 3 108 NA
#19: A 30 3 47 NA
#20: B 10 3 78 NA
#21: B 30 3 43 NA
#22: C 10 3 106 NA
#23: C 20 3 92 NA
#24: C 30 3 96 NA
I have a data frame like this in wide format
setseed(1)
df = data.frame(item=letters[1:6], field1a=sample(6,6),field1b=sample(60,6),
field1c=sample(200,6),field2a=sample(6,6),field2b=sample(60,6),
field2c=sample(200,6))
what would be the best way to stack all a columns together and all b together and all c together like this
items fielda fieldb fieldc
a 2 52 121
a 1 44 57
using base R:
cbind(item=df$item,unstack(transform(stack(df,-1),ind=sub("\\d+","",ind))))
item fielda fieldb fieldc
1 a 2 57 138
2 b 6 39 77
3 c 3 37 153
4 d 4 4 99
5 e 1 12 141
6 f 5 10 194
7 a 3 17 97
8 b 4 23 120
9 c 5 1 98
10 d 1 22 37
11 e 2 49 163
12 f 6 19 131
Or you can use the reshape function in Base R:
reshape(df,varying = split(names(df)[-1],rep(1:3,2)),idvar = "item",direction = "long")
item time field1a field1b field1c
a.1 a 1 2 57 138
b.1 b 1 6 39 77
c.1 c 1 3 37 153
d.1 d 1 4 4 99
e.1 e 1 1 12 141
f.1 f 1 5 10 194
a.2 a 2 3 17 97
b.2 b 2 4 23 120
c.2 c 2 5 1 98
d.2 d 2 1 22 37
e.2 e 2 2 49 163
f.2 f 2 6 19 131
You can also decide to separate the name of the dataframe by yourself then format it:
names(df)=sub("(\\d)(.)","\\2.\\1",names(df))
reshape(df,varying= -1,idvar = "item",direction = "long")
If we are using tidyverse, then gather into 'long' format, do some rearrangements with the column name and spread
library(tidyverse)
out <- df %>%
gather(key, val, -item) %>%
mutate(key1 = gsub("\\d+", "", key),
key2 = gsub("\\D+", "", key)) %>%
select(-key) %>%
spread(key1, val) %>%
select(-key2)
head(out, 2)
# item fielda fieldb fieldc
#1 a 2 57 138
#2 a 3 17 97
Or a similar option is melt/dcast from data.table, where we melt into 'long' format, substring the 'variable' and then dcast to 'wide' format
library(data.table)
dcast(melt(setDT(df), id.var = "item")[, variable := sub("\\d+", "", variable)
], item + rowid(variable) ~ variable, value.var = 'value')[
, variable := NULL][]
# item fielda fieldb fieldc
# 1: a 2 57 138
# 2: a 3 17 97
# 3: b 6 39 77
# 4: b 4 23 120
# 5: c 3 37 153
# 6: c 5 1 98
# 7: d 4 4 99
# 8: d 1 22 37
# 9: e 1 12 141
#10: e 2 49 163
#11: f 5 10 194
#12: f 6 19 131
NOTE: Should also work when the lengths are not balanced for each cases
data
set.seed(1)
df = data.frame(item = letters[1:6],
field1a=sample(6,6),
field1b=sample(60,6),
field1c=sample(200,6),
field2a=sample(6,6),
field2b=sample(60,6),
field2c=sample(200,6))
Following up from this post:
Calculate ranks for each group
df <- ddply(df, .(type), transform, pos = rank(x, ties.method = "min")-1)
Using the method described in the above post, when you you have multiple ties across the same TYPE, the ranking output (Pos) gets a little messy and hard to interpret, though technically still an accurate output.
For example:
library(plyr)
df <- data.frame(type = c(rep("a",11), rep("b",6), rep("c",2), rep("d", 6)),
x = c(50:53, rep(54, 3), 55:56, rep(57, 2), rep(51,3), rep(52,2), 56,
53, 57, rep(52, 2), 54, rep(58, 2), 70))
df<-ddply(df,.(type),transform, pos=rank(x,ties.method="min")-1)
Produces:
Type X Pos
a 50 0
a 51 1
a 52 2
a 53 3
a 54 4
a 54 4
a 54 4
a 55 7
a 56 8
a 57 9
a 57 9
b 51 0
b 51 0
b 51 0
b 52 3
b 52 3
b 56 5
c 53 0
c 57 1
d 52 0
d 52 0
d 54 2
d 58 3
d 58 3
d 70 5
The Pos relative ranking is correct (equal values are ranked the same, lower values ranked lower, and higher values ranked higher), but I have been trying to make the output look prettier. Any thoughts?
I'd like to get the output to look like this:
Type X Pos
a 50 1
a 51 2
a 52 3
a 53 4
a 54 5
a 54 5
a 54 5
a 55 6
a 56 7
a 57 8
a 57 8
b 51 1
b 51 1
b 51 1
b 52 2
b 52 2
b 56 3
c 53 1
c 57 2
d 52 1
d 52 1
d 54 2
d 58 3
d 58 3
d 70 4
This format, of course, assumes that the total number of records for each group doesn't matter. By taking away the "-1", we can remove the 0's, but that only solves one aspect. I've tried playing around with different equations and ties.method's, but to no avail.
Maybe the rank() function isn't what I should be using?
It seems you are looking for dense-rank:
as.data.table(df)[, pos := frank(x, ties.method = 'dense'), by = 'type'][]
# type x pos
# 1: a 50 1
# 2: a 51 2
# 3: a 52 3
# 4: a 53 4
# 5: a 54 5
# 6: a 54 5
# 7: a 54 5
# 8: a 55 6
# 9: a 56 7
# 10: a 57 8
# 11: a 57 8
# 12: b 51 1
# 13: b 51 1
# 14: b 51 1
# 15: b 52 2
# 16: b 52 2
# 17: b 56 3
# 18: c 53 1
# 19: c 57 2
# 20: d 52 1
# 21: d 52 1
# 22: d 54 2
# 23: d 58 3
# 24: d 58 3
# 25: d 70 4
# type x pos
dens_rank in dplyr does the same thing:
library(dplyr)
df %>% group_by(type) %>% mutate(pos = dense_rank(x)) %>% ungroup()
# # A tibble: 25 x 3
# type x pos
# <fctr> <dbl> <int>
# 1 a 50 1
# 2 a 51 2
# 3 a 52 3
# 4 a 53 4
# 5 a 54 5
# 6 a 54 5
# 7 a 54 5
# 8 a 55 6
# 9 a 56 7
# 10 a 57 8
# # ... with 15 more rows
I am trying to find a proper way, in R, to find duplicated values, and add the value 1 to each subsequent duplicated value grouped by id. For example:
data = data.table(id = c('1','1','1','1','1','2','2','2'),
value = c(95,100,101,101,101,20,35,38))
data$new_value <- ifelse(data[ , data$value] == lag(data$value,1),
lag(data$value, 1) + 1 ,data$value)
data$desired_value <- c(95,100,101,102,103,20,35,38)
Produces:
id value new_value desired_value
1: 1 95 NA 95
2: 1 100 100 100
3: 1 101 101 101 # first 101 in id 1: add 0
4: 1 101 102 102 # second 101 in id 1: add 1
5: 1 101 102 103 # third 101 in id 1: add 2
6: 2 20 20 20
7: 2 35 35 35
8: 2 38 38 38
I tried doing this with ifelse, but it doesn't work recursively so it only applies to the following row, and not any subsequent rows. Also the lag function results in me losing the first value in value.
I've seen examples with character variables with make.names or make.unique, but haven't been able to find a solution for a duplicated numeric value.
Background: I am doing a survival analysis and I am finding that with my data there are stop times that are the same, so I need to make it unique by adding a 1 (stop times are in seconds).
Here's an attempt. You're essentially grouping by id and value and adding 0:(length(value)-1). So:
data[, onemore := value + (0:(.N-1)), by=.(id, value)]
# id value new_value desired_value onemore
#1: 1 95 96 95 95
#2: 1 100 101 100 100
#3: 1 101 102 101 101
#4: 1 101 102 102 102
#5: 1 101 102 103 103
#6: 2 20 21 20 20
#7: 2 35 36 35 35
#8: 2 38 39 38 38
With base R we can use ave where we take the first value of each group and basically add the row number of that row in that group.
data$value1 <- ave(data$value, data$id, data$value, FUN = function(x)
x[1] + seq_along(x) - 1)
# id value new_value desired_value value1
#1: 1 95 96 95 95
#2: 1 100 101 100 100
#3: 1 101 102 101 101
#4: 1 101 102 102 102
#5: 1 101 102 103 103
#6: 2 20 21 20 20
#7: 2 35 36 35 35
#8: 2 38 39 38 38
Here is one option with tidyverse
library(dplyr)
data %>%
group_by(id, value) %>%
mutate(onemore = value + row_number()-1)
# id value onemore
# <chr> <dbl> <dbl>
#1 1 95 95
#2 1 100 100
#3 1 101 101
#4 1 101 102
#5 1 101 103
#6 2 20 20
#7 2 35 35
#8 2 38 38
Or we can use base R without anonymous function call
data$onemore <- with(data, value + ave(value, id, value, FUN =seq_along)-1)
data$onemore
#[1] 95 100 101 102 103 20 35 38
To avoid (a potentially costly) by, you may use rowid:
data[, res := value + rowid(id, value) - 1]
# data
# id value new_value desired_value res
# 1: 1 95 96 95 95
# 2: 1 100 101 100 100
# 3: 1 101 102 101 101
# 4: 1 101 102 102 102
# 5: 1 101 102 103 103
# 6: 2 20 21 20 20
# 7: 2 35 36 35 35
# 8: 2 38 39 38 38
Let's say I have the following data.table:
set.seed(123)
dt <- data.table (id=1:10,
group=sample(LETTERS[1:3], 10, replace=TRUE),
val=sample(1:100, 10, replace=TRUE),
ltr=sample(letters, 10),
col5=sample(100:200, 10)
)
setkey(dt, id)
(dt)
# id group val ltr col5
# 1: 1 A 96 x 197
# 2: 2 C 46 r 190
# 3: 3 B 68 p 168
# 4: 4 C 58 w 177
# 5: 5 C 11 o 102
# 6: 6 A 90 v 145
# 7: 7 B 25 k 172
# 8: 8 C 5 l 120
# 9: 9 B 33 f 129
# 10: 10 B 96 c 121
now I want to process it with grouping by group, and in each group I would need to order records by val column and then do some manipulations within each ordered group (for example, add a column with values from ltr merged in order):
# id group val ltr letters
# 1 6 A 90 v v_x
# 2 1 A 96 x v_x
# 3 7 B 25 k k_f_p_c
# 4 9 B 33 f k_f_p_c
# 5 3 B 68 p k_f_p_c
# 6 10 B 96 c k_f_p_c
# 7 8 C 5 l l_o_r_w
# 8 5 C 11 o l_o_r_w
# 9 2 C 46 r l_o_r_w
# 10 4 C 58 w l_o_r_w
(in this example the whole table is ordered but this is not required)
That's how I imagine the code in general:
dt1 <- dt[,
{
# processing here, reorder somehow
# ???
# ...
list(id=id, ltr=ltr, letters=paste0(ltr,collapse="_"))
},
by=group]
Thanks in advance for any ideas!
UPD. As noted in answers, for my example I can simply order by group and then by val. And if I need to do several different orderings? For example, I want to sort by col5 and add col5diff column which will show the difference of col5 values:
# id group val ltr col5 letters col5diff
# 1: 6 A 90 v 145 v_x
# 2: 1 A 96 x 197 v_x 52
# 3: 10 B 96 c 121 k_f_p_c
# 4: 9 B 33 f 129 k_f_p_c 8
# 5: 3 B 68 p 168 k_f_p_c 47
# 6: 7 B 25 k 172 k_f_p_c 51
# 7: 5 C 11 o 102 l_o_r_w
# 8: 8 C 5 l 120 l_o_r_w 18
# 9: 4 C 58 w 177 l_o_r_w 75
#10: 2 C 46 r 190 l_o_r_w 88
ok, for this example calculations of letters and col5diff are independent, so I can simply do them consecutively:
setkey(dt, "group", "val")
dt[, letters := paste(ltr, collapse="_"), by = group]
setkey(dt, "group", "col5")
dt<-dt[, col5diff:={
diff <- NA;
for (i in 2:length(col5)) {diff <- c(diff, col5[i]-col5[1]);}
diff; # updated to use := instead of list - thanks to comment of #Frank
}, by = group]
but I would be also glad to know what to do if I would need to use both of these orderings (in single {} block).
I think you're just looking for order
dt[, letters:=paste(ltr[order(val)], collapse="_"), by=group]
dt[order(group, val)]
# id group val ltr col5 letters
# 1: 6 A 90 v 145 v_x
# 2: 1 A 96 x 197 v_x
# 3: 7 B 25 k 172 k_f_p_c
# 4: 9 B 33 f 129 k_f_p_c
# 5: 3 B 68 p 168 k_f_p_c
# 6: 10 B 96 c 121 k_f_p_c
# 7: 8 C 5 l 120 l_o_r_w
# 8: 5 C 11 o 102 l_o_r_w
# 9: 2 C 46 r 190 l_o_r_w
#10: 4 C 58 w 177 l_o_r_w
Or, if you do not want to add a column by reference:
dt[, list(id, val, ltr, letters=paste(ltr[order(val)], collapse="_")),
by=group][order(group, val)]
# group id val ltr letters
# 1: A 6 90 v v_x
# 2: A 1 96 x v_x
# 3: B 7 25 k k_f_p_c
# 4: B 9 33 f k_f_p_c
# 5: B 3 68 p k_f_p_c
# 6: B 10 96 c k_f_p_c
# 7: C 8 5 l l_o_r_w
# 8: C 5 11 o l_o_r_w
# 9: C 2 46 r l_o_r_w
#10: C 4 58 w l_o_r_w
Unless I'm missing something, this just requires setting the key of your data.table to group and val:
setkey(dt, "group", "val")
# id group val ltr col5
# 1: 6 A 90 v 145
# 2: 1 A 96 x 197
# 3: 7 B 25 k 172
# 4: 9 B 33 f 129
# 5: 3 B 68 p 168
# 6: 10 B 96 c 121
# 7: 8 C 5 l 120
# 8: 5 C 11 o 102
# 9: 2 C 46 r 190
# 10: 4 C 58 w 177
You see that the values are automatically ordered. Now you can subset by group:
dt[, letters := paste(ltr, collapse="_"), by = group]
# id group val ltr col5 letters
# 1: 6 A 90 v 145 v_x
# 2: 1 A 96 x 197 v_x
# 3: 7 B 25 k 172 k_f_p_c
# 4: 9 B 33 f 129 k_f_p_c
# 5: 3 B 68 p 168 k_f_p_c
# 6: 10 B 96 c 121 k_f_p_c
# 7: 8 C 5 l 120 l_o_r_w
# 8: 5 C 11 o 102 l_o_r_w
# 9: 2 C 46 r 190 l_o_r_w
# 10: 4 C 58 w 177 l_o_r_w