I have data set
ID <- c(1,1,2,2,2,2,3,3,3,3,3,4,4,4)
Eval <- c("A","A","B","B","A","A","A","A","B","B","A","A","A","B")
med <- c("c","d","k","k","h","h","c","d","h","h","h","c","h","k")
df <- data.frame(ID,Eval,med)
> df
ID Eval med
1 1 A c
2 1 A d
3 2 B k
4 2 B k
5 2 A h
6 2 A h
7 3 A c
8 3 A d
9 3 B h
10 3 B h
11 3 A h
12 4 A c
13 4 A h
14 4 B k
I try to create variable x and y, group by ID and Eval. For each ID, if Eval = A, and med = "h" or "k", I set x = 1, other wise x = 0, if Eval = B and med = "h" or "k", I set y = 1, other wise y = 0. I use the way I don't like it, I got answer but it seem like not that great
df <- data.table(df)
setDT(df)[, count := uniqueN(med) , by = .(ID,Eval)]
setDT(df)[Eval == "A", x:= ifelse(count == 1 & med %in% c("k","h"),1,0), by=ID]
setDT(df)[Eval == "B", y:= ifelse(count == 1 & med %in% c("k","h"),1,0), by=ID]
ID Eval med count x y
1: 1 A c 2 0 NA
2: 1 A d 2 0 NA
3: 2 B k 1 NA 1
4: 2 B k 1 NA 1
5: 2 A h 1 1 NA
6: 2 A h 1 1 NA
7: 3 A c 3 0 NA
8: 3 A d 3 0 NA
9: 3 B h 1 NA 1
10: 3 B h 1 NA 1
11: 3 A h 3 0 NA
12: 4 A c 2 0 NA
13: 4 A h 2 0 NA
14: 4 B k 1 NA 1
Then I need to collapse the row to get unique ID, I don't know how to collapse rows, any idea?
The output
ID x y
1 0 0
2 1 1
3 0 1
4 0 1
We create the 'x' and 'y' variables grouped by 'ID' without the NA elements directly coercing the logical vector to binary (as.integer)
df[, x := as.integer(Eval == "A" & count ==1 & med %in% c("h", "k")) , by = ID]
and similarly for 'y'
df[, y := as.integer(Eval == "B" & count ==1 & med %in% c("h", "k")) , by = ID]
and summarise it, using any after grouping by "ID"
df[, lapply(.SD, function(x) as.integer(any(x))) , ID, .SDcols = x:y]
# ID x y
#1: 1 0 0
#2: 2 1 1
#3: 3 0 1
#4: 4 0 1
If we need a compact approach, instead of assinging (:=), we summarise the output grouped by "ID", "Eval" based on the conditions and then grouped by 'ID', we check if there is any TRUE values in 'x' and 'y' by looping over the columns described in the .SDcols.
setDT(df)[, if(any(uniqueN(med)==1 & med %in% c("h", "k"))) {
.(x= Eval=="A", y= Eval == "B") } else .(x=FALSE, y=FALSE),
by = .(ID, Eval)][, lapply(.SD, any) , by = ID, .SDcols = x:y]
# ID x y
#1: 1 FALSE FALSE
#2: 2 TRUE TRUE
#3: 3 FALSE TRUE
#4: 4 FALSE TRUE
If needed, we can convert to binary similar to the approach showed in the first solution.
The OP's goal...
"I try to create variable x and y, group by ID and Eval. For each ID, if Eval = A, and med = "h" or "k", I set x = 1, other wise x = 0, if Eval = B and med = "h" or "k", I set y = 1, other wise y = 0. [...] Then I need to collapse the row to get unique ID"
can be simplified to...
For each ID and Eval, flag if all med values are h or all med values are k.
setDT(df) # only do this once
df[, all(med=="k") | all(med=="h"), by=.(ID,Eval)][, dcast(.SD, ID ~ Eval, fun=any)]
ID A B
1: 1 FALSE FALSE
2: 2 TRUE TRUE
3: 3 FALSE TRUE
4: 4 FALSE TRUE
To see what dcast is doing, read ?dcast and try running just the first part on its own, df[, all(med=="k") | all(med=="h"), by=.(ID,Eval)].
The change to use x and y instead of A and B is straightforward but ill-advised (since unnecessary renaming can be confusing and lead to extra work when there are new Eval values); and ditto the change for 1/0 instead of TRUE/FALSE (since the values captured are actually boolean).
Here is my dplyr solution since I find it more readable than data.table.
library(dplyr)
df %>%
group_by(ID, Eval) %>%
mutate(
count = length(unique(med)),
x = ifelse(Eval == "A" &
count == 1 & med %in% c("h", "k"), 1, 0),
y = ifelse(Eval == "B" &
count == 1 & med %in% c("h", "k"), 1, 0)
) %>%
group_by(ID) %>%
summarise(x1 = max(unique(x)),
y1 = max(unique(y)))
A one liner solution for collapsing the rows of your result :
df[,lapply(.SD,function(i) {ifelse(1 %in% i,ifelse(!0 %in% i,1,0),0)}),.SDcols=x:y,by=ID]
ID x y
1: 1 0 0
2: 2 1 1
3: 3 0 1
4: 4 0 1
Related
I have a data table of this form (2000000+ rows, 1000+groups):
set.seed(1)
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))
> dt
id values
1: 1 a
2: 1 c
3: 1 a
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 a
13: 3 a
14: 3 a
15: 3 b
I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)
In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:
> dt
id values
1: 1 a
2: 1 c
3: 1 b
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 b
13: 3 b
14: 3 b
15: 3 b
What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.
Here's another data.table approach:
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")],
on = .(id, x, values),
values := "b"
][, x := NULL]
create a new column "x" with the run length ids per value grouped by id
join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
delete column x afterwards
The result is:
dt
# id values
# 1: 1 a
# 2: 1 c
# 3: 1 b
# 4: 1 b
# 5: 1 a
# 6: 2 c
# 7: 2 c
# 8: 2 b
# 9: 2 b
# 10: 2 c
# 11: 3 c
# 12: 3 b
# 13: 3 b
# 14: 3 b
# 15: 3 b
And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)],
on = .(id, x, values),
values := "b"
][, x := NULL]
Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.
(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.
# 1
dt[values != "a" , v2 := values]
# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]
# 3
dt[values == "a" & v2 == "b", values := v2]
# clean-up
dt[ , v2 := NULL]
Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from #chinsoon12 in the issue nafill, setnafill for character, factor and other types).
The NA replacement code may be slightly shortened by using zoo::nalocf:
dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]
However, note that na.locf is slower.
When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.
This is not pretty but I think this is what you are after:
dt[, .N, by = .(id, values = paste0(values, rleid(values)))
][, values := sub("[0-9]+", "", values)
][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
][, .SD[rep(seq_len(.N), N)]
][, !"N"]
id values
1: 1 a
2: 1 c
3: 1 b
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 b
13: 3 b
14: 3 b
15: 3 b
You can use rle().
Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".
dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt
# id var new
# 1: 1 a a
# 2: 1 c c
# 3: 1 a b
# 4: 1 b b
# 5: 1 a a
# 6: 2 c c
# 7: 2 c c
# 8: 2 b b
# 9: 2 b b
# 10: 2 c c
# 11: 3 c c
# 12: 3 a b
# 13: 3 a b
# 14: 3 a b
# 15: 3 b b
I have a data.table:
groups <- data.table(group = c("A", "B", "C", "D", "E", "F", "G"),
code_1 = c(2,2,2,7,8,NA,5),
code_2 = c(NA,3,NA,3,NA,NA,2),
code_3 = c(4,1,1,4,4,1,8))
group code_1 code_2 code_3
A 2 NA 4
B 2 3 1
C 2 NA 1
D 7 3 4
E 8 NA 4
F NA NA 1
G 5 2 8
What I would like to achieve, is for each group to find the immediate neighbors based on the available codes. For example: Group A has immediate neighbors groups B, C due to code_1 (code_1 is equal to 2 in all groups) and has immediate neighbor groups D,E due to code_3 (code_3 is equal to 4 in all those groups).
What I tried is for each code, subsetting the first column (group) based on the matches as follows:
groups$code_1_match = list()
for (row in 1:nrow(groups)){
set(groups, i=row, j="code_1_match", list(groups$group[groups$code_1[row] == groups$code_1]))
}
group code_1 code_2 code_3 code_1_match
A 2 NA 4 A,B,C,NA
B 2 3 1 A,B,C,NA
C 2 NA 1 A,B,C,NA
D 7 3 4 D,NA
E 8 NA 4 E,NA
F NA NA 1 NA,NA,NA,NA,NA,NA,...
G 5 2 8 NA,G
This "kinda" works but I would assume there is a more data table kind of way of doing this. I tried
groups[, code_1_match_2 := list(group[code_1 == groups$code_1])]
But this doesn't work.
Am I missing some obvious data table trick to deal with it?
My ideal case result would look like this (which currently would require using my method for all 3 columns and then concatenating the results):
group code_1 code_2 code_3 Immediate neighbors
A 2 NA 4 B,C,D,E
B 2 3 1 A,C,D,F
C 2 NA 1 A,B,F
D 7 3 4 B,A
E 8 NA 4 A,D
F NA NA 1 B,C
G 5 2 8
Using igraph, get 2nd degree neighbours, drop numeric nodes, paste remaining nodes.
library(data.table)
library(igraph)
# reshape wide-to-long
x <- melt(groups, id.vars = "group")[!is.na(value)]
# convert to graph
g <- graph_from_data_frame(x[, .(from = group, to = paste0(variable, "_", value))])
# get 2nd degree neighbours
x1 <- ego(g, 2, nodes = groups$group)
# prettify the result
groups$res <- sapply(seq_along(x1), function(i) toString(intersect(names(x1[[ i ]]),
groups$group[ -i ])))
# group code_1 code_2 code_3 res
# 1: A 2 NA 4 B, C, D, E
# 2: B 2 3 1 A, C, D, F
# 3: C 2 NA 1 A, B, F
# 4: D 7 3 4 B, A, E
# 5: E 8 NA 4 A, D
# 6: F NA NA 1 B, C
# 7: G 5 2 8
More info
This is how our data looks like before converting to igraph object. We want to ensure code1 with value 2 is different from code2 with value 2, etc.
x[, .(from = group, to = paste0(variable, "_", value))]
# from to
# 1: A code_1_2
# 2: B code_1_2
# 3: C code_1_2
# 4: D code_1_7
# 5: E code_1_8
# 6: G code_1_5
# 7: B code_2_3
# 8: D code_2_3
# 9: G code_2_2
# 10: A code_3_4
# 11: B code_3_1
# 12: C code_3_1
# 13: D code_3_4
# 14: E code_3_4
# 15: F code_3_1
# 16: G code_3_8
Here is how our network looks like:
Note that A..G nodes are always connected through code_x_y.
So we need to get the 2nd degree, ego(..., order = 2) gives us neighbours up to including 2nd degree neighbours, and returns a list object.
To get the names:
lapply(x1, names)
# [[1]]
# [1] "A" "code_1_2" "code_3_4" "B" "C" "D" "E"
#
# [[2]]
# [1] "B" "code_1_2" "code_2_3" "code_3_1" "A" "C" "D" "F"
#
# [[3]]
# [1] "C" "code_1_2" "code_3_1" "A" "B" "F"
#
# [[4]]
# [1] "D" "code_1_7" "code_2_3" "code_3_4" "B" "A" "E"
#
# [[5]]
# [1] "E" "code_1_8" "code_3_4" "A" "D"
#
# [[6]]
# [1] "F" "code_3_1" "B" "C"
#
# [[7]]
# [1] "G" "code_1_5" "code_2_2" "code_3_8"
To prettify the result, we need to remove code_x_y nodes and the origin node (1st node)
sapply(seq_along(x1), function(i) toString(intersect(names(x1[[ i ]]), groups$group[ -i ])))
#[1] "B, C, D, E" "A, C, D, F" "A, B, F" "B, A, E" "A, D" "B, C" ""
There is probably some more practical way of achieving this but you could do something like this, using melts and joins:
mgrp <- melt(groups, id.vars = "group")[!is.na(value)]
setkey(mgrp, variable, value)
for (i in seq_along(groups$group)) {
let = groups$group[i]
set(
groups,
i = i,
j = "inei",
value = list(mgrp[mgrp[group == let], setdiff(unique(group), let)])
)
}
groups
# group code_1 code_2 code_3 inei
# 1: A 2 NA 4 B,C,D,E
# 2: B 2 3 1 A,C,D,F
# 3: C 2 NA 1 A,B,F
# 4: D 7 3 4 B,A,E
# 5: E 8 NA 4 A,D
# 6: F NA NA 1 B,C
# 7: G 5 2 8
As mentioned by zx8754, using data.table::melt with combn and then igraph::as_adjacency_matrix
library(data.table)
df <- melt(groups, id.vars="group", na.rm=TRUE)[,
if (.N > 1L) transpose(combn(group, 2L, simplify=FALSE)), value][, (1) := NULL]
library(igraph)
as_adjacency_matrix(graph_from_data_frame(df, FALSE))
output:
7 x 7 sparse Matrix of class "dgCMatrix"
A B C E D G F
A . 1 1 1 1 1 .
B 1 . 2 . 1 1 1
C 1 2 . . . 1 1
E 1 . . . 1 1 .
D 1 1 . 1 . . .
G 1 1 1 1 . . .
F . 1 1 . . . .
or without using igraph
x <- df[, unique(c(V1, V2))]
df <- rbindlist(list(df, data.table(x, x)))
tab <- table(df) #or xtabs(~ V1 + V2, data=df)
ans <- t(tab) + tab
diag(ans) <- 0L
ans
output:
V1
V2 A B C D E F G
A 0 1 1 1 1 0 1
B 1 0 2 1 0 1 1
C 1 2 0 0 0 1 1
D 1 1 0 0 1 0 0
E 1 0 0 1 0 0 1
F 0 1 1 0 0 0 0
G 1 1 1 0 1 0 0
This is inspired by #sindri_baldur's melt. This solution:
Melts the groups
Performs a cartesian self-join.
Pastes together all the groups that matches.
Joins back to the original DT
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2
groups <- data.table(group = c("A", "B", "C", "D", "E", "F", "G"), code_1 = c(2,2,2,7,8,NA,5), code_2 = c(NA,3,NA,3,NA,NA,2), code_3=c(4,1,1,4,4,1,8))
molten_grps = melt(groups, measure.vars = patterns("code"), na.rm = TRUE)
inei_dt = molten_grps[molten_grps,
on = .(variable, value),
allow.cartesian = TRUE
][,
.(inei = paste0(setdiff(i.group, .BY[[1L]]), collapse = ", ")),
by = group]
groups[inei_dt, on = .(group), inei := inei]
groups
#> group code_1 code_2 code_3 inei
#> <char> <num> <num> <num> <char>
#> 1: A 2 NA 4 B, C, D, E
#> 2: B 2 3 1 A, C, D, F
#> 3: C 2 NA 1 A, B, F
#> 4: D 7 3 4 B, A, E
#> 5: E 8 NA 4 A, D
#> 6: F NA NA 1 B, C
#> 7: G 5 2 8
I'm trying to create a new column that indicates if an ID was present in a previous group. Here's my data:
data <- data.table(ID = c(1:3, c(9,2,3,4),c(5,1)),
groups = c(rep(c("a", "b", "c"), c(3, 4,2))))
ID groups
1: 1 a
2: 2 a
3: 3 a
4: 9 b
5: 2 b
6: 3 b
7: 4 b
8: 5 c
9: 1 c
I'm not sure how to specify lagged groups. I tried to use shift, but it's not working:
data[,.(ID=ID,match_lagged=ID %in% shift(ID)),by=groups]
Here's my desired result.
The first 3 lines are not matched because there is no previous group. FALSE would also work for these three rows. ID=4 (in group b) is not matched in group a. ID=5 (in group c) is not matched in group b.
Note that ID 1 in group c is not matched in group b so it should be false even though it exists in group a. This is why duplicated(data$ID) does not work. Data from a group has to be matched from the previous group.
groups ID match_lagged
1: a 1 NA
2: a 2 NA
3: a 3 NA
4: b 9 FALSE
5: b 2 TRUE
6: b 3 TRUE
7: b 4 FALSE
8: c 5 FALSE
9: c 1 FALSE
A dplyr solution would also work.
Number the groups, and then check if the diff is equal to one for each ID.
data[, grp.id := .GRP, by = groups]
data[, match_lagged := c(FALSE, diff(grp.id) == 1), by = ID][
grp.id == 1, match_lagged := NA][]
# ID groups grp.id match_lagged
#1: 1 a 1 NA
#2: 2 a 1 NA
#3: 3 a 1 NA
#4: 9 b 2 FALSE
#5: 2 b 2 TRUE
#6: 3 b 2 TRUE
#7: 4 b 2 FALSE
#8: 5 c 3 FALSE
#9: 1 c 3 FALSE
This assumes that you find each ID only once in each group. If that's not the case you can unique, do the above, and then merge in.
This works. There is probably an easier solution:
data <- data.frame(ID = c(1:3, 1:4,c(5,1)),
groups = c(rep(c("a", "b", "c"), c(3, 4,2))))
z <- data %>% group_by(groups) %>% summarize(all_vals = list(ID))
z <- z %>% mutate(lagged_id = lag(all_vals,1))
match_lagged <- lapply(1:nrow(z) , function(x) {
(z$all_vals[x] %>% unlist) %in% (z$lagged_id[x] %>% unlist)
})
data$match_lagged = match_lagged %>% unlist
I would like to add a counter column in a data frame based on a set of identical rows. To do this, I used the package data.table. In my case, the comparison between rows need doing from the combination of columns "z" AND ("x" OR "y").
I tested:
DF[ , Index := .GRP, by = c("x","y","z") ]
but the result is the combination of "z" AND "x" AND "y".
How can I have the combination of "z" AND ("x" OR "y") ?
Here is a data example:
DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
DF <- data.table(DF)
I would like to have this output:
> DF
x y z Index
1: a 1 M 1
2: a 3 M 1
3: a 2 M 1
4: b 8 F 2
5: c 8 F 2
6: d 4 M 3
7: e 4 M 3
8: f 6 F 4
9: f 0 F 4
The new group starts if the value for z is changing or the values both for x and y are changing.
Try this example.
require(data.table)
DF <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
y = c(1,3,2,8,8,4,4,6,0),
z=c("M","M","M","F","F","M","M","F","F"))
# The functions to compare if value is not equal with the previous value
is.not.eq.with.lag <- function(x) c(T, tail(x, -1) != head(x, -1))
DF[, x1 := is.not.eq.with.lag(x)]
DF[, y1 := is.not.eq.with.lag(y)]
DF[, z1 := is.not.eq.with.lag(z)]
DF
DF[, Index := cumsum(z1 | (x1 & y1))]
DF
I know a lot of people warn against a for loop in R, but in this instance I think it is a very direct way of approaching the problem. Plus, the result isn't growing in size so performance issues aren't a large issue. The for loop approach would be:
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
}
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
}else{
dt$grp[i] = dt$grp[i-1] + 1
}
}
}
Trying this on OPs original problem, the result is:
DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
dt <- data.table(DF)
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
}
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
}else{
dt$grp[i] = dt$grp[i-1] + 1
}
}
}
dt
x y z grp
1: a 1 M 1
2: a 3 M 1
3: a 2 M 1
4: b 8 F 2
5: c 8 F 2
6: d 4 M 3
7: e 4 M 3
8: f 6 F 4
9: f 0 F 4
Trying this on the data.table in #Frank's comment, gives the expected result as well:
dt<-data.table(x = c("b", "a", "a"), y = c(1, 1, 2), z = c("F", "F", "F"))
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
}
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
}else{
dt$grp[i] = dt$grp[i-1] + 1
}
}
}
dt
x y z grp
1: b 1 F 1
2: a 1 F 1
3: a 2 F 1
EDITED TO ADD: This solution is in some ways a more verbose version of the one advocated by djhurio above. I think it shows what is happening a bit more so I'll leave it.
I think this is a task easier to do if it is broken down a little bit. The below code creates TWO indices at first, one for changes in x (nested in z) and one for changes in y (nested in z). We then find the first row from each of these indices. Taking the cumulative sum of the case where both FIRST.x and FIRST.y is true should give your desired index.
library(data.table)
dt_example <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
y = c(1,3,2,8,8,4,4,6,0),
z = c("M","M","M","F","F","M","M","F","F"))
dt_example[,Index_x := .GRP,by = c("z","x")]
dt_example[,Index_y := .GRP,by = c("z","y")]
dt_example[,FIRST.x := !duplicated(Index_x)]
dt_example[,FIRST.y := !duplicated(Index_y)]
dt_example[,Index := cumsum(FIRST.x & FIRST.y)]
dt_example
x y z Index_x Index_y FIRST.x FIRST.y Index
1: a 1 M 1 1 TRUE TRUE 1
2: a 3 M 1 2 FALSE TRUE 1
3: a 2 M 1 3 FALSE TRUE 1
4: b 8 F 2 4 TRUE TRUE 2
5: c 8 F 3 4 TRUE FALSE 2
6: d 4 M 4 5 TRUE TRUE 3
7: e 4 M 5 5 TRUE FALSE 3
8: f 6 F 6 6 TRUE TRUE 4
9: f 0 F 6 7 FALSE TRUE 4
This approach looks for changes in x & z | y & z. The extra columns are left in the data.table to show the calculations.
DF[, c("Ix", "Iy", "Iz", "dx", "dy", "min.change", "Index") :=
#Create index of values based on consecutive order
list(ix <- rleid(x), iy <- rleid(y), iz <- rleid(z),
#Determine if combinations of x+z OR y+z change
ix1 <- c(0, diff(rleid(ix+iz))),
iy1 <- c(0, diff(rleid(iy+iz))),
#Either combination is constant (no change)?
change <- pmin(ix1, iy1),
#New index based on change
cumsum(change) + 1
)]
x y z Ix Iy Iz dx dy min.change Index
1: a 1 M 1 1 1 0 0 0 1
2: a 3 M 1 2 1 0 1 0 1
3: a 2 M 1 3 1 0 1 0 1
4: b 8 F 2 4 2 1 1 1 2
5: c 8 F 3 4 2 1 0 0 2
6: d 4 M 4 5 3 1 1 1 3
7: e 4 M 5 5 3 1 0 0 3
8: f 6 F 6 6 4 1 1 1 4
9: f 0 F 6 7 4 0 1 0 4
I'm using dcast.data.table to convert a long data.table to a wide data.table
library(data.table)
library(reshape2)
set.seed(1234)
dt.base <- data.table(A = rep(c(1:3),2), B = rep(c(1:2),3), C=c(1:4,1,2),thevalue=rnorm(6))
#from long to wide using dcast.data.table()
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum)
#now some stuff happens e.g., please do not bother what happens between dcast and melt
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
now i want to melt the data.table back again to the original column layout and here i'm stuck, how do I separate the concatenated columnames from the casted data.table, this is my problem
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
I need two columns instead of one
the result that i'm looking for can be produced with this code
#update
dt.base[A==2 & B == 1 & C == 1, thevalue :=1]
dt.base[A==2 & B == 2 & C == 2, thevalue :=1]
#insert (2,1,3 was not there in the base data.table)
dt.newrow <- data.table(A=2, B=1, C=3, thevalue = 1)
dt.base <-rbindlist(list(dt.base, dt.newrow))
dt.base
As always any help is appreciated
Would that work for you?
colnames <- c("B", "C")
dt.melt[, (colnames) := (colsplit(variable, "_", colnames))][, variable := NULL]
subset(dt.melt, thevalue != 0)
# or dt.melt[thevalue != 0, ]
# A thevalue B C
#1: 1 -1.2070657 1 1
#2: 2 1.0000000 1 1
#3: 2 1.0000000 1 3
#4: 3 1.0844412 1 3
#5: 2 1.0000000 2 2
#6: 3 0.5060559 2 2
#7: 1 -2.3456977 2 4
If your data set isn't representable and there could be zeros in valid rows, here's alternative approach
colnames <- c("B", "C")
setkey(dt.melt[, (colnames) := (colsplit(variable, "_",colnames))][, variable := NULL], A, B, C)
setkey(dt.base, A, B, C)
dt.base <- dt.melt[rbind(dt.base, data.table(A = 2, B = 1, C = 3), fill = T)]
dt.base[, thevalue.1 := NULL]
## A B C thevalue
## 1: 1 1 1 -1.2070657
## 2: 1 2 4 -2.3456977
## 3: 2 1 1 1.0000000
## 4: 2 2 2 1.0000000
## 5: 3 1 3 1.0844412
## 6: 3 2 2 0.5060559
## 7: 2 1 3 1.0000000
Edit
As. suggested by #Arun, the most efficient way would be to use #AnandaMahto cSplit function, as it is using data.table too, i.e,
cSplit(dt.melt, "variable", "_")
Second Edit
In order to save the manual merges, you can set fill = NA (for example) while dcasting and then do everything in one go with csplit, e.g.
dt.cast <- dcast.data.table(dt.base, A ~ B + C, value.var = "thevalue", fun = sum, fill = NA)
setkey(dt.cast, A)
dt.cast[2, c(2,3,4):=1,with = FALSE]
dt.melt <- melt(dt.cast,id.vars = c("A"), value.name = "thevalue")
dt.cast <- cSplit(dt.melt, "variable", "_")[!is.na(thevalue)]
setnames(dt.cast, 3:4, c("B","C"))
# A thevalue B C
# 1: 1 -1.2070657 1 1
# 2: 2 1.0000000 1 1
# 3: 2 1.0000000 1 3
# 4: 3 1.0844412 1 3
# 5: 2 1.0000000 2 2
# 6: 3 0.5060559 2 2
# 7: 1 -2.3456977 2 4