R data.table - update by summing over subsets coded by columns - r

I have the following problem. I have a list of sets encoded in a data.table sets where id.s encodes the id of the set and id.e encodes its element.
For each set s there is its value m(s). Values of the function m() are in another data.table m where each row contains an id of the set id.s and its value.
sets <- data.table(
id.s = c(1,2,2,3,3,3,4,4,4,4),
id.e = c(3,3,4,2,3,4,1,2,3,4))
v <- data.table(id.s = 1:4, value = c(1/10,2/10,3/10,4/10))
I need to calculate new function v'() such that
where |s| denoted the cardinality of the set s (the number of elements) and b \ a denotes sets subtraction (a way of modifying a set b by removing the joint elements with set a)
Right now, I do it using a for-loop where I update row by row. Nevertheless, it takes too much time for large data.tables with thousands of sets with thousands of elements.
Do you have any idea how to make it easier?
My current code:
# convert data.table to wide format
dc <- dcast(sets, id.s ~ id.e, drop = FALSE, value.var = "id.e" , fill = 0)
# take columns corresponding to elements id.e
cols <- names(dc)[-1]
# convert columns cols to 0-1 coding
dc[, (cols) := lapply(.SD, function(x) ifelse(x > 0,1,0)), .SDcols = cols]
# join dc with v
dc <- dc[v, on = "id.s"]
# calculate the cardinality of each set
dc[, cardinality := sum(.SD > 0), .SDcols = cols, by = id.s]
# prepare column for new value
dc[, value2 := 0]
# id.s 1 2 3 4 value cardinality value2
#1: 1 0 0 1 0 0.1 1 0
#2: 2 0 0 1 1 0.2 2 0
#3: 3 0 1 1 1 0.3 3 0
#4: 4 1 1 1 1 0.4 4 0
# for each set (row of dc)
for(i in 1:nrow(dc)) {
row <- dc[i,]
set <- as.numeric(row[,cols, with = F])
row.cardinality <- as.numeric(row$cardinality)
# find its supersets
dc[,is.superset := ifelse(rowSums(mapply("*",dc[,cols,with=FALSE],set))==row.cardinality,1,0)][]
# use the formula to update the value
res <- dc[is.superset==1,][, sum := sum((-1)^(cardinality - row.cardinality)*value)]$sum[1]
dc[i,value2 := res]
}
dc[,.(id.s, value2), with = TRUE]
# id.s value2
#1: 1 -0.2
#2: 2 0.3
#3: 3 -0.1
#4: 4 0.4

This might work for you:
Make a little function to get the superset for each set
get_superset <- function(el, setvalue) {
c(setvalue, sets[id.s!=setvalue, setequal(intersect(el, id.e), el), by=id.s][V1==TRUE, id.s])
}
Get cardinality of each set in the sets object, but also save separately for later use (see step 4)
sets[, cardinality:=.N, by=.(id.s)]
cardinality = unique(sets[, .(id.s, cardinality)])
Add supersets, by set, using above function
sets <- unique(sets[,!c("id.e")][sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s], on=.(id.s)])
(Note: As an alternative, step 2 could be broken into three sub-steps, like this)
# 2a. Get the supersets
supersets = sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s]
# 2b. Merge the supersets on the original sets
sets = sets[supersets, on=.(id.s)]
# 2c. Retain only necessary columns, and make unique
sets = unique(sets[, .(id.s, cardinality,supersets)])
add value
sets <- sets[v,on=.(supersets=id.s)][order(id.s)]
grab cardinality of each superset
sets <- sets[cardinality, on=.(supersets=id.s)]
get the result (i.e. estimate your v' function)
result = sets[, .(value2 = sum((-1)^(i.cardinality-cardinality)*value)), by=.(id.s)]
Output:
id.s value2
1: 1 -0.2
2: 2 0.3
3: 3 -0.1
4: 4 0.4

Related

Flag duplicate obs between based on two ID variables

Updated Example (see RULES)
I have data.table with id1 and id2 columns (as below)
data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
id1
id2
1
1
1
2
2
2
3
1
3
2
3
3
4
2
I would like to generate a flag to identify the duplicate association between id1 and id2.
RULE : if a particular id1 is already associated with id2 then it should be flagged..one unique id2 should be associated with one id1 only (see explanation below)
a) Looking for an efficient solution and b) a solution that only uses basics and data.table functions
id1
id2
flag
1
1
1
2
Y
<== since id2=1is assicated with id1=1 in 1st row
2
2
3
1
Y
<== since id2=1 is assicated with id1=1 in 1st row
3
2
Y
<== since id2=2 is assicated with id1=2 in 3rd row
3
3
4
2
Y
<== since id2=2 is assicated with id1=2 in 3rd row
This is a tricky one. If I understand correctly, my translation of OP's rules is as follows:
For each id1 group, exactly one row is not flagged.
If the id1 group consists only of one row it is not flagged.
Within an id1 group, all rows where id2 has been used in previous groups are flagged.
If there are more than one row within an id1 group which have not been flagged up to now, only the first row is not flagged; all other rows are flagged.
So, the approach is to
create a vector of available id2 values,
step through the id1 groups,
find the first row within each group where the id2 value not already has been consumed in previous groups,
flag all other rows,
and update the vector of available (not consumed) id2 values.
avail <- unique(DT$id2)
DT[, flag := {
idx <- max(first(which(id2 %in% avail)), 1L)
avail <- setdiff(avail, id2)
replace(rep("Y", .N), idx, "")
}, by = id1][]
id1 id2 flag
1: 1 1
2: 1 2 Y
3: 2 2
4: 3 1 Y
5: 3 2 Y
6: 3 3
7: 4 2
Caveat
The above code reproduces the expected result for the use case provided by the OP. However, there might be other uses cases and/or edge cases where the code might need to be tweaked to comply with OP's expectations. E.g., it is unclear what the expected result is in case of an id1 group where all id2 values already have been consumed in previous groups.
Edit:
The OP has edited the expected result so that row 7 is now flagged as well.
Here is a tweaked version of my code which reproduces the expected result after the edit:
avail <- unique(DT$id2)
DT[, flag := {
idx <- first(which(id2 %in% avail))
avail <- setdiff(avail, id2[idx])
replace(rep("Y", .N), idx, "")
}, by = id1][]
id1 id2 flag
1: 1 1
2: 1 2 Y
3: 2 2
4: 3 1 Y
5: 3 2 Y
6: 3 3
7: 4 2 Y
Data
library(data.table)
DT = data.table(id1 = c(1, 1, 2, 3, 3, 3, 4),
id2 = c(1, 2, 2, 1, 2, 3, 2))
This is a really convoluted chain, but I think it produces the result (the result in your question does not follow your own logic):
library(data.table)
a = data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
a[, .SD[1, ], by = id2][,
Noflag := "no"][a,
on = .(id2, id1)][is.na(Noflag),
flag := "y"][,
Noflag := NULL][]
What's in there:
a[, .SD[1, ], by = id2] gets each first row of the subgroups by id2. This groups shouldn't be flagged, so
[, Noflag := "no"] flags them as "not flagged" (go figure. I said it was convoluted). We need to join this no-flagged table with the original one:
[a, on = .(id2, id1)] joins the last table with the original a on both id1 and id2. Now we need to flag the rows that aren't flagged as "shouldn't be flagged":
[is.na(Noflag), flag := "y"]. Last part is to remove the Noflag unnecessary column:
[, Noflag := NULL] and add a [] to display the new table to screen.
I agree with the comment by #akrun reagarding igraph being not only more efficient, but also a simpler sintax.
# replicate your data
df <- data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
# create and append a new, empty column that will late be filled with the info whether they match or not
empty_col <- rep(NA, nrow(df)) #create empty vector
df[ , `:=` (duplicate = empty_col)] #append to existing data set
# create loop, to iteratively check if statement is true, then fill the new column accordingly
# note that instead of indexing the columns (e.g. df[,3] you could also use their names (e.g. df$flag)
for (i in 1:nrow(df)){
if (i>=2 & df[i,1] == df[1:i-1,1]){ #check if value in loop I matches that of any rows before (1:i)
df[i,3] = 1
}
else {
df[i,3] = 0 # when they dont match: 0, false
}
}
# note that the loop only starts in the second row, as i-1 would be 0 for i=1, such that there would be an error.

Applying a custom function to a data.table doesn't work, even though the function seems okay individually

tl,dr: My function seems to work, but then I lapply it and it doesn't. Is it the function or the lapplying?
The data
I have a datatable that contains text which is already tokenised into a character vector:
id text
1: 1 c("sadness", "joy")
2: 2 c("anger", "scream")
3: 3 c("relief", "sadness")
I want to annotate my tokenised texts with emotional values with a dictionary that has words and associated emotional values:
words emotion1 emotion2
1: sadness 1 5
2: anger 2 6
3: relief 3 7
The ultimate goal
I am expecting my search_function to output something similar to this:
my_emotion_function(c("relief", "sadness"), lexicon_emotions)
emotion1 emotion2
1: 2 6
my_emotion_function(c("relief", "meh"), lexicon_emotions)
emotion1 emotion2
1: 3 7
my_emotion_function(c("meh", "ugh"), lexicon_emotions)
emotion1 emotion2
1: NA NA
Applying this to the tokens, I would add new columns and fill them with the results.
id text emotion1 emotion2
1: 1 c("sadness", "joy") 1 5
2: 2 c("anger", "scream") 2 6
3: 3 c("relief", "sadness") 2 6
The function that half-works
The function takes a character vector, subsets the (keyed) emotional dictionary for matching words and calculates the average score for each emotional dimension.
my_emotion_function <- function(characters, lexicon){
return(lexicon[.(characters), lapply(.SD, mean, na.rm = TRUE), .SDcols = 2:3])
}
What I don't understand
What I am baffled by and can't understand is why this function seems to work well when tested on one character vector (the example above, testing it only on one vector, works well), but when I want to lapply it to a data.table, it doens't work.
I am not sure whether the function is wrong in one aspect or my laplying of it to the data.table. I can't figure out why the single instance works, but not repeatedly on a data.table
If I execute the above code, with an equal number of tokens in each "text" row, then I will just get N.A for every cell, no matter the words.
id text emotion1 emotion2
1: 1 c("sadness", "joy") NaN NaN
2: 2 c("anger", "scream") NaN NaN
3: 3 c("relief", "sadness") NaN NaN
If you test it out with an unequal number of tokens (say the first row), then every row contains the value for the first row.
id text emotion1 emotion2
1: 1 sadness 1 5
2: 2 c("anger", "scream") 1 5
3: 3 c("relief", "sadness") 1 5
I can't find a reason as to why I either get only the same result or NA's everywhere.
Complete code for reproduction
library(data.table)
table_of_tokens <- data.table("id" = 1:3,
"text" = list(c("sadness", "joy"),
c("anger", "scream"),
c("relief", "sadness")))
table_of_tokens[, "text" := as.character(text)]
#convert to character vector to use key-subsetting in data.table
lexicon_emotions <-
data.table(
"words" = c("sadness", "anger", "relief"),
"emotion1" = 1:3,
"emotion2" = 5:7
)
setkey(lexicon_emotions, words)
my_emotion_function <- function(characters, lexicon) {
return(lexicon[.(characters),
lapply(.SD, mean, na.rm = TRUE), .SDcols = 2:3])
}
table_of_tokens[, c("emotion1", "emotion2") :=
my_emotion_function(text, lexicon_emotions)]
Credit: this is a basically a re-write of the syuzhet R-package, which relies on data.frames and is therefore not flexible or efficient enough in my situation for a large dataset.
Edit: this should get you want you want.
library(data.table)
table_of_tokens <- data.table(
"id" = 1:3,
"text" = list(
c("sadness"),
c("anger", "scream"),
c("relief", "grief"),
c("relief", "grief", "sadness")
)
)
lexicon_emotions <- data.table("words" = c("sadness", "anger", "relief"),
"emotion1" = 1:3,
"emotion2" = 5:7,
key = "words")
emotions <- names(lexicon_emotions)[-1]
table_of_tokens[,
(emotions) := {
res <- lapply(text, function(x) {
lexicon_emotions[words %chin% x,
lapply(.SD, mean, na.rm = TRUE),
.SDcols = emotions]
})
rbindlist(res)
}
]
print(table_of_tokens)
> print(table_of_tokens)
id text emotion1 emotion2
1: 1 sadness 1 5
2: 2 anger,scream 2 6
3: 3 relief,grief 3 7
4: 1 relief,grief,sadness 2 6
One of the most important aspects of code writing is debugging. Let's use a simple print() call to figure out what is happening during the function call:
my_emotion_function <- function(characters, lexicon) {
print(characters) ## for debugging
return(lexicon[.(characters),
lapply(.SD, mean, na.rm = TRUE), .SDcols = 2:3])
}
table_of_tokens[, c("emotion1", "emotion2") :=
my_emotion_function(text, lexicon_emotions)]
## [1] "c(\"sadness\", \"joy\")" "c(\"anger\", \"scream\")" "c(\"relief\", \"sadness\")"
What that means is that we are are actually performing:
lexicon["c(\"sadness\", \"joy\")" ...]
## what we actually want for each token
lexicon[c("sadness", "joy"), lapply(.SD, mean, na.rm = TRUE), .SDcols = 2:3])
To do so, we do not want to convert from a list to a character as suggested by #IanCampbell. The other item is that we want to loop through each element, which means lapply() can be our friend:
table_of_tokens[, c("emotion1", "emotion2") :=
rbindlist(lapply(text, my_emotion_function, lexicon_emotions))]
table_of_tokens
## id text emotion1 emotion2
## 1: 1 sadness,joy 1 5
## 2: 2 anger,scream 2 6
## 3: 3 relief,sadness 2 6
I am still uncertain what would happen if there were no matches.

Replace nth consecutive occurrence of a value

I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)
Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.

How to merge lists of vectors based on one vector belonging to another vector?

In R, I have two data frames that contain list columns
d1 <- data.table(
group_id1=1:4
)
d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )
And
d_grouped <- data.table(
group_id2=1:4
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )
I would like to merge these two data.tables based on the vectors in d1$Cat_grouped being contained in the vectors in d_grouped$Cat_grouped
To be more precise, there could be two matching criteria:
a) all elements of each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_a <- data.table(
group_id1=c(1,2)
group_id2=c(1,1)
)
b) at least one of the elements in each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_b <- data.table(
group_id1=c(1,2,3,3),
group_id2=c(1,1,1,2)
)
How can I implement a) or b) ? Preferably in a data.table way.
EDIT1: added the expected results of a) and b)
EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions
So I think long form is better, though my answer feels a little roundabout. I bet someone whose a little sleeker with data table can do this in fewer steps, but here's what I've got:
first, let's unpack the vectors in your example data:
d1_long <- d1[, list(cat=unlist(Cat_grouped)), group_id1]
d_grouped_long <- d_grouped[, list(cat=unlist(Cat_grouped)), group_id2]
Now, we can merge on the individual elements:
result_b <- merge(d1_long, d_grouped_long, by='cat')
Based on our example, it seems you don't actually need to know which elements were part of the match...
result_b[, cat := NULL]
Finally, my answer has duplicated group_id pairs because it gets a join for each pairwise match, not just the vector-level matches. So we can unique them away.
result_b <- unique(result_b)
Here's my result_b:
group_id.1 group_id.2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
We can use b as an intermediate step to a, since having any elements in common is a subset of having all elements in common.
Let's merge the original tables to see what the candidates are in terms of subvectors and vectors
result_a <- merge(result_b, d1, by = 'group_id1')
result_a <- merge(result_a, d_grouped, by = 'group_id2')
So now, if the length of Cat_grouped.x matches the number of TRUEs about Cat_grouped.x being %in% Cat_grouped.y, that's a bingo.
I tried a handful of clean ways, but the weirdness of having lists in the data table defeated the most obvious attempts. This seems to work though:
Let's add a row column to operate by
result_a[, row := 1:.N]
Now let's get the length and number of matches...
result_a[, x.length := length(Cat_grouped.x[[1]]), row]
result_a[, matches := sum(Cat_grouped.x[[1]] %in% Cat_grouped.y[[1]]), row]
And filter down to just rows where length and matches are the same
result_a <- result_a[x.length==matches]
This answer focuses on part a) of the question.
It follows Harland's approach but tries to make better use of the data.table idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.
Sample data
library(data.table)
d1 <- data.table(
group_id1 = 1:4,
Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12]))
d_grouped <- data.table(
group_id2 = 1:2,
Cat_grouped = list(letters[1:5], letters[6:10]))
Result a)
grp_cols <- c("group_id1", "group_id2")
unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, .(V2, .N), by = grp_cols][V2 == N, ..grp_cols])
group_id1 group_id2
1: 1 1
2: 2 1
Explanation
While expanding the list elements of d1 and d_grouped into long format, the number of list elements is determined for d1 using the lengths() function. lengths() (note the difference to length()) gets the length of each element of a list and was introduced with R 3.2.0.
After the inner join (note the nomatch = 0L parameter), the number of rows in the result set is counted (using the specal symbol .N) for each combination of grp_cols. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols are returned.
Result b)
Result b) can be derived from above solution by omitting the counting stuff:
unique(d1[, unlist(Cat_grouped), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, c("group_id1", "group_id2")])
group_id1 group_id2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
Another way:
Cross-join to get all pairs of group ids:
Y = CJ(group_id1=d1$group_id1, group_id2=d_grouped$group_id2)
Then merge in the vectors:
Y = Y[d1, on='group_id1'][d_grouped, on='group_id2']
# group_id1 group_id2 Cat_grouped i.Cat_grouped
# 1: 1 1 a,b a,b,c,d,e
# 2: 2 1 c,b a,b,c,d,e
# 3: 3 1 c,d,e,f a,b,c,d,e
# 4: 4 1 k,l a,b,c,d,e
# 5: 1 2 a,b f,g,h,i,j
# 6: 2 2 c,b f,g,h,i,j
# 7: 3 2 c,d,e,f f,g,h,i,j
# 8: 4 2 k,l f,g,h,i,j
Now you can use mapply to filter however you like:
Y[mapply(function(u,v) all(u %in% v), Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
Y[mapply(function(u,v) length(intersect(u,v)) > 0, Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 3 2

Value as column names in data.table

I have the following data.table:
dat<-data.table(Y=as.factor(c("a","b","a")),"a"=c(1,2,3),"b"=c(3,2,1))
It looks like:
Y a b
1: a 1 3
2: b 2 2
3: a 3 1
What I want is to subtract the value of the column indicated by the value of Y by 1. E.g. the Y value of the first row is "a", so the value of the column "a" in the first row should be reduced by one.
The result should be:
Y a b
1: a 0 3
2: b 2 1
3: a 2 1
Is this possible? If yes, how? Thank you!
Using self-joins and get:
for (yval in dat[ , unique(Y)]){
dat[yval, (yval) := get(yval) - 1L, on = "Y"]
}
dat[]
# Y a b
# 1: a 0 3
# 2: b 2 1
# 3: a 2 1
We can use melt/dcast to do this. melt the dataset after creating a row sequence ('N') to 'long' format, subtract 1 from the 'value' column where 'Y' and 'variable' elements are equal, assign (:= the output to 'value', then dcast the 'long' format to 'wide'.
dcast(melt(dat[, N := 1:.N], id.var = c("Y", "N"))[Y==variable,
value := value -1], N + Y ~variable, value.var = "value")[, N := NULL][]
# Y a b
#1: a 0 3
#2: b 2 1
#3: a 2 1
First an apply function to make the actual transformation. We need to apply by row and then use the first element to name the second element to access and over write. For some reason the values I was accessing in a and b were strings, so I used as.numeric to transform them to numbers. I don't know if this is normal in data.tables or a result of using the apply statement on one since I don't use data.tables normally.
tformDat <- apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})
Then you need to reformat back to the original data.table format
data.table(t(tformDat))
The whole thing can be done in one line.
data.table(t(apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})))

Resources