Flag duplicate obs between based on two ID variables - r

Updated Example (see RULES)
I have data.table with id1 and id2 columns (as below)
data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
id1
id2
1
1
1
2
2
2
3
1
3
2
3
3
4
2
I would like to generate a flag to identify the duplicate association between id1 and id2.
RULE : if a particular id1 is already associated with id2 then it should be flagged..one unique id2 should be associated with one id1 only (see explanation below)
a) Looking for an efficient solution and b) a solution that only uses basics and data.table functions
id1
id2
flag
1
1
1
2
Y
<== since id2=1is assicated with id1=1 in 1st row
2
2
3
1
Y
<== since id2=1 is assicated with id1=1 in 1st row
3
2
Y
<== since id2=2 is assicated with id1=2 in 3rd row
3
3
4
2
Y
<== since id2=2 is assicated with id1=2 in 3rd row

This is a tricky one. If I understand correctly, my translation of OP's rules is as follows:
For each id1 group, exactly one row is not flagged.
If the id1 group consists only of one row it is not flagged.
Within an id1 group, all rows where id2 has been used in previous groups are flagged.
If there are more than one row within an id1 group which have not been flagged up to now, only the first row is not flagged; all other rows are flagged.
So, the approach is to
create a vector of available id2 values,
step through the id1 groups,
find the first row within each group where the id2 value not already has been consumed in previous groups,
flag all other rows,
and update the vector of available (not consumed) id2 values.
avail <- unique(DT$id2)
DT[, flag := {
idx <- max(first(which(id2 %in% avail)), 1L)
avail <- setdiff(avail, id2)
replace(rep("Y", .N), idx, "")
}, by = id1][]
id1 id2 flag
1: 1 1
2: 1 2 Y
3: 2 2
4: 3 1 Y
5: 3 2 Y
6: 3 3
7: 4 2
Caveat
The above code reproduces the expected result for the use case provided by the OP. However, there might be other uses cases and/or edge cases where the code might need to be tweaked to comply with OP's expectations. E.g., it is unclear what the expected result is in case of an id1 group where all id2 values already have been consumed in previous groups.
Edit:
The OP has edited the expected result so that row 7 is now flagged as well.
Here is a tweaked version of my code which reproduces the expected result after the edit:
avail <- unique(DT$id2)
DT[, flag := {
idx <- first(which(id2 %in% avail))
avail <- setdiff(avail, id2[idx])
replace(rep("Y", .N), idx, "")
}, by = id1][]
id1 id2 flag
1: 1 1
2: 1 2 Y
3: 2 2
4: 3 1 Y
5: 3 2 Y
6: 3 3
7: 4 2 Y
Data
library(data.table)
DT = data.table(id1 = c(1, 1, 2, 3, 3, 3, 4),
id2 = c(1, 2, 2, 1, 2, 3, 2))

This is a really convoluted chain, but I think it produces the result (the result in your question does not follow your own logic):
library(data.table)
a = data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
a[, .SD[1, ], by = id2][,
Noflag := "no"][a,
on = .(id2, id1)][is.na(Noflag),
flag := "y"][,
Noflag := NULL][]
What's in there:
a[, .SD[1, ], by = id2] gets each first row of the subgroups by id2. This groups shouldn't be flagged, so
[, Noflag := "no"] flags them as "not flagged" (go figure. I said it was convoluted). We need to join this no-flagged table with the original one:
[a, on = .(id2, id1)] joins the last table with the original a on both id1 and id2. Now we need to flag the rows that aren't flagged as "shouldn't be flagged":
[is.na(Noflag), flag := "y"]. Last part is to remove the Noflag unnecessary column:
[, Noflag := NULL] and add a [] to display the new table to screen.
I agree with the comment by #akrun reagarding igraph being not only more efficient, but also a simpler sintax.

# replicate your data
df <- data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
# create and append a new, empty column that will late be filled with the info whether they match or not
empty_col <- rep(NA, nrow(df)) #create empty vector
df[ , `:=` (duplicate = empty_col)] #append to existing data set
# create loop, to iteratively check if statement is true, then fill the new column accordingly
# note that instead of indexing the columns (e.g. df[,3] you could also use their names (e.g. df$flag)
for (i in 1:nrow(df)){
if (i>=2 & df[i,1] == df[1:i-1,1]){ #check if value in loop I matches that of any rows before (1:i)
df[i,3] = 1
}
else {
df[i,3] = 0 # when they dont match: 0, false
}
}
# note that the loop only starts in the second row, as i-1 would be 0 for i=1, such that there would be an error.

Related

Filter by column index number data.table R

Interesting I am unable to find a way to filter using the column number. I do not know the name of the column because it changes name, but I always know the position of the column.
This seems pretty trivial but it seems like I can only reference the i portion using the column name.
table = data.table(one = c(1,2,3), two = c("a","b","c"))
> table
one two
1: 1 a
2: 2 b
3: 3 c
I do not know that the second column is "two". I just want to filter by second column.
> table[two == "a"]
one two
1: 1 a
UPDATE:
As Ronak described, I could use
> table[table[[2]]=="a"]
one two
1: 1 a
However I would next like to update this same column, for example I would like to turn "a" into "c".
what I need:
> table
one two
1: 1 c
2: 2 b
3: 3 c
I have tried:
> table[table[[2]]=="a", table[[2]]:= "c"]
> table
one two a b c
1: 1 a c c c
2: 2 b <NA> <NA> <NA>
3: 3 c <NA> <NA> <NA>
So it seems like I am taking all the values in the second column and creating new columns for them instead of just changing the filtered rows to c.
> table[table[[2]]=="a", table[2]:= "c"]
Error in `[.data.table`(table, table[[2]] == "a", `:=`(table[2], "c")) :
LHS of := must be a symbol, or an atomic vector (column names or positions).
So I think I need to know the position of the second column.
Using [[ works :
library(data.table)
dt <- data.table(a = 1:5, b = 2:6)
dt[dt[[1]] == 1]
# a b
#1: 1 2
This gives the same output as dt[a == 1].
As we know we need the 2nd column, get the 2nd column name, and use "variable as column name", see example:
library(data.table)
d <- data.table(one = c(1,2,3), two = c("a","b","c"))
# get the 2nd column name
myCol <- colnames(d)[ 2 ]
# subset
d[ get(myCol) == "a", ]
# subset and update
d[ get(myCol) == "a", (myCol) := "c" ]
We can use .SD
dt[dt[, .SD[[1]] == 1]]
# a b
#1: 1 2
data
dt <- data.table(a = 1:5, b = 2:6)
You can also try this:
table[[2]][table[[2]]=="a"] <- "c"
table
> table
one two
1: 1 c
2: 2 b
3: 3 c
I have figured it out:
> table[table[[2]]=="a", colnames(table)[2]:= "c"]
> table
one two
1: 1 c
2: 2 b
3: 3 c
Thanks!

data.table "out of range", how to add value to new row

While working with data.frame it is simple to insert new value by using row number;
df1 <- data.frame(c(1:3))
df1[4,1] <- 1
> df1
c.1.3.
1 1
2 2
3 3
4 1
It is not working with data.table;
df1 <- data.table(c(1:3))
df1[4,1] <- 1
Error in `[<-.data.table`(`*tmp*`, 4, 1, value = 1) : i[1] is 4 which is out of range [1,nrow=3].
How can I do it?
Data Tables were designed to work much faster with some common operations like subset, join, group, sort etc. and as a result have some differences with data.frames.
Some operations like the one you pointed out will not work on data.tables. You need to use data.table - specific operations.
dt1 <- data.table(c(1:3))
rbindlist(list(dt1, list(1)), use.names=FALSE)
dt1
# V1
# 1: 1
# 2: 2
# 3: 3
# 4: 1

How to merge lists of vectors based on one vector belonging to another vector?

In R, I have two data frames that contain list columns
d1 <- data.table(
group_id1=1:4
)
d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )
And
d_grouped <- data.table(
group_id2=1:4
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )
I would like to merge these two data.tables based on the vectors in d1$Cat_grouped being contained in the vectors in d_grouped$Cat_grouped
To be more precise, there could be two matching criteria:
a) all elements of each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_a <- data.table(
group_id1=c(1,2)
group_id2=c(1,1)
)
b) at least one of the elements in each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_b <- data.table(
group_id1=c(1,2,3,3),
group_id2=c(1,1,1,2)
)
How can I implement a) or b) ? Preferably in a data.table way.
EDIT1: added the expected results of a) and b)
EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions
So I think long form is better, though my answer feels a little roundabout. I bet someone whose a little sleeker with data table can do this in fewer steps, but here's what I've got:
first, let's unpack the vectors in your example data:
d1_long <- d1[, list(cat=unlist(Cat_grouped)), group_id1]
d_grouped_long <- d_grouped[, list(cat=unlist(Cat_grouped)), group_id2]
Now, we can merge on the individual elements:
result_b <- merge(d1_long, d_grouped_long, by='cat')
Based on our example, it seems you don't actually need to know which elements were part of the match...
result_b[, cat := NULL]
Finally, my answer has duplicated group_id pairs because it gets a join for each pairwise match, not just the vector-level matches. So we can unique them away.
result_b <- unique(result_b)
Here's my result_b:
group_id.1 group_id.2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
We can use b as an intermediate step to a, since having any elements in common is a subset of having all elements in common.
Let's merge the original tables to see what the candidates are in terms of subvectors and vectors
result_a <- merge(result_b, d1, by = 'group_id1')
result_a <- merge(result_a, d_grouped, by = 'group_id2')
So now, if the length of Cat_grouped.x matches the number of TRUEs about Cat_grouped.x being %in% Cat_grouped.y, that's a bingo.
I tried a handful of clean ways, but the weirdness of having lists in the data table defeated the most obvious attempts. This seems to work though:
Let's add a row column to operate by
result_a[, row := 1:.N]
Now let's get the length and number of matches...
result_a[, x.length := length(Cat_grouped.x[[1]]), row]
result_a[, matches := sum(Cat_grouped.x[[1]] %in% Cat_grouped.y[[1]]), row]
And filter down to just rows where length and matches are the same
result_a <- result_a[x.length==matches]
This answer focuses on part a) of the question.
It follows Harland's approach but tries to make better use of the data.table idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.
Sample data
library(data.table)
d1 <- data.table(
group_id1 = 1:4,
Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12]))
d_grouped <- data.table(
group_id2 = 1:2,
Cat_grouped = list(letters[1:5], letters[6:10]))
Result a)
grp_cols <- c("group_id1", "group_id2")
unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, .(V2, .N), by = grp_cols][V2 == N, ..grp_cols])
group_id1 group_id2
1: 1 1
2: 2 1
Explanation
While expanding the list elements of d1 and d_grouped into long format, the number of list elements is determined for d1 using the lengths() function. lengths() (note the difference to length()) gets the length of each element of a list and was introduced with R 3.2.0.
After the inner join (note the nomatch = 0L parameter), the number of rows in the result set is counted (using the specal symbol .N) for each combination of grp_cols. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols are returned.
Result b)
Result b) can be derived from above solution by omitting the counting stuff:
unique(d1[, unlist(Cat_grouped), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, c("group_id1", "group_id2")])
group_id1 group_id2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
Another way:
Cross-join to get all pairs of group ids:
Y = CJ(group_id1=d1$group_id1, group_id2=d_grouped$group_id2)
Then merge in the vectors:
Y = Y[d1, on='group_id1'][d_grouped, on='group_id2']
# group_id1 group_id2 Cat_grouped i.Cat_grouped
# 1: 1 1 a,b a,b,c,d,e
# 2: 2 1 c,b a,b,c,d,e
# 3: 3 1 c,d,e,f a,b,c,d,e
# 4: 4 1 k,l a,b,c,d,e
# 5: 1 2 a,b f,g,h,i,j
# 6: 2 2 c,b f,g,h,i,j
# 7: 3 2 c,d,e,f f,g,h,i,j
# 8: 4 2 k,l f,g,h,i,j
Now you can use mapply to filter however you like:
Y[mapply(function(u,v) all(u %in% v), Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
Y[mapply(function(u,v) length(intersect(u,v)) > 0, Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 3 2

Value as column names in data.table

I have the following data.table:
dat<-data.table(Y=as.factor(c("a","b","a")),"a"=c(1,2,3),"b"=c(3,2,1))
It looks like:
Y a b
1: a 1 3
2: b 2 2
3: a 3 1
What I want is to subtract the value of the column indicated by the value of Y by 1. E.g. the Y value of the first row is "a", so the value of the column "a" in the first row should be reduced by one.
The result should be:
Y a b
1: a 0 3
2: b 2 1
3: a 2 1
Is this possible? If yes, how? Thank you!
Using self-joins and get:
for (yval in dat[ , unique(Y)]){
dat[yval, (yval) := get(yval) - 1L, on = "Y"]
}
dat[]
# Y a b
# 1: a 0 3
# 2: b 2 1
# 3: a 2 1
We can use melt/dcast to do this. melt the dataset after creating a row sequence ('N') to 'long' format, subtract 1 from the 'value' column where 'Y' and 'variable' elements are equal, assign (:= the output to 'value', then dcast the 'long' format to 'wide'.
dcast(melt(dat[, N := 1:.N], id.var = c("Y", "N"))[Y==variable,
value := value -1], N + Y ~variable, value.var = "value")[, N := NULL][]
# Y a b
#1: a 0 3
#2: b 2 1
#3: a 2 1
First an apply function to make the actual transformation. We need to apply by row and then use the first element to name the second element to access and over write. For some reason the values I was accessing in a and b were strings, so I used as.numeric to transform them to numbers. I don't know if this is normal in data.tables or a result of using the apply statement on one since I don't use data.tables normally.
tformDat <- apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})
Then you need to reformat back to the original data.table format
data.table(t(tformDat))
The whole thing can be done in one line.
data.table(t(apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})))

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources