Generate All ID Pairs, by group with data.table in R - r

I have a data.table with many individuals (with ids) in many groups. Within each group, I would like to find every combination of ids (every pair of individuals). I know how to do this with a split-apply-combine approach, but I am hoping that a data.table would be faster.
Sample data:
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C"), 20, replace=TRUE))
Split-Apply-Combine Method:
datS <- split(dat, f=dat$groups)
datSc <- lapply(datS, function(x){ as.data.table(t(combn(x$ids, 2)))})
rbindlist(datSc)
head(rbindlist(datSc))
V1 V2
1: 2 5
2: 2 10
3: 2 19
4: 5 10
5: 5 19
6: 10 19
My best data.table attempt produces a single column, not two columns with all the possible combinations:
dat[, combn(x=ids, m=2), by=groups]
Thanks in advance.

You need to convert the result from t(combn()) which is a matrix to a data.table or data.frame, so this should work:
library(data.table)
set.seed(10)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C"), 20, replace=TRUE))
dt <- dat[, as.data.table(t(combn(ids, 2))), .(groups)]
head(dt)
groups V1 V2
1: C 1 3
2: C 1 5
3: C 1 7
4: C 1 10
5: C 1 13
6: C 1 14

library(data.table)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C"), 20, replace=TRUE))
ind<-unique(dat$groups)
lapply(1:length(ind), function (i) combn(dat$ids[which(dat$groups==ind[i])],2))
You can then change the list to any other type of format you might need.

Related

Row operations on selected columns based on substring in data.table

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

How do I subset a data table row in R to get the rows unique to itself

I know this may be a simple question but I cant seem to get it right.
I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.
Here is an example. Old_dt
v1 v2
1 a
2 b
3 c
4 d
Here is new_dt
v1 v2
3 c
4 d
5 e
What I want is to get just the 5 e row.
Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Only resulted in new_dt itself.
Using
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Reulted in nothing.
Using
sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]
Reulted in multiple rows of FALSEs
Can somebody help me?
Thank you in advance
We can do a join (data from #giraffehere's post)
df2[!df1, on = "a"]
# a b
#1: 6 14
#2: 7 15
To get rows in 'df1' that are not in 'df2' based on the 'a' column
df1[!df2, on = "a"]
# a b
#1: 4 9
#2: 5 10
In the OP's example we need to join on both columns
new_dt[!old_dt, on = c("v1", "v2")]
# v1 v2
#1: 5 e
NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.tables.
Of course, dplyr is a good package. For dealing with this problem, a shorter anti_join can be used
library(dplyr)
anti_join(new_dt, old_dt)
# v1 v2
# (int) (chr)
#1 5 e
or the setdiff from dplyr can work on data.frame, data.table, tbl_df etc.
setdiff(new_dt, old_dt)
# v1 v2
#1: 5 e
However, the question is tagged as data.table.
dplyr would help a lot when you deal with tabular data in R - Would recommend you learn more about dplyr here
library(dplyr)
library(magrittr) # this is just for shorter code with %<>%
# Create a sequence number that combine v1 & v2
Old_dt %<>%
mutate(sequence = paste0(v1,v2))
new_dt %<>%
mutate(sequence = paste0(v1,v2))
# Filter new_dt by sequence not existed in old_dt
result <- new_dt %>%
filter(!(sequence %in% Old_dt$sequence)) %>%
select(v1:v2)
v1 v2
5 e
EDIT: I noticed OP wanted both rows and not just one to match on. I'll keep the data initialization part of the solution here as it is referenced above by #akron. However, use the top solution #akrun posted. It is the more of the "data.table way".
df1 <- data.table(a = 1:5, b = 6:10)
df2 <- data.table(a = c(1, 2, 3, 6, 7), b = 11:15)
head(df1)
a b
1: 1 6
2: 2 7
3: 3 8
4: 4 9
5: 5 10
head(df2)
a b
1: 1 11
2: 2 12
3: 3 13
4: 6 14
5: 7 15
If column a has repeats, you could try this base R hack:
id.var1 <- paste(df1$a, df1$b,sep="_")
id.var2 <- paste(df2$a, df2$b,sep="_")
dfKeep <- df[!(id.var2 %in% id.var1),]

reshape data and coerce missing to zero

EDIT:
The original dataset can be found here: link
I have a matrix like:
data <- matrix(c("a","1","10",
"b","1","20",
"c","1","30",
"a","2","10",
"b","2","20",
"a","3","10",
"c","3","20"),
ncol=3, byrow=TRUE)
I would like to reshape as a dataframe coercing the missing values to zero:
data <- matrix(c("a","1","10",
"b","1","20",
"c","1","30",
"a","2","10",
"b","2","20",
"c","2","0",
"a","3","10",
"b","3","0",
"c","3","20"),
ncol=3, byrow=TRUE)
How can I do it with the reshape package?
Thaks
We can use complete from tidyr, after converting your data a little:
library(tidyr)
data <- as.data.frame(data)
data$V3 <- as.numeric(as.character(data$V3))
complete(data, V1, V2, fill = list(V3 = 0))
tidyr better but if you want use reshape you can
library(reshape2)
data2=dcast(data = as.data.frame(data),V1~V2)
data3=melt( data2,measure.vars=colnames(data2)[-1])
data3[is.na(data3)]="0"
Seems to me like you are handling something like a multivariate time series. Therefore I would suggest using a proper time series object.
library(zoo)
res=read.zoo(data.frame(data,stringsAsFactors=FALSE),
split=1,
index.column=2,
FUN=as.numeric)
coredata(res)=as.numeric(coredata(res))
coredata(res)[is.na(res)]=0
This gives
res
# a b c
#1 10 20 30
#2 10 20 0
#3 10 0 20
I think you are doing it wrong by having a matrix with multiple classes.
First I would convert to a data.frame or to a data.table and then convert all the column to the proper type. Something like
library(data.table) # V 1.9.6+
# Convert to data.table
DT <- as.data.table(data)
# Convert to correct column types
for(j in names(DT)) set(DT, j = j, value = type.convert(DT[[j]]))
Then we can expand rows using data.table::CJ and assign zeroes to NA values
## Cross join all column except the third
DT <- DT[do.call(CJ, c(unique = TRUE, DT[, -3, with = FALSE])), on = names(DT)[-3]]
## Or if you want only to operate on these two columns you can alternatively do
# DT <- DT[CJ(V1, V2, unique = TRUE), on = c("V1", "V2")]
## Fill with zeroes
DT[is.na(V3), V3 := 0]
DT
# V1 V2 V3
# 1: a 1 10
# 2: a 2 10
# 3: a 3 10
# 4: b 1 20
# 5: b 2 20
# 6: b 3 0
# 7: c 1 30
# 8: c 2 0
# 9: c 3 20

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

Resources