R: data.table compare sets of rows

R: data.table compare sets of rows - r

I am working in R with data.tables. I have the following data.table encoding a set of points with coordinates A,B,C,D and index encoding a set the point belongs to.
library(data.table)
A B C D set
1: 0 0 0 0 1
2: 1 0 1 0 2
3: 1 1 1 0 2
4: 0 1 0 0 2
5: 1 0 1 1 2
6: 0 1 0 0 3
7: 1 1 0 0 3
8: 0 0 1 0 4
9: 1 0 1 0 4
10: 0 1 0 1 4
11: 0 0 0 0 5
12: 1 0 0 0 5
13: 1 1 1 0 5
14: 1 1 1 1 5
dt = setDT(structure(list(A = c(0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
0L, 1L, 1L, 1L), B = c(0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 1L), C = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L), D = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 1L), set = c(1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L)), .Names = c("A", "B", "C", "D", "set"), row.names = c(NA,
-14L), class = "data.frame"))
I have another table encoding e.g. probability of each set.
set mass
1: 1 0.27809187
2: 2 0.02614841
3: 3 0.36890459
4: 4 0.28975265
5: 5 0.03710247
wt = setDT(structure(list(set = 1:5, mass = c(0.27809187, 0.02614841, 0.36890459,
0.28975265, 0.03710247)), .Names = c("set", "mass"), row.names = c(NA,
-5L), class = "data.frame"))
I would like to have a procedure to create a projection to a subspace e.g. C,D. (Note that original points 1,4,6,7,11,12 coincide in this case, sets 1 and 3 are the same in this subspace as well as sets 2 and 5.
unique(dt[,c("C","D", "set")])
> C D set
1: 0 0 1
2: 1 0 2
3: 0 0 2
4: 1 1 2
5: 0 0 3
6: 1 0 4
7: 0 1 4
8: 0 0 5
9: 1 0 5
10: 1 1 5
and to identify the same sets, keep just unique ones and sum the corresponding masses. I.e. in this case:
> C D set
1: 0 0 1
2: 1 0 2
3: 0 0 2
4: 1 1 2
5: 1 0 4
6: 0 1 4
set mass
1: 1 0.6469965 % set 1 + set 3
2: 2 0.06325088 % set 2 + set 5
3: 4 0.36890459
Thank you for your ideas.

Similar in concept to Frank's, we can map the binary values of each set to a decimal with x * 2 ^ ((length(x) - 1):0). Subsetting, also, for "C" and "D", we get:
coords = c("C", "D")
d = data.frame(set = dt$set,
val = Reduce("+", Map("*", list(dt$C, dt$D), 2 ^ ((length(coords) - 1):0))))
d
Then, we can group identical sets following the same idea:
tab = table(d$val, d$set) > 0L ## `table(d) > 0` to ignore the duplicates
gr = colSums(tab * (2 ^ ((nrow(tab) - 1):0)))
gr
# 1 2 3 4 5
# 8 11 8 6 11
## another (pre-edit) alternative with unnecessary overhead
#gr = cutree(hclust(dist(table(d) > 0L)), h = 0)
#gr
#1 2 3 4 5
#1 2 1 3 2
and aggregate based on this group:
rowsum(wt$mass[match(names(gr), wt$set)], gr, reorder = FALSE)
# [,1]
#8 0.64699646
#11 0.06325088
#6 0.28975265

A somewhat clunky option: make a unique string for each set, and then group on that.
coords = c("C", "D")
gDT = setorder(unique(dt[,c(coords, "set"), with=FALSE]))[,
.(s = paste(do.call(paste, c(.SD, .(sep="_"))), collapse="."))
, by=set, .SDcols = coords][,
g := .GRP
, by=s][]
# set s g
# 1: 1 0_0 1
# 2: 2 0_0.1_0.1_1 2
# 3: 3 0_0 1
# 4: 5 0_0.1_0.1_1 2
# 5: 4 0_1.1_0 3
gDT[wt, on=.(set), mass := i.mass ]
gDT[, .(set = first(set), mass = sum(mass)), by=g]
# g set mass
# 1: 1 1 0.64699646
# 2: 2 2 0.06325088
# 3: 3 4 0.28975265
Comments
You can get rid of the g by chaining on [, g := NULL][] in the last line.
setorder is just sorting the data so that the unique string turns out the same in set sets that are the same.
Grouped first and sum operations are optimized, as you can see if you add verbose = TRUE to the final line, like gDT[, .(set = first(set), mass = sum(mass)), by=g, verbose=TRUE].

Related

Counting the common indices shared between columns of a dataframe containing only binary values in R

Suppose I have a dataframe containing binary values as:
A B C D
a 1 0 0 0
b 0 1 1 0
c 1 1 0 1
d 0 0 1 1
e 1 1 1 1
f 1 0 0 1
I'd like to count the number of common indices shared between the pair of columns in the dataframe
as shown below. What is the most efficient way to do so in R ?
A B C D
A - 2 1 3
B 2 - 2 2
C 1 2 - 2
D 3 2 2 -
Any help is greatly appreciated. Thanks!

Maybe this is what you are after
> `diag<-`(crossprod(as.matrix(df)),NA)
A B C D
A NA 2 1 3
B 2 NA 2 2
C 1 2 NA 2
D 3 2 2 NA
Data
> dput(df)
structure(list(A = c(1L, 0L, 1L, 0L, 1L, 1L), B = c(0L, 1L, 1L,
0L, 1L, 0L), C = c(0L, 1L, 0L, 1L, 1L, 0L), D = c(0L, 0L, 1L,
1L, 1L, 1L)), class = "data.frame", row.names = c("a", "b", "c",
"d", "e", "f"))

Sorting data with some similar words in R

I have a database with 100 columns, but a minimal production of my data are as follows:
df1<=read.table(text="PG1S1AW KOM1S1zo PG2S2AW KOM2S2zo PG3S3AW KOM3S3zo PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
4 1 2 4 4 3 0 4 0 5
4 4 3 1 3 1 0 3 0 1
2 3 5 3 3 2 1 4 0 2
1 1 1 1 1 3 0 5 0 1
2 5 3 4 4 5 0 1 3 4", header=TRUE)
I want to get columns starting with KOM and PG which have a greater of 3 . So we need to have PG4, KOM4 and above. Put it simply, starting with PG and KOM have the same values which is 4 and greater.
The intended output is:
PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
0 4 0 5
0 3 0 1
1 4 0 2
0 5 0 1
0 1 3 4
I have used the following code, but it does not work for me:
df2<- df1%>% select(contains("KO"))
Thanks for your help.

It is not entirely clear about the patterns. We create a function (f1) to extract one or more digits (\\d+) that follows the 'KOM' or (|) 'PG' with str_extract (from stringr), convert to numeric ('v1'), similarly, extract numbers after the 'S' ('v2'). Do a check whether these values are same and if one of the value is greater than 3, wrap with which so that if there are any NAs resulting from str_extract would be removed as which gives the column index while removing any NAs. Use the function in select to select the columns that follow the pattern
library(dplyr)
library(stringr)
f1 <- function(nm) {
v1 <- as.numeric(str_extract(nm, "(?<=(KOM|PG))\\d+"))
v2 <- as.numeric(str_extract(nm, "(?<=S)\\d+"))
nm[which((v1 == v2) & (v1 > 3))]
}
df1 %>%
select(f1(names(.)))
# PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
#1 0 4 0 5
#2 0 3 0 1
#3 1 4 0 2
#4 0 5 0 1
#5 0 1 3 4
data
df1 <- structure(list(PG1S1AW = c(4L, 4L, 2L, 1L, 2L), KOM1S1zo = c(1L,
4L, 3L, 1L, 5L), PG2S2AW = c(2L, 3L, 5L, 1L, 3L), KOM2S2zo = c(4L,
1L, 3L, 1L, 4L), PG3S3AW = c(4L, 3L, 3L, 1L, 4L), KOM3S3zo = c(3L,
1L, 2L, 3L, 5L), PG4S4AW = c(0L, 0L, 1L, 0L, 0L), KOM4S4zo = c(4L,
3L, 4L, 5L, 1L), PG5S5AW = c(0L, 0L, 0L, 0L, 3L), KOM5S5zo = c(5L,
1L, 2L, 1L, 4L)), class = "data.frame", row.names = c(NA, -5L
))

Given your example data, you can just instead look for the numbers 4 or 5.
df1 %>%
select(matches("4|5"))
#> KO4S4AW KOM4S4zo KO5S5AW KOM5S5zo
#> 1 0 4 0 5
#> 2 0 3 0 1
#> 3 1 4 0 2
#> 4 0 5 0 1
#> 5 0 1 3 4

How do I create occasion variable (time) for each ID?

I would like to create variable "Time" which basically indicates the number of times variable ID showed up within each day minus 1. In other words, the count is lagged by 1 and the first time ID showed up in a day should be left blank. Second time the same ID shows up on a given day should be 1.
Basically, I want to create the "Time" variable in the example below.
ID Day Time Value
1 1 0
1 1 1 0
1 1 2 0
1 2 0
1 2 1 0
1 2 2 0
1 2 3 1
2 1 0
2 1 1 0
2 1 2 0
Below is the code I am working on. Have not been successful with it.
data$time<-data.frame(data$ID,count=ave(data$ID==data$ID, data$Day, FUN=cumsum))

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Day', we get the lag of sequence of rows (shift(seq_len(.N))) and assign (:=) it as "Time" column.
library(data.table)
setDT(df1)[, Time := shift(seq_len(.N)), .(ID, Day)]
df1
# ID Day Value Time
# 1: 1 1 0 NA
# 2: 1 1 0 1
# 3: 1 1 0 2
# 4: 1 2 0 NA
# 5: 1 2 0 1
# 6: 1 2 0 2
# 7: 1 2 1 3
# 8: 2 1 0 NA
# 9: 2 1 0 1
#10: 2 1 0 2
Or with base R
with(df1, ave(Day, Day, ID, FUN= function(x)
ifelse(seq_along(x)!=1, seq_along(x)-1, NA)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
Or without the ifelse
with(df1, ave(Day, Day, ID, FUN= function(x)
NA^(seq_along(x)==1)*(seq_along(x)-1)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Day = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), Value = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("ID", "Day",
"Value"), row.names = c(NA, -10L), class = "data.frame")

Apply function across multiple columns

Please find here a very small subset of a long data.table I am working with
dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L,
4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L,
4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L,
101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L,
111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor",
"KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L,
0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L,
3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id",
"pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn",
"age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA,
-15L))
I am looking to apply a tweaked greater than comparison in 5 different columns.
Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.
So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons.
The number of comparisons is by design 5 * (n_inv -1).
The result I am looking for for row 1 should then be 10 / 10 = 1.
For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.
For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.
Thus note the difference: There are three different dyadic comparisons
1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Desired result
An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2
Any suggestions on how to compute this efficiently?
Thanks!

vars = grep('_kn', names(dt), value = T)
# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
dt[, res := res + get(var) / .N, by = c('pnum', var)]
# normalize
dt[, res := res/sum(res), by = pnum]
# id pnum invid fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1: 1 4298390 15 CORN 1 1 1 1 1 5 3 1.0
# 2: 2 4298390 101 CORN 0 0 0 0 0 0 3 0.0
# 3: 3 4298390 102 CORN 0 0 0 0 0 0 3 0.0
# 4: 4 4298558 103 DowCor 0 0 0 1 1 2 2 0.2
# 5: 5 4298558 104 DowCor 1 1 1 1 1 5 2 0.8
# 6: 6 4298559 103 DowCor 0 0 0 1 1 2 2 0.2
# 7: 7 4298559 104 DowCor 1 1 1 1 1 5 2 0.8
# 8: 8 4299026 106 Texas 0 0 0 0 0 0 4 NaN
# 9: 9 4299026 107 Texas 0 0 0 0 0 0 4 NaN
#10: 10 4299026 108 Texas 0 0 0 0 0 0 4 NaN
#11: 11 4299026 109 Texas 0 0 0 0 0 0 4 NaN
#12: 12 4300436 87 KIM 1 1 1 1 1 5 2 1.0
#13: 13 4300436 111 KIM 0 0 0 0 0 0 2 0.0
#14: 14 4303566 2 DowCor 1 1 1 1 1 5 2 0.8
#15: 15 4303566 60 DowCor 1 0 0 1 0 2 2 0.2
Dealing with the above NaN case (arguably the correct answer), is left to the reader.

Here's a fastish solution using dplyr:
library(dplyr)
dt %>%
group_by(pnum) %>% # group by pnum
mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
#give a 1 if the value is the max, and not 0. Only for the column with kn
mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
#correct for multiple maximums
select(ends_with('kn')) %>%
#remove all non kn columns
do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
#make a new data frame with x = rowsums for each indvidual
# and y the colusums
mutate(out = x/y)
#divide by y (we could just use /5 if we always have five columns)
giving your desired output in the column out:
Source: local data frame [15 x 4]
Groups: pnum [6]
pnum x y out
(int) (dbl) (dbl) (dbl)
1 4298390 5 5 1.0
2 4298390 0 5 0.0
3 4298390 0 5 0.0
4 4298558 1 5 0.2
5 4298558 4 5 0.8
6 4298559 1 5 0.2
7 4298559 4 5 0.8
8 4299026 NaN NaN NaN
9 4299026 NaN NaN NaN
10 4299026 NaN NaN NaN
11 4299026 NaN NaN NaN
12 4300436 5 5 1.0
13 4300436 0 5 0.0
14 4303566 4 5 0.8
15 4303566 1 5 0.2
The NaNs come from the groups with no winners, convert them back using eg:
x[is.na(x)] <- 0

How to count the number of combinations of boolean data in R

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))

Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4

Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))

Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: data.table compare sets of rows - r

Related

Counting the common indices shared between columns of a dataframe containing only binary values in R

Sorting data with some similar words in R

How do I create occasion variable (time) for each ID?

Apply function across multiple columns

How to count the number of combinations of boolean data in R

Categories

Resources