count shared occurrences and remove duplicates - r

I have this data.frame :
df <- read.table(text= " section to from time
a 1 5 9
a 2 5 9
a 1 5 10
a 2 6 10
a 2 7 11
a 2 7 12
a 3 7 12
a 4 7 12
a 4 6 13 ", header = TRUE)
Each row identifies the simultaneoues occurence of an id in to and from at a timepoint time. Basically a time explicit network of ids in to and from.
I want to know which to ids shared a from id within a particular time range which is 2. In otherwards i want to know if ids 1 and 2 in to both went to coffee shop 5 within two days of each other., i.e.
id 1 and 2 in to shared id 5 in from at time 9 and 10 respectively and so would have 1 shared events within the time window 2. If they also shared a from id at time point 13 e.g.
a 1 5 9
a 2 5 9
a 1 7 13
a 2 7 13
then 1 and 2 would get a 2
So the final output I would like for the df would be:
section to.a to.b noShared
a 1 2 1
a 2 3 1
a 2 4 1
a 3 4 1
I can get some of the way there with:
library(plyr)
library(tnet)
a <- ddply(df, .(section,to,time), function(x)
data.frame(from = unique(x$from)) )
b <- ddply(a, .(section,time), function(x) {
b <- as.tnet(x[, c("to","from")], type="binary two-mode tnet")
b <- projecting_tm(b, method="sum")
return(b)
})
This gets me which ids in to shared ids in from within each time point.
However there are two main problems with b.
Firstly within each time point the pairs of ids appear twice in both directions i.e.
1 2 5 9 # id 1 and 2 went to coffee shop 5 at time 9
2 1 5 9 # id 2 and 1 went to coffee shop 5 at time 9
I only want each sombination to appear once:
1 2 5 # id 1 and 2 went to coffee shop 5 at time 9</strike>
Secondly I need to bin the results within the time window so that my final result doesnt hav time just number of shared events i.e.
EDIT
The time issue has more issues than expected. The first problem is enough for this question.

for the generation of b (first part of the question)
I change the code projecteing_tm wihch is transformation of a network.
b <- ddply(a, .(section,time), function(x) {
## first I create the origin network
net2 <- x[, c("to","from")]
colnames(net2) <- c('i','p')
net2 <- net2[order(net2[, "i"], net2[, "p"]), ]
np <- table(net2[, "p"])
net2 <- merge(net2, cbind(p = as.numeric(rownames(np)),np = np))
## trasnformed network
net1 <- merge(net2, cbind(j = net2[, "i"], p = net2[, "p"]))
net1 <- net1[net1[, "i"] != net1[, "j"], c("i", "j","np")]
net1 <- net1[order(net1[, "i"], net1[, "j"]), ]
index <- !duplicated(net1[, c("i", "j")])
net1 <- cbind(net1[index, c("i", "j")])
net1
})
So here you get your b without any warning
> b
section time i j
1 a 9 1 2
2 a 9 2 1
3 a 12 2 3
4 a 12 2 4
5 a 12 3 2
6 a 12 3 4
7 a 12 4 2
8 a 12 4 3
For the second part of the question , do you want to remove duplicated from b?
b[!duplicated(t(apply(b[3:4], 1, sort))), ]
section time i j
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
For this part Here I use an answer to this question.

Related

Attempting to remove a row in R using variable names

I am trying to remove some rows in a for loop in R. The conditional involves comparing it to the line below it, so I can't filter within the brackets.
I know that I can remove a row when a constant is specified: dataframe[-2, ]. I just want to do the same with a variable: dataframe[-x, ]. Here's the full loop:
for (j in 1:(nrow(referrals) - 1)) {
k <- j + 1
if (referrals[j, "Client ID"] == referrals[k, "Client ID"] &
referrals[j, "Provider SubCode"] == referrals[k, "Provider SubCode"]) {
referrals[-k, ]
}
}
The code runs without complaint, but no rows are removed (and I know some should be). Of course, if it I test it with a constant, it works fine: referrals[-2, ].
You need to add a reproducible example for people to work with. I don't know the structure of your data, so I can only guess if this will work for you. I would not use a loop, for the reasons pointed out in the comments. I would identify the rows to remove first, and then remove them using normal means. Consider:
set.seed(4499) # this makes the example exactly reproducible
d <- data.frame(Client.ID = sample.int(4, 20, replace=T),
Provider.SubCode = sample.int(4, 20, replace=T))
d
# Client.ID Provider.SubCode
# 1 1 1
# 2 1 4
# 3 3 2
# 4 4 4
# 5 4 1
# 6 2 2
# 7 2 2 # redundant
# 8 3 1
# 9 4 4
# 10 3 4
# 11 1 3
# 12 1 3 # redundant
# 13 3 4
# 14 1 2
# 15 3 2
# 16 4 4
# 17 3 4
# 18 2 2
# 19 4 1
# 20 3 3
redundant.rows <- with(d, Client.ID[1:nrow(d)-1]==Client.ID[2:nrow(d)] &
Provider.SubCode[1:nrow(d)-1]==Provider.SubCode[2:nrow(d)] )
d[-c(which(redundant.rows)+1),]
# Client.ID Provider.SubCode
# 1 1 1
# 2 1 4
# 3 3 2
# 4 4 4
# 5 4 1
# 6 2 2
# 8 3 1 # 7 is missing
# 9 4 4
# 10 3 4
# 11 1 3
# 13 3 4 # 12 is missing
# 14 1 2
# 15 3 2
# 16 4 4
# 17 3 4
# 18 2 2
# 19 4 1
# 20 3 3
Using all information given by you, I believe this could be a good alternative:
duplicated.rows <- duplicated(referrals)
Then, if you want the duplicated results run:
referrals.double <- referrals[duplicated.rows, ]
However, if you want the non duplicated results run:
referrals.not.double <- referrals[!duplicated.rows, ]
If you prefer to go step by step (maybe it's interesting for you):
duplicated.rows.Client.ID <- duplicated(referrals$"Client ID")
duplicated.rows.Provider.SubCode <- duplicated(referrals$"Provider SubCode")
referrals.not.double <- referrals[!duplicated.rows.Client.ID, ]
referrals.not.double <- referrals.not.double[!duplicated.rows.Client.ID, ]

Percolation clustering

Consider the following groupings:
> data.frame(x = c(3:5,7:9,12:14), grp = c(1,1,1,2,2,2,3,3,3))
x grp
1 3 1
2 4 1
3 5 1
4 7 2
5 8 2
6 9 2
7 12 3
8 13 3
9 14 3
Let's say I don't know the grp values but only have a vector x. What is the easiest way to generate grp values, essentially an id field of groups of values within a threshold from from each other? Is this a percolation algorithm?
One option would be to compare the next with the current value and check if the difference is greater than 1, and get the cumulative sum.
df1$grp <- cumsum(c(TRUE, diff(df1$x) > 1))
df1$grp
#[1] 1 1 1 2 2 2 3 3 3
EDIT: From #geotheory's comments.

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

How to calculate the # of unique player (when repeat entry is allowed)?

I am trying to calculate the number of unique player in an experiment where each player is allowed to re-enter the game. Here is what the data look like
x <- read.table(header=T, text="group timepast Name NoOfUniquePlayer
1 0.02703 A 1
1 0.02827 B 2
1 0.02874 A 2
1 0.02875 A 2
1 0.02875 D 3
2 0.03255 M 1
2 0.03417 K 2
2 0.10029 T 3
2 0.10394 T 3
2 0.10605 K 3
2 0.16522 T 3
3 0.11938 E 1
3 0.12607 F 2
3 0.13858 E 2
3 0.16084 G 3
3 0.19830 G 3
3 0.24563 V 4")
The original experiment data contain the first 3 columns, the first one is the group number of each experiment (3 groups here), the second column is the normalized time each player joined the experiment (I've sort this column from smallest to largest), the third one is the name of each player (each player only join one single group).
What I want to generate is the last column called # of unique players, e.g. for group 1, five players (A B A A D) are recorded but only 3 unique players there (A B D), player A started the game (1st row) and re-joined (3rd row) after player B played (2nd row), and then player A joined the game again (the 4th row thereby was recorded), finally player D entered and finished the whole game.
Can anyone help me figure out how to program in R to get this problem solved?
I think this will give you what you want (I think there is an error in your example for group 2)
x$uniquenum <- unlist(
tapply(
x$Name,
x$group,
function(y)
cummax(as.numeric(factor(y,levels=y[!duplicated(y)])))
)
)
group timepast Name NoOfUniquePlayer uniquenum
1 1 0.02703 A 1 1
2 1 0.02827 B 2 2
3 1 0.02874 A 2 2
4 1 0.02875 A 2 2
5 1 0.02875 D 3 3
6 2 0.03255 M 1 1
7 2 0.03417 K 2 2
8 2 0.10029 T 3 3
9 2 0.10394 T 3 3
10 2 0.10605 K 4 3
11 2 0.16522 T 4 3
12 3 0.11938 E 1 1
13 3 0.12607 F 2 2
14 3 0.13858 E 2 2
15 3 0.16084 G 3 3
16 3 0.19830 G 3 3
17 3 0.24563 V 4 4
slightly more compactly, using data.table
DT <- data.table(x)
DT[, uniqueNum := cummax(match(Name,unique(Name))), by = group]
if you want the total number of unique players then
DT[, totalUnique := max(uniqueNum), by = group]

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources