I want to run a for loop that assigns group_ids to a dataset that have two columns, name and location. The goal is to plot large mutually exclusive and exhaustive networks. it is sort of like contract tracing, where individuals interact with locations and then, as a function of both individual interaction and locations, you get a finite network of people who belong to a group. So A and C both interact with location B. Both A and C are in the same group, but so is every other person who interacted with location B. Likewise, any other locations, x1-xn, that interacted with A or C or anyone else from location B is part of the same group. And so forth.
Basically, in the for loop, I want to do the following. First, for the first row in the dataframe, I want to check if the name exists in the name_group_id list initialized. If it does, then assign the group_id associated with that name. If not, then check if the location exists in the location_group_id list. If it does, return that group_id, if it does not, then create a new group_id and add this row's name/group_id to the name_group_id and the location/group_id to the location_group_id
This code does not work yet, as row three should have group_id 1 because name interacted with location 1 in row 1. But it seems close. Help tweaking this code would be awesome.
Thanks, everyone.
# Create data frame with the sample data
df <- data.frame(name = c("a", "a", "b", "b", "b", "c", "c", "d", "d", "e", "e", "f", "g", "g", "h"),
location = c(1, 2, 1, 3, 4, 3, 2, 5, 6, 7, 8, 4, 9, 10, 5))
# Initialize two lists
name_id <- list()
location_id <- list()
# Counter for creating new group IDs
group_id_counter <- 0
# For loop that implements the logic
for (i in 1:nrow(df)) {
row <- df[i, ]
name <- row$_name
location <- row$location
# Check if the name exists in name_id
if (name %in% names(name_id)) {
group_id <- name_group_id[[name]]
} else {
# Check if the location exists in location_group_id
if (location %in% names(location _group_id)) {
group_id <- location _group_id[[location ]]
} else {
group_id_counter <- group_id_counter + 1
group_id <- paste0("Group ", group_id_counter)
name_id[[name]] <- group_id
location _group_id[[location ]] <- group_id
}
}
# Assign the group_id to the current row
df[i, "group_id"] <- group_id
}
Right now, this code would return group_id 2 to row three, but it should be group_id 1 because location 1 was already seen in row 1 and assigned group_id 1. Help would be greatly appreciated.
Expected results for the fake data frame supplied are:
name location Group ID
a 1 1
a 2 1
b 1 1
b 3 1
b 4 1
c 3 1
c 2 1
d 5 2
d 6 2
e 7 3
e 8 3
f 4 1
g 9 4
g 10 4
h 5 2
As Gregor Thomas mentioned this is the same as determining connected components in graphs. Another term for this is equivalence groups. The code below is a bit of hack using an internal function of one of my packages (on CRAN); you could do this using the external functions but then you have to convert the data into the right form; this is easier:
First make a list of all node 'identifiers':
x <- c(unique(df$name), unique(df$location))
Determine which nodes belong to the same group
res <- reclin2:::equivalence(x = x,
df$name, df$location)
The result res is a vector with identifiers for each group for each value in x. To get the end result:
res[match(df$name, x)]
Edit: the OP mentioned that the datasets are large; this should work on large datasets.
Related
Suppose I have the following directed acyclic graph (DAG) with each node having a weight of 1.
I am interested in calculating the accumulated sum of each node based on the value of its ancestor. Assuming as I said earlier that the weight of each node is 1, then this is what I would expect to get
This is what I tried to do:
library(tidygraph, quietly = TRUE)
library(tidyverse)
library(ggraph)
# create adjacencies
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"B", "D")
# create the graph
grafo <- as_tbl_graph(grafo_df)
# calculate accumulated sum
grafo %>%
arrange(node_topo_order()) %>%
mutate(
revenue = 1,
cum_weight = map_dfs(1, .f = function(node, path, ...) {
sum(.N()$revenue[c(node, path$node)])
})) %>%
as_tibble() %>%
unnest("cum_weight")
#> # A tibble: 4 x 3
#> name revenue cum_weight
#> <chr> <dbl> <dbl>
#> 1 C 1 1
#> 2 A 1 2
#> 3 B 1 2
#> 4 D 1 3
Created on 2021-05-13 by the reprex package (v2.0.0)
As you can see, the accumulated sum of D results in 3 and not 4, because the value of D should be the sum of the accumulated value of A and B. I do not understand why D does not add 4
I have tried to understand the solution given here, but had a hard time understanding it
How can I get the accumulated sum?
Update # 1
I am not concerned (for the moment) with the complexity of the algorithm, that is, if the algorithm does it in O(V + E) it is not relevant.
Something important that is mentioned in this question is about the problem of counting twice, that is, the partial sum of the value of A is equal to C(1) + A(1) = 2, and the partial sum of the value of B is equal to C(1) + B (1) = 2, so to say that the value of D is not equal to the partial sums of A (2) + B(2) because the value of C would be duplicating I think it does not apply in this situation due to the following:
Let's imagine that each of these 4 nodes (A, B, C and D) are internet nodes that generate revenue of $1 each, so the total accumulated income of the 4 nodes would be $4. If D is the convergence node of the rest of nodes, then in a scenario where D stops working, the income of the remaining nodes and that of D would no longer be possible, therefore, its value is $4.
Update # 2
If I add a new path from C to D then the value of D should always be 4 because the number of dependent nodes is maintained, that is, what should matter is the number of dependent nodes in the accumulated sum. For example, in the solution proposed by #ThomasIsCoding, if I add this new path, the value of D is now 5, I think partly that their algorithm uses the degrees as a parameter to calculate the cumulative sum, however, if I add a additional node then the calculation is correct.
Update # 3
The example that I have placed is simple with the intention that it is easy to understand the objective, however, I did not specify that it should be generalizable for a graph with many nodes with three different topologies. The outermost layers are trees, the middle layers are rings, and the innermost layer is a full mesh.
Here is an igraph option using distance with argument mode = "in"
If your nodes are unweighted, i.e., revenue=1 for all nodes
g <- graph_from_data_frame(grafo_df)
data.frame(name = names(V(g))) %>%
mutate(revenue = 1) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 1 3
3 B 1 2
4 F 1 1
5 D 1 5
If your nodes are weighted, e.g.,
data.frame(name = names(V(g))) %>%
mutate(revenue = 1:n()) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 2 7
3 B 3 4
4 F 4 4
5 D 5 15
Data
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"C", "D",
"B", "D",
"F", "A"
)
and the DAG by plot(g) is given as
Now the question is clear, so I propose an algorithm, I cannot code it since I don't know the language that you are using.
For each node Ni in the graph we will calculate the set of ancestors Ai, then the accumulated sum for each node will be |Ai| + 1.
Initialize all nodes with an empty ancestor set Ai = {}
Start with a set S0 containing all nodes with no incoming edges
Initialize the next set Sn+1
Iterate over Sn, for each node N:
For all nodes D with an incoming edge from N:
Merge the ancestor set of D with the ancestor set of N plus N itself
remove the egde N->D
If D has no other incoming edges add it to Sn+1
If Sn+1 is not empty, increase pass to n+1 and repeat from 2.
The big limit of this solution is the complexity, I'll try later to find some optimized solution.
I have a data frame with Column1, which can take the value of any letter of the alphabet. I want to create a second column that spells out the number corresponding to that letter. I am trying to do this with an if then statement... But keep getting an error. Sorry this is a simple question but I have tried the R for dummies website http://www.dummies.com/how-to/content/how-to-use-if-statements-in-r.html with no luck!
x$Column2 <- NULL
if (x$Column1 == "A") then[x$Column2 <- "One"]
The best way to do this is create a reference table:
>Reference = data.frame(Number = c("One", "Two", "Three", "Four"), Letter = c("A", "B", "C", "D"))
> Reference
Number Letter
1 One A
2 Two B
3 Three C
4 Four D
> Data = data.frame(Letter = c("B", "B", "C", "A", "D"))
> Data
Letter
1 B
2 B
3 C
4 A
5 D
Then you can find the indices:
> Indices = sapply(Data$Letter, function(x) which(x == Reference$Letter))
> Indices
[1] 2 2 3 1 4
And use them to create the column
> Data$Number = Reference[Indices,]$Number
> Data
Letter Number
1 B Two
2 B Two
3 C Three
4 A One
5 D Four
To my understanding, it is like creating a dummy variable, what you want to do here. Try
> x$dummy <- as.numeric(Column1 != "A")
and you should get 0 for all A's and 1 for other values.
Look at Generate a dummy-variable for further information.
Let's say I have this data frame A :
A = data.frame(first=c("a", "b","c", "d"), second=c(1, 2, 3, 4))
first second
1 a 1
2 b 2
3 c 3
4 d 4
And I have this data frame B :
B = data.frame(first=c("x", "a", "c"), second=c(1, 4, 3))
first second
1 x 1
2 a 4
3 c 3
I want to count the number of times a pair of the data frame B (B$first, B$second) is in the data frame A. The counting part is not the problem, I just can't find the function to determine whether a pair is in a data frame.
The result would be that only c("c",3) is an element of A, so it should be 1. both "a" and 4 are in data frame A, but the couple c("a", 4) does not exist in data frame A, so I don't want to count this. I want the exact match.
I'm looking for a function like %in% that could work for pairs.
Thanks for your help
Maybe something like this
apply(B, 1, function(r, A){ sum(A$first==r[1] & A$second==r[2]) }, A)
Basically, what it does is the following: for every row of B it applies a function that inspects which elements of A are in accordance with row r from B (part A$first==r[1] & A$second==r[2]) and then sums obtained logicals to derive the number of rows in A that are in accordance with row r.
If you also want grouping it can easily be done with dplyr like this
cbind(B,tmp) %.% group_by(first,second) %.% summarise(n=max(tmp))
where tmp is a variable representing the result of the aforementioned apply
Here's an alternative: rbind your data.frames together and use duplicated.
AB <- do.call(rbind, mget(c("A", "B")))
AB$ind <- as.numeric(duplicated(AB))
AB[grep("^B", rownames(AB)), ]
# first second ind
# B.1 x 1 0
# B.2 a 4 0
# B.3 c 3 1
You can also probably try to use "digest" to generate a hash for each row, but I'm not sure how efficient this would be:
library(digest)
Reduce(function(x, y) y %in% x,
lapply(mget(c("A", "B")), function(x)
apply(x, 1, digest)))
# [1] FALSE FALSE TRUE
An alternative is to merge by row, e.g. mB<-apply(B,1,function(j) paste0(j[1],"_",j[2]) and similarly for A at which point you can loop mB[j]%in%mA[k]
Not that I would really recommend doing this :-)
I have a dataframe with categories and values. Based on the category I want to subtract values that are stored in another table.
myframe <- data.frame(
x = factor(c("A", "D", "A", "C")),
y = c(8, 3, 9, 9))
reference <- c('A'= 1, 'B'= 2, 'C'= 3, 'D'= 4)
The desired (y-ref) outcome would be:
result <- data.frame(
x = factor(c("A", "D", "A", "C")),
y = c(8, 3, 9, 9),
r = c(7, -1, 8, 6))
x y r
1 A 8 7
2 D 3 -1
3 A 9 8
4 C 9 6
The reference 'table' is a named vector in this case but it could be changed to a better suited data format.
I am not sure how to accomplish this...
This is a fairly straight forward task using match and [...
myframe$r <- myframe$y - reference[ match( myframe$x , names( reference ) ) ]
# x y r
#1 A 8 7
#2 D 3 -1
#3 A 9 8
#4 C 9 6
Pretty sure this is a (several-times over) duplicate so we should find you a good pointer and close the question (but I commend you for showing input data and desired result, many questions are often not that well laid out).
EDIT
Well there are many, many match based questions on the site. It's hard to pick one to point to as an exact duplicate. But I suggest having a browse of a few of these by searching for "r match" (you can search by specific tags by enclosing the search term in square brackets like this "[r]").
The data.table way:
library(data.table)
# convert to data.table and set key for the upcoming merge
dt = data.table(myframe, key = 'x')
ref = data.table(x = names(reference), val = reference)
# merge and add a new column
dt[ref, r := y - val]
dt
# x y r
#1: A 8 7
#2: A 9 8
#3: C 9 6
#4: D 3 -1
#Aniko points out that one way to view my problem is that I need to find the connected components of a graph, where the vertices are called groups and, variables group and nominated_group indicate an edges between those two groups. My goal is to create a variable parent_Group which indexes the connected components. Or as I put it before:
I have a dataframe with four variables: ID, group, and nominated_ID, and nominated_Group.
Consider sister-groups: Groups A and B are sister-groups if there is at least one case in the data where group==A and nominated_group==B, or vice versa.
I would like to create a variable parent_group which takes on a unique value for each set of sister-groups. In other words, no nominations should occur between cases in different parent_groups. Making the parent_group sequential numbers seems like a good idea.
Many thanks for the help I already received here! I can't really contribute here but note that I try to pay it forward at stats.exchange and on wikipedia.
In my fake data, A and B are sister-groups. Either case ID=4 or ID=5 are sufficient to make this true. Each group is also their own sister-group. The goal, the creation of parent_group, should result in one parent_group for all cases in A or B, and another parent_group for group C
df <- data.frame(ID = c(9, 5, 2, 4, 3, 7),
group = c("A", "A", "B", "B", "A", "C"),
nominated_ID = c(9, 8, 4, 9, 2, 7) )
df$nominated_group <- with(df, group[match(nominated_ID, ID)])
df
ID group nominated_ID nominated_group
1 9 A 9 A
2 5 A 8 <NA>
3 2 B 4 B
4 4 B 9 A
5 3 A 2 B
6 7 C 7 C
Consider a graph with the groups as its vertices and the edges indicating that the two groups occur for the same ID. Then I think you are looking for connected components of this graph. The following is a quick and dirty (and probably not optimal) implementation of this idea using the graph package:
library(graph)
#make some fake data
nom <- data.frame(group = c("A","A","A","B","B","C","C"),
group2 = c("A","A","B","B","A","C","C"),
stringsAsFactors=FALSE)
#remove duplicated pairs
#it will keep A-B distinct from B-A, could probably be fixed
nom1 <- nom[!duplicated(nom),]
#define empty graph
grps <- union(unique(nom$group), unique(nom$group2))
gg <- new("graphNEL", nodes=grps, edgeL=list())
#add an edge for every pair
for (i in 1:nrow(nom1)) gg <- addEdge(nom1$group[i], nom1$group2[i], gg, 1)
#find connected components
cc <- connComp(gg)
#assing parent by matching within cc
nom$parent <- apply(nom, 1,
function(x) which(sapply(cc, function(y) x["group"] %in% y)))
nom
group group2 parent
1 A A 1
2 A A 1
3 A B 1
4 B B 1
5 B A 1
6 C C 2
7 C C 2