Simple data-manipulation in R - r

#Aniko points out that one way to view my problem is that I need to find the connected components of a graph, where the vertices are called groups and, variables group and nominated_group indicate an edges between those two groups. My goal is to create a variable parent_Group which indexes the connected components. Or as I put it before:
I have a dataframe with four variables: ID, group, and nominated_ID, and nominated_Group.
Consider sister-groups: Groups A and B are sister-groups if there is at least one case in the data where group==A and nominated_group==B, or vice versa.
I would like to create a variable parent_group which takes on a unique value for each set of sister-groups. In other words, no nominations should occur between cases in different parent_groups. Making the parent_group sequential numbers seems like a good idea.
Many thanks for the help I already received here! I can't really contribute here but note that I try to pay it forward at stats.exchange and on wikipedia.
In my fake data, A and B are sister-groups. Either case ID=4 or ID=5 are sufficient to make this true. Each group is also their own sister-group. The goal, the creation of parent_group, should result in one parent_group for all cases in A or B, and another parent_group for group C
df <- data.frame(ID = c(9, 5, 2, 4, 3, 7),
group = c("A", "A", "B", "B", "A", "C"),
nominated_ID = c(9, 8, 4, 9, 2, 7) )
df$nominated_group <- with(df, group[match(nominated_ID, ID)])
df
ID group nominated_ID nominated_group
1 9 A 9 A
2 5 A 8 <NA>
3 2 B 4 B
4 4 B 9 A
5 3 A 2 B
6 7 C 7 C

Consider a graph with the groups as its vertices and the edges indicating that the two groups occur for the same ID. Then I think you are looking for connected components of this graph. The following is a quick and dirty (and probably not optimal) implementation of this idea using the graph package:
library(graph)
#make some fake data
nom <- data.frame(group = c("A","A","A","B","B","C","C"),
group2 = c("A","A","B","B","A","C","C"),
stringsAsFactors=FALSE)
#remove duplicated pairs
#it will keep A-B distinct from B-A, could probably be fixed
nom1 <- nom[!duplicated(nom),]
#define empty graph
grps <- union(unique(nom$group), unique(nom$group2))
gg <- new("graphNEL", nodes=grps, edgeL=list())
#add an edge for every pair
for (i in 1:nrow(nom1)) gg <- addEdge(nom1$group[i], nom1$group2[i], gg, 1)
#find connected components
cc <- connComp(gg)
#assing parent by matching within cc
nom$parent <- apply(nom, 1,
function(x) which(sapply(cc, function(y) x["group"] %in% y)))
nom
group group2 parent
1 A A 1
2 A A 1
3 A B 1
4 B B 1
5 B A 1
6 C C 2
7 C C 2

Related

for loop to determine mutually exclusive/exhaustive network membership

I want to run a for loop that assigns group_ids to a dataset that have two columns, name and location. The goal is to plot large mutually exclusive and exhaustive networks. it is sort of like contract tracing, where individuals interact with locations and then, as a function of both individual interaction and locations, you get a finite network of people who belong to a group. So A and C both interact with location B. Both A and C are in the same group, but so is every other person who interacted with location B. Likewise, any other locations, x1-xn, that interacted with A or C or anyone else from location B is part of the same group. And so forth.
Basically, in the for loop, I want to do the following. First, for the first row in the dataframe, I want to check if the name exists in the name_group_id list initialized. If it does, then assign the group_id associated with that name. If not, then check if the location exists in the location_group_id list. If it does, return that group_id, if it does not, then create a new group_id and add this row's name/group_id to the name_group_id and the location/group_id to the location_group_id
This code does not work yet, as row three should have group_id 1 because name interacted with location 1 in row 1. But it seems close. Help tweaking this code would be awesome.
Thanks, everyone.
# Create data frame with the sample data
df <- data.frame(name = c("a", "a", "b", "b", "b", "c", "c", "d", "d", "e", "e", "f", "g", "g", "h"),
location = c(1, 2, 1, 3, 4, 3, 2, 5, 6, 7, 8, 4, 9, 10, 5))
# Initialize two lists
name_id <- list()
location_id <- list()
# Counter for creating new group IDs
group_id_counter <- 0
# For loop that implements the logic
for (i in 1:nrow(df)) {
row <- df[i, ]
name <- row$_name
location <- row$location
# Check if the name exists in name_id
if (name %in% names(name_id)) {
group_id <- name_group_id[[name]]
} else {
# Check if the location exists in location_group_id
if (location %in% names(location _group_id)) {
group_id <- location _group_id[[location ]]
} else {
group_id_counter <- group_id_counter + 1
group_id <- paste0("Group ", group_id_counter)
name_id[[name]] <- group_id
location _group_id[[location ]] <- group_id
}
}
# Assign the group_id to the current row
df[i, "group_id"] <- group_id
}
Right now, this code would return group_id 2 to row three, but it should be group_id 1 because location 1 was already seen in row 1 and assigned group_id 1. Help would be greatly appreciated.
Expected results for the fake data frame supplied are:
name location Group ID
a 1 1
a 2 1
b 1 1
b 3 1
b 4 1
c 3 1
c 2 1
d 5 2
d 6 2
e 7 3
e 8 3
f 4 1
g 9 4
g 10 4
h 5 2
As Gregor Thomas mentioned this is the same as determining connected components in graphs. Another term for this is equivalence groups. The code below is a bit of hack using an internal function of one of my packages (on CRAN); you could do this using the external functions but then you have to convert the data into the right form; this is easier:
First make a list of all node 'identifiers':
x <- c(unique(df$name), unique(df$location))
Determine which nodes belong to the same group
res <- reclin2:::equivalence(x = x,
df$name, df$location)
The result res is a vector with identifiers for each group for each value in x. To get the end result:
res[match(df$name, x)]
Edit: the OP mentioned that the datasets are large; this should work on large datasets.

Finding the cumulative sum of the value of nodes in a DAG

Suppose I have the following directed acyclic graph (DAG) with each node having a weight of 1.
I am interested in calculating the accumulated sum of each node based on the value of its ancestor. Assuming as I said earlier that the weight of each node is 1, then this is what I would expect to get
This is what I tried to do:
library(tidygraph, quietly = TRUE)
library(tidyverse)
library(ggraph)
# create adjacencies
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"B", "D")
# create the graph
grafo <- as_tbl_graph(grafo_df)
# calculate accumulated sum
grafo %>%
arrange(node_topo_order()) %>%
mutate(
revenue = 1,
cum_weight = map_dfs(1, .f = function(node, path, ...) {
sum(.N()$revenue[c(node, path$node)])
})) %>%
as_tibble() %>%
unnest("cum_weight")
#> # A tibble: 4 x 3
#> name revenue cum_weight
#> <chr> <dbl> <dbl>
#> 1 C 1 1
#> 2 A 1 2
#> 3 B 1 2
#> 4 D 1 3
Created on 2021-05-13 by the reprex package (v2.0.0)
As you can see, the accumulated sum of D results in 3 and not 4, because the value of D should be the sum of the accumulated value of A and B. I do not understand why D does not add 4
I have tried to understand the solution given here, but had a hard time understanding it
How can I get the accumulated sum?
Update # 1
I am not concerned (for the moment) with the complexity of the algorithm, that is, if the algorithm does it in O(V + E) it is not relevant.
Something important that is mentioned in this question is about the problem of counting twice, that is, the partial sum of the value of A is equal to C(1) + A(1) = 2, and the partial sum of the value of B is equal to C(1) + B (1) = 2, so to say that the value of D is not equal to the partial sums of A (2) + B(2) because the value of C would be duplicating I think it does not apply in this situation due to the following:
Let's imagine that each of these 4 nodes (A, B, C and D) are internet nodes that generate revenue of $1 each, so the total accumulated income of the 4 nodes would be $4. If D is the convergence node of the rest of nodes, then in a scenario where D stops working, the income of the remaining nodes and that of D would no longer be possible, therefore, its value is $4.
Update # 2
If I add a new path from C to D then the value of D should always be 4 because the number of dependent nodes is maintained, that is, what should matter is the number of dependent nodes in the accumulated sum. For example, in the solution proposed by #ThomasIsCoding, if I add this new path, the value of D is now 5, I think partly that their algorithm uses the degrees as a parameter to calculate the cumulative sum, however, if I add a additional node then the calculation is correct.
Update # 3
The example that I have placed is simple with the intention that it is easy to understand the objective, however, I did not specify that it should be generalizable for a graph with many nodes with three different topologies. The outermost layers are trees, the middle layers are rings, and the innermost layer is a full mesh.
Here is an igraph option using distance with argument mode = "in"
If your nodes are unweighted, i.e., revenue=1 for all nodes
g <- graph_from_data_frame(grafo_df)
data.frame(name = names(V(g))) %>%
mutate(revenue = 1) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 1 3
3 B 1 2
4 F 1 1
5 D 1 5
If your nodes are weighted, e.g.,
data.frame(name = names(V(g))) %>%
mutate(revenue = 1:n()) %>%
mutate(cum_weight = rowSums((!is.infinite(distances(g, mode = "in"))) %*% diag(revenue)))
which gives you
name revenue cum_weight
1 C 1 1
2 A 2 7
3 B 3 4
4 F 4 4
5 D 5 15
Data
grafo_df <- tribble(
~from, ~to,
"C", "A",
"C", "B",
"A", "D",
"C", "D",
"B", "D",
"F", "A"
)
and the DAG by plot(g) is given as
Now the question is clear, so I propose an algorithm, I cannot code it since I don't know the language that you are using.
For each node Ni in the graph we will calculate the set of ancestors Ai, then the accumulated sum for each node will be |Ai| + 1.
Initialize all nodes with an empty ancestor set Ai = {}
Start with a set S0 containing all nodes with no incoming edges
Initialize the next set Sn+1
Iterate over Sn, for each node N:
For all nodes D with an incoming edge from N:
Merge the ancestor set of D with the ancestor set of N plus N itself
remove the egde N->D
If D has no other incoming edges add it to Sn+1
If Sn+1 is not empty, increase pass to n+1 and repeat from 2.
The big limit of this solution is the complexity, I'll try later to find some optimized solution.

Iterate over combinations from one row in each index by group in r

I have a dataset (example) as below,
data <- data.frame(pc = c("A","A","A","A","A","A", "B","B","B","B"), #categorical
index = c(1, 1, 2, 2, 2, 3, 4, 5, 5, 5), #categorical
g= c(1, 2, 4, 3, 6, 7, 8, 5, 9, 3), #numeric
h= c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4)) #categorical
I want to group by 'pc', iterate over all combinations based on 'index' to get the summation of values in 'g' and number of categories in 'h' columns, and keep the rows of the combination that yields the highest summation value from 'g' + number of categories from 'h'.
For example, in pc=A group, index=1 has two rows, index=2 has three, index=3 has one, so in total I have 2x3x1= 6 combinations (each combination has three rows, one with index=1, one with index=2, one with index=3). I want to keep the rows (one row from each unique index) that yields the highest (summation value from 'g' + number of categories from 'h'). The number of index and length of each index are all different in each pc group.
Just an example to visualise the combination for pc=A group,
combination sum_of_values_in_g number_of_categories_in_h
#1 12 2
#2 11 3
#3 14 3
#4 13 2
#5 12 3
#6 15 3
My desired result in this example will be
pc index g h
A 1 2 1
A 2 6 2
A 3 7 3
B 4 8 3
B 5 9 3
I have done some research on how to get combinations
(Iterate over unique combination of groups in a data frame, How to iterate through all combinations of columns and apply function by group in R? and
Combinations by group in R)..
but I couldn't figure out how to get the right combination in each group and run further operation in each combination... Any input or direction will be appreciated!
Here is a brute force solution. The run time could be really long given a large dataset.
We need functions from these packages:
library(tidyr)
library(dplyr)
library(purrr)
This is the first step, we need a function to first split your data into several groups (split(transpose(df), df[[split_by]])), then find all possible row combinations across them (cross(...)), and finally merge each of them into a single dataframe (lapply(..., bind_rows)).
perm_all <- function(df, split_by){
lapply(cross(split(transpose(df), df[[split_by]])), bind_rows)
}
(transpose turns an n-row dataframe into an n-element list of single-row dataframes)
This is the second step, we loop through all dataframes in that list to see which one satisfies your requirements.
which_max <- function(ls_of_df, numer, categ) {
test_stats <- vapply(
ls_of_df,
function(df) {
temp <- length(unique(df[[categ]]))
c(sum(df[[numer]]) + temp, temp)
},
double(2L)
)
# You could have multiple maxima for those sums
out <- which(test_stats[1L, ] == max(test_stats[1L, ]))
# but after the second test (i.e. find the greatest number of categories), you should have one and only one combination left
out[[which.max(test_stats[2L, out])]]
}
Now, we use a single function to perform these two steps.
max_of_all_perm <- function(df, group_var, numer, categ) {
l <- perm_all(df, group_var)
l[[which_max(l, numer, categ)]]
}
And run it across all groups defined by pc
data %>%
nest(data = -pc) %>%
mutate(data = lapply(data, max_of_all_perm, "index", "g", "h")) %>%
unnest(data)
Output
# A tibble: 5 x 4
pc index g h
<chr> <dbl> <dbl> <dbl>
1 A 1 2 1
2 A 2 6 2
3 A 3 7 3
4 B 4 8 3
5 B 5 9 3

Error in Adabag boosting function

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
ada = boosting(formula=var1~., data=df1)
Error in cbind(yval2, yprob, nodeprob) :
el número de filas de las matrices debe coincidir (vea arg 2)
Hi everyone, I'm trying to use boosting function from adabag package, but it's telling me that the number of rows from matrix (?) must be equal. This data is not the original, but it seems to throw the same error.
Could you help me?
Thank you.
You should not use ID as explanatory variable.
Unfortunately your df1 dataset is too small and it is not possibile to understand if ID is the source of your problem.
Below I generate a bigger data set:
library(adabag)
set.seed(1)
n <- 100
df1 <- data.frame(ID = 1:n,
var1 = sample(letters[1:5], n, replace=T),
var2 = sample(c(0,1), n, replace=T))
head(df1)
# ID var1 var2
#
# 1 1 b 1
# 2 2 b 0
# 3 3 c 0
# 4 4 e 1
# 5 5 b 1
# 6 6 e 0
ada <- boosting(var1~var2, data=df1)
ada.pred <- predict.boosting(ada, newdata=df1)
ada.pred$confusion
# Observed Class Predicted Class a b c d e
# b 5 20 2 7 11
# c 2 2 10 2 2
# d 6 3 7 17 4
Pablo, if we have a closer look at your sample data, we will notice a property that makes it impossible for the classification algorithm to handle. Your dataset consists of five samples, each having a unique label i.e. the dependent variable: a, b, c, d, e. The dataset has only one feature (i.e. independent variable var2, as ID should be excluded from the features’ list) consisting of two classes: 0 and 1. It means there are several labels (of the dependent variable) that correspond to the same class of the independent variable. When algorithm tries to build a model, in this process it encounters a problem with defining regression due to the previously described dataset property and throws the error (number of rows of matrices must match (see arg 2)).
Marco's data, instead, has some healthy diversity: in the dataset of six samples, there are only three labels (b, c, e) and two classes (0, 1). The data set is diverse and reliable enough for the algorithm to handle it.
So, in order to use adabag’s boosting (that uses a regression tree called rpart as the control), you should make your data more diverse and reliable. Good luck!

Conditional calculations in R

I have a dataframe with categories and values. Based on the category I want to subtract values that are stored in another table.
myframe <- data.frame(
x = factor(c("A", "D", "A", "C")),
y = c(8, 3, 9, 9))
reference <- c('A'= 1, 'B'= 2, 'C'= 3, 'D'= 4)
The desired (y-ref) outcome would be:
result <- data.frame(
x = factor(c("A", "D", "A", "C")),
y = c(8, 3, 9, 9),
r = c(7, -1, 8, 6))
x y r
1 A 8 7
2 D 3 -1
3 A 9 8
4 C 9 6
The reference 'table' is a named vector in this case but it could be changed to a better suited data format.
I am not sure how to accomplish this...
This is a fairly straight forward task using match and [...
myframe$r <- myframe$y - reference[ match( myframe$x , names( reference ) ) ]
# x y r
#1 A 8 7
#2 D 3 -1
#3 A 9 8
#4 C 9 6
Pretty sure this is a (several-times over) duplicate so we should find you a good pointer and close the question (but I commend you for showing input data and desired result, many questions are often not that well laid out).
EDIT
Well there are many, many match based questions on the site. It's hard to pick one to point to as an exact duplicate. But I suggest having a browse of a few of these by searching for "r match" (you can search by specific tags by enclosing the search term in square brackets like this "[r]").
The data.table way:
library(data.table)
# convert to data.table and set key for the upcoming merge
dt = data.table(myframe, key = 'x')
ref = data.table(x = names(reference), val = reference)
# merge and add a new column
dt[ref, r := y - val]
dt
# x y r
#1: A 8 7
#2: A 9 8
#3: C 9 6
#4: D 3 -1

Resources