Vertices/edges between specified nodes - r

This seems like an easy problem to solve, but even after a week, I can't find a solution to this.
I have an undirected graph with 1700+ nodes and 5000+ edges. I would like to filter/subset this graph for a source node and other target nodes, regardless of the order of the neighborhood. I tried the following code and it works for this example dataset, but all_simple_paths doesn't terminate for my real data set.
In this working example, I am showing the connections between A and C & F.
library(igraph)
gg <- graph.data.frame(data.frame(n1 = c('A', 'A', 'B', 'A', 'A', 'F', 'E'),
n2 = c('D', 'B', 'C', 'E', 'F', 'C', 'F')),
directed = FALSE,
vertices = LETTERS[1:6])
plot(gg, layout = layout_with_kk(gg))
paths <- all_simple_paths(gg, from = "A", to = c("C", "F"))
gg_s <- subgraph(gg, v = unique(rapply(paths, f = c)))
plot(gg_s, layout = layout_with_kk(gg_s))
I also tried subcomponent and intersection, but subcomponent returns all possible connections. I tried creating ego graphs and then doing an intersection, but that's not giving the best results.

Related

Create a dataframe that is the result of a difference between vectors from other dataframes R

I have the following datasets and information: first, I have i different plots that I want to analyze. In each plot, i have j species that I want to obtain some information, such as:
plot1 = c(rep(1, 3), rep(2, 4), rep(3, 5))
spp1 = c('a', 'b', 'c', 'a', 'b', 'c', 'd', 'b', 'b', 'b', 'e', 'f')
data.1 = data.frame(plot1, spp1)
The above mentioned information repeats for a second dataframe of similar structure:
plot2 = c(rep 1, 2), rep(2, 3), rep(3, 5))
spp2 = c('a', 'a', 'b', 'c', 'c', 'b', 'b', 'b', 'e', 'f'))
data.2 = data.frame(plot2, spp2)
What I'm trying to do is, for each i plot, setdiff(unique(data.1$spp1), unique(data.2$spp2)) and add the obtained information to a dataframe that has 2 columns: plot and spp_name
For the example datasets I'd like to obtain a final dataframe such as:
df_result = data.frame(plot = c(1,1,2,2,3), spp_name = ('b','c','a','d',0)
0 (or similar) must be returned when the setdiff(unique()) returns 'character(0)', So, in a way, my df_result needs to have, for each i plot, length equal to the number of setdiff strings between data.1$spp1 and data.2$spp2.
The first thing I did was using a for loop based on each i plot. Getting to setdiff() string result is ok to but I don't know how to add this information to a empty dataframe...do I need to loop something for each species? I really hope my question is comprehensible.
Thanks already
You could use anti_join and add rows for the missing values:
library(dplyr)
anti_join(data.1, data.2, by = c("plot1" = "plot2", "spp1" = "spp2")) %>%
add_row(plot1 = setdiff(data.1$plot1, .$plot1))
# plot1 spp1
#1 1 b
#2 1 c
#3 2 a
#4 2 d
#5 3 <NA>

Generalizable function to select and filter dataframe r - using shiny input

I am building a shiny app. The user will need to be able to reduce the data by selecting variables and filtering on specific values for those variables. I am stuck trying to get a generalizable function that can work based on all possible selections.
Here is an example - I skip the shiny code because I think the problem is with the function:
#sample dataframe
df <- data.frame('date' = c(1, 2, 3, 2, 2, 3, 1),
'time' = c('a', 'b', 'c', 'e', 'b', 'a', 'e'),
'place' = c('A', 'A', 'A', 'H', 'A', 'H', 'H'),
'result' = c('W', 'W', 'L', 'W', 'W', 'L', 'L'))
If the user selected date and result for the date values 1, 2; and the result values W, I would do the following:
out <- df %>%
select(date, result) %>%
filter(date %in% c(1,2)) %>%
filter(result %in% c('W'))
The challenge I am having is that the user can select any unique combinations of variables and values. Using the input$ values from my shiny app, I can get the selected variables into a vector and I can get the selected values into a list of values, positionaly matching the selected variables. For example:
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
What i think i then need is a generalizable function that will match up the filter calls with the correct variables. Something like:
#function that takes data frame, vector of selected variables, list of vectors of chosen values for each variable
#Returns a reduced table of selected variables, filtered values
table_reducer <- function(df, select_var, filter_values) {
#select the variables
out <- df %>%
#now filter each variable by the values contained in the list
select(vect_of_var)
out <- [for loop that iterates over vect_of_var, list_of_vec, filtering accordingly]
out #return out
}
My thinking would be to use a zip equivalent from python, but all my searching on that just points me to mapply and i can't see how to use that within the for loop (which i also know is not always approved in R - but i am talking about a relatively small number of iterations). If there is a better solution to this i would welcome it.
Here's a 1-liner table_reducer function in base R -
table_reducer <- function(df, select_var, filter_values) {
subset(df, Reduce(`&`, Map(`%in%`, df[select_var], filter_values)))
}
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
table_reducer(df, selected_variables, selected_values)
# date time place result
#1 1 a A W
#2 2 b A W
#4 2 e H W
#5 2 b A W
Map is a wrapper over mapply so you were right in thinking that you should use mapply for this task. This answer is also free of dreaded for loops.

Finding All Pairwise Commonalities in R

I have been a StackExchange Lurker forever now, but have not had much luck finding this question in R before, so I created a username just for this.
Basically I have a data set of Customers and Stores (roughly 260k customers and 300 stores with most of the customers visiting at least 10 unique stores), and I want to see which sites overlap on customers the most (i.e. Site A and B share this many customers, A and C that many, ... for ALL Pairs of sites).
Reproducible example:
begindata<-data.frame(customer_id=c(1,2,3,1,2,3,4,1,4,5), site_visited=c('A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'D', 'D'))
and I would like to see the following, if possible:
final_table<-data.frame(site_1=c('A', 'A', 'A', 'B', 'B', 'C'), site_2=c('B', 'C', 'D', 'C', 'D', 'D'), number_of_commonalities=c(3, 1,1,1,1,0))
I have tried joining begindata to itself based on customer_id, something like this...
attempted_df<-begindata %>% left_join(begindata, by="customer_id") %>% count(site_visited.x, site_visited.y)
Which I know is redundant (lines that go A, B, 3; B, A, 3; as well as lines that go A, A, 3).
However, this cannot be executed with my actual data set (260k members and 300 sites) due to size limitations.
Any advice would be greatly appreciated! Also go easy on me if my post sucks--I think I have followed the rules and suggestions?
We could use combn
number_of_commonalities <- combn(unique(begindata$site_visited), 2,
FUN = function(x) with(begindata,
length(intersect(customer_id[site_visited == x[1]],
customer_id[site_visited == x[2]]))))
names(number_of_commonalities) <- combn(unique(begindata$site_visited), 2,
FUN = paste, collapse="_")
stack(number_of_commonalities)[2:1]

Frequent Sequential Patterns

What would be the best way to get the sequential pattern for such data in R :
The idea is to get the frequency of letters in process 1,2, and 3. Is there GSP function that can do that ? any insight or tutorial is appreciated.
you can use an apply and table combo (provided you read your data into R):
dat <- data.frame(process1 = c('A', 'B', 'A', 'A', 'C'), process2 = c('B', 'C', 'B', 'B', 'A'), process3 = c('C', 'C', 'A', 'B', 'B'))
apply(dat, 2, table)
# process1 process2 process3
#A 3 1 1
#B 1 3 2
#C 1 1 2
apply iterates through the columns of dat (this is what argument 2 refers to) and applies table to each, which counts each unique element. see help pages for *apply family of functions for more info.
d.b's solution above, lapply(dat, table), does the same thing but returns a list rather than a matrix.

How do I find the edges of a vertex using igraph and R?

Say I have this example graph, i want to find the edges connected to vertex 'a'
d <- data.frame(p1=c('a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'),
p2=c('b', 'c', 'd', 'c', 'd', 'e', 'd', 'e', 'e'))
library(igraph)
g <- graph.data.frame(d, directed=FALSE)
print(g, e=TRUE, v=TRUE)
I can easily find a vertex:
V(g)[V(g)$name == 'a' ]
But i need to reference all the edges connected to the vertex 'a'.
See the documentation on igraph iterators; in particular the from() and to() functions.
In your example, "a" is V(g)[0], so to find all edges connected to "a":
E(g) [ from(0) ]
Result:
[0] b -- a
[1] c -- a
[2] d -- a
If you do not know the index of the vertex, you can find it using match() before using the from() function.
idx <- match("a", V(g)$name)
E(g) [ from(idx) ]
Found a simpler version combining the two efforts above that may be useful too.
E(g)[from(V(g)["name"])]
I use this function for getting number of edges for all nodes:
sapply(V(g)$name, function(x) length(E(g)[from(V(g)[x])]))

Resources