Finding All Pairwise Commonalities in R - r

I have been a StackExchange Lurker forever now, but have not had much luck finding this question in R before, so I created a username just for this.
Basically I have a data set of Customers and Stores (roughly 260k customers and 300 stores with most of the customers visiting at least 10 unique stores), and I want to see which sites overlap on customers the most (i.e. Site A and B share this many customers, A and C that many, ... for ALL Pairs of sites).
Reproducible example:
begindata<-data.frame(customer_id=c(1,2,3,1,2,3,4,1,4,5), site_visited=c('A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'D', 'D'))
and I would like to see the following, if possible:
final_table<-data.frame(site_1=c('A', 'A', 'A', 'B', 'B', 'C'), site_2=c('B', 'C', 'D', 'C', 'D', 'D'), number_of_commonalities=c(3, 1,1,1,1,0))
I have tried joining begindata to itself based on customer_id, something like this...
attempted_df<-begindata %>% left_join(begindata, by="customer_id") %>% count(site_visited.x, site_visited.y)
Which I know is redundant (lines that go A, B, 3; B, A, 3; as well as lines that go A, A, 3).
However, this cannot be executed with my actual data set (260k members and 300 sites) due to size limitations.
Any advice would be greatly appreciated! Also go easy on me if my post sucks--I think I have followed the rules and suggestions?

We could use combn
number_of_commonalities <- combn(unique(begindata$site_visited), 2,
FUN = function(x) with(begindata,
length(intersect(customer_id[site_visited == x[1]],
customer_id[site_visited == x[2]]))))
names(number_of_commonalities) <- combn(unique(begindata$site_visited), 2,
FUN = paste, collapse="_")
stack(number_of_commonalities)[2:1]

Related

Create a dataframe that is the result of a difference between vectors from other dataframes R

I have the following datasets and information: first, I have i different plots that I want to analyze. In each plot, i have j species that I want to obtain some information, such as:
plot1 = c(rep(1, 3), rep(2, 4), rep(3, 5))
spp1 = c('a', 'b', 'c', 'a', 'b', 'c', 'd', 'b', 'b', 'b', 'e', 'f')
data.1 = data.frame(plot1, spp1)
The above mentioned information repeats for a second dataframe of similar structure:
plot2 = c(rep 1, 2), rep(2, 3), rep(3, 5))
spp2 = c('a', 'a', 'b', 'c', 'c', 'b', 'b', 'b', 'e', 'f'))
data.2 = data.frame(plot2, spp2)
What I'm trying to do is, for each i plot, setdiff(unique(data.1$spp1), unique(data.2$spp2)) and add the obtained information to a dataframe that has 2 columns: plot and spp_name
For the example datasets I'd like to obtain a final dataframe such as:
df_result = data.frame(plot = c(1,1,2,2,3), spp_name = ('b','c','a','d',0)
0 (or similar) must be returned when the setdiff(unique()) returns 'character(0)', So, in a way, my df_result needs to have, for each i plot, length equal to the number of setdiff strings between data.1$spp1 and data.2$spp2.
The first thing I did was using a for loop based on each i plot. Getting to setdiff() string result is ok to but I don't know how to add this information to a empty dataframe...do I need to loop something for each species? I really hope my question is comprehensible.
Thanks already
You could use anti_join and add rows for the missing values:
library(dplyr)
anti_join(data.1, data.2, by = c("plot1" = "plot2", "spp1" = "spp2")) %>%
add_row(plot1 = setdiff(data.1$plot1, .$plot1))
# plot1 spp1
#1 1 b
#2 1 c
#3 2 a
#4 2 d
#5 3 <NA>

Frequent Sequential Patterns

What would be the best way to get the sequential pattern for such data in R :
The idea is to get the frequency of letters in process 1,2, and 3. Is there GSP function that can do that ? any insight or tutorial is appreciated.
you can use an apply and table combo (provided you read your data into R):
dat <- data.frame(process1 = c('A', 'B', 'A', 'A', 'C'), process2 = c('B', 'C', 'B', 'B', 'A'), process3 = c('C', 'C', 'A', 'B', 'B'))
apply(dat, 2, table)
# process1 process2 process3
#A 3 1 1
#B 1 3 2
#C 1 1 2
apply iterates through the columns of dat (this is what argument 2 refers to) and applies table to each, which counts each unique element. see help pages for *apply family of functions for more info.
d.b's solution above, lapply(dat, table), does the same thing but returns a list rather than a matrix.

Vertices/edges between specified nodes

This seems like an easy problem to solve, but even after a week, I can't find a solution to this.
I have an undirected graph with 1700+ nodes and 5000+ edges. I would like to filter/subset this graph for a source node and other target nodes, regardless of the order of the neighborhood. I tried the following code and it works for this example dataset, but all_simple_paths doesn't terminate for my real data set.
In this working example, I am showing the connections between A and C & F.
library(igraph)
gg <- graph.data.frame(data.frame(n1 = c('A', 'A', 'B', 'A', 'A', 'F', 'E'),
n2 = c('D', 'B', 'C', 'E', 'F', 'C', 'F')),
directed = FALSE,
vertices = LETTERS[1:6])
plot(gg, layout = layout_with_kk(gg))
paths <- all_simple_paths(gg, from = "A", to = c("C", "F"))
gg_s <- subgraph(gg, v = unique(rapply(paths, f = c)))
plot(gg_s, layout = layout_with_kk(gg_s))
I also tried subcomponent and intersection, but subcomponent returns all possible connections. I tried creating ego graphs and then doing an intersection, but that's not giving the best results.

R find value in multiple data frame columns

Given a data set where a value could be in any of a set of columns from the dataframe:
df <- data.frame(h1=c('a', 'b', 'c', 'a', 'a', 'b', 'c'), h2=c('b', 'c', 'd', 'b', 'c', 'd', 'b'), h3=c('c', 'd', 'e', 'e', 'e', 'd', 'c'))
How can I get a logical vector that specifies which rows contain the target value? In this case, searching for 'b', I'd want a logical vector with rows (1,2,4,6,7) as TRUE.
The real data set is much larger and more complicated so I'm trying to avoid a for loop.
thanks
EDIT:
This seems to work.
>apply(df, 1, function(x) {'b' %in% as.vector(t(x))}) -> i
> i
[1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE
If speed is a concern I would go with:
rowSums(df == "b") > 0
apply(df, 1, function(r) any(r == "b"))
I'd rather wrap it into a small helper function that also returns the matching rows and performs a case-insensitive search across all columns
require(dplyr)
require(stringr)
search_df = function(df, search_term){
apply(df, 1, function(r){
any(str_detect(as.character(r), fixed(search_term, ignore_case=T)))
}) %>% subset(df, .)
}
search_df(iris, "Setosa")
To keep it more generic this can also be rewritten to expose the matching expression/rule as a function argument:
match_df = function(df, search_expr){
filter_fun = eval(substitute(function(x){search_expr}))
apply(df, 1, function(r) any(filter_fun(r))) %>% subset(df, .)
}
match_df(iris, str_detect(x, "setosa"))

How do I find the edges of a vertex using igraph and R?

Say I have this example graph, i want to find the edges connected to vertex 'a'
d <- data.frame(p1=c('a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'),
p2=c('b', 'c', 'd', 'c', 'd', 'e', 'd', 'e', 'e'))
library(igraph)
g <- graph.data.frame(d, directed=FALSE)
print(g, e=TRUE, v=TRUE)
I can easily find a vertex:
V(g)[V(g)$name == 'a' ]
But i need to reference all the edges connected to the vertex 'a'.
See the documentation on igraph iterators; in particular the from() and to() functions.
In your example, "a" is V(g)[0], so to find all edges connected to "a":
E(g) [ from(0) ]
Result:
[0] b -- a
[1] c -- a
[2] d -- a
If you do not know the index of the vertex, you can find it using match() before using the from() function.
idx <- match("a", V(g)$name)
E(g) [ from(idx) ]
Found a simpler version combining the two efforts above that may be useful too.
E(g)[from(V(g)["name"])]
I use this function for getting number of edges for all nodes:
sapply(V(g)$name, function(x) length(E(g)[from(V(g)[x])]))

Resources