I am working with the R programming language.
Suppose there are 100 people - each person is denoted with an ID from 1:100. Each person can be friends with other people. The dataset can be represented in graph/network format and looks something like this:
# Set the seed for reproducibility
set.seed(123)
# Generate a vector of ID's from 1 to 100
ids <- 1:100
# Initialize an empty data frame to store the "from" and "to" values
edges <- data.frame(from=integer(), to=integer(), stringsAsFactors=FALSE)
# Iterate through the ID's
for(id in ids) {
# Randomly select a minimum of 1 and a maximum of 8 neighbors for the current ID
neighbors <- sample(ids[ids != id], size=sample(1:8, size=1))
# Add a new row to the data frame for each "to" value
for(neighbor in neighbors) {
edges <- rbind(edges, data.frame(from=id, to=neighbor))
}
}
As we can see, the data can be visualized to reveal the graph/network format:
library(igraph)
library(visNetwork)
# Convert the data frame to an igraph object
g <- graph_from_data_frame(edges, directed=FALSE)
# Plot the graph
plot(g)
# Optional visualization
#visIgraph(g)
Now, suppose each person in this dataset has has a certain number of cookies. This looks something like this:
set.seed(123)
cookies = data.frame(id = 1:100, number_of_cookies = c(abs(as.integer(rnorm(25, 15, 5))), abs(as.integer(rnorm(75, 5, 5)))))
Here is my question:
I want to make sure that no person in this dataset has less than 12 cookies - that is, if someone has less than 12 cookies, they can "pool" together their cookies with their neighbors (e.g. first-degree neighbors, second-degree neighbors, ... n-th degree neighbors ... until this condition is satisfied) and see if they now have more than 12 cookies
However, I also want to make sure that during this pooling process, no pooled group of friends has more than 20 cookies (i.e. this might require having to "un-pool" neighbors that were previously pooled together)
And finally, if someone is in a group with some other people - this person can not be then placed into another group (i.e. no "double dipping")
I wrote a function that takes an ID as an input and then returns the total number of cookies for this ID and all of this ID's neighbors:
library(data.table)
library(dplyr)
sum_cookies_for_id <- function(id) {
# Get the connected IDs for the given ID
connected_ids <- c(id, edges[edges$from == id | edges$to == id, "to"])
# Sum the number of cookies for all connected IDs
sum(cookies[cookies$id %in% connected_ids, "number_of_cookies"])
}
# Test the function
sum_cookies_for_id(23)
But beyond this, I am not sure how to continue.
Can someone please show me how I might be able to continue writing the code for this problem? Can Dynamic Programming be used in such an example?
Thanks!
Notes:
I think that this problem might have a "stochastic" nature - that is, depending on which ID you begin with and which other ID's you group this specific ID with ... might eventually lead to fewer or more people with less than 12 cookies
I think that writing a "greedy" algorithm that performs random groupings might be the easiest option for such a problem?
In the future, I would be interested in seeing how more "sophisticated algorithms" (e.g. Genetic Algorithm) could be used to make groupings such that the fewest number of people with less than 12 cookies are left behind.
Food for Thought: Is it possible that perhaps some pre-existing graph/network clustering algorithm could be used for this problem while taking into consideration these total/sum constraints (e.g. Louvain Community Detection)?
Related
Lets say I have this data. My objective is to extraxt combinations of sequences.
I have one constraint, the time between two events may not be more than 5, lets call this maxGap.
User <- c(rep(1,3)) # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13)) # This is a timeline
df <- data.frame(User=User,
Event=Event,
Time=Time)
If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.
res.df <- data.frame(User=1,
C=1,
B=1,
CB=0,
BC=1,
CBC=0)
(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of evets also becomes larger. And also if the number of different User grows to 100 000.
Is it possible of doing this in TraMineR with the help of seqeconstraint?
Here is how you would do that with TraMineR
df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
constr <- seqeconstraint(maxGap=5)
subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
(presence <- seqeapplysub(subseq, method="presence"))
which gives
(B) (B)-(C) (C)
1-(C)-11-(B)-1-(C) 1 1 1
presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )
However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.
Hope this helps
I have a list of over 1,000,000 numbers. I have a lookup table that has a range of numbers and a category. For example, 0-200 is category A, 201-650 is category B (the ranges are not of equal length)
I need to simply iterate over the list of 1,000,000 numbers and get a list of the 1,000,000 corresponding categories.
EDIT:
For example, the first few elements of my list are - 100, 125.5, 807.5, 345.2, and it should return something like 1,1,8,4 as categories. The logic for the mapping is implemented in a function - categoryLookup(cd) and I'm using the following command to get the categories
cats <- sapply(list.cd, categoryLookup)
However, while this seems to be working quickly on lists of size up to 10000, it is taking a lot of time for the whole list.
What is the fastest way to do the same? Is there any form of indexing that can help speed up the process?
The numbers:
numbers <- sample(1:1000000)
groups:
groups <- sort(rep(letters, 40000))
lookup:
categories <- groups[numbers]
EDIT:
If you don't yet have the vector of "groups" you can create it first.
Assume you have data-frame with range info:
ranges <- data.frame(group=c("A","B","C"),
start=c(0,300001,600001),
end=c(300000,600000,1000000)
)
ranges
group start end
1 A 1 3e+05
2 B 300001 6e+05
3 C 600001 1e+06
# if groups are sorted and don't overlap:
groups <- rep(ranges$group, (ranges$end-ranges$start)+1)
Then continue as before
categories <- groups[numbers]
EDIT: as #jbaums said - you will have to add +1 to the (ranges$end-ranges$start) in this case. (already edited in the example above). Also in this case your starting coordinate should be 1 and not a 0
This may be a fairly esoteric question.
I'm trying to implement some of the ideas from Albatineh et al (2006) (DOI: 10.1007/s00357-006-0017-z) for a spatial clustering algorithm. The basic idea is one way to assess the stability of a clustering result is to examine how often pairs of observations end up in the same class. In a well defined solution pairs of observations should frequently end up in the same group.
The challenge is that in a large data set there are n^2 possible pairs (and most don't occur). We have structured our output as follows:
A B C C A
B A A A B
A B C C A
Where the column index is the observation ID and each row represents a run from the clustering algorithm. In this example there are 5 observations and the algorithm was run 3 times. The cluster labels A:C are essentially arbitrary between runs. I'd like an efficient way to calculate something like this:
ID1 ID2
1 5
2
3 4
4 3
5 1
1 2
2 3
2 4
...
This accomplishes my goal but is super slow, especially for a large data frame:
testData <- matrix(data=sample(x=c("A", "B", "C"), 15, replace=TRUE), nrow=3)
cluPr <- function(pr.obs){
pairs <- data.frame()
for (row in 1:dim(pr.obs)[1]){
for (ob in 1:dim(pr.obs)[2]){
ob.pairs <- which(pr.obs[row,] %in% pr.obs[row,ob], arr.ind=TRUE)
pairs <- rbind(pairs, cbind(ob, ob.pairs))
}
}
return(pairs)
}
cluPr(testData)
Here's a relatively quick approach using the combn() function. I assumed that the name of your matrix was m.
results <- t(combn(dim(m)[2], 2, function(x) c(x[1], x[2], sum(m[, x[1]] == m[, x[2]]))))
results2 <- results[results[, 3]>0, ]
Try this:
clu.pairs <- function(k, row)
{
w <- which(row==k)
expand.grid(w, w)
}
row.pairs <- function(row)
{
do.call(rbind, lapply(unique(row), function(k) clu.pairs(k, row)))
}
full.pairs <- function(data)
{
do.call(rbind, lapply(seq_len(nrow(data)), function(i) row.pairs(data[i,])))
}
And use full.pairs(testData). The result is not in the same order as yours, but it's equivalent.
My first implementation (not in R; my code is much faster in Java) of the pair counting metrics was with ordered generators, and then doing a merge-sort way of computing the intersection. It was still on the order of O(n^2) run-time, but much lower in memory use.
However, you need to realize that you don't need to know the exact pairs. You only need the quantity in the intersections, and that can be computed straightforward from the intersection matrix, just like most other similarity measures. It's substantially faster if you only need to compute the set intersection sizes; with hash tables, set intersection should be in O(n).
I don't have time to look it up; but we may have touched this in the discussion of
Evaluation of Clusterings – Metrics and Visual Support
Data Engineering (ICDE), 2012 IEEE 28th International Conference on
Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek
where we demonstrated a visual tool to explore the pair-counting based measures, also for more than two clustering solutions (unfortunately, a visual inspection mostly works for toy data sets, not for real data which is usually too messy and high-dimensional).
Roughly said: try computing the values using the formulas on page 303 in the publication you cited, instead of computing and then counting the pairs as explained in the intuition/motivation!
I want to ask your opinion since I am not so sure how to do it. This is regarding one part of my paper project and my situation is:
Stage I
I have 2 groups and for each group I need to compute the following steps:
Generate 3 random numbers from normal distribution and square them.
Repeat step 1 for 15 times and at the end I will get 15 random numbers.
I already done stage I using for loop.
n1<-3
n2<-3
miu<-0
sd1<-1
sd2<-1
asim<-15
w<-rep(NA,asim)
x<-rep(NA,asim)
for (i in 1:asim) {
print(i)
set.seed(i)
data1<-rnorm(n1,miu,sd1)
data2<-rnorm(n2,miu,sd2)
w[i]<-sum(data1^2)
x[i]<-sum(data2^2)
}
w
x
Second stage is;
Stage II
For each group, I need to:
Sort the group;
Find trimmed mean for each group.
For the whole process (stage I and stage II) I need to simulate them for 5000 times. How am I going to proceed with step 2? Do you think I need to put another loop to proceed with stage II?
Those are tasks you can do without explicit loops. Therefore, note a few things: It is the same if you generate 3 times 15 times 2000 random numbers or if you generate them all at once. They still share the same distribution.
Next: Setting the seed within each loop makes your simulation deterministic. Call set.seed once at the start of your script.
So, what we will do is to generate all random numbers at once, then compute their squared norms for groups of three, then build groups of 15.
First some variable definitions:
set.seed(20131301)
repetitions <- 2000
numperval <- 3
numpergroup <- 15
miu <- 0
sd1 <- 1
sd2 <- 1
As we need two groups, we wrap the group generation stuff into a custom function. This is not necessary, but does help a bit in keeping the code clean an readable.
generateGroup <- function(repetitions, numperval, numpergroup, m, s) {
# Generate all data
data <- rnorm(repetitions*numperval*numpergroup, m, s)
# Build groups of 3:
data <- matrix(data, ncol=numperval)
# And generate the squared norm of those
data <- rowSums(data*data)
# Finally build a matrix with 15 columns, each column one dataset of numbers, each row one repetition
matrix(data, ncol=numpergroup)
}
Great, now we can generate random numbers for our group:
group1 <- generateGroup(repetitions, numperval, numpergroup, miu, sd1)
group2 <- generateGroup(repetitions, numperval, numpergroup, miu, sd2)
To compute the trimmed mean, we again utilize apply:
trimmedmeans_group1 <- apply(group1, 1, mean, trim=0.25)
trimmedmeans_group2 <- apply(group2, 1, mean, trim=0.25)
I used mean with the trim argument instead of sorting, throwing away and computing the mean. If you need the sorted numbers explicitly, you could do it by hand (just for one group, this time):
sorted <- t(apply(group1, 1, sort))
# We have to transpose as apply by default returns a matrix with each observation in one column. I chose the other way around above, so we stick with this convention and transpose.
Now, it would be easy to throw away the first and last two columns and generate the mean, if you want to do it manually.
I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}