Lets say I have this data. My objective is to extraxt combinations of sequences.
I have one constraint, the time between two events may not be more than 5, lets call this maxGap.
User <- c(rep(1,3)) # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13)) # This is a timeline
df <- data.frame(User=User,
Event=Event,
Time=Time)
If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.
res.df <- data.frame(User=1,
C=1,
B=1,
CB=0,
BC=1,
CBC=0)
(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of evets also becomes larger. And also if the number of different User grows to 100 000.
Is it possible of doing this in TraMineR with the help of seqeconstraint?
Here is how you would do that with TraMineR
df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
constr <- seqeconstraint(maxGap=5)
subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
(presence <- seqeapplysub(subseq, method="presence"))
which gives
(B) (B)-(C) (C)
1-(C)-11-(B)-1-(C) 1 1 1
presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )
However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.
Hope this helps
Related
I am working with the R programming language.
Suppose there are 100 people - each person is denoted with an ID from 1:100. Each person can be friends with other people. The dataset can be represented in graph/network format and looks something like this:
# Set the seed for reproducibility
set.seed(123)
# Generate a vector of ID's from 1 to 100
ids <- 1:100
# Initialize an empty data frame to store the "from" and "to" values
edges <- data.frame(from=integer(), to=integer(), stringsAsFactors=FALSE)
# Iterate through the ID's
for(id in ids) {
# Randomly select a minimum of 1 and a maximum of 8 neighbors for the current ID
neighbors <- sample(ids[ids != id], size=sample(1:8, size=1))
# Add a new row to the data frame for each "to" value
for(neighbor in neighbors) {
edges <- rbind(edges, data.frame(from=id, to=neighbor))
}
}
As we can see, the data can be visualized to reveal the graph/network format:
library(igraph)
library(visNetwork)
# Convert the data frame to an igraph object
g <- graph_from_data_frame(edges, directed=FALSE)
# Plot the graph
plot(g)
# Optional visualization
#visIgraph(g)
Now, suppose each person in this dataset has has a certain number of cookies. This looks something like this:
set.seed(123)
cookies = data.frame(id = 1:100, number_of_cookies = c(abs(as.integer(rnorm(25, 15, 5))), abs(as.integer(rnorm(75, 5, 5)))))
Here is my question:
I want to make sure that no person in this dataset has less than 12 cookies - that is, if someone has less than 12 cookies, they can "pool" together their cookies with their neighbors (e.g. first-degree neighbors, second-degree neighbors, ... n-th degree neighbors ... until this condition is satisfied) and see if they now have more than 12 cookies
However, I also want to make sure that during this pooling process, no pooled group of friends has more than 20 cookies (i.e. this might require having to "un-pool" neighbors that were previously pooled together)
And finally, if someone is in a group with some other people - this person can not be then placed into another group (i.e. no "double dipping")
I wrote a function that takes an ID as an input and then returns the total number of cookies for this ID and all of this ID's neighbors:
library(data.table)
library(dplyr)
sum_cookies_for_id <- function(id) {
# Get the connected IDs for the given ID
connected_ids <- c(id, edges[edges$from == id | edges$to == id, "to"])
# Sum the number of cookies for all connected IDs
sum(cookies[cookies$id %in% connected_ids, "number_of_cookies"])
}
# Test the function
sum_cookies_for_id(23)
But beyond this, I am not sure how to continue.
Can someone please show me how I might be able to continue writing the code for this problem? Can Dynamic Programming be used in such an example?
Thanks!
Notes:
I think that this problem might have a "stochastic" nature - that is, depending on which ID you begin with and which other ID's you group this specific ID with ... might eventually lead to fewer or more people with less than 12 cookies
I think that writing a "greedy" algorithm that performs random groupings might be the easiest option for such a problem?
In the future, I would be interested in seeing how more "sophisticated algorithms" (e.g. Genetic Algorithm) could be used to make groupings such that the fewest number of people with less than 12 cookies are left behind.
Food for Thought: Is it possible that perhaps some pre-existing graph/network clustering algorithm could be used for this problem while taking into consideration these total/sum constraints (e.g. Louvain Community Detection)?
I am trying to count the length of occurrances of a value in a vector such as
q <- c(1,1,1,1,1,1,4,4,4,4,4,4,4,4,4,4,4,4,6,6,6,6,6,6,6,6,6,6,1,1,4,4,4)
Actual vectors are longer than this, and are time based. What I would like would be an output for 4 that tells me it occurred for 12 time steps (before the vector changes to 6) and then 3 time steps. (Not that it occurred 15 times total).
Currently my ideas to do this are pretty inefficient (a loop that looks element by element that I can have stop when it doesn't equal the value I specified). Can anyone recommend a more efficient method?
x <- with(rle(q), data.frame(values, lengths)) will pull the information that you want (courtesy of d.b. in the comments).
From the R Documentation: rle is used to "Compute the lengths and values of runs of equal values in a vector – or the reverse operation."
y <- x[x$values == 4, ] will subset the data frame to include only the value of interest (4). You can then see clearly that 4 ran for 12 times and then later for 3.
Modifying the code will let you check whatever value you want.
I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.
I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100
I have a list of over 1,000,000 numbers. I have a lookup table that has a range of numbers and a category. For example, 0-200 is category A, 201-650 is category B (the ranges are not of equal length)
I need to simply iterate over the list of 1,000,000 numbers and get a list of the 1,000,000 corresponding categories.
EDIT:
For example, the first few elements of my list are - 100, 125.5, 807.5, 345.2, and it should return something like 1,1,8,4 as categories. The logic for the mapping is implemented in a function - categoryLookup(cd) and I'm using the following command to get the categories
cats <- sapply(list.cd, categoryLookup)
However, while this seems to be working quickly on lists of size up to 10000, it is taking a lot of time for the whole list.
What is the fastest way to do the same? Is there any form of indexing that can help speed up the process?
The numbers:
numbers <- sample(1:1000000)
groups:
groups <- sort(rep(letters, 40000))
lookup:
categories <- groups[numbers]
EDIT:
If you don't yet have the vector of "groups" you can create it first.
Assume you have data-frame with range info:
ranges <- data.frame(group=c("A","B","C"),
start=c(0,300001,600001),
end=c(300000,600000,1000000)
)
ranges
group start end
1 A 1 3e+05
2 B 300001 6e+05
3 C 600001 1e+06
# if groups are sorted and don't overlap:
groups <- rep(ranges$group, (ranges$end-ranges$start)+1)
Then continue as before
categories <- groups[numbers]
EDIT: as #jbaums said - you will have to add +1 to the (ranges$end-ranges$start) in this case. (already edited in the example above). Also in this case your starting coordinate should be 1 and not a 0