Creating transition matrix per indexed Time interval - r

I have a dataset on the transition of users between states (9 in total) in a specific time interval. This dataset will be used for a markov chain model. In total there are 96 time intervals, thus: for every user there are 96 observations each of which provides the specified time interval, a start location and an end location. A state that combines two locations simply means that the user is still in transition between the two states.
Below is a fictional dataset. In this example, unlike the actual dataset, start and end location are not necessarily linked, but I believe this will serve just as well as an illustration of the problem.
ID <- rep(1:10, each = 96)
TimeInterval <- rep(1:96, 10)
Locations <- c("Home", "Bakery", "Grocery", "Home-Bakery", "Home-Grocery", "Bakery-Home", "Bakery-Grocery", "Grocery-Home", "Grocery-Bakery")
startLocation <- sample(Locations, 960, replace = TRUE)
endLocation <- sample(Locations, 960, replace = TRUE)
df <- data.frame(ID, TimeInterval, startLocation, endLocation)
I want to calculate a transition matrix for every time interval, where the transition probability is calculated by the probability of transitioning into a state/location given the state/location at the previous time interval. For instance, to calculate the transition probability matrix for TimeInterval 37 the probability of being in a certain state in TimeInterval 37 given the state in TimeInterval 36 is taken.
This will result in a total of 96 transition matrices. The probability of transitioning from one state (Location) to another given a specific timeframe then depends on probability of all users combined.
However, I do not know how to aggregate the results of the individual transitions. What would be an efficient way to calculate these matrices?
The transition matrix per time interval should be a 9x9 matrix that includes all the states.
Edit:
A (very ugly) dplyr solution that worked for a single transition matrix:
Interval36 <- df %>% filter(TimeInterval== 36)
Interval37 <- df %> filter(TimeInterval == 37)
timeBlock37 <- as.data.frame(cbind(Interval37$journey, Interval36$journey))
mTimeBlock37 <- as.data.frame.matrix(table(timeBlock37))
timeBlock <- prop.table(mTimeBlock37 )
timeBlock

I solved it myself, although not in the most elegant way with quite a unstructured for-loop.
matrixList <- list()
states <- Locations
for(i in 1:96){
i <- ifelse(i < 96,i + 1, 96)
t1 <- df %>% filter(timeBlock == i)
j <- ifelse(i < 96,i + 1, 96)
t2 <- df%>% filter(timeBlock == j)
timeBlock <- as.data.frame(cbind(t1[,8], t2[,8]))
mTime <- as.data.frame.matrix(table(timeBlock))
timeBlock <- prop.table(mTime)
timeBlock <- as.matrix(timeBlock)
mat1 <- matrix(0, nrow = 9, ncol = 9)
colnames(mat1) <- states
rownames(mat1) <- states
colsNeeded <- colnames(mat1)[colnames(mat1) %in% colnames(timeBlock)]
rowsNeeded <- rownames(mat1)[rownames(mat1) %in% rownames(timeBlock)]
mat1[rowsNeeded, colsNeeded] <- timeBlock[rowsNeeded, colsNeeded]
matrixList[[i]] <- mat1
}
I created an empty matrix in the for-loop that could be populated with the transition matrix to take into account for the situations where not all states would appear within a certain timeframe.
If someone still has a more elegant/cleaner solution, feel free to contribute for future readers.

Related

What is the fastest way to perform an exhaustive search in R

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.
Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.
I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
as_tibble()
outcome <- rnorm(100)
#create fake data for example
#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
#set seed for results consistency among different run
set.seed(x)
#subsample the features table by extracting a subset of columns
subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
subset_matrix <- fake_table[,subset_index]
#assign the outcome as last column of the subset
subset_matrix[,ncol(subset_matrix)+1] <- outcome
#use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
#create a data frame with as many columns as features in the subset and only one row
#with the Relieff weigths
rf_df <- rf_weights %>%
unname() %>%
matrix(., ncol = length(.), byrow = TRUE) %>%
as_tibble() %>%
set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time
However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
#initalize the weight as 0
current_weigth <- 0
#for all element in result (i.e. repetitions of the subsampling process)
for (el in 1:length(result)){
#assign new weight accessing the table
new_weigth <- result[[el]][[1,feat]]
#skip is empty (i.e. the features is not present in the current subset)
if(is_empty(new_weigth)){next}
#if new weight is higher than current weight assign the value to current weight
if (current_weigth < new_weigth){
current_weigth <- new_weigth}}
current_weigth
}
end_time <- Sys.time()
end_time - start_time
If I understood what you are trying to do correctly, then the answer is simpler than you think.
Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature?
if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:
allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)
This follows #DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
nr = 50L
nc = 200L
## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr)
## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)
## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))
st = Sys.time()
for (i in seq_len(n_reps)) {
set.seed(i)
## similar way to do data simulations as OP
sub_cols = sample(seq_len(nc), nc_sample_size)
sub_mat = cbind(mat[, sub_cols], outcome)
rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
## assign back to pre-allocated result
res[i, sub_cols] = rf_weights
}
## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st
The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.
This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.
The first chapter of Rcpp for everyone says:
Chapter 1 Suitable situations to use Rcpp
R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.
Loop operations in which later iterations depend on previous
iterations.
Accessing each element of a vector/matrix.
Recurrent function calls within loops.
Changing the size of vectors dynamically.
Operations that need advanced data structures and algorithms.
Wickham's Advanced R has a good chapter on that topic too.

Speed optimization - calculating weighted column in data.table with distance matrix

I am trying to apply weights to a numeric vector in a data.table. The weights come from the euclidean distances of each point with all the other points. If a point is close with another point, then the weights assigned to them will be higher, if the distance between 2 points are greater than a threshold then the weights will be 0, the weight assigned to the distance between a point and itself is 10000.
I can illustrate with the code below:
library(data.table)
library(dplyr)
library(tictoc)
set.seed(42)
df <- data.table(
LAT = rnorm(500, 42),
LONG = rnorm(500, -72),
points = rnorm(500)
)
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix
# function to calculate the weights
func <- function(j, cols, threshold) {
N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}
tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()
The current function works well for small number of points, but it takes a lot longer to compute weighted points for about 220K rows.
I have come up with another solution that cuts down the time in half, but I think it can still be improved.
d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)
tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
x == 0 ~ 10000,
x <= 0.5 ~ 1 / (x^2),
TRUE ~ 0)), .SDcols = cols]
# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]]))
}
toc()
round(sum(df$weighted_points - df2$weighted_points), 0)
The time differences may be small for this small data set, but I have tested the time using the real data set and the new way is quite a bit faster.
My question is, how can I make the new approach to be even faster? I know I am using case_when from dplyr which could make things slower in exchange for readability, but are there other things that I am not doing correctly in data.table that could help make it faster?
From data analyst side I think you could improve your code with an approximation for what mean distance and close points.
Once I worked with NCDC station locations and tried to find closes stations for each other because there were so many stations it was time-consuming. I came up with an idea that after I get dist of my coordinates of each point I just rank them up and put up threshold "how many stations I want to take for real weight calculation".
For example, after ranking take 50 closest points (within the rank) and put them weights respectively, other points will just get 0 weight.
Hope this helps

How to Generate Normal Random Samples within Mean±3Sigma

I want to draw normal random numbers in an array of order ((100*8)*5000) with a specific Mean (M) and Standard Deviation (S) but I want them to be only within the range M±3S, so that I don't have any outliers in my array exceeding those limits.
Any Suggestion? I want to write a program in R based on this array for some simulation studies. I am using following R Code to generate my Data Set:
for(i in 1:5000){
for(j in 1:8){
Dat[,j,i]=rnorm(100,mean=muu[j],sd=sigma[j])
}
}
Now, We want to get rid of those values which are higher than muu±3sigma in the above data. Definitely, We have to replace discarded values with fresh values so that the dimension of the Dat array keep intact.
First Solution
Here is a start but I bet there is a more elegant solution.
First generate a sample next step is to subset it to your desired values. Of course you have to adjust values to your desire.
set.seed(123)
rs <- rnorm(10000, mean = 10, sd = 3)
rs1 <- rs[ rs >= -19 & rs <= 19 ]
Second (better) solution
I think my first solutions didn't work so well. I have just written some code that might be perfect for your purposes. Here are the steps.
create an array of NAs with the required dimensions
fill it with random numbers
create a logical vector where TRUEs are for the desired conditions
subset the data based on that vector and replace the values where TRUE is TRUE (pardon my words game) with the mean used to generate samples
data <- array(NA, dim = c(100, 8, 5000))
for(i in 1:5000){
data[ , , i] <- rnorm(800, 3, 1)
}
bound <- 3 + c(-1, 1)*3*1
pr <- data <= bound[1] | data >= bound[2]
data[pr] <- 3

Select the most dissimilar individual using cluster analysis

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Select the most dissimilar individual using cluster analysis [duplicate]

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources