Transition probability with zero-frequency - r

With the below code I can calculate the Markov chain probabilities per time step. However I would like to make the following change to my code:
a.) speed up the transitions: increase the number of each transition by one
AND another change in my code for
b.) speed up each transition only between specific time steps (for example 04:10-04:30)
Can this be done with Laplace smoothing
The mathematical formula for the Laplace smoothing is :
where nij represents the count of users whose activity has changed from i to j in the data, and ni represents the count of users whose activity is i at the time step k.
The mathematical formula for calculating the transition probability matrixes is:
Sample code:
library(readr)
library(markovchain)
mcListFist <-
markovchainListFit(data = df[,], name = "df")
matrixList <- list()
for (i in 1:dim(mcListFist$estimate)) {
myMatr <- mcListFist$estimate[[i]]#transitionMatrix
matrixList[[i]] <- myMatr
}
matrixList
Sample code:
df<-structure(list(`04:00` = c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0), `04:10` = c(1,
1, 0, 0, 0, 0, 0, 0, 0, NA), `04:20` = c(1, 1, 1, 0, 0, 0, 0,
0, 1, NA), `04:30` = c(0, 0, 0, 0, 0, 0, 1, 1, 1, NA)), row.names = c(NA,
-10L), spec = structure(list(cols = list(`04:00` = structure(list(), class = c("collector_double",
"collector")), `04:10` = structure(list(), class = c("collector_double",
"collector")), `04:20` = structure(list(), class = c("collector_double",
"collector")), `04:30` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x0000023694f01da0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))

Related

Count edges using the adjacency matrix

Based on the adjacency matrix, I would like to count the number of unique edges in a network. In the below example I coloured the unique edges between the different nodes. But I don't know how to proceed.
Desired output:
Sample data
structure(list(...1 = c("m1", "m2", "m3", "m4"), m1 = c(0.2,
0.2, 0.2, 0.3), m2 = c(0.1, 0.2, 0.2, 0.6), m3 = c(0.5, 0.2,
1, 0), m4 = c(0.3, 0, 0, 0.1)), row.names = c(NA, -4L), spec = structure(list(
cols = list(...1 = structure(list(), class = c("collector_character",
"collector")), m1 = structure(list(), class = c("collector_double",
"collector")), m2 = structure(list(), class = c("collector_double",
"collector")), m3 = structure(list(), class = c("collector_double",
"collector")), m4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Assuming that this is an undirected graph such that 0 indicates no edge and a positive number indicates an edge, convert the input DF to a logical matrix and from that to an igraph object. Then get its edges and the names of those edges. (Another possible output is by using as_edgelist(g) to get a 2 column matrix such that each row defines an edge.)
If it were intended that the graph be directed then replace "undirected" with "directed" and in that case a character vector of 13 edge names will be produced instead of the 9 undirected edges shown below.
library(igraph)
m <- as.matrix(DF[-1])
rownames(m) <- colnames(m)
g <- graph_from_adjacency_matrix(m > 0, "undirected")
e <- E(g)
attr(e, "vnames")
## [1] "m1|m1" "m1|m2" "m1|m3" "m1|m4" "m2|m2" "m2|m3" "m2|m4" "m3|m3" "m4|m4"
Alternately as a pipeline
library(igraph)
library(tibble)
DF %>%
column_to_rownames("...1") %>%
as.matrix %>%
sign %>%
graph_from_adjacency_matrix("undirected") %>%
E %>%
attr("vnames")
## [1] "m1|m1" "m1|m2" "m1|m3" "m1|m4" "m2|m2" "m2|m3" "m2|m4" "m3|m3" "m4|m4"
The graph of g looks like this. (If "directed" had been chosen above then the edges would have arrowheads on them.)
set.seed(123)
plot(g)
Note
DF <-
structure(list(...1 = c("m1", "m2", "m3", "m4"), m1 = c(0.2,
0.2, 0.2, 0.3), m2 = c(0.1, 0.2, 0.2, 0.6), m3 = c(0.5, 0.2,
1, 0), m4 = c(0.3, 0, 0, 0.1)), row.names = c(NA, -4L), spec = structure(list(
cols = list(...1 = structure(list(), class = c("collector_character",
"collector")), m1 = structure(list(), class = c("collector_double",
"collector")), m2 = structure(list(), class = c("collector_double",
"collector")), m3 = structure(list(), class = c("collector_double",
"collector")), m4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))

Loop in tidyverse

I am learning tidyverse() and I am using a time-series dataset, and I selected columns that start with sec. What I would like basically to identify those values from columns that equal 123, keep these and have the rest replace with 0. But I don't know how to loop from sec1:sec4. Also how can I sum() per columns?
df1<-df %>%
select(starts_with("sec")) %>%
select(ifelse("sec1:sec4"==123, 1, 0))
Sample data:
structure(list(sec1 = c(1, 123, 1), sec2 = c(123, 1, 1), sec3 = c(123,
0, 0), sec4 = c(1, 123, 1)), spec = structure(list(cols = list(
sec1 = structure(list(), class = c("collector_double", "collector"
)), sec2 = structure(list(), class = c("collector_double",
"collector")), sec3 = structure(list(), class = c("collector_double",
"collector")), sec4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
I think you would have to use mutate and across to accomplish this. below you will mutate across each column starting with sec and then keep all values that are 123 and replace all others with 0.
df1<-df %>%
select(starts_with("sec")) %>%
mutate(across(starts_with("sec"),.fns = function(x){ifelse(x == 123,x,0)}))

Return two or more max equal values

It is my intent to return the time of day of the column where the most measures were taken. As an example, I found the highest numbers in this to be 16:10 and 16:20. I am using the which.max(col) function but it returns only one occurrence, not all. What is the best way to return two (or more) equal maximum values?
Update: If I sum up based on column colSums how can I return as well the max values? (eg. 16:10 and 16:20)
Data structure:
Desired output: 16:10 16:20
Sample data:
structure(list(Duration = c(10, 20, 30, 40, 50, 60), `16:00` = c(1,
0, 0, 0, 1, 0), `16:10` = c(0, 0, 0, 2, 1, 0), `16:20` = c(0,
0, 0, 2, 1, 0)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), spec = structure(list(cols = list(
Duration = structure(list(), class = c("collector_double",
"collector")), `16:00` = structure(list(), class = c("collector_double",
"collector")), `16:10` = structure(list(), class = c("collector_double",
"collector")), `16:20` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Compare the max value with the values in the dataframe and return column names which has at least 1 occurrence of the max value.
tmp <- df[-1]
names(tmp)[colSums(tmp == max(tmp)) > 0]
#[1] "16:10" "16:20"
For the update this should work -
tmp <- colSums(df[-1])
names(tmp)[tmp == max(tmp)]
#[1] "16:10" "16:20"
Does this work:
colnames(df)[apply(df,2,function(x) any(x == max(df[-1])))]
[1] "16:10" "16:20"
This also can be done through the combination of base R and tidyverse:
library(dplyr)
df[-1] %>%
select(where(~ sum(.x) == max(colSums(df[-1])))) %>%
names()
[1] "16:10" "16:20"
We can use base R
v1 <- sapply(df1[-1], max)
names(which(v1 == max(v1)))
#[1] "16:10" "16:20"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
summarise(across(-Duration, max)) %>%
pivot_longer(everything()) %>%
filter(value == max(value))%>%
pull(name)
[1] "16:10" "16:20"

Regression in R on panel data

I would like to run a linear regression on panel data. Below my code so far but, I don't understand why is not returning the fit and rsq. Any suggestion?
Sample code:
for(i in names(df))
{
if(is.numeric(df[3,i])) ##if row 3 is numeric, the entire column is
{
fit <- lm(df[3,i] ~ Gender, data=df) #does a regression for each column in my csv file against my independent variable 'etch'
rsq <- summary(fit)$r.squared
}
}
Data structure
Sample data:
df<-structure(list(id = c(1, 1, 2, 2, 2), id1 = c(1, 2, 1, 2, 3),
a1 = c(5, 8, 7, 6, 3), a2 = c(1, 4, 3, 10, 5), a3 = c(2,
34, 3, 12, 6), a4 = c(9, 2, 3, 12, 7), a5 = c(0, 0, 0, 7,
8), a6 = c(7, 7, 0, 0, 9), a7 = c(5, 8, 7, 6, 0), a8 = c(1,
4, 3, 10, 3), a9 = c(2, 34, 3, 12, 3), a10 = c(9, 2, 3, 12,
3), Gender = c(1, 2, 1, 1, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(id = structure(list(), class = c("collector_double",
"collector")), id1 = structure(list(), class = c("collector_double",
"collector")), a1 = structure(list(), class = c("collector_double",
"collector")), a2 = structure(list(), class = c("collector_double",
"collector")), a3 = structure(list(), class = c("collector_double",
"collector")), a4 = structure(list(), class = c("collector_double",
"collector")), a5 = structure(list(), class = c("collector_double",
"collector")), a6 = structure(list(), class = c("collector_double",
"collector")), a7 = structure(list(), class = c("collector_double",
"collector")), a8 = structure(list(), class = c("collector_double",
"collector")), a9 = structure(list(), class = c("collector_double",
"collector")), a10 = structure(list(), class = c("collector_double",
"collector")), Gender = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
To fit a linear regression on each column you can use lapply. We use reformulate to create a formula object from the column name and use it in lapply. R-squared value can be extracted from summary of each model.
cols <- grep('a\\d+', names(df), value = TRUE)
cols
#[1] "a1" "a2" "a3" "a4" "a5" "a6" "a7" "a8" "a9" "a10"
lapply(cols, function(x) {
lm(reformulate('Gender', x), df)
}) -> fit
r.squared <- sapply(fit, function(x) summary(x)$r.squared)

ggplot loop deal with special characters

Hi there I'm trying to plot a defined number of graphs using gridExtra.
This is working but unfortunately it is not dealing with special characters in its name. I tried to work around by using R friendly names and add in the actual name as a subtitle
library(gridExtra)
library(ggplot2)
Dataframe<-read.csv2(File_with_R_friendly_names.csv)
names<-read.csv2(File_with_actual_names.csv)
bar<-colnames(names)
list_of_plots<-lapply(names(Dataframe)[2:10], function(i) {
ggplot(Dataframe, aes_string(x="X1", y=i)) + geom_point()+labs(x=i, y="Intensity", subtitle=bar[i])
})
do.call(grid.arrange, c(list_of_plots, ncol=3))
If I put in bar[2] all graphs get the actual name but it is the same one for all while if I set bar to i, all graphs get NA.
The names I use to suit R are
Met1, Met2, Met3, Met4, Met5, Met6, Met7, Met8, Met9 and Met10
Examples of names that I need on the plots are:
-(-)-Corey lactone
-(2R)-2,3-Dihydroxypropanoic acid
-(D-(+)-Glyceric acid?)
-1,5-Naphthalenediamine
-12-Aminododecanoic acid
-2,5-di-tert-Butylhydroquinone
-2,6-di-tert-Butylphenol
-2-Amino-N,N-diethylacetamide
-2-Ethyl-2-phenylmalonamide
-2-Naphthalenesulfonic acid
Here is the dput to reproduce the bar (names):
`bar<-c("X1", "(-)-Corey lactone", "(2R)-2,3-Dihydroxypropanoic acid (D-(+)- Glyceric acid?)", "1,5-Naphthalenediamine", "12-Aminododecanoic acid", "2,5-di- tert-Butylhydroquinone", "2,6-di-tert-Butylphenol", "2-Amino-N,N- diethylacetamide", "2-Ethyl-2-phenylmalonamide", "2-Naphthalenesulfonic acid")`
Here is the dput to reproduce the dataframe:
Dataframe<-structure(list(X1 = c(0, 0, 0.25, 0.25, 0.5, 0.5, 1, 1, 2, 2),
Met1 = c(0, 0, 38096319.85, 45978353.93, 35077691.7, 42146132.41,
62606961.17, 32786049.6, 51054004.82, 48898547.32), Met2 = c(0,
0, 1288905.771, 948466.4001, 645979.6463, 1228663.251, 1137957.136,
940928.9344, 1443680.706, 1755726.385), Met3 = c(0, 0, 575887.464,
693692.0349, 1362477.6, 1515767.293, 2241120.502, 2417932.908,
3866432.112, 3894701.876), Met4 = c(0, 0, 16737068.73, 21915551.3,
12088089.1, 16003037.3, 17720785.29, 11957614.24, 13127281.5,
14192542.13), Met5 = c(0, 0, 4556006.426, 4782909.936, 4484706.271,
8019957.826, 5112289.476, 8537488.48, 6680688.948, 5959748.061
), Met6 = c(0, 0, 16874476.32, 15721984.25, 18093323.61,
18619817.92, 22055835.04, 19754379.11, 29211315.88, 27321333.35
), Met7 = c(0, 0, 6604385.457, 6396794.568, 13823034.64,
15449539.63, 26013299.82, 20262673.28, 35301685.57, 33367520.66
), Met8 = c(0, 0, 6727973.448, 7166827.569, 13238311.46,
13986568.69, 20957194.23, 19186953.76, 34513697.47, 31192991.75
), Met9 = c(0, 0, 2373752.304, 3259738.104, 1998529.732,
2387445.15, 2479309.442, 26924139.6, 4611277.427, 2439602.098
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L), .Names = c("X1", "Met1", "Met2", "Met3", "Met4", "Met5",
"Met6", "Met7", "Met8", "Met9"), spec = structure(list(cols = structure(list(
X1 = structure(list(), class = c("collector_double", "collector"
)), Met1 = structure(list(), class = c("collector_double",
"collector")), Met2 = structure(list(), class = c("collector_double",
"collector")), Met3 = structure(list(), class = c("collector_double",
"collector")), Met4 = structure(list(), class = c("collector_double",
"collector")), Met5 = structure(list(), class = c("collector_double",
"collector")), Met6 = structure(list(), class = c("collector_double",
"collector")), Met7 = structure(list(), class = c("collector_double",
"collector")), Met8 = structure(list(), class = c("collector_double",
"collector")), Met9 = structure(list(), class = c("collector_double",
"collector"))), .Names = c("X1", "Met1", "Met2", "Met3",
"Met4", "Met5", "Met6", "Met7", "Met8", "Met9")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Because names(Dataframe)[2:10] is not number. Below will work:
list_of_plots<-lapply(as.numeric(names(Dataframe)[2:10]), function(i) {
ggplot(Dataframe, aes_string(x="X1", y=i)) + geom_point()+labs(x=i,
y="Intensity", subtitle=bar[i])
})
do.call(grid.arrange, c(list_of_plots, ncol=3))

Resources