I'm using the markovchain package in R and the function
mc<-markovchainFit(data)
I have a propablity matrix mc$estimate and I want to round the propabilities. How do I do that?
Another question: How I can write that matrix to text file or Excel?
I have matrix like this:
mc$estimate
MLE Fit
A 22 - dimensional discrete Markov Chain with following states
A B C D E F G H I J K L M N O P Q R S T Y Z
The transition matrix (by rows) is defined as follows
A B C D E F
A 0.468053492 0.008172363 0.028974740 0.014858841 0.023031204 0.063150074
B 0.003191489 0.590425532 0.020212766 0.019148936 0.011702128 0.102127660
C 0.004054198 0.001707031 0.817134322 0.015896725 0.004374267 0.017497066
D 0.004519774 0.006214689 0.052824859 0.505367232 0.024011299 0.035310734
E 0.005132930 0.001710977 0.005396157 0.010002632 0.698078442 0.068570676
F 0.001155435 0.001386522 0.002195326 0.001675381 0.007683642 0.903347873
G 0.004933473 0.002690985 0.014800419 0.012856929 0.020032890 0.073105098
H 0.005486028 0.004114521 0.016629522 0.022458426 0.035487742 0.053317332
I 0.007445734 0.002271580 0.020570419 0.021327612 0.031423523 0.028899546
J 0.011885111 0.003796633 0.024430505 0.021294156 0.015351601 0.056949488
K 0.008743754 0.001784440 0.022127052 0.026945039 0.021234832 0.070663812
L 0.003227759 0.003026024 0.012507565 0.014726649 0.016743998 0.052854549
M 0.007148954 0.002560819 0.013551003 0.014511310 0.015258216 0.067008109
N 0.010998878 0.002918070 0.018406285 0.025140292 0.029405163 0.073400673
O 0.003787879 0.001578283 0.003787879 0.008207071 0.006313131 0.067866162
P 0.000000000 0.000000000 0.000000000 0.007518797 0.000000000 0.007518797
Q 0.005144695 0.004501608 0.003215434 0.012861736 0.013504823 0.052733119
R 0.009460298 0.003566998 0.022797767 0.024193548 0.015973945 0.095068238
I would round that whit 2 desimals and then write to Excel or text file. How it is possible?
The mc$estimate is a S4 class of the following type,
# [1] "markovchain"
# attr(,"package")
# [1] "markovchain"
by using str you can see what slots that object has that you could print/round,
str(mc)
# Formal class 'markovchain' [package "markovchain"] with 4 slots
# ..# states : chr -----
# ..# byrow : logi TRUE
# ..# transitionMatrix: num -----
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:x] ----
# .. .. ..$ : chr [1:x] ----
# ..# name : chr "MLE Fit"
Your output will be a little different since you did not provide the data. But you might see that the slot transitionMatrix looks like a matrix. Let us check,
class(mc$estimate#transitionMatrix)
# [1] "matrix"
Voila! We can easily round a matrix,
print(round(mc$estimate#transitionMatrix, digits=2))
and to store it,
write.csv(mc$esimate#transitionMatrix, file = "transition_matrix.csv", row.names = FALSE)
Hope this helps
It's a typo in the write.csv line for estimate after mc$
write.csv(mc$estimate#transitionMatrix, file = "transition_matrix.csv", row.names = FALSE)
once fixed it works.
Related
I have a sparse matrix of type dgTMatrix built with the Matrix package in R and I'm trying to solve for the vector x in Ax=b using QR decomposition, but it is not working correctly.
For example below I am solving for a random A, b and you can see the method works with the norm of Ax-b = 0
A <- matrix(rnorm(9),ncol=3)
decomp <- qr(A)
b <- rnorm(3)
x <- qr.coef(decomp,b)
(A %*% x - b) %>% norm
[1] 3.885781e-16
My A and b are like this:
A: 173700 x 173700 sparse Matrix of class "dgTMatrix"```
b: 173700 x 1 sparse Matrix of class "dgCMatrix"```
When I take a QR decomposition of A I get the following:
qr_decomp <- qr(A, LAPACK = TRUE, tol = 1e-10)
qr_decomp
'MatrixFactorization' of Formal class 'sparseQR' [package "Matrix"] with 6 slots
..# V :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..# i : int [1:13350377] 0 173283 173284 173285 173286 1 2 3 2 3 ...
.. .. ..# p : int [1:173701] 0 5 8 11 19 22 25 30 33 36 ...
.. .. ..# Dim : int [1:2] 173700 173700
.. .. ..# Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..# x : num [1:13350377] -9.27e+01 2.35e-03 -1.28e+02 2.35e-03 6.40e+01 ...
.. .. ..# factors : list()
..# beta: num [1:173700] 6.88e-05 7.80e-14 1.24e-06 3.88e-05 7.80e-14 ...
..# p : int [1:173700] 34977 38066 38838 39610 38067 38839 34980 38069 38841 39613 ...
..# R :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..# i : int [1:24075359] 0 1 1 2 0 1 2 3 4 4 ...
.. .. ..# p : int [1:173701] 0 1 2 4 8 9 11 12 13 15 ...
.. .. ..# Dim : int [1:2] 173700 173700
.. .. ..# Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..# x : num [1:24075359] 1.57e+02 3.58e+06 -1.26e+03 3.58e+06 1.92e-03 ...
.. .. ..# factors : list()
..# q : int [1:173700] 35749 38838 38066 36522 38839 38067 35752 38841 38069 36525 ...
..# Dim : int [1:2] 173700 173700
According to the docs for sparseQR class,
For a sparse m×n (“long”: m≥n) rectangular matrix A, the sparse QR decomposition is either
of the form PA=QR with a (row) permutation matrix P, (encoded in the p slot of the result) if the q slot is of length 0,
or of the form PAP∗=QR with an extra (column) permutation matrix P∗ (encoded in the q slot).
Since qr_decomp#q exists, per the docs, this must be of the form PAP∗=QR
Anyway, the same procedure as above is not working in this case:
decomp <- qr(A)
x <- qr.coef(decomp, b)
(A %*% x - b) %>% norm
[1] 3.540814e+24
The above value is really far from zero.
Inspecting matrix A everything looks okay to me:
!(A#x %>% is.finite) %>% any
[1] FALSE
!(b#x %>% is.finite) %>% any
[1] FALSE
> A[1:10, 1:10]
10 x 10 sparse Matrix of class "dgTMatrix"
[1,] -5.366088e+08 5.301600e-02 . . . . . . . .
[2,] 4.418000e-02 -5.366088e+08 4.418000e-02 . . . . . . .
[3,] . 4.123467e-02 -5.366088e+08 4.123467e-02 . . . . . .
[4,] . . 3.976200e-02 -5.366088e+08 3.976200e-02 . . . . .
[5,] . . . 3.887840e-02 -5.366088e+08 3.887840e-02 . . . .
[6,] . . . . 3.828933e-02 -5.366088e+08 3.828933e-02 . . .
[7,] . . . . . 3.786857e-02 -5.366088e+08 3.786857e-02 . .
[8,] . . . . . . 2.650800e-02 -5.366088e+08 1.251767e-02 .
[9,] . . . . . . . 9.719600e-03 -5.366088e+08 9.719600e-03
[10,] . . . . . . . . 9.572333e-03 -5.366088e+08
However, there are large values present in the matrices:
max(b)
[1] 3.978441e+22
max(A)
[1] 3.979517e+22
min(b)
[1] 0
min(A)
[1] -7.958754e+22
Could it be the large values messing things up with some kind of round off error? Or am I missing some key info about QR not converging in some case?
tl;dr it does appear that large values will lead to problems; scaling the inputs appears to help.
An example that isn't problematic, even though I used Cauchy-distributed samples to fill in the matrix (which leads to a wide range of values):
set.seed(101)
library(Matrix)
## d <- 173700
d <- 1e3
## n <- 1e6
n <- 1e5
randfun <- rcauchy
As <- sparseMatrix(i=integer(0),
j=integer(0),
dims = c(d, d), repr = "T")
As[cbind(sample(d, size = n, replace = TRUE),
sample(d, size = n, replace = TRUE))] <- randfun(n)
bs <- sparseMatrix(i = 1:d, j = rep(1, d), x = randfun(d))
qr_decomp <- qr(As, LAPACK = TRUE, tol = 1e-10)
decomp <- qr(As)
xs <- qr.coef(decomp, bs)
(As %*% xs - bs) |> norm() ## 1.19e-09
If I use larger values for d (matrix/vector dimensions) and n (number of non-zero elements), either the computation takes a long time or I get a segmentation fault.
However, if I set randfun <- function(n) rcauchy(n, scale = 1e20) in the example above I do get large deviations.
Scaling the inputs seems to solve the problem, although I don't know if it will work for what you want to do ...
As2 <- As/1e20
bs2 <- bs/1e20
decomp2 <- qr(As2)
xs2 <- qr.coef(decomp2, bs2)
(As2 %*% xs2 - bs2) |> norm()
I have for loop like this, trying to implement the solution here, with dummy vars such that
aaa <- DFM %*% t(DFM) #DFM is Quanteda dfm-sparse-matrix
for(i in 1:nrow(aaa)) aaa[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)]
but now
for(i in 1:nrow(mmm)) mmm[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)]
where mmm does not exist yet, the goal is to do the same thing as mmm <- t(apply(a, 1, sort, decreasing = TRUE)). But now before the for loop I need to initialise the mmm otherwise Error: object 'mmm' not found. The type of aaa and mmm is dgCMatrix given by the matrix multiplication of two Quanteda DFM matrices.
Structure
aaaFunc is given by the matrix multiplication DFM %*% t(DFM) where DFM is the Quanteda Sparse dfm-matrix. The structure is such that
> str(aaaFunc)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:39052309] 0 2 1 0 2 2616 2880 3 4 5 ...
..# p : int [1:38162] 0 2 3 7 8 10 13 15 16 96 ...
..# Dim : int [1:2] 38161 38161
..# Dimnames:List of 2
.. ..$ : chr [1:38161] "90120000" "90120000" "90120000" "86140000" ...
.. ..$ : chr [1:38161] "90120000" "90120000" "90120000" "86140000" ...
..# x : num [1:39052309] 1 1 1 1 2 1 1 1 2 1 ...
..# factors : list()
ERRORS on the DFM with the methods mentioned here on general question on replicating a R object without its content but its structure/etc.
A. error with aaaFunc.mt[]<- NA
> aaaFunc.mt <- aaaFunc[0,]; aaaFunc.mt[] <- NA; aaaFunc.mt[1,]
Error in intI(i, n = x#Dim[1], dn[[1]], give.dn = FALSE) : index larger than maximal 0
B. error with mySparseMatrix.mt[nrow(mySparseMatrix),]<-
> aaaFunc.mt <- aaaFunc[0,]; aaaFunc.mt[nrow(aaaFunc),] <- NA
Error in intI(i, n = di[margin], dn = dn[[margin]], give.dn = FALSE) :
index larger than maximal 0
C. error with replace(...,NA)
Browse[2]> mmmFunc <- replace(aaaFunc,NA);
Error in replace(aaaFunc, NA) :
argument "values" is missing, with no default
Browse[2]> mmmFunc <- replace(aaaFunc,,NA);
Error in `[<-`(`*tmp*`, list, value = NA) :
argument "list" is missing, with no default
Browse[2]> mmmFunc <- replace(aaaFunc,c(),NA);
Error in .local(x, i, j, ..., value) :
not-yet-implemented 'Matrix[<-' method
How do you initialise empty dgCMatrix given by the matrix multiplication of two Quanteda DFM matrices?
The following will either initialize an empty sparse matrix or reset an existing sparse matrix while preserving both the dimensions and dimnames
library(Matrix)
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
A <- sparseMatrix(i, j, x = x)
rownames(A) <- letters[seq_len(nrow(A))]
A2 <- sparseMatrix(i = integer(0), j = integer(0), dims = A#Dim, dimnames = A#Dimnames)
A#i <- integer(0)
A#p[] <- 0L
A#x <- numeric(0)
setequal(A, A2)
[1] TRUE
I would like to be able to write a function that runs regressions in a data.table by groups and then nicely organizes the results. Here is a sample of what I would like to do:
require(data.table)
dtb = data.table(y=1:10, x=10:1, z=sample(1:10), weights=1:10, thedate=1:2)
models = c("y ~ x", "y ~ z")
res = lapply(models, function(f) {dtb[,as.list(coef(lm(f, weights=weights, data=.SD))),by=thedate]})
#do more stuff with res
I would like to wrap all this into a function since the #doe more stuff might be long. The issue I face is how to pass the various names of things to data.table? For example, how do I pass the column name weights? how do I pass thedate? I envision a prototype that looks like this:
myfun = function(dtb, models, weights, dates)
Let me be clear: passing the formulas to my function is NOT the problem. If the weights I wanted to use and the column name describing the date, thedate were known then my function could simply look like this:
myfun = function(dtb, models) {
res = lapply(models, function(f) {dtb[,as.list(coef(lm(f, weights=weights, data=.SD))),by=thedate]})
#do more stuff with res
}
However the column names corresponding to thedate and to the weights are unknown in advance. I would like to pass them to my function as so:
#this will not work
myfun = function(dtb, models, w, d) {
res = lapply(models, function(f) {dtb[,as.list(coef(lm(f, weights=w, data=.SD))),by=d]})
#do more stuff with res
}
Thanks
Here is a solution that relies on having the data in long format (which makes more sense to me, in this cas
library(reshape2)
dtlong <- data.table(melt(dtb, measure.var = c('x','z')))
foo <- function(f, d, by, w ){
# get the name of the w argument (weights)
w.char <- deparse(substitute(w))
# convert `list(a,b)` to `c('a','b')`
# obviously, this would have to change depending on how `by` was defined
by <- unlist(lapply(as.list(as.list(match.call())[['by']])[-1], as.character))
# create the call substituting the names as required
.c <- substitute(as.list(coef(lm(f, data = .SD, weights = w), list(w = as.name(w.char)))))
# actually perform the calculations
d[,eval(.c), by = by]
}
foo(f= y~value, d= dtlong, by = list(variable, thedate), w = weights)
variable thedate (Intercept) value
1: x 1 11.000000 -1.00000000
2: x 2 11.000000 -1.00000000
3: z 1 1.009595 0.89019190
4: z 2 7.538462 -0.03846154
one possible solution:
fun = function(dtb, models, w_col_name, date_name) {
res = lapply(models, function(f) {dtb[,as.list(coef(lm(f, weights=eval(parse(text=w_col_name)), data=.SD))),by=eval(parse(text=paste0("list(",date_name,")")))]})
}
Can't you just add (inside that anonymous function call):
f <- as.formula(f)
... as a separate line before the dtb[,as.list(coef(lm(f, ...)? That's the usual way of turning a character element into a formula object.
> res = lapply(models, function(f) {f <- as.formula(f)
dtb[,as.list(coef(lm(f, weights=weights, data=.SD))),by=thedate]})
>
> str(res)
List of 2
$ :Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
..$ thedate : int [1:2] 1 2
..$ (Intercept): num [1:2] 11 11
..$ x : num [1:2] -1 -1
..- attr(*, ".internal.selfref")=<externalptr>
$ :Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
..$ thedate : int [1:2] 1 2
..$ (Intercept): num [1:2] 6.27 11.7
..$ z : num [1:2] 0.0633 -0.7995
..- attr(*, ".internal.selfref")=<externalptr>
If you need to build character versions of formulas from component names, just use paste or paste0 and pass to the models character vector. Tested code supplied with receipt of testable examples.
I got a file containing the following data:
str(dat)
List of 2
$ x: Named num [1:28643] 2714769 2728569 NA 2728569 2740425 ...
..- attr(*, "names")= chr [1:28643] "h" "h" "" "h" ...
$ y: Named num [1:28643] 925000 925000 NA 925000 925000 ...
..- attr(*, "names")= chr [1:28643] "h" "h" "" "h" ...
- attr(*, "class")= chr [1:2] "bor" "list"
dat$x[1:10]
h h h h h h h
2714769 2728569 NA 2728569 2740425 NA 2740425 2751585 NA 2751585
dat$y[1:10]
h h h h h h h
925000 925000 NA 925000 925000 NA 925000 925000 NA 925000
class(dat)
"bor" "list"
table(names(dat$x))
h
479 28164
table(names(dat$y))
h
479 28164
plot(dat, type='l') results in a nice map.
I read about an old/simple form of line-'objects' used in S in "Applied Spatial Data Analysis with R" (Bivand, Pebesma, Gomez-Rubio; Springer 2008) on Page 38, which seem to have similarities to my file. This format defines a line as "start-point; end-point; NA" triplet.
Do you know this format?
How can I convert it to an sp-object?
Thanks in advance
Based on your information, here is one possilbe way to go:
Assuming that your data represent lines and that the NA values indicate the end of each line, you can convert your data to spatial lines doing the following:
# Creating artificial data for the example
dat <- list()
dat$x <- rnorm(1000) + rep(c(rep(0, 99), NA), 10)
dat$y <- dat$x + rnorm(1000)
# For simplicity, convert to data frame
# (this would be the first step for you to do with your data)
mydat <- data.frame(x = dat$x, y = dat$y)
# Convert each part to a line, using the NA values as breaks
mylines <- list()
last <- 1
for(i in 1:nrow(mydat)){
if(is.na(mydat$x[i])){
print(i)
mylines[[as.character(i)]] <- Lines(Line(mydat[last:(i-1),]), ID = as.character(i))
last <- i+1
}
}
# Convert to spatial lines object
mylines <- SpatialLines(mylines)
# Plot to see if it worked
plot(mylines)
Can anyone help me with this?
If I run:
> mode(iris$Species)
[1] "numeric"
> mode(iris$Sepal.Width)
[1] "numeric"
Then I get "numeric" as answer
Cheers
M
The function mode() is used to find out the storage mode of the the object, in this case is is stored as mode "numeric". This function is not used to find the most "frequent" observed value in a data set, i.e. it is not used to find the statistical mode. See ?mode for more on what this function does in R and why it isn't useful for your problem.
For discrete data, the mode is the most frequent observed value among the set:
> set.seed(1) ## reproducible example
> dat <- sample(1:5, 100, replace = TRUE) ## dummy data
> (tab <- table(dat)) ## tabulate the frequencies
dat
1 2 3 4 5
13 25 19 26 17
> which.max(tab) ## which is the mode?
4
4
> tab[which.max(tab)] ## what is the frequency of the mode?
4
26
For continuous data, the mode is the value of the data at which the probability density function (PDF) reaches a maximum. As your data are generally a sample from some continuous probability distribution, we don't know the PDF but we can estimate it through a histogram or better through a kernel density estimate.
Returning to the iris data, here is an example of determining the mode from continuous data:
> sepalwd <- with(iris, density(Sepal.Width)) ## kernel density estimate
> plot(sepalwd)
> str(sepalwd)
List of 7
$ x : num [1:512] 1.63 1.64 1.64 1.65 1.65 ...
$ y : num [1:512] 0.000244 0.000283 0.000329 0.000379 0.000436 ...
$ bw : num 0.123
$ n : int 150
$ call : language density.default(x = Sepal.Width)
$ data.name: chr "Sepal.Width"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
> with(sepalwd, which.max(y)) ## which value has maximal density?
[1] 224
> with(sepalwd, x[which.max(y)]) ## use the above to find the mode
[1] 3.000314
See ?density for more info. By default, density() evaluates the kernel density estimate at n = 512 equally spaced locations. If this is too crude for you, increase the number of locations evaluated and returned:
> sepalwd2 <- with(iris, density(Sepal.Width, n = 2048))
> with(sepalwd, x[which.max(y)])
[1] 3.000314
In this case it doesn't alter the result.
see ?mode : mode is giving you the storage mode. If you want the value with the maximum count, then use table.
> Sample <- sample(letters[1:5],50,replace=T)
> tmp <- table(Sample)
> tmp
Sample
a b c d e
9 12 9 7 13
> tmp[which(tmp==max(tmp))]
e
13
Please, read the help files if a function is not doing what you think it should.
Some extra explanation :
max(tmp) is the maximum of tmp
tmp == max(tmp) gives a logical vector with a length of tmp, indicating whether a value is equal or not to max(tmp).
which(tmp == max(tmp)) returns the index of the values in the vector that are TRUE. These indices you use to select the value in tmp that is the maximum value.
See the help files ?which, ?max and the introductory manuals for R.
See ?mode : mode is giving you the storage mode.
If you want to know the mode of a continuous random variable, I recently released the package ModEstM. In addition to the method proposed by Gavin Simpson, it addresses the case of multimodal variables.
For example, in case you study the sample:
> x2 <- c(rbeta(1000, 23, 4), rbeta(1000, 4, 16))
Which is clearly bimodal, you get the answer:
> ModEstM::ModEstM(x2)
[[1]]
[1] 0.8634313 0.1752347