R Code package markovchain doesn't properly recognize numbers as states - r

I have checked the documentation of the package and found an example of how they fitted a DTMC on data.frame objects using the following code:
library(holson)
data(holson)
singleMc<-markovchainFit(data=holson[,2:12],name="holson")
The data I apply the code to is structured essentially in the same way as the holson data only that there are 10 states. Additionally, the numbers in my excel file are indeed integers and not class characters. These states are the numbers 1 to 10. When I run the code on my data it gives me a transition matrix where the states are listed as followed in (1,10,2,3,4,5,6,7,8,9). Thus in the matrix the state following 1 is 10.
It appears to me that R Studio thinks that the character 10 is between 1 and 2? (Like lexicographic sorting?) How can I fix this issue and have the package recognize 10 as the character following 9?
EDIT:
Here is an example
library(markovchain)
set.seed(12)
Test <- data.frame(entity = LETTERS[1:100],
Time1 = round(runif(n = 100, min = 1, max = 10)),
Time2 = round(runif(n = 100, min = 1, max = 10)),
Time3 = round(runif(n = 100, min = 1, max = 10)))
Test_Fit <- markovchainFit(data=Test[,2:4] , name="Test_FIT")
Est_Test_Fit <- Test_Fit$estimate
Est_Test_Fit#transitionMatrix

I am not familiar with the markovchain package. I tend to use r4jags.
Reading the markovchain manual, it seems that a call to markovchainFit should be preceded by a call to createSequenceMatrix. (See the example code on page 11 of the manual.) The first parameter of createSequenceMatrix is "... a n x n matrix or a character vector or a list". Thus, contrary to my comment above, it seems that markovchain expects the state labels to be character rather than numeric. Given your question, it seems that your states are ordered rather than merely categorical, so the ordering '"1"', `"10", '"2"' is a problem for you.
The solution would be to convert your numeric state labels to character before calling markovChainFit/createSequenceMatrix. Here are two possible ways of doing this:
charState <- LETTERS[state]
which will give you state labels of "A" to "J". or
charState <- sprintf("%02i", state)
which produces "01", "02", ... , "10".
By the way, did you run your test code before adding it to your question? Rows 27 to 100 of your Test dataframe have entity equal to NA, which I suspect is not what you intended. In addition, I suspect your columns Time1 to Time3 are misnamed because I believe they contain states of the process rather than times at which the process(es) was/were in a given state.

Related

Poisson Process algorithm in R (renewal processes perspective)

I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.

Complex self-referencing of a dataframe

I could not find anything that answered this question, so I apologize if it is a duplicate. I'm also not sure exactly how to phrase it.
Here is my example I created for stackoverflow - my real dataset is much more complex:
Here is the example dataframe I am using
The idea behind this is that this is a dataset of workers. Each worker has info in columns named name, age, State (where they are located), and State_Lead, a boolean column that represents whether or not if they are the worker who is in charge of that State.
My goal here is twofold - I want a code that
1) references the State and State_Lead columns and require 1 (Not zero, not >1) State_Lead =TRUE per State. If there is more than or less than 1, I want to randomize who in each State becomes the State Lead
2) Calls up the current State_Lead=TRUE for each State. Ideally I could reference a State and be able to call anything from the row of the State_Lead (where the rows are named the same as the Name column).
#I made Jack not the state lead so the goal should be to return James and Jill
Database["Jack", "State_Lead"]=FALSE
All_States <- unique(Database$State)
All_States
##Here I thought I could cycle through each state and return the rows that matched each State Leader
heads <- NULL
for(i in All_States){
heads <- append( heads, Database[, "State"==i])
}
heads
## heads just returns "list()"
###attempt 2
heads <- NULL
for(i in All_States){
if (sum(Database[Database[,"State"==i], "State_Lead"]) = 1)
heads <-append(heads, Database[,"State"==i], "State_Lead"])
else Database$State==i <- NA
all_in_state <- subset(Database[, State="i"])
sample(all_in_state, 1)
}
All right, so it looks like you're definitely brand new to programming as a whole, and not just R. So first and foremost, I'd highly recommend checking out some of the MOOCs on Coursera, such as this one. But, as for your question, let's look at each piece of it that seems to be causing confusion.
First, when asking for help on this site, it's always best to provide actual data, and not a picture of your dataset. Given that you already had a dataframe in R that you were working with, you could easily take advantage of the dput function and then copy that into your question. So, for example, you might have the following the dataframe:
df = data.frame(name=c("John", "Jim", "Sally"), state=c("MI", "FL", "NY"), state_leader=c(TRUE, FALSE, TRUE))
df
name state state_leader
1 John MI TRUE
2 Jim FL FALSE
3 Sally NY TRUE
Then we can just use dput(df) and get the following output:
dput(df)
structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Jim",
"John", "Sally"), class = "factor"), state = structure(c(2L,
1L, 3L), .Label = c("FL", "MI", "NY"), class = "factor"), state_leader = c(TRUE,
FALSE, TRUE)), .Names = c("name", "state", "state_leader"), row.names = c(NA,
-3L), class = "data.frame")
Those of us on Stack Overflow can now copy the output from dput and have a working copy of your dataset.
Next, let's look at your confusion around how to set new values in a dataset. In your updated text, you tried to set state_leader equal to FALSE with the following code df["John", "state_leader"] = FALSE. This is wrong for two reasons: 1) "John" doesn't point to anything. R has no idea what you mean when you just say "John". 2) Even assuming that first part of your indexing logic was correct, by simply putting "state_leader" in the second part of your index, you're telling R that you want that whole column to be equal to FALSE. The proper way to do what you wanted to do is with the following.
df[df$name == "John", "state_leader"] = FALSE
This way, R knows that you want the variable name to be equal to "John".
So now that we have that, it'd probably be a good time to look at the [ operator and understand how it works. Because your complex algorithm for trying to find your values is not nearly as complex as you think when you understand how indexing works.
If you have a one-dimensional object in R, such as a vector, [ takes one parameter. If you have a two-dimensional object, such as a dataframe or matrix, [ takes two parameters, either one of which is optional. Let's look at a few examples.
x = 1:10 # A one-dimensional vector
x[1:3] # Get the first three elements of x
x[c(1, 3, 5, 7, 9)] # Get all odd elements of x
x[x %% 2 != 0] # Get all odd elements of x
In the examples above, we're working with a one-dimensional vector. The three operations we perform highlight a couple key points about [. The first key point is that [ expects a numeric input, or something that can be converted to a numeric input. Second, the numeric inputs do not have to be consecutive. Lastly, the numeric input can be a function that returns a numeric result, such as x %% 2 != 0. This last example is perfect for demonstrating what I mean by "something that can be converted to a numeric input". You can think of this in the following way: First, R computes x %% 2. It then checks each element to see if it is equal to 0 or not, which returns a vector of Boolean values equal to TRUE or FALSE. It then checks which values are TRUE and returns a vector of indices equal to c(1, 3, 5, 7, 9), which is identical to our second example.
Now, let's look at df to see how [ works on two-dimensional objects. When working with 2D objects, the first parameter to [ tells it which rows you want, and the second parameter tells it which columns you want.
df[df$name == "John", ] # Get all rows where name equals "John" and ALL columns
df[, c(1, 3)] # Get all rows and only the first and third column
df[grepl("^J", df$name), 3] # Get all rows with names that start with "J" and only the third column
As we see above in the first two examples, you do not need to provide a value for each parameter in [. If you leave one of the values blank, the default is to return all available rows or columns from the object. You'll also notice that we specifically call the column name even when we're specifying rows, such as df[df$name == "John", ]. This is because we need R to understand which column we want to check to determine if we keep the row. Lastly, you should also notice that all of our prior understandings about [ in one-dimensional objects holds here. It expects a numeric input, or one that can be converted to a numeric input. So, in the first example, df$name == "John" will be result in a Boolean vector with values c(TRUE, FALSE, FALSE) and R will then check which values are TRUE and return a value of 1, indicating that only the first row matches that criteria.
So now that we understand how [ works, let's see how to use it to solve our question here. We know that we want all of the columns, so we can ignore the second parameter in [. And we know that we want only the rows where state_leader is TRUE. So let's use that condition in our index.
df[df$state_leader == TRUE, ]
name state state_leader
1 John MI TRUE
3 Sally NY TRUE
As an exercise to you, how would you make this output better by only returning the name and state variables?

Correlation using rolling window on second vector

I'm a bit of a r newbie, and have am a little stuck at the way forward to run a correlation on time-series data where the second vector is much longer and I want to run a rolling time window.
My data looks something like this :
set.seed(1)
# "Target sample" (this is always of known fixed length N, e.g. 20 )
target <- data.frame(Date=rep(seq(Sys.Date(),by="1 day",length=20)),Measurement=rnorm(2))
# "Potential Sample" (this is always much longer and of unknown length,e.g. 730 in this example)
potential <- data.frame(Date=rep(seq(Sys.Date()-1095,by="1 day",length=730)),Measurement=rnorm(2))
What I would like to do is take a rolling window of size N (i.e matching the size of target sample), incrementing the roll by one day at a time, and then print two columns for each window :
WindowStartDate and the result of cor(target,potentialWindow)
So in pseudo-code (using the generated example above) :
Start at Sys.Date()-1095, take window size N values
Print (or,probably better, put in to new data frame) Sys.Date()-1095 and result of cor(target,potentialWindow)
Roll forward +1 day to Sys.Date()-1094 , take window size N values
Print (or, probably better, put in to new data frame) Sys.Date()-1094 and result of cor(target,potentialWindow)
etc. etc.
N.B. The roll forward +1 day is obviously a variable that could be tweaked depending on desired overlap.
Here's a way we can do it. Note that in your original example you only specified rnorm(2), which worked because R can recycle arguments, but it's probably not what you wanted. We just need to initialize a few things, and then send it through a for loop.
It seems like we can just pull the date you want from the potential data set, but if you want to use the Sys.Date() - X formula, I've shown how to do that as well.
set.seed(1)
# "Target sample" (this is always of known fixed length N, e.g. 20 )
target <- data.frame(Date = rep(seq(Sys.Date(), by = "1 day", length = 20)),
Measurement = rnorm(20))
# "Potential Sample" (this is always much longer and of unknown length,e.g. 730 in this example)
potential <- data.frame(Date = rep(seq(Sys.Date() - 1095, by = "1 day", length = 730)),
Measurement = rnorm(730))
#initialize values
N <- 20
len_potential <- nrow(potential) - (N - 1)
time_start <- 1096
result.df <- data.frame(Day = potential[1,1],
Corr = numeric(len_potential),
Day2 = potential[1,1],
stringsAsFactors = FALSE
)
#use a for loop
for(i in 1:len_potential){
result.df[i,1] = as.Date(potential[i,1])
result.df[i,2] = cor(target[,2], potential[i:(i+N-1), 2])
result.df[i,3] = Sys.Date() - (time_start - i)
}
Also, as a note on posting questions to SO, sometimes it is helpful to provide desired output.

Programming a sensitivity analysis in R: Vary 1 parameter (column), hold others constant. Better way?

I want to test the sensitivity of a calculation to the value of 4 parameters. To do this, I want to vary one parameter at a time -- i.e., change Variable 1, hold variables 2-4 at a "default" value (e.g., 1). I thought an easy way to organize these values would be in a data.frame(), where each column corresponds to a different variable, and each row to a set of parameters for which the calculation should be made. I would then loop through each row of the data frame, evaluating a function given the parameter values in that row.
This seems like it should be a simple thing to do, but I can't find a quick way to do it.
The problem might be my overall approach to programming the sensitivity analysis, but I can't think of a good, simple way to program the aforementioned data.frame.
My code for generating the data.frame:
Adj_vals <- c(seq(0, 1, by=0.1), seq(1.1, 2, by=0.1)) #a series of values for 3 of the parameters to use
A_Adj_vals <- 10^(seq(1,14,0.5)) #a series of values for another one of the parameters to use
n1 <- length(Adj_vals)
n2 <- length(A_Adj_vals)
data.frame(
"Dg_Adj"=c(Adj_vals, rep(1, n1*2+n2)), #this parameter's default is 1
"Df_Adj"=c(rep(1, n1), Adj_vals, rep(1, n1+n2)), #this parameter's default is 1
"sd_Adj"=c(rep(1, n1*2), 0.01, Adj_vals[-1], rep(1, n2)), #This parameter has default of 1, but unlike the others using Adj_vals, it can only take on values >0
"A"=c(rep(1E7, n1*3), A_Adj_vals) #this parameter's default is 10 million
)
This code produces the desired data.frame. Is there a simpler way to achieve the same result? I would accept an answer where sd_Adj takes on 0 instead of 0.01.
It's pretty debatable if this is better, but another way to do it would be to follow this pattern:
defaults<-data.frame(a=1,b=1,c=1,d=10000000)
merge(defaults[c("b","c","d")],data.frame(a=c(seq(0, 1, by=0.1), seq(1.1, 2, by=0.1))))
This should be pretty easy to cook up into a function that automatically removes the correct column from defaults based on the column name in the data frame you are merging with etc

data.table and "by must evaluate to list" Error

I would like to use the data.table package in R to dynamically generate aggregations, but I am running into an error. Below, let my.dt be of type data.table.
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
grouping.vars <- c("sex", "age")
for (i in 1:2) {
my.dt[,sum(dependent.variable), by=grouping.vars[i]]
}
If I run this, I get errors:
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] :
by must evaluate to list
Yet the following works without error:
my.dt[,sum(dependent.variable), by=sex]
I see why the error is occurring, but I do not see how to use a vector with the by parameter.
[UPDATE] 2 years after question was asked ...
On running the code in the question, data.table is now more helpful and returns this (using 1.8.2) :
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) :
'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...)
if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency
so data.table can detect which columns are needed.
and following the advice in the second sentence of error :
my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])]
sex V1
1: M 2650
2: F 2600
Old answer from Jul 2010 (by can now be double and character, though) :
Strictly speaking the by needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age could also be coerced to integer using as.integer(). This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.
The idea behind by being a list() of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by. A common one is to aggregate by month; for example :
DT[,sum(col1), by=list(region,month(datecol))]
or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :
DT[,sum(col1), by=list(region,month=datecol%/%100L)]
Notice how you can name the columns inside the list() like that.
To define and reuse complex grouping expressions :
e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]
Or if you don't want to re-evaluate the by expressions each time, you can save the result once and reuse the result for efficiency; if the by expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :
byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]
Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.
I did two changes to your original code:
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
age<-as.factor(age)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
for ( a in 1:2){
print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]])
}
Numerical vector age should be forced into factors. As to by parameter, do not use quote for column names but group them into list(...). At least this is what the author has suggested.

Resources