MHSMM package R input data format with multiple variables - r

my problem is similar to the question as followingthe problem of R-input Format
I have tried the above code in the above link and revised some part to suit my data. my data is like follow
I want my data can be created as a data frame with 4 variable vectors. The code what I have revised is
formatMhsmm <- function(data){
nb.sequences = nrow(data)
nb.variables = ncol(data)
data_df <- data.frame(matrix(unlist(data), ncol = 4, byrow = TRUE))
# iterate over these in loops
rows <- 1: nb.sequences
# build vector with id value
id = numeric(length = nb.sequences)
for( i in rows)
{
id[i] = data_df[i,2]
}
# build vector with time value
time = numeric (length = nb.sequences)
for( i in rows)
{
time[i] = data_df[i,3]
}
# build vector with observation values
sequences = numeric(length = nb.sequences)
for(i in rows)
{
sequences[i] = data_df[i, 4]
}
data.df = data.frame(id,time,sequences)
# creation of hsmm data object need for training
N <- as.numeric(table(data.df$id))
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
return(train)
}
library(mhsmm)
dataset <- read.csv("location.csv", header = TRUE)
train <- formatMhsmm(dataset)
print(train)
The output observation is not the data of 4th col, it's a list of (4, 8, 12,...,396, 1, 1, ..., 56, 192,...,6550, 68, NA, NA,...) It has picked up 1/4 data of each col. Why it is like this?
Thank you very much!!!!

Why don't you simply count yout observations by Id, and create the hsmm.data object directly? Supposing yout dataframe is called "data", we have:
N <- as.numeric(table(data$id))
train <- list(x=data$location, N = N)
class(train) <- "hsmm.data"
Extracted from http://www.jstatsoft.org/v39/i04/paper

Related

How to run function on indivisual columns instead of data frame?

Hello everyone I have two data frame trying to do bootstrapping with below script1 in my script1 i am taking number of rows from data frame one and two. Instead of taking rows number from entire data frame I wanted split individual columns as a data frame and remove the zero values and than take the row number than do the bootstrapping using below script. So trying with script2 where I am creating individual data frame from for loop as I am new to R bit confused how efficiently do add the script1 function to it
please suggest me below I am providing script which is running script1 and the script2 I am trying to subset each columns creating a individual data frame
Script1
set.seed(2)
m1 <- matrix(sample(c(0, 1:10), 100, replace = TRUE), 10)
m2 <- matrix(sample(c(0, 1:5), 50, replace = TRUE), 5)
m1 <- as.data.frame(m1)
m2 <- as.data.frame(m2)
nboot <- 1e3
n_m1 <- nrow(m1); n_m2 <- nrow(m2)
temp<- c()
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
value <- colSums(m2)/colSums(m1[boot,])
temp <- rbind(temp, value)
}
boot_data<- apply(temp, 2, median)
script2
for (i in colnames(m1)){
m1_subset=(m1[m1[[i]] > 0, ])
m1_subset=m1_subset[i]
m2_subset=m2[m2[[i]] >0, ]
m2_subset=m2_subset[i]
num_m1 <- nrow(m1_subset); n_m2 <- nrow(m2_subset)# after this wanted add above script changing input
}
If I understand correctly, you want to do the sampling and calculation on each column individually, after removing the 0 values. I. modified your code to work on a single vector instead of a dataframe (i.e., using length() instead of nrow() and sum() instead of colSums(). I also suggest creating the empty matrix for your results ahead of time, and filling in -- it will be fasted.
temp <- matrix(nrow = nboot, ncol = ncol(m1))
for (i in seq_along(m1)){
m1_subset = m1[m1[,i] > 0, i]
m2_subset = m2[m2[,i] > 0, i]
n_m1 <- length(m1_subset); n_m2 <- length(m2_subset)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_m1), size = n_m2, replace = TRUE)
temp[j, i] <- sum(m2_subset)/sum(m1_subset[boot])
}
}
boot_data <- apply(temp, 2, median)
boot_data <- setNames(data.frame(t(boot_data)), names(m1))
boot_data

Function in R that performs multiple operations over columns of two datasets

I have two datasets, each with 5 columns and 10,000 rows. I want to calculate y from values in columns between the two datasets, column 1 in data set 1 and column 1 in data set 2; then column 2 in data set 1 and column 2 in data set 2. The yneeds nonetheless to follow a set of rules before being calculated. What I did so far doesn't work, and I cannot figure it out why and if there is a easier way to do all of this.
Create data from t-distributions
mx20 <- as.data.frame(replicate(10000, rt(20,19)))
mx20.50 <- as.data.frame(replicate(10000, rt(20,19)+0.5))
Calculates the mean for each simulated sample
m20 <- apply(mx20, FUN=mean, MARGIN=2)
m20.05 <- apply(mx20.50, FUN=mean, MARGIN=2)
The steps 1 and 2_ above are repeated for five sample sizes from t-distributions rt(30,29); rt(50,49); rt(100,99); and rt(1000,999)
Bind tables (create data.frame) for each t-distribution specification
tbl <- cbind(m20, m30, m50, m100, m1000)
tbl.50 <- cbind(m20.05, m30.05, m50.05, m100.05, m1000.05)
Finally, I want to calculate the y as specified above. But here is where I get totally lost. Please see below my best attempt so far.
y = (mtheo-m0)/(m1-m0), where y = 0 when m1 < m0 and y = y when m1 >= m0. mtheo is a constant (e.g. 0.50), m1 is value in column 1 of tbl and m0 is value in column 1 of tbl.50.
ycalc <- function(mtheo, m1, m0) {
ifelse(m1>=m0) {
y = (mteo-m0)/(m1-m0)
} ifelse(m1<m0) {
y=0
} returnValue(y)
}
You can try this. I used data frames instead of data tables.
This code is more versatile. You can add or remove parameters. Below are the parameters that you can use to create t distributions.
params = data.frame(
n = c(20, 30, 50, 100, 1000),
df = c(19, 29, 49, 99, 999)
)
And here is a loop that creates the values you need for each t distribution. You can ignore this part if you already have those values (or code to create those values).
tbl = data.frame(i = c(1:10000))
tbl.50 = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
mx = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])))
m <- apply(mx, FUN=mean, MARGIN=2)
tbl = cbind(tbl, m)
names(tbl)[ncol(tbl)] = paste("m", params[i, 1], sep="")
mx.50 = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])+.5))
m.50 <- apply(mx.50, FUN=mean, MARGIN=2)
tbl.50 = cbind(tbl.50, m.50)
names(tbl.50)[ncol(tbl.50)] = paste("m", params[i, 1], ".50", sep="")
}
tbl = tbl[-1]
tbl.50 = tbl.50[-1]
And here is the loop that does the calculations. I save them in a data frame (y). Each column in this data frame is the result of your function applied for all rows.
mtheo = .50
y = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
y$dum = 0
idx = which(tbl[, i] >= tbl.50[, i])
y[idx, ]$dum =
(mtheo - tbl.50[idx, i]) /
(tbl[idx, i] - tbl.50[idx, i])
names(y)[ncol(y)] = paste("y", params[i, 1], sep="")
}
y = y[-1]
You could try this, if the first column in tbl is called m0 and the first column in tbl.50 is called m1:
mteo <- 0.5
ycalc <- ifelse(tbl$m1 >= tbl.50$m0, (mteo - tbl.50$m0)/(tbl$m1 - tbl.50$m0),
ifelse(tbl$m1 < tbl.50$m0), 0, "no")
Using the same column names provided by your code, and transforming your matrices into dataframes:
tbl <- data.frame(tbl)
tbl.50 <- data.frame(tbl.50)
mteo <- 0.5
ycalc <- ifelse(tbl$m20 >= tbl.50$m20.05, (mteo - tbl.50$m20.05)/(tbl$m20 - tbl.50$m20.05),
ifelse(tbl$m20 < tbl.50$m20.05, "0", "no"))
This results in:
head(ycalc)
[1] "9.22491706576716" "0" "0" "0" "0" "1.77027049630147"

Eliminate for-loop through functional programming

I would like to take input from a dataframe with a systemName variable and a popNum variable and use this to generate a named list of vectors whose elements are the random numbers (1-6)*5 ie (5, 10, 15, 20, 25, 30) where the vector length is equal to the popNum of the system.
The following code works:
## Data
#Create a vector of integers
popNum <- c (2,5,3,9)
#Create corresponding names
systemNames <- c("Ruie", "Regina", "Roupe", "Efate")
# form up into a recatangular data frame
dataSource <- cbind.data.frame(systemNames,popNum )
## Create and Fill the List
#initialise the list
availableCargoes <- vector( mode = "list", length = nrow(dataSource))
#name the list
names(availableCargoes) <- dataSource$systemNames
#fill the list
for (loopCounter in 1:nrow(dataSource)) {
availableCargoes[[loopCounter]] <- sample.int( n = 6,
size = dataSource$popNum[loopCounter],
replace = TRUE) * 5
}
How can I get rid of the for-loop through something from the apply family or the purrr package? The problem I am having a hard time resolving is what is the X that the lapply runs the sample.int over? How do I pass the vector of popNum as an argument to control the size of the resulting vectors?
Use lapply to loop directly through dataSource$popNum.
Note that I set the RNG seed to make the results reproducible.
set.seed(1234)
for (loopCounter in 1:nrow(dataSource)) {
availableCargoes[[loopCounter]] <- sample.int( n = 6,
size = dataSource$popNum[loopCounter],
replace = TRUE) * 5
}
set.seed(1234)
ac <- lapply(dataSource$popNum, function(s)
sample.int(n = 6, size = s, replace = TRUE)*5)
names(ac) <- dataSource$systemNames
ac
identical(availableCargoes, ac)
#[1] TRUE
sapply version
## Data
#Create a vector of integers
popNum <- c (2,5,3,9)
#Create corresponding names
systemNames <- c("Ruie", "Regina", "Roupe", "Efate")
# form up into a recatangular data frame
dataSource <- cbind.data.frame(systemNames,popNum )
## Create and Fill the List
#initialise the list
availableCargoes <- vector( mode = "list", length = nrow(dataSource))
#name the list
names(availableCargoes) <- dataSource$systemNames
#fill the list
availableCargoes <- sapply(as.character(dataSource$systemNames),function(sysname){
sample.int( n = 6,
size = dataSource$popNum[dataSource$systemNames==sysname],
replace = TRUE) * 5
},USE.NAMES=T,simplify = F)

Create a matrix from a list consisting of unequal matrices for individual bootstraps

I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim

extracting results from a for loop

i am currently working on creating a program on event studies in R. I have a large set of events and have used this code to create a for loop which basically runs the regression n number times where n is the number of events i have in my dataset.
i am really struggling in trying to get the results from each regression into a dataframe and was wondering if anyone would have an idea how to do it. my code is here -
#Reading in the data from excel
AllData <- read.xlsx("eventstudy.xlsx", 1)
EventDates <- read.xlsx("eventstudy.xlsx", 2)
Event <- list()
p <- for(i in 1:length(EventDates$Date.of.notice.)) {
Event[[i]] <- evReturn(y = AllData,
firm = as.character(EventDates$Bank.[i]),
y.date = "date",
index = "s.p500",
est.win = 252,
digits = 3,
event.date = as.character( EventDates$Date.of.notice.[i] ),
event.win = 30)
Event[[i]]
plot(Event[[i]])
print(Event[[i]])
}
d <- matrix(nrow=20, ncol=3)
for (i in 1:20) {
d[i,] <- c(i+i, i*i, i/1)
}

Resources