Efficient way to calculate average MAPE and MSE in R - r

I have a real data and predicted data and I want to calculate overall MAPE and MSE. The data are time series, with each column representing data for different weeks. I predict value for each of the 52 weeks for each of the items as shown below. What would be the best possible calculate overall Error in R.
real = matrix(
c("item1", "item2", "item3", "item4", .5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow=4,
ncol=4)
colnames(real) <- c("item", "week1", "week2", "week3")
predicted = matrix(
c("item1", "item2", "item3", "item4", .55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow=4,
ncol=4)
colnames(predicted) <- c("item", "week1", "week2", "week3")

How do you get the predicted values in the first place? The model you use to get the predicted values is probably based on minimising some function of prediction errors (usually MSE). Therefore, if you calculate your predicted values, the residuals and some metrics on MSE and MAPE have been calculated somewhere along the line in fitting the model. You can probably retrieve them directly.
If the predicted values happened to be thrown into your lap and you have nothing to do with fitting the model, then you calculate MSE and MAPE as per below:
You have only one record per week for every item. So for every item, you can only calculate one prediction error per week. Depending on your application, you can choose to calculate the MSE and MAPE per item or per week.
This is what your data looks like:
real <- matrix(
c(.5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow = 4, ncol = 3)
colnames(real) <- c("week1", "week2", "week3")
predicted <- matrix(
c(.55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow = 4, ncol = 3)
colnames(predicted) <- c("week1", "week2", "week3")
Calculate the (percentage/squared) errors for every entry:
pred_error <- real - predicted
pct_error <- pred_error/real
squared_error <- pred_error^2
Calculate MSE, MAPE:
# For per-item prediction errors
apply(squared_error, MARGIN = 1, mean) # MSE
apply(abs(pct_error), MARGIN = 1, mean) # MAPE
# For per-week prediction errors
apply(squared_error, MARGIN = 0, mean) # MSE
apply(abs(pct_error), MARGIN = 0, mean) # MAPE

Related

LMest: problem introducing covariates to the measurement model when fitting a Latent Markov Model to continuous data

I am working with longitudinal continuous data that reflect the linguistic abilities of children. In that regard I seek to make a Latent Transition Model, more exact a Latent Markov Model using the LMest package in R. As far as I have understood this means creating both a measurement model and subsequently a latent model, both in which covariates (X) can be reduced, however I fail when I try to add them to the measurement model. Can anyone tell me why?
##### SIMULATED DATA OF THE SAME NATURE
ID <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
time <- c(0,1,2,3,4,5,6,7,8,0,1,2,3,4,5,6,7,8,0,1,2,3,4,5,6,7,8)
gender <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1)
response_y <- c(NA, 0.15, 0.2, 0.4, 0.64, NA, 0.85, 0.89, NA, 0.02, NA, 0.01, 0.11, 0.35, 0.63, NA, NA, NA, NA, 0.3, NA, 0.56, 0.84, 0.81, 0.9, NA, NA)
response_y1 <- c(NA, 0.1, 0.3, 0.5, NA, NA, 0.7, 0.89, NA, NA, NA, 0.01, 0.11, 0.35, NA, NA, NA, NA, NA, 0.3, NA, 0.56, 0.84, NA, 0.9, 0.91, NA)
d = data.frame(ID, time, gender, response_y)
I have so far tried to model it like this:
library(LMest)
## COVARIATES INTRODUCED TO THE MEASUREMENT MODEL (gives error)
lmestCont(responsesFormula = response_y + response_y1 ~ gender, latentFormula = NULL, index = c("ID", "time"), k = 1:5, data = dt$data, modBasic = 1, output = TRUE, tol = 10^-5, out_se = TRUE)
But keep getting this error:
Warning: multivariate data are not allowed; only the first response variable is considered
Steps of EM:
1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...20...21...22...23...24...25...26...27...28...29...30...31...32...33...34...35...36...37...38...39...40...41...42...43...44...
Missing data in the dataset. imp.mix function (mix package) used for imputation.
Error in aicv[kv] <- out[[kv]]$aic : replacement has length zero
When introducing the covariates to the latent model it works, and looks like this:
## COVARIATES INTRODUCED TO THE LATENT MODEL (RUNS)
mod_con <- lmestCont(responsesFormula = response_y+ response_y1 ~ NULL, latentFormula = ~ gender | gender, index = c("ID", "time"), k = 1:5, data = dt$data, modBasic = 1, output = TRUE, tol = 10^-5, out_se = TRUE)
All kinds of advise are happily received - also on the LMest in general, maybe I have misunderstood something!!! thanks

IV for logistic regression with clustered standard errors in R

I have individual-level data to analyze the effect of state-level educational expenditures on individual-level students' performances. Students' performance is a binary variable (0 when they do not pass, 1 when they pass the test). I run the following glm model with state-level clustering of standard errors:
library(miceadds)
df_logit <- data.frame(performance = c(0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0),
state = c("MA", "MA", "MB", "MC", "MB", "MD", "MA", "MC", "MB", "MD", "MB", "MC", "MA", "MA", "MA", "MA", "MD", "MA","MB","MA","MA","MD","MC","MA","MA","MC","MB","MB","MD", "MB"),
expenditure = c(123000, 123000,654000, 785000, 654000, 468000, 123000, 785000, 654000, 468000, 654000, 785000,123000,123000,123000,123000, 468000,123000, 654000, 123000, 123000, 468000,785000,123000, 123000, 785000, 654000, 654000, 468000,654000),
population = c(0.25, 0.25, 0.12, 0.45, 0.12, 0.31, 0.25, 0.45, 0.12, 0.31, 0.12, 0.45, 0.25, 0.25, 0.25, 0.25, 0.31, 0.25, 0.12, 0.25, 0.25, 0.31, 0.45, 0.25, 0.25, 0.45, 0.12, 0.12, 0.31, 0.1),
left_wing = c(0.10, 0.10, 0.12, 0.18, 0.12, 0.36, 0.10, 0.18, 0.12, 0.36, 0.12, 0.18, 0.10, 0.10, 0.10, 0.10, 0.36, 0.10, 0.12, 0.10, 0.10, 0.36, 0.18, 0.10, 0.10,0.18, 0.12, 0.12, 0.36, 0.12))
df_logit$performance <- as.factor(df_logit$performance)
glm_clust_1 <- miceadds::glm.cluster(data=df_logit, formula=performance ~ expenditure + population,
cluster="state", family=binomial(link = "logit"))
summary(glm_clust_1)
Since I cannot rule out the possibility that expenditures are endogenous, I would like to use the share of left-wing parties at the state level as an instrument for education expenditures.
However, I have not found a command in ivtools or other packages to run two-stage least squares with control variables in a logistic regression with state-level clustered standard errors.
Which commands can I use to extend my logit model with the instrument "left_wing" (also included in the example dataset) and at the same time output the common tests like the Wu-Hausman test or the weak instrument test (like ivreg does for ols)?
ideally, I could adapt the following command to binary dependent variables and cluster the standard errors at state level
iv_1 <- ivreg(performance ~ population + expenditure | left_wing + population, data=df_logit)
summary(iv_1, cluster="state", diagnostics = TRUE)
Try this?
require(mlogit)
require(ivprobit)
test <- ivprobit(performance ~ population | expenditure | left_wing + population, data = df_logit)
summary(test)
I wasn't completely sure about the clustering part, but according to this thread on CrossValidated, it might not be necessary. Please take a read and let me know what you think.
Essentially, what I understood was since the likelihood of binary data is already specified there is no need to include the clusters. This is only true when your model is "correct", however, if you believe that there is something in the joint distribution that is not accounted for then you should cluster, though from my reading it doesn't seem like it's possible to implement clustering on a IV logit model in R.
In terms of the model itself there is a really good explanation in this SO question. How can I use the "ivprobit" function in "ivprobit" package in R?.
From my reading as well there should be almost no difference between the end results of a logit v probit model.
The basic breakdown is as follows:
y= d2 = dichotomous l.h.s.
x= ltass+roe+div = the r.h.s. exogenous variables
y1= eqrat+bonus = the r.h.s. endogenous variables
x2= tass+roe+div+gap+cfa = the complete set of instruments
Feel free to comment/edit/give feedback to this answer as I'm definitely not expert in applications of causal analysis and it's been a long time since I've implemented one. I also have not explored the potential of post-hoc tests from this final model, so that is still left for completion.

bnlearn Error: Wrong number of conditional probability distributions

I am learning to work with bnlearn and I keep running into the following error in the last line of my code below:
Error in custom.fit(dag, cpt) : wrong number of conditional probability distributions
What am I doing wrong?
modelstring(dag)= "[s][r][nblw|r][nblg|nblw][mlw|s:r][f|s:r:mlw][mlg|mlw:f]
[mlgr|mlg:nblg]"
###View DAG Specifics
dag
arcs(dag)
nodes(dag)
# Create Levels
State <- c("State0", "State1")
##Create probability distributions given; these are all 2d b/c they have 1 or 2 nodes
cptS <- matrix(c(0.6, 0.4), ncol=2, dimnames=list(NULL, State))
cptR <- matrix(c(0.7, 0.3), ncol=2, dimnames=list(NULL, State))
cptNBLW <- matrix(c(0.95, 0.05, 0.05, 0.95), ncol=2, dimnames=list(NULL, "r"= State))
cptNBLG <- matrix(c(0.9, 0.099999999999999998, 0.2, 0.8), ncol=2, dimnames=list(NULL,
"nblw"=State))
cptMLG <- matrix(c(0.95, 0.05, 0.4, 0.6, 0.2, 0.8, 0.05, 0.95),ncol=2,nrow = 2,
dimnames=list("mlw"= State, "f"=State))
cptMLGR <- matrix(c(0.6,0.4,0.95,0.05,0.2,0.8,0.55,0.45),ncol=2,nrow = 2,
dimnames=list("mlg"= State, "nblg"=State))
cptMLW <-matrix(c(0.95, 0.05, 0.1, 0.9, 0.2, 0.8, 0.01, 0.99), ncol=2,nrow = 2,byrow = TRUE,
dimnames=list("r"= State, "s"=State))
# Build 3-d matrices( becuase you have 3 nodes, you can't use the matrix function; you
have to build it from scratch)
cptF <- c(0.05, 0.95, 0.4, 0.6, 0.9, 0.1, 0.99, 0.01, 0.9, 0.1, 0.95, 0.05, 0.95, 0.05, 0.99,
0.01)
dim(cptF) <- c(2, 2, 2, 2)
dimnames(cptF) <- list("s"=State, "r"=State, "mlw"=State)
###Create CPT Table
cpt <- list(s = cptS, r = cptR, mlw = cptMLW,nblw= cptNBLW,
mlg= cptMLG, nblg= cptNBLG, mlgr= cptMLGR)
# Construct BN network with Conditional Probability Table
S.net <- custom.fit(dag,cpt)
Reference: https://rpubs.com/sarataheri/bnlearnCGM
You have several errors in your CPT definitions. Primarily, you need to make sure that:
the number of probabilities supplied are equal to the product of the number of states in the child and parent nodes,
that the number of dimensions of the matrix/array is equal to the number of parent nodes plus one, for the child node,
the child node should be given in the first dimension when the node dimension is greater than one.
the names given in the dimnames arguments (e.g. the names in dimnames=list(ThisName = ...)) should match the names that were defined in the DAG, in your case with modelstring and in my answer with model2network. (So my earlier suggestion of using dimnames=list(cptNBLW = ...) should be dimnames=list(nblw = ...) to match how node nblw was declared in the model string)
You also did not add node f into your cpt list.
Below is your code with comments where things have been changed. (I have commented out the offending lines and added ones straight after)
library(bnlearn)
dag <- model2network("[s][r][nblw|r][nblg|nblw][mlw|s:r][mlg|mlw:f][mlgr|mlg:nblg][f|s:r:mlw]")
State <- c("State0", "State1")
cptS <- matrix(c(0.6, 0.4), ncol=2, dimnames=list(NULL, State))
cptR <- matrix(c(0.7, 0.3), ncol=2, dimnames=list(NULL, State))
# add child node into first slot of dimnames
cptNBLW <- matrix(c(0.95, 0.05, 0.05, 0.95), ncol=2, dimnames=list(nblw=State, "r"= State))
cptNBLG <- matrix(c(0.9, 0.099999999999999998, 0.2, 0.8), ncol=2, dimnames=list(nblg=State,"nblw"=State))
# Use a 3d array and not matrix, and add child node into dimnames
# cptMLG <- matrix(c(0.95, 0.05, 0.4, 0.6, 0.2, 0.8, 0.05, 0.95),ncol=2,nrow = 2, dimnames=list("mlw"= State, "f"=State))
cptMLG <- array(c(0.95, 0.05, 0.4, 0.6, 0.2, 0.8, 0.05, 0.95),dim=c(2,2,2), dimnames=list(mlg = State, "mlw"= State, "f"=State))
# cptMLGR <- matrix(c(0.6,0.4,0.95,0.05,0.2,0.8,0.55,0.45),ncol=2,nrow = 2, dimnames=list("mlg"= State, "nblg"=State))
cptMLGR <- array(c(0.6,0.4,0.95,0.05,0.2,0.8,0.55,0.45), dim=c(2,2,2), dimnames=list(mlgr=State, "mlg"= State, "nblg"=State))
# cptMLW <-matrix(c(0.95, 0.05, 0.1, 0.9, 0.2, 0.8, 0.01, 0.99), ncol=2,nrow = 2,byrow = TRUE, dimnames=list("r"= State, "s"=State))
cptMLW <-array(c(0.95, 0.05, 0.1, 0.9, 0.2, 0.8, 0.01, 0.99), dim=c(2,2,2), dimnames=list(mlw=State, "r"= State, "s"=State))
# add child into first slot of dimnames
cptF <- c(0.05, 0.95, 0.4, 0.6, 0.9, 0.1, 0.99, 0.01, 0.9, 0.1, 0.95, 0.05, 0.95, 0.05, 0.99, 0.01)
dim(cptF) <- c(2, 2, 2, 2)
dimnames(cptF) <- list("f" = State, "s"=State, "r"=State, "mlw"=State)
# add missing node f into list
cpt <- list(s = cptS, r = cptR, mlw = cptMLW,nblw= cptNBLW, mlg= cptMLG, nblg= cptNBLG, mlgr= cptMLGR, f=cptF)
# Construct BN network with Conditional Probability Table
S.net <- custom.fit(dag, dist=cpt)

Problem assigning probabilities in sample function

I have generated a bunch of variables through the following :
max_no=10
list2env(setNames(as.list(c(0.2, 0.25,0.15, 0.1, 0.1, 0.05, 0.03, 0.06, 0.03, 0.02, 0.01)), paste0("proportion", 0:max_no)), envir = .GlobalEnv)
These generate objects like "proportion0", "proportion1",..., "proportion10" with values 0.2, 0.25.... etc.
I want to plug these proportion values as a vector of probabilities into the sample function like the following:
sample(seq.int(0, max_no, 1), size=10000, replace=TRUE, prob=c(paste0("proportion", 0:max_no)))
But I get the error message: Error in sample.int(length(x), size, replace, prob) : NA in probability vector
What would be a simple way of feeding in the probabilities?
I am not sure what is the use-case for this but you can use mget and unlist
sample(seq.int(0, max_no, 1), size=10000, replace=TRUE,
prob=unlist(mget(c(paste0("proportion", 0:max_no)))))
Why not pass probability directly instead of storing each number in a different variable?
sample(seq.int(0, max_no, 1), size=10000, replace=TRUE,
prob=c(0.2, 0.25,0.15, 0.1, 0.1, 0.05, 0.03, 0.06, 0.03, 0.02, 0.01))

How to store the output into a list of matrices

Data:
x <- seq(0, 1, len = 1024)
pos <- c(0.1, 0.13, 0.15, 0.23, 0.25, 0.40, 0.44, 0.65, 0.76, 0.78, 0.81)
hgt <- c(4, 5, 3, 4, 5, 4.2, 2.1, 4.3, 3.1, 5.1, 4.2)
wdt <- c(0.005, 0.005, 0.006, 0.01, 0.01, 0.03, 0.01, 0.01, 0.005, 0.008, 0.005)
pSignal <- numeric(length(x))
for (i in seq(along=pos)) {
pSignal <- pSignal + hgt[i]/(1 + abs((x - pos[i])/wdt[i]))^4
}
df = as.data.frame(rbind(pSignal,pSignal,pSignal))
dflist=list(df,df,df)
I'm trying to run this pracma package's findpeaks() function to find the local maxima of each row in each data.frame in the list, dflist. The output is a N x 4 array. N = the number of peaks. So in the first row of the first data.frame if it finds 4 peaks, it will be a 4x4 matrix. My goal is to loop this function over every row in each data.frame and store the matrix that is output in a list.
My code:
## Find Peaks
pks=list()
for (i in 1:length(dflist)){
for (j in 1:length(dflist[[i]])){
row = dflist[[i]][j,]
temppks = findpeaks(as.vector(row,mode='numeric')
,minpeakheight = 1.1,nups=2)
pks[i][[j]]=rbind(pks,temppks)
}
}
This doesn't seem to be doing quite what I want it too. any ideas?
A combination of apply() and sapply() could do the work:
my.f.row <- function(row) findpeaks(as.vector(row,mode='numeric'), minpeakheight = 1.1, nups=2)
sapply(dflist, function(df.i) apply(df.i, 1, my.f.row))
eventually you have to reorganize the result.

Resources