Outlier removal (outlierMAD command in pracma package) - r

I would like to remove outliers from each column of my dataset... By searching the web, seems the Hampel Identifier should be a good solution to my problem, which has an outlier detection technique: [median – t * MAD, median + t * MAD].
With the instruction provided by: http://exploringdatablog.blogspot.com/2012/01/moving-window-filters-and-pracma.html I would like to use the "outlierMAD" command to fulfill my task:
outlierMAD <- function (x, k){
n <- length(x)
y <- x
ind <- c()
L <- 1.4826
t0 <- 3
for (i in (k + 1):(n - k)) {
x0 <- median(x[(i - k):(i + k)])
S0 <- L * median(abs(x[(i - k):(i + k)] - x0))
if (abs(x[i] - x0) > t0 * S0) {
y[i] <- x0
ind <- c(ind, i)
}
}
list(y = y, ind = ind)
}
But I got an error when I try:
Error in if (abs(x[i] - x0) > t0 * S0) { :
missing value where TRUE/FALSE needed
Can anyone help me on this? How can I avoid NAs in my data?
A link of my sample data can be found at:
https://drive.google.com/file/d/0B86_a8ltyoL3NHNaeWk3d1QyQms/view?usp=sharing

The logical value of abs(x[i] - x0) > t0 * S0 should be True or False. The error could happen if any of variables are NA or treated as non-numeric.
In NA presence case, the median could produce NA instead of the median value if na.rm=FALSE. Setting up na.rm=TRUE for median will fix X0 and S0, but doesn't help in logical expression above if x[I]=NA itself.
Try to check your data on NA or some other non-numeric values. Hope this helps.

I got the same error using the hampel function of the 'pracma' package. The problem was caused by my time-series which has less than 5 observations. Check if yours is long enough.

need to reduce the K value or else kept one condition if the nrow(df) > 5 then apply hample otherwise skip the hample and proceed the mean for those 5 sets

Related

How to find the solutions of multiple nonlinear equations in R using the fsolve function

I want to use the “fsolve” function solution of nonlinear equations, equations and code is as follows, but I can only use “fsolve”function only to find the solution of a set of nonlinear equations, for example, I have three number in the A and B coefficient(A_coeff and B_coeff), according to my idea is that each number after formulas to calculate a set of solution, then three, there should be three sets of solution, what can I do to achieve them
A_coeff<-c(177506.9,177639.3,178039.4)
B_coeff<-c(0.0003485474,0.0005155126,0.0004671370)
C_coeff<-5.511464
D_coeff<-23.39138
E_coeff<-5.0866e+17
F_coeff<-0.9732414
library('pracma')
Para_fun <- function(temp1) {
new <- sqrt((4*temp1-1)/3)
return(new)
}
Para_fun2<- function(temp1) {
new2 <- ceiling(temp1/C_coeff)
return(new2)
}
F_try<- function(x){
s_actual <- x[1]
K_actual <- x[2]
n_tube <- x[3]
c( A_coeff/K_actual-s_actual,
(B_coeff+F_coeff/(E_coeff/Para_fun(n_tube)^(2/3))^0.25)^-1-K_actual,
Para_fun2(s_actual)*D_coeff-n_tube)}
x0_xinitial_value<- c(20,2000,20)
X_result<- fsolve(F_try, x0_xinitial_value)
X_result$x
The easiest way to solve your problem is to solve the set of equations for each pair of A_coeff and B_coeff with a loop.
Redefine function F_try as (where I have rewritten the code to make easier to read and less confusing)
F_try<- function(x,k){
s_actual <- x[1]
K_actual <- x[2]
n_tube <- x[3]
y <- numeric(length(x))
y[1] <- A_coeff[k]/K_actual-s_actual
y[2] <- (B_coeff[k]+F_coeff/(E_coeff/Para_fun(n_tube)^(2/3))^0.25)^-1-K_actual
y[3] <- Para_fun2(s_actual)*D_coeff-n_tube
y
}
The argument k is the index of the vector of coefficient A_coeff and B_coeff.
If you try this like so
X_result <- matrix(NA,nrow=3,ncol=3)
xstart <- x0_xinitial_value
for( k in 1:3){
z <- fsolve(F_try, xstart,k=k)
X_result[k,] <- z$x
}
X_result
you will get an error message
Error in if (norm(s, "F") < tol || norm(as.matrix(ynew), "F") < tol) break :
with message
missing value where TRUE/FALSE needed
Calls: fsolve -> broyden
In addition: Warning message:
In sqrt((4 * temp1 - 1)/3) : NaNs produced
Execution halted
It is no immediately clear what is wrong and why the error occurs.
There is another package nleqslv which gives more insight into what is going wrong.
You can use it like this
library(nleqslv)
X_result <- matrix(NA,nrow=3,ncol=3)
xstart <- x0_xinitial_value
for( k in 1:3){
z <- nleqslv(xstart,F_try,k=k)
X_result[k,] <- z$x
}
X_result
Inspecting X_result shows that the third solution is most likely wrong.
Cutting a long story short it appears that for k=3 and the starting values you provided the algorithms cannot find a solution.
A solution is to make the starting value for each k equal to the solution for the previous k. Like so
X_result <- matrix(NA,nrow=3,ncol=3)
xstart <- x0_xinitial_value
for( k in 1:3){
z <- nleqslv(xstart,F_try,k=k)
X_result[k,] <- z$x
xstart <- z$x
}
X_result
resulting in
[,1] [,2] [,3]
[1,] 72.60480 2444.837 327.4793
[2,] 102.59563 1731.451 444.4362
[3,] 94.16426 1890.732 421.0448
It is advisable to check the exit code of nleqslv for each row of this matrix
to make sure that a solution was found.

Non-comformable arguments in R

I am re-writting an algorithm I did in C++ in R for practice called the Finite Difference Method. I am pretty new with R so I don't know all the rules regarding vector/matrix multiplication. For some reason I am getting a non-conformable arguments error when I do this:
ST_u <- matrix(0,M,1)
ST_l <- matrix(0,M,1)
for(i in 1:M){
Z <- matrix(gaussian_box_muller(i),M,1)
ST_u[i] <- (S0 + delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
ST_l[i] <- (S0 - delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
}
I get this error:
Error in sqrt(T) %*% Z : non-conformable arguments
Here is my whole code:
gaussian_box_muller <- function(n){
theta <- runif(n, 0, 2 * pi)
rsq <- rexp(n, 0.5)
x <- sqrt(rsq) * cos(theta)
return(x)
}
d_j <- function(j, S, K, r, v,T) {
return ((log(S/K) + (r + (-1^(j-1))*0.5*v*v)*T)/(v*(T^0.5)))
}
call_delta <- function(S,K,r,v,T){
return (S * dnorm(d_j(1, S, K, r, v, T))-K*exp(-r*T) * dnorm(d_j(2, S, K, r, v, T)))
}
Finite_Difference <- function(S0,K,r,sigma,T,M,delta_S){
ST_u <- matrix(0,M,1)
ST_l <- matrix(0,M,1)
for(i in 1:M){
Z <- matrix(gaussian_box_muller(i),M,1)
ST_u[i] <- (S0 + delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
ST_l[i] <- (S0 - delta_S)*exp((r - (sigma*sigma)/(2.0))*T + sigma*sqrt(T)%*%Z)
}
Delta <- matrix(0,M,1)
totDelta <- 0
for(i in 1:M){
if(ST_u[i] - K > 0 && ST_l[i] - K > 0){
Delta[i] <- ((ST_u[i] - K) - (ST_l[i] - K))/(2*delta_S)
}else{
Delta <- 0
}
totDelta = totDelta + exp(-r*T)*Delta[i]
}
totDelta <- totDelta * 1/M
Var <- 0
for(i in 1:M){
Var = Var + (Delta[i] - totDelta)^2
}
Var = Var*1/M
cat("The Finite Difference Delta is : ", totDelta)
call_Delta_a <- call_delta(S,K,r,sigma,T)
bias <- abs(call_Delta_a - totDelta)
cat("The bias is: ", bias)
cat("The Variance of the Finite Difference method is: ", Var)
MSE <- bias*bias + Var
cat("The marginal squared error is thus: ", MSE)
}
S0 <- 100.0
delta_S <- 0.001
K <- 100.0
r <- 0.05
sigma <- 0.2
T <- 1.0
M <- 10
result1 <- Finite_Difference(S0,K,r,sigma,T,M,delta_S)
I can't seem to figure out the problem, any suggestions would be greatly appreciated.
In R, the %*% operator is reserved for multiplying two conformable matrices. As one special case, you can also use it to multiply a vector by a matrix (or vice versa), if the vector can be treated as a row or column vector that conforms to the matrix; as a second special case, it can be used to multiply two vectors to calculate their inner product.
However, one thing it cannot do is perform scalar multipliciation. Scalar multiplication of vectors or matrices always uses the plain * operator. Specifically, in the expression sqrt(T) %*% Z, the first term sqrt(T) is a scalar, and the second Z is a matrix. If what you intend to do here is multiply the matrix Z by the scalar sqrt(T), then this should just be written sqrt(T) * Z.
When I made this change, your program still didn't work because of another bug -- S is used but never defined -- but I don't understand your algorithm well enough to attempt a fix.
A few other comments on the program not directly related to your original question:
The first loop in Finite_Difference looks suspicious: guassian_box_muller(i) generates a vector of length i as i varies in the loop from 1 up to M, and forcing these vectors into a column matrix of length M to generate Z is probably not doing what you want. It will "reuse" the values in a cycle to populate the matrix. Try these to see what I mean:
matrix(gaussian_box_muller(1),10,1) # all one value
matrix(gaussian_box_muller(3),10,1) # cycle of three values
You also use loops in many places where R's vector operations would be easier to read and (typically) faster to execute. For example, your definition of Var is equivalent to:
Var <- sum((Delta - totDelta)^2)/M
and the definitions of Delta and totDelta could also be written in this simplified fashion.
I'd suggest Googling for "vector and matrix operations in r" or something similar and reading some tutorials. Vector arithmetic in particular is idiomatic R, and you'll want to learn it early and use it often.
You might find it helpful to consider the rnorm function to generate random Gaussians.
Happy R-ing!

R, coding a discontinuous/interval function within a function

I'm new to R, and I'm trying to code a function which requires it only chooses values in a certain interval, so I have decided to go with k=1 if it lies in [lower, upper] and 0 if it lies elsewhere (where lower and upper have been defined earlier in the function. However, when I try to assign values to the function, it always comes back with this
myfun(10,0.5,0.05)
#Error in k[i] <- function(p) ifelse(p >= lower & p <= upper, 1, 0) :
# incompatible types (from closure to double) in subassignment type fix
I don't really know what this means, I've tried finding an answer, but most pages just say how to fix their particular problem rather than saying what it actually means. Maybe I haven't been looking hard enough, and I apologise if I haven't, but any help would be greatly appreciated. Here is the full function, if it would help:
myfun <- function(a, q, m) {
k <- rep(0,a+1)
bin.prob <- rep(0,a+1)
for (i in 1:(a+1)) {
x <- i-1
qhat <- x/a
z <- qnorm(1-m/2)
upper <- qhat+(z*sqrt(qhat*(1-qhat)*(a^-1)))
lower <- qhat-(z*sqrt(qhat*(1-qhat)*(a^-1)))
k[i] <- function(q) ifelse(q>=lower & q<=upper, 1, 0)
bin.prob[i] <- dbinom(x,a,q)
}
C <- sum(k*bin.prob)
return(C)
}
myfun(10,0.5,0.05)
#Error in k[i] <- function(q) ifelse(q >= lower & q <= upper, 1, 0) :
# incompatible types (from closure to double) in subassignment type fix
NEW PROBLEM
Hey, I'm encountering a new problem when trying to adjust this function when trying to adjust the data set, i.e a becomes a+4 and x becomes x+2
> myfun2 <- function(a,q,m) {
+ fn <- function(a) a+4
+ abar <- fn(a)
+ kadj <- rep(0,abar+1)
+ bin.prob.adj <- rep(0,abar+1)
+ for (j in 1:(abar+1)) {
+ x <- j-1
+ fx <- function(x) x+2
+ xbar <- fx(x)
+ qhatadj <- xbar/abar
+ z <- (1-(m/2))
+ upperadj <- qhatadj+(z*sqrt(qhatadj*(1-qhatadj)*(abar^-1)))
+ loweradj <- qhatadj-(z*sqrt(qhatadj*(1-qhatadj)*(abar^-1)))
+ kadj[j] <- q>=loweradj & q<=upperadj
+ bin.prob.adj[j] <- dbinom(xbar,abar,q)
+ }
+ D <- sum(kadj*bin.prob.adj)
+ return(D)
+ }
> myfun2(10,0.5,0.05)
[1] NA
Warning messages:
1: In sqrt(qhatadj * (1 - qhatadj) * (abar^-1)) : NaNs produced
2: In sqrt(qhatadj * (1 - qhatadj) * (abar^-1)) : NaNs produced
3: In sqrt(qhatadj * (1 - qhatadj) * (abar^-1)) : NaNs produced
4: In sqrt(qhatadj * (1 - qhatadj) * (abar^-1)) : NaNs produced
I've been trying to find an answer as to why this has arised, and have found that the NaNs warning could mean there is a negative square root? However I can't see why that would have arisen. It may be bad coding on my part, or could be something else entirely (I'm new to R). Thanks for any help.
As the error message suggests, the problem starts at the line:
k[i] <- function(q) ifelse(q >= lower & q <= upper, 1, 0)
In the line above you are assigning a function function(q) ifelse(q >= lower & q <= upper, 1, 0) to each element of the vector k, when you really want to be assigning the result of evaluating this function on the scalar q given as an argument to the original function. Note also that the closure function(q) has an environment separate from that of the function in which it is defined. It must be explicitly called with an argument in order for it to evaluate. Hence, when you hit the line:
C <- sum(k * bin.prob)
R tries to multiply the function function(q) itself by bin.prob, throwing an error, when what you want to be doing is multiplying the result of evaluating function(q) for the scalar q defined in the arguments to the original function. In this case, there appears to be no need for you to define function(q) at all. The assignment can be replaced with:
k[i] <- ifelse(q >= lower & q <= upper, 1, 0)
Since R coerces logical vectors to numeric vectors where necessary, treating TRUE as 1 and FALSE as 0, the above assigment can be expressed more succinctly as:
k[i] <- q >= lower & q <= upper

Error in R: nonconformable arguments. Not true?

this is my code:
#define likelihood function (including an intercept/constant in the function.)
lltobit <- function(b,x,y) {
sigma <- b[3]
y <- as.matrix(y)
x <- as.matrix(x)
vecones <- rep(1,nrow(x))
x <- cbind(vecones,x)
bx <- x %*% b[1:2]
d <- y != 0
llik <- sum(d * ((-1/2)*(log(2*pi) + log(sigma^2) + ((y - bx)/sigma)^2))
+ (1-d) * (log(1 - pnorm(bx/sigma))))
return(-llik)
}
n <- nrow(censored) #define number of variables
y <- censored$y #define y and x for easier use
x1 <- as.matrix(censored$x)
x <- cbind(rep(1,n),x1) #include constant/intercept
bols <- (solve(t(x) %*% x)) %*% (t(x) %*% y) #compute ols estimator (XX) -1 XY
init <- rbind(as.matrix(bols[1:nrow(bols)]),1) #initial values
init
tobit1 <- optim(init, lltobit, x=x, y=y, hessian=TRUE, method="BFGS")
where censored is my data table, including 200 (censored) values of y and 200 values of x.
Everything works, but when running the optim command, i get the following error:
tobit1 <- optim(init, lltobit, x=x, y=y, hessian=TRUE, method="BFGS")
Error in x %*% b[1:2] : non-conformable arguments
I know what it means, but since x is a 200 by 2 matrix, and b[1:2] a vector of 2 by 1, what goes wrong? I tried transposing both, and also the initial values vector, but nothing works. Can anyone help me?
I stumbled upon a similar problem today ("non-conformable arguments" error, even though everything seemed OK), and solution in my case was in basic rules for matrix-multiplication: i.e. number of columns of the left matrix must be the same as the number of rows of the right matrix = I had to switch order in multiplication equation.
In other words, in matrix multiplication (unlike ordinary multiplication), A %*% B is not the same as B %*% A.
I offer one case in Principal Component Regression (PCR) in R, today I met this problem when tring to fit test data with model. it returned an error:
> pcr.pred = predict(pcr.fit, test.data, ncomp=6)
Error in newX %*% B[-1, , i] : non-conformable arguments
In addition: Warning message:
The problem was that, the test data has a new level that is previously not contained in the train data. To find which level has the problem:
cols = colnames(train)
for (col in cols){
if(class(ori.train[[col]]) == 'factor'){
print(col)
print(summary(train[[col]]))
print(summary(test[[col]]))
}
}
You can check which annoying attributes has this new level, then you can replace this 'new' attribute with other common values, save the data with write.csv and reload it, and you can run the PCR prediction.

efficient way of calculating lots of matrices

I'm trying to write a program that does the following:
Given two intervals A and B, for every (a,b) with a in A and b in B
create a variance matrix ymat, depending on (a,b)
calculate the (multivariate normal) density of some vector y
with mean 0 and variance matrix ymat
I learned that using loops is bad in R, so I wanted to use outer(). Here are my two functions:
y_mat <- function(n,lambda,theta,sigma) {
L <- diag(n);
L[row(L) == col(L) + 1] <- -1;
K <- t(1/n * L - theta*diag(n))%*%(1/n * L - theta*diag(n));
return(sigma^2*diag(n) + 1/lambda*K);
}
make_plot <- function(y,sigma,theta,lambda) {
n <- length(y)
sig_intv <- seq(.1,2*sigma,.01);
th_intv <- seq(-abs(2*theta),abs(2*theta),.01);
z <- outer(sig_intv,th_intv,function(s,t){dmvnorm(y,rep(0,n),y_mat(n,lambda,theta=t,sigma=s))})
contour(sig_intv,th_intv,z);
}
The shape of the variance matrix isn't relevant for this question. n and lambda are just two scalars, as are sigma and theta.
When I try
make_plot(y,.5,-3,10)
I get the following error message:
Error in t(1/n * L - theta * diag(n)) :
dims [product 25] do not match the length of object [109291]
In addition: Warning message:
In theta * diag(n) :
longer object length is not a multiple of shorter object length
Could someone enlighten me as to what's going wrong? Am I maybe going about this the wrong way?
The third argument of outer should be a vectorized function. Wrapping it with Vectorize should suffice:
make_plot <- function(y, sigma, theta, lambda) {
n <- length(y)
sig_intv <- seq(.1,2*sigma,.01);
th_intv <- seq(-abs(2*theta),abs(2*theta),.01);
z <- outer(
sig_intv, th_intv,
Vectorize(function(s,t){dmvnorm(y,rep(0,n),y_mat(n,lambda,theta=t,sigma=s))})
)
contour(sig_intv,th_intv,z);
}

Resources