R: Find outliers with mvBACON - r

I'm new to R and working on an assignment were I am supposed to replicate the results from a linear regression (time series data with 1360 observations and 52 variables (11 variables in the regression model)). In the original study the researchers identified outliers with the Hadi method. It seems that this is done best in R with the mvBacon function, is this correct? I cannot seem to find a good answer on how to use this though, could anyone please tell me how I can use this function to find the outliers?
(I would very much appreciate an answer that is explained as simply as possible since R is very new to me).
Thank you very much!

Yes, the mvBACON is for outlier identification based on some distance. The default one is the Mahalanobis distance.
The following code will walk you through a simple example on the mtcars subdataset on how to identify outliers with mvBACON:
# load packages
library(dplyr)
library(magrittr)
# Use mtcars (sub)dataset and plot it
data <- mtcars %>% select(mpg, disp)
plot(data, main = "mtcars")
# Add some outliers and plot again
data <- rbind(data,
data.frame(mpg = c(1, 80), disp = c(800, 1000)))
plot(data, main = "mtcars")
# Use mvBacon to calculate the distances and get the ouliers
# install.packages("robustX) # uncomment line to install package
library(robustX)
#compute distance - default is Mahalonobis
distances <- mvBACON(data)
# Plot it again...
plot(data, main = "mtcars")
# ...with highlighting the outliers
points(data[!distances$subset, ], col = "red", pch = 19)
# Some fine tuning, since many of the outliers seem to be still good for regression
distances <- mvBACON(data, alpha = 0.6)
# update plot
plot(data, main = "mtcars")
points(data[!distances$subset, ], col = "red", pch = 19)

Related

Perona-Malik model in R for smoothing time series of data

Recently, i use Savitzky-Golay in signal package for smoothing my data, but it is not work well. I hear that Perona-Malik is good smooth method for this task, however, i could not realize it. My question is that is it possible realize the task to smooth the data by using P& M model by using R.
Thanks
hees
Simple example.
library(signal)
bf <- butter(5,1/3)
x <- c(rep(0,15), rep(10, 10), rep(0, 15))
###
sg <- sgolayfilt(x) # replace at here
plot(sg, type="l")
lines(filtfilt(rep(1, 5)/5,1,x), col = "red") # averaging filter
lines(filtfilt(bf,x), col = "blue") # butterworth
p

R - Meta-Analysis - Plotting forest plot from multi-level random-effects model with subgroups

I am having trouble with plotting a forest plot based on a multi-level model, in which I'd also like to display pooled effects of subgroups, as well as the results for subgroup differences.
So far, I have managed to produce a plot of the data where clusters are grouped together. I would like to extend this plot by adding pooled effects of subgroups at the right positions, without losing the grouping of the clusters. (As it is explained here, but also while keeping what is shown in the last example of this).
This is the code I have used so far to produce the "normal" forest plot for my model (sorry, it's pretty long):
# ma_data => my data
# main_3L => my multi-level model
# Prepare row argument for separation by study
dd <- c(0, diff(ma_data$ID))
dd[dd > 0] <- 1
rows <- (1:main_3L$k) + cumsum(dd)
par(tck=-.01, mgp = c(1.6,.2,0), cex=1)
# refactor ID var
ma_data$ID_plot <- substr(ma_data$short_cite, 1, nchar(ma_data$short_cite))
ma_data$ID_plot <- paste(sub(" ||) ","",substr(ma_data$ID_plot,0,2)), substr(ma_data$ID_plot,3,nchar(ma_data$ID_plot)), sep="")
tiff("./figures/forestFull_ext1.tiff", width=3200,height=4500, res=300)
# Plot the forest!
metafor::forest(main_3L,
addpred = TRUE, # adds prediction interval
cex=0.5,
header="Author(s) and Year",
rows=rows, # uses the vector created above
order=order(ma_data$ID, ma_data$es_adj),
ylim=c(0.5,max(rows)+3),
xlim=c(-5,3),
xlab="Hedges' G",
ilab=cbind(as.character(ma_data$setup),as.character(ma_data$target_1), as.character(ma_data$measure_type), ma_data$task, as.character(ma_data$cogdom_pooled), ma_data$sample_size_exp),
ilab.xpos=c(-3.9,-3.6,-3.3,-2.8,-2.2,-1.7),
slab=ma_data$ID_plot,
mlab = mlabfun("Overall RE Modell", main_3L, main_3L.I2)) # Adds Q,Qp, I² and sigma² values.
abline(h = rows[c(1,diff(rows)) == 2] - 1, lty="dotted")
# adds a second polygon with robust estimates for standard error
addpoly(coeftest.main_3L$beta, sei = coeftest.main_3L$SE,
rows = -2.5,
cex = 0.5,
mlab = "Robust RE Model estimate",
col = "darkred")
par(cex=0.5, font=2)
# text(c(-4,-3.7,-3.2,-2.5, -2), 150.5, pos=3, c("Target", "Measure","Task","Cognitive Domain", "N"))
text(c(-3.9,-3.6,-3.3,-2.8,-2.2,-1.7), 150.5, pos=3, c("Setup", "Target", "Measure","Task","Cognitive Domain", "N"))
dev.off()
Specifically, I need to know how to "make space" for the additional rows and polygons.
Also, is there an option in the forest() function to display only the pooled effects of subgroups and main effect, bot not the individual effect sizes? I know that it is possible in the meta package, but have not found anything similar in metafor.
Any help is greatly appreciated!

Plotting quantile regression by variables in a single page

I am running quantile regressions for several independent variables separately (same dependent). I want to plot only the slope estimates over several quantiles of each variable in a single plot.
Here's a toy data:
set.seed(1988)
y <- rnorm(50, 5, 3)
x1 <- rnorm(50, 3, 1)
x2 <- rnorm(50, 1, 0.5)
# Running Quantile Regression
require(quantreg)
fit1 <- summary(rq(y~x1, tau=1:9/10), se="boot")
fit2 <- summary(rq(y~x2, tau=1:9/10), se="boot")
I want to plot only the slope estimates over quantiles. Hence, I am giving parm=2 in plot.
plot(fit1, parm=2)
plot(fit2, parm=2)
Now, I want to combine both these plots in a single page.
What I have tried so far;
I tried setting par(mfrow=c(2,2)) and plotting them. But it's producing a blank page.
I have tried using gridExtra and gridGraphics without success. Tried to convert base graphs into Grob objects as stated here
Tried using function layout function as in this document
I am trying to look into the source code of plot.rqs. But I am unable to understand how it's plotting confidence bands (I'm able to plot only the coefficients over quantiles) or to change mfrow parameter there.
Can anybody point out where am I going wrong? Should I look into the source code of plot.rqs and change any parameters there?
While quantreg::plot.summary.rqs has an mfrow parameter, it uses it to override par('mfrow') so as to facet over parm values, which is not what you want to do.
One alternative is to parse the objects and plot manually. You can pull the tau values and coefficient matrix out of fit1 and fit2, which are just lists of values for each tau, so in tidyverse grammar,
library(tidyverse)
c(fit1, fit2) %>% # concatenate lists, flattening to one level
# iterate over list and rbind to data.frame
map_dfr(~cbind(tau = .x[['tau']], # from each list element, cbind the tau...
coef(.x) %>% # ...and the coefficient matrix,
data.frame(check.names = TRUE) %>% # cleaned a little
rownames_to_column('term'))) %>%
filter(term != '(Intercept)') %>% # drop intercept rows
# initialize plot and map variables to aesthetics (positions)
ggplot(aes(x = tau, y = Value,
ymin = Value - Std..Error,
ymax = Value + Std..Error)) +
geom_ribbon(alpha = 0.5) +
geom_line(color = 'blue') +
facet_wrap(~term, nrow = 2) # make a plot for each value of `term`
Pull more out of the objects if you like, add the horizontal lines of the original, and otherwise go wild.
Another option is to use magick to capture the original images (or save them with any device and reread them) and manually combine them:
library(magick)
plots <- image_graph(height = 300) # graphics device to capture plots in image stack
plot(fit1, parm = 2)
plot(fit2, parm = 2)
dev.off()
im1 <- image_append(plots, stack = TRUE) # attach images in stack top to bottom
image_write(im1, 'rq.png')
The function plot used by quantreg package has it's own mfrow parameter. If you do not specify it, it enforces some option which it chooses on it's own (and thus overrides your par(mfrow = c(2,2)).
Using the mfrow parameter within plot.rqs:
# make one plot, change the layout
plot(fit1, parm = 2, mfrow = c(2,1))
# add a new plot
par(new = TRUE)
# create a second plot
plot(fit2, parm = 2, mfrow = c(2,1))

R corrplot - Getting small squeezed plot

I am trying to make a correlation plot from the correlation matrix using corrplot function.
But I am getting a squeezed and unreadable plot. Also,the plot is generated at the extreme right end of the window. Ways of expanding a ggplot plot is not working here.
> col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
> corrplot(correlation_matrix, method="color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
Here is the plot generated
As somebody suggested above, you should either repair the names you have or change the parameters of your plot. I will use ggcorrplot instead because I find it easier to work with (and better looking), but the illustration will show the same problem. If I switch out the names for the airquality data to be hideous like so and plot it:
#### Libraries ####
library(tidyverse)
library(ggcorrplot)
#### Change Data Names ####
bad.names <- airquality %>%
rename(Approximate_Ozone_Measurement_in_Some_Measure = Ozone,
Solar_Radiation_Based_On_Sun_Movements = Solar.R,
Wind_Barometer_Ratings_And_Such = Wind,
Temperature_In_Fahrenheit_To_Nearest_Degree = Temp)
#### Run Correlation ####
bad.corr <- bad.names %>%
correlation()
#### Plot ####
ggcorrplot(bad.corr)
You get something like this:
There are two ways around this...either rename your variables or rotate the names in some way to fix the angle so its readable. Its much easier with ridiculous names like this to simply fix them rather than squeeze them in artificially:
#### Fix Names ####
good.names <- bad.names %>%
rename(Ozone= Approximate_Ozone_Measurement_in_Some_Measure,
Solar.R = Solar_Radiation_Based_On_Sun_Movements,
Wind = Wind_Barometer_Ratings_And_Such,
Temp = Temperature_In_Fahrenheit_To_Nearest_Degree)
#### Run Correlation ####
good.corr <- good.names %>%
correlation()
#### Replot ####
ggcorrplot(good.corr,
lab = T,
type = "lower")

Visualize data using histogram in R

I am trying to visualize some data and in order to do it I am using R's hist.
Bellow are my data
jancoefabs <- as.numeric(as.vector(abs(Janmodelnorm$coef)))
jancoefabs
[1] 1.165610e+00 1.277929e-01 4.349831e-01 3.602961e-01 7.189458e+00
[6] 1.856908e-04 1.352052e-05 4.811291e-05 1.055744e-02 2.756525e-04
[11] 2.202706e-01 4.199914e-02 4.684091e-02 8.634340e-01 2.479175e-02
[16] 2.409628e-01 5.459076e-03 9.892580e-03 5.378456e-02
Now as the more cunning of you might have guessed these are the absolute values of some model's coefficients.
What I need is an histogram that will have for axes:
x will be the number (count or length) of coefficients which is 19 in total, along with their names.
y will show values of each column (as breaks?) having a ylim="" set, according to min and max of those values (or something similar).
Note that Janmodelnorm$coef simply produces the following
(Intercept) LON LAT ME RAT
1.165610e+00 -1.277929e-01 -4.349831e-01 -3.602961e-01 -7.189458e+00
DS DSA DSI DRNS DREW
-1.856908e-04 1.352052e-05 4.811291e-05 -1.055744e-02 -2.756525e-04
ASPNS ASPEW SI CUR W_180_270
-2.202706e-01 -4.199914e-02 4.684091e-02 -8.634340e-01 -2.479175e-02
W_0_360 W_90_180 W_0_180 NDVI
2.409628e-01 5.459076e-03 -9.892580e-03 -5.378456e-02
So far and consulting ?hist, I am trying to play with the code bellow without success. Therefore I am taking it from scratch.
# hist(jancoefabs, col="lightblue", border="pink",
# breaks=8,
# xlim=c(0,10), ylim=c(20,-20), plot=TRUE)
When plot=FALSE is set, I get a bunch of somewhat useful info about the set. I also find hard to use breaks argument efficiently.
Any suggestion will be appreciated. Thanks.
Rather than using hist, why not use a barplot or a standard plot. For example,
## Generate some data
set.seed(1)
y = rnorm(19, sd=5)
names(y) = c("Inter", LETTERS[1:18])
Then plot the cofficients
barplot(y)
Alternatively, you could use a scatter plot
plot(1:19, y, axes=FALSE, ylim=c(-10, 10))
axis(2)
axis(1, 1:19, names(y))
and add error bars to indicate the standard errors (see for example Add error bars to show standard deviation on a plot in R)
Are you sure you want a histogram for this? A lattice barchart might be pretty nice. An example with the mtcars built-in data set.
> coef <- lm(mpg ~ ., data = mtcars)$coef
> library(lattice)
> barchart(coef, col = 'lightblue', horizontal = FALSE,
ylim = range(coef), xlab = '',
scales = list(y = list(labels = coef),
x = list(labels = names(coef))))
A base R dotchart might be good too,
> dotchart(coef, pch = 19, xlab = 'value')
> text(coef, seq(coef), labels = round(coef, 3), pos = 2)

Resources