How to find outliers in R for MDS? - r

I'm making a MDS plot with Salary and WAR data from baseball players. I have attached it here. Because there are so many data points I just want to label the Baseball Player names for the outlier data points. Any idea on how to do so?

Well, first it depends on the method you want to detect the outliers. I create a reproducible example, since you did not provide a dataset. I'm using the mtcars dataset, which is a regular dataset already implemented in R.
For outlier detection I'm using the cooks distance (you can check about that in a post here)
First i set up the data and calculating the cooks distance (cooksd) with cooks.distance from stats between the variables hp and sec
df<-cbind( mtcars[,c("hp", "qsec")], cooks.distance(lm(mtcars$hp~mtcars$qsec)))
names(df)[3]<-"cooksd"
plot(cooks.distance(lm(mtcars$hp~mtcars$qsec)))
We see, that we got two outliers. I'm using the threshold of 4*mean(cooksd) and labeling those in my dataset with following loop. Those are considered as outliers.
for (i in 1:nrow(df)){
if (df$cooksd[i] > 4* mean(df$cooksd)){
df$outlier[i] =1
}else {df$outlier[i]=0}
}
Last step is to plot the values and only give a label to those, which are above the threshold. I'm using here the row.names, but you can also use any column, just filter it as the other datapoints.
plot(df$hp, df$qsec,
text(df[df$outlier==1,"hp"], df[df$outlier==1,"qsec"], row.names(df[df$outlier == 1,]), cex=0.6, pos=4, col="red") )
A huge problem with outlier detection is, there are different methods and for every case the interpretation must be adjusted. Which method you want to use or which threshold, won't be answered here, since it's more a discussion for stackexchange. You can use cooksd, stdev or anything you seem it is the most useful. But the solution stays the same:
use your method to find the outliers
label them in your dataset
just implement the text in your plot, which you labelled.
For your three clusters, you just have to do a lm modell for every cluster, label the outliers and then join all three results.

Related

Weight ggridges by another variable

I'm trying to visualize some data with a ridge plot, but I'm wondering if there's a way I can weight the densities of the ridges.
Basically I have the following:
set.seed(1)
example <- data.frame(matrix(nrow=100,ncol=3))
colnames(example) <- c("year","position","weight")
example$year <- as.character(rep(c(1,2,3,4,5),each=20) )
example$position <- runif(100,1,10)
example$weight <- sample(1:3,100,replace = T)
A sample of position in 5 different years. I want to plot the distribution change over time with a ridge plot, but in the dataset, there is also a column for "weight," meaning that some samples counted more than others. Is there a way to incorporate this into my ridges distribution plot? And also is there a way to make rows with more sample*weight be taller than rows with less? So not normalize every year's height to one?
ggplot(example,aes(x=position,y=year))+
ggridges::geom_density_ridges()+
theme_classic()
I was thinking I could try to pipe the dataset to repeat rows for number of weight value that they have, and so they would get counted more than x number of times (or, "weight" number of times) and change the density. Can't quite figure out how to do that though. Also, in my dataset, the weights aren't integers, so I'm hoping for a better solution.
Or, is there another package/technique that might achieve that?
For this dataset we can repeat the rows based on weight column and then plot:
library(ggplot2)
library(ggridges)
example2 <- example[rep(seq_along(example$weight), example$weight), ]
ggplot(example2,aes(x=position,y=year))+
ggridges::geom_density_ridges()+
theme_classic()
#> Picking joint bandwidth of 1.02
However, if you have wights that are not integer, this would not work. There's this open issue on github that you may want to give it a shot.
Another idea would be normalizing your weights in your original dataset to be integer by rounding them to certain digits and multiplying them by 10 to the power of your desired precision. Then you can utilize previous solution for your actual dataset.

R: Re-colour and connect certain points in a model-plot

This is my first question here so please excuse any mistakes I (may or may not) make.
The premise:
I got a vegetational dataset containing paired data on different plots for old and new observations. I used the 'openxlsx'-package to load my data, and 'vegan'-package to execute an NMDS as follows:
mydata <- read.xlsx(mydata)
mydataMDS <- metaMDS(mydata, k=2, trymax=500)
The result is then used for a model via the "envfit()"-function, including environmental variables:
myenvdata <- read.xlsx(myenvdata)
mydataMDS_fit <- envfit(mydataMDS, myenvdata, perm=10000, na.rm=TRUE)
plot(mydataMDS, display="sites")
plot(mydataMDS_fit, p.max=0.01, axis=TRUE)
Now I have a plot with my statistical "mydataMDS"-analyses, including vectors produced by the "mydataMDS_fit" R calculated.
The problem:
I want to colour and connect certain points within this plot. As "mydata" consists of observations within the same plot at different times, I intend to colour all points of old observations in one colour, and all the new ones in a different one. I've read about adding columns in order to group old and new observations, but as I'm working with a model there are no columns. How can I edit my datasets ("mydata", "mydataMDS", "myenvdata", "mydataMDS_fit") in order to show old and new plots in 2 different colours (one colour for old, one colour for new), and connect the paired observations with lines? Or: Is there a possibility to directly re-colour the points within my graphical output via checking for old/new observations?
(Sorry, I feel like my explanation is quite complicated, but I still hope someone may be able to help)

Adding multiple lines to plot, without ggplot

I would like to plot multiple lines on the same plot, without using ggplot.
I have scores for different individuals across a set time period and wish to plot a line between yearly scores for each individual. Data is organised with each row representing an individual and each column an observed value in a given year.
Currently I am using a for loop, but am aware that this is often not efficient in R, and am interested if there are any more suitable approaches available within base R.
I will be working with up 100,000 individuals
Thanks.
Code:
df=data.frame(runif(10,0,100),runif(10,0,100),runif(10,0,100),runif(10,0,100))
df=data.frame(t(df))
Years=seq(1,10,1)
plot(1,type="n",xlab="Year",ylab="Score", xlim=c(1,10), ylim=c(0,100))
for(x in 1:4){lines(Years,df[x,])}
Efficiency is not much of a consideration when plotting since plotting to a device is a slow operation in itself. You can use matplot (which uses a loop internally). It's basically a more sophisticated version of your code wrapped in a function.
matplot(Years, t(df), xlab="Year", ylab="Score", type = "l")

PCA biplot one variables shown R

I ran a pca on a set of 45000 genes on 5 different samples, and when I perform a biplot, all I see is a mass of text (responding to the observation names), and cannot see the location of my samples. Is there a way to plot the location of the samples only, and not the observation, in a biplot?
Using built in data from R
usa <- USArrests
pca1 <- prcomp(usa)
biplot(pca1)
This generates a biplot where all the states (observation names) overlap the variables (my different samples) rape, etc. Is it possible to plot only the variables (samples), and not the states (observation names)?
biplot.default uses text to write the categorical variable name of the observation. As it doesn't use points you need to modify the source if you only want the points (and not the labels) to be plotted.
However, you could "hack" it by doing something like:
biplot(pca1, xlabs = rep(".", nrow(usa)))
I hope this is what you're looking for!
Edit If this is not satisfactory, you can modify the source given when running stats:::biplot.default to use points.

Is there a better way to plot multicolor lines in R than splitting the data?

Sequential portions of my time series are under different treatments, and I'd like to separately color a line connecting observations in each portion.
For example, in the series under treatment A I'd have a red line, and in the succeeding series under treatment B I'd have a blue line.
plot(response, type="l",col="treatment") failed - all observations were connected with a line the same color.
This listhost posting proposed just splitting the data by treatment and then separately plotting each subset on the same plot. (http://r.789695.n4.nabble.com/Can-R-plot-multicolor-lines-td791081.html).
Is there a more elegant way?
An alternative using Map that avoids manually plotting segments:
dat <- data.frame(treatment=rep(LETTERS[1:2],3:4),
response=c(6,5,2,1,5,6,7),time=1:7)
plot(response ~ time, data=dat, type="n")
Map(
function(x) lines(response ~ time, data=x, col=x$treatment),
split(dat, dat$treatment)
)
There are two popular more elegant ways. One is to use the ggplot2 package. Without more information it's hard to advise you other than look at help or examples in various places. The other is to check out the function matplot. That will require you to first restructure your data as a matrix but it can easily do what you want. Keep in mind that while it says in the help, "Plot the columns of one matrix against the columns of another", the x-axis matrix can be a vector the same length as one column of a matrix containing your line information. The function will just recycle the x vector.

Resources