How to remove outliers from a dataset using bivariate boxplot

How to remove outliers from a dataset using bivariate boxplot - r

I have a data-set (See below) that is made up of multiple variables, two of these are 'manu' and 'popul' and they both contain numeric values.
From this data I plotted a bivariate boxplot using 'manu' and 'popul' so that I could find outliers between these two variables. This is what it looks like:
Then from this plot I can see there are a few outliers, I was able to identify what values are outliers using the code below, I will also show which values are outliers:
What I would like to know is how do I now take the row that contains these values and remove them from the dataset?
Thanks in advance.

You can extract the desired subset like so:

Related

Clustering with only two variables?

I want to cluster my two-dimensional dataset, but I couldn't figure it out. My dataset looks like below,
dt<-data.frame(x=c(rnorm(10, 2,1), rnorm(10, 6,1)), categorize=c(rep(1,10), rep(2,10)))
I just want to plot this dataset like the graph below, if I add the third value like c(1:nrow(dt)) does it work or what do you recommend me?

ggplot making a descriptive bar graph with no clear y variable

I've been trying to create a proportional stacked bar graph using ggplot and a huge data set that is one column of a dummy variable and one column a factor variable with 14 different levels.
I posted a small sample of the data here.
Despite not having a clear y-variale in my data, I can produce a plot that is only really useful looking at the factors that have a lot of observations, but when there's only one or two, you can't see the proportion at all. The code I used is here.
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar()
ggplot says you need to apply a ddply function to the data frame.
ce<-ddply(data,"factor",transform, percent_y=y/sum(y)*100)
Their example doesn't really apply in the case of this data since there's no clear y-variable to call in the plot; just counts of each factor that is 1 or 0.
My best guess for a ddply function spits out an error about differeing number of rows.
ce<-ddply(plot,"factor(data$factor)",transform,
percent=sum(data$dummy)*100/(dim(data$dummy)[1]))

How to structure data for R?

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?

As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

PCA biplot one variables shown R

I ran a pca on a set of 45000 genes on 5 different samples, and when I perform a biplot, all I see is a mass of text (responding to the observation names), and cannot see the location of my samples. Is there a way to plot the location of the samples only, and not the observation, in a biplot?
Using built in data from R
usa <- USArrests
pca1 <- prcomp(usa)
biplot(pca1)
This generates a biplot where all the states (observation names) overlap the variables (my different samples) rape, etc. Is it possible to plot only the variables (samples), and not the states (observation names)?

biplot.default uses text to write the categorical variable name of the observation. As it doesn't use points you need to modify the source if you only want the points (and not the labels) to be plotted.
However, you could "hack" it by doing something like:
biplot(pca1, xlabs = rep(".", nrow(usa)))
I hope this is what you're looking for!
Edit If this is not satisfactory, you can modify the source given when running stats:::biplot.default to use points.

Plotting distribution of differences in R

I have a dataset with numbers indicating daily difference in some measure.
https://dl.dropbox.com/u/22681355/diff.csv
I would like to create a plot of the distribution of the differences with special emphasis on the rare large changes.
I tried plotting each column using the hist() function but it doesn't really provide a detailed picture of the data.
For example plotting the first column of the dataset produces the following plot:
https://dl.dropbox.com/u/22681355/Rplot.pdf
My problem is that this gives very little detail to the infrequent large deviations.
What is the easiest way to do this?
Also any suggestions on how to summarize this data in a table? For example besides showing the min, max and mean values, would you look at quantiles? Any other ideas?

You could use boxplots to visualize the distribution of the data:
sdiff <- read.csv("https://dl.dropbox.com/u/22681355/diff.csv")
boxplot(sdiff[,-1])
Outliers are printed as circles.

I back #Sven's suggestion for identifying outliers, but you can get more refinement in your histograms by specifying a denser set of breakpoints than what hist chooses by default.
d <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv', header=TRUE, row.names=1)
with(d, hist(a, breaks=seq(min(a), max(a), length.out=100)))

Violin plots could be useful:
df <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv')
library(vioplot)
with(df,vioplot(a,b,c,d,e,f,g,h,i,j))
I would use a boxplot on transformed data, e.g.:
boxplot(df[,-1]/sqrt(abs(df[,-1])))
Obviously a histogram would also look better after transformation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove outliers from a dataset using bivariate boxplot - r

You can extract the desired subset like so:

Related

Clustering with only two variables?

ggplot making a descriptive bar graph with no clear y variable

How to structure data for R?

PCA biplot one variables shown R

Plotting distribution of differences in R

Categories

Resources