How to make categorical scatterplot in R with median marking - r

My data are set up so that one column contains a continuous value testosterone concentration and the second column contains one of four "Kit type" values being "EIA," "RIA," "Other," or "All." I wanted to make the kit types into categories along the x axis with testosterone concentration along the y. I can't seem to figure out how to make sort of a cross between a boxplot and a scatterplot, but with only the individual data points and a median marking for each category marked on the graph?
This seemed to get me the data points into catagories alright, but the summarySE function does not have a median: Categorical scatter plot with mean segments using ggplot2 in R

Without data, I'm only guessing here, but ...
## create some data
set.seed(42)
n <- 100
dat <- data.frame(Testo=rbeta(n, 2, 5),
Kit=sample(c('EIA', 'RIA', 'Other', 'All'), size = n, replace = TRUE))
## show unequal distribution of points, no problem
table(dat$Kit)
## All EIA Other RIA
## 23 30 14 33
## break into individual levels
dat2 <- lapply(levels(dat$Kit), function(lvl) dat$Testo[ dat$Kit == lvl ])
names(dat2) <- levels(dat$Kit)
## parent plot
boxplot(dat2, main = 'Testosterone Levels per Kit')
## adding individual points
for (lvl in seq_along(dat2)) {
points(jitter(rep(lvl, length(dat2[[lvl]]))), dat2[[lvl]],
pch = 16, col = '#88888888')
}

Related

In summary plots of a linear model, how to label outliers with a grouping variable instead of the index value?

I have a data frame structured like this:
set.seed(123)
data<- data.frame(
ID=factor(letters[seq(20)]),
Location = rep(c("alph","brav", "char","delt"), each = 5),
Var1 = rnorm(20),
Var2 = rnorm(20),
Var3 = rnorm(20)
)
I have built a linear model: mod1 <- lm(Var1~Location,mydata). When I use: plot(mod1) on the linear model object, outliers are labeled with the index of the value. Is there a way to label those points with the value in ID? In other words, in this example values 6, 16, and 18 are labeled in the plots, and I want to them to be labeled with f, p, and r, respectively, because those are their corresponding values in ID
stats:::plot.lm is used to plot the diagnostic plots, and there are two options:
id.n: number of points to be labelled in each plot, starting with
the most extreme.
labels.id: vector of labels, from which the labels for extreme points
will be chosen. ‘NULL’ uses observation numbers.
By default id.n=3, so they always label the 3 observations with the largest cook's distance. I am including this as part of the answer because you might want to be careful about interpreting them as outliers.
To get these points, you do
mod1 <- lm(Var1~Location,data)
outl = order(-cooks.distance(mod1))[1:3]
outl
[1] 18 6 16
To plot, you can either provide the labels.id the ID you want, or you start from scratch:
par(mfrow=c(1,2))
plot(mod1,which=1,labels.id =data$ID)
plot(fitted(mod1),residuals(mod1))
panel.smooth(fitted(mod1),residuals(mod1))
text(fitted(mod1)[outl]+0.01,residuals(mod1)[outl],
data$ID[outl],col="red")
To go through all the plots, do:
plot(mod1,labels.id=data$ID)

display issue with matplot in r

I have a matrix composed of 20 sampled data like this (the original data has 30 observations):
## dummy data
dat <- rnorm(30, 1, 0.5)
## generate 20 sampled data
resamples <- lapply(1:20, function(i) sample(dat, replace = T))
## create matrix combining all sampled data together
mat <- t(do.call(rbind, resamples))
I want to draw a dot plot showing the change of the 30 observation across the 20 sampled dataset. The matplot function seems to work, but it displays numbers and alphabets instead of points in the figure:
## draw plot
matplot(mat, type = "p", ylab = " ")
Does anyone know how to fix this problem? And how can I make the x-axis ranges from 1 to 30, separated by 1? (I tried xlim but did not work)
Thanks!!

Plotting two principal component score vectors, using a different color to indicate three unique classes

After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.
I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20)
#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
dim(x)
[1] 60 50
dim(xymatrix)
[1] 60 51
pca=prcomp(xymatrix, scale=TRUE)
How should I correctly plot and color this principal component analysis as noted above? Thank you.
If I understand your question correctly, ggparcoord in Gally package would help you.
library(GGally)
y <- rep(c(1,2,3), 20)
# matrix of 50 variables i.e. 50 columns and 60 rows
# i.e. 60x50 dimensions (=3000 table cells)
x <- matrix(rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
pca <- prcomp(xymatrix, scale=TRUE)
# Principal components score and group label 'y'
pc_label <- data.frame(pca$x, y=as.factor(y))
# Plot the first two principal component scores of each samples
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
However, I think it makes more sense to do PCA on x rather than xymatrix that includes the target y. So the following codes should be more appropriate in your case.
pca <- prcomp(x, scale=TRUE)
pc_label <- data.frame(pca$x, y=as.factor(y))
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
If you want a scatter plot of first two principal component scores, you can do it using ggplot.
library(ggplot2)
ggplot(data=pc_label) +
geom_point(aes(x=PC1, y=PC2, colour=y))
Here's a base R solution, to show how simply this can be done. First do the PCA on the x matrix only and from the resulting object get a matrix of the transformed variables which we'll call PCs.
x <- matrix(rnorm(3000), ncol=50)
pca <- prcomp(x, scale=TRUE)
PCs <- as.matrix(pca$x)
Now we can make vector of colour names based on your y for the labels.
col.labs <- rep(c("Green", "Blue", "Red"), 20)
Now just plot as a scatter, passing the colour vector to col.
plot(PCs[, 1], PCs[, 2], col=col.labs, pch=19, xlab = "Scores on PC1", ylab="Scores on PC2")

How to measure area between 2 distribution curves in R / ggplot2

The specific example is that imagine x is some continuous variable between 0 and 10 and that the red line is distribution of "goods" and the blue is "bads", I'd like to see if there is value in incorporating this variable into checking for 'goodness' but I'd like to first quantify the amount of stuff in the areas where the blue > red
Because this is a distribution chart, the scales look the same, but in reality there is 98 times more good in my sample which complicates things, since it's not actually just measuring the area under the curve, but rather measuring the bad sample where it's distribution is along lines where it's greater than the red.
I've been working to learn R, but am not even sure how to approach this one, any help appreciated.
EDIT
sample data:
http://pastebin.com/7L3Xc2KU <- a few million rows of that, essentially.
the graph is created with
graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)
The only way I can think of to do this is to calculate the area between the curve using simple trapezoids. First we manually compute the densities
d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])
Now we create functions that will interpolate between our observed density points
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
Next we find the x range of the overlap of the densities
ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))
and divide that into 500 sections
i <- seq(min(ovrng), max(ovrng), length.out=500)
Now we calculate the distance between the density curves
h <- f0(i)-f1(i)
and using the formula for the area of a trapezoid we add up the area for the regions where d1>d0
area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627
We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)+h[jj])
Here's a way to shade the area between two density plots and calculate the magnitude of that area.
# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)),
group=c(rep("Bad", 1000), rep("Good", 2000)))
# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
geom_density(aes(x=x, y=..count.., colour=group), lwd=1)
Some extra calculations to shade the area between the two density plots
(adapted from this SO question):
pp1 = ggplot_build(p1)
# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])
# We want ymax and ymin to differ only when the density of "Good"
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]
# Shade the area between "Good" and "Bad"
p1a = p1 +
geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)
Here are the two plots:
To get the area (number of values) in specific ranges of Good and Bad, use the density function on each group (or you can continue to work with the data pulled from ggplot as above, but this way you get more direct control over how the density distribution is generated):
## Calculate densities for Bad and Good.
# Use same number of points and same x-range for each group, so that the density
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"],
n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"],
n=1024, from=min(dat$x), to=max(dat$x))
## Normalize so that densities sum to number of rows in each group
# Number of rows in each group
counts = tapply(dat$x, dat$group, length)
bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y
## Results
# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495 # Out of 2000 total in the data frame
# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315 # Out of 1000 total in the data frame

Get a histogram plot of factor frequencies (summary)

I've got a factor with many different values. If you execute summary(factor) the output is a list of the different values and their frequency. Like so:
A B C D
3 3 1 5
I'd like to make a histogram of the frequency values, i.e. X-axis contains the different frequencies that occur, Y-axis the number of factors that have this particular frequency. What's the best way to accomplish something like that?
edit: thanks to the answer below I figured out that what I can do is get the factor of the frequencies out of the table, get that in a table and then graph that as well, which would look like (if f is the factor):
plot(factor(table(f)))
Update in light of clarified Q
set.seed(1)
dat2 <- data.frame(fac = factor(sample(LETTERS, 100, replace = TRUE)))
hist(table(dat2), xlab = "Frequency of Level Occurrence", main = "")
gives:
Here we just apply hist() directly to the result of table(dat). table(dat) provides the frequencies per level of the factor and hist() produces the histogram of these data.
Original
There are several possibilities. Your data:
dat <- data.frame(fac = rep(LETTERS[1:4], times = c(3,3,1,5)))
Here are three, from column one, top to bottom:
The default plot methods for class "table", plots the data and histogram-like bars
A bar plot - which is probably what you meant by histogram. Notice the low ink-to-information ratio here
A dot plot or dot chart; shows the same info as the other plots but uses far less ink per unit information. Preferred.
Code to produce them:
layout(matrix(1:4, ncol = 2))
plot(table(dat), main = "plot method for class \"table\"")
barplot(table(dat), main = "barplot")
tab <- as.numeric(table(dat))
names(tab) <- names(table(dat))
dotchart(tab, main = "dotchart or dotplot")
## or just this
## dotchart(table(dat))
## and ignore the warning
layout(1)
this produces:
If you just have your data in variable factor (bad name choice by the way) then table(factor) can be used rather than table(dat) or table(dat$fac) in my code examples.
For completeness, package lattice is more flexible when it comes to producing the dot plot as we can get the orientation you want:
require(lattice)
with(dat, dotplot(fac, horizontal = FALSE))
giving:
And a ggplot2 version:
require(ggplot2)
p <- ggplot(data.frame(Freq = tab, fac = names(tab)), aes(fac, Freq)) +
geom_point()
p
giving:

Resources