Excluding outliers when plotting a Stripchart with ggplot2 - r

I'm trying to create a combination Boxplot/Scatterplot. I'm doing alright with it so far but there's one issue that's really bothering me that I've been unable to figure out. I'm in R and I've installed the ggplot2 package. Here's the code I'm using:
#(xx= stand in for my data set, which I imported from excel with the
# column labels as the X-axis values)
> boxplot(xx, lwd = 1.5, ylab = 'Minutes', xlab = "Epoch")
> stripchart(xx, vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue')
This gives me a plot that is pretty close to what I want but the problem is that the outliers are placed on the chart twice. If possible, I'd like to have the stripchart exclude them (highest groups of blue dots) and only use the ones from the boxplot (black outlined circles) so they stand out as different and don't look so sloppy.
I've tried to alter the points in question by putting a lot of different outlier arguments into the stripchart command, unfortunately with no luck. I've tried setting y-limits below their values, tried using outline=false (which completely removes the stripchart), tried changing outlier color, outpch, etc. The command has not worked for any of these attempts. Here's an example of ylim:
> stripchart(xx, vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue', ylim = true,
ylim (0,20))
Error in ylim(0, 20) : could not find function "ylim"
And here's an example with outlier color:
> stripchart(xx vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue', outcol = "black")
Warning messages:
1: In plot.xy(xy.coords(x, y), type = type, ...) : "outcol" is not a
graphical parameter
.......# warning messages continue as such.
Are stripcharts capable of outlier exclusion? Or do I simply not know enough about them yet (and R as a whole, for that matter) to effectively write the code?
If this can be done, how should I proceed? I'm totally fine with solutions that don't directly address the outlier issue in terms of the data as long as the visual effect on the plot is the same.
Thank you for your time and any help you can give!
Edit: Here's some of the data to play around with. Top row is column labels and data is beneath. Sorry if this formatting is bad.The 29s and 30s and such in the 9th row of data, 10th overall, are examples of some of the points plotted as outliers in my graphs that I would like to keep in the boxplot but not in the scatterplot/stripchart.
1 5 10 15 30 60
7.233333333 8.166666667 9.666666667 7.75 9 7
7.133333333 9.25 9.333333333 9.75 10 11
0.733333333 0.5 0.833333333 1 1 0
1.766666667 1.166666667 1 0.75 1 0
1.75 2.25 2.333333333 2.25 1 1
6.75 7 7.166666667 7.75 6.5 7
1.516666667 1.75 1.333333333 2 2 2
1.533333333 1.5 2 1.25 1.5 2
27.3 28.33333333 29.33333333 30.25 28.5 29
6.35 6 6.333333333 7 6 6
7.083333333 8.333333333 8.833333333 8.75 8 8
8.533333333 10.08333333 10.5 12 10.5 11
7.65 8.416666667 9 10.75 9 12
6.85 7.333333333 8 7.25 6 8
4.433333333 5 5.5 5 6.5 6
8.616666667 10 11.66666667 12.25 13 12
3.633333333 3.75 3.5 3.25 3 2
0.8 0.75 0.833333333 1 1 0
7.283333333 8.583333333 9.666666667 9.75 12 8
7.483333333 8.75 8.333333333 7.75 6.5 7
3.466666667 2.916666667 3.166666667 2.5 2 0
5.483333333 6.416666667 6.833333333 6.75 7 8

There are a few things going on here. If you wanted to stick with the base plotting functions (boxplot() and stripchart()), you could simply tell stripchart to plot only the points that are within some criterion. A common standard for outliers would be any point 3 or more standard deviations away from the mean. Instead of passing your unmodified data set to stripchart, we subset that data set (note the [ ] brackets).
boxplot(xx)
stripchart(xx[xx <= mean(xx) + sd(xx) * 3], vertical = T, method = 'jitter', add = T, pch = 20, col = 'blue')
Of course, if you really did want to use ggplot2 (and I recommend installing not only that package, but the entire tidyverse with install.packages('tidyverse')), you could produce an arguably nicer plot:
The data formatting and commands needed to produce the ggplot version are quite different from the base graphics version, and beyond the scope of this answer. Reproducible code follows.
library(tidyverse)
df <- structure(list(X1 = c(7.233333333, 7.133333333, 0.733333333, 1.766666667, 1.75, 6.75, 1.516666667, 1.533333333, 27.3, 6.35, 7.083333333, 8.533333333, 7.65, 6.85, 4.433333333, 8.616666667, 3.633333333, 0.8, 7.283333333, 7.483333333, 3.466666667, 5.483333333 ), X5 = c(8.166666667, 9.25, 0.5, 1.166666667, 2.25, 7, 1.75, 1.5, 28.33333333, 6, 8.333333333, 10.08333333, 8.416666667, 7.333333333, 5, 10, 3.75, 0.75, 8.583333333, 8.75, 2.916666667, 6.416666667 ), X10 = c(9.666666667, 9.333333333, 0.833333333, 1, 2.333333333, 7.166666667, 1.333333333, 2, 29.33333333, 6.333333333, 8.833333333, 10.5, 9, 8, 5.5, 11.66666667, 3.5, 0.833333333, 9.666666667, 8.333333333, 3.166666667, 6.833333333), X15 = c(7.75, 9.75, 1, 0.75, 2.25, 7.75, 2, 1.25, 30.25, 7, 8.75, 12, 10.75, 7.25, 5, 12.25, 3.25, 1, 9.75, 7.75, 2.5, 6.75), X30 = c(9, 10, 1, 1, 1, 6.5, 2, 1.5, 28.5, 6, 8, 10.5, 9, 6, 6.5, 13, 3, 1, 12, 6.5, 2, 7), X60 = c(7L, 11L, 0L, 0L, 1L, 7L, 2L, 2L, 29L, 6L, 8L, 11L, 12L, 8L, 6L, 12L, 2L, 0L, 8L, 7L, 0L, 8L)), .Names = c("X1", "X5", "X10", "X15", "X30", "X60"), class = "data.frame", row.names = c(NA, -22L))
df.long <- gather(df, x, value) %>%
mutate(x = as.factor(as.numeric(gsub('X', '', x)))) %>%
group_by(x) %>%
mutate(is.outlier = value > mean(value) + sd(value) * 3)
plot.df <- ggplot(data = df.long, aes(x = x, y = value, group = x)) +
geom_boxplot() +
geom_point(data = filter(df.long, !is.outlier), color = '#0000ff88', position = position_jitter(width = 0.1))
print(plot.df)

Related

How to draw histograms in r

My code is reading in a file which looks like this:
Wafer Analyses
%HN20396815 150.000000 03/27/2016 04:06:27
Vf2 Site
2.76 1
2.32 2
2.56 3
2.45 2
2.76 4
2.98 3
2.58 1
2.42 2
2.76 1
2.32 4
2.56 3
2.45 2
2.76 4
2.98 3
2.58 1
2.42 2
then I segregate "Vf2" data by "Site" and plot for each site in a histogram, however the histogram doesn't get printed out, how do I rectify this? This is my code>
blufile <- read.table("C:/blufiles/WFRHN20396815_700.blu", skip = 2, header = TRUE, sep="\t")
Vf2 = blufile["Vf2"]
Site = blufile["Site"]
blufile2 <- data.frame(Vf2, Site)
blufile2$Vf2[blufile2$Site == '2']
set.seed(42)
p1 <- blufile2$Vf2[blufile2$Site == '1'] # centered at 4
p2 <- blufile2$Vf2[blufile2$Site == '2'] # centered at 6
p3 <- blufile2$Vf2[blufile2$Site == '3'] # centered at 4
p4 <- blufile2$Vf2[blufile2$Site == '4'] # centered at 6
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,4)) # first histogram
plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,4), add=T) # second
plot( p3, col=rgb(0,1,0,1/4), xlim=c(0,4), add=T) # first histogram
plot( p4, col=rgb(1,1,0,1/4), xlim=c(0,4), add=T) # second
I want a similar output ->
The solution #alistaire have provided worked with the data you have provided.
ggplot(df, aes(x=Vf2)) +
geom_histogram(bins=5) +
facet_wrap(~Site)
The data df used in this case is:
structure(list(Vf2 = c(2.76, 2.32, 2.56, 2.45, 2.76, 2.98, 2.58,
2.42, 2.76, 2.32, 2.56, 2.45, 2.76, 2.98, 2.58, 2.42), Site = c(1L,
2L, 3L, 2L, 4L, 3L, 1L, 2L, 1L, 4L, 3L, 2L, 4L, 3L, 1L, 2L)), .Names = c("Vf2",
"Site"), class = "data.frame", row.names = c(NA, -16L))
Please elaborate more on your issues if this still do not work.
EDIT:
The issue with your code is that you used plot which would result in a x,y point plot, instead of hist.
Here's a solution with base:
df_list <- split(df, df$Site) #produce a list of 4 data.frame, each with one level of Site
dev.off() # close existing output on console
for(i in 1:length(df_list)){
dat <- df_list[[i]]$Vf2 # select Vf2 data from site i
hist_col <- c(rgb(0,0,1,1/4), rgb(1,0,0,1/4), rgb(0,1,0,1/4), rgb(1,1,0,1/4))[i] # select rgb colour for site i plot
hist(dat, col=hist_col, xlim=c(0,4), ylim=c(0, 5)) #plot histogram for each site
par(new=T) # add new histogram layer on top
}
The solution for ggplot is slightly more confusing in this case:
ggplot(data=df, aes(x=Vf2, fill=factor(Site), colour=factor(Site))) +
geom_histogram(data=df %>% filter(Site==1), bins=5, alpha=.5) +
geom_histogram(data=df %>% filter(Site==2), bins=5, alpha=.5) +
geom_histogram(data=df %>% filter(Site==3), bins=5, alpha=.5) +
geom_histogram(data=df %>% filter(Site==4), bins=5, alpha=.5) +
theme_bw()

calculate gaussian curve fitting on a list

I have a list data like below. I want to perform nonlinear regression Gaussian curve fitting between mids and counts for each element of my list and report mean and standard deviation
mylist<- structure(list(A = structure(list(breaks = c(-10, -9,
-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4), counts = c(1L,
0L, 1L, 5L, 9L, 38L, 56L, 105L, 529L, 2858L, 17L, 2L, 0L, 2L),
density = c(0.000276014352746343, 0, 0.000276014352746343,
0.00138007176373171, 0.00248412917471709, 0.010488545404361,
0.0154568037537952, 0.028981507038366, 0.146011592602815,
0.788849020149048, 0.00469224399668783, 0.000552028705492686,
0, 0.000552028705492686), mids = c(-9.5, -8.5, -7.5, -6.5,
-5.5, -4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5),
xname = "x", equidist = TRUE), .Names = c("breaks", "counts",
"density", "mids", "xname", "equidist"), class = "histogram"),
B = structure(list(breaks = c(-7, -6, -5,
-4, -3, -2, -1, 0), counts = c(2L, 0L, 6L, 2L, 2L, 1L, 3L
), density = c(0.125, 0, 0.375, 0.125, 0.125, 0.0625, 0.1875
), mids = c(-6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5), xname = "x",
equidist = TRUE), .Names = c("breaks", "counts", "density",
"mids", "xname", "equidist"), class = "histogram"), C = structure(list(
breaks = c(-7, -6, -5, -4, -3, -2, -1, 0, 1), counts = c(2L,
2L, 4L, 5L, 14L, 22L, 110L, 3L), density = c(0.0123456790123457,
0.0123456790123457, 0.0246913580246914, 0.0308641975308642,
0.0864197530864197, 0.135802469135802, 0.679012345679012,
0.0185185185185185), mids = c(-6.5, -5.5, -4.5, -3.5,
-2.5, -1.5, -0.5, 0.5), xname = "x", equidist = TRUE), .Names = c("breaks",
"counts", "density", "mids", "xname", "equidist"), class = "histogram")), .Names = c("A",
"B", "C"))
I have read this
Fitting a density curve to a histogram in R
but this is how to fit a curve to a histogram. what I want is Best-fit values"
" Mean"
" SD"
If I use PRISM to do it, I should get the following results
for A
Mids Counts
-9.5 1
-8.5 0
-7.5 1
-6.5 5
-5.5 9
-4.5 38
-3.5 56
-2.5 105
-1.5 529
-0.5 2858
0.5 17
1.5 2
2.5 0
3.5 2
performing nonlinear regression Gaussian curve fitting , I get
"Best-fit values"
" Amplitude" 3537
" Mean" -0.751
" SD" 0.3842
for the second set
B
Mids Counts
-6.5 2
-5.5 0
-4.5 6
-3.5 2
-2.5 2
-1.5 1
-0.5 3
"Best-fit values"
" Amplitude" 7.672
" Mean" -4.2
" SD" 0.4275
and for the third one
Mids Counts
-6.5 2
-5.5 2
-4.5 4
-3.5 5
-2.5 14
-1.5 22
-0.5 110
0.5 3
I get this
"Best-fit values"
" Amplitude" 120.7
" Mean" -0.6893
" SD" 0.4397
In order to convert the histogram back to the estimate of the mean and standard deviation. First convert the results of the bin counts times the bin. This will be an approximation of the original data.
Based on your example above:
#extract the mid points and create list of simulated data
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
#if the original data were integers then this may give a better estimate
#simdata<-lapply(mylist, function(x){rep(x$breaks[-1], x$counts)})
#find the mean and sd of simulated data
means<-lapply(simdata, mean)
sds<-lapply(simdata, sd)
#or use sapply in the above 2 lines depending on future process needs
If your data was integers then using the breaks as the bins will provide a better estimate. Depending on the function for the histogram (ie right=TRUE/FALSE) may shift the results by one.
Edit
I thought this was going to be an easy one. I reviewed the video, the sample data shown was:
mids<-seq(-7, 7)
counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
simdata<-rep(mids, counts)
The video results were mean = -0.7359 and sd= 0.4571. The solution which I found provided the closest results was using the "fitdistrplus" package:
fitdist(simdata, "norm", "mge")
Using the "maximizing goodness-of-fit estimation" resulted in mean = -0.7597280 and sd= 0.8320465.
At this point, the method above provides a close estimate but does not exactly match. I don't not know what technique was used to calculate the fit from the video.
Edit #2
The above solutions involved recreating the original data and fitting that using either the mean/sd or using the fitdistrplus package. This attempt is an attempt to perform a least-square fit using the Gaussian distribution.
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
means<-sapply(simdata, mean)
sds<-sapply(simdata, sd)
#Data from video
#mids<-seq(-7, 7)
#counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
#make list of the bins and distribution in each bin
mids<-lapply(mylist, function(x){x$mids})
dis<-lapply(mylist, function(x) {x$counts/sum(x$counts)})
#function to perform the least square fit
nnorm<-function(values, mids, dis) {
means<-values[1]
sds<-values[2]
#print(paste(means, sds))
#calculate out the Gaussian distribution for each bin
modeld<-dnorm(mids, means, sds)
#sum of the squares
diff<-sum( (modeld-dis)^2)
diff
}
#use optim function with the mean and sd as initial guesses
#find the mininium with the mean and SD as fit parameters
lapply(1:3, function(i) {optim(c(means[[i]], sds[[i]]), nnorm, mids=mids[[i]], dis=dis[[i]])})
This solution provides a closer answer to PRISM results, but still not the same. Here is a comparison of all the 4 solutions.
From the table, the least square fit (the one just above) provides the closest approximation. Maybe tweaking the mid points dnorm function might help. But Case B data is farthest from being normally distributed but the PRISM software still generates a small standard deviation, while the other methods are similar. It is possible the PRISM software performs some type of data filtering to remove the outliers before the fit.

ggplot graph in R based on many variables

I have two data sets, d1 and d2 in csv files. Each data has 6 columns. I managed to combine them by melt command and graph them together in ggplot. After I have added one extra column which is another variable the graph would depend on, I couldn't get the required graph. Provided sample of the dataset and my code.
The dataset after using melt and reshape package:
initi A B C D E L1
0.005 1 23.7 1.0 1.0 24.7 d2
0.005 2 31.2 2.0 2.1 31.2 d2
0.005 3 35.8 3.1 3.2 35.6 d2
1 1 6.2 1.0 1.0 6.2 d1
1 2 10.1 2.0 2.1 7.0 d1
1 3 11.2 3.0 3.5 7.0 d1
2 1 14.2 8.0 14.3 5.2 d1
2 2 15.9 7.0 13.0 5.5 d1
2 3 16.0 6.2 12.4 5.8 d1
I need to graph A in the X-axis and B in the Y-axis. The initi value will represent each graph. In other words, it will be in the legend. For d1, I need to plot the results between A and B. For d2 I want to plot two graphs where the first graph when the initi = 1 and the second for initi = 2. All of the graphs for d1 and d2 are between A and B and combined in graph. The total in this case 3 lines combined in one graph.
I managed to graph d1 and d2 before I have added the initi column. Now I struggling. Below is my code:
dlist <- list(d1 =data1 ,d2 = data2)
reshaped_data <- melt(dlist, id.vars = c('initi','A','B','C','D','E'))
graph_AB<-ggplot(reshaped_data,aes(x = A, y = B, colour = initi)) +
geom_point(size = 5)+
geom_line() +
ggtitle("DATA1 vs DATA2")
The above code is close to what I want ...Except that the lines are connected in strange way. I should get each pattern for "initi" as one line. Plus the legend is not showing each value of 'initi'
You say you want two graphs, one for when initi is 1 and another for when it is 2, but it also takes the value 0.005. You need to subset your data first if you want to omit the 0.005 level.
In ggplot2, multiple graphs are called "facets", and since your facets depend on one variable, the command you want is facet_wrap(). (You'd use facet_grid if your facets depended on 2 variables.) Facets should be on factors, so we'll make sure initi is a factor
reshaped_data$initi <- factor(reshaped_data$initi)
then just add + facet_wrap(~ initi) to your ggplot.
EDIT:
If you want just one graph, try
graph_AB<-ggplot(reshaped_data,aes(x = A, y = B, colour = initi, group = L1)) +
geom_point(size = 5)+
geom_line() +
ggtitle("DATA1 vs DATA2")
graph_AB
In the future, post data using dput. For example, if you do dput(reshaped_data) and paste it into your question, the output will be this:
structure(list(initi = c(0.005, 0.005, 0.005, 1, 1, 1, 2, 2,
2), A = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), B = c(23.7, 31.2,
35.8, 6.2, 10.1, 11.2, 14.2, 15.9, 16), C = c(1, 2, 3.1, 1, 2,
3, 8, 7, 6.2), D = c(1, 2.1, 3.2, 1, 2.1, 3.5, 14.3, 13, 12.4
), E = c(24.7, 31.2, 35.6, 6.2, 7, 7, 5.2, 5.5, 5.8), L1 = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("d1", "d2"), class = "factor")), .Names = c("initi",
"A", "B", "C", "D", "E", "L1"), class = "data.frame", row.names = c(NA,
-9L))
which anyone can paste into R and use easily.
dlist <- list(d1 =data1 ,d2 = data2)
reshaped_data <- melt(dlist, id.vars = c('initi','A','B','C','D','E'))
graph_AB<-ggplot(reshaped_data,aes(x = A, y = B, colour = initi)) +
geom_point(size = 5)+
ggtitle("DATA1 vs DATA2")
print(graph_AB)
I just removed geom_line from original question !

connecting line like tree in r

I have following type data for human family:
indvidual <- c("John", "Kris", "Peter", "King", "Marry", "Renu", "Kim", "Ken", "Lu")
Parent1 <- c( NA, NA, "John", "John", "John", NA, "Peter", NA, NA)
Parent2 <- c( NA, NA, "Kris", "Kris", "Renu", NA, "Lu", NA, NA)
X <- c( 2, 3, 2, 3, 4, 5, 1.5, 1, 1)
Y <- c( 3, 3, 2, 2, 2, 3, 1, 3, 2)
pchsize <- c( 4.5, 4.3, 9.2, 6.2, 3.2, 6.4, 2.1, 1.9, 8)
fillcol <- c( 8.5, 8.3, 1.2, 3.2, 8.2, 2.4, 2.6, 6.1, 3.2)
myd <- data.frame (indvidual, Parent1, Parent2, X, Y, pchsize,fillcol)
indvidual Parent1 Parent2 X Y pchsize fillcol
1 John <NA> <NA> 2.0 3 4.5 8.5
2 Kris <NA> <NA> 3.0 3 4.3 8.3
3 Peter John Kris 2.0 2 9.2 1.2
4 King John Kris 3.0 2 6.2 3.2
5 Marry John Renu 4.0 2 3.2 8.2
6 Renu <NA> <NA> 5.0 3 6.4 2.4
7 Kim Peter Lu 1.5 1 2.1 2.6
8 Ken <NA> <NA> 1.0 3 1.9 6.1
9 Lu <NA> <NA> 1.0 2 8.0 3.2
I want plot something like the following, individuals points are connected to parents (Preferably different line color to Parent1 and Parent2 listed). Also pch size and pch fill is scaled to other variables pchsize and fillcol. Thus plot outline is:
Here is my progress in ggplot2:
require(ggplot2)
ggplot(data=myd, aes(X, Y,fill = fillcol)) +
geom_point(aes(size = pchsize, fill = fillcol), pch = "O") +
geom_text(aes (label = indvidual, vjust=1.25))
Issues unsolved: connecting lines, making size of pch big and fill color at the sametime.
Here is ggplot2 solution
library(ggplot2)
individual <- c("John", "Kris", "Peter", "King", "Marry", "Renu", "Kim", "Ken", "Lu")
Parent1 <- c( NA, NA, "John", "John", "John", NA, "Peter", NA, NA)
Parent2 <- c( NA, NA, "Kris", "Kris", "Renu", NA, "Lu", NA, NA)
X <- c( 2, 3, 2, 3, 4, 5, 1.5, 1, 1)
Y <- c( 3, 3, 2, 2, 2, 3, 1, 3, 2)
pchsize <- c( 4.5, 4.3, 9.2, 6.2, 3.2, 6.4, 2.1, 1.9, 8)
fillcol <- c( 8.5, 8.3, 1.2, 3.2, 8.2, 2.4, 2.6, 6.1, 3.2)
myd <- data.frame (individual, Parent1, Parent2, X, Y, pchsize,fillcol)
SegmentParent1 <- merge(
myd[, c("individual", "X", "Y")],
myd[!is.na(myd$Parent1), c("Parent1", "X", "Y")],
by.x = "individual", by.y = "Parent1")
SegmentParent2 <- merge(
myd[, c("individual", "X", "Y")],
myd[!is.na(myd$Parent1), c("Parent2", "X", "Y")],
by.x = "individual", by.y = "Parent2")
Segments <- rbind(SegmentParent1, SegmentParent2)
ggplot(data=myd, aes(X, Y)) +
geom_segment(data = Segments, aes(x = X.x, xend = X.y, y = Y.x, yend = Y.y)) +
geom_point(aes(size = pchsize, colour = fillcol)) +
geom_text(aes (label = indvidual), vjust = 0.5, colour = "red", fontface = 2) +
scale_x_continuous("", expand = c(0, 0.6), breaks = NULL) +
scale_y_continuous("", expand = c(0, 0.4), breaks = NULL) +
scale_size(range = c(20, 40)) +
theme_bw()
Here is a solution just using plot(), text(), and arrows(). The for loop is a bit cluttered, but will work for larger data sets and it should be easy to play with the plot and arrows:
plot(myd$X,myd$Y, col='white', type="p", main="", ylab="", xlab="",
axes = FALSE, ylim = c(min(myd$Y)*.8, max(myd$Y)*1.2),
xlim = c(min(myd$X)*.8, max(myd$X)*1.2))
child = data.frame()
child = myd[!is.na(myd$Parent1),]
DArrows = matrix(0,nrow(child),4);
MArrows = matrix(0,nrow(child),4);
for (n in 1:nrow(child)){
d<-child[n,];
c1<-myd$indvidual==as.character(d$Parent1);
b1<-myd[t(c1)];
c2<-myd$indvidual==as.character(d$Parent2);
b2<-myd[t(c2)];
DArrows[n, 1]=as.double(d$X)
DArrows[n, 2]=as.double(d$Y)
DArrows[n, 3]=as.double(b1[4])
DArrows[n, 4]=as.double(b1[5])
MArrows[n, 1]=as.double(d$X)
MArrows[n, 2]=as.double(d$Y)
MArrows[n, 3]=as.double(b2[4])
MArrows[n, 4]=as.double(b2[5])
}
arrows(DArrows[,3],DArrows[,4],DArrows[,1],DArrows[,2],lwd= 2, col = "blue",length=".1")
arrows(MArrows[,3],MArrows[,4],MArrows[,1],MArrows[,2],lwd=2, col = "red",length=".1")
par(new=TRUE)
plot(myd$X,myd$Y,type = "p", main = "", ylab = "", xlab = "",cex = myd$pchsize,
axes = FALSE, pch = 21, ylim = c(min(myd$Y)*.8, max(myd$Y)*1.2),
xlim=c(min(myd$X)*.8, max(myd$X)*1.2), bg = myd$fillcol,fg = 'black')
text(1.12*myd$X, .85*myd$Y, myd$indvidual)
arrows((DArrows[,3]+DArrows[,1])/2, (DArrows[,4]+DArrows[,2])/2,
DArrows[,1], DArrows[,2], lwd = 2, col = "blue", length = ".1")
arrows((MArrows[,3]+MArrows[,1])/2, (MArrows[,4]+MArrows[,2])/2,
MArrows[,1], MArrows[,2], lwd = 2, col = "red", length = ".1")
One thing that jumped out to me was to treat this is a network - R has many packages to plot these.
Here's a very simple solution:
First, I used your parent list to make a sociomatrix - you can generally input networks using edge lists as well - here I put 1 for the first parental relationship and 2 for the second.
psmat <- rbind(c(0, 0, 1, 1, 1, 0, 0, 0, 0),
c(0, 0, 2, 2, 0, 0, 0, 0, 0),
c(0, 0, 0, 0, 0, 0, 1, 0, 0),
rep(0, 9),
rep(0, 9),
c(0, 0, 0, 0, 2, 0, 0, 0, 0),
rep(0, 9),
rep(0, 9),
c(0, 0, 0, 0, 0, 0, 2, 0, 0))
Then, using the network package I just hit:
require(network)
plot(network(psmat), coord = cbind(X, Y), vertex.cex = pchsize,
vertex.col = fillcol, label = indvidual, edge.col = psmat)
This isn't terribly pretty in itself, but I think gives you all the basic elements you wanted.
For the colors, I believe the decimal places are just rounded - I wasn't sure what to do with those.
I know I've seen people plot networks in ggplot, so that might give you a better result.
Edit:
So here's a really messy way of turning your data into a network object directly - someone else might be able to fix it. Additionally, I add an edge attribute (named 'P' for parental status) and give the first set a value of 1 and the second set a value of 2. This can be used when plotting to set the colors.
P1 <- match(Parent1, indvidual)
e1 <- cbind(P1, 1:9); e1 <- na.omit(e1); attr(e1, 'na.action') <- NULL
P2 <- match(Parent2, indvidual)
e2 <- cbind(P2, 1:9); e2 <- na.omit(e2); attr(e2, 'na.action') <- NULL
en1 <- network.initialize(9)
add.edges(en1, e1[,1], e1[,2])
set.edge.attribute(en1, 'P', 1)
add.edges(en1, e2[,1], e2[,2], names.eval = 'P', vals.eval = 2)
plot(en1, coord = cbind(X, Y), vertex.cex = pchsize,
vertex.col = fillcol, label = indvidual, edge.col = 'P')
Alternative solution use igraph
library(igraph)
mm<-data.frame(dest=c(as.character(myd$Parent1),as.character(myd$Parent2)))
mm$orig<-myd$individual
g<-graph.edgelist(as.matrix(mm[!is.na(mm$dest),]))
rownames(myd)<-as.character(myd[,1])
l<-as.matrix(myd[V(g)$name,4:5])
plot(g,layout=l,vertex.color=myd[V(g)$name,6],vertex.size=myd[V(g)$name,6])
Just play a bit with color a sizes!

How do I plot more than one series using qplot?

I'm trying to understand how to have more than one series on a plot, using the following data.
Year <- c('1950', '1960', '1970', '1980')
Bus <- c(10,20,30,40)
Bus.sd <- c(1.1, 2.2, 3.3, 4.4)
Car <- c(20, 20, 40, 40)
Car.sd <- c(1.1, 2.2, 3.3, 4.4)
sample_data = data.frame(Year, Bus, Bus.sd, Car, Car.sd)
qplot(Year, Bus, data=sample_data, geom="pointrange",
ymin = Bus - Bus.sd/2, ymax = Bus + Bus.sd/2)
For example, using the above data, how do I show both sample_data$Bus and sample_data$Car on the same plot in different colors?
What I tried doing was:
p <- qplot(...)
then
p <- p + qplot(...)
where I replicated the previous line, but this gave me an error.
I don't fully understand how AES works. I have studied the ggplot2 examples, but have difficulty understanding the relevant examples here. Or, if it is possible to make a stacked bar (geom_bar) using this data, I think that would also represent it appropriately.
I Hope this helps
gplot2 works best with data in long format, like so:
Year score sd variable
1 1950 10 1.1 bus
2 1960 20 2.2 bus
3 1970 30 3.3 bus
4 1980 40 4.4 bus
5 1950 20 1.1 car
6 1960 20 2.2 car
7 1970 40 3.3 car
8 1980 40 4.4 car
This will get the data into R:
data <- structure(list(Year = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L), class = "factor", .Label = c("1950", "1960", "1970", "1980"
)), score = c(10, 20, 30, 40, 20, 20, 40, 40), sd = c(1.1, 2.2,
3.3, 4.4, 1.1, 2.2, 3.3, 4.4), variable = c("bus", "bus", "bus",
"bus", "car", "car", "car", "car")), .Names = c("Year", "score",
"sd", "variable"), row.names = c(NA, -8L), class = "data.frame")
And this will make the plot, with dodge an all. You properbly need the dodge, because your data is overlapping. You can control the amount of dodging with the "W" value.
ggplot(data, aes(x=Year, y=score,col=variable))+
geom_point(position=position_dodge(w=0.2))+
geom_pointrange(aes(ymin=score-sd, ymax=score+sd,group=Year),position=position_dodge(w=0.2))

Resources