Two sided bean plots with connection in R - r

I am trying to create two sided bean plots in R.
My data looks like:
> t1
Country Women Kids
1 China 2 5
2 China 4 10
3 China 3 10
4 China 1 3
5 China 2 2
6 USA 1 1
7 USA 1 2
8 USA 2 1
9 USA 2 3
10 USA 1 0
11 Swiss 1 3
12 Swiss 2 6
13 Swiss 2 5
14 Swiss 1 2
15 Swiss 3 9
I tried the following using R package "beanplot":
> t2=melt(t1)
Using Country as id variables
> t2$C.M=paste(t2$Country,t2$variable,sep=" ")
> beanplot(value ~ C.M, data = t2, ll = 0.04,
+ main = NA, side = "both",ylab = "Count",
+ border = NA, col = list("blue", c("orange", "white")),what=c(1,1,1,1))
And I get the bean plots:
Bean plots for family structure per country
However, I want a bean plot that tells the relation of pairs of points (i.e. women with kids) with connections per country. It should be something like:
This plot but with two-sided bean plot instead of box plot for each country.
Is there a way to achieve this?

You can do:
library(beanplot)
library(reshape2)
library(beeswarm)
# melt
d1 <- melt(t1)
# draw the beans using the at to specify the positions, boxwex
# to increase the size of the beans and xlim to increase the x-axis limits:
beanplot(d1$value ~ interaction(d1$variable, d1$Country), at=c(1.5,3.5,5.5),
side="both",col = list("blue", c("orange", "white")), what=c(1,1,1,1),
boxwex=2, xlim=c(0,7))
# add the points
n <- beeswarm(d1$value ~ interaction(d1$variable, d1$Country), add=T, cex=2,
pwcol = d1$variable, pch=16)
# and finally the segments
segments(matrix(n$x,5,)[,1], d[1:5, 2], matrix(n$x,5,)[,2], d[1:5, 3], lwd= 2)
segments(matrix(n$x,5,)[,3], d[11:15, 2], matrix(n$x,5,)[,4], d[11:15, 3], lwd= 2)
segments(matrix(n$x,5,)[,5], d[6:10, 2], matrix(n$x,5,)[,6], d[6:10, 3], lwd= 2)

Related

how to count and group categorical data by range in r

I have data from a questionnaire that has a column for year of birth. So the range of data was too large and my mapping became confusing. I'm now trying to take the years, group them up by decade decade, and then chart them. But I don't know how to group them.
my data is like:
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
and my plot is like:
However, I want my data by group as:
birth_year count
(1920-1930]: 5
(1931-1940]: 8
(1941-1950]: 4
(1951-1960]: 3
(1961-1970]: 5
(1971-1980]: 5
(1981-1990]: 5
(1991-2000]: 9
and then plot as a range group.
We can use cut() to group the data, and then plot with ggplot().
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
birth_year$yearGroup <- cut(as.integer(birth_year$years),breaks = 8,dig.lab = 4,
include.lowest = FALSE)
library(ggplot2)
ggplot(birth_year,aes(x = yearGroup)) + geom_bar()
birth_year %>%
mutate(val=cut_width(as.numeric(years),10,boundary = 1920, dig.lab=-1))%>%
count(val)
val n
1 [1920,1930] 5
2 (1930,1940] 8
3 (1940,1950] 4
4 (1950,1960] 3
5 (1960,1970] 5
6 (1970,1980] 5
7 (1980,1990] 5
8 (1990,2000] 9

Is there a way to omit variables with NA values from facet wrap plots?

# A tibble: 8 × 4
measurement log2_fc locus operon
<chr> <dbl> <chr> <chr>
1 transcriptome 1 PA3552 arn
2 transcriptome 1.5 PA3553 arn
3 proteome NA PA3552 arn
4 proteome 2 PA3553 arn
5 transcriptome 2.5 PA1179 opr
6 transcriptome 3 PA1180 opr
7 proteome NA PA1179 opr
8 proteome NA PA1180 opr
plot <- ggplot(data=x,aes(x=locus,y=log2_fc,color=measurement)) +
geom_jitter()
plot + facet_wrap(~operon, ncol=2)
I'm working with the code above to create a plot comparing log2_fc of genes as obtained through two different measurement methods. I want to separate the plot out by the operon that the genes belong to but I would like to only have the genes in that operon plotted along the x axis in each facet. Currently it is creating the plot below:
Is there a way to only plot each locus value once along the x axis and still have the data separated by operon?
You just need to free the x axis:
plot + facet_wrap(~operon, ncol=2, scales = "free_x")
Without specifying scales = "free_x", ggplot defaults to identical axes with the same limits and breaks.

Visualising the distribution for different subgroups

I'm using "d.pizza" data. There is variable called "delivery_min" which is delivery time (in minutes) and there is variable called "area" which can be one of three areas (Camden, Westminster and Brent).
I want to draw a density plot that visualises the distribution of delivery time for these three areas.
I tried
plot.ecdf(pizza_d$delivery_min)
this code works, but how can I do it for each area?
head(d.pizza)=
index date week weekday area count rabate price operator driver delivery_min
1 1 1 01.03.2014 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0
2 2 2 01.03.2014 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6
3 3 3 01.03.2014 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8
4 4 4 01.03.2014 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3
5 5 5 01.03.2014 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8
6 6 6 01.03.2014 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7
temperature wine_ordered wine_delivered wrongpizza quality
1 53.0 0 0 FALSE medium
2 56.4 0 0 FALSE high
3 36.5 0 0 FALSE <NA>
4 NA 0 0 FALSE <NA>
5 50.0 0 0 FALSE medium
6 27.0 0 0 FALSE low
You could do:
library(DescTools)
data(d.pizza)
plot.ecdf(subset(d.pizza, area == "Camden")$delivery_min,
col = "red", main = "ECDF for pizza deliveries")
plot.ecdf(subset(d.pizza, area == "Westminster")$delivery_min,
add = TRUE, col = "blue")
plot.ecdf(subset(d.pizza, area == "Brent")$delivery_min,
add = TRUE, col = "green")
library(DescTools)
data(d.pizza)
summary(d.pizza$delivery_min)
plot(NULL,ylab='',xlab='', xlim=c(5,66), ylim=0:1)
for(A in 1:3) {
plot.ecdf(d.pizza$delivery_min[d.pizza$area == levels(d.pizza$area)[A]],
pch=20, col=A+1, add=T)
}
legend("bottomright", legend=levels(d.pizza$area),
bty='n', pch=20, col=2:4)
I'd recommend the ggplot2 library for data visualization in R. Here's some code using ggplot2 that can create a density plot with the three groups overlaid:
library(ggplot2)
# make example dataframe
d.pizza <- data.frame(delivery_min = rnorm(n=30), area = rep(c("Camden", "Westminster", "Brent"), 10))
# plot data in ggplot2
ggplot(d.pizza, aes(x = delivery_min, fill = area, color = area)) + geom_density(alpha = 0.5)
If you want a histogram, that can be done too:
ggplot(d.pizza, aes(x = delivery_min, fill = area, color = area)) + geom_histogram(alpha = 0.5, position = 'identity')

Vertex Labels in igraph R

I am using igraph to plot a non directed force network.
I have a dataframe of nodes and links as follows:
> links
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
2 6 8 0.5723 2607133 3051992
3 9 7 0.7150 3101536 3025831
4 0 1 0.7695 401517 425784
5 2 5 0.5535 1045501 2258363
> nodes
name group size
1 401517 1 8
2 425784 1 8
3 1045501 1 8
4 1450552 1 8
5 1519842 1 8
6 2258363 1 8
7 2607133 1 8
8 3025831 1 8
9 3051992 1 8
10 3101536 1 8
I plot these using igraph as follows:
gg <- graph.data.frame(links,directed=FALSE)
plot(gg, vertex.color = 'lightblue', edge.label=links$value, vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2 )
On this plot, igraph has used the proxy indexes for source and target as vertex labels.
I want to use the real ID's, in my links table expressed as sourceID and targetID.
So, for:
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
This would show as:
(1450552) ----- 0.6245 ----- (1519842)
Instead of:
(3) ----- 0.6245 ----- (4)
(Note that the proxy indexes are zero indexed in the links dataframe, and one indexed in the nodes dataframe. This offset by 1 is necessary for igraph plotting).
I know I need to somehow match or map the proxy indexes to their corresponding name within the nodes dataframe. However, I am at a loss as I do no not know the order in which igraph plots labels.
How can I achieve this?
I have consulted the following questions to no avail:
Vertex Labels in igraph with R
how to specify the labels of vertices in R
R igraph rename vertices
You can specify the labels like this:
library(igraph)
gg <- graph.data.frame(
links,directed=FALSE,
vertices = rbind(
setNames(links[,c(1,4)],c("id","label")),
setNames(links[,c(2,5)], c("id","label"))))
plot(gg, vertex.color = 'lightblue', edge.label=links$value,
vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2 )
You could also pass
merge(rbind(
setNames(links[,c(1,4)],c("id","label")),
setNames(links[,c(2,5)], c("id","label"))),
nodes,
by.x="label", by.y="name")
to the vertices argument if you needed the other node attributes.
Data:
links <- read.table(header=T, text="
source target value sourceID targetID
1 3 4 0.6245 1450552 1519842
2 6 8 0.5723 2607133 3051992
3 9 7 0.7150 3101536 3025831
4 0 1 0.7695 401517 425784
5 2 5 0.5535 1045501 2258363")
nodes <- read.table(header=T, text="
name group size
1 401517 1 8
2 425784 1 8
3 1045501 1 8
4 1450552 1 8
5 1519842 1 8
6 2258363 1 8
7 2607133 1 8
8 3025831 1 8
9 3051992 1 8
10 3101536 1 8")
It appears I was able to repurpose the answer to this question to achieve this.
r igraph - how to add labels to vertices based on vertex id
The key was to use the vertex.label attribute within plot() and a select a sliced subset of nodes$names.
For our index we can use the ordered default labels returned in igraph automatically. To extract these, you can type V(gg)$names.
Within plot(gg) we can then write:
vertex.label = nodes[c(as.numeric(V(gg)$name)+1),]$name
# 1 Convert to numeric
# 2 Add 1 for offset between proxy links index and nodes index
# 3 Select subset of nodes with above as row index. Return name column
As full code:
gg <- graph.data.frame(links,directed=FALSE)
plot(gg, vertex.color = 'lightblue', edge.label=links$value, vertex.size=1, edge.color="darkgreen",
vertex.label.font=1, edge.label.font =1, edge.label.cex = 1,
vertex.label.cex = 2, vertex.label = nodes[c(as.numeric(V(gg)$name)+1),]$name)
With the data above, this gave:
The easiest solution would be to reorder the columns of links, because according to the documentation:
"If vertices is NULL, then the first two columns of d are used as a symbolic edge list and additional columns as edge attributes."
Hence, your code will give the correct output after running:
links <- links[,c(4,5,3)]

Plotting tetrahedron with data points in R

I'm in a little bit of pain at the moment.
I'm looking for a way to plot compositional data.(https://en.wikipedia.org/wiki/Compositional_data). I have four categories so data must be representable in a 3d simplex ( since one category is always 1 minus the sum of others).
So I have to plot a tetrahedron (edges will be my four categories) that contains my data points.
I've found this github https://gist.github.com/rmaia/5439815 but the use of pavo package(tcs, vismodel...) is pretty obscure to me.
I've also found something else in composition package, with function plot3D. But in this case an RGL device is open(?!) and I don't really need a rotating plot but just a static plot, since I want to save as an image and insert into my thesis.
Update: data looks like this. Consider only columns violent_crime (total), rape, murder, robbery, aggravated_assault
[ cities violent_crime murder rape rape(legally revised) robbery
1 Autauga 68 2 8 NA 6
2 Baldwin 98 0 4 NA 18
3 Barbour 17 2 2 NA 2
4 Bibb 4 0 1 NA 0
5 Blount 90 0 6 NA 1
6 Bullock 15 0 0 NA 3
7 Butler 44 1 7 NA 4
8 Calhoun 15 0 3 NA 1
9 Chambers 4 0 0 NA 2
10 Cherokee 49 2 8 NA 2
aggravated_assault
1 52
2 76
3 11
4 3
5 83
6 12
7 32
8 11
9 2
10 37
Update: my final plot with composition package
Here is how you can do this without a dedicated package by using geometry and plot3D. Using the data you provided:
# Load test data
df <- read.csv("test.csv")[, c("murder", "robbery", "rape", "aggravated_assault")]
# Convert absolute data to relative
df <- t(apply(df, 1, function(x) x / sum(x)))
# Compute tetrahedron coordinates according to https://mathoverflow.net/a/184585
simplex <- function(n) {
qr.Q(qr(matrix(1, nrow=n)) ,complete = TRUE)[,-1]
}
tetra <- simplex(4)
# Convert barycentric coordinates (4D) to cartesian coordinates (3D)
library(geometry)
df3D <- bary2cart(tetra, df)
# Plot data
library(plot3D)
scatter3D(df3D[,1], df3D[,2], df3D[,3],
xlim = range(tetra[,1]), ylim = range(tetra[,2]), zlim = range(tetra[,3]),
col = "blue", pch = 16, box = FALSE, theta = 120)
lines3D(tetra[c(1,2,3,4,1,3,1,2,4),1],
tetra[c(1,2,3,4,1,3,1,2,4),2],
tetra[c(1,2,3,4,1,3,1,2,4),3],
col = "grey", add = TRUE)
text3D(tetra[,1], tetra[,2], tetra[,3],
colnames(df), add = TRUE)
You can tweak the orientation with the phi and theta arguments in scatter3D.

Resources