Mosaic Plot of Data Frame - r

I have a data set (called group2) that looks like this
ticks var1 var2
11 2010-09-19 0 2
12 2010-09-20 1 4
16 2010-09-24 0 1
17 2010-09-26 1 1
18 2010-09-27 0 1
27 2010-10-06 0 1
29 2010-10-08 0 1
30 2010-10-10 1 1
31 2010-10-12 2 2
38 2010-10-19 0 2
39 2010-10-20 0 2
41 2010-10-22 0 2
42 2010-10-23 1 5
43 2010-10-24 2 3
44 2010-10-25 1 2
68 2010-11-19 3 4
83 2010-12-04 1 1
I wanted to make a mosaic plot such that the dates are on the x -axis and the categories (var1, var2) are on the vertical bars.
I used mosaicplot(group2[,2:3], col = c(7, 5), las = 3). but the top part of the image does not look right.
I also want the dates to show at the top (vertically).
Thanks!

mosaicplot needs a table (or a matrix) to be used as first argument. Here you can find a workaround for your setting
## Fake data set up
group2 <- data.frame(
"ticks" = as.Date(c("2010-09-19","2010-09-20","2010-09-24")),
"var1" = c(0,1,0),
"var2" = c(2,4,1))
## matrix creation
my.tab <- as.matrix(group2[,2:3])
rownames(my.tab) <- as.character(group2$ticks)
colnames(my.tab) <- c("var1","var2")
## plotting
mosaicplot(my.tab,
col = c(7, 5),
las = 3,
main = "Mosaic plot")
With more columns should be better than this quick image:
You may also consider las=2 (more readable).

Related

Setting correct Y-axis labels from column names for boxplots using loops

I have a data frame (data2) with multiple columns as variables. I ran the code below to create individual boxplots but my problem is the y-axis labels are labeled as "i" instead of the column names for each boxplot. How do I fix this?
sample of first 6 rows and first 7 columns below.
for (i in data2[,c(5:36)]{
boxplot(i ~ data2$cv,
xlab = "CV")
}
block loc cv rep days_til_flower days_til_anthesis days_til_harvest
1 1 H12 CR 1 9 21 59
2 1 H12 CR 2 7 20 57
3 1 H12 LB 1 7 20 62
4 1 H12 LB 2 13 21 62
5 1 H12 YC 1 7 17 59
6 1 H12 YC 2 7 16 59
Not sure what you're looking for exactly, but here's how I would do it with the mtcars data:
par(mfrow=c(2,3))
for(i in c(1,3,4,5,6,7)){
boxplot(mtcars[,i] ~ mtcars$am, xlab="American", ylab = names(mtcars)[i])
}

Dividing all possible rows within a given sub-data in R

My data looks like this:
set <- c(1,1,1,2,2,3,3,3,3,3,4,4)
density <- c(1,3,3,1,3,1,1,1,3,3,1,3)
counts <- c(100,2,4,76,33,12,44,13,54,36,65,1)
data <- data.frame(set,density,counts)
data$set <- as.factor(data$set)
data$density <- as.factor(data$density)
Within a given set there are two levels of densities "1" or "3". For a given set, I want to divide all possible combinations of counts of density "1" and density "3". I then want to print the original density associated with density "1", the ratio, and the set
For example, the result for the first few rows should look like:
set counts ratio
1 100 50 #100/2
1 100 25 #100/4
2 76 2.3 #76/33
3 12 0.22 #12/54
3 12 0.33 #12/36
3 44 0.8148 #44/54
...
I thought I could achieve it by dplyr..but it seems a little too complicated for dplyr.
It looks like the comments get you most of the way there. Here's a dplyr solution. With left_join each of the density1's get matched up with all density3's in the same set, providing output in line with your specification.
# Edited below to use dplyr syntax; my base syntax had a typo
library(dplyr)
data_combined <- data %>% filter(density == 1) %>%
# Match each 1 w/ each 3 in the set
left_join(data %>% filter(density == 3), by = "set") %>%
mutate(ratio = counts.x / counts.y) %>%
select(set, counts.x, counts.y, ratio)
data_combined
# set counts.x counts.y ratio
#1 1 100 2 50.0000000
#2 1 100 4 25.0000000
#3 2 76 33 2.3030303
#4 3 12 54 0.2222222
#5 3 12 36 0.3333333
#6 3 44 54 0.8148148
#7 3 44 36 1.2222222
#8 3 13 54 0.2407407
#9 3 13 36 0.3611111
#10 4 65 1 65.0000000

Plotting tetrahedron with data points in R

I'm in a little bit of pain at the moment.
I'm looking for a way to plot compositional data.(https://en.wikipedia.org/wiki/Compositional_data). I have four categories so data must be representable in a 3d simplex ( since one category is always 1 minus the sum of others).
So I have to plot a tetrahedron (edges will be my four categories) that contains my data points.
I've found this github https://gist.github.com/rmaia/5439815 but the use of pavo package(tcs, vismodel...) is pretty obscure to me.
I've also found something else in composition package, with function plot3D. But in this case an RGL device is open(?!) and I don't really need a rotating plot but just a static plot, since I want to save as an image and insert into my thesis.
Update: data looks like this. Consider only columns violent_crime (total), rape, murder, robbery, aggravated_assault
[ cities violent_crime murder rape rape(legally revised) robbery
1 Autauga 68 2 8 NA 6
2 Baldwin 98 0 4 NA 18
3 Barbour 17 2 2 NA 2
4 Bibb 4 0 1 NA 0
5 Blount 90 0 6 NA 1
6 Bullock 15 0 0 NA 3
7 Butler 44 1 7 NA 4
8 Calhoun 15 0 3 NA 1
9 Chambers 4 0 0 NA 2
10 Cherokee 49 2 8 NA 2
aggravated_assault
1 52
2 76
3 11
4 3
5 83
6 12
7 32
8 11
9 2
10 37
Update: my final plot with composition package
Here is how you can do this without a dedicated package by using geometry and plot3D. Using the data you provided:
# Load test data
df <- read.csv("test.csv")[, c("murder", "robbery", "rape", "aggravated_assault")]
# Convert absolute data to relative
df <- t(apply(df, 1, function(x) x / sum(x)))
# Compute tetrahedron coordinates according to https://mathoverflow.net/a/184585
simplex <- function(n) {
qr.Q(qr(matrix(1, nrow=n)) ,complete = TRUE)[,-1]
}
tetra <- simplex(4)
# Convert barycentric coordinates (4D) to cartesian coordinates (3D)
library(geometry)
df3D <- bary2cart(tetra, df)
# Plot data
library(plot3D)
scatter3D(df3D[,1], df3D[,2], df3D[,3],
xlim = range(tetra[,1]), ylim = range(tetra[,2]), zlim = range(tetra[,3]),
col = "blue", pch = 16, box = FALSE, theta = 120)
lines3D(tetra[c(1,2,3,4,1,3,1,2,4),1],
tetra[c(1,2,3,4,1,3,1,2,4),2],
tetra[c(1,2,3,4,1,3,1,2,4),3],
col = "grey", add = TRUE)
text3D(tetra[,1], tetra[,2], tetra[,3],
colnames(df), add = TRUE)
You can tweak the orientation with the phi and theta arguments in scatter3D.

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

Resources