Plotting tetrahedron with data points in R - r

I'm in a little bit of pain at the moment.
I'm looking for a way to plot compositional data.(https://en.wikipedia.org/wiki/Compositional_data). I have four categories so data must be representable in a 3d simplex ( since one category is always 1 minus the sum of others).
So I have to plot a tetrahedron (edges will be my four categories) that contains my data points.
I've found this github https://gist.github.com/rmaia/5439815 but the use of pavo package(tcs, vismodel...) is pretty obscure to me.
I've also found something else in composition package, with function plot3D. But in this case an RGL device is open(?!) and I don't really need a rotating plot but just a static plot, since I want to save as an image and insert into my thesis.
Update: data looks like this. Consider only columns violent_crime (total), rape, murder, robbery, aggravated_assault
[ cities violent_crime murder rape rape(legally revised) robbery
1 Autauga 68 2 8 NA 6
2 Baldwin 98 0 4 NA 18
3 Barbour 17 2 2 NA 2
4 Bibb 4 0 1 NA 0
5 Blount 90 0 6 NA 1
6 Bullock 15 0 0 NA 3
7 Butler 44 1 7 NA 4
8 Calhoun 15 0 3 NA 1
9 Chambers 4 0 0 NA 2
10 Cherokee 49 2 8 NA 2
aggravated_assault
1 52
2 76
3 11
4 3
5 83
6 12
7 32
8 11
9 2
10 37
Update: my final plot with composition package

Here is how you can do this without a dedicated package by using geometry and plot3D. Using the data you provided:
# Load test data
df <- read.csv("test.csv")[, c("murder", "robbery", "rape", "aggravated_assault")]
# Convert absolute data to relative
df <- t(apply(df, 1, function(x) x / sum(x)))
# Compute tetrahedron coordinates according to https://mathoverflow.net/a/184585
simplex <- function(n) {
qr.Q(qr(matrix(1, nrow=n)) ,complete = TRUE)[,-1]
}
tetra <- simplex(4)
# Convert barycentric coordinates (4D) to cartesian coordinates (3D)
library(geometry)
df3D <- bary2cart(tetra, df)
# Plot data
library(plot3D)
scatter3D(df3D[,1], df3D[,2], df3D[,3],
xlim = range(tetra[,1]), ylim = range(tetra[,2]), zlim = range(tetra[,3]),
col = "blue", pch = 16, box = FALSE, theta = 120)
lines3D(tetra[c(1,2,3,4,1,3,1,2,4),1],
tetra[c(1,2,3,4,1,3,1,2,4),2],
tetra[c(1,2,3,4,1,3,1,2,4),3],
col = "grey", add = TRUE)
text3D(tetra[,1], tetra[,2], tetra[,3],
colnames(df), add = TRUE)
You can tweak the orientation with the phi and theta arguments in scatter3D.

Related

R: Find out which observations are located in each "bar" of the histogram

I am working with the R programming language. Suppose I have the following data:
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
index <- 1:1400
my_data = data.frame(index,d)
I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):
hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")
My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?
Right now, I am trying to manually do this, e.g. (for histogram #3)
histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]
head(histogram3_bar1)
index d
1001 1001 4.156393
1002 1002 3.358958
1003 1003 1.605904
1004 1004 3.603535
1006 1006 2.943456
1007 1007 1.586542
But is there a more "efficient" way to do this?
Thanks!
hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist returns a list with first member breaks
First, make the problem reproducible by setting the RNG seed.
set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)
Now, save the return value of hist and have findInterval tell the bins where each data points are in.
h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)
h1$breaks
# [1] -2 0 2 4 6 8 10 12 14 16
head(f1)
#[1] 6 7 7 7 7 6
The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:
head(d[f1])
#[1] 8.07743 10.26174 10.26174 10.26174 10.26174 8.07743
As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").
As a final check, table the values returned by findInterval and see if they match the histogram's counts.
table(f1)
#f1
# 1 2 3 4 5 6 7 8 9
# 2 34 130 34 17 478 512 169 24
h1$counts
#[1] 2 34 130 34 17 478 512 169 24
To have the intervals for each data point, the following
bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
# bin min max
#1 6 8 10
#2 7 10 12
#3 7 10 12
#4 7 10 12
#5 7 10 12
#6 6 8 10

Instantaneous velocity on R studio

(Rstudio) suppose I have a data set of:
# Circle X Y
1 A 21 8
2 A 32 17
3 A 23 32
4 B 22 4
5 B 43 12
6 C 12 4
.....
I need to find the instantaneous velocity of each circle at each time frame.
For line 1 is the starting point so the velocity is 0, and the formula I want to achieve for each circle's (X, Y) coordinates is sqrt(((x2-x1)^2 + (y2-y1)^2)/2)) where the x2 and x1 is from the previous line (e.g. line 1 & line 2, Line 2 & line 3). the final result I want to have is as below:
# Circle X Y Instant velocity
1 A 21 8 0
2 A 32 17 sqrt(((32-21)^2 + (17-8)^2)/2))
3 A 23 32 sqrt(((23-32)^2 + (32-17)^2)/2))
4 B 22 4 0
5 B 43 12 sqrt(((43-22)^2 + (12-4)^2)/2))
6 C 12 4 0
.....
Could anyone help me in achieving this on Rstudio???
You have one more ) than ( in your code example, which makes me a bit confused about where the /2 goes, but if you verify my syntax something like this should work:
library(dplyr)
your_data %>%
group_by(Circle) %>%
mutate(
instant_velocity = coalesce(sqrt(((x - lag(x))^2 + (y - lag(y))^2)/2), 0)
)

approx() without duplicates?

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?
You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).
approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

R: Interpolation between raster layers of different dates

Let's say I have 4 raster layers with the same extend with data of 4 different years: 2006,2008,2010 and 2012:
library(raster)
r2006<-raster(ncol=3, nrow=3)
values(r2006)<-1:9
r2008<-raster(ncol=3, nrow=3)
values(r2008)<-3:11
r2010<-raster(ncol=3, nrow=3)
values(r2010)<-5:13
r2012<-raster(ncol=3, nrow=3)
values(r2012)<-7:15
Now I want to create raster layers for every year between 2006 and 2013 (or even longer) by inter-/extrapolating (a linear method should be a good start) the values of the 4 raster layers. The result should look like this:
r2006<-raster(ncol=3, nrow=3)
values(r2006)<-1:9
r2007<-raster(ncol=3, nrow=3)
values(r2007)<-2:10
r2008<-raster(ncol=3, nrow=3)
values(r2008)<-3:11
r2009<-raster(ncol=3, nrow=3)
values(r2009)<-4:12
r2010<-raster(ncol=3, nrow=3)
values(r2010)<-5:13
r2011<-raster(ncol=3, nrow=3)
values(r2011)<-6:14
r2012<-raster(ncol=3, nrow=3)
values(r2012)<-7:15
r2013<-raster(ncol=3, nrow=3)
values(r2013)<-8:16
Using lm() or approxExtrap don't seem to help a lot.
One way to do this is to separate your problem into two parts: 1. First, perform the numerical interpolation on the raster values, 2. and apply the interpolated values to the appropriate intermediate raster layers.
Idea: Build a data frame of the values() of the raster layers, time index that data frame, and then apply Linear Interpolation to those numbers. For linear interpolation I use approxTime from the simecol package.
For your example above,
library(raster)
library(simecol)
df <- data.frame("2006" = 1:9, "2008" = 3:11, "2010" = 5:13, "2012"=7:15)
#transpose since we want time to be the first col, and the values to be columns
new <- data.frame(t(df))
times <- seq(2006, 2012, by=2)
new <- cbind(times, new)
# Now, apply Linear Interpolate for each layer of the raster
approxTime(new, 2006:2012, rule = 2)
This gives:
# times X1 X2 X3 X4 X5 X6 X7 X8 X9
#1 2006 1 2 3 4 5 6 7 8 9
#2 2007 2 3 4 5 6 7 8 9 10
#3 2008 3 4 5 6 7 8 9 10 11
#4 2009 4 5 6 7 8 9 10 11 12
#5 2010 5 6 7 8 9 10 11 12 13
#6 2011 6 7 8 9 10 11 12 13 14
#7 2012 7 8 9 10 11 12 13 14 15
You can then store this, and take each row and apply to the values of that year's raster object.
Note: approxTime does not do linear extrapolation. It simply takes the closest value, so you need to account for that.

Mosaic Plot of Data Frame

I have a data set (called group2) that looks like this
ticks var1 var2
11 2010-09-19 0 2
12 2010-09-20 1 4
16 2010-09-24 0 1
17 2010-09-26 1 1
18 2010-09-27 0 1
27 2010-10-06 0 1
29 2010-10-08 0 1
30 2010-10-10 1 1
31 2010-10-12 2 2
38 2010-10-19 0 2
39 2010-10-20 0 2
41 2010-10-22 0 2
42 2010-10-23 1 5
43 2010-10-24 2 3
44 2010-10-25 1 2
68 2010-11-19 3 4
83 2010-12-04 1 1
I wanted to make a mosaic plot such that the dates are on the x -axis and the categories (var1, var2) are on the vertical bars.
I used mosaicplot(group2[,2:3], col = c(7, 5), las = 3). but the top part of the image does not look right.
I also want the dates to show at the top (vertically).
Thanks!
mosaicplot needs a table (or a matrix) to be used as first argument. Here you can find a workaround for your setting
## Fake data set up
group2 <- data.frame(
"ticks" = as.Date(c("2010-09-19","2010-09-20","2010-09-24")),
"var1" = c(0,1,0),
"var2" = c(2,4,1))
## matrix creation
my.tab <- as.matrix(group2[,2:3])
rownames(my.tab) <- as.character(group2$ticks)
colnames(my.tab) <- c("var1","var2")
## plotting
mosaicplot(my.tab,
col = c(7, 5),
las = 3,
main = "Mosaic plot")
With more columns should be better than this quick image:
You may also consider las=2 (more readable).

Resources