How create gaps in smoother for "missing" values (R, ggplot) - r

If I have a data set like this
set.seed(100)
data <- data.frame("x" = c(1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5),
"y" = rnorm(13),
"factor" = c("a","b","c","a","b", "c", "c", "a",
"b", "c", "a", "b","c"))
so it looks like this
x y factor
1 1 -0.50219235 a
2 1 0.13153117 b
3 1 -0.07891709 c
4 2 0.88678481 a
5 2 0.11697127 b
6 2 0.31863009 c
7 3 -0.58179068 c
8 4 0.71453271 a
9 4 -0.82525943 b
10 4 -0.35986213 c
11 5 0.08988614 a
12 5 0.09627446 b
13 5 -0.20163395 c
I would like to plot this with a separate smoother each factor (a,b,c)
library(ggplot2)
ggplot(data = data, aes(x = x, y = y, col = factor)) +
geom_smooth(aes(group = factor))
However since there are no values for "a" and "b" for x = 3, so I would like the smoothers for "a" and "b" to have a break for x = 3. What's the best strategy to accomplish that?

I would create an expansion of the combinations of x and factor and then do a database-like join on the combinations and the data. For example, first I form a new data frame df with the combinations of the unique values of x and factor
df <- expand.grid(sapply(data[, c("x", "factor")], unique))
> df
x factor
1 1 a
2 2 a
3 3 a
4 4 a
5 5 a
6 1 b
7 2 b
8 3 b
9 4 b
10 5 b
11 1 c
12 2 c
13 3 c
14 4 c
15 5 c
Then we can simply perform a join operation on the df and your data, requesting that we return all the rows from the left hand side (the x argument, hence df), and include corresponding values for y from the right hand side (data). Where there is no corresponding right hand side (in data, we will get an NA.
newdf <- merge(df, data, all.x = TRUE)
> newdf
x factor y
1 1 a -0.50219235
2 1 b 0.13153117
3 1 c -0.07891709
4 2 a 0.88678481
5 2 b 0.11697127
6 2 c 0.31863009
7 3 a NA
8 3 b NA
9 3 c -0.58179068
10 4 a 0.71453271
11 4 b -0.82525943
12 4 c -0.35986213
13 5 a 0.08988614
14 5 b 0.09627446
15 5 c -0.20163395
Now we can fit and predict from a loess model by hand, but this is a little tedious - easier options are available via mgcv:gam()
loessFun <- function(XX, span = 0.85) {
fit <- loess(y ~ x, data = XX, na.action = na.exclude, span = span)
predict(fit)
}
Now split the data by factor and apply the loessFun() wrapper
fits <- lapply(split(newdf, newdf$factor), loessFun)
newdf <- transform(newdf, fitted = unsplit(fits, factor))
> head(newdf)
x factor y fitted
1 1 a -0.50219235 -0.50219235
2 1 b 0.13153117 0.13153117
3 1 c -0.07891709 -0.07891709
4 2 a 0.88678481 0.88678481
5 2 b 0.11697127 0.11697127
6 2 c 0.31863009 0.31863009
We can then plot using the new data frame
ggplot(newdf, aes(x = x, y = y, col = factor)) +
geom_line(aes(group = factor))
which gives:
It looks a bit funky because of the very low resolution of the sample data you provided and because this method that I've used predicts at the observed data only, preserving NAs. geom_smooth() is actually predicting over the range of x for each group separately and as such there are no missing xs in the data used to draw the geom layer.
Unless you can explain within what region of x = 3 we should add a break (an NA), this may well be the best that you can do. Alternatively, we could predict over the region from the models and then set anything 2.5 < x < 3.5 back to being NA. Add a comment if that is what you wanted and I'll expand my answer with an example of doing that if you can indicate how we are to envisage the gaps.

Related

extract and format data from dataset into matrix in R

I want to make this dataframe
into this matrix
I have tried:
x <- read.csv("sample1.csv")
ax <- matrix(c(x[1,1],x[2,1],x[1,3],x[1,1],x[3,1],x[1,4],x[1,1],x[4,1],x[1,5],x[1,1],x[5,1],x[1,6],x[1,1],x[6,1],x[1,7],x[2,1],x[1,1],x[2,2],x[2,1],x[3,1],x[2,4],x[2,1],x[4,1],x[2,5],x[2,1],x[5,1],x[2,6],x[3,1],x[6,1],x[2,7],x[3,1],x[1,1],x[3,2],x[3,1],x[2,1],x[3,3],x[3,1],x[4,1],x[3,5],x[3,1],x[5,1],x[3,6],x[3,1],x[6,1],x[3,7],x[4,1],x[1,1],x[4,2],x[4,1],x[2,1],x[4,3],x[4,1],x[3,1],x[4,4],x[4,1],x[5,1],x[4,6],x[4,1],x[6,1],x[4,7],x[5,1],x[1,1],x[2,2],x[5,1],x[2,1],x[2,4],x[5,1],x[3,1],x[2,5],x[5,1],x[4,1],x[2,6],x[5,1],x[6,1],x[2,7],x[6,1],x[1,1],x[2,2],x[6,1],x[2,1],x[2,4],x[6,1],x[3,1],x[2,5],x[6,1],x[4,1],x[2,6],x[6,1],x[5,1],x[2,7]),10,3, byrow=TRUE)
bx <- ax[order(ax[,3], decreasing = TRUE),]
But it's not beautiful at all, and also it's gonna be lots of work if I got different sample data.
So I wish to simplified it if possible, any suggestion?
This can be achieved by using melt() function from reshape2 package:
> a = matrix(c(1:9), nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], letters[1:3]))
> a
a b c
A 1 4 7
B 2 5 8
C 3 6 9
> library(reshape2)
> melt(a, na.rm = TRUE)
Var1 Var2 value
1 A a 1
2 B a 2
3 C a 3
4 A b 4
5 B b 5
6 C b 6
7 A c 7
8 B c 8
9 C c 9

collapse package: sum over two vectors but keep empty intersections

I would like to aggregate a vector/ matrix y by two variables a and b via the fsum function of the collapse package. fsum does not return values for empty intersections. Is there a way to keep empty intersection using the collapse package? I know that I could e.g. work through cross-joins and data.table, but as my function input is a vector and speed really matters, I would like to avoid converting the input matrix to a data.table and then convert the output back to a matrix / vector (for a solution with data.table, see e.g. here: data.table calculate sums by two variables and add observations for "empty" groups).
Here is an example:
library(collapse)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)
fsum(x = y, g = data.frame(a = a, b = b))
#> fsum(x = y, g = data.frame(a = a, b = b))
# [,1]
#1.1 -0.40955189
#1.2 -0.05710677
#2.2 0.50360797
#2.3 -1.28459935
#3.1 0.04672617
#3.2 -0.69095384
#3.3 -0.23570656
#4.1 0.80418951
#5.2 1.08576936
What I would like to get: the regular output above, but keeping the empty intersections of (a, b) - e.g (a = 1, b = 3) and assign a missing or zero:
# a b y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3 NA
#4: 2 1 NA
#5: 2 2 -0.4115108
#6: 2 3 0.4356833
#.................
As an addition: base::aggregate() has a function argument drop = FALSE that achieves this:
aggregate(y, data.frame(a, b), sum, drop = FALSE)
a b V1
#1 1 1 -0.7702614
#2 2 1 NA
#3 3 1 -1.2375384
#4 4 1 -0.2894616
#5 5 1 NA
#6 1 2 -0.2992151
#7 2 2 -0.4115108
#8 3 2 -0.8919211
#9 4 2 NA
#10 5 2 0.2522234
#11 1 3 NA
#12 2 3 0.4356833
#13 3 3 -0.2242679
#14 4 3 NA
#15 5 3 NA
Nevertheless, in my experience both data.table and collapse are significantly faster, butcollapse has the advantage that it also works with matrix objects (that do not need to be converted to data.table's).
Is there away to achieve this via collapse?
yes you can do that with fsum, however other functions like fmedian will warn about that. To do that you need to create factors and interact them using : like so:
library(collapse)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)
fsum(x = y, g = qF(a):qF(b))
# [,1]
# 1:1 -0.7702614
# 1:2 -0.2992151
# 1:3 NA
# 2:1 NA
# 2:2 -0.4115108
# 2:3 0.4356833
# 3:1 -1.2375384
# 3:2 -0.8919211
# 3:3 -0.2242679
# 4:1 -0.2894616
# 4:2 NA
# 4:3 NA
# 5:1 NA
# 5:2 0.2522234
# 5:3 NA
For the earlier example you gave, I'd also like to note that the expensive call to data.frame is absolutely not necessary, fsum(x = y, g = list(a = a, b = b)) is sufficient.

R - find clusters of group 2 (pairs)

I am looking for a way to find clusters of group 2 (pairs).
Is there a simple way to do that?
Imagine I have some kind of data where I want to match on x and y, like
library(cluster)
set.seed(1)
df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))
I want to find the closest pair of distances between the x_coord and y_coord:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.
Maybe there is a more effective alternative for doing this than clustering.
Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:
id x_coord y_coord pair_id
1 9 3 1
3 7 5 1
4 1 8 2
5 2 2 2
6 5 6 3
7 3 10 3
9 6 4 4
10 8 7 4
You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
df
#> id x_coord y_coord pair
#> 1 1 9 3 4
#> 3 3 7 5 4
#> 4 4 1 8 1
#> 5 5 2 2 1
#> 6 6 5 6 2
#> 7 7 3 10 2
#> 9 9 6 4 3
#> 10 10 8 7 3
Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line
df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))
Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
And we can see the method scales nicely:
df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)

How do I adjust the scale of a geom_tile in ggplot2?

I am trying to adjust the colour scale of a geom_tile plot.
A short version of my data (in data.frame format) is:
mydat <-
Sc K n minC
A 2 1 NA
A 2 2 37.453023
A 2 3 23.768316
A 2 4 17.628376
A 3 1 NA
A 3 2 12.693124
A 3 3 8.884226
A 3 4 7.436250
A 10 1 2.128121
A 10 2 2.116539
A 10 3 2.737923
A 10 4 3.509773
A 20 1 1.104592
A 20 2 1.840195
A 20 3 2.717198
A 20 4 3.616501
B 2 1 NA
B 2 2 25.090085
B 2 3 15.924186
B 2 4 11.811022
B 3 1 NA
B 3 2 8.827183
B 3 3 6.179484
B 3 4 5.175331
B 10 1 2.096934
B 10 2 2.064984
B 10 3 2.662373
B 10 4 3.407246
B 20 1 1.096871
B 20 2 1.802418
B 20 3 2.649153
B 20 4 3.517776
My code to prepare the data to plot is the following:
mydat$Sc <- factor(mydat$Sc, levels =c("A", "B"))
mydat$K <- factor(mydat$K, levels =c("2", "3","10","20"))
mydat.m <- melt(pmydat,id.vars=c("Sc","K","n"), measure.vars=c("minC"))
I want to plot with geom_tile the value of minC with K and n as axis and different facets for Sc with the following:
mydat.m.p <- ggplot(mydat.m, aes(x=n, y=K))
mydat.m.p +
geom_tile(data=mydat.m, aes(fill=value)) +
scale_fill_gradient(low="palegreen", high="lightcoral") +
facet_wrap(~ Sc, ncol=2)
This gives me a plot for each Sc factor. However, the colour scale does not reflect want I want to portray, because a few high values making low values all equal.
I want to adjust to a relevant scale in 4 breaks, i.e., 1-2, 2-3, 3-5, >5.
Looking at other questions there was a suggestion to use the cut function and scale fill manual as:
mydat.m$value1 <- cut(mydat.m$value, breaks = c(1:5, Inf), right = FALSE)
Then use the following in geom_tile:
scale_fill_manual(breaks = c("\[1,2)", "\[2, 3)", "\[3, 5)", "\[5, Inf)"),
values = c("darkgreen", "palegreen", "lightcoral", "red"))
However, I am not sure how this can be applied to a data.frame with other factors and in long format.
You're almost there. Simply use cut before melting:
mydat$minC.cut <- cut(mydat$minC, breaks = c(1:3, 5, Inf), right = FALSE)
mydat.cut <- melt(mydat, id.vars=c("Sc", "K", "n"), measure.vars=c("minC.cut"))
Now, you don't need to specify breaks since we took care of that already.
ggplot(mydat.cut, aes(x=n, y=K)) +
geom_tile(aes(fill=value)) +
facet_wrap(~ Sc, ncol=2) +
scale_fill_manual(values = c("darkgreen", "palegreen", "lightcoral", "red"))

Replacing header in data frame based on values in second data frame

Say I have a data frame which looks like this:
df.A
A B C
x 1 3 4
y 5 4 6
z 8 9 1
And I want to replace the column names in the first based on column values in a second:
df.B
Low High
A D
B F
C G
Such that I get:
df.A
D F G
x 1 3 4
y 5 4 6
z 8 9 1
How would I do it?
I have tried extracting the vector df.B$High from df.B and using this in names(df.A), but everything is in alphabetical order and shifted over one. Furthermore, this only works if the order of columns in df.A is conserved with respect to the elements in df.B$High, which is not always the case (and in my real example there is no numeric or alphabetical way to sort the two to the same order). So I think I need an rbind-type argument for matching elements, but I'm not sure.
Thanks!
You can use rename from plyr:
library(plyr)
dat <- read.table(text = " A B C
x 1 3 4
y 5 4 6
z 8 9 1",header = TRUE,sep = "")
> new <- read.table(text = "Low High
A D
B F
C G",header = TRUE,sep = "")
> rename(dat,replace = setNames(new$High,new$Low))
D F G
x 1 3 4
y 5 4 6
z 8 9 1
using match:
df.A <- read.table(sep=" ", header=T, text="
A B C
x 1 3 4
y 5 4 6
z 8 9 1")
df.B <- read.table(sep=" ", header=T, text="
Low High
A D
B F
C G")
df.C <- df.A
names(df.C) <- df.B$High[match(names(df.A), df.B$Low)]
df.C
# D F G
# x 1 3 4
# y 5 4 6
# z 8 9 1
You can play games with the row names of df.B to make a lookup more convenient:
rownames(df.B) <- df.B$Low
names(df.A) <- df.B[names(df.A),"High"]
df.A
## D F G
## x 1 3 4
## y 5 4 6
## z 8 9 1
Here's an approach abusing factor:
f <- factor(names(df.A), levels=df.B$Low)
levels(f) <- df.B$High
f
## [1] D F G
## Levels: D F G
names(df.A) <- f
## Desired results

Resources