Highlight specific points from vector in scatterplot - r

I have a dataframe df with two columns, which are plotted in a scatterplot using ggplot. Now I have parted the curve into intervalls. The sectioning points of the intervalls are in a vector r. I now want to highlight these points to improve the visualization of the intervalls. I was thinking about coloring these intervall points or even to section the intervalls in adding vertical lines into the plot...I have tried some commands, but they didnt work for me.
Here is an idea of how my data frame looks like:
d is first colume, e is the second with number of instances.
d e
1 4
2 4
3 5
4 5
5 5
6 4
7 2
8 3
9 1
10 3
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4
My vector r shows, where the intervall borders were set.
7
8
9
10
11
12
18
Any ideas how to do so? Thanks!

You can try a tidyverse. The idea is to find overlapping points using ´mutateand%in%, then color by the resutling logical vectorgr`. I also added vertical lines to illustrate the "intervals".
library(tidyverse)
d %>%
mutate(gr=d %in% r) %>%
ggplot(aes(d,e, color=gr)) +
geom_vline(xintercept=r, alpha=.1) +
geom_point()
Edit: Without tidyverse you can add gr using
d$gr <- d$d %in% r
ggplot(d, aes(d,e, color=gr)) ...
The data
d <- read.table(text=" d e
1 4
2 4
3 5
4 5
5 5
6 4
7 2
8 3
9 1
10 3
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4", header=T)
r <- c(7:12,18)

Related

How can I multiply columns by columns from different matrix in R?

guys:
I have two matrix as following:
d <- cbind(c(1,2,3,4),c(1,1,1,1),c(1,2,4,8))
v <- cbind(c(2,2,2,2),c(3,3,3,3))
But I want to get a matrix consisted of divj as following:
d1v1 d1v2 d2v1 d2v2 d3v1 d3v2
2 3 2 3 2 3
4 6 2 3 4 6
6 9 2 3 8 12
8 12 2 3 16 24
This is an example of my question,I wonder if you can tell me how to write codes to solve this question.Many thanks.
matrix(apply(v,2,function(x){x*d}),4,6)

How to cluster points (coordinates in a large data set

I have a large data set made of x and y coordinates (no necessarily lat, lng). Points are not ordered.
df <- data.frame(point=1:7, x=c(3,7,2,23,5,67,16) , y=c(1,4,5,23,17,89,20))
>df
point x y
1 3 1
2 7 4
3 2 5
4 23 23
5 5 17
6 67 89
7 16 20
Is there an easy way to cluster points? (according to a radius for example)
So for example:
points 1, 2, 3 would be together - group A
points 4, 5, 7 would be together - group B
point 6 would be group C
I have tried to use: %>% arrange to sort values, and then x-(x+1) (coordinate differences)
but the method is not perfect and there are situations where clustering isn't done properly.
Any suggestions or comments!
Thanks
You can try hclust with cutree.
df$group <- cutree(hclust(dist(df[,2:3])), 3)
df
# point x y group
#1 1 3 1 1
#2 2 7 4 1
#3 3 2 5 1
#4 4 23 23 2
#5 5 5 17 1
#6 6 67 89 3
#7 7 16 20 2

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

Sum certain values from changing dataframe in R

I have a data frame that I would like to aggregate by adding certain values. Say I have six clusters. I then feed data from each cluster into some function that generates a value x which is then put into the output data frame.
cluster year lambda v e x
1 1 1 -0.12160997 -0.31105287 -0.253391178 15
2 1 2 -0.12160997 -1.06313732 -0.300349972 10
3 1 3 -0.12160997 -0.06704185 0.754397069 40
4 2 1 -0.07378295 -0.31105287 -1.331764904 4
5 2 2 -0.07378295 -1.06313732 0.279413039 19
6 2 3 -0.07378295 -0.06704185 -0.004581941 23
7 3 1 -0.02809310 -0.31105287 0.239647063 28
8 3 2 -0.02809310 -1.06313732 1.284568047 38
9 3 3 -0.02809310 -0.06704185 -0.294881283 18
10 4 1 0.33479251 -0.31105287 -0.480496125 15
11 4 2 0.33479251 -1.06313732 -0.380251626 12
12 4 3 0.33479251 -0.06704185 -0.078851036 34
13 5 1 0.27953088 -0.31105287 1.435456851 100
14 5 2 0.27953088 -1.06313732 -0.795435607 0
15 5 3 0.27953088 -0.06704185 -0.166848530 0
16 6 1 0.29409366 -0.31105287 0.126647655 44
17 6 2 0.29409366 -1.06313732 0.162961658 18
18 6 3 0.29409366 -0.06704185 -0.812316265 13
To aggregate, I then add up the x value for cluster 1 across all three years with seroconv.cluster1=sum(data.all[c(1:3),6]) and repeat for each cluster.
Every time I change the number of clusters right now I have to manually change the addition of the x's. I would like to be able to say n.vec <- seq(6, 12, by=2) and feed n.vec into the functions and get x and have R add up the x values for each cluster every time with the number of clusters changing. So it would do 6 clusters and add up all the x's per cluster. Then 8 and add up the x's and so on.
It seems you are asking for an easy way to split your data up, apply a function (sum in this case) and then combine it all back together. Split apply combine is a common data strategy, and there are several split/apply/combine strategies in R, the most popular being ave in base, the dplyr package and the data.table package.
Here's an example for your data using dplyr:
library(dplyr)
df %>% group_by(cluster, year) %>% summarise_each(funs(sum))
To get the sum of x for each cluster as a vector, you can use tapply:
tapply(df$x, df$cluster, sum)
# 1 2 3 4 5 6
# 65 46 84 61 100 75
If you instead wanted to output as a data frame, you could use aggregate:
aggregate(x~cluster, sum, data=df)
# cluster x
# 1 1 65
# 2 2 46
# 3 3 84
# 4 4 61
# 5 5 100
# 6 6 75

R: How to use intervals as input data for histograms?

I would like to import the data into R as intervals, then I would like to count all the numbers falling within these intervals and draw a histogram from this counts.
Example:
start end freq
1 8 3
5 10 2
7 11 5
.
.
.
Result:
number freq
1 3
2 3
3 3
4 3
5 5
6 5
7 10
8 10
9 7
10 7
11 5
Some suggestions?
Thank you very much!
Assuming your data is in df, you can create a data set that has each number in the range repeated by freq. Once you have that it's trivial to use the summarizing functions in R. This is a little roundabout, but a lot easier than explicitly computing the sum of the overlaps (though that isn't that hard either).
dat <- unlist(apply(df, 1, function(x) rep(x[[1]]:x[[2]], x[[3]])))
hist(dat, breaks=0:max(df$end))
You can also do table(dat)
dat
1 2 3 4 5 6 7 8 9 10 11
3 3 3 3 5 5 10 10 7 7 5

Resources