I have a load of genomic data as follows:
chr leftPos Sample1 AnotherSample EtcSample
1 4324 434 43 33
1 5353 63 34 532
1 6632 543 3544 23
2 1443 25 345 543
2 7644 74 26 324
2 8886 23 9 23
3 1287 643 45 23
3 5443 93 23 77
3 7668 33 45 33
I would like to create a heatmap organised by chromosome with sample along the x-axis and leftPos along the Y axis. I think this would look good in a facet_wrap image (organised by chromosome) but this means I have to use heatmaps in ggplots and I understand this isn't a thing so I have to use geom_tiles().
So I tried googling all over the place but I'm stuck with how to firstly do a heatmap per chromosome and secondly do tiles per sample. All the examples seem to just use two columns.
df <- data.frame(chr=c(1,1,1,2,2,2,3,3,3),
leftPos=c(4324, 5353, 6632, 1443, 7644, 8886, 1287, 5443, 7668),
Sample1=c(434,63,543,25,74,23,643,93,33),
AnotherSample=c(43,34,3544,345,26,9,45,23,45),
EtcSample=c(33,532,23,543,324,23,23,77,33))
Reshape your data in a long format.
df.l <- reshape(df,
varying = c("Sample1", "AnotherSample", "EtcSample"),
idvar="chr",
v.names = "value",
timevar = "sample",
times=c("Sample1", "AnotherSample", "EtcSample"),
new.row.names=c(1:(3*nrow(df))),
direction = "long")
> df.l
chr leftPos sample value
1 1 4324 Sample1 434
2 1 5353 Sample1 63
3 1 6632 Sample1 543
4 2 1443 Sample1 25
5 2 7644 Sample1 74
...
12 1 6632 AnotherSample 3544
13 2 1443 AnotherSample 345
14 2 7644 AnotherSample 26
15 2 8886 AnotherSample 9
16 3 1287 AnotherSample 45
...
23 2 7644 EtcSample 324
24 2 8886 EtcSample 23
25 3 1287 EtcSample 23
26 3 5443 EtcSample 77
27 3 7668 EtcSample 33
For representation purpose based on your data, I converted leftPos into factor.
library(ggplot2)
df.l$leftPos <- factor(df.l$leftPos)
ggplot(df.l, aes(sample, leftPos)) + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "red") + facet_wrap(~chr)+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Related
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
I need to duplicate those levels whose frequency in my factor variable called groups is less than 500.
> head(groups)
[1] 0000000 1000000 1000000 1000000 0000000 0000000
75 Levels: 0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 ... 1111110
For example:
> table(group)
group
0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 0011000 0011010 0011100
58674 6 1033 654 223 1232 31 222 17 818 132 32 15 42 9 9
0011110 0100000 0100001 0100010 0100100 0100101 0100110 0101000 0101010 0101100 0101110 0110000 0110010 0110100 0110110 0111000
1 10609 1 487 64 1 58 132 11 12 3 142 27 9 7 11
0111010 0111100 0111110 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1001000 1001001 1001010 1001100 1001110 1010000
5 1 2 54245 10 1005 1 329 1 138 573 1 31 71 11 969
1010010 1010100 1010110 1011000 1011010 1011100 1011110 1100000 1100001 1100010 1100011 1100100 1100110 1101000 1101010 1101011
147 29 21 63 15 10 4 14161 6 770 1 142 96 260 23 1
1101100 1101110 1110000 1110001 1110010 1110100 1110110 1111000 1111010 1111100 1111110
34 16 439 2 103 13 26 36 13 8 5
Groups 0000001, 0000110, 0001010, 0001100... must be duplicated up to 500.
The ideal would be to have a "sample balanced data" of groups that duplicate those levels often less than 500 and penalize the rest (Levels more than 500 frequency) until reaching 500.
We can use rep on the levels of 'group' for the desired 'n'
factor(rep(levels(group), each = n))
If we need to use the table results as well
factor(rep(levels(group), table(group) + n-table(group)) )
Or with pmax
factor(rep(levels(group), pmax(n, table(levels(group)))))
data
set.seed(24)
group <- factor(sample(letters[1:6], 3000, replace = TRUE))
n <- 500
My question is similar to Fill region between two loess-smoothed lines in R with ggplot1
But I have two groups.
g1<-ggplot(NVIQ_predict,aes(cogn.age, predict, color=as.factor(NVIQ_predict$group)))+
geom_smooth(aes(x = cogn.age, y = upper,group=group),se=F)+
geom_line(aes(linetype = group), size = 0.8)+
geom_smooth(aes(x = cogn.age, y = lower,group=group),se=F)
I want to fill red and blue for each group.
I tried:
gg1 <- ggplot_build(g1)
df2 <- data.frame(x = gg1$data[[1]]$x,
ymin = gg1$data[[1]]$y,
ymax = gg1$data[[3]]$y)
g1 + geom_ribbon(data = df2, aes(x = x, ymin = ymin, ymax = ymax),fill = "grey", alpha = 0.4)
But it gave me the error: Aesthetics must either be length one, or the same length as the dataProblems
I get the same error every time my geom_ribbon() data and ggplot() data differ.
Can somebody help me with it? Thank you so much!
My data looks like:
> NVIQ_predict
cogn.age predict upper lower group
1 7 39.04942 86.68497 18.00000 1
2 8 38.34993 82.29627 18.00000 1
3 10 37.05174 74.31657 18.00000 1
4 11 36.45297 70.72421 18.00000 1
5 12 35.88770 67.39555 18.00000 1
6 13 35.35587 64.32920 18.00000 1
7 14 34.85738 61.52322 18.00000 1
8 16 33.95991 56.68024 18.00000 1
9 17 33.56057 54.63537 18.00000 1
10 18 33.19388 52.83504 18.00000 1
11 19 32.85958 51.27380 18.00000 1
12 20 32.55752 49.94791 18.00000 1
13 21 32.28766 48.85631 18.00000 1
14 24 31.67593 47.09206 18.00000 1
15 25 31.53239 46.91136 18.00000 1
16 28 31.28740 48.01764 18.00000 1
17 32 31.36627 50.55201 18.00000 1
18 35 31.73386 53.19630 18.00000 1
19 36 31.91487 54.22624 18.00000 1
20 37 32.13026 55.25721 18.00000 1
21 38 32.38237 56.26713 18.00000 1
22 40 32.98499 58.36229 18.00000 1
23 44 34.59044 62.80187 18.00000 1
24 45 35.06804 64.01951 18.00000 1
25 46 35.57110 65.31888 18.00000 1
26 47 36.09880 66.64696 17.93800 1
27 48 36.72294 67.60053 17.97550 1
28 49 37.39182 68.49995 18.03062 1
29 50 38.10376 69.35728 18.10675 1
30 51 38.85760 70.17693 18.18661 1
31 52 39.65347 70.95875 18.27524 1
32 53 40.49156 71.70261 18.38020 1
33 54 41.35332 72.44006 17.90682 1
34 59 46.37849 74.91802 18.63206 1
35 60 47.53897 75.66218 19.64432 1
36 61 48.74697 76.43933 20.82346 1
37 63 51.30607 78.02426 23.73535 1
38 71 63.43129 86.05467 40.43482 1
39 72 65.15618 87.44794 42.72704 1
40 73 66.92714 88.95324 45.01966 1
41 84 89.42079 114.27939 68.03834 1
42 85 91.73831 117.44007 69.83676 1
43 7 33.69504 54.03695 15.74588 2
44 8 34.99931 53.96500 18.00533 2
45 10 37.61963 54.05684 22.43516 2
46 11 38.93493 54.21969 24.60049 2
47 12 40.25315 54.45963 26.73027 2
48 13 41.57397 54.77581 28.82348 2
49 14 42.89710 55.16727 30.87982 2
50 16 45.54954 56.17193 34.88453 2
51 17 46.87877 56.78325 36.83632 2
52 18 48.21025 57.46656 38.75807 2
53 19 49.54461 58.22266 40.65330 2
54 20 50.88313 59.05509 42.52505 2
55 21 52.22789 59.97318 44.36944 2
56 24 56.24397 63.21832 49.26963 2
57 25 57.55394 64.33850 50.76938 2
58 28 61.45282 68.05043 54.85522 2
59 32 66.44875 72.85234 60.04517 2
60 35 69.96560 76.06171 63.86949 2
61 36 71.09268 77.06821 65.11714 2
62 37 72.19743 78.04559 66.34927 2
63 38 73.28041 78.99518 67.56565 2
64 40 75.37861 80.81593 69.94129 2
65 44 79.29028 84.20275 74.37780 2
66 45 80.20272 85.00888 75.39656 2
67 46 81.08645 85.80180 76.37110 2
68 47 81.93696 86.57689 77.29704 2
69 48 82.75920 87.34100 78.17739 2
70 49 83.55055 88.09165 79.00945 2
71 50 84.30962 88.82357 79.79567 2
72 51 85.03743 89.53669 80.53817 2
73 52 85.73757 90.23223 81.24291 2
74 53 86.41419 90.91607 81.91232 2
75 54 87.05716 91.58632 82.52800 2
76 59 89.75923 94.58218 84.93629 2
77 60 90.18557 95.05573 85.31541 2
78 61 90.58166 95.51469 85.64864 2
79 63 91.27115 96.31107 86.23124 2
80 71 92.40983 98.35031 86.46934 2
81 72 92.36362 98.52258 86.20465 2
82 73 92.27734 98.67161 85.88308 2
83 84 88.66150 98.84699 78.47602 2
84 85 88.08846 98.73625 77.44067 2
According to Gregor, I tried inherit.aes = FALSE, the error is gone. But my plot looks like:
We've got all the info we need. Now we just need to, ahem, connect the dots ;-)
First the input data:
NVIQ_predict <- read.table(text = "
id cogn.age predict upper lower group
1 7 39.04942 86.68497 18.00000 1
2 8 38.34993 82.29627 18.00000 1
3 10 37.05174 74.31657 18.00000 1
4 11 36.45297 70.72421 18.00000 1
5 12 35.88770 67.39555 18.00000 1
6 13 35.35587 64.32920 18.00000 1
7 14 34.85738 61.52322 18.00000 1
8 16 33.95991 56.68024 18.00000 1
9 17 33.56057 54.63537 18.00000 1
10 18 33.19388 52.83504 18.00000 1
11 19 32.85958 51.27380 18.00000 1
12 20 32.55752 49.94791 18.00000 1
13 21 32.28766 48.85631 18.00000 1
14 24 31.67593 47.09206 18.00000 1
15 25 31.53239 46.91136 18.00000 1
16 28 31.28740 48.01764 18.00000 1
17 32 31.36627 50.55201 18.00000 1
18 35 31.73386 53.19630 18.00000 1
19 36 31.91487 54.22624 18.00000 1
20 37 32.13026 55.25721 18.00000 1
21 38 32.38237 56.26713 18.00000 1
22 40 32.98499 58.36229 18.00000 1
23 44 34.59044 62.80187 18.00000 1
24 45 35.06804 64.01951 18.00000 1
25 46 35.57110 65.31888 18.00000 1
26 47 36.09880 66.64696 17.93800 1
27 48 36.72294 67.60053 17.97550 1
28 49 37.39182 68.49995 18.03062 1
29 50 38.10376 69.35728 18.10675 1
30 51 38.85760 70.17693 18.18661 1
31 52 39.65347 70.95875 18.27524 1
32 53 40.49156 71.70261 18.38020 1
33 54 41.35332 72.44006 17.90682 1
34 59 46.37849 74.91802 18.63206 1
35 60 47.53897 75.66218 19.64432 1
36 61 48.74697 76.43933 20.82346 1
37 63 51.30607 78.02426 23.73535 1
38 71 63.43129 86.05467 40.43482 1
39 72 65.15618 87.44794 42.72704 1
40 73 66.92714 88.95324 45.01966 1
41 84 89.42079 114.27939 68.03834 1
42 85 91.73831 117.44007 69.83676 1
43 7 33.69504 54.03695 15.74588 2
44 8 34.99931 53.96500 18.00533 2
45 10 37.61963 54.05684 22.43516 2
46 11 38.93493 54.21969 24.60049 2
47 12 40.25315 54.45963 26.73027 2
48 13 41.57397 54.77581 28.82348 2
49 14 42.89710 55.16727 30.87982 2
50 16 45.54954 56.17193 34.88453 2
51 17 46.87877 56.78325 36.83632 2
52 18 48.21025 57.46656 38.75807 2
53 19 49.54461 58.22266 40.65330 2
54 20 50.88313 59.05509 42.52505 2
55 21 52.22789 59.97318 44.36944 2
56 24 56.24397 63.21832 49.26963 2
57 25 57.55394 64.33850 50.76938 2
58 28 61.45282 68.05043 54.85522 2
59 32 66.44875 72.85234 60.04517 2
60 35 69.96560 76.06171 63.86949 2
61 36 71.09268 77.06821 65.11714 2
62 37 72.19743 78.04559 66.34927 2
63 38 73.28041 78.99518 67.56565 2
64 40 75.37861 80.81593 69.94129 2
65 44 79.29028 84.20275 74.37780 2
66 45 80.20272 85.00888 75.39656 2
67 46 81.08645 85.80180 76.37110 2
68 47 81.93696 86.57689 77.29704 2
69 48 82.75920 87.34100 78.17739 2
70 49 83.55055 88.09165 79.00945 2
71 50 84.30962 88.82357 79.79567 2
72 51 85.03743 89.53669 80.53817 2
73 52 85.73757 90.23223 81.24291 2
74 53 86.41419 90.91607 81.91232 2
75 54 87.05716 91.58632 82.52800 2
76 59 89.75923 94.58218 84.93629 2
77 60 90.18557 95.05573 85.31541 2
78 61 90.58166 95.51469 85.64864 2
79 63 91.27115 96.31107 86.23124 2
80 71 92.40983 98.35031 86.46934 2
81 72 92.36362 98.52258 86.20465 2
82 73 92.27734 98.67161 85.88308 2
83 84 88.66150 98.84699 78.47602 2
84 85 88.08846 98.73625 77.44067 2", header = TRUE)
NVIQ_predict$id <- NULL
Make sure the group column is a factor variable, so we can use it as a line type.
NVIQ_predict$group <- as.factor(NVIQ_predict$group)
Then build the plot.
library(ggplot2)
g1 <- ggplot(NVIQ_predict, aes(cogn.age, predict, color=group)) +
geom_smooth(aes(x = cogn.age, y = upper, group=group), method = loess, se = FALSE) +
geom_smooth(aes(x = cogn.age, y = lower, group=group), method = loess, se = FALSE) +
geom_line(aes(linetype = group), size = 0.8)
Finally, extract the (x,ymin) and (x,ymax) coordinates of the curves for group 1 as well as group 2. These pairs have identical x-coordinates, so connecting those points mimics shading the areas between both curves. This was explained in Fill region between two loess-smoothed lines in R with ggplot. The only difference here is that we need to be a bit more careful to select and connect the points that belong to the correct curves...
gp <- ggplot_build(g1)
d1 <- gp$data[[1]]
d2 <- gp$data[[2]]
df1 <- data.frame(x = d1[d1$group == 1,]$x,
ymin = d2[d2$group == 1,]$y,
ymax = d1[d1$group == 1,]$y)
df2 <- data.frame(x = d1[d1$group == 2,]$x,
ymin = d2[d2$group == 2,]$y,
ymax = d1[d1$group == 2,]$y)
g1 + geom_ribbon(data = df1, aes(x = x, ymin = ymin, ymax = ymax), inherit.aes = FALSE, fill = "grey", alpha = 0.4) +
geom_ribbon(data = df2, aes(x = x, ymin = ymin, ymax = ymax), inherit.aes = FALSE, fill = "grey", alpha = 0.4)
The result looks like this:
I am totally lost with using ggplot. I've tried with various solutions, but none were successful. Using numbers below, I want to create a line graph where the three lines, each representing df$c, df$d, and df$e, the x-axis representing df$a, and the y-axis representing the cumulative probability where 95=100%.
a b c d e
1 0 18 0.047368421 0.036842105 0.005263158
2 1 20 0.047368421 0.036842105 0.010526316
13 2 26 0.052631579 0.031578947 0.026315789
20 3 35 0.084210526 0.036842105 0.031578947
22 4 41 0.068421053 0.052631579 0.047368421
24 5 88 0.131578947 0.068421053 0.131578947
26 7 90 0.131578947 0.068421053 0.136842105
27 8 93 0.126315789 0.068421053 0.147368421
28 9 96 0.126315789 0.073684211 0.152631579
3 10 115 0.105263158 0.078947368 0.210526316
4 11 116 0.105263158 0.084210526 0.210526316
5 12 120 0.094736842 0.084210526 0.226315789
6 13 128 0.105263158 0.073684211 0.247368421
7 14 129 0.100000000 0.073684211 0.252631579
8 15 154 0.031578947 0.042105263 0.368421053
9 16 155 0.031578947 0.036842105 0.373684211
10 17 158 0.036842105 0.036842105 0.378947368
11 18 161 0.036842105 0.031578947 0.389473684
12 19 163 0.026315789 0.031578947 0.400000000
14 20 169 0.026315789 0.021052632 0.421052632
15 21 171 0.015789474 0.021052632 0.431578947
16 22 174 0.010526316 0.021052632 0.442105263
17 24 176 0.010526316 0.021052632 0.447368421
18 25 186 0.005263158 0.005263158 0.484210526
19 26 187 0.005263158 0.000000000 0.489473684
21 35 188 0.005263158 0.005263158 0.489473684
23 40 189 0.005263158 0.000000000 0.494736842
25 60 190 0.000000000 0.000000000 0.500000000
I was somewhat successful with using R base coding
plot(df$a, df$c, type="l",col="red")
lines(df$a, df$d, col="green")
lines(df$a, df$e, col="blue")
You first need to melt your data so that you have one column that designates from which variables the data comes from (call it variable) and another column that lists actual value (call it value). Study the example below to fully understand what happens to the variables from the original data.frame you want to keep constant.
library(reshape2)
xymelt <- melt(xy, id.vars = "a")
library(ggplot2)
ggplot(xymelt, aes(x = a, y = value, color = variable)) +
theme_bw() +
geom_line()
ggplot(xymelt, aes(x = a, y = value)) +
theme_bw() +
geom_line() +
facet_wrap(~ variable)
This code is also drawing column from your data called "d". You can remove it prior to melting, after melting, prior to plotting... or plot it.
I haven't been using pie graph a lot in r, is there a way to make a pie graph and only show the top 10 names with percentage?
For example, here's a simple version of my data:
> data
count METRIC_ID
1 8 71
2 2 1035
3 5 1219
4 4 1277
5 1 1322
6 3 1444
7 5 1462
8 17 1720
9 6 2019
10 2 2040
11 1 2413
12 11 2489
13 24 2610
14 29 2737
15 1 2907
16 1 2930
17 2 2992
18 1 2994
19 2 3020
20 4 3045
21 35 3222
22 2 3245
23 5 3306
24 2 3348
25 2 3355
26 2 3381
27 3 3383
28 4 3389
29 6 3404
30 1 3443
31 22 3465
32 3 3558
33 15 3600
34 3 3730
35 6 3750
36 1 3863
37 1 3908
38 5 3913
39 3 3968
40 9 3972
41 2 3978
42 5 4077
43 4 4086
44 3 4124
45 2 4165
46 3 4205
47 8 4206
48 4 4210
49 12 4222
50 4 4228
and I want to see the count of each METRIC_ID's distribution:
pie(data$count, data$METRIC_ID)
But this Chart marks every single METRIC_ID on the graph, when I have over 100 METRIC_ID, it looks like a mess. How can I only mark the top n (for example, n=5) METRIC_ID on the graph, and show the count of that n METRIC_ID only?
Thank you for your help!!!
To suppress plotting of some labels, set them to NA. Try this:
labls <- data$METRIC_ID
labls[data$count < 3] <- NA
pie(data$count, paste(labls))
Simply subset your data before creating the piechart. I'd do somehting like:
Sort your datasets using order.
Select the first ten rows.
Create the pie chart from the resulting data.
Pie charts are not the best way to visualize your data, just google pie chart problems, e.g. this link. I'd go for something like:
library(ggplot2)
dat = dat[order(-dat$count),]
dat = within(dat, {METRIC_ID = factor(METRIC_ID, levels = METRIC_ID)})
ggplot(dat, aes(x = METRIC_ID, y = count)) + geom_point()
Here I just plot all the data, which I think still leads to a readable graph. This graph is more formally known as a dotplot, and is heavily used in the graphics book of Cleveland. Here the height is linked to count, which is much easier to interpret that linking count to the fraction of the area of a circle, as in the case of the piechart.
Find a better type of chart for your data.
Here is a possibility to create the chart you want:
data2 <- data[data$count %in% tail(sort(data$count),5),]
pie(data2$count, data2$METRIC_ID)
Slightly better:
data3 <- data2
data3$METRIC_ID <- as.character(data3$METRIC_ID)
data3 <- rbind(data3,data.frame(count=sum(data[! data$count %in% tail(sort(data$count),5),"count"]),METRIC_ID="others"))
pie(data3$count, data3$METRIC_ID)