Can I use R to only analyze data past a certain date? - r

I have an excel sheet that I imported into RStudio which contains data for every subject of a certain population. Each subject has their own set of data with corresponding dates, but I only want to look at the data and perform statistical analyses on the dates past a unique date for each subject.
I'm assuming I can use the split function to create smaller dataframes, with each corresponding to that of each subject, and then use some function to analyze the data in a loop to run on all of the smaller dataframes I created from the split.
Some of these subjects with have over 1000 data points. My two main questions are:
1) Is there a function I can use to analyze the data for each subject past a specific unique date to each subject?
2) Is the strategy I proposed above a viable one?
I unfortunately have very little experience in data analyses or extensive any background in computer science. Thanks for any help.
Edit: So there was a request about the type of data I was talking about. I was wondering if I had data similar to this, could I still use the above strategy. Where P1 and P2 have their own data sets that I want to analyze after the TxDate.
>data
1 Date BMI Glucose Cholesterol TxDate
2 P1 3/3/15
3 12/1/14 24 145 99
4 3/18/15 26 123 101
5 4/21/15 28 111 85
6 6/2/15 25 133 90
7
8
9 P2 4/6/16
10 1/3/16 33 145 200
11 3/30/16 31 162 178
12 5/13/16 34 190 134
13 6/12/16 34 183 168
14 7/9/16 35 200 189
15 9/10/16 31 175 190
16 11/23/17 27 121 120
17
18

Here are some suggestions to get you started:
1) Tidy your data. To do this you could look into ways to modify your input data so it looks more like this:
ID Date BMI Glucose Cholesterol TxDate
3 P1 12/1/14 24 145 99 3/3/15
4 P1 3/18/15 26 123 101 3/3/15
5 P1 4/21/15 28 111 85 3/3/15
6 P1 6/2/15 25 133 90 3/3/15
10 P2 1/3/16 33 145 200 4/6/16
11 P2 3/30/16 31 162 178 4/6/16
12 P2 5/13/16 34 190 134 4/6/16
13 P2 6/12/16 34 183 168 4/6/16
14 P2 7/9/16 35 200 189 4/6/16
15 P2 9/10/16 31 175 190 4/6/16
16 P2 11/23/17 27 121 120 4/6/16
Notice the ID and TxDate column are filled in with the appropriate value and several rows were dropped. And row for ID, Date, etc. are actually 'headers', and not a data row. Don't be too surprised if the tidying step takes longer than the analysis.
Now, for the purpose of this example lets use this as your data:
df <- data.frame(
ID = c(rep("P1",4), rep("P2", 7)),
Date = as.Date(mdy(c("12/1/14", "3/18/15", "4/21/15" , "6/2/15", "1/3/16", "3/30/16", "5/13/16", "6/12/16", "7/9/16", "9/10/16", "11/23/17"))),
BMI = c(24,26,28,25,33,31,34,34,35,31,27),
Glucose = c(145,123,111,133,145,12,190,183,200,175,121),
Cholesterol = c(99,101,85,90,200,178,134,168,189,190,120),
TxDate = as.Date(mdy(c("3/3/15", "3/3/15","3/3/15","3/3/15","4/6/16", "4/6/16","4/6/16","4/6/16","4/6/16","4/6/16","4/6/16"))),
stringsAsFactors = F)
2) Check to see if your Date and TxDate columns are being represented as a date object. If your data.frame is named 'df' then something like is.date(df$Date) and is.date(df$TxDate) will tell you. Or str(df).
If not, read about ways to convert them to date objects, perhaps with the as.Date() function combined with mdy() from the lubridate package.
3) Once you have the dates represented as date objects you could subset the data frame with a simple logical statement, like this
# subset dataframe
df1 <- df[df$Date > df$TxDate, ]
Now df1 should look like this:
ID Date BMI Glucose Cholesterol TxDate
2 P1 2015-03-18 26 123 101 2015-03-03
3 P1 2015-04-21 28 111 85 2015-03-03
4 P1 2015-06-02 25 133 90 2015-03-03
7 P2 2016-05-13 34 190 134 2016-04-06
8 P2 2016-06-12 34 183 168 2016-04-06
9 P2 2016-07-09 35 200 189 2016-04-06
10 P2 2016-09-10 31 175 190 2016-04-06
11 P2 2017-11-23 27 121 120 2016-04-06
What's left is the data you seem to need for your analysis.

Related

Creating boxplot based on some conditions

Data given are a sample of cholesterol levels taken from 24 hospital employees who were on a standard American diet and who agreed to adopt a vegetarian diet for 1 month. Serum-cholesterol measurements were made before adopting the diet and 1 month after.
Subject Before After Difference
1 1 195 146 49
2 2 145 155 -10
3 3 205 178 27
4 4 159 146 13
5 5 244 208 36
6 6 166 147 19
7 7 250 202 48
8 8 236 215 21
9 9 192 184 8
10 10 224 208 16
11 11 238 206 32
12 12 197 169 28
13 13 169 182 -13
14 14 158 127 31
15 15 151 149 2
16 16 197 178 19
17 17 180 161 19
18 18 222 187 35
19 19 168 176 -8
20 20 168 145 23
21 21 167 154 13
22 22 161 153 8
23 23 178 137 41
24 24 137 125 12
Now here is the question I am trying to answer. Some investigators believe that the effects of diet
on cholesterol are more evident in people with high rather than low cholesterol levels. If you split the data  according to whether baseline cholesterol is above or below the median, can you comment descriptively on this issue?
Now, I am thinking of creating boxplot based on two categories here. I wish to use dplyr for data manipulation here. So, I will create a new column based on if Before is less than or greater than median of Before. So, I will have a new character vector with "high" for high Before cholesterol and low for low Before cholesterol. And, then I will do a boxplot of Difference based on the categorical new column. So, here is my code. I call the original data set as df2.
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
group_by(new_col) %>%
ggplot(aes(x= new_col, y=Difference)) +
geom_boxplot()
And following is the boxplot I get
So, based on this, I conclude that investigators are right and effects of diet on cholesterol are more evident in people with high rather than low cholesterol levels. I want to know if this can be done more effectively.
This is more a statistical plan question rather than a programming question, therefore it would belong more to stats.stackexchange than StackOverflow.
Anyway, categorizing a variable depending on the median is not the recommended way of visualizing associations, as you are suppressing a lot of information. You can read about this in this very good article by Peter Flom.
It is better to keep all the points and apply some spline or smoothing algorithm.
For instance, you could consider something like this:
ggplot(df2, aes(x= Before, y=Difference)) +
geom_point() +
geom_smooth()
Here, the relationship is clearly seeable, while keeping all the information you want.
If you really have to generate subgroups, you could also try something like this:
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
ggplot(aes(x= Before, y=Difference, group=new_col, color=new_col)) +
geom_point() +
geom_smooth(span=3) #try some other values here
However, using the median is still not a very good idea, especially with that amount of data points. You might want to assess the functional form of the relationship, but that would need a specific question on stats.stackexchange.com.
not really an answer, but more of a different approach in visualisation of the data..
library( data.table )
library( ggplot2 )
DT.melt <- melt( DT, id.vars = "Subject", measure.vars = c( "Before", "After" ) )
ggplot() +
geom_line( data = DT.melt,
aes( x = variable, y = value, group = Subject ) ) +
geom_line( data = DT.melt[, .(mean = mean(value)), by = variable ],
aes( x = variable, y = mean, group = 1 ), color = "red", size = 2 ) +
labs( x = "", y = "" )
sample data used
DT <- fread(" Subject Before After Difference
1 195 146 49
2 145 155 -10
3 205 178 27
4 159 146 13
5 244 208 36
6 166 147 19
7 250 202 48
8 236 215 21
9 192 184 8
10 224 208 16
11 238 206 32
12 197 169 28
13 169 182 -13
14 158 127 31
15 151 149 2
16 197 178 19
17 180 161 19
18 222 187 35
19 168 176 -8
20 168 145 23
21 167 154 13
22 161 153 8
23 178 137 41
24 137 125 12")

How to user NSE inside fct_reorder() in ggplot2

I would like to know how to use NSE (Non-Standard Evaluation) expression in fct_reorder() in ggplot2 to replicate charts for different data frames.
This is an example of data frame that I use to draw a chart:
travel_time_br30 travel_time_br30_int time_reduction shift not_shift total
1 0-30 0 10 2780 3268 6048
2 0-30 0 20 2779 3269 6048
3 0-30 0 30 2984 3064 6048
4 0-30 0 40 3211 2837 6048
5 30-60 30 10 2139 2007 4146
6 30-60 30 20 2159 1987 4146
7 30-60 30 30 2363 1783 4146
8 30-60 30 40 2478 1668 4146
9 60-90 60 10 764 658 1422
10 60-90 60 20 721 701 1422
11 60-90 60 30 782 640 1422
12 60-90 60 40 801 621 1422
13 90-120 90 10 296 224 520
14 90-120 90 20 302 218 520
15 90-120 90 30 317 203 520
16 90-120 90 40 314 206 520
17 120-150 120 10 12 10 22
18 120-150 120 20 10 12 22
19 120-150 120 30 10 12 22
20 120-150 120 40 13 9 22
21 150-180 150 10 35 21 56
22 150-180 150 20 40 16 56
23 150-180 150 30 40 16 56
24 150-180 150 40 35 21 56
share
1 45.96561
2 45.94907
3 49.33862
4 53.09193
5 51.59190
6 52.07429
7 56.99469
8 59.76845
9 53.72714
10 50.70323
11 54.99297
12 56.32911
13 56.92308
14 58.07692
15 60.96154
16 60.38462
17 54.54545
18 45.45455
19 45.45455
20 59.09091
21 62.50000
22 71.42857
23 71.42857
24 62.50000
These are the scripts to draw a chart from above data frame:
g.var <- "travel_time_br30"
go.var <- "travel_time_br30_int"
test %>% ggplot(.,aes_(x=as.name(x.var),y=as.name("share"),group=as.name(g.var))) +
geom_line(size=1.4, aes(
color=fct_reorder(travel_time_br30,order(travel_time_br30_int))))
As I have several data frames which has different fields such as access_time_br30, access_time_br30_int instead of travel_time_br30 and travel_time_br30_int in the data frame, I set two variables (g.var and go.var) to easily replicate multiple chars in the same scripts.
As I need to reorder the factor group numerically, in particular, changing order of travel_time_br30 by travel_time_br30_int, I am using fct_reorder function in ggplot2(., aes_(...)). However, if I use aes_ with fct_reorder() in geom_line() as shown as an example in the following script, it returns an error saying Error:fmust be a factor (or character vector).
geom_line(size=1.4, aes_(color=fct_reorder(as.name(g.var),order(as.name(go.var)))))
Fct_reorder() does not seem to have an NSE version like fct_reorder_().
Is it impossible to use both aes_ and fct_reorder() in a sequence of scripts or are there any other solutions?
Based on my novice working knowledge of tidy-eval, you could transform your factor order in mutate() before passing the data into ggplot() and acheive your result.
Sorry I couldn't easily read in your table above, because of the line return so I made a new example off of mtcars that I think captures your intent. (let me know if it doesn't)
mtcars2 <- mutate(mtcars,
gear_int = 6 - gear,
gear_intrev = rev(gear_int)) %>%
mutate_at(vars(cyl, gear), as.factor)
library(rlang)
gg_reorder <- function(data, col_var, col_order) {
eq_var <- sym(col_var) # sym is flexible and my novice preference
eq_ord <- sym(col_order)
data %>% mutate(!!quo_name(eq_var) := fct_reorder(!!eq_var, !!eq_ord) ) %>%
ggplot(aes_(~mpg, ~hp, color = eq_var)) +
geom_line()
}
And now put it to use plotting...
gg_reorder(mtcars2, "gear", "gear_int")
gg_reorder(mtcars2, "gear", "gear_intrev")
I didn't specify all of the aes_() variables as strings but you could pass those as text and use the as.name() pattern. If you want more tidy-eval patterns Edwin Thoen wrote up a bunch of common cases.

How to process multi columns data in data.frame with plyr

I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this(I store the CSV file in my GoogleDrive Link ) :
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees,I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!
I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20

Error in sort.list(y) whlie using 'Strata()' in R

When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.

how to make a pie graph only name top n performance

I haven't been using pie graph a lot in r, is there a way to make a pie graph and only show the top 10 names with percentage?
For example, here's a simple version of my data:
> data
count METRIC_ID
1 8 71
2 2 1035
3 5 1219
4 4 1277
5 1 1322
6 3 1444
7 5 1462
8 17 1720
9 6 2019
10 2 2040
11 1 2413
12 11 2489
13 24 2610
14 29 2737
15 1 2907
16 1 2930
17 2 2992
18 1 2994
19 2 3020
20 4 3045
21 35 3222
22 2 3245
23 5 3306
24 2 3348
25 2 3355
26 2 3381
27 3 3383
28 4 3389
29 6 3404
30 1 3443
31 22 3465
32 3 3558
33 15 3600
34 3 3730
35 6 3750
36 1 3863
37 1 3908
38 5 3913
39 3 3968
40 9 3972
41 2 3978
42 5 4077
43 4 4086
44 3 4124
45 2 4165
46 3 4205
47 8 4206
48 4 4210
49 12 4222
50 4 4228
and I want to see the count of each METRIC_ID's distribution:
pie(data$count, data$METRIC_ID)
But this Chart marks every single METRIC_ID on the graph, when I have over 100 METRIC_ID, it looks like a mess. How can I only mark the top n (for example, n=5) METRIC_ID on the graph, and show the count of that n METRIC_ID only?
Thank you for your help!!!
To suppress plotting of some labels, set them to NA. Try this:
labls <- data$METRIC_ID
labls[data$count < 3] <- NA
pie(data$count, paste(labls))
Simply subset your data before creating the piechart. I'd do somehting like:
Sort your datasets using order.
Select the first ten rows.
Create the pie chart from the resulting data.
Pie charts are not the best way to visualize your data, just google pie chart problems, e.g. this link. I'd go for something like:
library(ggplot2)
dat = dat[order(-dat$count),]
dat = within(dat, {METRIC_ID = factor(METRIC_ID, levels = METRIC_ID)})
ggplot(dat, aes(x = METRIC_ID, y = count)) + geom_point()
Here I just plot all the data, which I think still leads to a readable graph. This graph is more formally known as a dotplot, and is heavily used in the graphics book of Cleveland. Here the height is linked to count, which is much easier to interpret that linking count to the fraction of the area of a circle, as in the case of the piechart.
Find a better type of chart for your data.
Here is a possibility to create the chart you want:
data2 <- data[data$count %in% tail(sort(data$count),5),]
pie(data2$count, data2$METRIC_ID)
Slightly better:
data3 <- data2
data3$METRIC_ID <- as.character(data3$METRIC_ID)
data3 <- rbind(data3,data.frame(count=sum(data[! data$count %in% tail(sort(data$count),5),"count"]),METRIC_ID="others"))
pie(data3$count, data3$METRIC_ID)

Resources