I have some statistic data about process quality presented in table form (result >> % of all cases)
# (df <- read.csv(...)
detection_quality_algo1_pupil <- table(df$pupeuclid1)
detection_quality_algo1_pupil_percent = round(
detection_quality_algo1_pupil[names(detection_quality_algo1_pupil)]
/ nrow(df)
* 100
, digits = 1)
0 - 16.4%
1 - 50.6%
2 - 12.0%
3 - 2.4%
etc.
> detection_quality_algo1_pupil_percent
0 1 2 3 4 5 10 11 12 13 16 17 20 21 22 23 24 25 27 29 30 31 32 33
16.4 50.6 12.0 2.4 0.5 0.6 0.9 0.6 0.3 0.1 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.3 0.1 0.3 0.1 0.1
37 40 43 45 50 53 54 55 56 59 102 104 106 107 112 114 131 132 134 136 138 139 141 142
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.4 0.1 0.3 0.1 0.1 0.3 0.1 0.1
145 149 150 151 152 153 154 155 156 157 158 160 161 164 166 167 168 169 170 171 173 175 187 191
0.3 0.6 0.1 0.3 0.1 0.5 0.3 0.1 0.1 0.4 0.1 0.1 0.4 0.1 0.1 0.3 0.3 0.3 0.1 0.3 0.1 0.1 0.1 0.1
194 208
0.1 0.1
> pie(detection_quality_algo1_pupil_percent)
my goal is grouping results with value > 3 into one big group named "> 3" and show results on pie chart.
I think it's about applying some filters on source table...
How can i do this?
Try:
x <- rep(0:5,c(20,50,20,4,4,2))
pie(table(x)) # 3 small groups
pie(table(cut(x, c(-Inf,0:2,Inf),labels=0:3))) # 1 group representing the 3 small groups
And, as #sebpardo notes, pie charts are terrible. Use a barplot instead:
barplot(table(cut(x, c(-Inf,0:2,Inf),labels=0:3)))
You could try adding a new 'collapsed' column to your dataframe using mutate, e.g.
library(dplyr)
df <- mutate(df, new_group = ifelse(group > 3, ">3", group)
I agree with #sebpardo's suggestion in the comment above that there's a better way to visualize data than pie charts. Even the help page advises against them (see ?pie):
"Pie charts are a very bad way of displaying information. [...]"
Related
I have two variables, A and B, that are significantly related if modeled in a piecewise regression. The model has two segments. The problem is that in the plot, the two segments do not connect to one another the way they should: they form a 'nose' at the break point. I've seen in other posts on Stackoverflow that problems with plotting segmented regressions correctly seem widespread.
Here's the dataframe with A and B:
dfrm <- read.table(text=" A B
1 0.04545455 1.3
2 0.09090909 1.1
3 0.13636364 1.6
4 0.18181818 1.8
5 0.22727273 3.4
6 0.27272727 1.8
7 0.31818182 1.9
8 0.36363636 0.7
9 0.40909091 2.9
10 0.45454545 1.2
11 0.50000000 0.8
12 0.54545455 0.7
13 0.59090909 0.6
14 0.63636364 1.7
15 0.68181818 0.7
16 0.72727273 2.0
17 0.77272727 1.2
18 0.81818182 0.5
19 0.86363636 2.8
20 0.90909091 1.0
21 0.95454545 0.5
22 1.00000000 1.0
23 0.06666667 0.2
24 0.13333333 0.6
25 0.20000000 1.6
26 0.26666667 0.4
27 0.33333333 1.7
28 0.40000000 2.5
29 0.46666667 0.5
30 0.53333333 1.5
31 0.60000000 0.4
32 0.66666667 0.3
33 0.73333333 0.2
34 0.80000000 0.2
35 0.86666667 0.7
36 0.93333333 2.2
37 1.00000000 2.3
38 0.05882353 1.4
39 0.11764706 2.7
40 0.17647059 0.7
41 0.23529412 0.2
42 0.29411765 0.8
43 0.35294118 2.9
44 0.41176471 0.4
45 0.47058824 0.5
46 0.52941176 2.1
47 0.58823529 0.4
48 0.64705882 0.6
49 0.70588235 1.0
50 0.76470588 0.3
51 0.82352941 0.9
52 0.88235294 1.4
53 0.94117647 0.6
54 1.00000000 0.4
55 0.10000000 1.7
56 0.20000000 1.4
57 0.30000000 1.5
58 0.40000000 0.6
59 0.50000000 0.4
60 0.60000000 0.5
61 0.70000000 0.4
62 0.80000000 1.0
63 0.90000000 0.8
64 1.00000000 3.0
65 0.03846154 1.5
66 0.07692308 2.7
67 0.11538462 2.2
68 0.15384615 0.6
69 0.19230769 0.7
70 0.23076923 0.5
71 0.26923077 0.5
72 0.30769231 0.6
73 0.34615385 1.2
74 0.38461538 0.8
75 0.42307692 1.8
76 0.46153846 2.1
77 0.50000000 0.6
78 0.53846154 0.7
79 0.57692308 1.3
80 0.61538462 0.4
81 0.65384615 0.7
82 0.69230769 1.2
83 0.73076923 0.8
84 0.76923077 1.2
85 0.80769231 1.0
86 0.84615385 1.4
87 0.88461538 0.9
88 0.92307692 0.8
89 0.96153846 1.7
90 1.00000000 5.8", header=TRUE)
## attach(df) NO, don't use attach and mistrust anyone who tells you differently
model <- lm(B ~ (A < 0.89394)*A + (A >= 0.89394)*A, data=dfrm) # 0.89394 = breakpoint
# Preparing the plot:
a <- sort(unique(dfrm$A))
# Plotting:
plot(B ~ A, data=dfrm)
lines(a, predict(model, list(A=a)), lwd=2, col="blue")
This is the plot:Piecewise regression
How can the two segments be connected cleanly at the break point?
It might be easiest to attempt this with a GAM (Generalized Additive Model), applied via either the GAM package or the mgcv package in R. This technique allows you to fit a non-linear model in stages, smoothing out the joins (or 'knots) between functions. As a bonus, the GAM is basically a GLM anyway so the learning curve should be quite easy.
The nose and the disconnect between the segments may be due to lack of precision in the way the break point is determined.
After re-determining the break point for my data based on the method detailed in Crawley (2007: 427), the two segments perfectly connect.
The steps involved are:
define a vector "breaks" for potential breaks
run a for loop for piecewise regressions for all potential break points and yank out the minimal residual standard error (mse) for each model:
mse <- numeric(length(breaks))
for(i in 1:length(breaks)){
piecewise <- lm(V_indep ~ V_dep*(V_dep < breaks[i]) + V_dep*(V_dep>=breaks[i]))
mse[i] <- summary(piecewise)[6]
}
mse <- numeric(length(breaks))
identify the break point with the least mse:
breaks[which(mse==min(mse))]
fit the model using this break point.
Trying to plot a time series chart with ggplot2 and using the alpha value to make the lines darkers/lighter, as per ggplot2. Got it working in 1 function but when I try with another dataset the alpha doesnt work. Guess I am calling something incorrectly bc I have the alpha variable set at 0.2 but the line still come out dark
Here is the code and some sample data
tsplot <- ggplot(xall, aes(x=Var1, y=value)) +
geom_line(size=.01) + guides(colour=FALSE) + xlab(x.lab) +ylab("Time Series")
tsplot <- tsplot + aes(alpha=alpha, group= factor(Var2)) +guides(alpha=F)
Sample data for xall
Var1 Var2 value alpha row
1 1 657 0 0.2 Other Rows
2 2 657 -0.006748957 0.2 Other Rows
3 3 657 -0.00088561 0.2 Other Rows
4 4 657 0.009399679 0.2 Other Rows
5 5 657 0.020216333 0.2 Other Rows
6 6 657 0.035222838 0.2 Other Rows
7 7 657 0.038869107 0.2 Other Rows
8 8 657 0.034068491 0.2 Other Rows
9 9 657 0.044237734 0.2 Other Rows
81 1 553 0 0.2 Other Rows
82 2 553 -0.006172511 0.2 Other Rows
83 3 553 -0.004779576 0.2 Other Rows
84 4 553 0.000116964 0.2 Other Rows
85 5 553 -0.013408332 0.2 Other Rows
86 6 553 -0.003200561 0.2 Other Rows
87 7 553 0.000574187 0.2 Other Rows
88 8 553 0.025227017 0.2 Other Rows
89 9 553 0.019984901 0.2 Other Rows
241 1 876 0 0.2 Other Rows
242 2 876 0.006348487 0.2 Other Rows
243 3 876 0.020292484 0.2 Other Rows
244 4 876 0.030155311 0.2 Other Rows
245 5 876 0.02664097 0.2 Other Rows
246 6 876 0.021992971 0.2 Other Rows
247 7 876 0.015871216 0.2 Other Rows
248 8 876 0.020519216 0.2 Other Rows
249 9 876 0.017004875 0.2 Other Rows
250 10 876 0.029588482 0.2 Other Rows
Any help would be greatly appreciated.
You need to add alpha to the global aesthetic. You should also add the group mapping:
ggplot(xall, aes(x=Var1, y=value, alpha=alpha, group= factor(Var2))) +
geom_line(size=.01) + guides(colour=FALSE) + xlab(x.lab) +ylab("Time Series")
I want to create some basic grouped barplots with ggplot2 but it seems to exclude some data. If I review my input data everything is there, but some bars are missing and it is also messing with the error bars. I tried to convert into multiple variable types, regrouped, loaded again, saved everything in .csv and loaded all new... I just don't know what is wrong.
Here is my code:
library(ggplot2)
limits <- aes(ymax = DataCm$mean + DataCm$sd,
ymin = DataCm$mean - DataCm$sd)
p <- ggplot(data = DataCm, aes(x = factor(DataCm$Zeit), y = factor(DataCm$mean)
) )
p + geom_bar(stat = "identity",
position = position_dodge(0.9),fill =DataCm$group) +
geom_errorbar(limits, position = position_dodge(0.9),
width = 0.25) +
labs(x = "Time [min]", y = "Individuals per foodsource")
This is DataCm:
Zeit mean sd group
1 30 0.1 0.3162278 1
2 60 0.0 0.0000000 2
3 90 0.1 0.3162278 3
4 120 0.0 0.0000000 4
5 150 0.1 0.3162278 5
6 180 0.1 0.3162278 6
7 240 0.3 0.6749486 1
8 300 0.3 0.6749486 2
9 360 0.3 0.6749486 3
10 30 0.1 0.3162278 4
11 60 0.1 0.3162278 5
12 90 0.2 0.4216370 6
13 120 0.3 0.4830459 1
14 150 0.3 0.4830459 2
15 180 0.4 0.5163978 3
16 240 0.3 0.4830459 4
17 300 0.4 0.5163978 5
18 360 0.4 0.5163978 6
19 30 1.2 1.1352924 1
20 60 1.8 1.6865481 2
21 90 2.2 2.0976177 3
22 120 2.2 2.0976177 4
23 150 2.0 1.8856181 5
24 180 2.3 1.9465068 6
25 240 2.4 2.0655911 1
26 300 2.1 1.8529256 2
27 360 2.0 2.1602469 3
28 30 0.2 0.4216370 4
29 60 0.1 0.3162278 5
30 90 0.1 0.3162278 6
31 120 0.1 0.3162278 1
32 150 0.0 0.0000000 2
33 180 0.1 0.3162278 3
34 240 0.1 0.3162278 4
35 300 0.1 0.3162278 5
36 360 0.1 0.3162278 6
37 30 1.3 1.5670212 1
38 60 1.5 1.5811388 2
39 90 1.5 1.7159384 3
40 120 1.5 1.9002924 4
41 150 1.9 2.1317703 5
42 180 1.9 2.1317703 6
43 240 2.2 2.3475756 1
44 300 2.4 2.3190036 2
45 360 2.2 2.1499354 3
46 30 2.1 2.1317703 4
47 60 3.0 2.2110832 5
48 90 3.3 2.1628171 6
49 120 3.2 2.1499354 1
50 150 3.4 2.6331224 2
51 180 3.5 2.4152295 3
52 240 3.7 2.6267851 4
53 300 3.7 2.4060110 5
54 360 3.8 2.6583203 6
The output is:
Maybe you can help me. Thanks in advance!
Best wishes,
Benjamin
Solved it:
I reshaped everything in Excel and exported it another way. The group variable was also not the way I wanted it. Now it is fixed, but I can't really tell you why.
Your data looks malformed. I guess you wanted to have 6 different group values for each time point, but now the group variable just loops over, and you have:
1 30 0.1 0.3162278 1
...
10 30 0.1 0.3162278 4
...
19 30 1.2 1.1352924 1
...
28 30 0.2 0.4216370 4
geom_bar then probably omits rows that have identical mean and time. Although I am not sure why it chooses to do so, you should solve the group problem first anyway.
I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)
I have the data frame below that I have graphed as shown. How can I limit the values shown in the legend to only the first three? In other words, I want it to only show "A", "B", and "C".
graph_table <- read.table(header=TRUE, text="
names freq rank percs sums sums_str
1 A 1208 'Top 3' 46.1 61.1 61.1%
2 B 289 'Top 3' 11.0 61.1 61.1%
3 C 105 'Top 3' 4.0 61.1 61.1%
4 D 388 D 14.8 14.8 14.8%
5 E 173 E 6.6 6.6 6.6%
6 F 102 F 3.9 3.9 3.9%
7 G 70 G 2.7 2.7 2.7%
8 H 54 H 2.1 2.1 2.1%
9 I 44 I 1.7 1.7 1.7%
10 J 32 J 1.2 1.2 1.2%
11 K 24 K 0.9 0.9 0.9%
12 L 20 L 0.8 0.8 0.8%
13 M 20 M 0.8 0.8 0.8%
14 N 18 N 0.7 0.7 0.7%
15 O 13 O 0.5 0.5 0.5%
16 P 10 P 0.4 0.4 0.4%
17 Q 10 Q 0.4 0.4 0.4%
18 R 10 R 0.4 0.4 0.4%
19 S 7 S 0.3 0.3 0.3%
20 T 5 T 0.2 0.2 0.2%
21 U 5 U 0.2 0.2 0.2%
22 V 5 V 0.2 0.2 0.2%
23 W 3 W 0.1 0.1 0.1%")
library(ggplot2)
p <- ggplot(graph_table[1:10,], aes(x=rank, y=percs,
fill=names))+geom_bar(stat="identity")
p <- p+geom_text(aes(label=sums_str, y=(sums+4)), size=4)
p
Was confused at first, but you want to show a top-3 so the other names don't need a legend. Here you go:
p <- ggplot(graph_table[1:10,], aes(x=rank, y=percs,
fill=names))+geom_bar(stat="identity")
p <- p+geom_text(aes(label=sums_str, y=(sums+4)), size=4)
p + scale_fill_discrete(breaks=c("A","B","C"))