Setting defaults by ggproto when extending ggplot2 - r

I ran the R Code posted in http://ggplot2.tidyverse.org/articles/extending-ggplot2.html#picking-defaults
and the following is the modified code that added some print() to show the values of variables in each step, my questions are marked as comments in the code:
StatDensityCommon <- ggproto("StatDensityCommon", Stat, required_aes = "x",
setup_params = function(data, params) {
print("PARAMS BEFORE:")
print(params)
if(!is.null(params$bandwidth))
return(params)
print("DATA: ")
print(data)
#1. When and how does the data being modified and the "group" field added?
xs <- split(data$x, data$group)
print("XS: ")
print(xs)
bws <- vapply(xs, bw.nrd0, numeric(1))
print("BWS: ")
print(bws)
bw <- mean(bws)
print("BW: ")
print(bw)
message("Picking bandwidth of ", signif(bw, 3))
params$bandwidth <- bw
print("PARAMS AFTER: ")
print(params)
params
},
compute_group = function(data, scales, bandwidth = 1) {
#2. how does the bandwidth computed in setup_params passed into compute_group
#even if the bandwidth has already been set to 1 in the arguments?
d <- density(data$x, bw = bandwidth)
data.frame(x = d$x, y = d$y)
}
)
stat_density_common <- function(mapping = NULL, data = NULL, geom = "line", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, bandwidth = NULL, ...){
layer(stat = StatDensityCommon, data = data, mapping = mapping, geom = geom, position = position, show.legend = show.legend, inherit.aes = inherit.aes, params = list(bandwidth = bandwidth, na.rm = na.rm, ...))
}
ggplot(mpg, aes(displ, colour = drv)) + stat_density_common()
The following are the outputs except the plot:
[1] "PARAMS BEFORE:"
$bandwidth
NULL
$na.rm
[1] FALSE
[1] "DATA: "
x colour PANEL group
1 1.8 f 1 2
2 1.8 f 1 2
3 2.0 f 1 2
4 2.0 f 1 2
5 2.8 f 1 2
6 2.8 f 1 2
7 3.1 f 1 2
8 1.8 4 1 1
9 1.8 4 1 1
10 2.0 4 1 1
11 2.0 4 1 1
12 2.8 4 1 1
13 2.8 4 1 1
14 3.1 4 1 1
15 3.1 4 1 1
16 2.8 4 1 1
17 3.1 4 1 1
18 4.2 4 1 1
19 5.3 r 1 3
20 5.3 r 1 3
21 5.3 r 1 3
22 5.7 r 1 3
23 6.0 r 1 3
24 5.7 r 1 3
25 5.7 r 1 3
26 6.2 r 1 3
27 6.2 r 1 3
28 7.0 r 1 3
29 5.3 4 1 1
30 5.3 4 1 1
31 5.7 4 1 1
32 6.5 4 1 1
33 2.4 f 1 2
34 2.4 f 1 2
35 3.1 f 1 2
36 3.5 f 1 2
37 3.6 f 1 2
38 2.4 f 1 2
39 3.0 f 1 2
40 3.3 f 1 2
41 3.3 f 1 2
42 3.3 f 1 2
43 3.3 f 1 2
44 3.3 f 1 2
45 3.8 f 1 2
46 3.8 f 1 2
47 3.8 f 1 2
48 4.0 f 1 2
49 3.7 4 1 1
50 3.7 4 1 1
51 3.9 4 1 1
52 3.9 4 1 1
53 4.7 4 1 1
54 4.7 4 1 1
55 4.7 4 1 1
56 5.2 4 1 1
57 5.2 4 1 1
58 3.9 4 1 1
59 4.7 4 1 1
60 4.7 4 1 1
61 4.7 4 1 1
62 5.2 4 1 1
63 5.7 4 1 1
64 5.9 4 1 1
65 4.7 4 1 1
66 4.7 4 1 1
67 4.7 4 1 1
68 4.7 4 1 1
69 4.7 4 1 1
70 4.7 4 1 1
71 5.2 4 1 1
72 5.2 4 1 1
73 5.7 4 1 1
74 5.9 4 1 1
75 4.6 r 1 3
76 5.4 r 1 3
77 5.4 r 1 3
78 4.0 4 1 1
79 4.0 4 1 1
80 4.0 4 1 1
81 4.0 4 1 1
82 4.6 4 1 1
83 5.0 4 1 1
84 4.2 4 1 1
85 4.2 4 1 1
86 4.6 4 1 1
87 4.6 4 1 1
88 4.6 4 1 1
89 5.4 4 1 1
90 5.4 4 1 1
91 3.8 r 1 3
92 3.8 r 1 3
93 4.0 r 1 3
94 4.0 r 1 3
95 4.6 r 1 3
96 4.6 r 1 3
97 4.6 r 1 3
98 4.6 r 1 3
99 5.4 r 1 3
100 1.6 f 1 2
101 1.6 f 1 2
102 1.6 f 1 2
103 1.6 f 1 2
104 1.6 f 1 2
105 1.8 f 1 2
106 1.8 f 1 2
107 1.8 f 1 2
108 2.0 f 1 2
109 2.4 f 1 2
110 2.4 f 1 2
111 2.4 f 1 2
112 2.4 f 1 2
113 2.5 f 1 2
114 2.5 f 1 2
115 3.3 f 1 2
116 2.0 f 1 2
117 2.0 f 1 2
118 2.0 f 1 2
119 2.0 f 1 2
120 2.7 f 1 2
121 2.7 f 1 2
122 2.7 f 1 2
123 3.0 4 1 1
124 3.7 4 1 1
125 4.0 4 1 1
126 4.7 4 1 1
127 4.7 4 1 1
128 4.7 4 1 1
129 5.7 4 1 1
130 6.1 4 1 1
131 4.0 4 1 1
132 4.2 4 1 1
133 4.4 4 1 1
134 4.6 4 1 1
135 5.4 r 1 3
136 5.4 r 1 3
137 5.4 r 1 3
138 4.0 4 1 1
139 4.0 4 1 1
140 4.6 4 1 1
141 5.0 4 1 1
142 2.4 f 1 2
143 2.4 f 1 2
144 2.5 f 1 2
145 2.5 f 1 2
146 3.5 f 1 2
147 3.5 f 1 2
148 3.0 f 1 2
149 3.0 f 1 2
150 3.5 f 1 2
151 3.3 4 1 1
152 3.3 4 1 1
153 4.0 4 1 1
154 5.6 4 1 1
155 3.1 f 1 2
156 3.8 f 1 2
157 3.8 f 1 2
158 3.8 f 1 2
159 5.3 f 1 2
160 2.5 4 1 1
161 2.5 4 1 1
162 2.5 4 1 1
163 2.5 4 1 1
164 2.5 4 1 1
165 2.5 4 1 1
166 2.2 4 1 1
167 2.2 4 1 1
168 2.5 4 1 1
169 2.5 4 1 1
170 2.5 4 1 1
171 2.5 4 1 1
172 2.5 4 1 1
173 2.5 4 1 1
174 2.7 4 1 1
175 2.7 4 1 1
176 3.4 4 1 1
177 3.4 4 1 1
178 4.0 4 1 1
179 4.7 4 1 1
180 2.2 f 1 2
181 2.2 f 1 2
182 2.4 f 1 2
183 2.4 f 1 2
184 3.0 f 1 2
185 3.0 f 1 2
186 3.5 f 1 2
187 2.2 f 1 2
188 2.2 f 1 2
189 2.4 f 1 2
190 2.4 f 1 2
191 3.0 f 1 2
192 3.0 f 1 2
193 3.3 f 1 2
194 1.8 f 1 2
195 1.8 f 1 2
196 1.8 f 1 2
197 1.8 f 1 2
198 1.8 f 1 2
199 4.7 4 1 1
200 5.7 4 1 1
201 2.7 4 1 1
202 2.7 4 1 1
203 2.7 4 1 1
204 3.4 4 1 1
205 3.4 4 1 1
206 4.0 4 1 1
207 4.0 4 1 1
208 2.0 f 1 2
209 2.0 f 1 2
210 2.0 f 1 2
211 2.0 f 1 2
212 2.8 f 1 2
213 1.9 f 1 2
214 2.0 f 1 2
215 2.0 f 1 2
216 2.0 f 1 2
217 2.0 f 1 2
218 2.5 f 1 2
219 2.5 f 1 2
220 2.8 f 1 2
221 2.8 f 1 2
222 1.9 f 1 2
223 1.9 f 1 2
224 2.0 f 1 2
225 2.0 f 1 2
226 2.5 f 1 2
227 2.5 f 1 2
228 1.8 f 1 2
229 1.8 f 1 2
230 2.0 f 1 2
231 2.0 f 1 2
232 2.8 f 1 2
233 2.8 f 1 2
234 3.6 f 1 2
[1] "XS: "
$`1`
[1] 1.8 1.8 2.0 2.0 2.8 2.8 3.1 3.1 2.8 3.1 4.2 5.3 5.3 5.7 6.5 3.7 3.7 3.9 3.9 4.7 4.7 4.7 5.2 5.2
[25] 3.9 4.7 4.7 4.7 5.2 5.7 5.9 4.7 4.7 4.7 4.7 4.7 4.7 5.2 5.2 5.7 5.9 4.0 4.0 4.0 4.0 4.6 5.0 4.2
[49] 4.2 4.6 4.6 4.6 5.4 5.4 3.0 3.7 4.0 4.7 4.7 4.7 5.7 6.1 4.0 4.2 4.4 4.6 4.0 4.0 4.6 5.0 3.3 3.3
[73] 4.0 5.6 2.5 2.5 2.5 2.5 2.5 2.5 2.2 2.2 2.5 2.5 2.5 2.5 2.5 2.5 2.7 2.7 3.4 3.4 4.0 4.7 4.7 5.7
[97] 2.7 2.7 2.7 3.4 3.4 4.0 4.0
$`2`
[1] 1.8 1.8 2.0 2.0 2.8 2.8 3.1 2.4 2.4 3.1 3.5 3.6 2.4 3.0 3.3 3.3 3.3 3.3 3.3 3.8 3.8 3.8 4.0 1.6
[25] 1.6 1.6 1.6 1.6 1.8 1.8 1.8 2.0 2.4 2.4 2.4 2.4 2.5 2.5 3.3 2.0 2.0 2.0 2.0 2.7 2.7 2.7 2.4 2.4
[49] 2.5 2.5 3.5 3.5 3.0 3.0 3.5 3.1 3.8 3.8 3.8 5.3 2.2 2.2 2.4 2.4 3.0 3.0 3.5 2.2 2.2 2.4 2.4 3.0
[73] 3.0 3.3 1.8 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 2.8 1.9 2.0 2.0 2.0 2.0 2.5 2.5 2.8 2.8 1.9 1.9 2.0
[97] 2.0 2.5 2.5 1.8 1.8 2.0 2.0 2.8 2.8 3.6
$`3`
[1] 5.3 5.3 5.3 5.7 6.0 5.7 5.7 6.2 6.2 7.0 4.6 5.4 5.4 3.8 3.8 4.0 4.0 4.6 4.6 4.6 4.6 5.4 5.4 5.4
[25] 5.4
[1] "BWS: "
1 2 3
0.4056219 0.2482564 0.3797632
[1] "BW: "
[1] 0.3445472
Picking bandwidth of 0.345
[1] "PARAMS AFTER: "
$bandwidth
[1] 0.3445472
$na.rm
[1] FALSE
Thanks in advance!

Related

qtgrace/xmgrace non-overlaping data sets

I'm using qtgrace for MacOS and when I plotted two data in qtgrace I got something like this:
Overlapping data sets
However, I would like to plot something like this:
Non-overlapping data sets
My data 1:
0 14
0.1 6
0.2 14
0.3 14
0.4 14
0.5 14
0.6 14
0.7 14
0.8 6
0.9 6
1 6
1.1 6
1.2 6
1.3 6
1.4 6
1.5 6
1.6 6
1.7 6
1.8 6
1.9 6
2 6
2.1 6
2.2 6
2.3 6
2.4 6
2.5 6
2.6 6
2.7 6
2.8 6
2.9 6
3 6
3.1 6
3.2 6
3.3 6
3.4 6
3.5 6
3.6 6
3.7 6
3.8 6
3.9 6
4 6
4.1 6
4.2 6
4.3 6
4.4 6
4.5 6
4.6 6
4.7 6
4.8 6
4.9 6
5 6
5.1 6
5.2 6
5.3 6
5.4 6
5.5 6
5.6 6
5.7 6
5.8 6
5.9 6
6 6
6.1 6
6.2 6
6.3 6
6.4 6
6.5 6
6.6 6
6.7 6
6.8 6
6.9 6
7 6
7.1 6
7.2 6
7.3 2
7.4 6
7.5 2
7.6 2
7.7 2
7.8 2
7.9 6
8 2
8.1 6
8.2 2
8.3 2
8.4 6
8.5 6
8.6 6
8.7 2
8.8 6
8.9 19
9 19
9.1 6
9.2 6
9.3 6
9.4 2
9.5 2
9.6 2
9.7 2
9.8 2
9.9 2
10 2
10.1 2
10.2 2
10.3 2
10.4 2
10.5 2
10.6 2
10.7 2
10.8 2
10.9 2
11 2
11.1 2
11.2 2
11.3 2
11.4 2
11.5 2
11.6 2
11.7 2
11.8 2
11.9 2
12 2
12.1 2
12.2 2
12.3 2
12.4 2
12.5 2
12.6 2
12.7 2
12.8 2
12.9 2
13 2
13.1 2
13.2 2
13.3 2
13.4 2
13.5 2
13.6 2
13.7 2
13.8 2
13.9 2
14 2
14.1 2
14.2 2
14.3 2
14.4 2
14.5 2
14.6 2
14.7 2
14.8 2
14.9 2
15 2
15.1 2
15.2 2
15.3 2
15.4 2
15.5 2
15.6 2
15.7 2
15.8 2
15.9 2
16 2
16.1 2
16.2 2
16.3 2
16.4 2
16.5 2
16.6 2
16.7 2
16.8 2
16.9 2
17 2
17.1 2
17.2 2
17.3 2
17.4 2
17.5 2
17.6 2
17.7 2
17.8 2
17.9 2
18 2
18.1 2
18.2 2
18.3 2
18.4 2
18.5 2
18.6 2
18.7 2
18.8 2
18.9 2
19 2
19.1 2
19.2 2
19.3 2
19.4 2
19.5 2
19.6 2
19.7 2
19.8 2
19.9 2
20 2
20.1 2
20.2 2
20.3 2
20.4 2
20.5 2
20.6 2
20.7 2
20.8 2
20.9 2
21 2
21.1 2
21.2 2
21.3 2
21.4 2
21.5 2
21.6 2
21.7 2
21.8 7
21.9 2
22 2
22.1 2
22.2 2
22.3 7
22.4 7
22.5 7
22.6 7
22.7 7
22.8 2
22.9 2
23 7
23.1 7
23.2 7
23.3 7
23.4 7
23.5 2
23.6 2
23.7 2
23.8 2
23.9 2
24 2
24.1 2
24.2 2
24.3 2
24.4 2
24.5 2
24.6 2
24.7 2
24.8 2
24.9 2
25 2
. .
. .
. .
Data 2:
0 4
0.1 4
0.2 4
0.3 4
0.4 4
0.5 4
0.6 4
0.7 4
0.8 4
0.9 4
1 2
1.1 4
1.2 4
1.3 4
1.4 4
1.5 4
1.6 4
1.7 4
1.8 4
1.9 4
2 4
2.1 4
2.2 4
2.3 4
2.4 4
2.5 4
2.6 4
2.7 4
2.8 4
2.9 4
3 4
3.1 4
3.2 4
3.3 4
3.4 4
3.5 4
3.6 4
3.7 4
3.8 4
3.9 4
4 4
4.1 4
4.2 4
4.3 4
4.4 4
4.5 4
4.6 4
4.7 4
4.8 4
4.9 4
5 4
5.1 4
5.2 4
5.3 4
5.4 4
5.5 4
5.6 4
5.7 4
5.8 4
5.9 4
6 4
6.1 4
6.2 4
6.3 4
6.4 4
6.5 4
6.6 4
6.7 4
6.8 4
6.9 4
7 4
7.1 4
7.2 4
7.3 4
7.4 4
7.5 4
7.6 4
7.7 4
7.8 4
7.9 4
8 4
8.1 4
8.2 4
8.3 4
8.4 2
8.5 4
8.6 4
8.7 4
8.8 4
8.9 4
9 4
9.1 4
9.2 4
9.3 4
9.4 4
9.5 4
9.6 4
9.7 4
9.8 4
9.9 4
10 4
10.1 4
10.2 4
10.3 4
10.4 4
10.5 2
10.6 2
10.7 4
10.8 2
10.9 2
11 2
11.1 2
11.2 4
11.3 4
11.4 2
11.5 2
11.6 2
11.7 2
11.8 2
11.9 2
12 2
12.1 2
12.2 2
12.3 2
12.4 4
12.5 4
12.6 2
12.7 2
12.8 4
12.9 2
13 2
13.1 4
13.2 4
13.3 4
13.4 4
13.5 10
13.6 2
13.7 2
13.8 2
13.9 2
14 2
14.1 2
14.2 2
14.3 10
14.4 2
14.5 2
14.6 4
14.7 2
14.8 2
14.9 4
15 2
15.1 10
15.2 2
15.3 2
15.4 2
15.5 2
15.6 2
15.7 2
15.8 2
15.9 2
16 2
16.1 2
16.2 2
16.3 2
16.4 2
16.5 2
16.6 2
16.7 2
16.8 2
16.9 2
17 2
17.1 2
17.2 2
17.3 2
17.4 2
17.5 2
17.6 2
17.7 2
17.8 2
17.9 2
18 2
18.1 2
18.2 2
18.3 2
18.4 2
18.5 2
18.6 2
18.7 2
18.8 2
18.9 2
19 2
19.1 2
19.2 2
19.3 2
19.4 2
19.5 2
19.6 2
19.7 2
19.8 2
19.9 2
20 2
20.1 2
20.2 2
20.3 2
20.4 2
20.5 2
20.6 2
20.7 2
20.8 2
20.9 2
21 2
21.1 2
21.2 2
21.3 2
21.4 2
21.5 2
21.6 2
21.7 2
21.8 2
21.9 2
22 2
22.1 2
22.2 2
22.3 2
22.4 2
22.5 2
22.6 2
22.7 2
22.8 2
22.9 2
23 2
23.1 2
23.2 2
23.3 2
23.4 2
23.5 2
23.6 2
23.7 2
23.8 2
23.9 2
24 2
24.1 2
24.2 2
24.3 2
24.4 2
24.5 2
24.6 2
24.7 2
24.8 2
24.9 2
25 2
. .
. .
. .
The data are in two separate xvg file from GROMACS cluster analysis. I wanna plot five different sets in a manner which I can see all data without superposing.
Thank you!
I think the best approach would be to write a script that takes the original files and spits out new files with shifted y values. However, since you have asked for a qt/xmgrace solution, here is how you do it:
Load up all the datasets into qtgrace
Open the "Data -> Transformations -> Evaluate expression..." dialog
Select in the left and right columns a dataset and in the textbox below enter the formula y = y + 0.1. Click "apply". This will shift the dataset up by 0.1
Select the next dataset in the same way and use the formula y = y + 0.2. Click apply
Rinse and repeat for all the datasets (changing the shift accordingly)

Reformat data frame into a single row

I want to format my data frame into a single row.
So I have this
RT 200 201 202 203 204 205
2 2.5 3.5 4.5 5.5 6.5 7.5
3 2.6 3.6 4.6 5.6 6.6 7.6
4 2.7 3.7 4.7 5.7 6.7 7.7
And I want this:
m/z 200 201 202 203 204 205 200 201 202 203 204 205 200 201 202 203 204 205
RT 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
Sa 2.5 3.5 4.5 5.5 6.5 7.5 2.6 3.6 4.6 5.6 6.6 7.6 2.7 3.7 4.7 5.7 6.7 7.7
Can anyone provide me code for this?
Note: I want to add row names "m/z" and "Sa" to the rows instead of leaving it blank.

keep the observation belonging to the cluster in R

Simple example with iris dataset. I must use apcluster library
library("apcluster")
#use dist() create a negative SimilarityMatrix
sim<-negDistMat(iris[,1:4],r=2)
#run the clusteralgorythm and create apclustert object apiris1
apiris1<-apcluster(sim,details=T)
apiris1=apclusterK(sim,details=T,K=2,verbose=T)
and after, i see the number of cluster and obzervation in it
Cluster 1, exemplar 8:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49 50 58 99
Cluster 2, exemplar 124:
51 52 53 54 55 56 57 59 60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
88 89 90 91 92 93 94 95 96 97 98 100 101 102 103 104
105 106 107 108 109 110 111 112 113 114 115 116 117 118
119 120 121 122 123 124 125 126 127 128 129 130 131 132
133 134 135 136 137 138 139 140 141 142 143 144 145 146
147 148 149 150
How to keep the observation belonging to the cluster in R.
To make my post more clear, on output I expect such a table
n Sepal.Length Sepal.Width Petal.Length Petal.Width Species Save.cluster
1 1 5.1 3.5 1.4 0.2 setosa 1
2 2 4.9 3.0 1.4 0.2 setosa 1
3 3 4.7 3.2 1.3 0.2 setosa 1
4 4 4.6 3.1 1.5 0.2 setosa 1
5 5 5.0 3.6 1.4 0.2 setosa 1
6 6 5.4 3.9 1.7 0.4 setosa 1
7 7 4.6 3.4 1.4 0.3 setosa 1
8 8 5.0 3.4 1.5 0.2 setosa 1
9 9 4.4 2.9 1.4 0.2 setosa 1
10 10 4.9 3.1 1.5 0.1 setosa 1
11 51 7.0 3.2 4.7 1.4 versicolor 2
12 52 6.4 3.2 4.5 1.5 versicolor 2
13 53 6.9 3.1 4.9 1.5 versicolor 2
14 54 5.5 2.3 4.0 1.3 versicolor 2
15 55 6.5 2.8 4.6 1.5 versicolor 2
The cluster indices are stored in apiris1#clusters. You can make a data.frame like the one you are requesting like this:
iris1 = iris
iris1$Save.cluster = 0
for(i in 1:length(apiris1#clusters)) {
iris1$Save.cluster[apiris1#clusters[[i]]] = i }
head(iris1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Save.cluster
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1

how to filter data by condition to make the number of rows to be the same for each group

This is my sample data:
date label type exdate x y z w
1 10 A 2 15 0.25 0.35 13.49
1 10 A 2 12.5 1.30 1.45 13.49
1 10 B 2 10 1.7 1.8 13.49
1 10 B 2 12.5 0.3 0.4 13.49
1 10 B 2 17.5 1.8 0.3 13.49
1 11 A 3 15 0.75 0.8 13.49
1 11 A 3 12.5 1.8 1.9 13.49
1 11 A 3 17.5 0.2 0.35 13.49
1 11 B 3 10 0.1 0.25 13.49
1 11 B 3 15 2.15 2.3 13.49
1 11 B 3 12.5 0.8 0.85 13.49
1 11 B 3 17.5 4.1 4.3 13.49
2 11 A 4 10 3.7 4 13.49
2 11 A 4 15 1 1.1 13.49
2 11 A 4 12.5 2.05 2.2 13.49
2 11 A 4 17.5 0.4 0.55 13.49
2 11 B 4 10 0.3 0.4 13.49
2 11 B 4 15 2.45 2.6 13.49
2 11 B 4 12.5 1.05 1.15 13.49
2 11 B 4 17.5 4.3 4.6 13.49
Firstly, I will group my data set by c(date,label,exdate), and for each group it will be A and B inside variable 'type'. BUT I want to let the number of rows for type A and type B is the same.
Filter conditions:
To make data to be the same number of rows, the distance between x and w should be same or almost the same for any pairs of type A and type B.
For example:
type x w
A 2 3.5
A 3 3.5
A 4 3.5
B 1 3.5
B 2 3.5
# The output after filter
type x w
A 2 3.5 (pair with type_B ; x = 1)
A 3 3.5 (pair with type_B ; x = 2)
B 1 3.5
B 2 3.5
So, for the sample data above, the result I hope:
date label type exdate x y z w
1 10 A 2 15 0.25 0.35 13.49
1 10 A 2 12.5 1.30 1.45 13.49
1 10 B 2 12.5 0.3 0.4 13.49
1 10 B 2 17.5 1.8 0.3 13.49
1 11 A 3 15 0.75 0.8 13.49
1 11 A 3 12.5 1.8 1.9 13.49
1 11 A 3 17.5 0.2 0.35 13.49
1 11 B 3 15 2.15 2.3 13.49
1 11 B 3 12.5 0.8 0.85 13.49
1 11 B 3 17.5 4.1 4.3 13.49
2 11 A 4 10 3.7 4 13.49
2 11 A 4 15 1 1.1 13.49
2 11 A 4 12.5 2.05 2.2 13.49
2 11 A 4 17.5 0.4 0.55 13.49
2 11 B 4 10 0.3 0.4 13.49
2 11 B 4 15 2.45 2.6 13.49
2 11 B 4 12.5 1.05 1.15 13.49
2 11 B 4 17.5 4.3 4.6 13.49
To make this result, how can I code? Is it insert else if condition inside filter()?

Distance & cluster with dynamic time warping

I am using dtw to calculate distances between several series and getting strange results. Notice that in the sample data below the first 9 customers are identical sets (A==B==C, D==E==F, and G==H==I). The remaining rows are only for noise to allow me to make 8 clusters.
I expect that the first sets would be clustered with their identical partners. This happens when I calculate distance on the original data, but when I scale the data before distance/clustering I get different results.
The distances between identical rows in original data is 0.0 (as expected), but with scaled data the distances is not 0.0 (not even close). Any ideas why they are not the same?
library(TSdist)
library(dplyr)
library(tidyr)
mydata = as_data_frame(read.table(textConnection("
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 A 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 B 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 C 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
4 D 0.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0
5 E 0.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0
6 F 0.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0
7 G 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5
8 H 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5
9 I 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5
10 D2 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0 2.0
11 E2 5.0 6.0 5.0 4.0 5.0 6.0 5.0 4.0 5.0 6.0
12 F2 9.0 10.0 9.0 8.0 9.0 10.0 9.0 8.0 9.0 10.0
13 G2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0
14 H2 5.5 5.0 4.5 4.0 4.5 5.0 5.5 6.0 5.5 5.0
15 I2 9.5 9.0 8.5 8.0 8.5 9.0 9.5 10.0 9.5 9.0
16 A3 1.0 1.0 0.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0
17 B3 5.0 5.0 5.0 5.0 5.0 3.0 8.0 5.0 5.0 5.0
18 C3 9.0 9.0 9.0 9.0 9.0 5.4 14.4 9.0 9.0 9.0
19 D3 0.0 1.0 2.0 1.0 0.0 1.0 1.0 2.0 0.0 1.0
20 E3 4.0 5.0 5.0 6.0 4.0 5.0 6.0 5.0 4.0 5.0
21 F3 8.0 9.0 10.0 9.0 9.0 9.0 9.0 9.0 8.0 9.0
22 G3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 2.0 1.5 1.5
23 H3 6.0 5.5 5.0 4.5 4.0 5.0 4.5 5.5 6.0 5.5
24 I3 10.0 9.5 9.0 9.0 8.0 8.5 9.0 9.5 10.0 9.5
25 D4 0.0 3.0 6.0 3.0 0.0 3.0 6.0 3.0 0.0 5.0
26 E4 3.0 6.0 9.0 6.0 3.0 6.0 9.0 6.0 3.0 6.0
27 F4 4.0 6.0 10.0 7.0 5.0 6.0 11.0 8.0 5.0 7.0
28 D5 5.0 0.0 3.0 6.0 3.0 0.0 3.0 6.0 3.0 0.0
29 D6 9.0 6.0 3.0 6.0 9.0 6.0 3.0 6.0 9.0 6.0
30 D7 9.0 11.0 5.0 4.0 6.0 10.0 7.0 5.0 6.0 11.0
31 Dw 0.0 0.8 1.4 2.0 1.0 0.0 2.0 0.0 1.0 2.0
32 Ew 4.0 4.8 5.4 6.0 5.0 4.0 6.0 4.0 5.0 6.0
33 Fw 8.0 8.8 9.4 10.0 9.0 8.0 10.0 8.0 9.0 10.0
34 Gw 2.0 1.5 1.0 0.5 0.0 1.0 2.0 1.5 1.3 1.1
35 Hw 6.0 5.5 5.0 4.5 4.0 5.0 6.0 5.5 5.3 5.1
36 Iw 10.0 9.5 9.0 8.5 8.0 9.0 10.0 9.5 9.3 9.1"),
header = TRUE, stringsAsFactors = FALSE))
k=8
# create a scale version of mydata (raw data - mean) / std dev
mydata_long = mydata %>%
mutate (mean = apply(mydata[,2:ncol(mydata)],1,mean,na.rm = T)) %>%
mutate (sd = apply(mydata[,2:(ncol(mydata))],1,sd,na.rm = T))%>%
gather (period,value,-cust,-mean,-sd) %>%
mutate (sc = (value-mean)/sd)
mydata_sc = mydata_long[,-c(2,3,5)] %>%
spread(period,sc)
# dtw
dtw_dist = TSDatabaseDistances(mydata[2:ncol(mydata)], distance = "dtw",lag.max= 2) #distance
dtw_clus = hclust(dtw_dist, method="ward.D2") # Cluster
dtw_res = data.frame(cutree(dtw_clus, k)) # cut dendrogram into 9 clusters
# dtw (w scaled data)
dtw_sc_dist = TSDatabaseDistances(mydata_sc[2:ncol(mydata_sc)], distance = "dtw",lag.max= 2) #distance
dtw_sc_clus = hclust(dtw_sc_dist, method="ward.D2") # Cluster
dtw_sc_res = data.frame(cutree(dtw_sc_clus, k)) # cut dendrogram into 9 clusters
results = cbind (dtw_res,dtw_sc_res)
names(results) = c("dtw", "dtw_scaled")
print(results)
dtw dtw_scaled
1 1 1
2 1 2
3 1 1
4 1 2
5 1 1
6 1 2
7 1 3
8 1 4
9 1 3
10 1 3
11 2 3
12 3 4
13 1 5
14 2 6
15 3 3
16 1 4
17 2 3
18 4 3
19 1 6
20 2 3
21 3 4
22 1 3
23 2 3
24 3 6
25 5 7
26 6 8
27 7 7
28 5 7
29 6 7
30 8 8
31 1 7
32 2 7
33 3 7
34 1 8
35 2 7
36 3 7
A couple issues
You are scaling rowwise, not columnwise (take a look at the intermediate results of your dplyr chain -- do they make sense?)
The data manipulations you used to produce the scaled data changed the rows ordering of your data frame to alphabetical:
> mydata_sc %>% head
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
(chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A 2.84604989 -0.31622777 -0.31622777 -0.31622777 -0.31622777 -0.3162278 -0.3162278 -0.31622777 -0.31622777 -0.31622777
2 A3 0.00000000 0.00000000 -2.12132034 2.12132034 0.00000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000
3 B 2.84604989 -0.31622777 -0.31622777 -0.31622777 -0.31622777 -0.3162278 -0.3162278 -0.31622777 -0.31622777 -0.31622777
vs.
> mydata %>% head
Source: local data frame [6 x 11]
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
(chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A 1.1 1 1 1 1 1 1 1 1 1
2 B 1.1 1 1 1 1 1 1 1 1 1
(check the cust variable ordering!)
Here's my approach, and how I think you can avoid similar mistakes in the future:
scale with built-in scale function
mydata_sc <- mydata %>% select(-cust) %>% scale %>% as.data.frame %>% cbind(cust =mydata$cust,.) %>% as.tbl
assert that your scaled dataframe is equivalent to a scaled version of your original dataframe:
> (scale(mydata_sc %>% select(-cust)) - scale(mydata %>% select(-cust)))
%>% colSums %>% sum
[1] 0.000000000000005353357
Create one single function to perform your desired manipulations:
return_dtw <- function(df) {
res_2 = TSDatabaseDistances(df[2:ncol(df)],distance="dtw",lag.max=2) %>%
hclust(.,method="ward.D2")
return(data.frame(cutree(res_2,k)))
}
execute function:
> mydata %>% return_dtw %>% cbind(mydata_sc %>% return_dtw)
cutree.res_2..k. cutree.res_2..k.
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 2 2
12 3 3
13 1 1
14 2 2
15 3 3
16 1 1
17 2 2
18 4 3
19 1 1
20 2 2
21 3 3
22 1 1
23 2 2
24 3 3
25 5 4
26 6 5
27 7 5
28 5 6
29 6 7
30 8 8
31 1 1
32 2 2
33 3 3
34 1 1
35 2 2
36 3 3
Some of the later customers are not grouped similarly, but that's for another question!

Resources