Reindexing a column in R - r

I'm dealing with the following dataset
animal protein herd sire dam
6 416 189.29 2 15 236
7 417 183.27 2 6 295
9 419 193.24 3 11 268
10 420 198.84 2 12 295
11 421 205.25 3 3 251
12 422 204.15 2 2 281
13 423 200.20 2 3 248
14 424 197.22 2 11 222
15 425 201.14 1 10 262
17 427 196.20 1 11 290
18 428 208.13 3 9 294
19 429 213.01 3 14 254
21 431 203.38 2 4 273
22 432 190.56 2 8 248
25 435 196.59 3 9 226
26 436 193.31 3 10 249
27 437 207.89 3 7 272
29 439 202.98 2 10 260
30 440 177.28 2 4 291
31 441 182.04 1 6 282
32 442 217.50 2 3 265
33 443 190.43 2 11 248
35 445 197.24 2 4 256
37 447 197.16 3 5 240
42 452 183.07 3 5 293
43 453 197.99 2 6 293
44 454 208.27 2 6 254
45 455 187.61 3 12 271
46 456 173.18 2 6 280
47 457 187.89 2 6 235
48 458 191.96 1 7 286
49 459 196.39 1 4 275
50 460 178.51 2 13 262
52 462 204.17 1 6 253
53 463 203.77 2 11 273
54 464 206.25 1 13 249
55 465 211.63 2 13 222
56 466 211.34 1 6 228
57 467 194.34 2 1 217
58 468 201.53 2 12 247
59 469 198.01 2 3 251
60 470 188.94 2 7 290
61 471 190.49 3 2 220
62 472 197.34 2 3 224
63 473 194.04 1 15 229
64 474 202.74 2 1 287
67 477 189.98 1 6 300
69 479 206.37 3 2 293
70 480 183.81 2 10 274
72 482 190.70 2 12 265
74 484 194.25 3 2 262
75 485 191.15 3 10 297
76 486 193.23 3 15 255
77 487 193.29 2 4 266
78 488 182.20 1 15 260
81 491 195.89 2 12 294
82 492 200.77 1 8 278
83 493 179.12 2 7 281
85 495 172.14 3 13 252
86 496 183.82 1 4 264
88 498 195.32 1 6 249
89 499 197.19 1 13 274
90 500 178.07 1 8 293
92 502 209.65 2 7 241
95 505 199.66 3 5 220
96 506 190.96 2 11 259
98 508 206.58 3 3 230
100 510 196.60 2 5 231
103 513 193.25 2 15 280
104 514 181.34 2 3 227
I'm interested with the animals indexes and corresponding to them the dams' indexes. Using table function I was able to check that some dams are matched to different animals. In fact I got the following output
217 220 222 224 226 227 228 229 230 231 235 236 240 241 247 248 249 251 252 253 254 255 256 259 260 262
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 1 2 1 1 1 2 3
264 265 266 268 271 272 273 274 275 278 280 281 282 286 287 290 291 293 294 295 297 300
1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 4 2 2 1 1
Using length function I checked that there are only 48 dams in this dataset.
I would like to 'reindex' them with the integers 1, ..., 48 instead of these given in my set. Is there any method of doing such things?

You can use match and unique.
df$index <- match(df$dam, unique(df$dam))
Or convert to factor and then integer
df$index <- as.integer(factor(df$dam))
Another option is group_indices from dplyr.
df$index <- dplyr::group_indices(df, dam)

We can use .GRP in data.table
library(data.table)
setDT(df)[, index := .GRP, dam]

Related

I can't make a DESeq2 analysis

I need to make a DESeq2 analysis with my dataset for an homework, but I'm really new with this package (I never used it before).
When I want to make a
counts <- read.table("ProstateCancerCountData.txt",sep="", header=TRUE, row.names=1)
metadat<- read.table("mart_export.txt",sep=",", header=TRUE, row.names=1)
counts <- as.matrix(counts)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadat, design = ~ GC.content+ Gene.type)
I have this error :
Erreur dans DESeqDataSetFromMatrix(countData = counts, colData = metadat, :
ncol(countData) == nrow(colData) n'est pas TRUE
I don't know how to fix it.
This is the two dataset I have to used for the analysis :
head(counts)
N_10 T_10 N_11 T_12 N_13 T_13 N_14 T_14 N_1 T_1 N_2 T_2 N_3
ENSG00000000003 401 442 1155 1095 788 754 852 938 774 520 808 648 891
ENSG00000000005 0 7 23 9 5 2 45 5 11 10 56 8 7
ENSG00000000419 112 96 424 468 385 452 751 491 247 222 509 363 706
ENSG00000000457 13 121 327 165 40 204 290 199 70 121 104 151 352
ENSG00000000460 24 66 162 137 71 159 174 156 86 94 120 91 166
ENSG00000000938 96 128 218 372 126 129 538 320 117 129 157 238 177
T_3 N_4 N_5 T_6 N_7 T_7 N_8 T_8 N_9 T_9
ENSG00000000003 1071 2059 737 1006 1146 653 1299 1306 1522 490
ENSG00000000005 0 18 0 7 1 4 1 2 0 3
ENSG00000000419 622 988 307 402 294 323 535 518 573 322
ENSG00000000457 333 328 58 153 138 115 179 200 86 85
ENSG00000000460 152 162 100 100 101 148 128 78 83 109
ENSG00000000938 86 113 410 230 64 76 93 61 121 68
head(metadat)
Chromosome.scaffold.name Gene.start..bp. Gene.end..bp.
ENSG00000271782 1 50902700 50902978
ENSG00000232753 1 103817769 103828355
ENSG00000225767 1 50927141 50936822
ENSG00000202140 1 50965430 50965529
ENSG00000207194 1 51048076 51048183
ENSG00000252825 1 51215968 51216025
GC.content Gene.type
ENSG00000271782 35.48 lincRNA
ENSG00000232753 33.99 lincRNA
ENSG00000225767 38.99 antisense
ENSG00000202140 43.00 misc_RNA
ENSG00000207194 37.96 snRNA
ENSG00000252825 36.21 snRNA
Thank you for your help, and for your lighting
EDIT :
Thank you for your previous answer.
I take an another dataset to make this homework. But I have another bug :
This is my new dataset :
head(mycounts)
R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver
ENSG00000177757 2 1 0 0 1
ENSG00000187634 49 27 43 34 23
ENSG00000188976 73 34 77 56 45
ENSG00000187961 15 8 15 13 11
ENSG00000187583 1 0 1 1 0
ENSG00000187642 4 0 5 0 2
R1L7Kidney R1L8Liver R2L2Kidney R2L3Liver R2L6Kidney
ENSG00000177757 2 0 1 1 3
ENSG00000187634 41 35 42 25 47
ENSG00000188976 68 55 70 42 82
ENSG00000187961 13 12 12 20 15
ENSG00000187583 3 0 0 2 3
ENSG00000187642 12 1 9 4 9
head(myfactors)
Tissue TissueRun
R1L1Kidney Kidney Kidney_1
R1L2Liver Liver Liver_1
R1L3Kidney Kidney Kidney_1
R1L4Liver Liver Liver_1
R1L6Liver Liver Liver_1
R1L7Kidney Kidney Kidney_1
When I code my DESeq object, I would take the Tissue and TissueRun for take care of the batch. But I have an error :
dds2 <- DESeqDataSetFromMatrix(countData = mycounts, colData = myfactors, design = ~ Tissue + TissueRun)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
Thank you for your help

In R how to fill in numbers that do not appear in the sequence?

I have a data set that list the percentiles for a set of scores like this:
> percentiles
Score Percentile
1 231 0
2 385 1
3 403 2
4 413 3
5 418 4
6 424 5
7 429 6
8 434 7
9 437 8
10 441 9
11 443 10
I would like the "Score" column to run from 100 to 500. That is, I would like Scores 100 to 231 to be associated with a Percentile of 0, Scores 232 to 385 to be associated with a Percentile of 1, etc. Is there a simple way to fill in the values that do not appear in the sequence of "Score" values so it looks like the below data set?
> percentiles
Score Percentile
1 100 0
2 101 0
3 102 0
4 103 0
5 104 0
6 105 0
7 106 0
8 107 0
9 108 0
10 109 0
--------------------
130 229 0
131 230 0
132 231 0
133 232 1
134 233 1
135 234 1
136 235 1
137 236 1
138 237 1
139 238 1
140 239 1
If you convert percentiles to a data.table, you could do a rolling join with a new table of all scores 100:500. The rolling join with roll = -Inf gives a fill-backward behavior by itself, but still the 444:500 values are NA so a forward nafill is added at the end.
library(data.table)
setDT(percentiles)
percentiles[data.table(Score = 100:500), on = .(Score), roll = -Inf
][, Percentile := nafill(Percentile, 'locf')]
# Score Percentile
# 1: 100 0
# 2: 101 0
# 3: 102 0
# 4: 103 0
# 5: 104 0
# ---
# 397: 496 10
# 398: 497 10
# 399: 498 10
# 400: 499 10
# 401: 500 10
You might think about this differently: instead of a data frame to fill, as a set of breaks for binning your scores. Use the scores as the breaks with -Inf tacked on to have the lower bound. If you need something different to happen for the scores above the highest break, add Inf to the end of the breaks, but you'll need to come up with an additional label.
library(dplyr)
dat <- data.frame(Score = 100:500) %>%
mutate(Percentile = cut(Score, breaks = c(-Inf, percentiles$Score),
labels = percentiles$Percentile,
right = T, include.lowest = F))
Taking a look at a few of the breaking points:
slice(dat, c(129:135, 342:346))
#> Score Percentile
#> 1 228 0
#> 2 229 0
#> 3 230 0
#> 4 231 0
#> 5 232 1
#> 6 233 1
#> 7 234 1
#> 8 441 9
#> 9 442 10
#> 10 443 10
#> 11 444 <NA>
#> 12 445 <NA>
We could use complete
library(dplyr)
library(tidyr)
out <- complete(percentiles, Score = 100:500) %>%
fill(Percentile, .direction = "updown")
out %>%
slice(c(1:10, 130:140)) %>%
as.data.frame
# Score Percentile
#1 100 0
#2 101 0
#3 102 0
#4 103 0
#5 104 0
#6 105 0
#7 106 0
#8 107 0
#9 108 0
#10 109 0
#11 229 0
#12 230 0
#13 231 0
#14 232 1
#15 233 1
#16 234 1
#17 235 1
#18 236 1
#19 237 1
#20 238 1
#21 239 1
data
percentiles <- structure(list(Score = c(231L, 385L, 403L, 413L, 418L, 424L,
429L, 434L, 437L, 441L, 443L), Percentile = 0:10), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))
In base R you could use the findInterval function to break up your sequence 100:500 into buckets determined by the Score, then index into the Percentile column:
x <- 100:500
ind <- findInterval(x, percentiles$Score, left.open = TRUE)
output <- data.frame(Score = x, Percentile = percentiles$Percentile[ind + 1])
Values of x above 443 will receive a percentile of NA.
Here is a base R solution, where cut() and match() are the key points to make it, i.e.,
df <- data.frame(Score = (x <- 100:500),
percentile = percentiles$Percentile[match(s <-cut(x,c(0,percentiles$Score)),levels(s))])
such that
> df
Score percentile
1 100 0
2 101 0
3 102 0
4 103 0
5 104 0
6 105 0
7 106 0
8 107 0
9 108 0
10 109 0
11 110 0
12 111 0
13 112 0
14 113 0
15 114 0
16 115 0
17 116 0
18 117 0
19 118 0
20 119 0
21 120 0
22 121 0
23 122 0
24 123 0
25 124 0
26 125 0
27 126 0
28 127 0
29 128 0
30 129 0
31 130 0
32 131 0
33 132 0
34 133 0
35 134 0
36 135 0
37 136 0
38 137 0
39 138 0
40 139 0
41 140 0
42 141 0
43 142 0
44 143 0
45 144 0
46 145 0
47 146 0
48 147 0
49 148 0
50 149 0
51 150 0
52 151 0
53 152 0
54 153 0
55 154 0
56 155 0
57 156 0
58 157 0
59 158 0
60 159 0
61 160 0
62 161 0
63 162 0
64 163 0
65 164 0
66 165 0
67 166 0
68 167 0
69 168 0
70 169 0
71 170 0
72 171 0
73 172 0
74 173 0
75 174 0
76 175 0
77 176 0
78 177 0
79 178 0
80 179 0
81 180 0
82 181 0
83 182 0
84 183 0
85 184 0
86 185 0
87 186 0
88 187 0
89 188 0
90 189 0
91 190 0
92 191 0
93 192 0
94 193 0
95 194 0
96 195 0
97 196 0
98 197 0
99 198 0
100 199 0
101 200 0
102 201 0
103 202 0
104 203 0
105 204 0
106 205 0
107 206 0
108 207 0
109 208 0
110 209 0
111 210 0
112 211 0
113 212 0
114 213 0
115 214 0
116 215 0
117 216 0
118 217 0
119 218 0
120 219 0
121 220 0
122 221 0
123 222 0
124 223 0
125 224 0
126 225 0
127 226 0
128 227 0
129 228 0
130 229 0
131 230 0
132 231 0
133 232 1
134 233 1
135 234 1
136 235 1
137 236 1
138 237 1
139 238 1
140 239 1
141 240 1
142 241 1
143 242 1
144 243 1
145 244 1
146 245 1
147 246 1
148 247 1
149 248 1
150 249 1
151 250 1
152 251 1
153 252 1
154 253 1
155 254 1
156 255 1
157 256 1
158 257 1
159 258 1
160 259 1
161 260 1
162 261 1
163 262 1
164 263 1
165 264 1
166 265 1
167 266 1
168 267 1
169 268 1
170 269 1
171 270 1
172 271 1
173 272 1
174 273 1
175 274 1
176 275 1
177 276 1
178 277 1
179 278 1
180 279 1
181 280 1
182 281 1
183 282 1
184 283 1
185 284 1
186 285 1
187 286 1
188 287 1
189 288 1
190 289 1
191 290 1
192 291 1
193 292 1
194 293 1
195 294 1
196 295 1
197 296 1
198 297 1
199 298 1
200 299 1
201 300 1
202 301 1
203 302 1
204 303 1
205 304 1
206 305 1
207 306 1
208 307 1
209 308 1
210 309 1
211 310 1
212 311 1
213 312 1
214 313 1
215 314 1
216 315 1
217 316 1
218 317 1
219 318 1
220 319 1
221 320 1
222 321 1
223 322 1
224 323 1
225 324 1
226 325 1
227 326 1
228 327 1
229 328 1
230 329 1
231 330 1
232 331 1
233 332 1
234 333 1
235 334 1
236 335 1
237 336 1
238 337 1
239 338 1
240 339 1
241 340 1
242 341 1
243 342 1
244 343 1
245 344 1
246 345 1
247 346 1
248 347 1
249 348 1
250 349 1
251 350 1
252 351 1
253 352 1
254 353 1
255 354 1
256 355 1
257 356 1
258 357 1
259 358 1
260 359 1
261 360 1
262 361 1
263 362 1
264 363 1
265 364 1
266 365 1
267 366 1
268 367 1
269 368 1
270 369 1
271 370 1
272 371 1
273 372 1
274 373 1
275 374 1
276 375 1
277 376 1
278 377 1
279 378 1
280 379 1
281 380 1
282 381 1
283 382 1
284 383 1
285 384 1
286 385 1
287 386 2
288 387 2
289 388 2
290 389 2
291 390 2
292 391 2
293 392 2
294 393 2
295 394 2
296 395 2
297 396 2
298 397 2
299 398 2
300 399 2
301 400 2
302 401 2
303 402 2
304 403 2
305 404 3
306 405 3
307 406 3
308 407 3
309 408 3
310 409 3
311 410 3
312 411 3
313 412 3
314 413 3
315 414 4
316 415 4
317 416 4
318 417 4
319 418 4
320 419 5
321 420 5
322 421 5
323 422 5
324 423 5
325 424 5
326 425 6
327 426 6
328 427 6
329 428 6
330 429 6
331 430 7
332 431 7
333 432 7
334 433 7
335 434 7
336 435 8
337 436 8
338 437 8
339 438 9
340 439 9
341 440 9
342 441 9
343 442 10
344 443 10
345 444 NA
346 445 NA
347 446 NA
348 447 NA
349 448 NA
350 449 NA
351 450 NA
352 451 NA
353 452 NA
354 453 NA
355 454 NA
356 455 NA
357 456 NA
358 457 NA
359 458 NA
360 459 NA
361 460 NA
362 461 NA
363 462 NA
364 463 NA
365 464 NA
366 465 NA
367 466 NA
368 467 NA
369 468 NA
370 469 NA
371 470 NA
372 471 NA
373 472 NA
374 473 NA
375 474 NA
376 475 NA
377 476 NA
378 477 NA
379 478 NA
380 479 NA
381 480 NA
382 481 NA
383 482 NA
384 483 NA
385 484 NA
386 485 NA
387 486 NA
388 487 NA
389 488 NA
390 489 NA
391 490 NA
392 491 NA
393 492 NA
394 493 NA
395 494 NA
396 495 NA
397 496 NA
398 497 NA
399 498 NA
400 499 NA
401 500 NA
A bit hacky Base R:
# Create a dataframe with all score values in the range:
score_range <- merge(data.frame(Score = c(100:500)), percentiles, by = "Score", all.x = TRUE)
# Reverse the order of the dataframe:
score_range <- score_range[rev(order(score_range$Score)),]
# Change the first NA to the maximum score:
score_range$Percentile[which(is.na(score_range$Percentile))][1] <- max(score_range$Percentile, na.rm = TRUE)
# Replace all NAs with the value before them:
score_range$Percentile <- na.omit(score_range$Percentile)[cumsum(!is.na(score_range$Percentile))]
Data:
percentiles <- structure(list(Score = c(231L, 385L, 403L, 413L, 418L, 424L,
429L, 434L, 437L, 441L, 443L),
Percentile = 0:10), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))

Legend without a range of color

The following script gives me a graph with a legend in a range of colors between Vc=30 and Vc=40. All I need is to have a legend with only two colors, one for Vc=30 and other for Vc=40. How can I do that?
dados
Vc Tool Lu Fres Edge
1 30 1 10 466 1
2 30 1 10 416 2
3 30 1 10 465 1
4 30 1 10 416 2
5 30 1 10 464 1
6 30 1 10 416 2
7 30 1 10 476 1
8 30 1 10 412 2
9 30 1 10 468 1
10 30 1 10 410 2
11 30 1 10 470 1
12 30 1 10 407 2
13 30 1 10 468 1
14 30 1 10 412 2
15 30 1 10 469 1
16 30 1 10 414 2
17 30 1 10 469 1
18 30 1 10 412 2
19 30 1 10 467 1
20 30 1 10 409 2
21 30 1 10 469 1
22 30 1 10 415 2
23 30 1 10 471 1
24 30 1 10 420 2
25 30 1 10 469 1
26 30 1 10 416 2
27 30 1 10 464 1
28 30 1 10 409 2
29 30 1 10 465 1
30 30 1 10 412 2
31 30 1 10 464 1
32 30 1 10 409 2
33 30 1 10 466 1
34 30 1 10 417 2
35 30 1 10 466 1
36 30 1 10 417 2
37 30 1 10 464 1
38 30 1 10 414 2
39 30 1 10 466 1
40 30 1 10 415 2
41 40 1 38 457 1
42 40 1 38 416 2
43 40 1 38 460 1
44 40 1 38 438 2
45 40 1 38 465 1
46 40 1 38 441 2
47 40 1 38 467 1
48 40 1 38 442 2
49 40 1 38 473 1
50 40 1 38 452 2
51 40 1 38 469 1
52 40 1 38 446 2
53 40 1 38 478 1
54 40 1 38 450 2
55 40 1 38 476 1
56 40 1 38 454 2
57 40 1 38 479 1
58 40 1 38 452 2
59 40 1 38 480 1
60 40 1 38 450 2
61 40 1 38 481 1
62 40 1 38 443 2
63 40 1 38 476 1
64 40 1 38 447 2
65 40 1 38 472 1
66 40 1 38 450 2
67 40 1 38 479 1
68 40 1 38 449 2
69 40 1 38 478 1
70 40 1 38 455 2
71 40 1 38 478 1
72 40 1 38 457 2
73 40 1 38 481 1
74 40 1 38 447 2
75 40 1 38 504 1
76 40 1 38 452 2
77 40 1 38 472 1
78 40 1 38 447 2
79 40 1 38 472 1
80 40 1 38 451 2
81 40 1 66 622 1
82 40 1 66 377 2
83 40 1 66 619 1
84 40 1 66 378 2
85 40 1 66 622 1
86 40 1 66 369 2
87 40 1 66 616 1
88 40 1 66 374 2
89 40 1 66 619 1
90 40 1 66 374 2
91 40 1 66 616 1
92 40 1 66 374 2
93 40 1 66 621 1
94 40 1 66 375 2
95 40 1 66 618 1
96 40 1 66 397 2
97 40 1 66 633 1
98 40 1 66 406 2
99 40 1 66 652 1
100 40 1 66 412 2
101 40 1 66 652 1
102 40 1 66 419 2
103 40 1 66 658 1
104 40 1 66 423 2
105 40 1 66 659 1
106 40 1 66 409 2
107 40 1 66 650 1
108 40 1 66 405 2
109 40 1 66 653 1
110 40 1 66 405 2
111 40 1 66 652 1
112 40 1 66 403 2
113 40 1 66 656 1
114 40 1 66 408 2
115 40 1 66 644 1
116 40 1 66 406 2
117 40 1 66 649 1
118 40 1 66 412 2
119 40 1 66 650 1
120 40 1 66 406 2
121 30 1 94 585 1
122 30 1 94 234 2
123 30 1 94 589 1
124 30 1 94 231 2
125 30 1 94 585 1
126 30 1 94 223 2
127 30 1 94 586 1
128 30 1 94 223 2
129 30 1 94 572 1
130 30 1 94 233 2
131 30 1 94 585 1
132 30 1 94 233 2
133 30 1 94 589 1
134 30 1 94 234 2
135 30 1 94 598 1
136 30 1 94 237 2
137 30 1 94 605 1
138 30 1 94 237 2
139 30 1 94 586 1
140 30 1 94 233 2
141 30 1 94 588 1
142 30 1 94 227 2
143 30 1 94 585 1
144 30 1 94 230 2
145 30 1 94 586 1
146 30 1 94 230 2
147 30 1 94 591 1
148 30 1 94 237 2
149 30 1 94 586 1
150 30 1 94 234 2
151 30 1 94 592 1
152 30 1 94 237 2
153 30 1 94 595 1
154 30 1 94 236 2
155 30 1 94 600 1
156 30 1 94 227 2
157 30 1 94 592 1
158 30 1 94 237 2
159 30 1 94 592 1
160 30 1 94 240 2
161 40 1 122 853 1
162 40 1 122 330 2
163 40 1 122 859 1
164 40 1 122 323 2
165 40 1 122 842 1
166 40 1 122 308 2
167 40 1 122 842 1
168 40 1 122 324 2
169 40 1 122 831 1
170 40 1 122 334 2
171 40 1 122 838 1
172 40 1 122 341 2
173 40 1 122 836 1
174 40 1 122 328 2
175 40 1 122 840 1
176 40 1 122 324 2
177 40 1 122 836 1
178 40 1 122 321 2
179 40 1 122 831 1
180 40 1 122 328 2
181 40 1 122 833 1
182 40 1 122 328 2
183 40 1 122 840 1
184 40 1 122 330 2
185 40 1 122 831 1
186 40 1 122 321 2
187 40 1 122 833 1
188 40 1 122 328 2
189 40 1 122 833 1
190 40 1 122 321 2
191 40 1 122 840 1
192 40 1 122 319 2
193 40 1 122 838 1
194 40 1 122 317 2
195 40 1 122 831 1
196 40 1 122 319 2
197 40 1 122 827 1
198 40 1 122 323 2
199 40 1 122 836 1
200 40 1 122 328 2
201 30 2 10 468 1
202 30 2 10 408 2
203 30 2 10 471 1
204 30 2 10 405 2
205 30 2 10 475 1
206 30 2 10 403 2
207 30 2 10 470 1
208 30 2 10 409 2
209 30 2 10 478 1
210 30 2 10 405 2
211 30 2 10 474 1
212 30 2 10 403 2
213 30 2 10 472 1
214 30 2 10 402 2
215 30 2 10 478 1
216 30 2 10 408 2
217 30 2 10 477 1
218 30 2 10 406 2
219 30 2 10 473 1
220 30 2 10 406 2
221 30 2 10 474 1
222 30 2 10 406 2
223 30 2 10 477 1
224 30 2 10 411 2
225 30 2 10 480 1
226 30 2 10 413 2
227 30 2 10 479 1
228 30 2 10 408 2
229 30 2 10 476 1
230 30 2 10 406 2
231 30 2 10 476 1
232 30 2 10 404 2
233 30 2 10 472 1
234 30 2 10 407 2
235 30 2 10 474 1
236 30 2 10 411 2
237 30 2 10 473 1
238 30 2 10 415 2
239 30 2 10 479 1
240 30 2 10 409 2
241 40 2 38 442 1
242 40 2 38 407 2
243 40 2 38 437 1
244 40 2 38 410 2
245 40 2 38 444 1
246 40 2 38 412 2
247 40 2 38 440 1
248 40 2 38 414 2
249 40 2 38 439 1
250 40 2 38 413 2
251 40 2 38 436 1
252 40 2 38 416 2
253 40 2 38 446 1
254 40 2 38 412 2
255 40 2 38 438 1
256 40 2 38 414 2
257 40 2 38 443 1
258 40 2 38 408 2
259 40 2 38 446 1
260 40 2 38 407 2
261 40 2 38 445 1
262 40 2 38 413 2
263 40 2 38 453 1
264 40 2 38 414 2
265 40 2 38 449 1
266 40 2 38 417 2
267 40 2 38 447 1
268 40 2 38 411 2
269 40 2 38 443 1
270 40 2 38 417 2
271 40 2 38 447 1
272 40 2 38 410 2
273 40 2 38 449 1
274 40 2 38 409 2
275 40 2 38 442 1
276 40 2 38 413 2
277 40 2 38 451 1
278 40 2 38 412 2
279 40 2 38 447 1
280 40 2 38 420 2
281 40 2 66 526 1
282 40 2 66 467 2
283 40 2 66 532 1
284 40 2 66 470 2
285 40 2 66 528 1
286 40 2 66 474 2
287 40 2 66 529 1
288 40 2 66 472 2
289 40 2 66 533 1
290 40 2 66 480 2
291 40 2 66 542 1
292 40 2 66 487 2
293 40 2 66 545 1
294 40 2 66 504 2
295 40 2 66 549 1
296 40 2 66 507 2
297 40 2 66 546 1
298 40 2 66 517 2
299 40 2 66 541 1
300 40 2 66 518 2
301 40 2 66 554 1
302 40 2 66 514 2
303 40 2 66 564 1
304 40 2 66 514 2
305 40 2 66 571 1
306 40 2 66 522 2
307 40 2 66 575 1
308 40 2 66 525 2
309 40 2 66 582 1
310 40 2 66 533 2
311 40 2 66 588 1
312 40 2 66 536 2
313 40 2 66 591 1
314 40 2 66 553 2
315 40 2 66 592 1
316 40 2 66 557 2
317 40 2 66 592 1
318 40 2 66 563 2
319 40 2 66 583 1
320 40 2 66 568 2
321 30 2 94 578 1
322 30 2 94 370 2
323 30 2 94 570 1
324 30 2 94 378 2
325 30 2 94 575 1
326 30 2 94 367 2
327 30 2 94 579 1
328 30 2 94 371 2
329 30 2 94 576 1
330 30 2 94 362 2
331 30 2 94 579 1
332 30 2 94 372 2
333 30 2 94 588 1
334 30 2 94 375 2
335 30 2 94 586 1
336 30 2 94 372 2
337 30 2 94 589 1
338 30 2 94 378 2
339 30 2 94 587 1
340 30 2 94 375 2
341 30 2 94 578 1
342 30 2 94 368 2
343 30 2 94 575 1
344 30 2 94 375 2
345 30 2 94 574 1
346 30 2 94 376 2
347 30 2 94 575 1
348 30 2 94 367 2
349 30 2 94 580 1
350 30 2 94 382 2
351 30 2 94 583 1
352 30 2 94 368 2
353 30 2 94 591 1
354 30 2 94 386 2
355 30 2 94 595 1
356 30 2 94 379 2
357 30 2 94 593 1
358 30 2 94 384 2
359 30 2 94 607 1
360 30 2 94 399 2
361 30 2 122 760 1
362 30 2 122 625 2
363 30 2 122 746 1
364 30 2 122 612 2
365 30 2 122 762 1
366 30 2 122 625 2
367 30 2 122 783 1
368 30 2 122 637 2
369 30 2 122 778 1
370 30 2 122 640 2
371 30 2 122 778 1
372 30 2 122 638 2
373 30 2 122 791 1
374 30 2 122 638 2
375 30 2 122 782 1
376 30 2 122 635 2
377 30 2 122 792 1
378 30 2 122 640 2
379 30 2 122 783 1
380 30 2 122 637 2
381 30 2 122 774 1
382 30 2 122 622 2
383 30 2 122 777 1
384 30 2 122 618 2
385 30 2 122 777 1
386 30 2 122 622 2
387 30 2 122 765 1
388 30 2 122 623 2
389 30 2 122 769 1
390 30 2 122 625 2
391 30 2 122 775 1
392 30 2 122 622 2
393 30 2 122 777 1
394 30 2 122 628 2
395 30 2 122 769 1
396 30 2 122 620 2
397 30 2 122 778 1
398 30 2 122 623 2
399 30 2 122 788 1
400 30 2 122 634 2
> dadosc <- summarySE(dados, measurevar="Fres", groupvars=c("Vc","Lu"))
> dadosc
Vc Lu N Fres sd se ci
1 30 10 80 440.6875 30.91540 3.456447 6.879885
2 30 94 80 445.0250 150.97028 16.878990 33.596789
3 30 122 40 701.7000 75.06688 11.869115 24.007552
4 40 38 80 444.6125 23.31973 2.607225 5.189552
5 40 66 80 526.7125 90.77824 10.149316 20.201707
6 40 122 40 581.1250 259.74092 41.068645 83.069175
ggplot(dadosc, aes(x=Lu, y=Fres,ymax=max(Fres), colour=Vc)) +
geom_errorbar(aes(ymin=Fres-se, ymax=Fres+se), colour="black", width=2, position=pd) +
geom_point(position=pd, size=3, shape=21, fill="white") +
xlab("Machining lenght (mm)") +
ylab("Machining forces (N)") +
ggtitle("The Effect of Cutting Velocity on Machining Forces") +
expand_limits(y=400) + # Expand y range
scale_y_continuous(breaks=0:20*50) + # Set tick every 4
theme_bw() +
theme(legend.justification=c(1,0),
legend.position=c(1,0))
pd <- position_dodge(0.1)

Function or other basic script that compares values on two variables in a dataframe using an id variable located in both

Let's say you have two data frames, both of which contain some, but not all of the same records. Where they are the same records, the id variable in both data frames matches. There is a particular variable in each data frame that needs to be checked for consistency across the data frames, and any discrepancies need to be printed:
d1 <- ## first dataframe
d2 <- ## second dataframe
colnames(d1) #column headings for dataframe 1
[1] "id" "variable1" "variable2" "variable3"
colnames(d2) #column headings for dataframe 2 are identical
[1] "id" "variable1" "variable2" "variable3"
length(d1$id) #there are 200 records in dataframe 1
[1] 200
length(d2$id) #there are not the same number in dataframe 2
[1] 150
##Some function that takes d1$id, matches with d2$id, then compares the values of the matched, returning any discrepancies
I constructed an elaborate loop for this, but feel as though this is not the right way of going about it. Surely there is some better way than this for-if-for-if-if statement.
for (i in seq(d1$id)){ ##Sets up counter for loop
if (d1$id[i] %in% d2$id){ ## Search, compares and saves a common id and variable
index <- d1$id[i];
variable_d1 <- d1$variable1[i];
for (p in seq(d2$id)){ set
if (d2$id[p] == index){ ## saves the corresponding value in the second dataframe
variable_d2 <- d2$variable1[p];
if (variable_d2 != variable_d1) { ## prints if they are not equal
print(index);
}
}
}
}
}
Here's a solution, using random input data with a 50% chance that a given cell will be discrepant between d1 and d2:
set.seed(1);
d1 <- data.frame(id=sample(300,200),variable1=sample(2,200,replace=T),variable2=sample(2,200,replace=T),variable3=sample(2,200,replace=T));
d2 <- data.frame(id=sample(300,150),variable1=sample(2,150,replace=T),variable2=sample(2,150,replace=T),variable3=sample(2,150,replace=T));
head(d1);
## id variable1 variable2 variable3
## 1 80 1 2 2
## 2 112 1 1 2
## 3 171 2 2 1
## 4 270 1 2 2
## 5 60 1 2 2
## 6 266 2 2 2
head(d2);
## id variable1 variable2 variable3
## 1 258 1 2 1
## 2 11 1 1 1
## 3 290 2 1 2
## 4 222 2 1 2
## 5 81 2 1 1
## 6 200 1 2 1
com <- intersect(d1$id,d2$id); ## derive common id values
d1com <- match(com,d1$id); ## find indexes of d1 that correspond to common id values, in order of com
d2com <- match(com,d2$id); ## find indexes of d2 that correspond to common id values, in order of com
v1diff <- com[d1$variable1[d1com]!=d2$variable1[d2com]]; ## get ids of variable1 discrepancies
v1diff;
## [1] 60 278 18 219 290 35 107 4 237 131 50 210 29 168 6 174 61 127 99 220 247 244 157 51 84 122 196 125 265 115 186 139 3 132 223 211 268 102 155 207 238 41 199 200 231 236 172 275 250 176 248 255 222 59 100 33 124
v2diff <- com[d1$variable2[d1com]!=d2$variable2[d2com]]; ## get ids of variable2 discrepancies
v2diff;
## [1] 112 60 18 198 219 290 131 50 210 29 168 258 215 291 127 161 99 220 110 293 87 164 84 122 196 125 186 139 81 132 82 89 223 268 98 14 155 241 207 231 172 62 275 176 248 255 59 298 100 12 156
v3diff <- com[d1$variable3[d1com]!=d2$variable3[d2com]]; ## get ids of variable3 discrepancies
v3diff;
## [1] 278 219 290 35 4 237 131 168 202 174 215 220 247 244 261 293 164 13 294 84 196 125 265 115 186 81 3 89 223 211 268 98 14 155 241 207 38 191 200 276 250 45 269 255 298 100 12 156 124
Here's a proof that all variable1 values for ids in v1diff are really discrepant between d1 and d2:
d1$variable1[match(v1diff,d1$id)]; d2$variable1[match(v1diff,d2$id)];
## [1] 1 2 2 1 1 2 2 1 1 1 2 2 2 2 1 2 2 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 1 1 2 1 1 2 1 2 1 2 2 1 2 2 1 1
## [1] 2 1 1 2 2 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 2 1 2 1 1 2 1 1 2 2
Here's a proof that all variable1 values for ids not in v1diff are not discrepant between d1 and d2:
with(subset(d1,id%in%com&!id%in%v1diff),variable1[order(id)]); with(subset(d2,id%in%com&!id%in%v1diff),variable1[order(id)]);
## [1] 1 1 2 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 1
## [1] 1 1 2 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 1
Here, I wrapped this solution in a function which returns the vectors of discrepant id values in a list, with each component named for the variable it represents:
compare <- function(d1,d2,cols=setdiff(intersect(colnames(d1),colnames(d2)),'id')) {
com <- intersect(d1$id,d2$id);
d1com <- match(com,d1$id);
d2com <- match(com,d2$id);
setNames(lapply(cols,function(col) com[d1[[col]][d1com]!=d2[[col]][d2com]]),cols);
};
compare(d1,d2);
## $variable1
## [1] 60 278 18 219 290 35 107 4 237 131 50 210 29 168 6 174 61 127 99 220 247 244 157 51 84 122 196 125 265 115 186 139 3 132 223 211 268 102 155 207 238 41 199 200 231 236 172 275 250 176 248 255 222 59 100 33 124
##
## $variable2
## [1] 112 60 18 198 219 290 131 50 210 29 168 258 215 291 127 161 99 220 110 293 87 164 84 122 196 125 186 139 81 132 82 89 223 268 98 14 155 241 207 231 172 62 275 176 248 255 59 298 100 12 156
##
## $variable3
## [1] 278 219 290 35 4 237 131 168 202 174 215 220 247 244 261 293 164 13 294 84 196 125 265 115 186 81 3 89 223 211 268 98 14 155 241 207 38 191 200 276 250 45 269 255 298 100 12 156 124
Here is an approach using merge.
First, merge the dataframes, keeping all columns.
x <- merge(d1, d1, by="id")
Then, find all rows which do not match:
x[x$variable1.x != x$variable1.y | x$variable2.x != x$variable2.y |
x$variable3.x != x$variable3.y, ]

Applying function to every group in R

> head(m)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 80 80 100 180 180
2 2 1 76 76 95 171 171
3 3 1 72 72 90 162 162
4 4 1 68 68 85 153 153
5 5 1 64 64 80 144 144
6 6 1 60 60 75 135 135
> head(d)
X id1 q_following topic_followed topic_answered nfollowers nfollowing
1 1 1 63 735 665 949 146
2 2 1 89 737 666 587 185
3 3 1 121 742 670 428 264
4 4 1 277 750 706 622 265
5 5 1 339 765 734 108 294
6 6 1 363 767 766 291 427
matcher <- function(x,y){ return(na.omit(m[which(d[,y]==x),y])) }
max_matcher <- function(x) { return(sum(matcher(x,3:13))) }
result <- foreach(1:1000, function(x) {
if(max(max_matcher(1:1000)) == max_matcher(x)) return(x)
})
I want to compute result across each group, grouped by id1 of dataframe m.
m %>% group_by(id1) %>% summarise(result) #doesn't work
by(m, m[,"id1"], result) #doesn't work
How should I proceed?

Resources