Subset data.table using min condition

Subset data.table using min condition - r

There is probably a really simple solution to this problem, but I couldn't find it from googling, or the data.table FAQ.
I have a data.table like so:
> test
chr bp ID REF ALT AF AC AN EFFECT IMPACT FUNCLASS CODING GENE pos effRank
1: 1 860416 rs61464428 G A 0.5000000 14 28 UPSTREAM MODIFIER CODING SAMD11 1:860416 21
2: 1 860416 rs61464428 G A 0.5000000 14 28 UPSTREAM MODIFIER CODING SAMD11 1:860416 21
3: 1 860416 rs61464428 G A 0.5000000 14 28 DOWNSTREAM MODIFIER CODING AL645608.1 1:860416 22
4: 1 860461 rs57465118 G A 1.0000000 62 62 UPSTREAM MODIFIER CODING SAMD11 1:860461 21
5: 1 860461 rs57465118 G A 1.0000000 62 62 UPSTREAM MODIFIER CODING SAMD11 1:860461 21
6: 1 860461 rs57465118 G A 1.0000000 62 62 DOWNSTREAM MODIFIER CODING AL645608.1 1:860461 22
7: 1 860521 rs57924093 C A 0.9840000 61 62 UPSTREAM MODIFIER CODING SAMD11 1:860521 21
8: 1 860521 rs57924093 C A 0.9840000 61 62 UPSTREAM MODIFIER CODING SAMD11 1:860521 21
9: 1 860521 rs57924093 C A 0.9840000 61 62 DOWNSTREAM MODIFIER CODING AL645608.1 1:860521 22
10: 1 861261 rs144896029 G A 0.0027270 3 1100 UPSTREAM MODIFIER CODING SAMD11 1:861261 21
11: 1 861261 rs144896029 G A 0.0027270 3 1100 DOWNSTREAM MODIFIER CODING AL645608.1 1:861261 22
12: 1 861332 G A 0.0009074 1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING AL645608.1 1:861332 11
13: 1 861332 G A 0.0009074 1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING SAMD11 1:861332 11
14: 1 861332 G A 0.0009074 1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING SAMD11 1:861332 11
15: 1 861332 G A 0.0009074 1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING SAMD11 1:861332 11
16: 1 861332 G A 0.0009074 1 1102 UPSTREAM MODIFIER CODING SAMD11 1:861332 21
17: 1 865455 C G 0.0033190 3 904 UPSTREAM MODIFIER CODING SAMD11 1:865455 21
18: 1 865628 rs41285790 G A 0.0027780 3 1080 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING SAMD11 1:865628 11
19: 1 865628 rs41285790 G A 0.0027780 3 1080 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING SAMD11 1:865628 11
20: 1 865628 rs41285790 G A 0.0027780 3 1080 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING SAMD11 1:865628 11
21: 1 865628 rs41285790 G A 0.0027780 3 1080 SYNONYMOUS_CODING LOW SILENT CODING AL645608.1 1:865628 14
22: 1 865628 rs41285790 G A 0.0027780 3 1080 UPSTREAM MODIFIER CODING SAMD11 1:865628 21
23: 1 866437 rs139076934 C T 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING AL645608.1 1:866437 14
24: 1 866437 rs139076934 C T 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:866437 14
25: 1 866437 rs139076934 C T 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:866437 14
26: 1 866437 rs139076934 C T 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:866437 14
27: 1 866461 rs148884928 G A 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:866461 14
28: 1 866461 rs148884928 G A 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:866461 14
29: 1 866461 rs148884928 G A 0.0009074 1 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:866461 14
30: 1 866461 rs148884928 G A 0.0009074 1 1102 UPSTREAM MODIFIER CODING AL645608.1 1:866461 21
31: 1 866511 rs71576583 CCCCT CCCCTCCCT 1.0000000 148 148 UPSTREAM MODIFIER CODING AL645608.1 1:866511 21
32: 1 871057 C T 0.0009074 1 1102 UPSTREAM MODIFIER CODING SAMD11 1:871057 21
33: 1 871057 C T 0.0009074 1 1102 UPSTREAM MODIFIER CODING AL645608.1 1:871057 21
34: 1 871057 C T 0.0009074 1 1102 UPSTREAM MODIFIER CODING SAMD11 1:871057 21
35: 1 871215 rs28419423 C G 0.0036300 4 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:871215 14
36: 1 871215 rs28419423 C G 0.0036300 4 1102 SYNONYMOUS_CODING LOW SILENT CODING SAMD11 1:871215 14
37: 1 871215 rs28419423 C G 0.0036300 4 1102 UPSTREAM MODIFIER CODING SAMD11 1:871215 21
38: 1 871215 rs28419423 C G 0.0036300 4 1102 UPSTREAM MODIFIER CODING SAMD11 1:871215 21
39: 1 871215 rs28419423 C G 0.0036300 4 1102 UPSTREAM MODIFIER CODING AL645608.1 1:871215 21
40: 1 871215 rs28419423 C G 0.0036300 4 1102 DOWNSTREAM MODIFIER CODING SAMD11 1:871215 22
41: 1 871287 C G 0.0009107 1 1098 UPSTREAM MODIFIER CODING SAMD11 1:871287 21
42: 1 871287 C G 0.0009107 1 1098 UPSTREAM MODIFIER CODING SAMD11 1:871287 21
43: 1 871287 C G 0.0009107 1 1098 UPSTREAM MODIFIER CODING AL645608.1 1:871287 21
44: 1 871287 C G 0.0009107 1 1098 DOWNSTREAM MODIFIER CODING SAMD11 1:871287 22
45: 1 871334 rs4072383 G T 0.6680000 474 710 UPSTREAM MODIFIER CODING SAMD11 1:871334 21
46: 1 871334 rs4072383 G T 0.6680000 474 710 UPSTREAM MODIFIER CODING SAMD11 1:871334 21
47: 1 871334 rs4072383 G T 0.6680000 474 710 UPSTREAM MODIFIER CODING AL645608.1 1:871334 21
48: 1 871334 rs4072383 G T 0.6680000 474 710 DOWNSTREAM MODIFIER CODING SAMD11 1:871334 22
49: 1 874415 rs74047412 C T 0.0018250 2 1096 UPSTREAM MODIFIER CODING SAMD11 1:874415 21
50: 1 874415 rs74047412 C T 0.0018250 2 1096 UPSTREAM MODIFIER CODING SAMD11 1:874415 21
chr bp ID REF ALT AF AC AN EFFECT IMPACT FUNCLASS CODING GENE pos effRank
As you can see, the values in the many of the rows are repeats, for some of the columns. What I want to do is remove the duplicated rows, based on the value (the min) of the effRank variable. I have set the key to be chr, bp, and effRank. So the table should be sorted on the basis of those three columns. I got kind of close. The following command returns the rows that I want, but does not return all columns, which I want.
> test[,min(effRank), by=pos]
pos V1
1: 1:860416 21
2: 1:860461 21
3: 1:860521 21
4: 1:861261 21
5: 1:861332 11
6: 1:865455 21
7: 1:865628 11
8: 1:866437 14
9: 1:866461 14
10: 1:866511 21
11: 1:871057 21
12: 1:871215 14
13: 1:871287 21
14: 1:871334 21
15: 1:874415 21
All I need is a way to make the above command return all columns in the data.table, not just the ones mentioned in the expressions. Otherwise, works perfectly. Any help is appreciated. The output of dput is below, for those that with to make their own example.
Cheers,
Davy
> dput(test)
structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), bp = c(860416L, 860416L, 860416L,
860461L, 860461L, 860461L, 860521L, 860521L, 860521L, 861261L,
861261L, 861332L, 861332L, 861332L, 861332L, 861332L, 865455L,
865628L, 865628L, 865628L, 865628L, 865628L, 866437L, 866437L,
866437L, 866437L, 866461L, 866461L, 866461L, 866461L, 866511L,
871057L, 871057L, 871057L, 871215L, 871215L, 871215L, 871215L,
871215L, 871215L, 871287L, 871287L, 871287L, 871287L, 871334L,
871334L, 871334L, 871334L, 874415L, 874415L), ID = structure(c(10L,
10L, 10L, 8L, 8L, 8L, 9L, 9L, 9L, 3L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 7L, 7L, 7L, 7L, 7L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 11L,
1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 6L, 6L, 6L,
6L, 12L, 12L), .Label = c("", "rs139076934", "rs144896029", "rs148884928",
"rs28419423", "rs4072383", "rs41285790", "rs57465118", "rs57924093",
"rs61464428", "rs71576583", "rs74047412"), class = "factor"),
REF = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("C",
"CCCCT", "G"), class = "factor"), ALT = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L,
1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
4L, 4L, 4L), .Label = c("A", "CCCCTCCCT", "G", "T"), class = "factor"),
AF = c(0.5, 0.5, 0.5, 1, 1, 1, 0.984, 0.984, 0.984, 0.002727,
0.002727, 0.0009074, 0.0009074, 0.0009074, 0.0009074, 0.0009074,
0.003319, 0.002778, 0.002778, 0.002778, 0.002778, 0.002778,
0.0009074, 0.0009074, 0.0009074, 0.0009074, 0.0009074, 0.0009074,
0.0009074, 0.0009074, 1, 0.0009074, 0.0009074, 0.0009074,
0.00363, 0.00363, 0.00363, 0.00363, 0.00363, 0.00363, 0.0009107,
0.0009107, 0.0009107, 0.0009107, 0.668, 0.668, 0.668, 0.668,
0.001825, 0.001825), AC = c(14L, 14L, 14L, 62L, 62L, 62L,
61L, 61L, 61L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 148L, 1L, 1L, 1L,
4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 474L, 474L, 474L,
474L, 2L, 2L), AN = c(28L, 28L, 28L, 62L, 62L, 62L, 62L,
62L, 62L, 1100L, 1100L, 1102L, 1102L, 1102L, 1102L, 1102L,
904L, 1080L, 1080L, 1080L, 1080L, 1080L, 1102L, 1102L, 1102L,
1102L, 1102L, 1102L, 1102L, 1102L, 148L, 1102L, 1102L, 1102L,
1102L, 1102L, 1102L, 1102L, 1102L, 1102L, 1098L, 1098L, 1098L,
1098L, 710L, 710L, 710L, 710L, 1096L, 1096L), EFFECT = structure(c(4L,
4L, 1L, 4L, 4L, 1L, 4L, 4L, 1L, 4L, 1L, 2L, 2L, 2L, 2L, 4L,
4L, 2L, 2L, 2L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 3L, 3L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 1L, 4L, 4L,
4L, 1L, 4L, 4L), .Label = c("DOWNSTREAM", "NON_SYNONYMOUS_CODING",
"SYNONYMOUS_CODING", "UPSTREAM"), class = "factor"), IMPACT = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L,
3L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("LOW", "MODERATE", "MODIFIER"
), class = "factor"), FUNCLASS = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L,
1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("", "MISSENSE", "SILENT"), class = "factor"),
CODING = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CODING", class = "factor"),
GENE = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("AL645608.1",
"SAMD11"), class = "factor"), pos = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 7L,
7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 11L,
11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L,
14L, 14L, 14L, 14L, 15L, 15L), .Label = c("1:860416", "1:860461",
"1:860521", "1:861261", "1:861332", "1:865455", "1:865628",
"1:866437", "1:866461", "1:866511", "1:871057", "1:871215",
"1:871287", "1:871334", "1:874415"), class = "factor"), effRank = c(21L,
21L, 22L, 21L, 21L, 22L, 21L, 21L, 22L, 21L, 22L, 11L, 11L,
11L, 11L, 21L, 21L, 11L, 11L, 11L, 14L, 21L, 14L, 14L, 14L,
14L, 14L, 14L, 14L, 21L, 21L, 21L, 21L, 21L, 14L, 14L, 21L,
21L, 21L, 22L, 21L, 21L, 21L, 22L, 21L, 21L, 21L, 22L, 21L,
21L)), .Names = c("chr", "bp", "ID", "REF", "ALT", "AF",
"AC", "AN", "EFFECT", "IMPACT", "FUNCLASS", "CODING", "GENE",
"pos", "effRank"), row.names = c(NA, -50L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000004260788>, sorted = c("chr",
"bp", "effRank"))

You can use the internal variable .I, which gives the row number. Then subset using those values, as follows:
DT[DT[, .I[which.min(effRank)], pos]$V1]
It's easier to understand if you write it in two lines as follows:
tmp <- DT[, .I[which.min(effRank)], pos]
DT[tmp$V1]
The first line generates a column V1 with all the row numbers of the minimum positions (from your j expression) grouped by pos. Then you just subset them.

Related

Divide Items based on their available stock to other ID

Hi Everyone i am facing a unique problem . I want to find out Transfer of Inventory based on condition if Quantity required for a particular Item id is more than Stock on hand. we should transfer the inventory from other ID.
For Example. item I60 is available for 7 IDs. For E1, E6 available stock is less than quantity so what i want to do is Transfer the excess inventory from E3 (i.e 6-2 =4) to E1 and E6. So transfer for E1 will be 1 and E6 will be 2 and remaining SOH from E3 will be 3. I hope everyone can understand it.
structure(list(ID = structure(c(1L, 6L, 7L, 3L, 5L, 2L, 4L, 8L,
1L, 7L, 3L, 5L, 2L, 9L, 8L, 1L, 7L, 3L, 5L, 2L, 9L, 8L, 1L, 7L,
3L, 5L, 2L, 9L, 8L, 1L, 7L, 3L, 5L, 2L, 9L, 8L, 1L, 7L, 3L, 5L,
2L, 9L, 8L, 1L, 7L, 3L, 5L, 2L, 9L), .Label = c("E1", "E2", "E3",
"E4", "E5", "E6", "E7", "E8", "E9"), class = "factor"), Item.Code = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L
), .Label = c("I60", "I67", "I68", "I69", "I70", "I71", "I72"
), class = "factor"), Quantity = c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Stock_on_hand = c(1L,
0L, 2L, 6L, 2L, 2L, 2L, 0L, 6L, 3L, -1L, 1L, 2L, 9L, 1L, 5L,
-1L, 9L, 3L, 38L, 5L, 10L, 2L, 3L, 2L, 2L, 1L, 8L, 0L, 2L, 2L,
4L, 2L, 1L, 5L, 1L, -1L, 4L, 3L, 1L, 2L, 11L, 1L, 2L, 0L, 3L,
1L, 4L, 1L), Transfer = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 0L,
0L, 0L, 0L, 0L, 0L, 3L, 0L, 7L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 2L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 2L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-49L))
highlighted columns should be generated with R code

Looks like an integer programming problem with Updated_SOH as the variable and objective function minimizing absolute difference between Quantity and SOH subject to constraint that total sum of SOH stays constant.
Here is a heuristic approach to solve this optimization problem:
1) Calculate the difference which is to be used to sort the dataset.
2) In a similar approach as here but shifting positive values and different aggregation, we use these excess SOH to net off SOH deficit in previous rows.
3) The final output is the sum of i) existing Quantity, ii) any unfulfilled Quantity and iii) excess SOH.
setDT(df)
df[, Diff := Stock_on_hand - Quantity]
setorder(df, Item.Code, Diff)
df[, Updated_SOH := {
posVal <- replace(Diff, Diff<0, 0)
negVal <- replace(Diff, Diff>0, 0)
n <- 1L
while (any(negVal < 0) && n < .N) {
negVal <- replace(negVal, negVal>0, 0) +
shift(posVal, 1L, type="lead", fill=0) +
c(posVal[1L], rep(0, .N-1L)) #for case where there are more Quantity than SOH
posVal <- replace(negVal, negVal<0, 0)
n <- n + 1L
}
excess <- negVal[negVal > 0]
Quantity + #existing Quantity
replace(negVal, negVal>0, 0) + #unfulfilled Quantity
c(rep(0, .N - length(excess)), excess) #shifting back down excess SOH
}, by=.(Item.Code)]
output:
ID Item.Code Quantity Stock_on_hand Transfer Diff Updated_SOH
1: E6 I60 2 0 0 -2 2
2: E1 I60 2 1 0 -1 2
3: E7 I60 2 2 0 0 2
4: E5 I60 2 2 0 0 2
5: E2 I60 2 2 0 0 2
6: E4 I60 2 2 0 0 2
7: E3 I60 2 6 0 4 3
8: E3 I67 2 -1 0 -3 2
9: E8 I67 2 0 4 -2 2
10: E5 I67 2 1 0 -1 2
11: E2 I67 2 2 0 0 2
12: E7 I67 2 3 0 1 2
13: E1 I67 2 6 0 4 2
14: E9 I67 2 9 0 7 8
15: E7 I68 2 -1 7 -3 2
16: E8 I68 2 1 3 -1 2
17: E5 I68 2 3 0 1 2
18: E1 I68 2 5 0 3 2
19: E9 I68 2 5 0 3 5
20: E3 I68 2 9 0 7 9
21: E2 I68 4 38 0 34 38
22: E2 I69 2 1 6 -1 2
23: E1 I69 2 2 0 0 2
24: E3 I69 2 2 0 0 2
25: E5 I69 2 2 0 0 2
26: E7 I69 2 3 0 1 2
27: E9 I69 2 8 0 6 8
28: E8 I69 2 10 0 8 10
29: E8 I70 2 0 0 -2 2
30: E2 I70 2 1 3 -1 2
31: E1 I70 2 2 0 0 2
32: E7 I70 2 2 0 0 2
33: E5 I70 2 2 0 0 2
34: E3 I70 2 4 0 2 2
35: E9 I70 2 5 0 3 4
36: E1 I71 2 -1 2 -3 2
37: E8 I71 2 1 0 -1 2
38: E5 I71 2 1 0 -1 2
39: E2 I71 2 2 0 0 2
40: E3 I71 2 3 0 1 2
41: E7 I71 2 4 0 2 2
42: E9 I71 2 11 0 9 9
43: E7 I72 2 0 1 -2 0
44: E8 I72 2 1 0 -1 2
45: E5 I72 2 1 2 -1 2
46: E9 I72 2 1 0 -1 2
47: E1 I72 2 2 0 0 2
48: E3 I72 2 3 0 1 2
49: E2 I72 2 4 0 2 2
ID Item.Code Quantity Stock_on_hand Transfer Diff Updated_SOH
data:
library(data.table)
df <- structure(list(ID = structure(c(1L, 6L, 7L, 3L, 5L, 2L, 4L, 8L,
1L, 7L, 3L, 5L, 2L, 9L, 8L, 1L, 7L, 3L, 5L, 2L, 9L, 8L, 1L, 7L,
3L, 5L, 2L, 9L, 8L, 1L, 7L, 3L, 5L, 2L, 9L, 8L, 1L, 7L, 3L, 5L,
2L, 9L, 8L, 1L, 7L, 3L, 5L, 2L, 9L), .Label = c("E1", "E2", "E3",
"E4", "E5", "E6", "E7", "E8", "E9"), class = "factor"), Item.Code = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L
), .Label = c("I60", "I67", "I68", "I69", "I70", "I71", "I72"
), class = "factor"), Quantity = c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Stock_on_hand = c(1L,
0L, 2L, 6L, 2L, 2L, 2L, 0L, 6L, 3L, -1L, 1L, 2L, 9L, 1L, 5L,
-1L, 9L, 3L, 38L, 5L, 10L, 2L, 3L, 2L, 2L, 1L, 8L, 0L, 2L, 2L,
4L, 2L, 1L, 5L, 1L, -1L, 4L, 3L, 1L, 2L, 11L, 1L, 2L, 0L, 3L,
1L, 4L, 1L), Transfer = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 0L,
0L, 0L, 0L, 0L, 0L, 3L, 0L, 7L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 2L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 2L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-49L))

Large data set cleaning: How to fill in missing data based on multiple categories and searching by row order

This is my first StackOverflow post, so I hope that it isn't too difficult to understand.
I have a large dataset (~14,000) rows of bird observations. These data were collected by standing in one place (point) and counting birds that you see within 3 minutes. Within each point-count a new bird observation becomes a new row, so that there are many repeated dates, times, sites, and point (specific location within a site). Again, each point count is 3 minutes long. So if you see a yellow warbler (coded as YEWA) during minute 1, then it will be associated with MINUTE=1 for that specific point count (date, site, point, and time). ID=observer intials and Number=number of birds spotted (not necessarily important here).
However, if NO BIRDS are seen, then a "NOBI" goes into the dataset for that specific minute. Thus, if there are NOBI for an entire 3 minute point count, their will be three rows with the same date, site, point, and time, and "NOBI" in the "BIRD" column for each of the three rows.
So I have TWO main problems. The first is that some observers entered "NOBI" once for all three minutes, instead of three times (once per minute). Anywhere where "MINUTE"
has been left blank (becoming NA), AND "BIRD"="NOBI", I need to add three rows of data, all with the same values for all columns except "MINUTE", which should be 1, 2, and 3 for the respective rows.
If it looks like this:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
1 BS 5/9/2018 CW2 U125 7:51 NA NOBI NA
2 BS 5/9/2018 CW1 D250 8:12 1 YEWA 2
3 BS 5/9/2018 CW1 D250 8:12 2 NOBI NA
4 BS 5/9/2018 CW1 D250 8:12 3 LABU 1
It should look like this instead:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
1 BS 5/9/2018 CW2 U125 7:51 1 NOBI NA
2 BS 5/9/2018 CW2 U125 7:51 2 NOBI NA
3 BS 5/9/2018 CW2 U125 7:51 3 NOBI NA
4 BS 5/9/2018 CW1 D250 8:12 1 YEWA 2
5 BS 5/9/2018 CW1 D250 8:12 2 NOBI NA
6 BS 5/9/2018 CW1 D250 8:12 3 LABU 1
note: If you are wanting to enter some of this data into your R console, I included some at the end of this post using dput, which should be easier to enter than copy-and-pasting the above
I have made failed attempts at reproducing if statements with multiple conditions (based on:
R multiple conditions in if statement & Ifelse in R with multiple categorical conditions) I tried writing this many ways, including with piping from dplyr, but see below for one example of some code, notes, and error messages.
>if(PC$BIRD == "NOBI" & PC$MINUTE==NA){PC$Fix<-TRUE}
Error in if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
the condition has length > 1 and only the first element will be used
## Then I need to do something like this:
>if(PC$Fix<-TRUE){duplicate(row where Fix==TRUE, times=2)} #I know this isn't
### even close, but I want the row to be replicated two more times so
### that there are 3 total rows witht he same values
### Fix indicates that a fix is needed in this example
# Then somehow I need to assign a 1 to PC$MINUTE for the first row (original row),
# a 2 to the next row (with other values from other columns being the same), and a 3
# to the last duplicated row (still other values from other columns being the same)
The second problem, which seems more difficult to me is to search the dataset in order or perhaps by DATE,SITE,POINT, and TIME in some way. The minute values should always go from 1... to 2... to 3, and then back to 1 for the next set of date, time, site, and point. That is, each point count should have all values 1:3. However, one count may have multiple sightings in MINUTE=1 so that there are 5 or 6 (or 20) MINUTE=1 before MINUTE=2. BUT, again, some observers in this dataset simply left a row out when there was NO BIRDS (NOBI), instead of writing a row with BIRD="NOBI" for each MINUTE. That is if the dataset goes:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4 BS 5/9/2018 CW2 U125 7:54 1 AMRO 1
5 BS 5/9/2018 CW2 U125 7:54 1 SPTO 1
6 BS 5/9/2018 CW2 U125 7:57 1 AMRO 1
7 BS 5/9/2018 CW2 U125 7:57 1 SPTO 1
8 BS 5/9/2018 CW2 U125 7:57 1 AMCR 3
9 BS 5/9/2018 CW2 U125 7:57 2 SPTO 1
10 BS 5/9/2018 CW2 U125 7:57 2 HOWR 1
11 BS 5/9/2018 CW2 U125 7:57 3 UNBI 1
We can see that the 7:57 point count time is complete (there are MINUTE values of 1:3). However, the 7:54 point count time stops at MINUTE=1. Meaning, I need to enter two more rows underneath that have all of the same DATE,SITE,POINT,TIME information, but with MINUTE=2 and BIRD="NOBI" for the first added row and MINUTE=3 and BIRD="NOBI" for the second added row. So it should look like this:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4 BS 5/9/2018 CW2 U125 7:54 1 AMRO 1
5 BS 5/9/2018 CW2 U125 7:54 1 SPTO 1
6 BS 5/9/2018 CW2 U125 7:54 2 NOBI NA
7 BS 5/9/2018 CW2 U125 7:54 3 NOBI NA
8 BS 5/9/2018 CW2 U125 7:57 1 AMRO 1
9 BS 5/9/2018 CW2 U125 7:57 1 SPTO 1
10 BS 5/9/2018 CW2 U125 7:57 1 AMCR 3
11 BS 5/9/2018 CW2 U125 7:57 2 SPTO 1
12 BS 5/9/2018 CW2 U125 7:57 2 HOWR 1
13 BS 5/9/2018 CW2 U125 7:57 3 UNBI 1
Lastly, I understand that this is a long and complicated question, and I may not have articulated it very well. Please let me know if there is any clarification needed, and I would be happy to hear any advice, even if it doesn't fully answer my problems. Thank you in advance!
Everything below this line is only useful for you if you want to enter a sample of my data into R
To enter my data into R console, copy and paste everything from "structure" function to end of code to enter it as dataframe in R console with code: dataframe<-structure(list...
See Example of using dput() for help.
PC<-read.csv("PC.csv") ### ORIGINAL FILE
dput(PC)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BS", class = "factor"),
DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"),
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CW2", class = "factor"),
POINT = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M", "U125"), class = "factor"),
TIME = structure(c(8L, 8L, 8L, 9L, 9L, 10L, 10L, 10L, 10L,
10L, 10L, 11L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 7L), .Label = c("6:48", "6:51",
"6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54", "7:57",
"8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, NA, NA), BIRD = structure(c(6L,
6L, 6L, 2L, 7L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 6L, 8L, 3L, 7L, 9L, 5L, 4L, 2L, 6L,
6L), .Label = c("AMCR", "AMRO", "BRSP", "DUFL", "HOWR", "NOBI",
"SPTO", "UNBI", "VESP"), class = "factor"), NUMBER = c(NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA,
NA)), class = "data.frame", row.names = c(NA, -32L))
PCc<-read.csv("PC_Corrected.csv") #### WHAT I NEED MY DATABASE TO LOOK LIKE
dput(PCc)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "BS", class = "factor"), DATE = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"),
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "CW2", class = "factor"), POINT = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M",
"U125"), class = "factor"), TIME = structure(c(8L, 8L, 8L,
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("6:48",
"6:51", "6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54",
"7:57", "8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 1L,
2L, 3L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), BIRD = structure(c(6L,
6L, 6L, 2L, 7L, 6L, 6L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 6L, 6L, 7L, 7L, 6L, 8L, 3L,
7L, 9L, 5L, 4L, 2L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("AMCR",
"AMRO", "BRSP", "DUFL", "HOWR", "NOBI", "SPTO", "UNBI", "VESP"
), class = "factor"), NUMBER = c(NA, NA, NA, 1L, 1L, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, 1L, 1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-42L))

Here's a way to do it using dplyr and tidyr from the tidyverse meta-package.
# Step one - identify missing rows.
# For each DATE, SITE, POINT, TIME, count how many of each minute
library(tidyverse)
# Convert factors to character to make later joining simpler,
# and fix missing ID's by assuming prior line should be used,
# and make NOBI rows have a count of NA
PC_2_clean <- PC %>%
mutate_if(is.factor, as.character) %>%
fill(ID, .direction = "up") %>%
mutate(NUMBER = if_else(BIRD == "NOBI", NA_integer_, NUMBER))
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the 1, 2, and 3 columns represent
# missing minutes that we should add.
PC_3_NA_find <- PC_2_clean %>%
count(ID, DATE, SITE, POINT, TIME, MINUTE) %>%
spread(MINUTE, n)
PC_3_NA_find
# A tibble: 11 x 9
# ID DATE SITE POINT TIME `1` `2` `3` `<NA>`
# <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int>
# 1 BS 5/9/2018 CW2 M 7:12 3 1 2 NA
# 2 BS 5/9/2018 CW2 M 7:15 NA NA NA 1
# 3 BS 5/9/2018 CW2 M 7:18 NA NA NA 1
# 4 BS 5/9/2018 CW2 U125 6:48 1 1 1 NA
# 5 BS 5/9/2018 CW2 U125 6:51 1 1 1 NA
# 6 BS 5/9/2018 CW2 U125 6:54 2 NA NA NA
# 7 BS 5/9/2018 CW2 U125 6:57 2 1 1 NA
# 8 BS 5/9/2018 CW2 U125 7:51 1 1 1 NA
# 9 BS 5/9/2018 CW2 U125 7:54 2 NA NA NA
# 10 BS 5/9/2018 CW2 U125 7:57 3 2 1 NA
# 11 BS 5/9/2018 CW2 U125 8:00 1 NA NA NA
# Take the NA minute entries and make the desired line for each
PC_4_rows_to_add <- PC_3_NA_find %>%
gather(MINUTE, count, `1`:`3`) %>%
filter(is.na(count)) %>%
select(-count, -`<NA>`) %>%
mutate(MINUTE = as.integer(MINUTE),
BIRD = "NOBI",
NUMBER = NA_integer_)
# Add these lines to the original, remove the NA minute rows
# (these have been replaced with minute rows), and sort
PC_5_with_NOBIs <- PC_2_clean %>%
bind_rows(PC_4_rows_to_add) %>%
filter(MINUTE != "NA") %>%
arrange(ID, DATE, SITE, POINT, TIME, MINUTE, BIRD)
# Check result
PC_5_with_NOBIs %>%
count(ID, DATE, SITE, POINT, TIME, MINUTE) %>%
spread(MINUTE, n)
PC_5_with_NOBIs
# Now to confirm it matches your desired output.
# Note, I convert to character to avoid mismatches between factors
PCc_char <- PCc %>%
mutate_if(is.factor, as.character) %>%
arrange(ID, DATE, SITE, POINT, TIME, MINUTE, BIRD)
identical(PC_5_with_NOBIs, PCc_char)
# [1] TRUE

How can I identify specific string row number

My data looks like this
df<- structure(list(Main = structure(c(5L, 3L, 1L, 2L, 4L, 4L, 2L,
1L, 5L, 2L, 5L, 4L, 5L, 2L), .Label = c("IsMainbody", "IsMainbodyCandidate",
"IsMainbodyRejected", "Main", "None"), class = "factor"), Group.IDs = c(52L,
NA, 2L, 12L, 38L, 38L, 6L, 3L, NA, 49L, 20L, 38L, 54L, 85L),
X..Number1 = c(12L, 6L, 1L, 5L, 1L, 1L, 1L, 1L, 17L, 1L,
4L, 1L, 1L, 4L), X..No = c(20L, 62L, 2L, 16L, 3L, 3L, 1L,
3L, 32L, 3L, 36L, 3L, 1L, 20L), X..Unique.N = c(0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-14L))
I am trying to find the row number of for specific strings.
Based on main column, I want to find this how many of my sample has "Main" , how many have "IsmainbodyCandidate" and how many are "IsMainbodyRejected"
Then I want to make a new dataset that only consists of Main and Ismainbody and Ismainbodycandidates like below .
Main Group IDs # Number1 # No # Unique N
IsMainbody. 2 1 2 1
IsMainbodyCandidate 12 5 16 0
Main 38 1 3 0
Main 38 1 3 0
IsMainbodyCandidate 6 1 1 0
IsMainbody 3 1 3 0
IsMainbodyCandidate 49 1 3 0
IsMainbodyCandidate 85 4 20 0

# count by main
table(df$Main)
# new dataframe without "None"
df[df$Main != "None", ]
# or more explicitly
df[df$Main %in% c("Main", "IsMainbody", "IsMainbodyCandidate"), ]

window function is not working as expected

I have a monthly time series - monthlyTs:
monthlyTs <- ts(all.xts , frequency = 12, start=decimal_date(ymd("2012-01-29")))
head(index(monthlyTs))
1 "2012-01-29 00:00:00 UTC" "2012-02-26 01:22:47 UTC" "2012-03-25
02:45:35 UTC" "2012-04-29 04:29:04 UTC"
[5] "2012-05-27 05:51:52 UTC" "2012-06-24 07:14:39 UTC"
I want to apply a time windows that starts from 2013:
head(window(monthly, start = 2013))
2012-01-29 00:00:00 2
2012-02-26 01:22:47 8 2012-03-25 02:45:35 6 2012-04-29 04:29:04
5 2012-05-27 05:51:52 4 2012-06-24 07:14:39 4
So looks like window function is not filtering as expected. What is wrong?
Fully reproducible example as requested:
christmas.csv - tiny CSV file (google trends for 'Christmas' request)
#Reading data from the csv. Format - [week start date], [views per week]
data = read.csv('christmas.csv', sep=",", header = FALSE, skip = 3,col.names = c("Week","Views"))[[2]]
# creating time series
myTs <- ts(data[[2]], freq=365.25/7, start=decimal_date(ymd("2012-01-29")))
#converting from weekly to month time series
all.xts <- xts(myTs, date_decimal(index(myTs)))
monthlyTs <- ts(all.xts , frequency = 12, start=decimal_date(ymd("2012-01-29")))
head(window(monthlyTs, start = 2013))
2012-01-29 00:00:00 2
2012-02-26 01:22:47 8 2012-03-25 02:45:35 6 2012-04-29 04:29:04 5
2012-05-27 05:51:52 4 2012-06-24 07:14:39 4

There are two problems :
the object all.xts is a weekly and not a monthly time
The value your pass for the argument frequency is not correct
For the second point, try to change the value you pass for the argument start in your call of the function ts with
c(lubridate::year("2012-01-29"), lubridate::month("2012-01-29"))
and change the frequency to value 12. i.e use the line :
ts(all.xts , frequency = 12, start = c(lubridate::year("2012-01-29"), lubridate::month("2012-01-29")) )
Using the output from dput, your code rewrite as follow :
data <- c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 5L, 5L, 6L, 8L, 11L, 16L, 22L, 33L, 42L,
45L, 55L, 64L, 8L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 5L, 6L, 8L,
12L, 16L, 21L, 27L, 43L, 47L, 56L, 79L, 10L, 5L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 6L, 8L, 12L, 17L, 21L, 27L, 43L, 47L, 53L,
87L, 12L, 5L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 5L, 6L, 6L, 8L, 13L,
17L, 20L, 27L, 44L, 50L, 54L, 100L, 15L, 6L, 3L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 5L, 5L, 6L, 8L, 11L, 16L, 21L, 29L, 43L, 48L, 53L, 80L,
46L, 8L, 3L, 2L)
myTs <- ts(data, freq=365.25/7, start=decimal_date(ymd("2012-01-29")))
all.xts <- xts::xts(myTs, date_decimal(index(myTs)))
monthlyTs <- ts(all.xts , frequency = 12, start = c(lubridate::year("2012-01-29"), lubridate::month("2012-01-29")) )
window(monthlyTs, start= c(2013))
The last line will print :
> window(monthlyTs, start= c(2013))
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 1 1 1 1 1 1 1 1 1 1 1 1
2014 1 1 1 1 2 2 2 2 3 3 3 4
2015 5 5 6 8 11 16 22 33 42 45 55 64
2016 8 4 2 2 2 2 2 2 1 1 1 1
2017 1 1 1 1 1 1 1 1 1 1 1 1
2018 1 1 1 1 1 1 1 2 2 2 2 2
2019 3 3 3 4 4 5 6 8 12 16 21 27
2020 43 47 56 79 10 5 2 2 2 1 1 1
2021 1 1 1 1 1 1 1 1 1 1 1 1
2022 1 1 1 1 1 1 1 1 1 1 2 2
2023 2 2 2 2 3 3 3 4 5 5 6 8
2024 12 17 21 27 43 47 53 87 12 5 2 2
2025 2 1 1 1 1 1 1 1 1 1 1 1
2026 1 1 1 1 1 1 1 1 1 1 1 1
2027 1 2 2 2 2 2 2 2 3 3 3 4
2028 5 6 6 8 13 17 20 27 44 50 54 100
2029 15 6 3 2 2 1 1 1 1 1 1 1
2030 1 1 1 1 1 1 1 1 1 1 1 1
2031 1 1 1 1 1 1 2 2 2 2 2 2
2032 3 3 3 4 5 5 6 8 11 16 21 29
2033 43 48 53 80 46 8 3 2

sum and conditionally count based on a second column

I have gotten frustrated trying to solve this seemingly simple problem. I have a dataset (df) like this:
structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L), Unknown = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), Temp = c(21L, 21L, 21L, 23L, 23L, 21L, 21L, 22L, 21L, 23L,
23L, 22L, 21L, 21L, 22L, 22L, 21L, 21L, 23L, 23L), Obs = structure(c(1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 2L, 2L), .Label = c("mdk", "sde"), class = "factor"), State = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "ma", class = "factor"), Zone = c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), Segment = c(8L, 7L, 4L, 17L, 18L, 7L, 2L, 12L, 1L, 17L,
18L, 12L, 9L, 7L, 13L, 11L, 8L, 9L, 17L, 18L), Subseg = c(1L,
3L, 3L, 2L, 2L, 2L, 4L, 0L, 10L, 4L, 2L, 0L, 1L, 1L, 3L, 1L,
2L, 2L, 1L, 1L), Wdir = structure(c(2L, 2L, 1L, 3L, 3L, 2L, 2L,
1L, 2L, 3L, 3L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("na",
"ne", "nw"), class = "factor"), Wvel = structure(c(1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L), .Label = c("5", "na"), class = "factor"), Clouds = structure(c(1L,
1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L,
1L, 3L, 3L), .Label = c("1", "4", "na"), class = "factor"), Temp.1 = structure(c(1L,
1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 3L, 1L,
1L, 3L, 3L), .Label = c("20", "25", "na"), class = "factor"),
Species = structure(c(7L, 21L, 1L, 21L, 16L, 4L, 16L, 6L,
1L, 17L, 5L, 7L, 5L, 1L, 1L, 6L, 7L, 7L, 24L, 5L), .Label = c("ABDU",
"ABDU", "ABDU", "ABDU", "ABDU", "CAGO", "CAGO", "CAGO", "CAGO",
"CAGO", "GOLD", "GOLD", "GOLD", "GOLD", "GOLD", "MERG", "MERG",
"MERG", "MERG", "MERG", "SCOT", "SCOT", "SCOT", "SCOT",
"SCOT", "SCOT", "SCOT"), class = "factor"), Count = c(5L,
1L, 150L, 3L, 20L, 8L, 5L, 10L, 5L, 1L, 20L, 10L, 2L, 2L,
80L, 40L, 1L, 1000L, 2L, 20L)), .Names = c("Year", "Unknown",
"Temp", "Obs", "State", "Zone", "Segment", "Subseg", "Wdir",
"Wvel", "Clouds", "Temp.1", "Species", "Count"), row.names = c(666L,
614L, 2060L, 1738L, 1459L, 536L, 197L, 2467L, 98L, 1794L, 1449L,
2464L, 696L, 483L, 2644L, 2350L, 686L, 844L, 2989L, 2934L), class = "data.frame")
With a header that looks like this:
Year Unknown Temp Obs State Zone Segment Subseg Wdir Wvel
666 2015 1 21 mdk ma 2 8 1 ne 5
614 2015 1 21 mdk ma 2 7 3 ne 5
2060 2015 1 21 sde ma 2 4 3 na na
1738 2015 1 23 mdk ma 2 17 2 nw 5
1459 2015 1 23 mdk ma 2 18 2 nw 5
536 2015 1 21 mdk ma 2 7 2 ne 5
Clouds Temp.1 Species Count
666 1 20 CAGO 5
614 1 20 SCOT 1
2060 na na ABDU 150
1738 1 20 SCOT 3
1459 1 20 MERG 20
536 1 20 ABDU 8
Among other things within dplyr, I want to get a sum of each species as a new column, when I am grouping by segment. This is the final code I have tried with many variations.
df_group = df %>%
group_by(Segment) %>%
summarise(temp = round(mean(Temp)),
WDir = round(mean(Wdir)),
ABDU = sum(which(Species=="ABDU"),Count),
CAGO = sum(which(Species=="CAGO"),Count),
GOLD = sum(which(Species=="GOLD"),Count),
MERG = sum(which(Species=="MERG"),Count),
SCOT = sum(which(Species=="SCOT"),Count))
And this is what I get (to show correct format):
Segment temp WDir ABDU CAGO GOLD MERG SCOT
1 1 21 2 6 5 5 5 5
2 2 21 2 5 5 5 6 5
3 4 21 1 151 150 150 150 150
4 7 21 2 16 11 11 11 12
5 8 21 2 6 9 6 6 6
6 9 21 2 1003 1004 1002 1002 1002
The format and general idea are what I want, but the numbers are not adding up the way I want them to. I'm sure it is simple but need some help! Thanks.

The problem is that which returns a vector of the positions, but you're not using those to subset. So the sum you are getting is of the positions which are true in addition to the count variable. e.g.
x <- c("a", "b", "b")
count <- c(10, 11, 12)
sum(which(c("a", "b", "b") == "b"), count)
# 38 because it is 2 + 3 + 10 + 11 + 12
I believe what you want is (or at least one way of writing it):
sum(ifelse(x == "b", count, 0))
# 23 because it is equal to 0 + 11 + 12
Translating into dplyr syntax, your example could look like this:
df_group = df %>%
group_by(Segment) %>%
summarise(temp = round(mean(Temp)),
WDir = round(mean(Wdir)),
ABDU = sum(ifelse(Species=="ABDU", Count, 0L)),
CAGO = sum(ifelse(Species=="CAGO", Count, 0L)),
GOLD = sum(ifelse(Species=="GOLD", Count, 0L)),
MERG = sum(ifelse(Species=="MERG", Count, 0L)),
SCOT = sum(ifelse(Species=="SCOT", Count, 0L)))

Another approach, in case you don't want to type out the sum for all your species:
library(reshape2)
library(dplyr)
# I had a problem with duplicate factor levels from your dput,
# so I re-factored species
df$Species = as.factor(as.character(df$Species))
species.counts = select(df, Segment, Species, Count) %>%
dcast(formula = Segment ~ Species, value.var = "Count", fun.aggregate = sum)
> head(species.counts)
Segment ABDU CAGO MERG SCOT
1 1 5 0 0 0
2 2 0 0 5 0
3 4 150 0 0 0
4 7 10 0 0 1
5 8 0 6 0 0
6 9 2 1000 0 0
df %>% group_by(Segment) %>%
summarise(temp = round(mean(Temp))) %>%
left_join(species.counts)
Source: local data frame [11 x 6]
Segment temp ABDU CAGO MERG SCOT
1 1 21 5 0 0 0
2 2 21 0 0 5 0
3 4 21 150 0 0 0
4 7 21 10 0 0 1
5 8 21 0 6 0 0
6 9 21 2 1000 0 0
I also couldn't do the wind direction average, because your dput data only has that as a factor with the directions, not like the head() you showed, but the technique generalizes.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset data.table using min condition - r

Related

Divide Items based on their available stock to other ID

Large data set cleaning: How to fill in missing data based on multiple categories and searching by row order

How can I identify specific string row number

window function is not working as expected

sum and conditionally count based on a second column

Categories

Resources