Related
My dataframe is below. All variables are numeric, one of them (Total) has about 20 NAs. I would like the missForest package to create imputed values for the NAs in Total. I am running
R version 4.2.1 (2022-06-23 ucrt) on Windows.
imp <- structure(list(Years = c(21, 5, 5, 25, 4, 4, 4, 1, 12, 17, 5.5,
4, 13, 1, 1, 5, 1, 12, 8, 1, 14, 0.8, 6, 5, 4, 7, 4, 21, 3, 2,
20, 1, 2, 2, 20, 2, 1, 9, 12, 22, 1, 27, 5, 3, 1, 8, 5, 25, 1,
0.4, 4, 1, 1.5, 1, 1, 21, 5, 0.5, 3, 12, 3, 28, 7, 5, 22, 3.25,
4, 4, 12, 1, 3, 25, 17, 12, 40, 12, 6, 3, 8, 7, 17, 1, 3, 3,
6, 4, 7, 1, 7, 6, 4, 11, 1, 5, 2, 15, 1, 3, 7.5, 21, 4, 1.5,
7, 13, 5, 6, 9, 12.5, 2.5, 1, 17, 8, 5, 22, 25, 13, 5.5, 19,
9, 3.3, 14, 3, 22, 5, 6, 2.8, 9, 1, 8, 11, 8, 4, 2, 10, 1, 19,
13, 5, 1, 1.5, 7, 12, 2, 2.5, 1.5, 1, 2, 8, 5, 4, 3, 2, 2.5,
7, 11, 3, 8, 22, 5, 5, 8, 3.5, 1, 8, 11, 1, 5, 7, 9, 7, 4, 1,
14, 4, 20, 4, 5, 15.5, 9, 2, 7.5, 1, 13.5, 14, 1, 7, 4, 20, 9.5,
0, 10, 3, 8, 1, 3, 1, 19, 1, 20, 8, 25, 16, 14, 10, 24, 1, 2,
4, 0, 11, 2, 1.5, 2, 1, 21, 1, 20, 1.75, 5, 22, 5, 3), Staff = c(7,
8, 6, 10, 15, 6, 7, 17, 9, 5, 7, 12, 15, 8, 7, 5, 8, 8, 2, 8,
7, 8, 7, 7, 12, 8, 8, 7, 12, 10, 5, 7, 3, 6, 11, 4, 8, 8, 9,
6, 9, 9, 18, 10, 9, 5, 7, 20, 9, 4, 9, 6, 5, 4, 3, 5, 11, 8,
4, 7, 6, 16, 5, 5, 8, 8, 7, 4, 9, 9, 9, 14, 8, 5, 6, 6, 4, 3,
6, 7, 10, 7, 7, 3, 7, 13, 12, 4, 10, 8, 9, 5, 15, 7, 9, 9, 6,
5, 15, 7, 6, 5, 7, 8, 7, 7, 5, 9, 15, 12, 15, 5, 8, 7, 7, 5,
8, 12, 6, 6, 12, 9, 5, 4, 6, 7, 15, 5, 20, 6, 6, 11, 6, 8, 6,
2, 7, 4, 4, 2, 6, 15, 5, 15, 6, 3, 8, 15, 12, 7, 6, 9, 7, 1,
10, 5, 7, 4, 5, 1, 6, 5, 20, 8, 10, 1, 11, 9, 9, 5, 3, 8, 6,
5, 5, 5, 6, 8, 4, 7, 5, 4, 10, 8, 13, 5, 13, 3, 0, 15, 20, 5,
15, 14, 19, 20, 5, 7, 5, 9, 6, 6, 7, 20, 10, 25, 7, 5, 6, 10,
45, 10, 6, 5, 6, 8, 13, 12, 15, 7, 4, 1), JDs = c(64, 64, 120,
200, 30, 70, 370, 75, 300, 20, 68, 170, 77, 275, 132, 81, 875,
135, 75, 84, 74, 110, 120, 60, 1800, 94, 54, 125, 140, 150, 52,
190, 53, 170, 325, 18, 300, 86, 130, 375, 140, 200, 104, 50,
100, 95, 360, 40, 45, 52, 165, 20, 150, 58, 230, 95, 150, 95,
85, 120, 100, 265, 18, 90, 130, 77, 80, 75, 133, 73, 302, 500,
70, 50, 55, 72, 35, 60, 100, 90, 130, 41, 200, 29, 90, 35, 68,
30, 115, 51, 40, 125, 460, 400, 125, 400, 250, 51, 190, 200,
235, 150, 250, 137, 760, 90, 70, 100, 325, 200, 350, 150, 325,
23, 17, 50, 415, 650, 120, 96, 200, 4, 71, 700, 60, 224, 203,
16, 40, 62, 105, 41, 340, 22, 60, 11, 60, 30, 95, 27, 300, 120,
70, 96, 100, 6, 750, 14, 80, 60, 51, 90, 350, 250, 31, 78, 95,
32, 185, 65, 65, 30, 24, 65, 550, 100, 200, 80, 47, 45, 37, 250,
55, 25, 27, 90, 190, 65, 27, 80, 68, 110, 220, 325, 25, 43, 14,
5, 7, 17, 15, 135, 20, 26, 26, 29, 75, 93, 50, 127, 14, 75, 90,
50, 105, 190, 8, 45, 150, 300, 15, 25, 150, 60, 32, 85, 15, 144,
190, 155, 10, 20), Total = c(325000, 250000, 275000, 340000,
165000, 3e+05, 420000, 8e+05, 5e+05, 100776, 440000, 440000,
191500, NA, 4e+05, 145000, 6e+05, 4e+05, 125000, 155000, 230000,
250000, 240000, 2e+05, NA, 250000, 188000, 375000, 190000, 450000,
290558, 725000, 355000, 350000, 8e+05, 125000, 450000, 255000,
212500, 6e+05, 342000, 450000, 250000, 228000, 325000, 325000,
425000, 175000, NA, 240000, NA, 250000, 237000, 330000, 345000,
195000, 295000, 208000, 225000, NA, 445000, 253000, 75000, 285000,
4e+05, 2e+05, 308000, 236000, 470000, 190000, 1250000, 480000,
2e+05, 285000, 232000, 240000, 2e+05, 209000, 250000, 309000,
NA, 170000, 1e+06, 115200, 565000, 182500, 175000, 250000, 250000,
265000, 120000, 345000, 425000, 630000, 165000, 650000, 3e+05,
265000, 345000, 425000, 4e+05, 230000, 425000, 161500, 6e+05,
251000, 265000, 190000, 420000, 6e+05, 510000, 340000, 650000,
275000, 120000, 185000, 480000, 550000, 185000, 240000, 560000,
114000, 150000, 1050000, 230000, NA, 335000, 225000, 260000,
410000, 315000, 206000, 650000, 160000, 210000, 180000, 275000,
2e+05, 2e+05, 201094, 395000, 297000, 265000, 3e+05, 275000,
80000, 134000, 180000, 195000, 850000, 4e+05, 385000, 420000,
NA, 187000, 180000, 182700, 96597.28, 380000, 2e+05, 260000,
257500, 185000, 220000, 550000, 315000, 360000, 380000, 185000,
280000, 225000, 375000, 310000, 170000, 165000, 260000, 350000,
208000, 110000, 192500, 187500, 216000, 495000, 550000, 114500,
215000, 185000, NA, 114500, 110000, 250000, 350000, 180000, 118000,
191500, 1e+05, 230000, 350000, 240000, NA, 180000, 215000, 203000,
99800, 389900, NA, NA, NA, 4e+05, 6e+05, NA, NA, NA, 220000,
217500, NA, NA, 210000, 337000, 275000, NA, NA)), row.names = c(NA,
-222L), class = c("tbl_df", "tbl", "data.frame"))
library(missForest) # installed with dependencies = TRUE
impFor <- missForest(imp)
The statement above returns the following warnings and error.
Warning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NA
Warning: The response has five or fewer unique values. Are you sure you want to do regression?
Error in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, :
length of response must be the same as predictors
The first four warnings appear to say that my four variables are neither numeric nor logical, but they are all numeric. The warning regarding regression and "five or fewer unique values" puzzles me because the package's manual makes no reference to a minimum number of unique values. Finally, the error confounds me completely.
I have searched StackOverflow, but the two questions that came up are not relevant.
Thank you for setting me right.
Your data should be in a data.frame format instead of tibble. You could use as.data.frame like this:
library(missForest)
class(imp)
#> [1] "tbl_df" "tbl" "data.frame"
imp <- as.data.frame(imp)
class(imp)
#> [1] "data.frame"
imp <- missForest(imp)
imp
#> $ximp
#> Years Staff JDs Total
#> 1 21.00 7 64 325000.00
#> 2 5.00 8 64 250000.00
#> 3 5.00 6 120 275000.00
#> 4 25.00 10 200 340000.00
#> 5 4.00 15 30 165000.00
#> 6 4.00 6 70 300000.00
#> 7 4.00 7 370 420000.00
#> 8 1.00 17 75 800000.00
#> 9 12.00 9 300 500000.00
#> 10 17.00 5 20 100776.00
#> 11 5.50 7 68 440000.00
#> 12 4.00 12 170 440000.00
#> 13 13.00 15 77 191500.00
#> 14 1.00 8 275 422030.00
#> 15 1.00 7 132 400000.00
#> 16 5.00 5 81 145000.00
#> 17 1.00 8 875 600000.00
#> 18 12.00 8 135 400000.00
#> 19 8.00 2 75 125000.00
#> 20 1.00 8 84 155000.00
#> 21 14.00 7 74 230000.00
#> 22 0.80 8 110 250000.00
#> 23 6.00 7 120 240000.00
#> 24 5.00 7 60 200000.00
#> 25 4.00 12 1800 564720.00
#> 26 7.00 8 94 250000.00
#> 27 4.00 8 54 188000.00
#> 28 21.00 7 125 375000.00
#> 29 3.00 12 140 190000.00
#> 30 2.00 10 150 450000.00
#> 31 20.00 5 52 290558.00
#> 32 1.00 7 190 725000.00
#> 33 2.00 3 53 355000.00
#> 34 2.00 6 170 350000.00
#> 35 20.00 11 325 800000.00
#> 36 2.00 4 18 125000.00
#> 37 1.00 8 300 450000.00
#> 38 9.00 8 86 255000.00
#> 39 12.00 9 130 212500.00
#> 40 22.00 6 375 600000.00
#> 41 1.00 9 140 342000.00
#> 42 27.00 9 200 450000.00
#> 43 5.00 18 104 250000.00
#> 44 3.00 10 50 228000.00
#> 45 1.00 9 100 325000.00
#> 46 8.00 5 95 325000.00
#> 47 5.00 7 360 425000.00
#> 48 25.00 20 40 175000.00
#> 49 1.00 9 45 185352.00
#> 50 0.40 4 52 240000.00
#> 51 4.00 9 165 403167.00
#> 52 1.00 6 20 250000.00
#> 53 1.50 5 150 237000.00
#> 54 1.00 4 58 330000.00
#> 55 1.00 3 230 345000.00
#> 56 21.00 5 95 195000.00
#> 57 5.00 11 150 295000.00
#> 58 0.50 8 95 208000.00
#> 59 3.00 4 85 225000.00
#> 60 12.00 7 120 261252.00
#> 61 3.00 6 100 445000.00
#> 62 28.00 16 265 253000.00
#> 63 7.00 5 18 75000.00
#> 64 5.00 5 90 285000.00
#> 65 22.00 8 130 400000.00
#> 66 3.25 8 77 200000.00
#> 67 4.00 7 80 308000.00
#> 68 4.00 4 75 236000.00
#> 69 12.00 9 133 470000.00
#> 70 1.00 9 73 190000.00
#> 71 3.00 9 302 1250000.00
#> 72 25.00 14 500 480000.00
#> 73 17.00 8 70 200000.00
#> 74 12.00 5 50 285000.00
#> 75 40.00 6 55 232000.00
#> 76 12.00 6 72 240000.00
#> 77 6.00 4 35 200000.00
#> 78 3.00 3 60 209000.00
#> 79 8.00 6 100 250000.00
#> 80 7.00 7 90 309000.00
#> 81 17.00 10 130 279905.83
#> 82 1.00 7 41 170000.00
#> 83 3.00 7 200 1000000.00
#> 84 3.00 3 29 115200.00
#> 85 6.00 7 90 565000.00
#> 86 4.00 13 35 182500.00
#> 87 7.00 12 68 175000.00
#> 88 1.00 4 30 250000.00
#> 89 7.00 10 115 250000.00
#> 90 6.00 8 51 265000.00
#> 91 4.00 9 40 120000.00
#> 92 11.00 5 125 345000.00
#> 93 1.00 15 460 425000.00
#> 94 5.00 7 400 630000.00
#> 95 2.00 9 125 165000.00
#> 96 15.00 9 400 650000.00
#> 97 1.00 6 250 300000.00
#> 98 3.00 5 51 265000.00
#> 99 7.50 15 190 345000.00
#> 100 21.00 7 200 425000.00
#> 101 4.00 6 235 400000.00
#> 102 1.50 5 150 230000.00
#> 103 7.00 7 250 425000.00
#> 104 13.00 8 137 161500.00
#> 105 5.00 7 760 600000.00
#> 106 6.00 7 90 251000.00
#> 107 9.00 5 70 265000.00
#> 108 12.50 9 100 190000.00
#> 109 2.50 15 325 420000.00
#> 110 1.00 12 200 600000.00
#> 111 17.00 15 350 510000.00
#> 112 8.00 5 150 340000.00
#> 113 5.00 8 325 650000.00
#> 114 22.00 7 23 275000.00
#> 115 25.00 7 17 120000.00
#> 116 13.00 5 50 185000.00
#> 117 5.50 8 415 480000.00
#> 118 19.00 12 650 550000.00
#> 119 9.00 6 120 185000.00
#> 120 3.30 6 96 240000.00
#> 121 14.00 12 200 560000.00
#> 122 3.00 9 4 114000.00
#> 123 22.00 5 71 150000.00
#> 124 5.00 4 700 1050000.00
#> 125 6.00 6 60 230000.00
#> 126 2.80 7 224 756680.00
#> 127 9.00 15 203 335000.00
#> 128 1.00 5 16 225000.00
#> 129 8.00 20 40 260000.00
#> 130 11.00 6 62 410000.00
#> 131 8.00 6 105 315000.00
#> 132 4.00 11 41 206000.00
#> 133 2.00 6 340 650000.00
#> 134 10.00 8 22 160000.00
#> 135 1.00 6 60 210000.00
#> 136 19.00 2 11 180000.00
#> 137 13.00 7 60 275000.00
#> 138 5.00 4 30 200000.00
#> 139 1.00 4 95 200000.00
#> 140 1.50 2 27 201094.00
#> 141 7.00 6 300 395000.00
#> 142 12.00 15 120 297000.00
#> 143 2.00 5 70 265000.00
#> 144 2.50 15 96 300000.00
#> 145 1.50 6 100 275000.00
#> 146 1.00 3 6 80000.00
#> 147 2.00 8 750 134000.00
#> 148 8.00 15 14 180000.00
#> 149 5.00 12 80 195000.00
#> 150 4.00 7 60 850000.00
#> 151 3.00 6 51 400000.00
#> 152 2.00 9 90 385000.00
#> 153 2.50 7 350 420000.00
#> 154 7.00 1 250 434900.00
#> 155 11.00 10 31 187000.00
#> 156 3.00 5 78 180000.00
#> 157 8.00 7 95 182700.00
#> 158 22.00 4 32 96597.28
#> 159 5.00 5 185 380000.00
#> 160 5.00 1 65 200000.00
#> 161 8.00 6 65 260000.00
#> 162 3.50 5 30 257500.00
#> 163 1.00 20 24 185000.00
#> 164 8.00 8 65 220000.00
#> 165 11.00 10 550 550000.00
#> 166 1.00 1 100 315000.00
#> 167 5.00 11 200 360000.00
#> 168 7.00 9 80 380000.00
#> 169 9.00 9 47 185000.00
#> 170 7.00 5 45 280000.00
#> 171 4.00 3 37 225000.00
#> 172 1.00 8 250 375000.00
#> 173 14.00 6 55 310000.00
#> 174 4.00 5 25 170000.00
#> 175 20.00 5 27 165000.00
#> 176 4.00 5 90 260000.00
#> 177 5.00 6 190 350000.00
#> 178 15.50 8 65 208000.00
#> 179 9.00 4 27 110000.00
#> 180 2.00 7 80 192500.00
#> 181 7.50 5 68 187500.00
#> 182 1.00 4 110 216000.00
#> 183 13.50 10 220 495000.00
#> 184 14.00 8 325 550000.00
#> 185 1.00 13 25 114500.00
#> 186 7.00 5 43 215000.00
#> 187 4.00 13 14 185000.00
#> 188 20.00 3 5 132532.25
#> 189 9.50 0 7 114500.00
#> 190 0.00 15 17 110000.00
#> 191 10.00 20 15 250000.00
#> 192 3.00 5 135 350000.00
#> 193 8.00 15 20 180000.00
#> 194 1.00 14 26 118000.00
#> 195 3.00 19 26 191500.00
#> 196 1.00 20 29 100000.00
#> 197 19.00 5 75 230000.00
#> 198 1.00 7 93 350000.00
#> 199 20.00 5 50 240000.00
#> 200 8.00 9 127 259289.83
#> 201 25.00 6 14 180000.00
#> 202 16.00 6 75 215000.00
#> 203 14.00 7 90 203000.00
#> 204 10.00 20 50 99800.00
#> 205 24.00 10 105 389900.00
#> 206 1.00 25 190 466223.67
#> 207 2.00 7 8 153912.76
#> 208 4.00 5 45 249760.00
#> 209 0.00 6 150 400000.00
#> 210 11.00 10 300 600000.00
#> 211 2.00 45 15 190321.17
#> 212 1.50 10 25 143960.88
#> 213 2.00 6 150 350892.50
#> 214 1.00 5 60 220000.00
#> 215 21.00 6 32 217500.00
#> 216 1.00 8 85 193365.00
#> 217 20.00 13 15 193093.52
#> 218 1.75 12 144 210000.00
#> 219 5.00 15 190 337000.00
#> 220 22.00 7 155 275000.00
#> 221 5.00 4 10 143128.61
#> 222 3.00 1 20 149726.72
#>
#> $OOBerror
#> NRMSE
#> 0.4584988
#>
#> attr(,"class")
#> [1] "missForest"
Created on 2023-02-11 with reprex v2.0.2
I have a data frame which is a time series data but multiple items, their data starts from different dates.
I want to figure out a way to dynamically take the data from the month in which the first data point is visible and ignore the 0 values in the start of it and perform outlier cleansing . This is because if I fix the the start of the time frame and end date of time frame, the results are wrong.
I was planning on using a for loop and perform an outlier identification but the issue is I need to find the start date and end date.
The start date is for the cases where there is 0 for at least 3M before we see the first data point and then select the date of the first data point as the start year and month. The end case is for cases when the value is 0 value for 3M after the last data point and the last data point will be selected as the End year and end month. For cases where there is no 0 at the start or end, we can fix the dates.
structure(list(`Row Labels` = c("2019-01-01", "2019-02-01", "2019-03-01",
"2019-04-01", "2019-05-01", "2019-06-01", "2019-07-01", "2019-08-01",
"2019-09-01", "2019-10-01", "2019-11-01", "2019-12-01", "2020-01-01",
"2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01",
"2020-07-01", "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01",
"2020-12-01", "2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01",
"2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01",
"2021-10-01", "2021-11-01", "2021-12-01", "2022-01-01", "2022-02-01",
"2022-03-01", "2022-04-01", "2022-05-01", "2022-06-01", "2022-07-01",
"2022-08-01", "2022-09-01", "2022-10-01"), `XYZ|146` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 18, 16, 16, 17, 12, 22, 6,
7, 6, 0, 15, 0, 17, 17, 5, 19, 16, 7, 25, 19, 34, 26, 41, 50,
29, 42, 20, 14, 16, 27, 10, 28, 21), `XYZ|666` = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17, 45,
9, 21, 33, 3, 8, 11, 16, 3, 17, 14, 59, 26, 35, 26, 15, 7, 4,
4, 2, 7, 6, 2), `XYZ|574` = c(0, 0, 0, 0, 0, 0, 0, 0, 74, 179,
464, 880, 324, 184, 90, 170, 140, 96, 78, 83, 83, 121, 245, 740,
332, 123, 117, 138, 20, 42, 70, 70, 42, 103, 490, 641, 488, 245,
142, 95, 63, 343, 57, 113, 100, 105), `XYZ|851` = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 206, 1814, 2324, 772, 1116, 1636, 1906, 957,
829, 911, 786, 938, 1313, 2384, 1554, 1777, 1635, 1534, 1015,
827, 982, 685, 767, 511, 239, 1850, 1301, 426, 261, 201, 33,
0, 0, 0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -46L))
So can someone help me identify a method to identify the start date and end dates dynamically in each columns.
The code I have written is.
library(readxl)
library(dplyr)
library(forecast)
Book2 <- read_excel("C:/X/X/X- Y/X/Book5.xlsx")
View(Book2)
dput(Book2)
Dates <- Book2 %>%
select(`Row Labels`)
for(i in 2:ncol(Book2))
{
Start_Year =
Start_Month =
End_Year =
End_Month =
Y <- ts(data = Book2[,i],
frequency = 12,
start = c(Start_Year,Start_Month),
end = c(End_Year,End_Month))
autoplot(tsclean(Y),series = "Clean", color = 'red', lwd=0.9) +
autolayer(Y,series = "original",color = "grey",lwd=1)
}
Can someone help me to figure out how to set out the Start_Year, Start_Month and End_Year, End_Month dynamically based on the above mentioned logic.
Is this doable or is it too difficult?
I believe that NAs are much easier to deal with (as far as auto-removal) than 0s, so let's do a rolling-window on the data to NA-ize where three or more are all 0s. (Also, since the rows to remove will vary between columns, there's no way to remove some rows from one column and keep them for another. This way, the frame never changes dims, so it retains its data.frame properties nicely.)
The biggest weakness with this is that it assumes that each row is a month; if you have gaps, you will need to adapt the width= argument to rollapply based on the time spans.
(There is no strict need to define this fun, you can use the rollapply directly as across(.., ~ zoo::rollapply(z, 3, ...)). The reason I defined fun was for terse code.)
fun <- function(z) zoo::rollapply(z, 3, align = "right", partial = TRUE, FUN = function(z) if (all(z %in% c(NA, 0))) z[length(z)][NA] else z[length(z)])
dplyr
library(dplyr)
quux %>%
mutate(across(-`Row Labels`, ~ fun(.))) %>%
print(n=99)
# # A tibble: 46 × 5
# `Row Labels` `XYZ|146` `XYZ|666` `XYZ|574` `XYZ|851`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-01-01 NA NA NA NA
# 2 2019-02-01 NA NA NA NA
# 3 2019-03-01 NA NA NA NA
# 4 2019-04-01 NA NA NA NA
# 5 2019-05-01 NA NA NA NA
# 6 2019-06-01 NA NA NA NA
# 7 2019-07-01 NA NA NA NA
# 8 2019-08-01 NA NA NA NA
# 9 2019-09-01 NA NA 74 NA
# 10 2019-10-01 NA NA 179 206
# 11 2019-11-01 NA NA 464 1814
# 12 2019-12-01 NA NA 880 2324
# 13 2020-01-01 12 NA 324 772
# 14 2020-02-01 18 NA 184 1116
# 15 2020-03-01 16 NA 90 1636
# 16 2020-04-01 16 NA 170 1906
# 17 2020-05-01 17 NA 140 957
# 18 2020-06-01 12 NA 96 829
# 19 2020-07-01 22 NA 78 911
# 20 2020-08-01 6 NA 83 786
# 21 2020-09-01 7 NA 83 938
# 22 2020-10-01 6 NA 121 1313
# 23 2020-11-01 0 17 245 2384
# 24 2020-12-01 15 45 740 1554
# 25 2021-01-01 0 9 332 1777
# 26 2021-02-01 17 21 123 1635
# 27 2021-03-01 17 33 117 1534
# 28 2021-04-01 5 3 138 1015
# 29 2021-05-01 19 8 20 827
# 30 2021-06-01 16 11 42 982
# 31 2021-07-01 7 16 70 685
# 32 2021-08-01 25 3 70 767
# 33 2021-09-01 19 17 42 511
# 34 2021-10-01 34 14 103 239
# 35 2021-11-01 26 59 490 1850
# 36 2021-12-01 41 26 641 1301
# 37 2022-01-01 50 35 488 426
# 38 2022-02-01 29 26 245 261
# 39 2022-03-01 42 15 142 201
# 40 2022-04-01 20 7 95 33
# 41 2022-05-01 14 4 63 0
# 42 2022-06-01 16 4 343 0
# 43 2022-07-01 27 2 57 NA
# 44 2022-08-01 10 7 113 NA
# 45 2022-09-01 28 6 100 NA
# 46 2022-10-01 21 2 105 NA
base
quux[-1] <- lapply(quux[-1], fun)
quux
# # A tibble: 46 × 5
# `Row Labels` `XYZ|146` `XYZ|666` `XYZ|574` `XYZ|851`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-01-01 NA NA NA NA
# 2 2019-02-01 NA NA NA NA
# 3 2019-03-01 NA NA NA NA
# 4 2019-04-01 NA NA NA NA
# 5 2019-05-01 NA NA NA NA
# 6 2019-06-01 NA NA NA NA
# 7 2019-07-01 NA NA NA NA
# 8 2019-08-01 NA NA NA NA
# 9 2019-09-01 NA NA 74 NA
# 10 2019-10-01 NA NA 179 206
# # … with 36 more rows
(all output the same)
I have a large dataset of 70 000 rows that I want to perform some operations on, but I can't find an appropriate solution.
bib sta run course finish comment day
1 42 9 1 SG 19.88 99 1
2 42 17 2 A 19.96 11 1
3 42 27 3 B 20.92 22 1
4 42 39 4 A 19.60 11 1
5 42 48 5 SG 20.24 99 1
6 42 61 6 C 22.90 33 1
7 42 76 7 B 20.70 22 1
8 42 86 8 C 22.74 33 1
9 42 93 9 C 22.75 33 1
10 42 103 10 A 19.79 11 1
11 42 114 11 B 20.67 22 1
12 42 120 12 SG 20.10 99 1
I want to end up with a tibble that:
calculates the mean finish time in SG course for each bib number on one particular day. For example, 19.88 + 20.24 + 20.10 / 3
calculate a difference score for each observation in the dataset by subtracting finish from this mean SG score. For example, 19.88 - mean(SG), 19.96 - mean(SG).
I have tried the following approach:
First group by day, bib and course. Then filter by SG and calculate the mean:
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
mutate(avg.sg = mean(finish))
Resulting in the following tibble
bib sta run course finish comment day avg.sg
<int> <int> <int> <chr> <dbl> <int> <chr> <dbl>
1 42 9 1 SG 19.9 99 1 20.1
2 42 48 5 SG 20.2 99 1 20.1
3 42 120 12 SG 20.1 99 1 20.1
4 42 6 1 SG 20.0 99 2 19.9
5 42 42 5 SG 19.8 77 2 19.9
6 42 130 15 SG 19.9 99 2 19.9
7 42 6 1 SG 20.6 99 3 20.5
8 42 68 12 SG 20.6 77 3 20.5
9 42 90 15 SG 20.4 77 3 20.5
Finally I join the two tibbles together using the following syntax:
df %>% full_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
However, this doesn't work. It only works for the SG course but not for course A, B and C. Is there a way to fix this or is there a better solution to the problem?
bib sta run course finish comment day avg.sg diff
1 42 9 1 SG 19.88 99 1 20.07333 -0.193333333
2 42 17 2 A 19.96 11 1 NA NA
3 42 27 3 B 20.92 22 1 NA NA
4 42 39 4 A 19.60 11 1 NA NA
5 42 48 5 SG 20.24 99 1 20.07333 0.166666667
You can filter your values for finish within the mutate() and calculate the mean based on those:
df %>%
group_by(day,bib) %>%
mutate(
avg.sg = mean(finish[course=="SG"]),
diff = finish - avg.sg)
Is the following what you are aiming for?
(note that I added a few random values for a second bib just to make sure the join is done properly)
The difference to your attempt is using summarise() instead of mutate() to consolidate the avg.sgtime data frame, and also dropping a few columns so that the join is not populated with NAs. Instead of dropping you can also set the relevant columns to join by passing the by argument to the left_join() function.
library(dplyr)
library(tidyr) # for join
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
summarise(avg.sg = mean(finish), .groups = "drop") %>%
select(c(bib, day, avg.sg))
avg.sgtime
#> # A tibble: 3 x 3
#> bib day avg.sg
#> <dbl> <dbl> <dbl>
#> 1 42 1 20.1
#> 2 43 1 19.1
#> 3 44 2 19.3
df %>% left_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
#> Joining, by = c("bib", "day")
#> # A tibble: 36 x 9
#> bib sta run course finish comment day avg.sg diff
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 42 9 1 SG 19.9 99 1 20.1 -0.193
#> 2 42 17 2 A 20.0 11 1 20.1 -0.113
#> 3 42 27 3 B 20.9 22 1 20.1 0.847
#> 4 42 39 4 A 19.6 11 1 20.1 -0.473
#> 5 42 48 5 SG 20.2 99 1 20.1 0.167
#> 6 42 61 6 C 22.9 33 1 20.1 2.83
#> 7 42 76 7 B 20.7 22 1 20.1 0.627
#> 8 42 86 8 C 22.7 33 1 20.1 2.67
#> 9 42 93 9 C 22.8 33 1 20.1 2.68
#> 10 42 103 10 A 19.8 11 1 20.1 -0.283
#> # … with 26 more rows
Created on 2021-07-04 by the reprex package (v2.0.0)
data
df <- tribble(~bib, ~sta, ~run, ~course, ~finish, ~comment, ~day,
42, 9, 1, "SG", 19.88, 99, 1,
42, 17, 2, "A", 19.96, 11, 1,
42, 27, 3, "B", 20.92, 22, 1,
42, 39, 4, "A", 19.60, 11, 1,
42, 48, 5, "SG", 20.24, 99, 1,
42, 61, 6, "C", 22.90, 33, 1,
42, 76, 7, "B", 20.70, 22, 1,
42, 86, 8, "C", 22.74, 33, 1,
42, 93, 9, "C", 22.75, 33, 1,
42, 103, 10, "A", 19.79, 11, 1,
42, 114, 11, "B", 20.67, 22, 1,
42, 120, 12, "SG", 20.10, 99, 1,
43, 9, 1, "SG", 19.12, 99, 1,
43, 17, 2, "A", 19.64, 11, 1,
43, 27, 3, "B", 20.62, 22, 1,
43, 39, 4, "A", 19.23, 11, 1,
43, 48, 5, "SG", 20.11, 99, 1,
43, 61, 6, "C", 22.22, 33, 1,
43, 76, 7, "B", 20.33, 22, 1,
43, 86, 8, "C", 22.51, 33, 1,
43, 93, 9, "C", 22.78, 33, 1,
43, 103, 10, "A", 19.98, 11, 1,
43, 114, 11, "B", 20.11, 22, 1,
43, 120, 12, "SG", 18.21, 99, 1,
44, 9, 1, "SG", 19.18, 99, 2,
44, 17, 2, "A", 19.56, 11, 2,
44, 27, 3, "B", 20.62, 22, 2,
44, 39, 4, "A", 19.20, 11, 2,
44, 48, 5, "SG", 20.74, 99, 2,
44, 61, 6, "C", 22.50, 33, 2,
44, 76, 7, "B", 20.60, 22, 2,
44, 86, 8, "C", 22.74, 33, 2,
44, 93, 9, "C", 22.85, 33, 2,
44, 103, 10, "A", 19.59, 11, 2,
44, 114, 11, "B", 20.27, 22, 2,
44, 120, 12, "SG", 18.10, 99, 2,
)
Thanks #Marcelo Avila for providing me with a very good hint:
I hope this is what you are looking for:
library(dplyr)
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[course == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 12 x 9
# Groups: bib, day [1]
bib sta run course finish comment day avg_finish diff
<int> <int> <int> <chr> <dbl> <int> <int> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
11 42 114 11 B 20.7 22 1 NA 0.597
12 42 120 12 SG 20.1 99 1 20.1 0.0267
I also added another alternative solution with a minor change, using dear #Marcelo Avila's data set:
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[select(cur_data(), course) == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 36 x 9
# Groups: bib, day [3]
bib sta run course finish comment day avg_finish diff
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
# ... with 26 more rows
I'm having trouble using the pivot_longer on blocks of variables. Suppose I have this:
and I want this:
dfwide <- structure(list(date = structure(c(1577836800, 1577923200, 1578009600,
1578096000, 1578182400, 1578268800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), x1_a = c(20, 15, 12, NA, 25, 27), x1_b = c(33,
44, 85, 10, 12, 3), x1_c = c(70, 20, 87, 11, 20, 5), x2_a = c(85,
65, 33, 46, 82, 9), x2_b = c(87, 25, 55, 64, 98, 5), x2_c = c(77,
51, 92, 20, 37, 98)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
##Tried:
dfwide %>%
pivot_longer(cols = -date,
names_sep = c("x1", "x2"),
names_to = c("a", "b", "c"),
values_to = "value")
This line is taking advantage of the name separation option of the pivot_longer function.
pivot_longer(dfwide, -date, names_sep = "_",
names_to=c("which", ".value")) %>%
arrange(which)
# A tibble: 12 x 5
date which a b c
<dttm> <chr> <dbl> <dbl> <dbl>
1 2020-01-01 00:00:00 x1 20 33 70
2 2020-01-02 00:00:00 x1 15 44 20
3 2020-01-03 00:00:00 x1 12 85 87
4 2020-01-04 00:00:00 x1 NA 10 11
5 2020-01-05 00:00:00 x1 25 12 20
6 2020-01-06 00:00:00 x1 27 3 5
7 2020-01-01 00:00:00 x2 85 87 77
8 2020-01-02 00:00:00 x2 65 25 51
9 2020-01-03 00:00:00 x2 33 55 92
10 2020-01-04 00:00:00 x2 46 64 20
11 2020-01-05 00:00:00 x2 82 98 37
12 2020-01-06 00:00:00 x2 9 5 98
You can try this code:
library(tidyverse)
dfwide %>%
pivot_longer(cols = -date,
names_to = "which",
values_to = "value") %>%
separate(which, into = c("which","letter"), sep = "_") %>%
pivot_wider(names_from = "letter", values_from = "value") %>%
arrange(which)
This is the result:
# A tibble: 12 x 5
date which a b c
<dttm> <chr> <dbl> <dbl> <dbl>
1 2020-01-01 00:00:00 x1 20 33 70
2 2020-01-02 00:00:00 x1 15 44 20
3 2020-01-03 00:00:00 x1 12 85 87
4 2020-01-04 00:00:00 x1 NA 10 11
5 2020-01-05 00:00:00 x1 25 12 20
6 2020-01-06 00:00:00 x1 27 3 5
7 2020-01-01 00:00:00 x2 85 87 77
8 2020-01-02 00:00:00 x2 65 25 51
9 2020-01-03 00:00:00 x2 33 55 92
10 2020-01-04 00:00:00 x2 46 64 20
11 2020-01-05 00:00:00 x2 82 98 37
12 2020-01-06 00:00:00 x2 9 5 98
If you're okay with doing it in more than one step, this works. First gather the columns, separate by underscore, and then spread the values.
pivot_longer(dfwide, x1_a:x2_c, names_to="which") %>%
extract(which, into=c("var", "letter"), regex="(.*)_(.*)") %>%
pivot_wider(names_from=letter, values_from=value)
I would like to sum up the rows in Col1 for which the Col2 is equal to 0. And add the sum to the next value in Col1. I show an example below. I have many products in the data frame. I started with this:
library(dplyr)
TD %>% group_by(Product,Date) %>% mutate(Sum = sum(Col1[Col2 == 0])[1]) %>%
mutate(Expected = Col1)
Date <- seq(as.Date("2021-01-01"), as.Date("2021-01-07"), by = "day")
Product<-rep("A",7)
Col1 <- c(13, 10, 15, 7, 9, 4, 3)
Col2 <- c(7, 0, 0, 8, 0, 0, 27)
Expected <- c(13, 10, 15, 32, 9, 4, 16)
TD <- data.frame(Date, Product, Col1, Col2, Expected)
Other data:
Date Product Col1 Col2 Expected2
<date> <chr> <dbl> <dbl> <dbl>
1 2021-02-12 831 15 384 631
2 2021-02-13 831 11 373 631
3 2021-02-14 831 13 360 631
4 2021-02-15 831 14 826 631
5 2020-12-03 832 10 11 20
6 2020-12-04 832 10 1 20
7 2020-12-05 832 7 0 7
8 2020-12-06 832 11 0 11
9 2020-12-07 832 13 0 13
10 2020-12-08 832 10 0 10
structure(list(Date = structure(c(18670, 18671, 18672, 18673,
18599, 18600, 18601, 18602, 18603, 18604, 18605, 18606, 18607,
18608, 18609, 18610, 18611, 18612, 18613, 18614, 18615, 18616,
18617, 18618, 18619, 18620, 18621, 18622, 18623, 18624), class = "Date"),
Product = c("831", "831", "831", "831", "832", "832", "832",
"832", "832", "832", "832", "832", "832", "832", "832", "832",
"832", "832", "832", "832", "832", "832", "832", "832", "832",
"832", "832", "832", "832", "832"), Col1 = c(15, 11, 13,
14, 10, 10, 7, 11, 13, 10, 8, 11, 9, 8, 10, 17, 15, 17, 16,
16, 14, 14, 15, 17, 18, 16, 17, 18, 18, 8), Col2 = c(384,
373, 360, 826, 11, 1, 0, 0, 0, 0, 0, 70, 61, 53, 43, 26,
11, 0, 0, 84, 70, 56, 41, 24, 6, 0, 0, 0, 0, 0), Expected2 = c(631,
631, 631, 631, 20, 20, 7, 11, 13, 10, 8, 119, 119, 119, 119,
119, 119, 17, 16, 127, 127, 127, 127, 127, 127, 16, 17, 18,
18, 8)), row.names = c(NA, -30L), class = c("tbl_df", "tbl",
"data.frame"))
We can do
library(dplyr)
TD %>%
group_by(Product) %>%
group_by(grp = cumsum(lag(Col2 != 0, default = FALSE)), .add = TRUE) %>%
mutate(Expected2 = sum(Col1)) %>%
group_by(Product) %>%
mutate(Expected2 = case_when(Col2 == 0 | !any(Col2 == 0) ~
Col1, TRUE ~ Expected2)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 7 x 6
# Date Product Col1 Col2 Expected Expected2
# <date> <chr> <dbl> <dbl> <dbl> <dbl>
#1 2021-01-01 A 13 7 13 13
#2 2021-01-02 A 10 0 10 10
#3 2021-01-03 A 15 0 15 15
#4 2021-01-04 A 7 8 32 32
#5 2021-01-05 A 9 0 9 9
#6 2021-01-06 A 4 0 4 4
#7 2021-01-07 A 3 27 16 16
-testing on the second dataset
TD2 %>%
group_by(Product) %>%
group_by(grp = cumsum(lag(Col2 != 0, default = FALSE)), .add = TRUE)%>%
mutate(tmp = sum(Col1), Expected2 = case_when(any(Col2 == 0) &
(row_number() == n() & Col2 != 0) ~ tmp, TRUE ~ Col1)) %>%
ungroup %>%
select(-grp, -tmp)
-output
# A tibble: 15 x 5
# Date Product Col1 Col2 Expected2
# <chr> <chr> <int> <int> <int>
# 1 2020-12-03 00:00:00 B 10 206 10
# 2 2020-12-04 00:00:00 B 5 364 5
# 3 2020-12-05 00:00:00 B 10 354 10
# 4 2020-12-06 00:00:00 B 8 346 8
# 5 2020-12-07 00:00:00 B 5 341 5
# 6 2020-12-08 00:00:00 B 8 333 8
# 7 2020-12-09 00:00:00 B 12 321 12
# 8 2020-12-10 00:00:00 B 5 316 5
# 9 2020-12-11 00:00:00 B 7 309 7
#10 2020-12-12 00:00:00 B 13 296 13
#11 2020-12-13 00:00:00 B 9 287 9
#12 2020-12-14 00:00:00 B 11 276 11
#13 2020-12-15 00:00:00 B 10 266 10
#14 2020-12-16 00:00:00 B 17 249 17
#15 2020-12-17 00:00:00 B 14 235 14
Or the image data
TD3 %>%
group_by(Product) %>%
group_by(grp = cumsum(lag(Col2 != 0, default = FALSE)), .add = TRUE) %>%
mutate(tmp = sum(Col1), Expected2 = case_when(any(Col2 == 0) &
(row_number() == n() & Col2 != 0) ~ tmp, TRUE ~ Col1)) %>%
ungroup %>%
select(-grp, -tmp)
-output
# A tibble: 21 x 4
# Product Col1 Col2 Expected2
# <chr> <dbl> <dbl> <dbl>
# 1 C 11 52 11
# 2 C 7 45 7
# 3 C 6 39 6
# 4 C 15 24 15
# 5 C 14 10 14
# 6 C 10 0 10
# 7 C 8 0 8
# 8 C 10 125 28
# 9 C 12 113 12
#10 C 11 102 11
# … with 11 more rows
Or using the OP's latest dput data
TD4 %>%
group_by(Product) %>%
group_by(grp = cumsum(lag(Col2 != 0, default = FALSE)), .add = TRUE) %>%
mutate(tmp = sum(Col1), Expected2 = case_when(any(Col2 == 0) &
(row_number() == n() & Col2 != 0) ~ tmp, TRUE ~ Col1)) %>%
ungroup %>%
select(-grp, -tmp) %>%
as.data.frame
-output
# Date Product Col1 Col2 Expected2
#1 2021-02-12 831 15 384 15
#2 2021-02-13 831 11 373 11
#3 2021-02-14 831 13 360 13
#4 2021-02-15 831 14 826 14
#5 2020-12-03 832 10 11 10
#6 2020-12-04 832 10 1 10
#7 2020-12-05 832 7 0 7
#8 2020-12-06 832 11 0 11
#9 2020-12-07 832 13 0 13
#10 2020-12-08 832 10 0 10
#11 2020-12-09 832 8 0 8
#12 2020-12-10 832 11 70 60
#13 2020-12-11 832 9 61 9
#14 2020-12-12 832 8 53 8
#15 2020-12-13 832 10 43 10
#16 2020-12-14 832 17 26 17
#17 2020-12-15 832 15 11 15
#18 2020-12-16 832 17 0 17
#19 2020-12-17 832 16 0 16
#20 2020-12-18 832 16 84 49
#21 2020-12-19 832 14 70 14
#22 2020-12-20 832 14 56 14
#23 2020-12-21 832 15 41 15
#24 2020-12-22 832 17 24 17
#25 2020-12-23 832 18 6 18
#26 2020-12-24 832 16 0 16
#27 2020-12-25 832 17 0 17
#28 2020-12-26 832 18 0 18
#29 2020-12-27 832 18 0 18
#30 2020-12-28 832 8 0 8
data
TD2 <- structure(list(Date = c("2020-12-03 00:00:00", "2020-12-04 00:00:00",
"2020-12-05 00:00:00", "2020-12-06 00:00:00", "2020-12-07 00:00:00",
"2020-12-08 00:00:00", "2020-12-09 00:00:00", "2020-12-10 00:00:00",
"2020-12-11 00:00:00", "2020-12-12 00:00:00", "2020-12-13 00:00:00",
"2020-12-14 00:00:00", "2020-12-15 00:00:00", "2020-12-16 00:00:00",
"2020-12-17 00:00:00"), Product = c("B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B"), Col1 = c(10L,
5L, 10L, 8L, 5L, 8L, 12L, 5L, 7L, 13L, 9L, 11L, 10L, 17L, 14L
), Col2 = c(206L, 364L, 354L, 346L, 341L, 333L, 321L, 316L, 309L,
296L, 287L, 276L, 266L, 249L, 235L), Expected2 = c(144L, 144L,
144L, 144L, 144L, 144L, 144L, 144L, 144L, 144L, 144L, 144L, 144L,
144L, 144L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"
))
TD3 <- structure(list(Product = c("C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C"), Col1 = c(11, 7, 6, 15, 14, 10, 8, 10, 12, 11, 10, 20, 20,
22, 19, 23, 21, 20, 26, 26, 27), Col2 = c(52, 45, 39, 24, 10,
0, 0, 125, 113, 102, 92, 72, 52, 30, 11, 138, 117, 97, 71, 45,
18)), class = "data.frame", row.names = c(NA, -21L))