dplyr: Using poly function to generate polynomial coefficients - r

I want to append polynomial coefficient to data.frame as the example given below:
df1 <-
structure(list(
Y = c(4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 16, 16, 16,
16, 16, 32, 32, 32, 32, 32, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 16,
16, 16, 16, 16, 32, 32, 32, 32, 32, 4, 4, 4, 4, 4, 8, 8, 8, 8,
8, 16, 16, 16, 16, 16, 32, 32, 32, 32, 32)),
class = "data.frame", row.names = c(NA, -60L))
library(tidyverse)
df1 %>%
dplyr::mutate(
Linear = poly(x = Y, degree = 3, raw = TRUE)[ ,1]
, Quadratic = poly(x = Y, degree = 3, raw = TRUE)[ ,2]
, Cubic = poly(x = Y, degree = 3, raw = TRUE)[ ,3]
)
I wonder if there is a concise method like this
df1 %>%
dplyr::mutate(poly(x = Y, degree = 3, raw = TRUE))
Thanks

Not exactly the way you were hoping, but close enough. I first convert the output of poly (a matrix) to a data.frame, then use !!! to splice the columns (turning each element of a list/data.frame into it's own argument). setNames is optional for renaming the columns:
library(dplyr)
df1 %>%
mutate(!!!as.data.frame(poly(x = .$Y, degree = 3, raw = TRUE))) %>%
setNames(c("Y", "Linear", "Quadratic", "Cubic"))
Result:
Y Linear Quadratic Cubic
1 4 4 16 64
2 4 4 16 64
3 4 4 16 64
4 4 4 16 64
5 4 4 16 64
6 8 8 64 512
7 8 8 64 512
8 8 8 64 512
9 8 8 64 512
10 8 8 64 512
11 16 16 256 4096
12 16 16 256 4096
13 16 16 256 4096
14 16 16 256 4096
15 16 16 256 4096
16 32 32 1024 32768
17 32 32 1024 32768
18 32 32 1024 32768
19 32 32 1024 32768
20 32 32 1024 32768
...

Another option, although I really like #useR's solution:
df1 %>%
left_join(data.frame(Y = unique(.$Y), poly(unique(.$Y), degree = 3, raw = TRUE)),
by = c('Y' = 'Y')) %>%
setNames(c('Y', 'Linear', 'Quadratic', 'Cubic'))
Y Linear Quadratic Cubic
1 4 4 16 64
2 4 4 16 64
3 4 4 16 64
4 4 4 16 64
5 4 4 16 64
6 8 8 64 512
7 8 8 64 512
8 8 8 64 512
9 8 8 64 512
10 8 8 64 512
11 16 16 256 4096
12 16 16 256 4096
13 16 16 256 4096
14 16 16 256 4096
15 16 16 256 4096
16 32 32 1024 32768
17 32 32 1024 32768
18 32 32 1024 32768
19 32 32 1024 32768
20 32 32 1024 32768

Related

Error message with missForest package (imputation using Random Forest)

My dataframe is below. All variables are numeric, one of them (Total) has about 20 NAs. I would like the missForest package to create imputed values for the NAs in Total. I am running
R version 4.2.1 (2022-06-23 ucrt) on Windows.
imp <- structure(list(Years = c(21, 5, 5, 25, 4, 4, 4, 1, 12, 17, 5.5,
4, 13, 1, 1, 5, 1, 12, 8, 1, 14, 0.8, 6, 5, 4, 7, 4, 21, 3, 2,
20, 1, 2, 2, 20, 2, 1, 9, 12, 22, 1, 27, 5, 3, 1, 8, 5, 25, 1,
0.4, 4, 1, 1.5, 1, 1, 21, 5, 0.5, 3, 12, 3, 28, 7, 5, 22, 3.25,
4, 4, 12, 1, 3, 25, 17, 12, 40, 12, 6, 3, 8, 7, 17, 1, 3, 3,
6, 4, 7, 1, 7, 6, 4, 11, 1, 5, 2, 15, 1, 3, 7.5, 21, 4, 1.5,
7, 13, 5, 6, 9, 12.5, 2.5, 1, 17, 8, 5, 22, 25, 13, 5.5, 19,
9, 3.3, 14, 3, 22, 5, 6, 2.8, 9, 1, 8, 11, 8, 4, 2, 10, 1, 19,
13, 5, 1, 1.5, 7, 12, 2, 2.5, 1.5, 1, 2, 8, 5, 4, 3, 2, 2.5,
7, 11, 3, 8, 22, 5, 5, 8, 3.5, 1, 8, 11, 1, 5, 7, 9, 7, 4, 1,
14, 4, 20, 4, 5, 15.5, 9, 2, 7.5, 1, 13.5, 14, 1, 7, 4, 20, 9.5,
0, 10, 3, 8, 1, 3, 1, 19, 1, 20, 8, 25, 16, 14, 10, 24, 1, 2,
4, 0, 11, 2, 1.5, 2, 1, 21, 1, 20, 1.75, 5, 22, 5, 3), Staff = c(7,
8, 6, 10, 15, 6, 7, 17, 9, 5, 7, 12, 15, 8, 7, 5, 8, 8, 2, 8,
7, 8, 7, 7, 12, 8, 8, 7, 12, 10, 5, 7, 3, 6, 11, 4, 8, 8, 9,
6, 9, 9, 18, 10, 9, 5, 7, 20, 9, 4, 9, 6, 5, 4, 3, 5, 11, 8,
4, 7, 6, 16, 5, 5, 8, 8, 7, 4, 9, 9, 9, 14, 8, 5, 6, 6, 4, 3,
6, 7, 10, 7, 7, 3, 7, 13, 12, 4, 10, 8, 9, 5, 15, 7, 9, 9, 6,
5, 15, 7, 6, 5, 7, 8, 7, 7, 5, 9, 15, 12, 15, 5, 8, 7, 7, 5,
8, 12, 6, 6, 12, 9, 5, 4, 6, 7, 15, 5, 20, 6, 6, 11, 6, 8, 6,
2, 7, 4, 4, 2, 6, 15, 5, 15, 6, 3, 8, 15, 12, 7, 6, 9, 7, 1,
10, 5, 7, 4, 5, 1, 6, 5, 20, 8, 10, 1, 11, 9, 9, 5, 3, 8, 6,
5, 5, 5, 6, 8, 4, 7, 5, 4, 10, 8, 13, 5, 13, 3, 0, 15, 20, 5,
15, 14, 19, 20, 5, 7, 5, 9, 6, 6, 7, 20, 10, 25, 7, 5, 6, 10,
45, 10, 6, 5, 6, 8, 13, 12, 15, 7, 4, 1), JDs = c(64, 64, 120,
200, 30, 70, 370, 75, 300, 20, 68, 170, 77, 275, 132, 81, 875,
135, 75, 84, 74, 110, 120, 60, 1800, 94, 54, 125, 140, 150, 52,
190, 53, 170, 325, 18, 300, 86, 130, 375, 140, 200, 104, 50,
100, 95, 360, 40, 45, 52, 165, 20, 150, 58, 230, 95, 150, 95,
85, 120, 100, 265, 18, 90, 130, 77, 80, 75, 133, 73, 302, 500,
70, 50, 55, 72, 35, 60, 100, 90, 130, 41, 200, 29, 90, 35, 68,
30, 115, 51, 40, 125, 460, 400, 125, 400, 250, 51, 190, 200,
235, 150, 250, 137, 760, 90, 70, 100, 325, 200, 350, 150, 325,
23, 17, 50, 415, 650, 120, 96, 200, 4, 71, 700, 60, 224, 203,
16, 40, 62, 105, 41, 340, 22, 60, 11, 60, 30, 95, 27, 300, 120,
70, 96, 100, 6, 750, 14, 80, 60, 51, 90, 350, 250, 31, 78, 95,
32, 185, 65, 65, 30, 24, 65, 550, 100, 200, 80, 47, 45, 37, 250,
55, 25, 27, 90, 190, 65, 27, 80, 68, 110, 220, 325, 25, 43, 14,
5, 7, 17, 15, 135, 20, 26, 26, 29, 75, 93, 50, 127, 14, 75, 90,
50, 105, 190, 8, 45, 150, 300, 15, 25, 150, 60, 32, 85, 15, 144,
190, 155, 10, 20), Total = c(325000, 250000, 275000, 340000,
165000, 3e+05, 420000, 8e+05, 5e+05, 100776, 440000, 440000,
191500, NA, 4e+05, 145000, 6e+05, 4e+05, 125000, 155000, 230000,
250000, 240000, 2e+05, NA, 250000, 188000, 375000, 190000, 450000,
290558, 725000, 355000, 350000, 8e+05, 125000, 450000, 255000,
212500, 6e+05, 342000, 450000, 250000, 228000, 325000, 325000,
425000, 175000, NA, 240000, NA, 250000, 237000, 330000, 345000,
195000, 295000, 208000, 225000, NA, 445000, 253000, 75000, 285000,
4e+05, 2e+05, 308000, 236000, 470000, 190000, 1250000, 480000,
2e+05, 285000, 232000, 240000, 2e+05, 209000, 250000, 309000,
NA, 170000, 1e+06, 115200, 565000, 182500, 175000, 250000, 250000,
265000, 120000, 345000, 425000, 630000, 165000, 650000, 3e+05,
265000, 345000, 425000, 4e+05, 230000, 425000, 161500, 6e+05,
251000, 265000, 190000, 420000, 6e+05, 510000, 340000, 650000,
275000, 120000, 185000, 480000, 550000, 185000, 240000, 560000,
114000, 150000, 1050000, 230000, NA, 335000, 225000, 260000,
410000, 315000, 206000, 650000, 160000, 210000, 180000, 275000,
2e+05, 2e+05, 201094, 395000, 297000, 265000, 3e+05, 275000,
80000, 134000, 180000, 195000, 850000, 4e+05, 385000, 420000,
NA, 187000, 180000, 182700, 96597.28, 380000, 2e+05, 260000,
257500, 185000, 220000, 550000, 315000, 360000, 380000, 185000,
280000, 225000, 375000, 310000, 170000, 165000, 260000, 350000,
208000, 110000, 192500, 187500, 216000, 495000, 550000, 114500,
215000, 185000, NA, 114500, 110000, 250000, 350000, 180000, 118000,
191500, 1e+05, 230000, 350000, 240000, NA, 180000, 215000, 203000,
99800, 389900, NA, NA, NA, 4e+05, 6e+05, NA, NA, NA, 220000,
217500, NA, NA, 210000, 337000, 275000, NA, NA)), row.names = c(NA,
-222L), class = c("tbl_df", "tbl", "data.frame"))
library(missForest) # installed with dependencies = TRUE
impFor <- missForest(imp)
The statement above returns the following warnings and error.
Warning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NA
Warning: The response has five or fewer unique values. Are you sure you want to do regression?
Error in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, :
length of response must be the same as predictors
The first four warnings appear to say that my four variables are neither numeric nor logical, but they are all numeric. The warning regarding regression and "five or fewer unique values" puzzles me because the package's manual makes no reference to a minimum number of unique values. Finally, the error confounds me completely.
I have searched StackOverflow, but the two questions that came up are not relevant.
Thank you for setting me right.
Your data should be in a data.frame format instead of tibble. You could use as.data.frame like this:
library(missForest)
class(imp)
#> [1] "tbl_df" "tbl" "data.frame"
imp <- as.data.frame(imp)
class(imp)
#> [1] "data.frame"
imp <- missForest(imp)
imp
#> $ximp
#> Years Staff JDs Total
#> 1 21.00 7 64 325000.00
#> 2 5.00 8 64 250000.00
#> 3 5.00 6 120 275000.00
#> 4 25.00 10 200 340000.00
#> 5 4.00 15 30 165000.00
#> 6 4.00 6 70 300000.00
#> 7 4.00 7 370 420000.00
#> 8 1.00 17 75 800000.00
#> 9 12.00 9 300 500000.00
#> 10 17.00 5 20 100776.00
#> 11 5.50 7 68 440000.00
#> 12 4.00 12 170 440000.00
#> 13 13.00 15 77 191500.00
#> 14 1.00 8 275 422030.00
#> 15 1.00 7 132 400000.00
#> 16 5.00 5 81 145000.00
#> 17 1.00 8 875 600000.00
#> 18 12.00 8 135 400000.00
#> 19 8.00 2 75 125000.00
#> 20 1.00 8 84 155000.00
#> 21 14.00 7 74 230000.00
#> 22 0.80 8 110 250000.00
#> 23 6.00 7 120 240000.00
#> 24 5.00 7 60 200000.00
#> 25 4.00 12 1800 564720.00
#> 26 7.00 8 94 250000.00
#> 27 4.00 8 54 188000.00
#> 28 21.00 7 125 375000.00
#> 29 3.00 12 140 190000.00
#> 30 2.00 10 150 450000.00
#> 31 20.00 5 52 290558.00
#> 32 1.00 7 190 725000.00
#> 33 2.00 3 53 355000.00
#> 34 2.00 6 170 350000.00
#> 35 20.00 11 325 800000.00
#> 36 2.00 4 18 125000.00
#> 37 1.00 8 300 450000.00
#> 38 9.00 8 86 255000.00
#> 39 12.00 9 130 212500.00
#> 40 22.00 6 375 600000.00
#> 41 1.00 9 140 342000.00
#> 42 27.00 9 200 450000.00
#> 43 5.00 18 104 250000.00
#> 44 3.00 10 50 228000.00
#> 45 1.00 9 100 325000.00
#> 46 8.00 5 95 325000.00
#> 47 5.00 7 360 425000.00
#> 48 25.00 20 40 175000.00
#> 49 1.00 9 45 185352.00
#> 50 0.40 4 52 240000.00
#> 51 4.00 9 165 403167.00
#> 52 1.00 6 20 250000.00
#> 53 1.50 5 150 237000.00
#> 54 1.00 4 58 330000.00
#> 55 1.00 3 230 345000.00
#> 56 21.00 5 95 195000.00
#> 57 5.00 11 150 295000.00
#> 58 0.50 8 95 208000.00
#> 59 3.00 4 85 225000.00
#> 60 12.00 7 120 261252.00
#> 61 3.00 6 100 445000.00
#> 62 28.00 16 265 253000.00
#> 63 7.00 5 18 75000.00
#> 64 5.00 5 90 285000.00
#> 65 22.00 8 130 400000.00
#> 66 3.25 8 77 200000.00
#> 67 4.00 7 80 308000.00
#> 68 4.00 4 75 236000.00
#> 69 12.00 9 133 470000.00
#> 70 1.00 9 73 190000.00
#> 71 3.00 9 302 1250000.00
#> 72 25.00 14 500 480000.00
#> 73 17.00 8 70 200000.00
#> 74 12.00 5 50 285000.00
#> 75 40.00 6 55 232000.00
#> 76 12.00 6 72 240000.00
#> 77 6.00 4 35 200000.00
#> 78 3.00 3 60 209000.00
#> 79 8.00 6 100 250000.00
#> 80 7.00 7 90 309000.00
#> 81 17.00 10 130 279905.83
#> 82 1.00 7 41 170000.00
#> 83 3.00 7 200 1000000.00
#> 84 3.00 3 29 115200.00
#> 85 6.00 7 90 565000.00
#> 86 4.00 13 35 182500.00
#> 87 7.00 12 68 175000.00
#> 88 1.00 4 30 250000.00
#> 89 7.00 10 115 250000.00
#> 90 6.00 8 51 265000.00
#> 91 4.00 9 40 120000.00
#> 92 11.00 5 125 345000.00
#> 93 1.00 15 460 425000.00
#> 94 5.00 7 400 630000.00
#> 95 2.00 9 125 165000.00
#> 96 15.00 9 400 650000.00
#> 97 1.00 6 250 300000.00
#> 98 3.00 5 51 265000.00
#> 99 7.50 15 190 345000.00
#> 100 21.00 7 200 425000.00
#> 101 4.00 6 235 400000.00
#> 102 1.50 5 150 230000.00
#> 103 7.00 7 250 425000.00
#> 104 13.00 8 137 161500.00
#> 105 5.00 7 760 600000.00
#> 106 6.00 7 90 251000.00
#> 107 9.00 5 70 265000.00
#> 108 12.50 9 100 190000.00
#> 109 2.50 15 325 420000.00
#> 110 1.00 12 200 600000.00
#> 111 17.00 15 350 510000.00
#> 112 8.00 5 150 340000.00
#> 113 5.00 8 325 650000.00
#> 114 22.00 7 23 275000.00
#> 115 25.00 7 17 120000.00
#> 116 13.00 5 50 185000.00
#> 117 5.50 8 415 480000.00
#> 118 19.00 12 650 550000.00
#> 119 9.00 6 120 185000.00
#> 120 3.30 6 96 240000.00
#> 121 14.00 12 200 560000.00
#> 122 3.00 9 4 114000.00
#> 123 22.00 5 71 150000.00
#> 124 5.00 4 700 1050000.00
#> 125 6.00 6 60 230000.00
#> 126 2.80 7 224 756680.00
#> 127 9.00 15 203 335000.00
#> 128 1.00 5 16 225000.00
#> 129 8.00 20 40 260000.00
#> 130 11.00 6 62 410000.00
#> 131 8.00 6 105 315000.00
#> 132 4.00 11 41 206000.00
#> 133 2.00 6 340 650000.00
#> 134 10.00 8 22 160000.00
#> 135 1.00 6 60 210000.00
#> 136 19.00 2 11 180000.00
#> 137 13.00 7 60 275000.00
#> 138 5.00 4 30 200000.00
#> 139 1.00 4 95 200000.00
#> 140 1.50 2 27 201094.00
#> 141 7.00 6 300 395000.00
#> 142 12.00 15 120 297000.00
#> 143 2.00 5 70 265000.00
#> 144 2.50 15 96 300000.00
#> 145 1.50 6 100 275000.00
#> 146 1.00 3 6 80000.00
#> 147 2.00 8 750 134000.00
#> 148 8.00 15 14 180000.00
#> 149 5.00 12 80 195000.00
#> 150 4.00 7 60 850000.00
#> 151 3.00 6 51 400000.00
#> 152 2.00 9 90 385000.00
#> 153 2.50 7 350 420000.00
#> 154 7.00 1 250 434900.00
#> 155 11.00 10 31 187000.00
#> 156 3.00 5 78 180000.00
#> 157 8.00 7 95 182700.00
#> 158 22.00 4 32 96597.28
#> 159 5.00 5 185 380000.00
#> 160 5.00 1 65 200000.00
#> 161 8.00 6 65 260000.00
#> 162 3.50 5 30 257500.00
#> 163 1.00 20 24 185000.00
#> 164 8.00 8 65 220000.00
#> 165 11.00 10 550 550000.00
#> 166 1.00 1 100 315000.00
#> 167 5.00 11 200 360000.00
#> 168 7.00 9 80 380000.00
#> 169 9.00 9 47 185000.00
#> 170 7.00 5 45 280000.00
#> 171 4.00 3 37 225000.00
#> 172 1.00 8 250 375000.00
#> 173 14.00 6 55 310000.00
#> 174 4.00 5 25 170000.00
#> 175 20.00 5 27 165000.00
#> 176 4.00 5 90 260000.00
#> 177 5.00 6 190 350000.00
#> 178 15.50 8 65 208000.00
#> 179 9.00 4 27 110000.00
#> 180 2.00 7 80 192500.00
#> 181 7.50 5 68 187500.00
#> 182 1.00 4 110 216000.00
#> 183 13.50 10 220 495000.00
#> 184 14.00 8 325 550000.00
#> 185 1.00 13 25 114500.00
#> 186 7.00 5 43 215000.00
#> 187 4.00 13 14 185000.00
#> 188 20.00 3 5 132532.25
#> 189 9.50 0 7 114500.00
#> 190 0.00 15 17 110000.00
#> 191 10.00 20 15 250000.00
#> 192 3.00 5 135 350000.00
#> 193 8.00 15 20 180000.00
#> 194 1.00 14 26 118000.00
#> 195 3.00 19 26 191500.00
#> 196 1.00 20 29 100000.00
#> 197 19.00 5 75 230000.00
#> 198 1.00 7 93 350000.00
#> 199 20.00 5 50 240000.00
#> 200 8.00 9 127 259289.83
#> 201 25.00 6 14 180000.00
#> 202 16.00 6 75 215000.00
#> 203 14.00 7 90 203000.00
#> 204 10.00 20 50 99800.00
#> 205 24.00 10 105 389900.00
#> 206 1.00 25 190 466223.67
#> 207 2.00 7 8 153912.76
#> 208 4.00 5 45 249760.00
#> 209 0.00 6 150 400000.00
#> 210 11.00 10 300 600000.00
#> 211 2.00 45 15 190321.17
#> 212 1.50 10 25 143960.88
#> 213 2.00 6 150 350892.50
#> 214 1.00 5 60 220000.00
#> 215 21.00 6 32 217500.00
#> 216 1.00 8 85 193365.00
#> 217 20.00 13 15 193093.52
#> 218 1.75 12 144 210000.00
#> 219 5.00 15 190 337000.00
#> 220 22.00 7 155 275000.00
#> 221 5.00 4 10 143128.61
#> 222 3.00 1 20 149726.72
#>
#> $OOBerror
#> NRMSE
#> 0.4584988
#>
#> attr(,"class")
#> [1] "missForest"
Created on 2023-02-11 with reprex v2.0.2

Sequence with different intervals

seq can only use a single value in the by parameter. Is there a way to vectorize by, i.e. to use multiple intervals?
Something like this:
seq(1, 10, by = c(1, 2))
would return c(1, 2, 4, 5, 7, 8, 10). Now, this is possible to do this with e.g. seq(1, 10, by = 1)[c(T, T, F)] because it's a simple case, but is there a way to make it generalizable to more complex sequences?
Some examples
seq(1, 100, by = 1:5)
#[1] 1 2 4 7 11 16 17 19 22 26 31...
seq(8, -5, by = c(3, 8))
#[1] 8 5 -3
this looks like a close base R solution?
ans <- Reduce(`+`, rep(1:5, 100), init = 1, accumulate = TRUE)
ans[1:(which.max(ans >= 100) - 1)]
[1] 1 2 4 7 11 16 17 19 22 26 31 32 34 37 41 46 47 49 52 56 61 62 64
[24] 67 71 76 77 79 82 86 91 92 94 97
you would have to inverse a part of it, if you want to calculate 'down'
ans <- Reduce(`+`, rep(c(-3, -8), 20), init = 8, accumulate = TRUE)
ans[1:(which.max(ans <= -5) - 1)]
[1] 8 5 -3
still, you would have tu 'guess' the number of repetitions needed (20 of 100 in the examples above) to create ans.
This doesn't seem possible Maƫl. I suppose it's easy enough to write one?
seq2 <- function(from, to, by) {
vals <- c(0, cumsum(rep(by, abs(ceiling((to - from) / sum(by))))))
if(from > to) return((from - vals)[(from - vals) >= to])
else (from + vals)[(from + vals) <= to]
}
Testing:
seq2(1, 10, by = c(1, 2))
#> [1] 1 2 4 5 7 8 10
seq2(1, 100, by = 1:5)
#> [1] 1 2 4 7 11 16 17 19 22 26 31 32 34 37 41 46 47 49 52 56 61 62 64 67 71
#> [26] 76 77 79 82 86 91 92 94 97
seq2(8, -5, by = c(3, 8))
#> [1] 8 5 -3
Created on 2022-12-23 with reprex v2.0.2

Sample part of a dataset while keeping subgroups intact

I have a dataframe which I would like to split into one 75% and one 25% parts of the original.
I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))
With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)
library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

output of pmatch changes with vector length in R

I am trying to use pmatch in base R. The following example appears to work as expected:
treat1 <- c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 7,
7, 7, 7, 7, 7, 7, 8, 8, 8, 8,
8, 8, 9, 9, 9, 9, 9, 9, 9,10,
10,10,10,10,10,10,11,11,11,11,
11,11,12,12,12,12,12,12,12,13,
13,13,13,13,13,14,14,14,14,14,
14,14,15,15,15,15,15,15,16,16,
16,16,16,16,16,17,17,17,17,17,
17,18,18,18,18,18,18,18)
control1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 6, 6, 6,
6, 6, 6, 7, 7, 7, 7, 7, 8, 8,
8, 8, 8, 8, 9, 9, 9, 9, 9,10,
10,10,10,10,10,11,11,11,11,11,
12,12,12,12,12,12,13,13,13,13,
13,14,14,14,14,14,14,15,15,15,
15,15,16,16,16,16,16,16,17,17,
17,17,17,18,18,18,18,18,18)
pmatch(control1, treat1)
#[1] 1 2 3 4 5 8 9 10 11 12
# 13 14 15 16 17 18 21 22 23 24
# 25 26 27 28 29 30 31 34 35 36
# 37 38 39 40 41 42 43 44 47 48
# 49 50 51 52 53 54 55 56 57 60
# 61 62 63 64 65 67 68 69 70 71
# 73 74 75 76 77 78 80 81 82 83
# 84 86 87 88 89 90 91 93 94 95
# 96 97 99 100 101 102 103 104 106 107
# 108 109 110 112 113 114 115 116 117
However, the following example does not work as I expected. The only difference between the example above and the one below is the presence of a few additional elements of value 19 at the end of the vectors below. The output below contains numerous NA's and only seems to include the position in treat2 of the first element of a given value in control2. I have tried including some of the options for pmatch in the documentation but cannot get output similar to that shown above.
There are several similar questions on Stack Overflow, such as the following, but I have not found a solution to my issue:
Properties of pmatch function
treat2 <- c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 7,
7, 7, 7, 7, 7, 7, 8, 8, 8, 8,
8, 8, 9, 9, 9, 9, 9, 9, 9,10,
10,10,10,10,10,10,11,11,11,11,
11,11,12,12,12,12,12,12,12,13,
13,13,13,13,13,14,14,14,14,14,
14,14,15,15,15,15,15,15,16,16,
16,16,16,16,16,17,17,17,17,17,
17,18,18,18,18,18,18,18,19,19,
19,19,19,19,19)
control2 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 6, 6, 6,
6, 6, 6, 7, 7, 7, 7, 7, 8, 8,
8, 8, 8, 8, 9, 9, 9, 9, 9,10,
10,10,10,10,10,11,11,11,11,11,
12,12,12,12,12,12,13,13,13,13,
13,14,14,14,14,14,14,15,15,15,
15,15,16,16,16,16,16,16,17,17,
17,17,17,18,18,18,18,18,18,19,
19,19,19,19)
pmatch(control2, treat2)
#[1] 1 NA NA NA NA 8 NA NA NA NA
# NA 14 NA NA NA NA 21 NA NA NA
# NA NA 27 NA NA NA NA 34 NA NA
# NA NA NA 40 NA NA NA NA 47 NA
# NA NA NA NA 53 NA NA NA NA 60
# NA NA NA NA NA 67 NA NA NA NA
# 73 NA NA NA NA NA 80 NA NA NA
# NA 86 NA NA NA NA NA 93 NA NA
# NA NA 99 NA NA NA NA NA 106 NA
# NA NA NA 112 NA NA NA NA NA 119
# NA NA NA NA
Given that your treat and control are always numbers, I think it might be easier (and faster) to just rewrite that function using Rcpp. Consider something like this
Rcpp::cppFunction('NumericVector cpmatch(NumericVector x, NumericVector table) {
int n = x.size(), m = table.size();
NumericVector out(n, NA_REAL), y = clone(table);
for (int i = 0; i < n; i++) {
if (ISNAN(x[i])) {
continue;
}
for (int j = 0; j < m; j++) {
if (!ISNAN(y[j]) & x[i] == y[j]) {
y[j] = NA_REAL;
out[i] = j + 1;
break;
}
}
}
return out;
}')
Test
> cpmatch(control2, treat2)
[1] 1 2 3 4 5 8 9 10 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28 29 30 31 34 35 36 37 38 39 40 41 42 43
[38] 44 47 48 49 50 51 52 53 54 55 56 57 60 61 62 63 64 65 67 68 69 70 71 73 74 75 76 77 78 80 81 82 83 84 86 87 88
[75] 89 90 91 93 94 95 96 97 99 100 101 102 103 104 106 107 108 109 110 112 113 114 115 116 117 119 120 121 122 123
> cpmatch(control1, treat1)
[1] 1 2 3 4 5 8 9 10 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28 29 30 31 34 35 36 37 38 39 40 41 42 43
[38] 44 47 48 49 50 51 52 53 54 55 56 57 60 61 62 63 64 65 67 68 69 70 71 73 74 75 76 77 78 80 81 82 83 84 86 87 88
[75] 89 90 91 93 94 95 96 97 99 100 101 102 103 104 106 107 108 109 110 112 113 114 115 116 117
Benchmark
> microbenchmark::microbenchmark(cpmatch(control1, treat1), pmatch(control1, treat1))
Unit: microseconds
expr min lq mean median uq max neval cld
cpmatch(control1, treat1) 16.9 17.3 19.795 17.55 18.1 55.7 100 a
pmatch(control1, treat1) 174.5 174.8 187.174 175.20 188.5 421.9 100 b
Perhaps there is a way to get the desired output from pmatch, but I have not been able to figure out how. I tried looking at the source code for the pmatch function here:
R-4.0.3\src\library\base\R\match.R
But was not able to make progress that way.
So, I wrote the following for-loop to apply to the output of pmatch and replace the NA's with the elements I wanted. It seems to work, at least for the example below.
my.vector <- c(1, NA, NA, NA, NA, 8, NA, NA, NA, NA,
NA, 14, NA, NA, NA, NA, 21, NA, NA, NA,
NA, NA, 27, NA, NA, NA, NA, 34, NA, NA, NA, NA, NA)
desired.result <- c(1, 2, 3, 4, 5, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 34, 35, 36, 37, 38, 39)
pos.not.na <- which(!is.na(my.vector))
if(any(is.na(my.vector)) == TRUE) {
my.output <- my.vector
for(i in 2:length(pos.not.na)) {
my.output[pos.not.na[(i-1)]:(pos.not.na[i]-1)] <- seq(my.vector[pos.not.na[(i-1)]],
(my.vector[pos.not.na[(i-1)]] + (length(pos.not.na[(i-1)]:(pos.not.na[i]-1)) - 1)))
}
my.output[pos.not.na[length(pos.not.na)]:length(my.vector)] <- seq(my.vector[pos.not.na[length(pos.not.na)]],
(my.vector[pos.not.na[length(pos.not.na)]] + length(pos.not.na[length(pos.not.na)]:length(my.vector)) - 1))
}
if(any(is.na(my.vector)) == FALSE) {my.output = my.vector}
my.output
all.equal(my.output, desired.result)
#[1] TRUE

Plotting contour plots from melted dataframes

I have a dataset which is a function of time and frequency. When I melt it I want to retain actual time (date) and frequency values as I want a 2d plot with y axis as frequency and x axis as time.
I tried to retain a column of desired axis values but melting makes it factor and stat_contour throws error.
My data is some thing like the following
a = data.frame(date=time,power=power)
names(a) = c('date',period)
where period is
[1] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[23] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
[45] 8 8 8 8 8 8 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
[67] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
[89] 16 16 16 16 16 16 16 16 16 16 16 16 32 32 32 32 32 32 32 32 32 32
[111] 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32
[133] 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 64 64 64 64
[155] 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
[177] 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
[199] 64 64 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
[221] 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
[243] 128 128 128 128 128 128 128 128 256
power = melt(a,id.vars = 'date')
date period power
1 850-01-01 8 0.05106766
2 851-01-01 8 0.05926821
3 852-01-01 8 0.06783015
4 853-01-01 8 0.07681627
5 854-01-01 8 0.08636516
6 855-01-01 8 0.09667054
ggplot(power, aes(x = date, y = period, z = power)) +
stat_contour()
this gives an error as period column is a factor; if I make it numeric I loose the exact Y axis labels. Is there any workaround?
thanks
You did not provide a reproducible example, but in principle you have to change the axis labels manually:
library(ggplot2)
d <- expand.grid(x = 1:100, y = paste("P", 1:100))
d$z <- rnorm(10000)
ggplot(d, aes(x, as.numeric(y), z = z)) +
geom_contour() +
scale_y_continuous(breaks = seq_along(levels(d$y)), breaks = levels(d$y))
However, I find it a bit difficult to really understand what a contour plot with one axis being a qualitative should represent. My guess is that your y-axis is a factor because of the melt operation and that there is some column wrongly specified as factor and hence the overall column is a factor when in fact it should be a a numeric (that is it shows 8, should represent 8 but is actually a factor with level 8).
If this is the case, a normal as.numeric does not work as expected b/c it maps simply the first level to 1 the second level to 2 and so on. But you want it to be mapped to the numeric of the string representation. In this case something like this should work:
library(ggplot2)
d <- expand.grid(x = 1:100, y = sample(1000, 100)) # y should be a numeric but is a factor
d$z <- rnorm(10000)
ggplot(d, aes(x, as.numeric(as.character(y)), z = z)) +
geom_contour()
This all is, however, guesswork, because you did not provide us with a full reproducible example. Please read and follow these gudieline in the future when posting questions here how to ask a good question and how to give a reproducible example

Resources