How to reduce processing time of a code in R - r

Can you help me think of some way to reduce the computational time of a code that generates a certain value, which in this case I call coef, which will depend on id/date/category? Better explanations below.
I made two functions that generate the same result. As you can see in benchmark, the first function (return_values) takes twice as long as the second function (return_valuesX) to generate the same results. See that in the second function, I make some brief changes when calculating the coef variable. However, I strongly believe that there is a possibility of improving the code, as you can see in the second function, I managed to improve 50% of processing time compared to the first just with brief changes. But I'm out of ideas for new adjustments, so I would like your valuable opinion.
Code Explanations:
In general, the purpose of the code is to calculate a value, which I call a coef for each group of id, date and category. For this, the median of the values ​​resulting from the subtraction between DR1 and the values ​​of the DRM0 columns of the df1 database is first calculated. After obtaining the median (med variable), I add the values ​​found with the values ​​of the DRM0 columns of my df1 database. This calculation is my SPV variable. In both cases, I used the data.table function, which I believe is faster than using dplyr. After I get SPV, I need to calculate the coef variable for each id/date/category.
Below I will insert an example real easy to understand of the coef calculation. If for example I want to calculate coef of idd<-"3", dmda<-"2021-12-03", CategoryChosse<-"ABC", and I have the following:
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
So coef will be ymd(dmda) - ymd(min(df1$date1)). That is, if I do to this id/date/category that I mentioned I get a difference of 2 days, so the value I want is the DRM003_PV . So the value for this case is 198. Therefore, I made:
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
> coef
[1] 198
This issue has been resolved here: Adjust code to choose a specific column depending on the difference between dates
Libraries and database
library(tidyverse)
library(lubridate)
library(data.table)
library(bench)
set.seed(123)
df1 <- data.frame( Id = rep(1:5, length=800),
date1 = as.Date( "2021-12-01"),
date2= rep(seq( as.Date("2021-01-01"), length.out=400, by=1), each = 2),
Category = rep(c("ABC", "EFG"), length.out = 800),
Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"), length.out = 800),
DR1 = sample( 200:250, 800, repl=TRUE),
setNames( replicate(365, { sample(0:800, 800)}, simplify=FALSE),
paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))
First function
return_values <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns
dt1 <- as.data.table(df1)
cols <- grep("^DRM0", colnames(dt1), value = TRUE)
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")
nm2 <- f2(names(med), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
return(coef)
}
Results using first function
subset_df1 <- subset(df1, date2 > date1)
a<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)) %>%
data.frame()
> a
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...........
Second function
return_valuesX <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
dt1 <- as.data.table(df1)
num_to_pull <- as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6
cols <- grep("^DRM0", colnames(dt1), value = TRUE)[1:num_to_pull]
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")[1:num_to_pull]
nm2 <- f2(names(med), "_PV")[1:num_to_pull]
nm3 <- paste0("i.", nm2)[1:num_to_pull]
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(num_to_pull)
return(coef)
}
Results using second function
b<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result = return_valuesX(df1,Id, date2, Category)) %>%
data.frame()
> b
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...............
Comparing the two results:
identical(a, b)
[1] TRUE
Calculate processing time using benchmark
subset_df1 <- subset(df1, date2 > date1)
bench::mark(a=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)),
b=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_valuesX(df1,Id, date2, Category)),iterations = 1)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 a 53.7s 53.7s 0.0186 4.54GB 0.634 1 34 53.7s <rowwise_df [130 x 5]> <Rprofmem [981,580 x 3]> <bench_tm [1]> <tibble [1 x 3]>
2 b 21s 21s 0.0477 913.77MB 0.382 1 8 21s <rowwise_df [130 x 5]> <Rprofmem [278,340 x 3]> <bench_tm [1]> <tibble [1 x 3]>
To check df1 database

Here's an approach that is about 20x/10x faster than your functions for the example data, and would be faster yet for larger data sets. (When I run with 100k rows in df1, it's 572x faster.) I hope you'll find this approach easier to understand and debug.
This is written using tidyverse functions like tidyr::pivot_longer and dplyr::group_by. If you want to squeeze out a bit more speed, the data.table and collapse packages offer faster alternatives for many functions, especially around grouped calculations. But the main speed improvement here is from restructuring to avoid repeating the same calculations over and over and letting R rely more on vectorized calculations. https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
pre_calc <- function(df) {
pre_calc <- df1 %>% # this calculates once on the full data
select(!ends_with("_PV")) %>%
pivot_longer(-c(1:6), values_to = "DRM", names_to = "day") %>%
mutate(day = parse_number(day)) %>%
group_by(Id, Category, Week, day) %>%
mutate(med = median(DR1 - DRM), Result = DRM + med) %>%
ungroup()
df %>% # starts from the subsetted data and joins to results from above
select(1:5) %>%
left_join(pre_calc) %>%
filter(day == date2 - date1 + 1) %>%
select(Id, date1, date2, Category, Result)
}
c <- subset_df1 %>% pre_calc()
c matches a and b from your tests, with the one difference that
date2 (originally date-integer, which is a nonstandard type) has in my approach been
coerced into a typical date-double, like date1. We can use
typeof(df1$date1) & typeof(df1$date2) to see this.
waldo::compare(b, c) confirms the results otherwise match. I
opened an issue with tidyr here since the subtle change seems
to have been caused by the pivot_longer step.
UPDATE: Apparently the creation of a date-integer object is a bug in base R's seq.Date / seq function, which was fixed in R 4.2: https://github.com/tidyverse/tidyr/issues/1356#issuecomment-1111078891
In the approach above, I pre-calculate all the results once, by taking the original data set df1, throwing out the existing _PV columns (I think they get overwritten?), and -- here's where the speed gains come from -- reshaping to long format. While this single operation is computationally expensive, it means we can more efficiently apply the same calculation to all the DRM_* columns at once, and we can rely on fast filtering instead of slow subsetting to extract our result.
The group_by(Id, Category, Week, day) line and the next mutate(... line let us calculate the median differences between DR1 and that day's DRM for each Id-Category-Week combination, so we can calculate all the Results at once.
The last part takes the df lines (e.g. the subsetted data in your example where date2 > date1) and attaches those to the pre-calculated results, filtering to get the right day (previously encoded by column name/position).

Related

How to calculate Williams %R in RStudio?

I am trying to write a function to calculate Williams %R on data in R. Here is my code:
getSymbols('AMD', src = 'yahoo', from = '2018-01-01')
wr = function(high, low, close, n) {
highh = runMax((high),n)
lowl = runMin((low),n)
-100 * ((highh - close) / (highh - lowl))
}
williampr = wr(AMD$AMD.High, AMD$AMD.Low, AMD$AMD.Close, n = 10)
After implementing a buy/sell/hold signal, it returns integer(0):
## 1 = BUY, 0 = HOLD, -1 = SELL
## implement Lag to shift the time back to the previous day
tradingSignal = Lag(
## if wpr is greater than 0.8, BUY
ifelse(Lag(williampr) > 0.8 & williampr < 0.8,1,
## if wpr signal is less than 0.2, SELL, else, HOLD
ifelse(Lag(williampr) > 0.2 & williampr < 0.2,-1,0)))
## make all missing values equal to 0
tradingSignal[is.na(tradingSignal)] = 0
## see how many SELL signals we have
which(tradingSignal == "-1")
What am I doing wrong?
It would have been a good idea to identify that you were using the package quantmod in your question.
There are two things preventing this from working.
You didn't inspect what you expected! Your results in williampr are all negative. Additionally, you multiplied the values by 100, so 80% is 80, not .8. I removed -100 *.
I have done the same thing so many times.
wr = function(high, low, close, n) {
highh = runMax((high),n)
lowl = runMin((low),n)
((highh - close) / (highh - lowl))
}
That's it. It works now.
which(tradingSignal == "-1")
# [1] 13 15 19 22 39 71 73 84 87 104 112 130 134 136 144 146 151 156 161 171 175
# [22] 179 217 230 255 268 288 305 307 316 346 358 380 386 404 449 458 463 468 488 492 494
# [43] 505 510 515 531 561 563 570 572 574 594 601 614 635 642 644 646 649 666 668 672 691
# [64] 696 698 719 729 733 739 746 784 807 819 828 856 861 872 877 896 900 922 940 954 968
# [85] 972 978 984 986 1004 1035 1048 1060

fast partial match checking in R (or Python or Julia)

I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.
Create two dummy datasets
library(dplyr)
library(stringr)
set.seed(1)
data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
Test matching of full matches only
# This is almost instant
data_set_A %>%
rowwise() %>%
filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
ungroup()
# A tibble: 4 x 2
name ID_A
<chr> <int>
1 vnt 112
2 fly 391
3 cug 1125
4 xgv 1280
Include partial matches (This is what I want to optimize)
This of course only gives me the subset of dataset A, but that is ok.
# This takes way too long
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
A tibble: 237 x 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
Fuzzyjoin method.
This also works, but is equally slow
bind_rows(
fuzzyjoin::fuzzy_inner_join(
data_set_A,
data_set_B,
by = c("name" = "name_2"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A),
fuzzyjoin::fuzzy_inner_join(
data_set_B,
data_set_A,
by = c("name_2" = "name"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A)
) %>%
distinct()
data.table solution
not much faster unfortunately
library(data.table)
setDT(data_set_A)
setDT(data_set_B)
data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) |
any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]
This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
TL;DR
The slow part is str_detect(string, pattern).
To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.
To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.
Under follows examples, explanations and benchmarks.
Rewriting str_detect() using fixed() or coll()
I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().
From ?stringr::str_detect():
Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.
Under is a comparison with your original code:
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
# Note the use of fixed()
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
# Note the use of coll()
using_coll <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
ungroup()
}
bm <- bench::mark(
original(),
using_fixed(),
using_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371
#> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94
#> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512
bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr
Created on 2022-04-02 by the reprex package (v2.0.1)
So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.
Removing rowwise()
Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
using_map_coll <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, coll(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, coll(.x)))
)
data_set_A[logical_vec, ]
}
bm <- bench::mark(
using_fixed(),
using_map_fixed(),
using_coll(),
using_map_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37
#> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14
#> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489
#> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354
Created on 2022-04-02 by the reprex package (v2.0.1)
As we see, this gives another slight speed boost.
Using fixed() with data.table or fuzzyjoin
You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.
This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.
If you would like to use base R, the code below might be one fast option
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
which gives
> res
name ID_A name_2 ID_B
1 arh 1234 pimoarhd 8
2 qtj 720 aqtj 23
3 szcympsn 142 cym 43
4 cymvubnxg 245 cym 43
5 dppvtcymq 355 cym 43
6 kzi 690 kzii 48
7 eyajqchkn 498 chk 53
8 upfzh 522 upf 61
9 ioa 1852 ioadr 63
10 lya 1349 ibelyalvh 64
11 honod 504 ono 71
12 zozxjpu 18 zoz 72
13 jcz 914 cdjczpqg 88
14 ailmjf 623 ilm 99
15 upoux 609 oux 104
16 pouxifvp 1466 oux 104
17 mvob 516 vob 106
18 nqtotvhhm 1088 otv 115
19 wom 202 womtglapx 117
20 qkc 756 dqkcfqpps 118
21 qtl 600 ivqtlymzr 126
22 qqi 1605 owfsqqiyu 153
23 fmjalirze 1470 ali 172
24 ibwfwkyp 1588 fwk 175
25 iat 1258 iatjeg 185
26 osm 253 nviiqosm 199
27 wpj 373 wpjeb 204
28 hahx 515 ahx 213
29 keahxa 1565 ahx 213
30 psf 359 qnpsfo 223
31 saq 1859 saqhu 227
32 cvmkwtx 714 cvm 228
33 ilw 389 pyilwj 231
34 ohwysv 1590 ysv 237
35 utrl 698 trl 244
36 dmttrlcpj 1267 trl 244
37 cpv 236 btcpvmoc 247
38 uto 1047 utoi 257
39 yngunekl 1978 ekl 258
40 vceko 625 vce 265
41 fir 1934 firgk 278
42 qvd 983 eqvdfi 287
43 fir 1934 zwwefir 291
44 idvfkevdf 1380 vdf 312
45 qwdo 1921 qwd 322
46 kam 1205 tlkam 327
47 lck 488 clckjkyzn 329
48 gmspwckw 1015 msp 359
49 ynouuwqtz 1576 nou 360
50 tty 1209 bttyvt 361
51 vkc 999 fmrvkcl 366
52 ipw 1918 fipwjomdu 388
53 zdv 261 zdvkut 410
54 vku 1137 zdvkut 410
55 doby 246 oby 411
56 hycvuupgy 141 uup 421
57 uwlb 1249 wlb 431
58 auj 1452 lcmnauj 444
59 rwd 1667 ukwrwdczs 479
60 ylsihqqor 1290 ihq 483
61 feo 1649 feorvxbm 485
62 zff 755 dohzffujm 499
63 mqutujepu 904 epu 507
64 uiepu 1308 epu 507
65 vahepuk 1434 epu 507
66 cug 1125 accugl 509
67 fir 1934 firwe 517
68 dia 1599 dialeddd 527
69 temiwd 1725 tem 531
70 svofivl 1177 svo 545
71 flm 657 aflm 546
72 vnt 112 vnt 551
73 bhmoskrz 426 osk 558
74 wev 728 shemuwev 569
75 hzpi 1586 hzp 579
76 gvi 1064 mkgvivlfe 582
77 fjb 1398 vkfjbxnjl 589
78 qin 1013 qinp 593
79 ecn 1342 ecnzre 598
80 zre 1610 ecnzre 598
81 xvr 772 dpxvrfmo 623
82 tqr 1419 tqrmztdm 624
83 zmwnf 1571 mwn 626
84 ypil 1787 pil 630
85 mnxlqgfh 1132 nxl 643
86 gse 1563 gseice 646
87 ygk 1309 ygkqrk 655
88 fgm 933 vzfgmy 663
89 rlupd 977 upd 666
90 mcupdkuiy 1307 upd 666
91 fly 391 fly 669
92 vbkko 1603 kko 678
93 uvrew 465 rew 680
94 hgbhngwvd 901 wvd 690
95 wvdjprmo 1432 wvd 690
96 cgn 22 cgnd 698
97 dngnjv 967 njv 700
98 psqs 841 sqs 720
99 ywv 1180 ptywvlgc 730
100 ypil 1787 ypi 734
101 rwd 1667 srserwd 737
102 jqydasl 1294 jqy 742
103 ckujmc 717 ujm 751
104 dfzxta 662 xta 775
105 bjb 1562 jabjbei 779
106 adwknpll 1242 npl 780
107 kdv 1327 xhkdvqo 789
108 ghj 174 oghj 801
109 lhwfyksz 94 lhw 811
110 nwrrnlhw 929 lhw 811
111 xlhwm 1720 lhw 811
112 ncc 1602 wurhxnccn 814
113 jdslrf 1094 dsl 835
114 ktmw 1738 tmw 844
115 igwsvrzpk 16 gws 856
116 kug 591 pkugls 857
117 befgcpedr 339 fgc 862
118 ojf 1397 ojfpnkla 863
119 gyl 1203 gylxeqzw 872
120 ugcbb 1727 ugc 876
121 arh 1234 karhwhg 878
122 amm 458 ammqdc 883
123 azazryje 636 zaz 900
124 wczazw 1887 zaz 900
125 gkritbe 47 ritb 915
126 vku 1137 yjvkuxued 929
127 rnh 1633 kvyrnhugu 937
128 mzh 1135 xllmwmzhn 940
129 cug 1125 cug 960
130 xgv 1280 xgv 962
131 xusxgv 1436 xgv 962
132 umc 351 lwumcmvoo 980
133 zlb 1900 nkyazlb 991
134 llfkalao 1049 llf 1002
135 sflpbht 991 lpb 1048
136 rairmmcl 442 mmc 1087
137 mmckoln 780 mmc 1087
138 gfxmmcgb 1814 mmc 1087
139 aoj 402 taojlgp 1089
140 mypvzhp 121 ypv 1095
141 moctwaypv 611 ypv 1095
142 rngedn 306 ged 1106
143 djshecy 1408 ecy 1108
144 rairmmcl 442 rmm 1117
145 gzua 1594 zua 1124
146 ytj 416 yytj 1140
147 ubt 300 hubtcfr 1141
148 gqg 1854 ogqgsjqc 1144
149 tfg 1204 xiutfgru 1145
150 avrq 741 avr 1147
151 ytkpvss 440 tkp 1149
152 kug 591 yxsjkug 1176
153 vix 1846 vixsmn 1187
154 qtl 600 qtljkxz 1188
155 lgr 494 dlgrco 1189
156 ryg 864 xlmtryg 1203
157 yskvkxwj 1547 kvk 1205
158 kxhee 1795 xhe 1222
159 hzbcjs 1493 cjs 1224
160 kbi 270 itxlwkbi 1225
161 gdymcam 806 ymca 1232
162 tqr 1419 rxtqrdtl 1236
163 yyz 215 yyzw 1242
164 jyx 1735 mljjyxu 1248
165 aai 1928 umkpaaiwo 1254
166 dsd 1122 dndsdova 1257
167 tor 744 etor 1270
168 vhcyznp 1296 yzn 1278
169 xlc 1947 odxlcjwj 1280
170 mlm 1629 aomlmgtq 1303
171 owm 239 owmugb 1304
172 ynezwaml 507 nez 1308
173 jls 695 jlsve 1325
174 dvm 879 dvmv 1339
175 vsgx 944 dqpihvsgx 1352
176 wfo 768 wfokpjois 1354
177 tltbkinat 1986 nat 1362
178 gyl 1203 gylqte 1363
179 ngg 735 bsnggqbjd 1366
180 fkq 345 jdfkqf 1368
181 ojf 1397 ojfpgfga 1382
182 dqgd 1623 prqbndqgd 1398
183 siu 827 siuypucup 1412
184 yinsoivfd 1895 yin 1414
185 esm 1834 sesmeepz 1417
186 umc 351 umcj 1432
187 wny 866 wnyxamguw 1443
188 ujbhtvnin 399 vni 1444
189 dbq 630 bdbqq 1452
190 ebn 1405 ebngddw 1461
191 zcj 704 rbtjzcjod 1465
192 avn 500 avnspxv 1468
193 vkk 567 hvkk 1477
194 hmm 1441 bgjhmmthz 1483
195 aguakz 614 guak 1487
196 hycvuupgy 141 pgy 1493
197 tizpgymz 280 pgy 1493
198 guk 571 cncxdguk 1502
199 zyw 281 nzywuqs 1504
200 jnz 1558 rxdxsjnzw 1510
201 uuq 92 nxuuqtj 1514
202 qtj 720 nxuuqtj 1514
203 vkk 567 xpbpvkkdc 1518
204 iaa 460 sjiaa 1525
205 txsgmynng 1019 xsg 1526
206 yjvtwc 1107 jvt 1529
207 lnk 1113 hylnknwy 1546
208 szd 635 woszdm 1557
209 osm 253 sosmdp 1567
210 nbd 1067 nbdmmg 1570
211 mmg 1305 nbdmmg 1570
212 wqdsatbd 1536 sat 1585
213 sdlypo 1527 sdl 1596
214 inkynog 288 inky 1600
215 hpwoeclfy 1321 clf 1601
216 wodyqwqf 679 dyq 1603
217 lyw 7 xnalywyuw 1607
218 njm 1825 vjlnjmns 1617
219 njytqhaut 428 qha 1620
220 ilw 389 rilwbk 1647
221 oqo 45 ixoqowkpg 1650
222 odcbcvaun 1386 bcv 1652
223 mastn 434 stn 1662
224 xebhdssit 1091 xeb 1663
225 nmy 782 nmyxj 1671
226 fsqvgdw 673 gdw 1676
227 mwwczhs 482 wcz 1679
228 wczazw 1887 wcz 1679
229 anmryzm 915 ryz 1698
230 rteh 523 rte 1708
231 mlwrguae 817 lwr 1709
232 mbu 819 xpsuqmbuf 1729
233 mmckoln 780 cko 1733
234 lxpg 798 lxp 1734
235 ane 370 vxnanehvk 1746
236 tty 1209 vbttyozui 1752
237 igncdgyjx 332 ign 1753
238 ndignk 621 ign 1753
239 nmy 782 ivnmyba 1780
240 wknrsauuj 2 rsa 1799
241 tgd 165 qtgdidlf 1803
242 iaa 460 yziaazxto 1833
243 xto 1245 yziaazxto 1833
244 zff 755 dpzfft 1857
245 jyx 1735 jwjyxphe 1873
246 ytj 416 eytj 1881
247 lcggwonk 1596 onk 1882
248 zdv 261 zdvxfz 1889
249 xhskcb 417 kcb 1890
250 mrikqkcb 770 kcb 1890
251 psvxqnsap 1352 psv 1898
252 udjswzb 411 jsw 1900
253 rpfjswy 1840 jsw 1900
254 bjaywiso 1677 ayw 1902
255 zfli 130 fli 1906
256 vazx 1215 itvazxw 1918
257 tuw 100 tuwywtbwd 1921
258 vle 1437 ebvleaovm 1937
259 znycsygd 1757 nyc 1944
260 ynezwaml 507 ezw 1952
261 tseezwf 1276 ezw 1952
262 ezwzyfudo 1690 ezw 1952
263 oudiky 1503 dik 1964
264 dikjn 1615 dik 1964
265 oms 106 wpomsudi 1977
266 hhp 1864 hhpkm 1983
Benchmarking
It seems this base R option is slightly slower than #Andrew's approach.
TIC <- function() {
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
# data_set_A[unique(idx[, 1]), ]
}
jpiversen_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
andrew <- function() {
nchar_a <- nchar(data_set_A$name)
nchar_b <- nchar(data_set_B$name_2)
pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a <- data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm <- microbenchmark(
TIC(),
jpiversen_fixed(),
andrew(),
times = 20
)
shows that
> bm
Unit: milliseconds
expr min lq mean median uq max
TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841
jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872
neval
20
20
20

How to find out the sequence of value in R

Supposed that I have generated 100 different value from -100 to 100, and I have cumsum all of those value.
set.seed(123)
x <- -100:100
z <- sample (x, size = 100,replace=T)
cumsum(z)
and I got
[1] 58 136 49 143 212 161 178 120 33 50 102 91 81 177 167 251 242 278 276 247 172 78 147
[24] 183 246 223 203 145 147 163 138 180 111 119 25 61 129 102 24 78 165 117 151 103 157 222
[47] 155 123 94 69 31 71 67 57 109 46 -34 -94 -20 -31 -72 -157 -142 -149 -244 -145 -160 -175 -237
[70] -179 -162 -213 -280 -377 -465 -497 -471 -419 -468 -547 -559 -500 -576 -642 -575 -564 -635 -596 -538 -518 -509 -452
[93] -489 -448 -350 -384 -334 -313 -335 -351
Now, I would like to stop or find out the value that is greater than 200 or lower than -200.
If I do it by my hand, I know that the 5th sequence (212) is greater than 200.
However, in R, is there any command to find out the first time that z is greater than 200 or lower than -200?
Thank you very much
A quick hack way to do this might be:
z <- as.data.frame(z)
z$lv <- if_else(z >200,T,F)
min(which(lv == TRUE))
The min(which(...)) solutions provided by others don't give a convenient answer in case none of the values meet the condition. For example,
set.seed(123)
x <- -100:100
z <- sample (x, size = 100,replace=T)
min(which(abs(cumsum(z)) > 200))
#> [1] 5
min(which(abs(cumsum(z)) > 1000)) # None meet this condition
#> Warning in min(which(abs(cumsum(z)) > 1000)): no non-missing arguments to min;
#> returning Inf
#> [1] Inf
A better way is given in the R help page for which.max:
match(TRUE, abs(cumsum(z)) > 200)
#> [1] 5
match(TRUE, abs(cumsum(z)) > 1000)
#> [1] NA

Divide paired matching columns

I have a data.frame df with matching columns that are also paired. The matching columns are defined in the factor patient. I would like to devide the matching columns by each other. Any suggestions how to do this?
I tried this, but this does not take the pairing from patient into account.
m1 <- m1[sort(colnames(df)]
m1_g <- m1[,grep("^n",colnames(df))]
m1_r <- m1[,grep("^t",colnames(df))]
m1_new <- m1_g/m1_r
m1_new
head(df)
na-008 ta-008 nc012 tb012 na020 na-018 ta-018 na020 tc020 tc093 nc093
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGT 56 311 137 242 23 96 113 106 41 114
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGG 208 656 350 713 49 476 183 246 157 306
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGT 631 1978 1531 2470 216 1906 732 850 665 909
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTT 2760 8159 6067 9367 622 4228 2931 3031 2895 2974
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTTT 1698 4105 3737 3729 219 1510 1697 1643 1527 1536
> head(patient)
$`008`
[1] "na-008" "ta-008"
$`012`
[1] "nc012" "tb012"
$`018`
[1] "na-018" "ta-018"
$`020`
[1] "na020" "tc020"
$`045`
[1] "nb045" "tc045"
$`080`
[1] "nb-080" "ta-080"

R aggregate data in one column based on 2 other columns

So, I have these data given below, and my goal is to aggregate column v3 in terms of columns v1 and v2 and add the v3 values for each bin of v1 and v2. For example, the first line correspond to interval v1=21, v2=16, so the value of v3 will be aggregated over its (v1,v2) interval. And repeat this for the rest of rows. I want to use the mean as the aggregation function!
> df
v1 v2 v3
1 21.359 16.234 24.283
2 47.340 9.184 21.328
3 35.363 -13.258 14.556
4 -29.888 14.154 17.718
5 -10.109 -16.994 20.200
6 -32.387 1.722 15.735
7 49.240 -5.266 17.601
8 -38.933 2.558 16.377
9 41.213 5.937 21.654
10 -33.287 -4.028 19.525
11 -10.223 11.961 16.756
12 -48.652 16.558 20.800
13 44.778 27.741 17.793
14 -38.546 29.708 13.948
15 -45.622 4.729 17.793
16 -36.290 12.383 18.014
17 -19.626 19.767 18.182
18 -32.248 29.480 15.108
19 -41.859 35.502 8.490
20 -36.058 21.191 16.714
21 -23.588 0.524 21.471
22 -24.423 39.963 18.257
23 -0.042 -45.899 17.654
24 -35.479 32.049 9.294
25 -24.632 20.603 17.757
26 -26.591 25.882 18.968
27 -34.364 43.959 13.905
28 -19.334 29.728 20.102
29 12.304 -39.997 17.002
30 0.958 37.162 20.779
31 -35.475 -40.611 14.719
32 -39.268 44.382 11.247
33 -10.154 39.053 19.458
34 -12.612 32.056 17.759
35 2.730 -1.473 20.228
36 -45.326 -52.299 9.305
37 -1.996 -15.551 13.295
38 -26.655 -37.319 19.148
39 -18.509 -30.047 18.889
40 -22.705 -25.577 19.007
41 -15.705 -15.397 19.112
42 -2.637 9.790 10.548
43 -14.107 -3.145 19.654
44 -29.272 -19.906 18.503
45 -9.569 -4.632 11.334
46 2.114 18.048 14.744
47 -4.241 16.073 15.420
48 31.869 -3.394 21.559
49 20.425 35.205 22.250
50 -18.605 -8.866 20.082
51 -26.677 -7.690 21.850
52 -5.240 4.805 11.399
53 -6.766 2.538 6.292
54 4.567 22.554 19.682
55 -20.701 6.430 20.996
56 -23.972 16.141 17.976
57 -6.651 24.048 18.082
58 -32.243 -6.100 19.517
59 2.236 29.736 19.667
60 18.830 15.586 15.969
61 -9.598 28.414 17.806
62 -30.825 12.194 22.346
63 -17.415 15.795 18.135
64 -14.823 5.931 17.915
65 -14.234 12.882 13.001
66 9.937 18.368 20.421
67 -38.766 9.590 21.648
68 -30.896 27.047 16.453
69 -4.432 -10.562 10.061
70 -4.290 33.170 22.942
71 7.285 41.416 23.906
72 24.411 40.531 23.584
73 45.409 -32.420 20.831
74 49.341 -34.047 15.269
75 -7.730 -47.724 21.692
76 -10.563 -29.082 17.984
77 4.412 -41.182 16.845
78 31.822 -37.297 19.665
79 -43.355 31.093 17.688
80 -44.353 -44.723 13.832
81 -16.961 38.438 20.715
82 -21.225 -39.244 18.156
83 -42.022 -8.686 20.362
84 -42.904 -25.498 18.394
85 43.822 -25.990 21.287
86 43.013 -9.071 19.285
87 -36.901 -24.185 21.938
88 -28.251 -36.583 19.330
89 -19.830 -22.412 21.677
90 -3.789 -15.663 17.439
91 40.453 -21.796 17.432
92 -40.778 -31.188 18.762
93 -27.072 -48.609 18.913
94 -18.035 -1.791 19.909
95 -20.781 -7.912 22.563
96 47.307 -15.432 19.101
97 30.700 5.097 22.801
98 46.453 0.171 17.810
99 -27.439 -5.860 22.626
100 -30.526 -18.007 23.219
101 -18.280 -15.187 25.302
102 -18.367 6.044 18.864
103 41.265 -1.686 22.743
104 29.227 -14.814 19.196
105 -36.080 -32.715 18.930
106 7.475 7.061 25.002
107 -18.586 -45.207 21.864
108 35.227 11.148 21.388
109 -7.581 38.773 22.048
110 -43.685 14.083 22.037
111 -29.533 39.735 17.613
112 8.760 -39.400 22.421
113 -14.962 24.624 12.030
114 18.627 -32.888 23.036
115 -31.300 33.612 15.608
116 -38.024 45.839 16.567
117 -15.104 36.893 18.162
118 -12.809 -23.029 21.589
119 -21.614 36.264 16.680
120 42.917 -36.838 18.738
121 6.104 -14.961 14.468
122 44.032 -41.556 17.618
123 -24.493 21.886 17.366
124 -24.361 29.941 14.374
125 -25.060 43.383 16.437
126 -6.017 -24.640 19.207
127 -32.617 -40.549 18.059
128 -43.285 -43.364 18.827
129 -29.856 -46.089 16.881
130 -16.547 -43.619 22.547
131 -16.257 42.814 18.932
132 -9.236 -11.694 14.455
133 13.488 -35.422 24.436
134 -47.456 -32.714 18.123
135 39.476 -28.008 16.087
136 -21.933 -43.522 15.390
137 -17.347 -38.250 16.738
138 -4.948 -39.747 21.598
139 -31.018 -28.912 21.332
140 -36.364 30.461 17.542
141 -39.639 18.272 23.663
142 -24.162 -13.582 19.136
143 -8.935 -32.699 22.108
144 0.001 -19.219 17.888
145 -6.912 -24.885 20.683
146 7.785 -31.229 15.972
147 22.176 -7.478 21.335
148 8.755 -13.323 20.831
149 44.081 41.160 11.938
150 -8.451 -37.721 17.465
151 18.671 -2.776 23.374
152 12.668 -26.749 18.071
153 1.582 -21.252 20.750
154 20.832 -27.718 16.190
155 44.220 -45.690 12.598
156 -0.226 -37.737 17.634
157 -25.130 -19.197 23.170
158 2.086 -31.271 18.180
159 -20.445 -33.083 19.984
160 23.801 1.116 24.230
161 18.283 -17.922 20.256
162 -38.985 -13.770 20.702
163 -26.264 -27.413 20.276
164 10.396 -19.375 20.415
165 -16.343 -22.847 16.516
166 29.992 -8.215 21.661
167 35.052 -19.475 16.953
168 3.052 -6.800 22.509
169 -10.350 -5.413 19.222
170 14.371 -10.383 23.471
171 11.896 -4.191 21.773
172 18.152 8.741 23.669
173 25.748 -47.786 18.578
174 31.613 -0.735 23.898
175 12.660 25.645 23.549
176 2.933 29.345 25.170
177 9.369 18.791 26.817
178 15.805 4.798 27.866
179 27.556 -25.571 14.796
180 -5.112 -7.835 21.201
181 -30.571 3.471 20.496
182 19.816 -22.114 21.210
183 2.826 47.437 22.911
184 25.488 -33.064 21.442
185 44.826 42.162 22.994
186 25.208 -48.487 25.325
187 14.635 -17.430 17.083
188 -1.901 -33.370 22.163
189 12.306 -47.265 20.052
190 42.552 35.750 23.213
191 37.318 -46.069 22.599
192 4.725 -22.289 21.600
193 -40.815 -37.793 17.371
194 11.890 -12.862 14.286
195 35.251 -31.746 17.816
196 27.121 -27.638 19.677
197 36.024 -39.105 20.202
198 -47.119 41.940 17.526
199 0.837 -40.694 23.063
200 23.797 -39.795 20.198
201 -42.859 -21.372 23.554
202 39.407 -20.211 21.246
203 25.782 -18.892 20.423
204 34.529 -9.576 20.411
205 44.397 -13.247 23.180
206 5.534 6.856 14.248
207 31.598 -18.085 22.350
208 7.250 -0.481 15.453
209 -43.458 -15.204 23.193
210 -38.296 -31.524 21.776
211 4.276 -3.483 12.145
212 25.757 -11.708 22.360
213 15.634 37.478 24.624
214 -43.669 -3.197 20.742
215 45.381 6.365 21.351
216 -38.755 -6.877 20.879
217 -6.925 3.994 21.120
218 8.059 12.831 26.032
219 3.572 22.105 26.920
220 16.042 30.267 21.039
221 26.629 13.042 23.633
222 -12.126 -0.151 21.261
223 -11.981 24.600 19.236
224 29.480 28.362 21.838
225 -2.500 22.858 23.177
226 -41.163 19.863 20.059
227 35.953 27.401 19.101
228 -16.641 13.248 17.984
229 -3.778 14.090 18.943
230 11.643 34.817 21.621
231 34.921 38.666 17.359
232 25.621 22.451 22.866
233 34.936 17.384 19.836
234 40.017 37.599 13.987
235 19.547 33.838 22.575
236 11.197 39.977 19.347
237 16.972 -33.927 14.205
238 22.938 38.064 20.351
239 40.234 18.672 23.030
240 -0.846 42.320 18.383
241 -11.437 18.284 16.502
242 19.552 43.222 21.370
243 13.925 -46.486 18.917
244 41.709 -39.559 16.143
245 19.014 -44.563 17.796
246 32.260 33.114 18.402
247 -4.693 29.228 18.622
248 21.765 -38.452 15.147
249 39.157 -31.135 19.800
250 32.638 46.241 18.943
251 2.797 10.089 21.330
252 8.256 46.910 18.834
253 38.634 -2.429 20.413
254 28.642 2.763 19.580
255 0.456 1.422 7.452
256 3.050 11.792 14.196
257 24.736 14.532 17.886
258 16.787 -10.155 18.607
259 12.676 11.651 18.656
260 13.184 1.081 15.385
261 27.365 26.576 25.486
262 -7.878 -18.191 14.547
263 -42.112 32.576 20.865
264 15.069 21.684 17.986
265 33.045 27.166 25.252
266 21.810 -0.186 19.477
267 18.227 26.690 20.415
268 33.759 18.366 21.255
269 39.491 13.272 23.036
270 30.662 9.368 20.192
271 5.470 35.303 22.685
272 21.663 -44.343 20.999
273 31.261 33.178 24.335
274 21.854 22.665 20.876
275 21.853 7.932 18.588
276 -40.168 3.682 19.642
277 -42.292 23.997 22.199
278 10.233 28.731 21.263
279 17.745 41.831 19.536
280 38.406 25.165 26.534
281 -49.329 -0.465 20.887
282 40.398 -8.120 21.362
283 -2.531 46.118 22.933
284 7.959 -30.856 20.497
285 -34.467 -23.724 22.206
286 30.541 44.284 25.878
287 45.682 29.897 21.964
288 -22.251 -0.089 20.756
289 21.484 16.532 23.513
290 46.912 10.195 21.908
291 35.320 -13.352 16.102
292 -30.431 14.048 17.362
293 -8.976 -17.325 21.645
294 -32.661 2.301 16.805
295 49.317 -5.509 17.711
296 -37.756 4.459 16.054
297 41.445 6.158 21.442
298 -33.148 -3.499 19.543
299 -10.065 12.238 16.649
300 -48.323 17.153 20.974
301 45.010 28.147 17.838
302 -39.630 29.183 13.254
303 -45.191 5.065 18.214
304 -35.936 11.953 16.540
305 -19.816 19.624 18.279
306 -32.055 29.757 15.358
307 -41.533 36.169 10.005
308 -35.448 20.960 16.720
309 -23.384 0.511 20.005
310 -25.101 40.569 18.180
311 -0.547 -45.779 17.603
312 -35.291 32.643 9.548
313 -25.109 20.826 17.494
314 -26.202 27.012 18.678
315 -34.805 43.850 14.006
316 -18.819 30.611 20.309
317 13.019 -40.248 16.874
318 -0.655 37.112 20.924
319 -34.142 -41.553 15.237
320 -39.509 43.886 12.464
321 -9.491 38.639 18.839
322 -12.164 31.977 17.598
323 3.437 -1.596 20.318
324 -45.713 -52.599 9.918
325 -2.062 -15.946 12.847
326 -27.435 -37.600 18.257
327 -18.094 -29.624 18.791
328 -22.647 -26.123 18.746
329 -16.775 -15.505 19.204
330 -2.628 9.599 11.219
331 -15.718 -1.797 19.491
332 -29.476 -20.107 17.485
333 -10.618 -4.938 12.227
334 1.423 17.458 14.706
335 -4.503 16.630 14.718
336 32.450 -2.029 21.591
337 20.529 35.464 21.630
338 -19.348 -7.844 19.464
339 -26.760 -6.856 21.422
340 -4.539 4.393 11.819
341 -5.741 1.934 7.121
342 4.781 21.919 18.908
343 -19.797 6.928 20.928
344 -24.555 16.834 19.796
345 -5.664 24.465 18.432
346 -32.891 -6.571 18.691
347 2.354 28.462 19.825
348 18.058 16.251 16.335
349 -9.603 28.582 17.743
350 -31.282 11.454 22.342
351 -17.580 16.428 18.401
352 -13.884 6.206 17.270
353 -13.631 13.767 11.761
354 9.712 18.008 18.896
355 -37.987 9.024 21.309
356 -29.969 27.506 16.964
357 -4.248 -10.813 9.284
358 -5.755 32.673 22.541
359 6.675 41.952 24.227
360 24.564 41.173 23.241
361 45.314 -32.299 20.778
362 -45.890 -33.510 16.314
363 -8.277 -47.943 21.573
364 -11.044 -29.464 17.708
365 3.972 -41.396 17.411
366 31.776 -36.643 19.998
367 -43.072 31.311 17.828
368 -45.805 -43.071 14.477
369 -15.628 39.837 19.709
370 -21.129 -39.101 18.814
371 -41.628 -8.980 19.850
372 -42.244 -23.659 18.856
373 44.149 -25.710 21.099
374 42.623 -9.185 20.147
375 -35.949 -23.979 22.255
376 -28.512 -36.367 19.378
377 -19.827 -21.781 21.621
378 -3.429 -15.706 18.677
379 39.741 -20.721 18.670
380 -41.663 -29.499 19.260
381 -26.931 -48.467 18.185
382 -17.571 -1.467 19.770
383 -20.039 -7.591 22.737
384 46.370 -14.790 19.922
385 30.710 4.167 22.987
386 46.755 0.417 18.088
387 -27.293 -4.398 22.168
388 -30.364 -17.573 23.869
389 -16.870 -14.893 25.817
390 -18.152 6.546 18.392
391 40.134 0.160 23.661
392 28.179 -14.323 19.301
393 -35.907 -32.647 19.306
394 8.486 7.101 24.551
395 -17.155 -45.435 22.745
396 34.226 10.748 19.773
397 -7.760 38.754 22.211
398 -42.899 13.804 22.628
399 -29.972 40.435 17.784
400 8.764 -39.195 22.070
401 -15.624 25.585 12.291
402 18.620 -33.314 23.282
403 -30.436 34.219 15.102
404 -37.665 44.955 15.257
405 -15.861 37.488 18.956
406 -13.375 -22.408 20.312
407 -20.972 36.906 17.387
408 43.162 -35.948 19.695
409 6.639 -15.783 14.608
410 44.186 -41.037 17.398
411 -23.917 22.236 18.702
412 -23.957 30.033 14.725
413 -25.056 43.824 15.489
414 -6.795 -24.375 18.537
415 -33.485 -40.651 17.538
416 -43.186 -43.071 17.481
417 -30.325 -46.122 16.440
418 -17.489 -43.551 22.006
419 -16.376 43.928 18.992
420 -9.076 -10.921 14.131
421 13.704 -36.352 23.812
422 -47.302 -31.918 18.719
423 39.459 -27.814 15.558
424 -22.509 -42.660 14.366
425 -17.920 -37.614 16.572
426 -5.780 -39.212 21.667
427 -30.519 -28.942 21.931
428 -35.937 31.435 17.106
429 -38.680 18.435 23.342
430 -24.796 -13.279 18.543
431 -9.283 -32.388 21.895
432 0.493 -19.505 17.276
433 -7.046 -25.243 20.741
434 7.884 -32.006 16.727
435 22.451 -7.834 21.082
436 8.379 -13.690 22.002
437 43.730 41.697 11.894
438 -9.040 -38.086 17.500
439 18.831 -2.759 23.252
440 12.732 -27.410 18.948
441 0.739 -21.091 21.354
442 20.339 -27.959 16.514
443 44.688 -46.449 12.356
444 -0.402 -36.951 17.891
445 -24.790 -18.139 23.337
446 2.173 -30.577 18.023
447 -18.995 -33.799 20.730
448 23.372 0.223 24.855
449 17.835 -17.372 19.878
450 -38.915 -13.815 20.923
451 -26.241 -27.800 19.877
452 11.074 -18.156 19.249
453 -16.478 -22.928 16.386
454 29.646 -8.349 21.115
455 33.910 -20.809 16.629
456 3.306 -6.830 22.059
457 -10.512 -5.322 19.876
458 14.024 -10.406 23.456
459 12.365 -3.699 21.818
460 18.186 8.532 23.951
461 25.140 -47.653 18.592
462 32.288 -2.117 23.423
463 10.836 24.937 23.310
464 4.531 28.913 25.238
465 9.944 18.397 26.661
466 16.274 4.852 27.837
467 27.316 -26.007 15.934
468 -4.508 -8.010 20.906
469 -29.858 2.412 19.958
470 20.376 -21.957 21.306
471 2.077 47.431 23.248
472 25.777 -33.367 21.695
473 44.854 42.801 22.904
474 25.356 -48.833 25.402
475 15.322 -16.926 17.318
476 -2.656 -33.400 20.365
477 11.950 -47.390 20.328
478 42.961 36.955 22.919
479 35.726 -45.402 24.272
480 4.675 -21.758 21.780
481 -40.568 -36.931 16.934
482 11.758 -12.859 14.206
483 35.483 -31.760 16.975
484 27.336 -27.577 19.429
485 36.689 -39.218 19.668
486 -46.357 41.618 17.456
487 0.002 -40.589 22.558
488 23.525 -39.918 21.247
489 -43.269 -21.304 22.699
490 40.191 -20.594 21.145
491 25.728 -18.024 20.298
492 34.964 -10.441 20.189
493 43.627 -13.279 23.038
494 5.766 6.876 14.077
495 32.432 -18.172 21.848
496 7.087 -1.122 15.098
497 -44.110 -14.034 23.080
498 -39.474 -31.289 22.312
499 4.118 -4.077 11.067
500 26.597 -11.667 22.641
so, using these commands I can find the intervals, as below
x.bin <- seq(floor(min(d[,1])), ceiling(max(df[,1])), by=2)
y.bin <- seq(floor(min(d[,2])), ceiling(max(df[,2])), by=2)
> x.bin
[1] -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14
[20] -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24
[39] 26 28 30 32 34 36 38 40 42 44 46 48 50
> y.bin
[1] -53 -51 -49 -47 -45 -43 -41 -39 -37 -35 -33 -31 -29 -27 -25 -23 -21 -19 -17
[20] -15 -13 -11 -9 -7 -5 -3 -1 1 3 5 7 9 11 13 15 17 19 21
[39] 23 25 27 29 31 33 35 37 39 41 43 45 47
But, then I don't know how to assign each row of the raw data (df) to each x.bin and y.bin and calculate the aggregate (sum) of each bin.
library(plyr)
#I am using cut function with 50 breaks for both v1 and v2 and ddply from plyr package for computing the mean
newdata<-ddply(df,.(cut(v1,50),cut(v2,50)),summarise,mean.v3=mean(v3))
> head(newdata)
cut(v1, 50) cut(v2, 50) mean.v3
1 (-49.4,-47.5] (-34.7,-32.7] 18.123
2 (-49.4,-47.5] (-0.576,1.43] 20.887
3 (-49.4,-47.5] (15.5,17.5] 20.887
4 (-47.5,-45.5] (-52.7,-50.7] 9.918
5 (-47.5,-45.5] (-44.7,-42.7] 14.477
6 (-47.5,-45.5] (-34.7,-32.7] 16.314
Updated as per the comments: If you want the lower, middle and mid-points, you can use the following function or use with details as follow(you need to use the sub function to deal with ( and ]):
df$newv1<-with(df,cut(v1,50))
df$newv2<-with(df,cut(v2,50))
df$lowerv1<-with(df,as.numeric( sub("\\((.+),.*", "\\1", newv1))) #lower value
df$upperv1<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv1))) # upper value
df$midv1<-with(df,(lowerv1+upperv1)/2) #mid value
df$lowerv2<-with(df,as.numeric( sub("\\((.+),.*", "\\1",newv2))) #lower value
df$upperv2<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv2))) # upper value
df$midv2<-with(df,(lowerv2+upperv2)/2)#mid value
newdata<-ddply(df,.(newv1,newv2),transform,mean.v3=mean(v3))
> head(newdata)
v1 v2 v3 newv1 newv2 lowerv1 upperv1 midv1 lowerv2 upperv2 midv2 mean.v3
1 -47.456 -32.714 18.123 (-49.4,-47.5] (-34.7,-32.7] -49.4 -47.5 -48.45 -34.700 -32.70 -33.700 18.123
2 -49.329 -0.465 20.887 (-49.4,-47.5] (-0.576,1.43] -49.4 -47.5 -48.45 -0.576 1.43 0.427 20.887
3 -48.652 16.558 20.800 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887
4 -48.323 17.153 20.974 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887
5 -45.713 -52.599 9.918 (-47.5,-45.5] (-52.7,-50.7] -47.5 -45.5 -46.50 -52.700 -50.70 -51.700 9.918
6 -45.805 -43.071 14.477 (-47.5,-45.5] (-44.7,-42.7] -47.5 -45.5 -46.50 -44.700 -42.70 -43.700 14.477

Resources