How to reduce processing time of a code in R - r
Can you help me think of some way to reduce the computational time of a code that generates a certain value, which in this case I call coef, which will depend on id/date/category? Better explanations below.
I made two functions that generate the same result. As you can see in benchmark, the first function (return_values) takes twice as long as the second function (return_valuesX) to generate the same results. See that in the second function, I make some brief changes when calculating the coef variable. However, I strongly believe that there is a possibility of improving the code, as you can see in the second function, I managed to improve 50% of processing time compared to the first just with brief changes. But I'm out of ideas for new adjustments, so I would like your valuable opinion.
Code Explanations:
In general, the purpose of the code is to calculate a value, which I call a coef for each group of id, date and category. For this, the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns of the df1 database is first calculated. After obtaining the median (med variable), I add the values found with the values of the DRM0 columns of my df1 database. This calculation is my SPV variable. In both cases, I used the data.table function, which I believe is faster than using dplyr. After I get SPV, I need to calculate the coef variable for each id/date/category.
Below I will insert an example real easy to understand of the coef calculation. If for example I want to calculate coef of idd<-"3", dmda<-"2021-12-03", CategoryChosse<-"ABC", and I have the following:
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
So coef will be ymd(dmda) - ymd(min(df1$date1)). That is, if I do to this id/date/category that I mentioned I get a difference of 2 days, so the value I want is the DRM003_PV . So the value for this case is 198. Therefore, I made:
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
> coef
[1] 198
This issue has been resolved here: Adjust code to choose a specific column depending on the difference between dates
Libraries and database
library(tidyverse)
library(lubridate)
library(data.table)
library(bench)
set.seed(123)
df1 <- data.frame( Id = rep(1:5, length=800),
date1 = as.Date( "2021-12-01"),
date2= rep(seq( as.Date("2021-01-01"), length.out=400, by=1), each = 2),
Category = rep(c("ABC", "EFG"), length.out = 800),
Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"), length.out = 800),
DR1 = sample( 200:250, 800, repl=TRUE),
setNames( replicate(365, { sample(0:800, 800)}, simplify=FALSE),
paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))
First function
return_values <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns
dt1 <- as.data.table(df1)
cols <- grep("^DRM0", colnames(dt1), value = TRUE)
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")
nm2 <- f2(names(med), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
return(coef)
}
Results using first function
subset_df1 <- subset(df1, date2 > date1)
a<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)) %>%
data.frame()
> a
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...........
Second function
return_valuesX <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
dt1 <- as.data.table(df1)
num_to_pull <- as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6
cols <- grep("^DRM0", colnames(dt1), value = TRUE)[1:num_to_pull]
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")[1:num_to_pull]
nm2 <- f2(names(med), "_PV")[1:num_to_pull]
nm3 <- paste0("i.", nm2)[1:num_to_pull]
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(num_to_pull)
return(coef)
}
Results using second function
b<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result = return_valuesX(df1,Id, date2, Category)) %>%
data.frame()
> b
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...............
Comparing the two results:
identical(a, b)
[1] TRUE
Calculate processing time using benchmark
subset_df1 <- subset(df1, date2 > date1)
bench::mark(a=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)),
b=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_valuesX(df1,Id, date2, Category)),iterations = 1)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 a 53.7s 53.7s 0.0186 4.54GB 0.634 1 34 53.7s <rowwise_df [130 x 5]> <Rprofmem [981,580 x 3]> <bench_tm [1]> <tibble [1 x 3]>
2 b 21s 21s 0.0477 913.77MB 0.382 1 8 21s <rowwise_df [130 x 5]> <Rprofmem [278,340 x 3]> <bench_tm [1]> <tibble [1 x 3]>
To check df1 database
Here's an approach that is about 20x/10x faster than your functions for the example data, and would be faster yet for larger data sets. (When I run with 100k rows in df1, it's 572x faster.) I hope you'll find this approach easier to understand and debug.
This is written using tidyverse functions like tidyr::pivot_longer and dplyr::group_by. If you want to squeeze out a bit more speed, the data.table and collapse packages offer faster alternatives for many functions, especially around grouped calculations. But the main speed improvement here is from restructuring to avoid repeating the same calculations over and over and letting R rely more on vectorized calculations. https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
pre_calc <- function(df) {
pre_calc <- df1 %>% # this calculates once on the full data
select(!ends_with("_PV")) %>%
pivot_longer(-c(1:6), values_to = "DRM", names_to = "day") %>%
mutate(day = parse_number(day)) %>%
group_by(Id, Category, Week, day) %>%
mutate(med = median(DR1 - DRM), Result = DRM + med) %>%
ungroup()
df %>% # starts from the subsetted data and joins to results from above
select(1:5) %>%
left_join(pre_calc) %>%
filter(day == date2 - date1 + 1) %>%
select(Id, date1, date2, Category, Result)
}
c <- subset_df1 %>% pre_calc()
c matches a and b from your tests, with the one difference that
date2 (originally date-integer, which is a nonstandard type) has in my approach been
coerced into a typical date-double, like date1. We can use
typeof(df1$date1) & typeof(df1$date2) to see this.
waldo::compare(b, c) confirms the results otherwise match. I
opened an issue with tidyr here since the subtle change seems
to have been caused by the pivot_longer step.
UPDATE: Apparently the creation of a date-integer object is a bug in base R's seq.Date / seq function, which was fixed in R 4.2: https://github.com/tidyverse/tidyr/issues/1356#issuecomment-1111078891
In the approach above, I pre-calculate all the results once, by taking the original data set df1, throwing out the existing _PV columns (I think they get overwritten?), and -- here's where the speed gains come from -- reshaping to long format. While this single operation is computationally expensive, it means we can more efficiently apply the same calculation to all the DRM_* columns at once, and we can rely on fast filtering instead of slow subsetting to extract our result.
The group_by(Id, Category, Week, day) line and the next mutate(... line let us calculate the median differences between DR1 and that day's DRM for each Id-Category-Week combination, so we can calculate all the Results at once.
The last part takes the df lines (e.g. the subsetted data in your example where date2 > date1) and attaches those to the pre-calculated results, filtering to get the right day (previously encoded by column name/position).
Related
How to calculate Williams %R in RStudio?
I am trying to write a function to calculate Williams %R on data in R. Here is my code: getSymbols('AMD', src = 'yahoo', from = '2018-01-01') wr = function(high, low, close, n) { highh = runMax((high),n) lowl = runMin((low),n) -100 * ((highh - close) / (highh - lowl)) } williampr = wr(AMD$AMD.High, AMD$AMD.Low, AMD$AMD.Close, n = 10) After implementing a buy/sell/hold signal, it returns integer(0): ## 1 = BUY, 0 = HOLD, -1 = SELL ## implement Lag to shift the time back to the previous day tradingSignal = Lag( ## if wpr is greater than 0.8, BUY ifelse(Lag(williampr) > 0.8 & williampr < 0.8,1, ## if wpr signal is less than 0.2, SELL, else, HOLD ifelse(Lag(williampr) > 0.2 & williampr < 0.2,-1,0))) ## make all missing values equal to 0 tradingSignal[is.na(tradingSignal)] = 0 ## see how many SELL signals we have which(tradingSignal == "-1") What am I doing wrong?
It would have been a good idea to identify that you were using the package quantmod in your question. There are two things preventing this from working. You didn't inspect what you expected! Your results in williampr are all negative. Additionally, you multiplied the values by 100, so 80% is 80, not .8. I removed -100 *. I have done the same thing so many times. wr = function(high, low, close, n) { highh = runMax((high),n) lowl = runMin((low),n) ((highh - close) / (highh - lowl)) } That's it. It works now. which(tradingSignal == "-1") # [1] 13 15 19 22 39 71 73 84 87 104 112 130 134 136 144 146 151 156 161 171 175 # [22] 179 217 230 255 268 288 305 307 316 346 358 380 386 404 449 458 463 468 488 492 494 # [43] 505 510 515 531 561 563 570 572 574 594 601 614 635 642 644 646 649 666 668 672 691 # [64] 696 698 719 729 733 739 746 784 807 819 828 856 861 872 877 896 900 922 940 954 968 # [85] 972 978 984 986 1004 1035 1048 1060
fast partial match checking in R (or Python or Julia)
I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution. Create two dummy datasets library(dplyr) library(stringr) set.seed(1) data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% mutate(ID_A = 1:n()) set.seed(2) data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% mutate(ID_B = 1:n()) Test matching of full matches only # This is almost instant data_set_A %>% rowwise() %>% filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>% ungroup() # A tibble: 4 x 2 name ID_A <chr> <int> 1 vnt 112 2 fly 391 3 cug 1125 4 xgv 1280 Include partial matches (This is what I want to optimize) This of course only gives me the subset of dataset A, but that is ok. # This takes way too long data_set_A %>% rowwise() %>% filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>% ungroup() A tibble: 237 x 2 name ID_A <chr> <int> 1 wknrsauuj 2 2 lyw 7 3 igwsvrzpk 16 4 zozxjpu 18 5 cgn 22 6 oqo 45 7 gkritbe 47 8 uuq 92 9 lhwfyksz 94 10 tuw 100 Fuzzyjoin method. This also works, but is equally slow bind_rows( fuzzyjoin::fuzzy_inner_join( data_set_A, data_set_B, by = c("name" = "name_2"), match_fun = stringr::str_detect ) %>% select(name, ID_A), fuzzyjoin::fuzzy_inner_join( data_set_B, data_set_A, by = c("name_2" = "name"), match_fun = stringr::str_detect ) %>% select(name, ID_A) ) %>% distinct() data.table solution not much faster unfortunately library(data.table) setDT(data_set_A) setDT(data_set_B) data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]
This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more. This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum. Benchmarks: n = 2000 # A tibble: 4 × 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> 1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1 2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1 3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1 4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1 n = 4000 Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO). # A tibble: 4 × 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> 1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1 2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1 3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1 4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1 Code w/ comments: # This is so we do not retain the strings with the max number of # characters in our pattern because we are checking with %in% already nchar_a = nchar(data_set_A$name) nchar_b = nchar(data_set_B$name_2) # Creating large patterns (excluding values w/ max number of characters) pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|") pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|") # First checking using %in% idx_a = data_set_A$name %in% data_set_B$name_2 # Next, IDing when a(string) matches b(pattern) idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b) # IDing a(pattern) matches b(string) so we do not run every row of # a(as a pattern) against all of b b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)] # Using unmatched values of a as a pattern for the reduced set for b idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) { any(grepl(name, b_to_check, fixed = TRUE)) }, logical(1L), USE.NAMES = FALSE) data_set_A[idx_a, ] # A tibble: 237 × 2 name ID_A <chr> <int> 1 wknrsauuj 2 2 lyw 7 3 igwsvrzpk 16 4 zozxjpu 18 5 cgn 22 6 oqo 45 7 gkritbe 47 8 uuq 92 9 lhwfyksz 94 10 tuw 100 # … with 227 more rows Reproducible R code for benchmarks The following code is largely taken from jpiversen who provided a great answer: library(dplyr) library(stringr) n = 2000 set.seed(1) data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% mutate(ID_A = 1:n()) set.seed(2) data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% mutate(ID_B = 1:n()) original <- function() { data_set_A %>% rowwise() %>% filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>% ungroup() } using_fixed <- function() { data_set_A %>% rowwise() %>% filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>% ungroup() } using_map_fixed <- function() { logical_vec <- data_set_A$name %>% purrr::map_lgl( ~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) || any(stringr::str_detect(data_set_B$name_2, fixed(.x))) ) data_set_A[logical_vec, ] } andrew_fun = function() { nchar_a = nchar(data_set_A$name) nchar_b = nchar(data_set_B$name_2) pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|") pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|") idx_a = data_set_A$name %in% data_set_B$name_2 idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b) b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)] idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) { any(grepl(name, b_to_check, fixed = TRUE)) }, logical(1L), USE.NAMES = FALSE) data_set_A[idx_a, ] } bm = bench::mark( original(), using_fixed(), using_map_fixed(), andrew_fun(), iterations = 1 )
TL;DR The slow part is str_detect(string, pattern). To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text. To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data. Under follows examples, explanations and benchmarks. Rewriting str_detect() using fixed() or coll() I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll(). From ?stringr::str_detect(): Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale. Under is a comparison with your original code: original <- function() { data_set_A %>% rowwise() %>% filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>% ungroup() } # Note the use of fixed() using_fixed <- function() { data_set_A %>% rowwise() %>% filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>% ungroup() } # Note the use of coll() using_coll <- function() { data_set_A %>% rowwise() %>% filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>% ungroup() } bm <- bench::mark( original(), using_fixed(), using_coll(), iterations = 20 ) #> Warning: Some expressions had a GC in every iteration; so filtering is disabled. bm #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371 #> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94 #> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512 bm %>% ggplot2::autoplot(type = "violin") #> Loading required namespace: tidyr Created on 2022-04-02 by the reprex package (v2.0.1) So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work. Removing rowwise() Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above: using_map_fixed <- function() { logical_vec <- data_set_A$name %>% purrr::map_lgl( ~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) || any(stringr::str_detect(data_set_B$name_2, fixed(.x))) ) data_set_A[logical_vec, ] } using_map_coll <- function() { logical_vec <- data_set_A$name %>% purrr::map_lgl( ~any(stringr::str_detect(.x, coll(data_set_B$name_2))) || any(stringr::str_detect(data_set_B$name_2, coll(.x))) ) data_set_A[logical_vec, ] } bm <- bench::mark( using_fixed(), using_map_fixed(), using_coll(), using_map_coll(), iterations = 20 ) #> Warning: Some expressions had a GC in every iteration; so filtering is disabled. bm #> # A tibble: 4 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37 #> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14 #> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489 #> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354 Created on 2022-04-02 by the reprex package (v2.0.1) As we see, this gives another slight speed boost. Using fixed() with data.table or fuzzyjoin You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time. This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.
If you would like to use base R, the code below might be one fast option A <- data_set_A$name B <- data_set_B$name_2 A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE)) B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE)) idx <- which(t(A2B) | B2A, arr.ind = TRUE) res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ]) which gives > res name ID_A name_2 ID_B 1 arh 1234 pimoarhd 8 2 qtj 720 aqtj 23 3 szcympsn 142 cym 43 4 cymvubnxg 245 cym 43 5 dppvtcymq 355 cym 43 6 kzi 690 kzii 48 7 eyajqchkn 498 chk 53 8 upfzh 522 upf 61 9 ioa 1852 ioadr 63 10 lya 1349 ibelyalvh 64 11 honod 504 ono 71 12 zozxjpu 18 zoz 72 13 jcz 914 cdjczpqg 88 14 ailmjf 623 ilm 99 15 upoux 609 oux 104 16 pouxifvp 1466 oux 104 17 mvob 516 vob 106 18 nqtotvhhm 1088 otv 115 19 wom 202 womtglapx 117 20 qkc 756 dqkcfqpps 118 21 qtl 600 ivqtlymzr 126 22 qqi 1605 owfsqqiyu 153 23 fmjalirze 1470 ali 172 24 ibwfwkyp 1588 fwk 175 25 iat 1258 iatjeg 185 26 osm 253 nviiqosm 199 27 wpj 373 wpjeb 204 28 hahx 515 ahx 213 29 keahxa 1565 ahx 213 30 psf 359 qnpsfo 223 31 saq 1859 saqhu 227 32 cvmkwtx 714 cvm 228 33 ilw 389 pyilwj 231 34 ohwysv 1590 ysv 237 35 utrl 698 trl 244 36 dmttrlcpj 1267 trl 244 37 cpv 236 btcpvmoc 247 38 uto 1047 utoi 257 39 yngunekl 1978 ekl 258 40 vceko 625 vce 265 41 fir 1934 firgk 278 42 qvd 983 eqvdfi 287 43 fir 1934 zwwefir 291 44 idvfkevdf 1380 vdf 312 45 qwdo 1921 qwd 322 46 kam 1205 tlkam 327 47 lck 488 clckjkyzn 329 48 gmspwckw 1015 msp 359 49 ynouuwqtz 1576 nou 360 50 tty 1209 bttyvt 361 51 vkc 999 fmrvkcl 366 52 ipw 1918 fipwjomdu 388 53 zdv 261 zdvkut 410 54 vku 1137 zdvkut 410 55 doby 246 oby 411 56 hycvuupgy 141 uup 421 57 uwlb 1249 wlb 431 58 auj 1452 lcmnauj 444 59 rwd 1667 ukwrwdczs 479 60 ylsihqqor 1290 ihq 483 61 feo 1649 feorvxbm 485 62 zff 755 dohzffujm 499 63 mqutujepu 904 epu 507 64 uiepu 1308 epu 507 65 vahepuk 1434 epu 507 66 cug 1125 accugl 509 67 fir 1934 firwe 517 68 dia 1599 dialeddd 527 69 temiwd 1725 tem 531 70 svofivl 1177 svo 545 71 flm 657 aflm 546 72 vnt 112 vnt 551 73 bhmoskrz 426 osk 558 74 wev 728 shemuwev 569 75 hzpi 1586 hzp 579 76 gvi 1064 mkgvivlfe 582 77 fjb 1398 vkfjbxnjl 589 78 qin 1013 qinp 593 79 ecn 1342 ecnzre 598 80 zre 1610 ecnzre 598 81 xvr 772 dpxvrfmo 623 82 tqr 1419 tqrmztdm 624 83 zmwnf 1571 mwn 626 84 ypil 1787 pil 630 85 mnxlqgfh 1132 nxl 643 86 gse 1563 gseice 646 87 ygk 1309 ygkqrk 655 88 fgm 933 vzfgmy 663 89 rlupd 977 upd 666 90 mcupdkuiy 1307 upd 666 91 fly 391 fly 669 92 vbkko 1603 kko 678 93 uvrew 465 rew 680 94 hgbhngwvd 901 wvd 690 95 wvdjprmo 1432 wvd 690 96 cgn 22 cgnd 698 97 dngnjv 967 njv 700 98 psqs 841 sqs 720 99 ywv 1180 ptywvlgc 730 100 ypil 1787 ypi 734 101 rwd 1667 srserwd 737 102 jqydasl 1294 jqy 742 103 ckujmc 717 ujm 751 104 dfzxta 662 xta 775 105 bjb 1562 jabjbei 779 106 adwknpll 1242 npl 780 107 kdv 1327 xhkdvqo 789 108 ghj 174 oghj 801 109 lhwfyksz 94 lhw 811 110 nwrrnlhw 929 lhw 811 111 xlhwm 1720 lhw 811 112 ncc 1602 wurhxnccn 814 113 jdslrf 1094 dsl 835 114 ktmw 1738 tmw 844 115 igwsvrzpk 16 gws 856 116 kug 591 pkugls 857 117 befgcpedr 339 fgc 862 118 ojf 1397 ojfpnkla 863 119 gyl 1203 gylxeqzw 872 120 ugcbb 1727 ugc 876 121 arh 1234 karhwhg 878 122 amm 458 ammqdc 883 123 azazryje 636 zaz 900 124 wczazw 1887 zaz 900 125 gkritbe 47 ritb 915 126 vku 1137 yjvkuxued 929 127 rnh 1633 kvyrnhugu 937 128 mzh 1135 xllmwmzhn 940 129 cug 1125 cug 960 130 xgv 1280 xgv 962 131 xusxgv 1436 xgv 962 132 umc 351 lwumcmvoo 980 133 zlb 1900 nkyazlb 991 134 llfkalao 1049 llf 1002 135 sflpbht 991 lpb 1048 136 rairmmcl 442 mmc 1087 137 mmckoln 780 mmc 1087 138 gfxmmcgb 1814 mmc 1087 139 aoj 402 taojlgp 1089 140 mypvzhp 121 ypv 1095 141 moctwaypv 611 ypv 1095 142 rngedn 306 ged 1106 143 djshecy 1408 ecy 1108 144 rairmmcl 442 rmm 1117 145 gzua 1594 zua 1124 146 ytj 416 yytj 1140 147 ubt 300 hubtcfr 1141 148 gqg 1854 ogqgsjqc 1144 149 tfg 1204 xiutfgru 1145 150 avrq 741 avr 1147 151 ytkpvss 440 tkp 1149 152 kug 591 yxsjkug 1176 153 vix 1846 vixsmn 1187 154 qtl 600 qtljkxz 1188 155 lgr 494 dlgrco 1189 156 ryg 864 xlmtryg 1203 157 yskvkxwj 1547 kvk 1205 158 kxhee 1795 xhe 1222 159 hzbcjs 1493 cjs 1224 160 kbi 270 itxlwkbi 1225 161 gdymcam 806 ymca 1232 162 tqr 1419 rxtqrdtl 1236 163 yyz 215 yyzw 1242 164 jyx 1735 mljjyxu 1248 165 aai 1928 umkpaaiwo 1254 166 dsd 1122 dndsdova 1257 167 tor 744 etor 1270 168 vhcyznp 1296 yzn 1278 169 xlc 1947 odxlcjwj 1280 170 mlm 1629 aomlmgtq 1303 171 owm 239 owmugb 1304 172 ynezwaml 507 nez 1308 173 jls 695 jlsve 1325 174 dvm 879 dvmv 1339 175 vsgx 944 dqpihvsgx 1352 176 wfo 768 wfokpjois 1354 177 tltbkinat 1986 nat 1362 178 gyl 1203 gylqte 1363 179 ngg 735 bsnggqbjd 1366 180 fkq 345 jdfkqf 1368 181 ojf 1397 ojfpgfga 1382 182 dqgd 1623 prqbndqgd 1398 183 siu 827 siuypucup 1412 184 yinsoivfd 1895 yin 1414 185 esm 1834 sesmeepz 1417 186 umc 351 umcj 1432 187 wny 866 wnyxamguw 1443 188 ujbhtvnin 399 vni 1444 189 dbq 630 bdbqq 1452 190 ebn 1405 ebngddw 1461 191 zcj 704 rbtjzcjod 1465 192 avn 500 avnspxv 1468 193 vkk 567 hvkk 1477 194 hmm 1441 bgjhmmthz 1483 195 aguakz 614 guak 1487 196 hycvuupgy 141 pgy 1493 197 tizpgymz 280 pgy 1493 198 guk 571 cncxdguk 1502 199 zyw 281 nzywuqs 1504 200 jnz 1558 rxdxsjnzw 1510 201 uuq 92 nxuuqtj 1514 202 qtj 720 nxuuqtj 1514 203 vkk 567 xpbpvkkdc 1518 204 iaa 460 sjiaa 1525 205 txsgmynng 1019 xsg 1526 206 yjvtwc 1107 jvt 1529 207 lnk 1113 hylnknwy 1546 208 szd 635 woszdm 1557 209 osm 253 sosmdp 1567 210 nbd 1067 nbdmmg 1570 211 mmg 1305 nbdmmg 1570 212 wqdsatbd 1536 sat 1585 213 sdlypo 1527 sdl 1596 214 inkynog 288 inky 1600 215 hpwoeclfy 1321 clf 1601 216 wodyqwqf 679 dyq 1603 217 lyw 7 xnalywyuw 1607 218 njm 1825 vjlnjmns 1617 219 njytqhaut 428 qha 1620 220 ilw 389 rilwbk 1647 221 oqo 45 ixoqowkpg 1650 222 odcbcvaun 1386 bcv 1652 223 mastn 434 stn 1662 224 xebhdssit 1091 xeb 1663 225 nmy 782 nmyxj 1671 226 fsqvgdw 673 gdw 1676 227 mwwczhs 482 wcz 1679 228 wczazw 1887 wcz 1679 229 anmryzm 915 ryz 1698 230 rteh 523 rte 1708 231 mlwrguae 817 lwr 1709 232 mbu 819 xpsuqmbuf 1729 233 mmckoln 780 cko 1733 234 lxpg 798 lxp 1734 235 ane 370 vxnanehvk 1746 236 tty 1209 vbttyozui 1752 237 igncdgyjx 332 ign 1753 238 ndignk 621 ign 1753 239 nmy 782 ivnmyba 1780 240 wknrsauuj 2 rsa 1799 241 tgd 165 qtgdidlf 1803 242 iaa 460 yziaazxto 1833 243 xto 1245 yziaazxto 1833 244 zff 755 dpzfft 1857 245 jyx 1735 jwjyxphe 1873 246 ytj 416 eytj 1881 247 lcggwonk 1596 onk 1882 248 zdv 261 zdvxfz 1889 249 xhskcb 417 kcb 1890 250 mrikqkcb 770 kcb 1890 251 psvxqnsap 1352 psv 1898 252 udjswzb 411 jsw 1900 253 rpfjswy 1840 jsw 1900 254 bjaywiso 1677 ayw 1902 255 zfli 130 fli 1906 256 vazx 1215 itvazxw 1918 257 tuw 100 tuwywtbwd 1921 258 vle 1437 ebvleaovm 1937 259 znycsygd 1757 nyc 1944 260 ynezwaml 507 ezw 1952 261 tseezwf 1276 ezw 1952 262 ezwzyfudo 1690 ezw 1952 263 oudiky 1503 dik 1964 264 dikjn 1615 dik 1964 265 oms 106 wpomsudi 1977 266 hhp 1864 hhpkm 1983 Benchmarking It seems this base R option is slightly slower than #Andrew's approach. TIC <- function() { A <- data_set_A$name B <- data_set_B$name_2 A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE)) B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE)) idx <- which(t(A2B) | B2A, arr.ind = TRUE) cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ]) # data_set_A[unique(idx[, 1]), ] } jpiversen_fixed <- function() { data_set_A %>% rowwise() %>% filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>% ungroup() } andrew <- function() { nchar_a <- nchar(data_set_A$name) nchar_b <- nchar(data_set_B$name_2) pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|") pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|") idx_a <- data_set_A$name %in% data_set_B$name_2 idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b) b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)] idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) { any(grepl(name, b_to_check, fixed = TRUE)) }, logical(1L), USE.NAMES = FALSE) data_set_A[idx_a, ] } bm <- microbenchmark( TIC(), jpiversen_fixed(), andrew(), times = 20 ) shows that > bm Unit: milliseconds expr min lq mean median uq max TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841 jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574 andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872 neval 20 20 20
How to find out the sequence of value in R
Supposed that I have generated 100 different value from -100 to 100, and I have cumsum all of those value. set.seed(123) x <- -100:100 z <- sample (x, size = 100,replace=T) cumsum(z) and I got [1] 58 136 49 143 212 161 178 120 33 50 102 91 81 177 167 251 242 278 276 247 172 78 147 [24] 183 246 223 203 145 147 163 138 180 111 119 25 61 129 102 24 78 165 117 151 103 157 222 [47] 155 123 94 69 31 71 67 57 109 46 -34 -94 -20 -31 -72 -157 -142 -149 -244 -145 -160 -175 -237 [70] -179 -162 -213 -280 -377 -465 -497 -471 -419 -468 -547 -559 -500 -576 -642 -575 -564 -635 -596 -538 -518 -509 -452 [93] -489 -448 -350 -384 -334 -313 -335 -351 Now, I would like to stop or find out the value that is greater than 200 or lower than -200. If I do it by my hand, I know that the 5th sequence (212) is greater than 200. However, in R, is there any command to find out the first time that z is greater than 200 or lower than -200? Thank you very much
A quick hack way to do this might be: z <- as.data.frame(z) z$lv <- if_else(z >200,T,F) min(which(lv == TRUE))
The min(which(...)) solutions provided by others don't give a convenient answer in case none of the values meet the condition. For example, set.seed(123) x <- -100:100 z <- sample (x, size = 100,replace=T) min(which(abs(cumsum(z)) > 200)) #> [1] 5 min(which(abs(cumsum(z)) > 1000)) # None meet this condition #> Warning in min(which(abs(cumsum(z)) > 1000)): no non-missing arguments to min; #> returning Inf #> [1] Inf A better way is given in the R help page for which.max: match(TRUE, abs(cumsum(z)) > 200) #> [1] 5 match(TRUE, abs(cumsum(z)) > 1000) #> [1] NA
Divide paired matching columns
I have a data.frame df with matching columns that are also paired. The matching columns are defined in the factor patient. I would like to devide the matching columns by each other. Any suggestions how to do this? I tried this, but this does not take the pairing from patient into account. m1 <- m1[sort(colnames(df)] m1_g <- m1[,grep("^n",colnames(df))] m1_r <- m1[,grep("^t",colnames(df))] m1_new <- m1_g/m1_r m1_new head(df) na-008 ta-008 nc012 tb012 na020 na-018 ta-018 na020 tc020 tc093 nc093 hsa-let-7b-5p_TGAGGTAGTAGGTTGTGT 56 311 137 242 23 96 113 106 41 114 hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGG 208 656 350 713 49 476 183 246 157 306 hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGT 631 1978 1531 2470 216 1906 732 850 665 909 hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTT 2760 8159 6067 9367 622 4228 2931 3031 2895 2974 hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTTT 1698 4105 3737 3729 219 1510 1697 1643 1527 1536 > head(patient) $`008` [1] "na-008" "ta-008" $`012` [1] "nc012" "tb012" $`018` [1] "na-018" "ta-018" $`020` [1] "na020" "tc020" $`045` [1] "nb045" "tc045" $`080` [1] "nb-080" "ta-080"
R aggregate data in one column based on 2 other columns
So, I have these data given below, and my goal is to aggregate column v3 in terms of columns v1 and v2 and add the v3 values for each bin of v1 and v2. For example, the first line correspond to interval v1=21, v2=16, so the value of v3 will be aggregated over its (v1,v2) interval. And repeat this for the rest of rows. I want to use the mean as the aggregation function! > df v1 v2 v3 1 21.359 16.234 24.283 2 47.340 9.184 21.328 3 35.363 -13.258 14.556 4 -29.888 14.154 17.718 5 -10.109 -16.994 20.200 6 -32.387 1.722 15.735 7 49.240 -5.266 17.601 8 -38.933 2.558 16.377 9 41.213 5.937 21.654 10 -33.287 -4.028 19.525 11 -10.223 11.961 16.756 12 -48.652 16.558 20.800 13 44.778 27.741 17.793 14 -38.546 29.708 13.948 15 -45.622 4.729 17.793 16 -36.290 12.383 18.014 17 -19.626 19.767 18.182 18 -32.248 29.480 15.108 19 -41.859 35.502 8.490 20 -36.058 21.191 16.714 21 -23.588 0.524 21.471 22 -24.423 39.963 18.257 23 -0.042 -45.899 17.654 24 -35.479 32.049 9.294 25 -24.632 20.603 17.757 26 -26.591 25.882 18.968 27 -34.364 43.959 13.905 28 -19.334 29.728 20.102 29 12.304 -39.997 17.002 30 0.958 37.162 20.779 31 -35.475 -40.611 14.719 32 -39.268 44.382 11.247 33 -10.154 39.053 19.458 34 -12.612 32.056 17.759 35 2.730 -1.473 20.228 36 -45.326 -52.299 9.305 37 -1.996 -15.551 13.295 38 -26.655 -37.319 19.148 39 -18.509 -30.047 18.889 40 -22.705 -25.577 19.007 41 -15.705 -15.397 19.112 42 -2.637 9.790 10.548 43 -14.107 -3.145 19.654 44 -29.272 -19.906 18.503 45 -9.569 -4.632 11.334 46 2.114 18.048 14.744 47 -4.241 16.073 15.420 48 31.869 -3.394 21.559 49 20.425 35.205 22.250 50 -18.605 -8.866 20.082 51 -26.677 -7.690 21.850 52 -5.240 4.805 11.399 53 -6.766 2.538 6.292 54 4.567 22.554 19.682 55 -20.701 6.430 20.996 56 -23.972 16.141 17.976 57 -6.651 24.048 18.082 58 -32.243 -6.100 19.517 59 2.236 29.736 19.667 60 18.830 15.586 15.969 61 -9.598 28.414 17.806 62 -30.825 12.194 22.346 63 -17.415 15.795 18.135 64 -14.823 5.931 17.915 65 -14.234 12.882 13.001 66 9.937 18.368 20.421 67 -38.766 9.590 21.648 68 -30.896 27.047 16.453 69 -4.432 -10.562 10.061 70 -4.290 33.170 22.942 71 7.285 41.416 23.906 72 24.411 40.531 23.584 73 45.409 -32.420 20.831 74 49.341 -34.047 15.269 75 -7.730 -47.724 21.692 76 -10.563 -29.082 17.984 77 4.412 -41.182 16.845 78 31.822 -37.297 19.665 79 -43.355 31.093 17.688 80 -44.353 -44.723 13.832 81 -16.961 38.438 20.715 82 -21.225 -39.244 18.156 83 -42.022 -8.686 20.362 84 -42.904 -25.498 18.394 85 43.822 -25.990 21.287 86 43.013 -9.071 19.285 87 -36.901 -24.185 21.938 88 -28.251 -36.583 19.330 89 -19.830 -22.412 21.677 90 -3.789 -15.663 17.439 91 40.453 -21.796 17.432 92 -40.778 -31.188 18.762 93 -27.072 -48.609 18.913 94 -18.035 -1.791 19.909 95 -20.781 -7.912 22.563 96 47.307 -15.432 19.101 97 30.700 5.097 22.801 98 46.453 0.171 17.810 99 -27.439 -5.860 22.626 100 -30.526 -18.007 23.219 101 -18.280 -15.187 25.302 102 -18.367 6.044 18.864 103 41.265 -1.686 22.743 104 29.227 -14.814 19.196 105 -36.080 -32.715 18.930 106 7.475 7.061 25.002 107 -18.586 -45.207 21.864 108 35.227 11.148 21.388 109 -7.581 38.773 22.048 110 -43.685 14.083 22.037 111 -29.533 39.735 17.613 112 8.760 -39.400 22.421 113 -14.962 24.624 12.030 114 18.627 -32.888 23.036 115 -31.300 33.612 15.608 116 -38.024 45.839 16.567 117 -15.104 36.893 18.162 118 -12.809 -23.029 21.589 119 -21.614 36.264 16.680 120 42.917 -36.838 18.738 121 6.104 -14.961 14.468 122 44.032 -41.556 17.618 123 -24.493 21.886 17.366 124 -24.361 29.941 14.374 125 -25.060 43.383 16.437 126 -6.017 -24.640 19.207 127 -32.617 -40.549 18.059 128 -43.285 -43.364 18.827 129 -29.856 -46.089 16.881 130 -16.547 -43.619 22.547 131 -16.257 42.814 18.932 132 -9.236 -11.694 14.455 133 13.488 -35.422 24.436 134 -47.456 -32.714 18.123 135 39.476 -28.008 16.087 136 -21.933 -43.522 15.390 137 -17.347 -38.250 16.738 138 -4.948 -39.747 21.598 139 -31.018 -28.912 21.332 140 -36.364 30.461 17.542 141 -39.639 18.272 23.663 142 -24.162 -13.582 19.136 143 -8.935 -32.699 22.108 144 0.001 -19.219 17.888 145 -6.912 -24.885 20.683 146 7.785 -31.229 15.972 147 22.176 -7.478 21.335 148 8.755 -13.323 20.831 149 44.081 41.160 11.938 150 -8.451 -37.721 17.465 151 18.671 -2.776 23.374 152 12.668 -26.749 18.071 153 1.582 -21.252 20.750 154 20.832 -27.718 16.190 155 44.220 -45.690 12.598 156 -0.226 -37.737 17.634 157 -25.130 -19.197 23.170 158 2.086 -31.271 18.180 159 -20.445 -33.083 19.984 160 23.801 1.116 24.230 161 18.283 -17.922 20.256 162 -38.985 -13.770 20.702 163 -26.264 -27.413 20.276 164 10.396 -19.375 20.415 165 -16.343 -22.847 16.516 166 29.992 -8.215 21.661 167 35.052 -19.475 16.953 168 3.052 -6.800 22.509 169 -10.350 -5.413 19.222 170 14.371 -10.383 23.471 171 11.896 -4.191 21.773 172 18.152 8.741 23.669 173 25.748 -47.786 18.578 174 31.613 -0.735 23.898 175 12.660 25.645 23.549 176 2.933 29.345 25.170 177 9.369 18.791 26.817 178 15.805 4.798 27.866 179 27.556 -25.571 14.796 180 -5.112 -7.835 21.201 181 -30.571 3.471 20.496 182 19.816 -22.114 21.210 183 2.826 47.437 22.911 184 25.488 -33.064 21.442 185 44.826 42.162 22.994 186 25.208 -48.487 25.325 187 14.635 -17.430 17.083 188 -1.901 -33.370 22.163 189 12.306 -47.265 20.052 190 42.552 35.750 23.213 191 37.318 -46.069 22.599 192 4.725 -22.289 21.600 193 -40.815 -37.793 17.371 194 11.890 -12.862 14.286 195 35.251 -31.746 17.816 196 27.121 -27.638 19.677 197 36.024 -39.105 20.202 198 -47.119 41.940 17.526 199 0.837 -40.694 23.063 200 23.797 -39.795 20.198 201 -42.859 -21.372 23.554 202 39.407 -20.211 21.246 203 25.782 -18.892 20.423 204 34.529 -9.576 20.411 205 44.397 -13.247 23.180 206 5.534 6.856 14.248 207 31.598 -18.085 22.350 208 7.250 -0.481 15.453 209 -43.458 -15.204 23.193 210 -38.296 -31.524 21.776 211 4.276 -3.483 12.145 212 25.757 -11.708 22.360 213 15.634 37.478 24.624 214 -43.669 -3.197 20.742 215 45.381 6.365 21.351 216 -38.755 -6.877 20.879 217 -6.925 3.994 21.120 218 8.059 12.831 26.032 219 3.572 22.105 26.920 220 16.042 30.267 21.039 221 26.629 13.042 23.633 222 -12.126 -0.151 21.261 223 -11.981 24.600 19.236 224 29.480 28.362 21.838 225 -2.500 22.858 23.177 226 -41.163 19.863 20.059 227 35.953 27.401 19.101 228 -16.641 13.248 17.984 229 -3.778 14.090 18.943 230 11.643 34.817 21.621 231 34.921 38.666 17.359 232 25.621 22.451 22.866 233 34.936 17.384 19.836 234 40.017 37.599 13.987 235 19.547 33.838 22.575 236 11.197 39.977 19.347 237 16.972 -33.927 14.205 238 22.938 38.064 20.351 239 40.234 18.672 23.030 240 -0.846 42.320 18.383 241 -11.437 18.284 16.502 242 19.552 43.222 21.370 243 13.925 -46.486 18.917 244 41.709 -39.559 16.143 245 19.014 -44.563 17.796 246 32.260 33.114 18.402 247 -4.693 29.228 18.622 248 21.765 -38.452 15.147 249 39.157 -31.135 19.800 250 32.638 46.241 18.943 251 2.797 10.089 21.330 252 8.256 46.910 18.834 253 38.634 -2.429 20.413 254 28.642 2.763 19.580 255 0.456 1.422 7.452 256 3.050 11.792 14.196 257 24.736 14.532 17.886 258 16.787 -10.155 18.607 259 12.676 11.651 18.656 260 13.184 1.081 15.385 261 27.365 26.576 25.486 262 -7.878 -18.191 14.547 263 -42.112 32.576 20.865 264 15.069 21.684 17.986 265 33.045 27.166 25.252 266 21.810 -0.186 19.477 267 18.227 26.690 20.415 268 33.759 18.366 21.255 269 39.491 13.272 23.036 270 30.662 9.368 20.192 271 5.470 35.303 22.685 272 21.663 -44.343 20.999 273 31.261 33.178 24.335 274 21.854 22.665 20.876 275 21.853 7.932 18.588 276 -40.168 3.682 19.642 277 -42.292 23.997 22.199 278 10.233 28.731 21.263 279 17.745 41.831 19.536 280 38.406 25.165 26.534 281 -49.329 -0.465 20.887 282 40.398 -8.120 21.362 283 -2.531 46.118 22.933 284 7.959 -30.856 20.497 285 -34.467 -23.724 22.206 286 30.541 44.284 25.878 287 45.682 29.897 21.964 288 -22.251 -0.089 20.756 289 21.484 16.532 23.513 290 46.912 10.195 21.908 291 35.320 -13.352 16.102 292 -30.431 14.048 17.362 293 -8.976 -17.325 21.645 294 -32.661 2.301 16.805 295 49.317 -5.509 17.711 296 -37.756 4.459 16.054 297 41.445 6.158 21.442 298 -33.148 -3.499 19.543 299 -10.065 12.238 16.649 300 -48.323 17.153 20.974 301 45.010 28.147 17.838 302 -39.630 29.183 13.254 303 -45.191 5.065 18.214 304 -35.936 11.953 16.540 305 -19.816 19.624 18.279 306 -32.055 29.757 15.358 307 -41.533 36.169 10.005 308 -35.448 20.960 16.720 309 -23.384 0.511 20.005 310 -25.101 40.569 18.180 311 -0.547 -45.779 17.603 312 -35.291 32.643 9.548 313 -25.109 20.826 17.494 314 -26.202 27.012 18.678 315 -34.805 43.850 14.006 316 -18.819 30.611 20.309 317 13.019 -40.248 16.874 318 -0.655 37.112 20.924 319 -34.142 -41.553 15.237 320 -39.509 43.886 12.464 321 -9.491 38.639 18.839 322 -12.164 31.977 17.598 323 3.437 -1.596 20.318 324 -45.713 -52.599 9.918 325 -2.062 -15.946 12.847 326 -27.435 -37.600 18.257 327 -18.094 -29.624 18.791 328 -22.647 -26.123 18.746 329 -16.775 -15.505 19.204 330 -2.628 9.599 11.219 331 -15.718 -1.797 19.491 332 -29.476 -20.107 17.485 333 -10.618 -4.938 12.227 334 1.423 17.458 14.706 335 -4.503 16.630 14.718 336 32.450 -2.029 21.591 337 20.529 35.464 21.630 338 -19.348 -7.844 19.464 339 -26.760 -6.856 21.422 340 -4.539 4.393 11.819 341 -5.741 1.934 7.121 342 4.781 21.919 18.908 343 -19.797 6.928 20.928 344 -24.555 16.834 19.796 345 -5.664 24.465 18.432 346 -32.891 -6.571 18.691 347 2.354 28.462 19.825 348 18.058 16.251 16.335 349 -9.603 28.582 17.743 350 -31.282 11.454 22.342 351 -17.580 16.428 18.401 352 -13.884 6.206 17.270 353 -13.631 13.767 11.761 354 9.712 18.008 18.896 355 -37.987 9.024 21.309 356 -29.969 27.506 16.964 357 -4.248 -10.813 9.284 358 -5.755 32.673 22.541 359 6.675 41.952 24.227 360 24.564 41.173 23.241 361 45.314 -32.299 20.778 362 -45.890 -33.510 16.314 363 -8.277 -47.943 21.573 364 -11.044 -29.464 17.708 365 3.972 -41.396 17.411 366 31.776 -36.643 19.998 367 -43.072 31.311 17.828 368 -45.805 -43.071 14.477 369 -15.628 39.837 19.709 370 -21.129 -39.101 18.814 371 -41.628 -8.980 19.850 372 -42.244 -23.659 18.856 373 44.149 -25.710 21.099 374 42.623 -9.185 20.147 375 -35.949 -23.979 22.255 376 -28.512 -36.367 19.378 377 -19.827 -21.781 21.621 378 -3.429 -15.706 18.677 379 39.741 -20.721 18.670 380 -41.663 -29.499 19.260 381 -26.931 -48.467 18.185 382 -17.571 -1.467 19.770 383 -20.039 -7.591 22.737 384 46.370 -14.790 19.922 385 30.710 4.167 22.987 386 46.755 0.417 18.088 387 -27.293 -4.398 22.168 388 -30.364 -17.573 23.869 389 -16.870 -14.893 25.817 390 -18.152 6.546 18.392 391 40.134 0.160 23.661 392 28.179 -14.323 19.301 393 -35.907 -32.647 19.306 394 8.486 7.101 24.551 395 -17.155 -45.435 22.745 396 34.226 10.748 19.773 397 -7.760 38.754 22.211 398 -42.899 13.804 22.628 399 -29.972 40.435 17.784 400 8.764 -39.195 22.070 401 -15.624 25.585 12.291 402 18.620 -33.314 23.282 403 -30.436 34.219 15.102 404 -37.665 44.955 15.257 405 -15.861 37.488 18.956 406 -13.375 -22.408 20.312 407 -20.972 36.906 17.387 408 43.162 -35.948 19.695 409 6.639 -15.783 14.608 410 44.186 -41.037 17.398 411 -23.917 22.236 18.702 412 -23.957 30.033 14.725 413 -25.056 43.824 15.489 414 -6.795 -24.375 18.537 415 -33.485 -40.651 17.538 416 -43.186 -43.071 17.481 417 -30.325 -46.122 16.440 418 -17.489 -43.551 22.006 419 -16.376 43.928 18.992 420 -9.076 -10.921 14.131 421 13.704 -36.352 23.812 422 -47.302 -31.918 18.719 423 39.459 -27.814 15.558 424 -22.509 -42.660 14.366 425 -17.920 -37.614 16.572 426 -5.780 -39.212 21.667 427 -30.519 -28.942 21.931 428 -35.937 31.435 17.106 429 -38.680 18.435 23.342 430 -24.796 -13.279 18.543 431 -9.283 -32.388 21.895 432 0.493 -19.505 17.276 433 -7.046 -25.243 20.741 434 7.884 -32.006 16.727 435 22.451 -7.834 21.082 436 8.379 -13.690 22.002 437 43.730 41.697 11.894 438 -9.040 -38.086 17.500 439 18.831 -2.759 23.252 440 12.732 -27.410 18.948 441 0.739 -21.091 21.354 442 20.339 -27.959 16.514 443 44.688 -46.449 12.356 444 -0.402 -36.951 17.891 445 -24.790 -18.139 23.337 446 2.173 -30.577 18.023 447 -18.995 -33.799 20.730 448 23.372 0.223 24.855 449 17.835 -17.372 19.878 450 -38.915 -13.815 20.923 451 -26.241 -27.800 19.877 452 11.074 -18.156 19.249 453 -16.478 -22.928 16.386 454 29.646 -8.349 21.115 455 33.910 -20.809 16.629 456 3.306 -6.830 22.059 457 -10.512 -5.322 19.876 458 14.024 -10.406 23.456 459 12.365 -3.699 21.818 460 18.186 8.532 23.951 461 25.140 -47.653 18.592 462 32.288 -2.117 23.423 463 10.836 24.937 23.310 464 4.531 28.913 25.238 465 9.944 18.397 26.661 466 16.274 4.852 27.837 467 27.316 -26.007 15.934 468 -4.508 -8.010 20.906 469 -29.858 2.412 19.958 470 20.376 -21.957 21.306 471 2.077 47.431 23.248 472 25.777 -33.367 21.695 473 44.854 42.801 22.904 474 25.356 -48.833 25.402 475 15.322 -16.926 17.318 476 -2.656 -33.400 20.365 477 11.950 -47.390 20.328 478 42.961 36.955 22.919 479 35.726 -45.402 24.272 480 4.675 -21.758 21.780 481 -40.568 -36.931 16.934 482 11.758 -12.859 14.206 483 35.483 -31.760 16.975 484 27.336 -27.577 19.429 485 36.689 -39.218 19.668 486 -46.357 41.618 17.456 487 0.002 -40.589 22.558 488 23.525 -39.918 21.247 489 -43.269 -21.304 22.699 490 40.191 -20.594 21.145 491 25.728 -18.024 20.298 492 34.964 -10.441 20.189 493 43.627 -13.279 23.038 494 5.766 6.876 14.077 495 32.432 -18.172 21.848 496 7.087 -1.122 15.098 497 -44.110 -14.034 23.080 498 -39.474 -31.289 22.312 499 4.118 -4.077 11.067 500 26.597 -11.667 22.641 so, using these commands I can find the intervals, as below x.bin <- seq(floor(min(d[,1])), ceiling(max(df[,1])), by=2) y.bin <- seq(floor(min(d[,2])), ceiling(max(df[,2])), by=2) > x.bin [1] -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 [20] -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 [39] 26 28 30 32 34 36 38 40 42 44 46 48 50 > y.bin [1] -53 -51 -49 -47 -45 -43 -41 -39 -37 -35 -33 -31 -29 -27 -25 -23 -21 -19 -17 [20] -15 -13 -11 -9 -7 -5 -3 -1 1 3 5 7 9 11 13 15 17 19 21 [39] 23 25 27 29 31 33 35 37 39 41 43 45 47 But, then I don't know how to assign each row of the raw data (df) to each x.bin and y.bin and calculate the aggregate (sum) of each bin.
library(plyr) #I am using cut function with 50 breaks for both v1 and v2 and ddply from plyr package for computing the mean newdata<-ddply(df,.(cut(v1,50),cut(v2,50)),summarise,mean.v3=mean(v3)) > head(newdata) cut(v1, 50) cut(v2, 50) mean.v3 1 (-49.4,-47.5] (-34.7,-32.7] 18.123 2 (-49.4,-47.5] (-0.576,1.43] 20.887 3 (-49.4,-47.5] (15.5,17.5] 20.887 4 (-47.5,-45.5] (-52.7,-50.7] 9.918 5 (-47.5,-45.5] (-44.7,-42.7] 14.477 6 (-47.5,-45.5] (-34.7,-32.7] 16.314 Updated as per the comments: If you want the lower, middle and mid-points, you can use the following function or use with details as follow(you need to use the sub function to deal with ( and ]): df$newv1<-with(df,cut(v1,50)) df$newv2<-with(df,cut(v2,50)) df$lowerv1<-with(df,as.numeric( sub("\\((.+),.*", "\\1", newv1))) #lower value df$upperv1<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv1))) # upper value df$midv1<-with(df,(lowerv1+upperv1)/2) #mid value df$lowerv2<-with(df,as.numeric( sub("\\((.+),.*", "\\1",newv2))) #lower value df$upperv2<-with(df,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", newv2))) # upper value df$midv2<-with(df,(lowerv2+upperv2)/2)#mid value newdata<-ddply(df,.(newv1,newv2),transform,mean.v3=mean(v3)) > head(newdata) v1 v2 v3 newv1 newv2 lowerv1 upperv1 midv1 lowerv2 upperv2 midv2 mean.v3 1 -47.456 -32.714 18.123 (-49.4,-47.5] (-34.7,-32.7] -49.4 -47.5 -48.45 -34.700 -32.70 -33.700 18.123 2 -49.329 -0.465 20.887 (-49.4,-47.5] (-0.576,1.43] -49.4 -47.5 -48.45 -0.576 1.43 0.427 20.887 3 -48.652 16.558 20.800 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887 4 -48.323 17.153 20.974 (-49.4,-47.5] (15.5,17.5] -49.4 -47.5 -48.45 15.500 17.50 16.500 20.887 5 -45.713 -52.599 9.918 (-47.5,-45.5] (-52.7,-50.7] -47.5 -45.5 -46.50 -52.700 -50.70 -51.700 9.918 6 -45.805 -43.071 14.477 (-47.5,-45.5] (-44.7,-42.7] -47.5 -45.5 -46.50 -44.700 -42.70 -43.700 14.477