I am working on data frame which has information like this.
df<- as.data.frame(read.table("headen.bed",header = FALSE, sep="\t",stringsAsFactors=FALSE, quote=""))
C1 C2 C3
33 12249 0,300,3900,400,4500,400,4200
83 9213 0,49,66,75,158,160,170,183,218
146 680 0,3,13,129,274,278,383,481,482,496
I want to do addition of C1 into each element of C3, it would be something like.
C1 C2 C3
33 12249 33,333,3933,433,4533,433,433
83 9213 83 132 149 158 241 243 253 266 301
146 680 146 149 159 275 420 424 529 627 628 642
but somehow its showing that the C3 is a character class, I tried. different ways to convert into numeric type using as.numeric type.convert, character to factor and then numeric
. But still didn't can anyone suggest best way to perform this?
You can try,
mapply(function(x, y)paste(x + as.numeric(y), collapse = ','),df$C1 ,strsplit(df$C3, ','))
[1] "33,333,3933,433,4533,433,4233" "83,132,149,158,241,243,253,266,301" "146,149,159,275,420,424,529,627,628,642"
DATA
df <- data.frame(C1 = c(33, 83, 146),
C2 = c(1, 2, 3),
C3 = c('0,300,3900,400,4500,400,4200', '0,49,66,75,158,160,170,183,218', '0,3,13,129,274,278,383,481,482,496'),
stringsAsFactors = FALSE)
EDIT
To make C3 into numeric you will have to split it into many columns. There are a bunch of ways to do it as shown here. I like the splitstackshape approach, i.e.
library(splitstackshape)
df1 <- cSplit(df, 'C3', sep = ',')
#C1 C2 C3_01 C3_02 C3_03 C3_04 C3_05 C3_06 C3_07 C3_08 C3_09 C3_10
#1: 33 1 33 333 3933 433 4533 433 4233 NA NA NA
#2: 83 2 83 132 149 158 241 243 253 266 301 NA
#3: 146 3 146 149 159 275 420 424 529 627 628 642
str(df1)
Classes ‘data.table’ and 'data.frame': 3 obs. of 12 variables:
$ C1 : num 33 83 146
$ C2 : num 1 2 3
$ C3_01: int 33 83 146
$ C3_02: int 333 132 149
$ C3_03: int 3933 149 159
$ C3_04: int 433 158 275
$ C3_05: int 4533 241 420
$ C3_06: int 433 243 424
$ C3_07: int 4233 253 529
$ C3_08: int NA 266 627
$ C3_09: int NA 301 628
$ C3_10: int NA NA 642
Related
Can you help me think of some way to reduce the computational time of a code that generates a certain value, which in this case I call coef, which will depend on id/date/category? Better explanations below.
I made two functions that generate the same result. As you can see in benchmark, the first function (return_values) takes twice as long as the second function (return_valuesX) to generate the same results. See that in the second function, I make some brief changes when calculating the coef variable. However, I strongly believe that there is a possibility of improving the code, as you can see in the second function, I managed to improve 50% of processing time compared to the first just with brief changes. But I'm out of ideas for new adjustments, so I would like your valuable opinion.
Code Explanations:
In general, the purpose of the code is to calculate a value, which I call a coef for each group of id, date and category. For this, the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns of the df1 database is first calculated. After obtaining the median (med variable), I add the values found with the values of the DRM0 columns of my df1 database. This calculation is my SPV variable. In both cases, I used the data.table function, which I believe is faster than using dplyr. After I get SPV, I need to calculate the coef variable for each id/date/category.
Below I will insert an example real easy to understand of the coef calculation. If for example I want to calculate coef of idd<-"3", dmda<-"2021-12-03", CategoryChosse<-"ABC", and I have the following:
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
So coef will be ymd(dmda) - ymd(min(df1$date1)). That is, if I do to this id/date/category that I mentioned I get a difference of 2 days, so the value I want is the DRM003_PV . So the value for this case is 198. Therefore, I made:
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
> coef
[1] 198
This issue has been resolved here: Adjust code to choose a specific column depending on the difference between dates
Libraries and database
library(tidyverse)
library(lubridate)
library(data.table)
library(bench)
set.seed(123)
df1 <- data.frame( Id = rep(1:5, length=800),
date1 = as.Date( "2021-12-01"),
date2= rep(seq( as.Date("2021-01-01"), length.out=400, by=1), each = 2),
Category = rep(c("ABC", "EFG"), length.out = 800),
Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"), length.out = 800),
DR1 = sample( 200:250, 800, repl=TRUE),
setNames( replicate(365, { sample(0:800, 800)}, simplify=FALSE),
paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))
First function
return_values <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns
dt1 <- as.data.table(df1)
cols <- grep("^DRM0", colnames(dt1), value = TRUE)
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")
nm2 <- f2(names(med), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
return(coef)
}
Results using first function
subset_df1 <- subset(df1, date2 > date1)
a<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)) %>%
data.frame()
> a
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...........
Second function
return_valuesX <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
dt1 <- as.data.table(df1)
num_to_pull <- as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6
cols <- grep("^DRM0", colnames(dt1), value = TRUE)[1:num_to_pull]
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")[1:num_to_pull]
nm2 <- f2(names(med), "_PV")[1:num_to_pull]
nm3 <- paste0("i.", nm2)[1:num_to_pull]
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(num_to_pull)
return(coef)
}
Results using second function
b<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result = return_valuesX(df1,Id, date2, Category)) %>%
data.frame()
> b
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...............
Comparing the two results:
identical(a, b)
[1] TRUE
Calculate processing time using benchmark
subset_df1 <- subset(df1, date2 > date1)
bench::mark(a=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)),
b=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_valuesX(df1,Id, date2, Category)),iterations = 1)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 a 53.7s 53.7s 0.0186 4.54GB 0.634 1 34 53.7s <rowwise_df [130 x 5]> <Rprofmem [981,580 x 3]> <bench_tm [1]> <tibble [1 x 3]>
2 b 21s 21s 0.0477 913.77MB 0.382 1 8 21s <rowwise_df [130 x 5]> <Rprofmem [278,340 x 3]> <bench_tm [1]> <tibble [1 x 3]>
To check df1 database
Here's an approach that is about 20x/10x faster than your functions for the example data, and would be faster yet for larger data sets. (When I run with 100k rows in df1, it's 572x faster.) I hope you'll find this approach easier to understand and debug.
This is written using tidyverse functions like tidyr::pivot_longer and dplyr::group_by. If you want to squeeze out a bit more speed, the data.table and collapse packages offer faster alternatives for many functions, especially around grouped calculations. But the main speed improvement here is from restructuring to avoid repeating the same calculations over and over and letting R rely more on vectorized calculations. https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
pre_calc <- function(df) {
pre_calc <- df1 %>% # this calculates once on the full data
select(!ends_with("_PV")) %>%
pivot_longer(-c(1:6), values_to = "DRM", names_to = "day") %>%
mutate(day = parse_number(day)) %>%
group_by(Id, Category, Week, day) %>%
mutate(med = median(DR1 - DRM), Result = DRM + med) %>%
ungroup()
df %>% # starts from the subsetted data and joins to results from above
select(1:5) %>%
left_join(pre_calc) %>%
filter(day == date2 - date1 + 1) %>%
select(Id, date1, date2, Category, Result)
}
c <- subset_df1 %>% pre_calc()
c matches a and b from your tests, with the one difference that
date2 (originally date-integer, which is a nonstandard type) has in my approach been
coerced into a typical date-double, like date1. We can use
typeof(df1$date1) & typeof(df1$date2) to see this.
waldo::compare(b, c) confirms the results otherwise match. I
opened an issue with tidyr here since the subtle change seems
to have been caused by the pivot_longer step.
UPDATE: Apparently the creation of a date-integer object is a bug in base R's seq.Date / seq function, which was fixed in R 4.2: https://github.com/tidyverse/tidyr/issues/1356#issuecomment-1111078891
In the approach above, I pre-calculate all the results once, by taking the original data set df1, throwing out the existing _PV columns (I think they get overwritten?), and -- here's where the speed gains come from -- reshaping to long format. While this single operation is computationally expensive, it means we can more efficiently apply the same calculation to all the DRM_* columns at once, and we can rely on fast filtering instead of slow subsetting to extract our result.
The group_by(Id, Category, Week, day) line and the next mutate(... line let us calculate the median differences between DR1 and that day's DRM for each Id-Category-Week combination, so we can calculate all the Results at once.
The last part takes the df lines (e.g. the subsetted data in your example where date2 > date1) and attaches those to the pre-calculated results, filtering to get the right day (previously encoded by column name/position).
I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.
Create two dummy datasets
library(dplyr)
library(stringr)
set.seed(1)
data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
Test matching of full matches only
# This is almost instant
data_set_A %>%
rowwise() %>%
filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
ungroup()
# A tibble: 4 x 2
name ID_A
<chr> <int>
1 vnt 112
2 fly 391
3 cug 1125
4 xgv 1280
Include partial matches (This is what I want to optimize)
This of course only gives me the subset of dataset A, but that is ok.
# This takes way too long
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
A tibble: 237 x 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
Fuzzyjoin method.
This also works, but is equally slow
bind_rows(
fuzzyjoin::fuzzy_inner_join(
data_set_A,
data_set_B,
by = c("name" = "name_2"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A),
fuzzyjoin::fuzzy_inner_join(
data_set_B,
data_set_A,
by = c("name_2" = "name"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A)
) %>%
distinct()
data.table solution
not much faster unfortunately
library(data.table)
setDT(data_set_A)
setDT(data_set_B)
data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) |
any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]
This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
TL;DR
The slow part is str_detect(string, pattern).
To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.
To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.
Under follows examples, explanations and benchmarks.
Rewriting str_detect() using fixed() or coll()
I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().
From ?stringr::str_detect():
Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.
Under is a comparison with your original code:
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
# Note the use of fixed()
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
# Note the use of coll()
using_coll <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
ungroup()
}
bm <- bench::mark(
original(),
using_fixed(),
using_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371
#> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94
#> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512
bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr
Created on 2022-04-02 by the reprex package (v2.0.1)
So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.
Removing rowwise()
Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
using_map_coll <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, coll(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, coll(.x)))
)
data_set_A[logical_vec, ]
}
bm <- bench::mark(
using_fixed(),
using_map_fixed(),
using_coll(),
using_map_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37
#> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14
#> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489
#> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354
Created on 2022-04-02 by the reprex package (v2.0.1)
As we see, this gives another slight speed boost.
Using fixed() with data.table or fuzzyjoin
You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.
This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.
If you would like to use base R, the code below might be one fast option
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
which gives
> res
name ID_A name_2 ID_B
1 arh 1234 pimoarhd 8
2 qtj 720 aqtj 23
3 szcympsn 142 cym 43
4 cymvubnxg 245 cym 43
5 dppvtcymq 355 cym 43
6 kzi 690 kzii 48
7 eyajqchkn 498 chk 53
8 upfzh 522 upf 61
9 ioa 1852 ioadr 63
10 lya 1349 ibelyalvh 64
11 honod 504 ono 71
12 zozxjpu 18 zoz 72
13 jcz 914 cdjczpqg 88
14 ailmjf 623 ilm 99
15 upoux 609 oux 104
16 pouxifvp 1466 oux 104
17 mvob 516 vob 106
18 nqtotvhhm 1088 otv 115
19 wom 202 womtglapx 117
20 qkc 756 dqkcfqpps 118
21 qtl 600 ivqtlymzr 126
22 qqi 1605 owfsqqiyu 153
23 fmjalirze 1470 ali 172
24 ibwfwkyp 1588 fwk 175
25 iat 1258 iatjeg 185
26 osm 253 nviiqosm 199
27 wpj 373 wpjeb 204
28 hahx 515 ahx 213
29 keahxa 1565 ahx 213
30 psf 359 qnpsfo 223
31 saq 1859 saqhu 227
32 cvmkwtx 714 cvm 228
33 ilw 389 pyilwj 231
34 ohwysv 1590 ysv 237
35 utrl 698 trl 244
36 dmttrlcpj 1267 trl 244
37 cpv 236 btcpvmoc 247
38 uto 1047 utoi 257
39 yngunekl 1978 ekl 258
40 vceko 625 vce 265
41 fir 1934 firgk 278
42 qvd 983 eqvdfi 287
43 fir 1934 zwwefir 291
44 idvfkevdf 1380 vdf 312
45 qwdo 1921 qwd 322
46 kam 1205 tlkam 327
47 lck 488 clckjkyzn 329
48 gmspwckw 1015 msp 359
49 ynouuwqtz 1576 nou 360
50 tty 1209 bttyvt 361
51 vkc 999 fmrvkcl 366
52 ipw 1918 fipwjomdu 388
53 zdv 261 zdvkut 410
54 vku 1137 zdvkut 410
55 doby 246 oby 411
56 hycvuupgy 141 uup 421
57 uwlb 1249 wlb 431
58 auj 1452 lcmnauj 444
59 rwd 1667 ukwrwdczs 479
60 ylsihqqor 1290 ihq 483
61 feo 1649 feorvxbm 485
62 zff 755 dohzffujm 499
63 mqutujepu 904 epu 507
64 uiepu 1308 epu 507
65 vahepuk 1434 epu 507
66 cug 1125 accugl 509
67 fir 1934 firwe 517
68 dia 1599 dialeddd 527
69 temiwd 1725 tem 531
70 svofivl 1177 svo 545
71 flm 657 aflm 546
72 vnt 112 vnt 551
73 bhmoskrz 426 osk 558
74 wev 728 shemuwev 569
75 hzpi 1586 hzp 579
76 gvi 1064 mkgvivlfe 582
77 fjb 1398 vkfjbxnjl 589
78 qin 1013 qinp 593
79 ecn 1342 ecnzre 598
80 zre 1610 ecnzre 598
81 xvr 772 dpxvrfmo 623
82 tqr 1419 tqrmztdm 624
83 zmwnf 1571 mwn 626
84 ypil 1787 pil 630
85 mnxlqgfh 1132 nxl 643
86 gse 1563 gseice 646
87 ygk 1309 ygkqrk 655
88 fgm 933 vzfgmy 663
89 rlupd 977 upd 666
90 mcupdkuiy 1307 upd 666
91 fly 391 fly 669
92 vbkko 1603 kko 678
93 uvrew 465 rew 680
94 hgbhngwvd 901 wvd 690
95 wvdjprmo 1432 wvd 690
96 cgn 22 cgnd 698
97 dngnjv 967 njv 700
98 psqs 841 sqs 720
99 ywv 1180 ptywvlgc 730
100 ypil 1787 ypi 734
101 rwd 1667 srserwd 737
102 jqydasl 1294 jqy 742
103 ckujmc 717 ujm 751
104 dfzxta 662 xta 775
105 bjb 1562 jabjbei 779
106 adwknpll 1242 npl 780
107 kdv 1327 xhkdvqo 789
108 ghj 174 oghj 801
109 lhwfyksz 94 lhw 811
110 nwrrnlhw 929 lhw 811
111 xlhwm 1720 lhw 811
112 ncc 1602 wurhxnccn 814
113 jdslrf 1094 dsl 835
114 ktmw 1738 tmw 844
115 igwsvrzpk 16 gws 856
116 kug 591 pkugls 857
117 befgcpedr 339 fgc 862
118 ojf 1397 ojfpnkla 863
119 gyl 1203 gylxeqzw 872
120 ugcbb 1727 ugc 876
121 arh 1234 karhwhg 878
122 amm 458 ammqdc 883
123 azazryje 636 zaz 900
124 wczazw 1887 zaz 900
125 gkritbe 47 ritb 915
126 vku 1137 yjvkuxued 929
127 rnh 1633 kvyrnhugu 937
128 mzh 1135 xllmwmzhn 940
129 cug 1125 cug 960
130 xgv 1280 xgv 962
131 xusxgv 1436 xgv 962
132 umc 351 lwumcmvoo 980
133 zlb 1900 nkyazlb 991
134 llfkalao 1049 llf 1002
135 sflpbht 991 lpb 1048
136 rairmmcl 442 mmc 1087
137 mmckoln 780 mmc 1087
138 gfxmmcgb 1814 mmc 1087
139 aoj 402 taojlgp 1089
140 mypvzhp 121 ypv 1095
141 moctwaypv 611 ypv 1095
142 rngedn 306 ged 1106
143 djshecy 1408 ecy 1108
144 rairmmcl 442 rmm 1117
145 gzua 1594 zua 1124
146 ytj 416 yytj 1140
147 ubt 300 hubtcfr 1141
148 gqg 1854 ogqgsjqc 1144
149 tfg 1204 xiutfgru 1145
150 avrq 741 avr 1147
151 ytkpvss 440 tkp 1149
152 kug 591 yxsjkug 1176
153 vix 1846 vixsmn 1187
154 qtl 600 qtljkxz 1188
155 lgr 494 dlgrco 1189
156 ryg 864 xlmtryg 1203
157 yskvkxwj 1547 kvk 1205
158 kxhee 1795 xhe 1222
159 hzbcjs 1493 cjs 1224
160 kbi 270 itxlwkbi 1225
161 gdymcam 806 ymca 1232
162 tqr 1419 rxtqrdtl 1236
163 yyz 215 yyzw 1242
164 jyx 1735 mljjyxu 1248
165 aai 1928 umkpaaiwo 1254
166 dsd 1122 dndsdova 1257
167 tor 744 etor 1270
168 vhcyznp 1296 yzn 1278
169 xlc 1947 odxlcjwj 1280
170 mlm 1629 aomlmgtq 1303
171 owm 239 owmugb 1304
172 ynezwaml 507 nez 1308
173 jls 695 jlsve 1325
174 dvm 879 dvmv 1339
175 vsgx 944 dqpihvsgx 1352
176 wfo 768 wfokpjois 1354
177 tltbkinat 1986 nat 1362
178 gyl 1203 gylqte 1363
179 ngg 735 bsnggqbjd 1366
180 fkq 345 jdfkqf 1368
181 ojf 1397 ojfpgfga 1382
182 dqgd 1623 prqbndqgd 1398
183 siu 827 siuypucup 1412
184 yinsoivfd 1895 yin 1414
185 esm 1834 sesmeepz 1417
186 umc 351 umcj 1432
187 wny 866 wnyxamguw 1443
188 ujbhtvnin 399 vni 1444
189 dbq 630 bdbqq 1452
190 ebn 1405 ebngddw 1461
191 zcj 704 rbtjzcjod 1465
192 avn 500 avnspxv 1468
193 vkk 567 hvkk 1477
194 hmm 1441 bgjhmmthz 1483
195 aguakz 614 guak 1487
196 hycvuupgy 141 pgy 1493
197 tizpgymz 280 pgy 1493
198 guk 571 cncxdguk 1502
199 zyw 281 nzywuqs 1504
200 jnz 1558 rxdxsjnzw 1510
201 uuq 92 nxuuqtj 1514
202 qtj 720 nxuuqtj 1514
203 vkk 567 xpbpvkkdc 1518
204 iaa 460 sjiaa 1525
205 txsgmynng 1019 xsg 1526
206 yjvtwc 1107 jvt 1529
207 lnk 1113 hylnknwy 1546
208 szd 635 woszdm 1557
209 osm 253 sosmdp 1567
210 nbd 1067 nbdmmg 1570
211 mmg 1305 nbdmmg 1570
212 wqdsatbd 1536 sat 1585
213 sdlypo 1527 sdl 1596
214 inkynog 288 inky 1600
215 hpwoeclfy 1321 clf 1601
216 wodyqwqf 679 dyq 1603
217 lyw 7 xnalywyuw 1607
218 njm 1825 vjlnjmns 1617
219 njytqhaut 428 qha 1620
220 ilw 389 rilwbk 1647
221 oqo 45 ixoqowkpg 1650
222 odcbcvaun 1386 bcv 1652
223 mastn 434 stn 1662
224 xebhdssit 1091 xeb 1663
225 nmy 782 nmyxj 1671
226 fsqvgdw 673 gdw 1676
227 mwwczhs 482 wcz 1679
228 wczazw 1887 wcz 1679
229 anmryzm 915 ryz 1698
230 rteh 523 rte 1708
231 mlwrguae 817 lwr 1709
232 mbu 819 xpsuqmbuf 1729
233 mmckoln 780 cko 1733
234 lxpg 798 lxp 1734
235 ane 370 vxnanehvk 1746
236 tty 1209 vbttyozui 1752
237 igncdgyjx 332 ign 1753
238 ndignk 621 ign 1753
239 nmy 782 ivnmyba 1780
240 wknrsauuj 2 rsa 1799
241 tgd 165 qtgdidlf 1803
242 iaa 460 yziaazxto 1833
243 xto 1245 yziaazxto 1833
244 zff 755 dpzfft 1857
245 jyx 1735 jwjyxphe 1873
246 ytj 416 eytj 1881
247 lcggwonk 1596 onk 1882
248 zdv 261 zdvxfz 1889
249 xhskcb 417 kcb 1890
250 mrikqkcb 770 kcb 1890
251 psvxqnsap 1352 psv 1898
252 udjswzb 411 jsw 1900
253 rpfjswy 1840 jsw 1900
254 bjaywiso 1677 ayw 1902
255 zfli 130 fli 1906
256 vazx 1215 itvazxw 1918
257 tuw 100 tuwywtbwd 1921
258 vle 1437 ebvleaovm 1937
259 znycsygd 1757 nyc 1944
260 ynezwaml 507 ezw 1952
261 tseezwf 1276 ezw 1952
262 ezwzyfudo 1690 ezw 1952
263 oudiky 1503 dik 1964
264 dikjn 1615 dik 1964
265 oms 106 wpomsudi 1977
266 hhp 1864 hhpkm 1983
Benchmarking
It seems this base R option is slightly slower than #Andrew's approach.
TIC <- function() {
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
# data_set_A[unique(idx[, 1]), ]
}
jpiversen_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
andrew <- function() {
nchar_a <- nchar(data_set_A$name)
nchar_b <- nchar(data_set_B$name_2)
pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a <- data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm <- microbenchmark(
TIC(),
jpiversen_fixed(),
andrew(),
times = 20
)
shows that
> bm
Unit: milliseconds
expr min lq mean median uq max
TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841
jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872
neval
20
20
20
In a data frame dt of dimension 76x108, I want to reduce the sum of values in each columns 13 to 108 by an amount stored in an array c, by minimising the non-zero elements in the column, starting from the last row.
For example, if dt[76,13] > 0, the following happens:
dt[76,13] <- max((dt[76,13]-c),0)
If after this operation dt[76,13] == 0, the residual of c - dt[76,13] should get subtracted from the next non-zero element in column 13. This goes on until the sum of all rows in column 13 is reduced by an amount equivalent to c.
This needs to be done for the 96 columns in dt[,13:108].
Edited: Added an example with a smaller data frame below.
dt <- data.frame(Plant = sample(LETTERS,10,replace=T),
Type = rep("Base",10),
Ownership = rep("Pub",10))
caps = matrix(round(runif(10*5,0,500),0),nrow=10,ncol=5)
dt <- as.data.frame(cbind(dt,caps)) #this what the data frame looks like
for(i in 1:5){
colnames(dt)[i+3] <- (paste0("TB",i))
}
dt
Plant Type Ownership TB1 TB2 TB3 TB4 TB5
1 T Base Pub 454 32 162 271 478
2 S Base Pub 275 75 385 491 60
3 Y Base Pub 314 44 252 221 363
4 T Base Pub 170 122 490 332 123
5 J Base Pub 241 178 173 472 468
6 B Base Pub 243 316 152 411 434
7 T Base Pub 127 167 356 451 400
8 U Base Pub 20 102 54 182 57
9 O Base Pub 368 333 236 103 27
10 J Base Pub 343 189 0 494 184
c <- c(500,200,217,50,300)
#required output
Plant Type Ownership TB1 TB2 TB3 TB4 TB5
1 T Base Pub 454 32 162 271 478
2 S Base Pub 275 75 385 491 60
3 Y Base Pub 314 44 252 221 363
4 T Base Pub 170 122 490 332 123
5 J Base Pub 241 178 173 472 468
6 B Base Pub 243 316 152 411 434
7 T Base Pub 127 167 356 451 368
8 U Base Pub 20 102 54 182 0
9 O Base Pub 211 322 19 103 0
10 J Base Pub 0 0 0 444 0
#dt[10,4] is now max((343-500),0), while dt[9,4] is 368-(500-343).
#dt[10,5] is now max((189-200),0), while dt[9,4] is 333-(200-189).
#and so on.
What I've tried so far looks something like this:
for(i in 4:8){
j <- nrow(dt) #start from the last row
if(dt[j,i]>0){
res1 <- c[i] - dt[j,i] #residual value of the difference
dt[j,i] <- max((dt[j,i] - c[i]),0)
while(res1>0){ #the process should continue until an amount equivalent to c[i] is not subtracted from dt[j,i]
j <- j-1
p <- dt[j,i]
dt[j,i] <- max((dt[j,i] - res1),0)
res1 <- res1 - p
}
}
else if(dt[j,i]==0){ #if the last element of the column is already 0, process should start w/ the first non-zero element
j <- j-1
res1 <- c[i] - dt[j,i]
dt[j,i] <- max((dt[j,i] - c[i]),0)
while(res1>0){
j <- j-1
p <- dt[j,i]
dt[j,i] <- max((dt[j,i] - res1),0)
res1 <- res1 - p
}
}
}
This will do your purpose. (I renamed your vector c with cc so that it may not interact with function c)
df <- read.table(text = ' Plant Type Ownership TB1 TB2 TB3 TB4 TB5
1 T Base Pub 454 32 162 271 478
2 S Base Pub 275 75 385 491 60
3 Y Base Pub 314 44 252 221 363
4 T Base Pub 170 122 490 332 123
5 J Base Pub 241 178 173 472 468
6 B Base Pub 243 316 152 411 434
7 T Base Pub 127 167 356 451 400
8 U Base Pub 20 102 54 182 57
9 O Base Pub 368 333 236 103 27
10 J Base Pub 343 189 0 494 184', header =T)
cc <- c(500,200,217,50,300)
library(tidyverse)
df %>% arrange(rev(row_number())) %>%
mutate(across(starts_with('TB'), ~ . - c(first(pmin(cc[as.integer(str_remove(cur_column(), 'TB'))],
cumsum(pmin(., cc[as.integer(str_remove(cur_column(), 'TB'))])))),
diff(pmin(cc[as.integer(str_remove(cur_column(), 'TB'))],
cumsum(pmin(., cc[as.integer(str_remove(cur_column(), 'TB'))])))))
)) %>%
arrange(rev(row_number()))
#> Plant Type Ownership TB1 TB2 TB3 TB4 TB5
#> 1 T Base Pub 454 32 162 271 478
#> 2 S Base Pub 275 75 385 491 60
#> 3 Y Base Pub 314 44 252 221 363
#> 4 T Base Pub 170 122 490 332 123
#> 5 J Base Pub 241 178 173 472 468
#> 6 B Base Pub 243 316 152 411 434
#> 7 T Base Pub 127 167 356 451 368
#> 8 U Base Pub 20 102 54 182 0
#> 9 O Base Pub 211 322 19 103 0
#> 10 J Base Pub 0 0 0 444 0
Created on 2021-05-24 by the reprex package (v2.0.0)
I have a dataset:
> k
EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504
The variables are as follows. As per the first answer by joran, unused levels can be dropped by droplevels, so no worry about the 898 levels, the illustrative k I'm showing is the complete dataset obtained from k <- d1[1:10, 3:4] where d1 is the original dataset.
> str(k)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 898 levels " HIGH SURF ADVISORY",..: 243 NA NA NA 243 NA NA 243 243 NA
$ FATALITIES: num 583 158 116 114 99 90 75 74 67 57
$ INJURIES : num 0 1150 785 597 0 ...
I'm trying to overwrite the WIND factor:
> k[k$EVTYPE==factor("WIND"), ]$EVTYPE <- factor("AFDAF")
> k[k$EVTYPE=="WIND", ]$EVTYPE <- factor("AFDAF")
But both commands give me error messages: level sets of factors are different or invalid factor level, NA generated.
How should I do this?
Try this instead:
k <- droplevels(d1[1:10, 3:5])
Factors (as per the documentation) are simply a vector of integer codes and then a simple vector of labels for each code. These are called the "levels". The levels are an attribute, and persist with your data even when subsetting.
This is a feature, since for many statistical procedures it is vital to keep track of all the possible values that variable could have, even if they don't appear in the actual data.
Some people find this irritation and run R using options(stringsAsFactors = FALSE).
To simply change the levels, you can do something like this:
d <- read.table(text = " EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504",header = TRUE,sep = "",stringsAsFactors = TRUE)
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "HEAT","WIND": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
> levels(d$EVTYPE) <- c('A','B')
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "A","B": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
Or to just change one:
levels(d$EVTYPE)[2] <- 'C'
What I need:
I have a huge data frame with the following columns (and some more, but these are not important). Here's an example:
user_id video_id group_id x y
1 1 0 0 39 108
2 1 0 0 39 108
3 1 10 0 135 180
4 2 0 0 20 123
User, video and group IDs are factors, of course. For example, there are 20 videos, but each of them has several "observations" for each user and group.
I'd like to transform this data frame into the following format, where there are as many x.N, y.N as there are users (N).
video_id x.1 y.1 x.2 y.2 …
0 39 108 20 123
So, for video 0, the x and y values from user 1 are in columns x.1 and y.1, respectively. For user 2, their values are in columns x.2, y.2, and so on.
What I've tried:
I made myself a list of data frames that are solely composed of all the x, y observations for each video_id:
summaryList = dlply(allData, .(user_id), function(x) unique(x[c("video_id","x","y")]) )
That's how it looks like:
List of 15
$ 1 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 1 11 8 5 12 9 20 13 7 10 ...
..$ x : int [1:20] 39 135 86 122 28 167 203 433 549 490 ...
..$ y : int [1:20] 108 180 164 103 187 128 185 355 360 368 ...
$ 2 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 2 14 15 4 20 6 19 3 13 18 ...
..$ x : int [1:20] 128 688 435 218 528 362 299 134 83 417 ...
..$ y : int [1:20] 165 117 135 179 96 328 332 563 623 476 ...
Where I'm stuck:
What's left to do is:
Merge each data frame from the summaryList with each other, based on the video_id. I can't find a nice way to access the actual data frames in the list, which are summaryList[1]$`1`, summaryList[2]$`2`, et cetera.
#James found out a partial solution:
Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
Ensure the column names are renamed after the user ID and not kept as-is. Right now my summaryList doesn't contain any info about the user ID, and the output of Reduce has duplicate column names like x.x y.x x.y y.y x.x y.x and so on.
How do I go about doing this? Or is there any easier way to get to the result than what I'm currently doing?
I am still somewhat confused. However, I guess you simply want to melt and dcast.
library(reshape2)
d <- melt(allData,id.vars=c("user_id","video_id"), measure.vars=c("x","y"))
dcast(d,video_id~user_id+variable,value.var="value",fun.aggregate=mean)
Resulting in:
video_id 1_x 1_y 2_x 2_y 3_x 3_y 4_x 4_y 5_x 5_y 6_x 6_y 7_x 7_y 8_x 8_y 9_x 9_y 10_x 10_y 11_x 11_y 12_x 12_y 14_x 14_y 15_x 15_y 16_x 16_y
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210 134 58 244 910 403 152 52 1092 617 1012 114 1105 424 548 394
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994 114 854 129 781 306 672 -1 1096 354 525 524 150
Reduce does the trick:
reducedData <- Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
… but you need to fix the names afterwards:
names(reducedData)[-1] <- do.call(function(...) paste(...,sep="."),expand.grid(letters[24:25],names(summaryList)))
The result is:
video_id x.1 y.1 x.2 y.2 x.3 y.3 x.4 y.4 x.5 y.5 x.6 y.6 x.7 y.7 x.8
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994