generalizing net/gross in a bar chart

generalizing net/gross in a bar chart - r

I'm doing a particular operation quite a bit, and I need help generalizing it.
I have a lot of data that "looks" kind of like this:
> hflights::hflights %>% tbl_df %>% mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=sum(ArrDelay, na.rm=T)) %>%
dcast(month ~ carrier)
month AA AS B6 CO DL EV F9 FL MQ OO UA US WN XE YV
1 1 18 296 229 27031 1026 1337 851 216 2322 3957 -219 -1068 31701 24248 NA
2 2 461 249 802 15769 1657 730 707 1079 4283 11486 323 -663 36729 27861 -44
3 3 317 476 1037 49061 905 2529 673 1111 2524 12955 1665 -606 28758 50702 -38
4 4 1147 465 518 52086 1856 4483 515 927 5085 17439 1803 -711 47084 69590 260
5 5 1272 56 654 63413 1381 3563 1334 1213 7899 22190 1798 1627 73771 66972 18
6 6 -262 172 504 60042 3736 2618 744 983 4519 21652 6260 2140 40191 66456 49
7 7 -460 112 1241 41300 2868 1628 321 506 1529 23432 2780 497 21200 98484 34
8 8 -1417 59 1659 36106 -949 808 42 -1366 310 11038 3546 -84 6991 33554 34
9 9 -841 -364 -202 24857 1022 -424 151 -747 -1373 4502 1743 248 15592 31846 NA
10 10 215 -112 -45 26437 1082 -1005 277 -537 522 13 1833 -1878 14725 27539 NA
11 11 97 -5 -72 20339 -101 207 180 449 2286 2628 230 -1093 8424 24199 NA
12 12 2287 -242 310 6644 1281 -1082 585 79 2311 5900 -491 -951 12735 65269 NA
There are positive and negative values with some groups; in this case, month & carrier. I can plot it like this:
> hflights::hflights %>% tbl_df %>% mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=mean(ArrDelay, na.rm=T)) %>%
ggplot(aes(x=month, y=delay, fill=carrier)) + geom_bar(stat='identity')
Which gives me an eye-bleedy chart like this:
It also gives me the message:
Warning message:
Stacking not well defined when ymin != 0
This message is kind of what I'm after. I want to separate positive from negative so that I can see the "gross" amount, and also generate the sum per group and show the "net" amount.
For this dataset, I can do that like so:
> df <- hflights::hflights %>% tbl_df %>%
mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=mean(ArrDelay, na.rm=T))
> ggplot(NULL, aes(x=month, y=delay, fill=carrier)) +
geom_bar(data=df %>% filter(delay > 0), stat='identity') +
geom_bar(data=df %>% filter(delay < 0), stat='identity') +
geom_bar(data=df %>% group_by(month) %>% summarize(delay=sum(delay, na.rm=T)), fill='black', width=0.25, alpha=0.5, stat='identity')
Which gives me this chestnut:
This is much nicer because in September, it doesn't do netting so I get a better sense of the magnitude of the positives and the negatives.
However, the above only works for this dataset. What happens when I have different groups? How do I generalize this?

Adding position = "identity" to geom_bar should get rid of the warning you are getting in your first plot.
The reason for this warning is related to interpreting that bars have negative height instead of just negative values.

Related

How to get year columnns to properly show up in a bar chart

Here is what I am trying to get into a bar chart
Daysoutfromelection 2018 2020 2022
-48 1 0 0
-47 1 0 0
-46 6 0 0
-45 29 9 3
-44 55 26 3
-43 90 63 84
-42 120 77 250
-41 141 105 356
-40 163 185 422
-39 187 201 487
-38 193 204 487
-37 208 212 487
-36 238 229 644
-35 265 237 745
-34 286 272 802
-33 309 336 881
-32 335 518 951
-31 342 564 953
-30 356 585 953
-29 389 623 1150
-28 423 705 1264
-27 1374 3262 1771
-26 1698 5347 2225
-25 1974 8006 3818
-24 1986 10649 4896
-23 2006 11011 7313
-22 15285 36989 27733
-21 114431 142780 52288
-20 190945 209455 165335
-19 262430 407002 253878
-18 337650 603601 324661
-17 398354 652945 328549
-16 402512 717160 328748
-15 456827 859288 460281
-14 587248 1045476 594185
-13 737361 1228168 708165
-12 820264 1361718 804466
-11 888008 1482940 894919
-10 954156 1553120 949524
-9 960346 1613982 949552
-8 1016452 1742538 1042852
-7 1132337 1956437 1203929
-6 1234037 2116439 1350648
-5 1392544 2278454 1425982
-4 1487850 2390960 1499179
-3 1514435 2420439 1525419
-2 1514571 2420789 1549176
-1 1528274 2510411 1579877
0 1530403 2810924 1596362
1 1562526 2880302 1865700
2 1563402 2887366 1884284
3 1563403 2887572 1888305
4 1563405 2887579 1888305
5 1563405 2887579 1888308
Here is a picture in case you cannot see that properly
I am having major trouble getting all three columns to appear. I would like there to be 3 x values all next to eachother on a bar chart, with the y showcasing total votes so that you can compare which year had the most votes. I only have dplyr, ggplot2, tidyr, stringr, and tidyverse.
I've tried it like
viz2 <- data2 %>%
ggplot(aes(x = Daysoutfromelection, y = '2018'))+
geom_bar()
print(viz2)
But I have also switched around the x & y, tried using geom_line.
and also some other ways. Any help would be greatly appreciated in getting this data to properly appear

You have to reshape your data first using e.g. tidyr::pivot_longer so that your years become categories of a new column which I name year and the values go into a new column called value. Afterwards you could create your bar chart where (as you already have a value column) you have to use geom_col. Additionally I added a position?"dodge" to put the bars side by side:
library(ggplot2)
library(tidyr)
dat <- data2 %>%
tidyr::pivot_longer(-Daysoutfromelection, names_to = "year", values_to = "value")
ggplot(dat, aes(x = Daysoutfromelection, y = value, fill = year)) +
geom_col(position = "dodge")
However, IMHO a line chart using geom_line is more appropriate for your data:
ggplot(dat, aes(x = Daysoutfromelection, y = value, color = year)) +
geom_line()

Gompertz-Makeham parameter estimation

I would like estimate the parameters of the Gompert-Makeham distribution, but I haven't got a result.
I would like a method in R, like this Weibull parameter estimation code:
weibull_loglik <- function(parm){
gamma <- parm[1]
lambda <- parm[2]
loglik <- sum(dweibull(vec, shape=gamma, scale=lambda, log=TRUE))
return(-loglik)
}
weibull <- nlm(weibull_loglik,parm<-c(1,1), hessian = TRUE, iterlim=100)
weibull$estimate
c=weibull$estimate[1];b=weibull$estimate[2]
My data:
[1] 872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
[16] 20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
[31] 50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
[46] 374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
[61] 894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
[76] 2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
[91] 2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
[106] 12 6 3 2 1 1
summary(vec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 32.0 314.0 900.9 1355.0 4054.0

It would be nice to have a reproducible example, but something like:
library(bbmle)
library(eha)
set.seed(101)
vec <- rmakeham(1000, shape = c(2,3), scale = 2)
dmwrap <- function(x, shape1, shape2, scale, log) {
res <- try(dmakeham(x, c(shape1, shape2), scale, log = log), silent = TRUE)
if (inherits(res, "try-error")) return(NA)
res
}
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=1),
data = data.frame(y = vec),
method = "Nelder-Mead"
)
Define a wrapper that (1) takes shape parameters as separate values; (2) returns NA rather than throwing an error when e.g. parameters are negative
Use Nelder-Mead rather than default BFGS for robustness
the fitdistrplus package might help too
if you're going to do a lot of this it may help to fit parameters on the log scale (i.e. use parameters logshape1, etc., and use exp(logshape1) etc. in the fitting formula)
I had to work a little harder to fit your data; I scaled the variable by 1000 (and found that I could only compute the log-likelihood; the likelihood gave an error that I didn't bother trying to track down). Unfortunately, it doesn't look like a great fit (too many small values).
x <- scan(text = "872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
12 6 3 2 1 1")
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=10000),
data = data.frame(y = x/1000),
method = "Nelder-Mead"
)
cc <- as.list(coef(m1))
png("gm.png")
hist(x,breaks = 25, freq=FALSE)
with(cc,
curve(exp(dmwrap(x/1000, shape1, shape2, scale, log = TRUE))/1000, add = TRUE)
)
dev.off()

How to reduce processing time of a code in R

Can you help me think of some way to reduce the computational time of a code that generates a certain value, which in this case I call coef, which will depend on id/date/category? Better explanations below.
I made two functions that generate the same result. As you can see in benchmark, the first function (return_values) takes twice as long as the second function (return_valuesX) to generate the same results. See that in the second function, I make some brief changes when calculating the coef variable. However, I strongly believe that there is a possibility of improving the code, as you can see in the second function, I managed to improve 50% of processing time compared to the first just with brief changes. But I'm out of ideas for new adjustments, so I would like your valuable opinion.
Code Explanations:
In general, the purpose of the code is to calculate a value, which I call a coef for each group of id, date and category. For this, the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns of the df1 database is first calculated. After obtaining the median (med variable), I add the values found with the values of the DRM0 columns of my df1 database. This calculation is my SPV variable. In both cases, I used the data.table function, which I believe is faster than using dplyr. After I get SPV, I need to calculate the coef variable for each id/date/category.
Below I will insert an example real easy to understand of the coef calculation. If for example I want to calculate coef of idd<-"3", dmda<-"2021-12-03", CategoryChosse<-"ABC", and I have the following:
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
So coef will be ymd(dmda) - ymd(min(df1$date1)). That is, if I do to this id/date/category that I mentioned I get a difference of 2 days, so the value I want is the DRM003_PV . So the value for this case is 198. Therefore, I made:
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
> coef
[1] 198
This issue has been resolved here: Adjust code to choose a specific column depending on the difference between dates
Libraries and database
library(tidyverse)
library(lubridate)
library(data.table)
library(bench)
set.seed(123)
df1 <- data.frame( Id = rep(1:5, length=800),
date1 = as.Date( "2021-12-01"),
date2= rep(seq( as.Date("2021-01-01"), length.out=400, by=1), each = 2),
Category = rep(c("ABC", "EFG"), length.out = 800),
Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"), length.out = 800),
DR1 = sample( 200:250, 800, repl=TRUE),
setNames( replicate(365, { sample(0:800, 800)}, simplify=FALSE),
paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))
First function
return_values <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns
dt1 <- as.data.table(df1)
cols <- grep("^DRM0", colnames(dt1), value = TRUE)
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")
nm2 <- f2(names(med), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
return(coef)
}
Results using first function
subset_df1 <- subset(df1, date2 > date1)
a<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)) %>%
data.frame()
> a
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...........
Second function
return_valuesX <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
dt1 <- as.data.table(df1)
num_to_pull <- as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6
cols <- grep("^DRM0", colnames(dt1), value = TRUE)[1:num_to_pull]
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")[1:num_to_pull]
nm2 <- f2(names(med), "_PV")[1:num_to_pull]
nm3 <- paste0("i.", nm2)[1:num_to_pull]
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(num_to_pull)
return(coef)
}
Results using second function
b<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result = return_valuesX(df1,Id, date2, Category)) %>%
data.frame()
> b
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...............
Comparing the two results:
identical(a, b)
[1] TRUE
Calculate processing time using benchmark
subset_df1 <- subset(df1, date2 > date1)
bench::mark(a=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)),
b=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_valuesX(df1,Id, date2, Category)),iterations = 1)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 a 53.7s 53.7s 0.0186 4.54GB 0.634 1 34 53.7s <rowwise_df [130 x 5]> <Rprofmem [981,580 x 3]> <bench_tm [1]> <tibble [1 x 3]>
2 b 21s 21s 0.0477 913.77MB 0.382 1 8 21s <rowwise_df [130 x 5]> <Rprofmem [278,340 x 3]> <bench_tm [1]> <tibble [1 x 3]>
To check df1 database

Here's an approach that is about 20x/10x faster than your functions for the example data, and would be faster yet for larger data sets. (When I run with 100k rows in df1, it's 572x faster.) I hope you'll find this approach easier to understand and debug.
This is written using tidyverse functions like tidyr::pivot_longer and dplyr::group_by. If you want to squeeze out a bit more speed, the data.table and collapse packages offer faster alternatives for many functions, especially around grouped calculations. But the main speed improvement here is from restructuring to avoid repeating the same calculations over and over and letting R rely more on vectorized calculations. https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
pre_calc <- function(df) {
pre_calc <- df1 %>% # this calculates once on the full data
select(!ends_with("_PV")) %>%
pivot_longer(-c(1:6), values_to = "DRM", names_to = "day") %>%
mutate(day = parse_number(day)) %>%
group_by(Id, Category, Week, day) %>%
mutate(med = median(DR1 - DRM), Result = DRM + med) %>%
ungroup()
df %>% # starts from the subsetted data and joins to results from above
select(1:5) %>%
left_join(pre_calc) %>%
filter(day == date2 - date1 + 1) %>%
select(Id, date1, date2, Category, Result)
}
c <- subset_df1 %>% pre_calc()
c matches a and b from your tests, with the one difference that
date2 (originally date-integer, which is a nonstandard type) has in my approach been
coerced into a typical date-double, like date1. We can use
typeof(df1$date1) & typeof(df1$date2) to see this.
waldo::compare(b, c) confirms the results otherwise match. I
opened an issue with tidyr here since the subtle change seems
to have been caused by the pivot_longer step.
UPDATE: Apparently the creation of a date-integer object is a bug in base R's seq.Date / seq function, which was fixed in R 4.2: https://github.com/tidyverse/tidyr/issues/1356#issuecomment-1111078891
In the approach above, I pre-calculate all the results once, by taking the original data set df1, throwing out the existing _PV columns (I think they get overwritten?), and -- here's where the speed gains come from -- reshaping to long format. While this single operation is computationally expensive, it means we can more efficiently apply the same calculation to all the DRM_* columns at once, and we can rely on fast filtering instead of slow subsetting to extract our result.
The group_by(Id, Category, Week, day) line and the next mutate(... line let us calculate the median differences between DR1 and that day's DRM for each Id-Category-Week combination, so we can calculate all the Results at once.
The last part takes the df lines (e.g. the subsetted data in your example where date2 > date1) and attaches those to the pre-calculated results, filtering to get the right day (previously encoded by column name/position).

fast partial match checking in R (or Python or Julia)

I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.
Create two dummy datasets
library(dplyr)
library(stringr)
set.seed(1)
data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
Test matching of full matches only
# This is almost instant
data_set_A %>%
rowwise() %>%
filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
ungroup()
# A tibble: 4 x 2
name ID_A
<chr> <int>
1 vnt 112
2 fly 391
3 cug 1125
4 xgv 1280
Include partial matches (This is what I want to optimize)
This of course only gives me the subset of dataset A, but that is ok.
# This takes way too long
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
A tibble: 237 x 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
Fuzzyjoin method.
This also works, but is equally slow
bind_rows(
fuzzyjoin::fuzzy_inner_join(
data_set_A,
data_set_B,
by = c("name" = "name_2"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A),
fuzzyjoin::fuzzy_inner_join(
data_set_B,
data_set_A,
by = c("name_2" = "name"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A)
) %>%
distinct()
data.table solution
not much faster unfortunately
library(data.table)
setDT(data_set_A)
setDT(data_set_B)
data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) |
any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]

This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)

TL;DR
The slow part is str_detect(string, pattern).
To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.
To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.
Under follows examples, explanations and benchmarks.
Rewriting str_detect() using fixed() or coll()
I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().
From ?stringr::str_detect():
Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.
Under is a comparison with your original code:
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
# Note the use of fixed()
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
# Note the use of coll()
using_coll <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
ungroup()
}
bm <- bench::mark(
original(),
using_fixed(),
using_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371
#> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94
#> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512
bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr
Created on 2022-04-02 by the reprex package (v2.0.1)
So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.
Removing rowwise()
Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
using_map_coll <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, coll(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, coll(.x)))
)
data_set_A[logical_vec, ]
}
bm <- bench::mark(
using_fixed(),
using_map_fixed(),
using_coll(),
using_map_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37
#> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14
#> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489
#> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354
Created on 2022-04-02 by the reprex package (v2.0.1)
As we see, this gives another slight speed boost.
Using fixed() with data.table or fuzzyjoin
You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.
This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.

If you would like to use base R, the code below might be one fast option
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
which gives
> res
name ID_A name_2 ID_B
1 arh 1234 pimoarhd 8
2 qtj 720 aqtj 23
3 szcympsn 142 cym 43
4 cymvubnxg 245 cym 43
5 dppvtcymq 355 cym 43
6 kzi 690 kzii 48
7 eyajqchkn 498 chk 53
8 upfzh 522 upf 61
9 ioa 1852 ioadr 63
10 lya 1349 ibelyalvh 64
11 honod 504 ono 71
12 zozxjpu 18 zoz 72
13 jcz 914 cdjczpqg 88
14 ailmjf 623 ilm 99
15 upoux 609 oux 104
16 pouxifvp 1466 oux 104
17 mvob 516 vob 106
18 nqtotvhhm 1088 otv 115
19 wom 202 womtglapx 117
20 qkc 756 dqkcfqpps 118
21 qtl 600 ivqtlymzr 126
22 qqi 1605 owfsqqiyu 153
23 fmjalirze 1470 ali 172
24 ibwfwkyp 1588 fwk 175
25 iat 1258 iatjeg 185
26 osm 253 nviiqosm 199
27 wpj 373 wpjeb 204
28 hahx 515 ahx 213
29 keahxa 1565 ahx 213
30 psf 359 qnpsfo 223
31 saq 1859 saqhu 227
32 cvmkwtx 714 cvm 228
33 ilw 389 pyilwj 231
34 ohwysv 1590 ysv 237
35 utrl 698 trl 244
36 dmttrlcpj 1267 trl 244
37 cpv 236 btcpvmoc 247
38 uto 1047 utoi 257
39 yngunekl 1978 ekl 258
40 vceko 625 vce 265
41 fir 1934 firgk 278
42 qvd 983 eqvdfi 287
43 fir 1934 zwwefir 291
44 idvfkevdf 1380 vdf 312
45 qwdo 1921 qwd 322
46 kam 1205 tlkam 327
47 lck 488 clckjkyzn 329
48 gmspwckw 1015 msp 359
49 ynouuwqtz 1576 nou 360
50 tty 1209 bttyvt 361
51 vkc 999 fmrvkcl 366
52 ipw 1918 fipwjomdu 388
53 zdv 261 zdvkut 410
54 vku 1137 zdvkut 410
55 doby 246 oby 411
56 hycvuupgy 141 uup 421
57 uwlb 1249 wlb 431
58 auj 1452 lcmnauj 444
59 rwd 1667 ukwrwdczs 479
60 ylsihqqor 1290 ihq 483
61 feo 1649 feorvxbm 485
62 zff 755 dohzffujm 499
63 mqutujepu 904 epu 507
64 uiepu 1308 epu 507
65 vahepuk 1434 epu 507
66 cug 1125 accugl 509
67 fir 1934 firwe 517
68 dia 1599 dialeddd 527
69 temiwd 1725 tem 531
70 svofivl 1177 svo 545
71 flm 657 aflm 546
72 vnt 112 vnt 551
73 bhmoskrz 426 osk 558
74 wev 728 shemuwev 569
75 hzpi 1586 hzp 579
76 gvi 1064 mkgvivlfe 582
77 fjb 1398 vkfjbxnjl 589
78 qin 1013 qinp 593
79 ecn 1342 ecnzre 598
80 zre 1610 ecnzre 598
81 xvr 772 dpxvrfmo 623
82 tqr 1419 tqrmztdm 624
83 zmwnf 1571 mwn 626
84 ypil 1787 pil 630
85 mnxlqgfh 1132 nxl 643
86 gse 1563 gseice 646
87 ygk 1309 ygkqrk 655
88 fgm 933 vzfgmy 663
89 rlupd 977 upd 666
90 mcupdkuiy 1307 upd 666
91 fly 391 fly 669
92 vbkko 1603 kko 678
93 uvrew 465 rew 680
94 hgbhngwvd 901 wvd 690
95 wvdjprmo 1432 wvd 690
96 cgn 22 cgnd 698
97 dngnjv 967 njv 700
98 psqs 841 sqs 720
99 ywv 1180 ptywvlgc 730
100 ypil 1787 ypi 734
101 rwd 1667 srserwd 737
102 jqydasl 1294 jqy 742
103 ckujmc 717 ujm 751
104 dfzxta 662 xta 775
105 bjb 1562 jabjbei 779
106 adwknpll 1242 npl 780
107 kdv 1327 xhkdvqo 789
108 ghj 174 oghj 801
109 lhwfyksz 94 lhw 811
110 nwrrnlhw 929 lhw 811
111 xlhwm 1720 lhw 811
112 ncc 1602 wurhxnccn 814
113 jdslrf 1094 dsl 835
114 ktmw 1738 tmw 844
115 igwsvrzpk 16 gws 856
116 kug 591 pkugls 857
117 befgcpedr 339 fgc 862
118 ojf 1397 ojfpnkla 863
119 gyl 1203 gylxeqzw 872
120 ugcbb 1727 ugc 876
121 arh 1234 karhwhg 878
122 amm 458 ammqdc 883
123 azazryje 636 zaz 900
124 wczazw 1887 zaz 900
125 gkritbe 47 ritb 915
126 vku 1137 yjvkuxued 929
127 rnh 1633 kvyrnhugu 937
128 mzh 1135 xllmwmzhn 940
129 cug 1125 cug 960
130 xgv 1280 xgv 962
131 xusxgv 1436 xgv 962
132 umc 351 lwumcmvoo 980
133 zlb 1900 nkyazlb 991
134 llfkalao 1049 llf 1002
135 sflpbht 991 lpb 1048
136 rairmmcl 442 mmc 1087
137 mmckoln 780 mmc 1087
138 gfxmmcgb 1814 mmc 1087
139 aoj 402 taojlgp 1089
140 mypvzhp 121 ypv 1095
141 moctwaypv 611 ypv 1095
142 rngedn 306 ged 1106
143 djshecy 1408 ecy 1108
144 rairmmcl 442 rmm 1117
145 gzua 1594 zua 1124
146 ytj 416 yytj 1140
147 ubt 300 hubtcfr 1141
148 gqg 1854 ogqgsjqc 1144
149 tfg 1204 xiutfgru 1145
150 avrq 741 avr 1147
151 ytkpvss 440 tkp 1149
152 kug 591 yxsjkug 1176
153 vix 1846 vixsmn 1187
154 qtl 600 qtljkxz 1188
155 lgr 494 dlgrco 1189
156 ryg 864 xlmtryg 1203
157 yskvkxwj 1547 kvk 1205
158 kxhee 1795 xhe 1222
159 hzbcjs 1493 cjs 1224
160 kbi 270 itxlwkbi 1225
161 gdymcam 806 ymca 1232
162 tqr 1419 rxtqrdtl 1236
163 yyz 215 yyzw 1242
164 jyx 1735 mljjyxu 1248
165 aai 1928 umkpaaiwo 1254
166 dsd 1122 dndsdova 1257
167 tor 744 etor 1270
168 vhcyznp 1296 yzn 1278
169 xlc 1947 odxlcjwj 1280
170 mlm 1629 aomlmgtq 1303
171 owm 239 owmugb 1304
172 ynezwaml 507 nez 1308
173 jls 695 jlsve 1325
174 dvm 879 dvmv 1339
175 vsgx 944 dqpihvsgx 1352
176 wfo 768 wfokpjois 1354
177 tltbkinat 1986 nat 1362
178 gyl 1203 gylqte 1363
179 ngg 735 bsnggqbjd 1366
180 fkq 345 jdfkqf 1368
181 ojf 1397 ojfpgfga 1382
182 dqgd 1623 prqbndqgd 1398
183 siu 827 siuypucup 1412
184 yinsoivfd 1895 yin 1414
185 esm 1834 sesmeepz 1417
186 umc 351 umcj 1432
187 wny 866 wnyxamguw 1443
188 ujbhtvnin 399 vni 1444
189 dbq 630 bdbqq 1452
190 ebn 1405 ebngddw 1461
191 zcj 704 rbtjzcjod 1465
192 avn 500 avnspxv 1468
193 vkk 567 hvkk 1477
194 hmm 1441 bgjhmmthz 1483
195 aguakz 614 guak 1487
196 hycvuupgy 141 pgy 1493
197 tizpgymz 280 pgy 1493
198 guk 571 cncxdguk 1502
199 zyw 281 nzywuqs 1504
200 jnz 1558 rxdxsjnzw 1510
201 uuq 92 nxuuqtj 1514
202 qtj 720 nxuuqtj 1514
203 vkk 567 xpbpvkkdc 1518
204 iaa 460 sjiaa 1525
205 txsgmynng 1019 xsg 1526
206 yjvtwc 1107 jvt 1529
207 lnk 1113 hylnknwy 1546
208 szd 635 woszdm 1557
209 osm 253 sosmdp 1567
210 nbd 1067 nbdmmg 1570
211 mmg 1305 nbdmmg 1570
212 wqdsatbd 1536 sat 1585
213 sdlypo 1527 sdl 1596
214 inkynog 288 inky 1600
215 hpwoeclfy 1321 clf 1601
216 wodyqwqf 679 dyq 1603
217 lyw 7 xnalywyuw 1607
218 njm 1825 vjlnjmns 1617
219 njytqhaut 428 qha 1620
220 ilw 389 rilwbk 1647
221 oqo 45 ixoqowkpg 1650
222 odcbcvaun 1386 bcv 1652
223 mastn 434 stn 1662
224 xebhdssit 1091 xeb 1663
225 nmy 782 nmyxj 1671
226 fsqvgdw 673 gdw 1676
227 mwwczhs 482 wcz 1679
228 wczazw 1887 wcz 1679
229 anmryzm 915 ryz 1698
230 rteh 523 rte 1708
231 mlwrguae 817 lwr 1709
232 mbu 819 xpsuqmbuf 1729
233 mmckoln 780 cko 1733
234 lxpg 798 lxp 1734
235 ane 370 vxnanehvk 1746
236 tty 1209 vbttyozui 1752
237 igncdgyjx 332 ign 1753
238 ndignk 621 ign 1753
239 nmy 782 ivnmyba 1780
240 wknrsauuj 2 rsa 1799
241 tgd 165 qtgdidlf 1803
242 iaa 460 yziaazxto 1833
243 xto 1245 yziaazxto 1833
244 zff 755 dpzfft 1857
245 jyx 1735 jwjyxphe 1873
246 ytj 416 eytj 1881
247 lcggwonk 1596 onk 1882
248 zdv 261 zdvxfz 1889
249 xhskcb 417 kcb 1890
250 mrikqkcb 770 kcb 1890
251 psvxqnsap 1352 psv 1898
252 udjswzb 411 jsw 1900
253 rpfjswy 1840 jsw 1900
254 bjaywiso 1677 ayw 1902
255 zfli 130 fli 1906
256 vazx 1215 itvazxw 1918
257 tuw 100 tuwywtbwd 1921
258 vle 1437 ebvleaovm 1937
259 znycsygd 1757 nyc 1944
260 ynezwaml 507 ezw 1952
261 tseezwf 1276 ezw 1952
262 ezwzyfudo 1690 ezw 1952
263 oudiky 1503 dik 1964
264 dikjn 1615 dik 1964
265 oms 106 wpomsudi 1977
266 hhp 1864 hhpkm 1983
Benchmarking
It seems this base R option is slightly slower than #Andrew's approach.
TIC <- function() {
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
# data_set_A[unique(idx[, 1]), ]
}
jpiversen_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
andrew <- function() {
nchar_a <- nchar(data_set_A$name)
nchar_b <- nchar(data_set_B$name_2)
pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a <- data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm <- microbenchmark(
TIC(),
jpiversen_fixed(),
andrew(),
times = 20
)
shows that
> bm
Unit: milliseconds
expr min lq mean median uq max
TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841
jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872
neval
20
20
20

Using the surface plot function in plotly

I'm looking to create some 3d surface plots with some data in R. I've got a large dataframe of many variables, but I'm only interested in displaying 3 (the distance, the year, and the mortality rate). I've been trying to use the surface plot library for a day or two now, and I can't get anything to display. I can get the other libraries to work. Here's the code I'm working with at the moment -
cold3Dplot <- plot_ly(showscale = FALSE) %>%
add_surface(z = ~as.matrix(filtered$distance,2014,filtered$RelativeDeaths.2014),opacity = 0.98) %>%
add_surface(z = ~as.matrix(filtered$distance,2015,filtered$RelativeDeaths.2015),opacity = 0.98) %>%
add_surface(z = ~as.matrix(filtered$distance,2016,filtered$RelativeDeaths.2016),opacity = 0.98)%>%
add_surface(z = ~as.matrix(filtered$distance,2017,filtered$RelativeDeaths.2017),opacity = 0.98)
cold3Dplot
This is what my filtered data frame looks like:
datazone label code shape_area shape_leng stdareakm2 stdareaha hhcnt2011 respop2011 totpop2011
561 S01010318 S12000046S01010318 S01010318 128009.62 2950.044 0.128009 12.800961 304 505 1588
562 S01010319 S12000046S01010319 S01010319 96160.88 2282.814 0.096161 9.616090 407 711 711
599 S01010356 S12000046S01010356 S01010356 286149.89 4874.812 0.286150 28.614989 308 682 682
606 S01010363 S12000046S01010363 S01010363 157714.24 3083.966 0.157714 15.771425 394 758 758
563 S01010320 S12000046S01010320 S01010320 73061.83 1544.579 0.073063 7.306183 437 670 675
709 S01010466 S12000046S01010466 S01010466 529203.94 4393.694 0.529203 52.920395 458 879 879
name Feature.Name RelativeDeaths.2014 RelativeDeaths.2015 RelativeDeaths.2016 RelativeDeaths.2017
561 Ruchill - 04 NA -5 -5 -3 -4
562 Ruchill - 05 NA -6 -5 -6 -6
599 Maryhill West - 03 NA -5 -6 -2 -9
606 Kelvindale - 01 NA -2 -2 -5 -5
563 Ruchill - 06 NA -4 -6 -3 -6
709 Anniesland East - 02 NA -5 -7 -3 0
AverageRelativeDeaths distance lon lat ID OBJECTID Name TotPop2011 ResPop2011 HHCnt2011
561 -4 30.44933 -4.278552 55.88277 24 3813 Ruchill - 04 1588 505 304
562 -6 32.69077 -4.280502 55.88595 23 3814 Ruchill - 05 711 711 407
599 -6 50.42398 -4.287960 55.89284 32 3851 Maryhill West - 03 682 682 308
606 -4 105.39898 -4.309206 55.89306 10 3858 Kelvindale - 01 758 758 394
563 -5 149.88008 -4.282008 55.88644 23 3815 Ruchill - 06 675 670 437
709 -4 154.92867 -4.319477 55.89498 72 3961 Anniesland East - 02 879 879 458
Easting Northing coords.x1 coords.x2 optional
561 257542 667848 -4.278992 55.88265 TRUE
562 257447 668188 -4.280688 55.88567 TRUE
599 256994 669045 -4.288377 55.89323 TRUE
606 255670 668966 -4.309488 55.89213 TRUE
563 257256 668176 -4.283733 55.88551 TRUE
709 254973 669173 -4.320734 55.89378 TRUE
But like I said, I'm only interested in using the Relative Deaths, distance and year from it.
What am I doing wrong exactly? I'd like to get something to show at least and then I can see if I'm going in the right direction or not.
Cheers!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

generalizing net/gross in a bar chart - r

Adding position = "identity" to geom_bar should get rid of the warning you are getting in your first plot. The reason for this warning is related to interpreting that bars have negative height instead of just negative values.

Related

How to get year columnns to properly show up in a bar chart

Gompertz-Makeham parameter estimation

How to reduce processing time of a code in R

fast partial match checking in R (or Python or Julia)

Using the surface plot function in plotly

Categories

Resources