I'm working on creating an automated process to pull tables from a yearly PDF report. Ideally, I'd be able to take each year's report, pull the data from the table within it, combine all years into a large data frame, and then analyze it. Here is what I have so far (just focusing on one year of the report):
library(pdftools)
library(data.table)
library(dplyr)
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")
txt14_16 <- pdf_text("nasbo14_16.pdf")
## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\\s+", ",", gsub("^\\s+|\\s+$", "",data))
My problem is converting these raw table data into a dataframe that has each state for every row and their respective values for every column. I'm sure the solution is simple, but I'm just a little bit new to R! Any help?
EDIT: All of these solutions have been terrific and have worked perfectly. However, when I try a report from another year, I get some errors:
: ' 0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').
After trying this code for the next report:
convert txt09_11 to data frame for analyzing
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")
txt09_11 <- pdf_text("nasbo09_11.pdf")
df <- txt09_11[54] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
readr::read_fwf has a fwf_empty utility that will guess column widths for you, which makes the job a lot simpler:
library(tidyverse)
df <- txt14_16[56] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
df
#> # A tibble: 50 × 13
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612
#> 2 Maine 746 1767 267 2780 753 1510 270 2533 776
#> 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411
#> 4 New Hampshire 491 660 175 1326 515 936 166 1617 523
#> 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953
#> 6 Vermont 282 797 332 1411 302 923 326 1551 337
#> 7 Delaware 662 1001 0 1663 668 1193 14 1875 689
#> 8 Maryland 2893 4807 860 8560 2896 5686 1061 9643 2812
#> 9 New Jersey 3961 6920 1043 11924 3831 8899 1053 13783 3955
#> 10 New York 10981 24237 4754 39972 11161 29393 5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 <dbl>, X12 <dbl>,
#> # X13 <dbl>
Obviously column names still need to be added, but the data is fairly usable at this point.
Your gsubs are a bit overly aggressive. You are doing fine through your data[-c(1,...)], so I'll pick up from there, replacing all of your calls to gsub:
# sloppy fixed-width parsing
dat2 <- read.fwf(textConnection(data), c(35,15,20,20,12,10,15,10,10,10,10,15,99))
# clean up extra whitespace
dat3 <- as.data.frame(lapply(dat2, trimws), stringsAsFactors = FALSE)
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* $3,779 $2,992 $0 $6,771 $3,496 $3,483 $0 $6,979 $3,612 $3,604 $0 $7,216
# 2 Maine* 746 1,767 267 2,780 753 1,510 270 2,533 776 1,605 274 2,655
# 3 Massachusetts 6,359 5,542 143 12,044 6,953 6,771 174 13,898 7,411 7,463 292 15,166
# 4 New Hampshire 491 660 175 1,326 515 936 166 1,617 523 1,197 238 1,958
# 5 Rhode Island 998 1,190 31 2,219 998 1,435 24 2,457 953 1,527 22 2,502
# 6 Vermont* 282 797 332 1,411 302 923 326 1,551 337 948 338 1,623
Caution: the widths I used (35,15,20,...) were hastily-derived, and though I think they work, admittedly I did not go row-by-row to verify that I did not chop something. Please verify!
Lastly, from here you likely want to remove $ and , and integerize, that's fairly straight-forward:
dat3[-1] <- lapply(dat3[-1], function(a) as.integer(gsub("[^0-9]", "", a)))
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216
# 2 Maine* 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502
# 6 Vermont* 282 797 332 1411 302 923 326 1551 337 948 338 1623
I'm guessing the asterisks in the state names is meaningful. This can be captured easily with grepl and then removed:
dat3$ast <- grepl("\\*", dat3$V1)
dat3[[1]] <- gsub("\\*", "", dat3[[1]])
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 ast
# 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216 TRUE
# 2 Maine 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655 TRUE
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166 FALSE
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958 FALSE
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502 FALSE
# 6 Vermont 282 797 332 1411 302 923 326 1551 337 948 338 1623 TRUE
Related
I have:
a number m of categorical features (x1, x2, ... xm)
1 categorical feature (y)
all in a dataframe (df).
I would like have a function that give a single table with all the crossings between xi and y: for example
table1 = table (df $ x1, df $ y) ... tablem = table (df $ xm, df $ y)
aggregate tables with rbind
I'm almost there but it doesn't work.
How about this:
data(diamonds, package="ggplot2")
tabs <- lapply(diamonds[,c("color", "clarity")], \(x){
table(x, diamonds$cut)
})
do.call(rbind,tabs)
#> Fair Good Very Good Premium Ideal
#> D 163 662 1513 1603 2834
#> E 224 933 2400 2337 3903
#> F 312 909 2164 2331 3826
#> G 314 871 2299 2924 4884
#> H 303 702 1824 2360 3115
#> I 175 522 1204 1428 2093
#> J 119 307 678 808 896
#> I1 210 96 84 205 146
#> SI2 466 1081 2100 2949 2598
#> SI1 408 1560 3240 3575 4282
#> VS2 261 978 2591 3357 5071
#> VS1 170 648 1775 1989 3589
#> VVS2 69 286 1235 870 2606
#> VVS1 17 186 789 616 2047
#> IF 9 71 268 230 1212
Created on 2022-05-30 by the reprex package (v2.0.1)
An example with mtcars, c("vs","am","gear") (your x's) vs "carb" (your y):
do.call(
rbind,
sapply(
c("vs","am","gear"),
function(x){
as.data.frame(table(mtcars[,x],mtcars$carb))
},
simplify=F
)
)
Var1 Var2 Freq
vs.1 0 1 0
vs.2 1 1 7
vs.3 0 2 5
vs.4 1 2 5
vs.5 0 3 3
vs.6 1 3 0
vs.7 0 4 8
vs.8 1 4 2
...
var1 is the value of to variable in the row names, var2 is the value of y.
Can you help me think of some way to reduce the computational time of a code that generates a certain value, which in this case I call coef, which will depend on id/date/category? Better explanations below.
I made two functions that generate the same result. As you can see in benchmark, the first function (return_values) takes twice as long as the second function (return_valuesX) to generate the same results. See that in the second function, I make some brief changes when calculating the coef variable. However, I strongly believe that there is a possibility of improving the code, as you can see in the second function, I managed to improve 50% of processing time compared to the first just with brief changes. But I'm out of ideas for new adjustments, so I would like your valuable opinion.
Code Explanations:
In general, the purpose of the code is to calculate a value, which I call a coef for each group of id, date and category. For this, the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns of the df1 database is first calculated. After obtaining the median (med variable), I add the values found with the values of the DRM0 columns of my df1 database. This calculation is my SPV variable. In both cases, I used the data.table function, which I believe is faster than using dplyr. After I get SPV, I need to calculate the coef variable for each id/date/category.
Below I will insert an example real easy to understand of the coef calculation. If for example I want to calculate coef of idd<-"3", dmda<-"2021-12-03", CategoryChosse<-"ABC", and I have the following:
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
> SPV %>% filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse)
Id date1 date2 Week Category DRM001_PV DRM002_PV DRM003_PV DRM004_PV DRM005_PV DRM006_PV DRM007_PV DRM008_PV DRM009_PV DRM010_PV DRM011_PV DRM012_PV
1: 3 2021-12-01 2021-12-03 Monday ABC -3 374 198 17 537 -54 330 -136 -116 534 18 -199
DRM013_PV DRM014_PV DRM015_PV DRM016_PV DRM017_PV DRM018_PV DRM019_PV DRM020_PV DRM021_PV DRM022_PV DRM023_PV DRM024_PV DRM025_PV DRM026_PV DRM027_PV DRM028_PV
1: 106 106 349 76 684 390 218 146 141 20 435 218 372 321 218 218
DRM029_PV DRM030_PV DRM031_PV DRM032_PV DRM033_PV DRM034_PV DRM035_PV DRM036_PV DRM037_PV DRM038_PV DRM039_PV DRM040_PV DRM041_PV DRM042_PV DRM043_PV DRM044_PV
1: 55 455 46 411 262 449 325 467 43 -114 191 167 63 -123 252 218
DRM045_PV DRM046_PV DRM047_PV DRM048_PV DRM049_PV DRM050_PV DRM051_PV DRM052_PV DRM053_PV DRM054_PV DRM055_PV DRM056_PV DRM057_PV DRM058_PV DRM059_PV DRM060_PV
1: 305 420 -296 596 200 218 190 203 607 218 442 -72 463 129 -39 333
DRM061_PV DRM062_PV DRM063_PV DRM064_PV DRM065_PV DRM066_PV DRM067_PV DRM068_PV DRM069_PV DRM070_PV DRM071_PV DRM072_PV DRM073_PV DRM074_PV DRM075_PV DRM076_PV
1: -26 160 -91 326 218 369 317 476 224 61 195 613 342 218 204 521
DRM077_PV DRM078_PV DRM079_PV DRM080_PV DRM081_PV DRM082_PV DRM083_PV DRM084_PV DRM085_PV DRM086_PV DRM087_PV DRM088_PV DRM089_PV DRM090_PV DRM091_PV DRM092_PV
1: 588 218 449 340 51 508 -72 42 492 510 328 818 -132 -105 210 -102
DRM093_PV DRM094_PV DRM095_PV DRM096_PV DRM097_PV DRM098_PV DRM099_PV DRM0100_PV DRM0101_PV DRM0102_PV DRM0103_PV DRM0104_PV DRM0105_PV DRM0106_PV DRM0107_PV
1: -137 94 639 265 -64 512 32 -53 414 340 -16 471 434 150 267
DRM0108_PV DRM0109_PV DRM0110_PV DRM0111_PV DRM0112_PV DRM0113_PV DRM0114_PV DRM0115_PV DRM0116_PV DRM0117_PV DRM0118_PV DRM0119_PV DRM0120_PV DRM0121_PV DRM0122_PV
1: 383 -162 434 -134 -39 450 212 146 -26 8 222 341 601 239 57
DRM0123_PV DRM0124_PV DRM0125_PV DRM0126_PV DRM0127_PV DRM0128_PV DRM0129_PV DRM0130_PV DRM0131_PV DRM0132_PV DRM0133_PV DRM0134_PV DRM0135_PV DRM0136_PV DRM0137_PV
1: 484 239 502 415 504 62 487 168 101 319 365 37 218 -50 230
DRM0138_PV DRM0139_PV DRM0140_PV DRM0141_PV DRM0142_PV DRM0143_PV DRM0144_PV DRM0145_PV DRM0146_PV DRM0147_PV DRM0148_PV DRM0149_PV DRM0150_PV DRM0151_PV DRM0152_PV
1: 493 159 150 132 58 21 468 -81 27 345 107 148 -66 -146 -185
DRM0153_PV DRM0154_PV DRM0155_PV DRM0156_PV DRM0157_PV DRM0158_PV DRM0159_PV DRM0160_PV DRM0161_PV DRM0162_PV DRM0163_PV DRM0164_PV DRM0165_PV DRM0166_PV DRM0167_PV
1: -14 562 68 140 353 120 130 301 76 441 218 370 218 378 -22
DRM0168_PV DRM0169_PV DRM0170_PV DRM0171_PV DRM0172_PV DRM0173_PV DRM0174_PV DRM0175_PV DRM0176_PV DRM0177_PV DRM0178_PV DRM0179_PV DRM0180_PV DRM0181_PV DRM0182_PV
1: -279 563 628 600 152 218 445 246 420 94 495 509 356 183 326
DRM0183_PV DRM0184_PV DRM0185_PV DRM0186_PV DRM0187_PV DRM0188_PV DRM0189_PV DRM0190_PV DRM0191_PV DRM0192_PV DRM0193_PV DRM0194_PV DRM0195_PV DRM0196_PV DRM0197_PV
1: 493 -190 -65 -123 376 357 473 112 -69 471 452 221 165 -44 87
DRM0198_PV DRM0199_PV DRM0200_PV DRM0201_PV DRM0202_PV DRM0203_PV DRM0204_PV DRM0205_PV DRM0206_PV DRM0207_PV DRM0208_PV DRM0209_PV DRM0210_PV DRM0211_PV DRM0212_PV
1: 239 285 521 -65 158 223 160 223 269 57 218 218 102 329 218
DRM0213_PV DRM0214_PV DRM0215_PV DRM0216_PV DRM0217_PV DRM0218_PV DRM0219_PV DRM0220_PV DRM0221_PV DRM0222_PV DRM0223_PV DRM0224_PV DRM0225_PV DRM0226_PV DRM0227_PV
1: 769 215 -68 218 347 18 218 547 759 278 -80 -37 629 -16 774
DRM0228_PV DRM0229_PV DRM0230_PV DRM0231_PV DRM0232_PV DRM0233_PV DRM0234_PV DRM0235_PV DRM0236_PV DRM0237_PV DRM0238_PV DRM0239_PV DRM0240_PV DRM0241_PV DRM0242_PV
1: 364 113 -132 31 536 118 248 385 218 202 218 41 23 218 379
DRM0243_PV DRM0244_PV DRM0245_PV DRM0246_PV DRM0247_PV DRM0248_PV DRM0249_PV DRM0250_PV DRM0251_PV DRM0252_PV DRM0253_PV DRM0254_PV DRM0255_PV DRM0256_PV DRM0257_PV
1: -158 462 600 221 218 221 442 218 53 218 176 504 -61 78 68
DRM0258_PV DRM0259_PV DRM0260_PV DRM0261_PV DRM0262_PV DRM0263_PV DRM0264_PV DRM0265_PV DRM0266_PV DRM0267_PV DRM0268_PV DRM0269_PV DRM0270_PV DRM0271_PV DRM0272_PV
1: 493 403 218 339 299 749 -18 465 686 -215 579 307 366 279 94
DRM0273_PV DRM0274_PV DRM0275_PV DRM0276_PV DRM0277_PV DRM0278_PV DRM0279_PV DRM0280_PV DRM0281_PV DRM0282_PV DRM0283_PV DRM0284_PV DRM0285_PV DRM0286_PV DRM0287_PV
1: 138 56 459 613 219 400 35 -74 516 218 -80 317 310 -231 229
DRM0288_PV DRM0289_PV DRM0290_PV DRM0291_PV DRM0292_PV DRM0293_PV DRM0294_PV DRM0295_PV DRM0296_PV DRM0297_PV DRM0298_PV DRM0299_PV DRM0300_PV DRM0301_PV DRM0302_PV
1: 345 -70 619 235 122 61 337 -163 210 586 127 -112 368 365 476
DRM0303_PV DRM0304_PV DRM0305_PV DRM0306_PV DRM0307_PV DRM0308_PV DRM0309_PV DRM0310_PV DRM0311_PV DRM0312_PV DRM0313_PV DRM0314_PV DRM0315_PV DRM0316_PV DRM0317_PV
1: 240 270 497 97 420 -184 212 -28 151 527 186 -32 60 96 -86
DRM0318_PV DRM0319_PV DRM0320_PV DRM0321_PV DRM0322_PV DRM0323_PV DRM0324_PV DRM0325_PV DRM0326_PV DRM0327_PV DRM0328_PV DRM0329_PV DRM0330_PV DRM0331_PV DRM0332_PV
1: 454 321 300 552 319 134 -63 622 441 297 507 578 198 360 542
DRM0333_PV DRM0334_PV DRM0335_PV DRM0336_PV DRM0337_PV DRM0338_PV DRM0339_PV DRM0340_PV DRM0341_PV DRM0342_PV DRM0343_PV DRM0344_PV DRM0345_PV DRM0346_PV DRM0347_PV
1: 153 318 68 763 370 337 633 469 453 146 428 418 169 468 526
DRM0348_PV DRM0349_PV DRM0350_PV DRM0351_PV DRM0352_PV DRM0353_PV DRM0354_PV DRM0355_PV DRM0356_PV DRM0357_PV DRM0358_PV DRM0359_PV DRM0360_PV DRM0361_PV DRM0362_PV
1: 441 674 21 -182 174 153 -158 268 191 460 10 82 543 -193 218
DRM0363_PV DRM0364_PV DRM0365_PV
1: -203 269 479
So coef will be ymd(dmda) - ymd(min(df1$date1)). That is, if I do to this id/date/category that I mentioned I get a difference of 2 days, so the value I want is the DRM003_PV . So the value for this case is 198. Therefore, I made:
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
> coef
[1] 198
This issue has been resolved here: Adjust code to choose a specific column depending on the difference between dates
Libraries and database
library(tidyverse)
library(lubridate)
library(data.table)
library(bench)
set.seed(123)
df1 <- data.frame( Id = rep(1:5, length=800),
date1 = as.Date( "2021-12-01"),
date2= rep(seq( as.Date("2021-01-01"), length.out=400, by=1), each = 2),
Category = rep(c("ABC", "EFG"), length.out = 800),
Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"), length.out = 800),
DR1 = sample( 200:250, 800, repl=TRUE),
setNames( replicate(365, { sample(0:800, 800)}, simplify=FALSE),
paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))
First function
return_values <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM0 columns
dt1 <- as.data.table(df1)
cols <- grep("^DRM0", colnames(dt1), value = TRUE)
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")
nm2 <- f2(names(med), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
return(coef)
}
Results using first function
subset_df1 <- subset(df1, date2 > date1)
a<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)) %>%
data.frame()
> a
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...........
Second function
return_valuesX <- function (df1,idd,dmda, CategoryChosse) {
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
dt1 <- as.data.table(df1)
num_to_pull <- as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6
cols <- grep("^DRM0", colnames(dt1), value = TRUE)[1:num_to_pull]
med <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")[1:num_to_pull]
nm2 <- f2(names(med), "_PV")[1:num_to_pull]
nm3 <- paste0("i.", nm2)[1:num_to_pull]
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
# Third idea: Calculate the coef values
coef<-SPV %>%
filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
pull(num_to_pull)
return(coef)
}
Results using second function
b<-subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result = return_valuesX(df1,Id, date2, Category)) %>%
data.frame()
> b
Id date1 date2 Category Result
1 1 2021-12-01 2021-12-02 ABC 4.0
2 2 2021-12-01 2021-12-02 EFG 238.0
3 3 2021-12-01 2021-12-03 ABC 198.0
4 4 2021-12-01 2021-12-03 EFG 163.0
5 5 2021-12-01 2021-12-04 ABC 462.0
...............
Comparing the two results:
identical(a, b)
[1] TRUE
Calculate processing time using benchmark
subset_df1 <- subset(df1, date2 > date1)
bench::mark(a=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_values(df1,Id, date2, Category)),
b=subset_df1 %>%
rowwise %>%
select(-c(Week,starts_with('DR')))%>%
mutate(Result=return_valuesX(df1,Id, date2, Category)),iterations = 1)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 a 53.7s 53.7s 0.0186 4.54GB 0.634 1 34 53.7s <rowwise_df [130 x 5]> <Rprofmem [981,580 x 3]> <bench_tm [1]> <tibble [1 x 3]>
2 b 21s 21s 0.0477 913.77MB 0.382 1 8 21s <rowwise_df [130 x 5]> <Rprofmem [278,340 x 3]> <bench_tm [1]> <tibble [1 x 3]>
To check df1 database
Here's an approach that is about 20x/10x faster than your functions for the example data, and would be faster yet for larger data sets. (When I run with 100k rows in df1, it's 572x faster.) I hope you'll find this approach easier to understand and debug.
This is written using tidyverse functions like tidyr::pivot_longer and dplyr::group_by. If you want to squeeze out a bit more speed, the data.table and collapse packages offer faster alternatives for many functions, especially around grouped calculations. But the main speed improvement here is from restructuring to avoid repeating the same calculations over and over and letting R rely more on vectorized calculations. https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
pre_calc <- function(df) {
pre_calc <- df1 %>% # this calculates once on the full data
select(!ends_with("_PV")) %>%
pivot_longer(-c(1:6), values_to = "DRM", names_to = "day") %>%
mutate(day = parse_number(day)) %>%
group_by(Id, Category, Week, day) %>%
mutate(med = median(DR1 - DRM), Result = DRM + med) %>%
ungroup()
df %>% # starts from the subsetted data and joins to results from above
select(1:5) %>%
left_join(pre_calc) %>%
filter(day == date2 - date1 + 1) %>%
select(Id, date1, date2, Category, Result)
}
c <- subset_df1 %>% pre_calc()
c matches a and b from your tests, with the one difference that
date2 (originally date-integer, which is a nonstandard type) has in my approach been
coerced into a typical date-double, like date1. We can use
typeof(df1$date1) & typeof(df1$date2) to see this.
waldo::compare(b, c) confirms the results otherwise match. I
opened an issue with tidyr here since the subtle change seems
to have been caused by the pivot_longer step.
UPDATE: Apparently the creation of a date-integer object is a bug in base R's seq.Date / seq function, which was fixed in R 4.2: https://github.com/tidyverse/tidyr/issues/1356#issuecomment-1111078891
In the approach above, I pre-calculate all the results once, by taking the original data set df1, throwing out the existing _PV columns (I think they get overwritten?), and -- here's where the speed gains come from -- reshaping to long format. While this single operation is computationally expensive, it means we can more efficiently apply the same calculation to all the DRM_* columns at once, and we can rely on fast filtering instead of slow subsetting to extract our result.
The group_by(Id, Category, Week, day) line and the next mutate(... line let us calculate the median differences between DR1 and that day's DRM for each Id-Category-Week combination, so we can calculate all the Results at once.
The last part takes the df lines (e.g. the subsetted data in your example where date2 > date1) and attaches those to the pre-calculated results, filtering to get the right day (previously encoded by column name/position).
I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.
Create two dummy datasets
library(dplyr)
library(stringr)
set.seed(1)
data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
Test matching of full matches only
# This is almost instant
data_set_A %>%
rowwise() %>%
filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
ungroup()
# A tibble: 4 x 2
name ID_A
<chr> <int>
1 vnt 112
2 fly 391
3 cug 1125
4 xgv 1280
Include partial matches (This is what I want to optimize)
This of course only gives me the subset of dataset A, but that is ok.
# This takes way too long
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
A tibble: 237 x 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
Fuzzyjoin method.
This also works, but is equally slow
bind_rows(
fuzzyjoin::fuzzy_inner_join(
data_set_A,
data_set_B,
by = c("name" = "name_2"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A),
fuzzyjoin::fuzzy_inner_join(
data_set_B,
data_set_A,
by = c("name_2" = "name"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A)
) %>%
distinct()
data.table solution
not much faster unfortunately
library(data.table)
setDT(data_set_A)
setDT(data_set_B)
data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) |
any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]
This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
TL;DR
The slow part is str_detect(string, pattern).
To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.
To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.
Under follows examples, explanations and benchmarks.
Rewriting str_detect() using fixed() or coll()
I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().
From ?stringr::str_detect():
Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.
Under is a comparison with your original code:
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
# Note the use of fixed()
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
# Note the use of coll()
using_coll <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
ungroup()
}
bm <- bench::mark(
original(),
using_fixed(),
using_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371
#> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94
#> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512
bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr
Created on 2022-04-02 by the reprex package (v2.0.1)
So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.
Removing rowwise()
Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
using_map_coll <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, coll(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, coll(.x)))
)
data_set_A[logical_vec, ]
}
bm <- bench::mark(
using_fixed(),
using_map_fixed(),
using_coll(),
using_map_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37
#> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14
#> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489
#> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354
Created on 2022-04-02 by the reprex package (v2.0.1)
As we see, this gives another slight speed boost.
Using fixed() with data.table or fuzzyjoin
You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.
This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.
If you would like to use base R, the code below might be one fast option
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
which gives
> res
name ID_A name_2 ID_B
1 arh 1234 pimoarhd 8
2 qtj 720 aqtj 23
3 szcympsn 142 cym 43
4 cymvubnxg 245 cym 43
5 dppvtcymq 355 cym 43
6 kzi 690 kzii 48
7 eyajqchkn 498 chk 53
8 upfzh 522 upf 61
9 ioa 1852 ioadr 63
10 lya 1349 ibelyalvh 64
11 honod 504 ono 71
12 zozxjpu 18 zoz 72
13 jcz 914 cdjczpqg 88
14 ailmjf 623 ilm 99
15 upoux 609 oux 104
16 pouxifvp 1466 oux 104
17 mvob 516 vob 106
18 nqtotvhhm 1088 otv 115
19 wom 202 womtglapx 117
20 qkc 756 dqkcfqpps 118
21 qtl 600 ivqtlymzr 126
22 qqi 1605 owfsqqiyu 153
23 fmjalirze 1470 ali 172
24 ibwfwkyp 1588 fwk 175
25 iat 1258 iatjeg 185
26 osm 253 nviiqosm 199
27 wpj 373 wpjeb 204
28 hahx 515 ahx 213
29 keahxa 1565 ahx 213
30 psf 359 qnpsfo 223
31 saq 1859 saqhu 227
32 cvmkwtx 714 cvm 228
33 ilw 389 pyilwj 231
34 ohwysv 1590 ysv 237
35 utrl 698 trl 244
36 dmttrlcpj 1267 trl 244
37 cpv 236 btcpvmoc 247
38 uto 1047 utoi 257
39 yngunekl 1978 ekl 258
40 vceko 625 vce 265
41 fir 1934 firgk 278
42 qvd 983 eqvdfi 287
43 fir 1934 zwwefir 291
44 idvfkevdf 1380 vdf 312
45 qwdo 1921 qwd 322
46 kam 1205 tlkam 327
47 lck 488 clckjkyzn 329
48 gmspwckw 1015 msp 359
49 ynouuwqtz 1576 nou 360
50 tty 1209 bttyvt 361
51 vkc 999 fmrvkcl 366
52 ipw 1918 fipwjomdu 388
53 zdv 261 zdvkut 410
54 vku 1137 zdvkut 410
55 doby 246 oby 411
56 hycvuupgy 141 uup 421
57 uwlb 1249 wlb 431
58 auj 1452 lcmnauj 444
59 rwd 1667 ukwrwdczs 479
60 ylsihqqor 1290 ihq 483
61 feo 1649 feorvxbm 485
62 zff 755 dohzffujm 499
63 mqutujepu 904 epu 507
64 uiepu 1308 epu 507
65 vahepuk 1434 epu 507
66 cug 1125 accugl 509
67 fir 1934 firwe 517
68 dia 1599 dialeddd 527
69 temiwd 1725 tem 531
70 svofivl 1177 svo 545
71 flm 657 aflm 546
72 vnt 112 vnt 551
73 bhmoskrz 426 osk 558
74 wev 728 shemuwev 569
75 hzpi 1586 hzp 579
76 gvi 1064 mkgvivlfe 582
77 fjb 1398 vkfjbxnjl 589
78 qin 1013 qinp 593
79 ecn 1342 ecnzre 598
80 zre 1610 ecnzre 598
81 xvr 772 dpxvrfmo 623
82 tqr 1419 tqrmztdm 624
83 zmwnf 1571 mwn 626
84 ypil 1787 pil 630
85 mnxlqgfh 1132 nxl 643
86 gse 1563 gseice 646
87 ygk 1309 ygkqrk 655
88 fgm 933 vzfgmy 663
89 rlupd 977 upd 666
90 mcupdkuiy 1307 upd 666
91 fly 391 fly 669
92 vbkko 1603 kko 678
93 uvrew 465 rew 680
94 hgbhngwvd 901 wvd 690
95 wvdjprmo 1432 wvd 690
96 cgn 22 cgnd 698
97 dngnjv 967 njv 700
98 psqs 841 sqs 720
99 ywv 1180 ptywvlgc 730
100 ypil 1787 ypi 734
101 rwd 1667 srserwd 737
102 jqydasl 1294 jqy 742
103 ckujmc 717 ujm 751
104 dfzxta 662 xta 775
105 bjb 1562 jabjbei 779
106 adwknpll 1242 npl 780
107 kdv 1327 xhkdvqo 789
108 ghj 174 oghj 801
109 lhwfyksz 94 lhw 811
110 nwrrnlhw 929 lhw 811
111 xlhwm 1720 lhw 811
112 ncc 1602 wurhxnccn 814
113 jdslrf 1094 dsl 835
114 ktmw 1738 tmw 844
115 igwsvrzpk 16 gws 856
116 kug 591 pkugls 857
117 befgcpedr 339 fgc 862
118 ojf 1397 ojfpnkla 863
119 gyl 1203 gylxeqzw 872
120 ugcbb 1727 ugc 876
121 arh 1234 karhwhg 878
122 amm 458 ammqdc 883
123 azazryje 636 zaz 900
124 wczazw 1887 zaz 900
125 gkritbe 47 ritb 915
126 vku 1137 yjvkuxued 929
127 rnh 1633 kvyrnhugu 937
128 mzh 1135 xllmwmzhn 940
129 cug 1125 cug 960
130 xgv 1280 xgv 962
131 xusxgv 1436 xgv 962
132 umc 351 lwumcmvoo 980
133 zlb 1900 nkyazlb 991
134 llfkalao 1049 llf 1002
135 sflpbht 991 lpb 1048
136 rairmmcl 442 mmc 1087
137 mmckoln 780 mmc 1087
138 gfxmmcgb 1814 mmc 1087
139 aoj 402 taojlgp 1089
140 mypvzhp 121 ypv 1095
141 moctwaypv 611 ypv 1095
142 rngedn 306 ged 1106
143 djshecy 1408 ecy 1108
144 rairmmcl 442 rmm 1117
145 gzua 1594 zua 1124
146 ytj 416 yytj 1140
147 ubt 300 hubtcfr 1141
148 gqg 1854 ogqgsjqc 1144
149 tfg 1204 xiutfgru 1145
150 avrq 741 avr 1147
151 ytkpvss 440 tkp 1149
152 kug 591 yxsjkug 1176
153 vix 1846 vixsmn 1187
154 qtl 600 qtljkxz 1188
155 lgr 494 dlgrco 1189
156 ryg 864 xlmtryg 1203
157 yskvkxwj 1547 kvk 1205
158 kxhee 1795 xhe 1222
159 hzbcjs 1493 cjs 1224
160 kbi 270 itxlwkbi 1225
161 gdymcam 806 ymca 1232
162 tqr 1419 rxtqrdtl 1236
163 yyz 215 yyzw 1242
164 jyx 1735 mljjyxu 1248
165 aai 1928 umkpaaiwo 1254
166 dsd 1122 dndsdova 1257
167 tor 744 etor 1270
168 vhcyznp 1296 yzn 1278
169 xlc 1947 odxlcjwj 1280
170 mlm 1629 aomlmgtq 1303
171 owm 239 owmugb 1304
172 ynezwaml 507 nez 1308
173 jls 695 jlsve 1325
174 dvm 879 dvmv 1339
175 vsgx 944 dqpihvsgx 1352
176 wfo 768 wfokpjois 1354
177 tltbkinat 1986 nat 1362
178 gyl 1203 gylqte 1363
179 ngg 735 bsnggqbjd 1366
180 fkq 345 jdfkqf 1368
181 ojf 1397 ojfpgfga 1382
182 dqgd 1623 prqbndqgd 1398
183 siu 827 siuypucup 1412
184 yinsoivfd 1895 yin 1414
185 esm 1834 sesmeepz 1417
186 umc 351 umcj 1432
187 wny 866 wnyxamguw 1443
188 ujbhtvnin 399 vni 1444
189 dbq 630 bdbqq 1452
190 ebn 1405 ebngddw 1461
191 zcj 704 rbtjzcjod 1465
192 avn 500 avnspxv 1468
193 vkk 567 hvkk 1477
194 hmm 1441 bgjhmmthz 1483
195 aguakz 614 guak 1487
196 hycvuupgy 141 pgy 1493
197 tizpgymz 280 pgy 1493
198 guk 571 cncxdguk 1502
199 zyw 281 nzywuqs 1504
200 jnz 1558 rxdxsjnzw 1510
201 uuq 92 nxuuqtj 1514
202 qtj 720 nxuuqtj 1514
203 vkk 567 xpbpvkkdc 1518
204 iaa 460 sjiaa 1525
205 txsgmynng 1019 xsg 1526
206 yjvtwc 1107 jvt 1529
207 lnk 1113 hylnknwy 1546
208 szd 635 woszdm 1557
209 osm 253 sosmdp 1567
210 nbd 1067 nbdmmg 1570
211 mmg 1305 nbdmmg 1570
212 wqdsatbd 1536 sat 1585
213 sdlypo 1527 sdl 1596
214 inkynog 288 inky 1600
215 hpwoeclfy 1321 clf 1601
216 wodyqwqf 679 dyq 1603
217 lyw 7 xnalywyuw 1607
218 njm 1825 vjlnjmns 1617
219 njytqhaut 428 qha 1620
220 ilw 389 rilwbk 1647
221 oqo 45 ixoqowkpg 1650
222 odcbcvaun 1386 bcv 1652
223 mastn 434 stn 1662
224 xebhdssit 1091 xeb 1663
225 nmy 782 nmyxj 1671
226 fsqvgdw 673 gdw 1676
227 mwwczhs 482 wcz 1679
228 wczazw 1887 wcz 1679
229 anmryzm 915 ryz 1698
230 rteh 523 rte 1708
231 mlwrguae 817 lwr 1709
232 mbu 819 xpsuqmbuf 1729
233 mmckoln 780 cko 1733
234 lxpg 798 lxp 1734
235 ane 370 vxnanehvk 1746
236 tty 1209 vbttyozui 1752
237 igncdgyjx 332 ign 1753
238 ndignk 621 ign 1753
239 nmy 782 ivnmyba 1780
240 wknrsauuj 2 rsa 1799
241 tgd 165 qtgdidlf 1803
242 iaa 460 yziaazxto 1833
243 xto 1245 yziaazxto 1833
244 zff 755 dpzfft 1857
245 jyx 1735 jwjyxphe 1873
246 ytj 416 eytj 1881
247 lcggwonk 1596 onk 1882
248 zdv 261 zdvxfz 1889
249 xhskcb 417 kcb 1890
250 mrikqkcb 770 kcb 1890
251 psvxqnsap 1352 psv 1898
252 udjswzb 411 jsw 1900
253 rpfjswy 1840 jsw 1900
254 bjaywiso 1677 ayw 1902
255 zfli 130 fli 1906
256 vazx 1215 itvazxw 1918
257 tuw 100 tuwywtbwd 1921
258 vle 1437 ebvleaovm 1937
259 znycsygd 1757 nyc 1944
260 ynezwaml 507 ezw 1952
261 tseezwf 1276 ezw 1952
262 ezwzyfudo 1690 ezw 1952
263 oudiky 1503 dik 1964
264 dikjn 1615 dik 1964
265 oms 106 wpomsudi 1977
266 hhp 1864 hhpkm 1983
Benchmarking
It seems this base R option is slightly slower than #Andrew's approach.
TIC <- function() {
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
# data_set_A[unique(idx[, 1]), ]
}
jpiversen_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
andrew <- function() {
nchar_a <- nchar(data_set_A$name)
nchar_b <- nchar(data_set_B$name_2)
pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a <- data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm <- microbenchmark(
TIC(),
jpiversen_fixed(),
andrew(),
times = 20
)
shows that
> bm
Unit: milliseconds
expr min lq mean median uq max
TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841
jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872
neval
20
20
20
I converted some .pdf's into .txt files using R and am having trouble finding a way to scrape them to ultimately construct a data frame. I am new to text scraping, so please have mercy on my ignorance.
This is the format of the .txt file and I am mainly interested in the numbers and headers. Any recommendations are much appreciated.
Township of Buena Vista
General Election Results - November 2, 2010
Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk
Township Committee Public Count
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Ward Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188
I would like the output to be something in tabular form like
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188
except with the names and party affiliation as one character string. The goal is to merge this with other files like it to create a dataset.
It's always going to be ugly, but this should be somewhat automated:
# read it in as individual lines
rl <- readLines(textConnection(txt))
# drop all the extra info at top
rl <- rl[-(1:9)]
# just keep header
dist <- which(rl == "District")
hd <- head(rl, dist - 1)
# make everything same length and split characters
hd <- lapply(strsplit(hd, ""), `length<-`, max(nchar(hd)))
hd <- lapply(hd, function(x) replace(x, is.na(x), " "))
# find where spaces are in common in all rows
wdths <- rle(Reduce(`&`, lapply(hd, `==`, " ")))$lengths
# read it all in, ignoring district row
out <- read.fwf(textConnection(rl[-dist]), widths=wdths )
# keep those columns that aren't all NA
out <- out[!sapply(out, function(x) all(is.na(x)) )]
# collapse the header
hdr <- sapply(head(out, dist - 1),
function(x) trimws(gsub("\\s+", " ", paste(na.omit(x), collapse=" "))))
# finalise by joining
setNames(
data.frame(lapply(tail(out, -(dist-1)), type.convert, as.is=TRUE)),
hdr
)
Result:
# Ward Peter C. Bylone, Sr. Democratic Richard Harlan Democratic
#1 D-1 205 195
#2 D-2 202 160
#3 D-3 331 346
#4 D-4 215 205
#5 D-5 104 95
#6 D-6 77 70
# Mary Ann Micheletti- Levari Republican Henry L. Coia, Jr. Republican
#1 230 223
#2 275 261
#3 99 87
#4 164 152
#5 169 166
#6 109 108
# Total Machine Count Total Vote By Mail Total Provisional Count
#1 436 113 16
#2 459 NA NA
#3 457 NA NA
#4 377 NA NA
#5 271 NA NA
#6 188 NA NA
# Total Emergency Count Total Public Count
#1 NA 565
#2 NA 459
#3 NA 457
#4 NA 377
#5 NA 271
#6 NA 188
The example txt used was:
" Township of Buena Vista\n General Election Results - November 2, 2010\n Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk\n\n\n\n\n Township Committee Public Count\n\n Mary Ann\n Peter C. Richard Henry L. Total Total Total Total Total\n Micheletti-\n Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public\n Levari\nWard Democratic Democratic Republican Count Mail Count Count Count\n Republican\nDistrict\n D-1 205 195 230 223 436 113 16 565\n D-2 202 160 275 261 459 459\n D-3 331 346 99 87 457 457\n D-4 215 205 164 152 377 377\n D-5 104 95 169 166 271 271\n D-6 77 70 109 108 188 188"
Perhaps you can generalize this approach, but I don't think, it is very stable when used with other data than the example data.
I put your example into a file named example.txt.
library(tidyverse)
input <- read_lines("example.txt")
input[as.logical(cumsum(input == "District"))] %>%
tibble() %>%
slice(-1) %>%
mutate(count = str_replace_all(string = ., "\\s{9,12}", ";")) %>%
select(-.) %>%
separate(col = count, into = c("District", as.character(1:9)), sep = ";") %>%
mutate(across(everything(), str_trim),
across(as.character(1:9), as.integer))
returns
# A tibble: 6 x 10
District `1` `2` `3` `4` `5` `6` `7` `8` `9`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 D-1 205 195 230 223 436 113 16 NA 565
2 D-2 202 160 275 261 459 NA NA NA 459
3 D-3 331 346 99 87 457 NA NA NA 457
4 D-4 215 205 164 152 377 NA NA NA 377
5 D-5 104 95 169 166 271 NA NA NA 271
6 D-6 77 70 109 108 188 NA NA NA 188
Creating the column names (the candidate names) is a tricky task. Depending on the counts, perhaps it is necessary to adjust the spaces replaced with ";": \\s{9,12} means replace at least 9 up to 12 space characters.
I am plotting three different sets as three boxplot in 1 page using ggplot2. In each set there is a point that I would like to highlight, and illustrate where the point stands compare to the others, is it inside the box ? or the outside.
Here is my datapoint
CDH 1KG NHLBI
CDH 301 688 1762
RS0 204 560 21742
RS1 158 1169 1406
RS2 182 1945 1467
RS3 256 2371 1631
RS4 198 580 1765
RS5 193 524 1429
RS6 139 2551 1469
RS7 188 702 1584
RS8 142 4311 1461
RS9 223 916 1591
RS10 250 794 1406
RS11 185 539 1270
RS12 228 641 1786
RS13 152 557 1677
RS14 225 1970 1619
RS15 196 458 1543
RS16 203 2891 1528
RS17 221 1542 1780
RS18 258 1173 1850
RS19 202 718 1651
RS20 191 6314 1564
library(ggplot2)
rm(list = ls())
orig_table = read.table("thedata.csv", header = T, sep = ",")
bb = orig_table # have copy of the data
bb = bb[,-1] # since these points, the ones in the first raw are my interesting point, I exclude them from the sets for the time being
tt = bb
mydata = cbind(c(tt[,1], tt[,2], tt[,3]), c(rep(1,22),rep(2,22),rep(3,22))) # I form the dataframe
data2 = cbind(c(301,688,1762),c(1,2,3)) # here is the points that I want to highlight - similar to the first raw
colnames(data2) = c("num","gro")
data2 = as.data.frame(data2) # I form them as a dataframe
colnames(mydata) = c("num","gro")
mydata = as.data.frame(mydata)
mydata$gro = factor(mydata$gro, levels=c(1,2,3))
qplot(gro, num, data=mydata, geom=c("boxplot"))+scale_y_log10() # I am making the dataframe out of 21 other ponts
# and here I want to highlight those three values in the "data2" dataframe
I appreciate your help
First, ggplot is a lot easier to use if you use data in long format. melt from reshape2 helps with that:
library(reshape2)
library(ggplot2)
df$highlight <- c(TRUE, rep(FALSE, nrow(df) - 1L)) # tag first row as interesting
df.2 <- melt(df) # convert df to long format
ggplot(subset(df.2, !highlight), aes(x=variable, y=value)) +
geom_boxplot() + scale_y_log10() +
geom_point( # add the highlight points
data=subset(df.2, highlight),
aes(x=variable, y=value),
color="red", size=5
)
Now, all I did was add a TRUE, to the first row, melted the data to be compatible with ggplot, and plotted the points with highlight==TRUE in addition to the boxplots.
EDIT: this is how I made the data:
df <- read.table(text=" CDH 1KG NHLBI
CDH 301 688 1762
RS0 204 560 21742
RS1 158 1169 1406
RS2 182 1945 1467
RS3 256 2371 1631
RS4 198 580 1765
RS5 193 524 1429
RS6 139 2551 1469
RS7 188 702 1584
RS8 142 4311 1461
RS9 223 916 1591
RS10 250 794 1406
RS11 185 539 1270
RS12 228 641 1786
RS13 152 557 1677
RS14 225 1970 1619
RS15 196 458 1543
RS16 203 2891 1528
RS17 221 1542 1780
RS18 258 1173 1850
RS19 202 718 1651
RS20 191 6314 1564", header=T)