I have a data table like this one, I would like to know which type of substrate (called "Litières" / "Branchages" / "Racines") contributes the most to each score.
in r :
Substrate<-c('Litières','Litières','Racines','Branchages','Branchages','Litières','Branchages','Litières','Litières' )
One<-c(0,22,216,36,288,351,28,12,0)
Two<-c(574,755,1248,504,882,810,431,537,56)
Three<-c(1352,1248,706,1476,846,855,1334,1152,1628)
Four<-c(261,162,17,171,171,171,394,486,503)
x<-data.frame(Substrate,One,Two,Three,Four)
or in a table :
Substrate
One
Two
Three
Four
Litières
0
574
1352
261
Litières
22
755
1248
162
Racines
216
1248
706
17
Branchages
36
504
1476
171
Branchages
288
882
846
171
Litières
351
810
855
171
Branchages
28
431
1334
394
Litières
12
537
1152
486
Litières
0
56
1628
503
However you will notice that the number of substrate is not the same between each type of substrate. How to cancel this bias?
Thank !
I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.
Create two dummy datasets
library(dplyr)
library(stringr)
set.seed(1)
data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
Test matching of full matches only
# This is almost instant
data_set_A %>%
rowwise() %>%
filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
ungroup()
# A tibble: 4 x 2
name ID_A
<chr> <int>
1 vnt 112
2 fly 391
3 cug 1125
4 xgv 1280
Include partial matches (This is what I want to optimize)
This of course only gives me the subset of dataset A, but that is ok.
# This takes way too long
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
A tibble: 237 x 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
Fuzzyjoin method.
This also works, but is equally slow
bind_rows(
fuzzyjoin::fuzzy_inner_join(
data_set_A,
data_set_B,
by = c("name" = "name_2"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A),
fuzzyjoin::fuzzy_inner_join(
data_set_B,
data_set_A,
by = c("name_2" = "name"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A)
) %>%
distinct()
data.table solution
not much faster unfortunately
library(data.table)
setDT(data_set_A)
setDT(data_set_B)
data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) |
any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]
This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
TL;DR
The slow part is str_detect(string, pattern).
To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.
To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.
Under follows examples, explanations and benchmarks.
Rewriting str_detect() using fixed() or coll()
I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().
From ?stringr::str_detect():
Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.
Under is a comparison with your original code:
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
# Note the use of fixed()
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
# Note the use of coll()
using_coll <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
ungroup()
}
bm <- bench::mark(
original(),
using_fixed(),
using_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371
#> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94
#> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512
bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr
Created on 2022-04-02 by the reprex package (v2.0.1)
So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.
Removing rowwise()
Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
using_map_coll <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, coll(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, coll(.x)))
)
data_set_A[logical_vec, ]
}
bm <- bench::mark(
using_fixed(),
using_map_fixed(),
using_coll(),
using_map_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37
#> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14
#> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489
#> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354
Created on 2022-04-02 by the reprex package (v2.0.1)
As we see, this gives another slight speed boost.
Using fixed() with data.table or fuzzyjoin
You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.
This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.
If you would like to use base R, the code below might be one fast option
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
which gives
> res
name ID_A name_2 ID_B
1 arh 1234 pimoarhd 8
2 qtj 720 aqtj 23
3 szcympsn 142 cym 43
4 cymvubnxg 245 cym 43
5 dppvtcymq 355 cym 43
6 kzi 690 kzii 48
7 eyajqchkn 498 chk 53
8 upfzh 522 upf 61
9 ioa 1852 ioadr 63
10 lya 1349 ibelyalvh 64
11 honod 504 ono 71
12 zozxjpu 18 zoz 72
13 jcz 914 cdjczpqg 88
14 ailmjf 623 ilm 99
15 upoux 609 oux 104
16 pouxifvp 1466 oux 104
17 mvob 516 vob 106
18 nqtotvhhm 1088 otv 115
19 wom 202 womtglapx 117
20 qkc 756 dqkcfqpps 118
21 qtl 600 ivqtlymzr 126
22 qqi 1605 owfsqqiyu 153
23 fmjalirze 1470 ali 172
24 ibwfwkyp 1588 fwk 175
25 iat 1258 iatjeg 185
26 osm 253 nviiqosm 199
27 wpj 373 wpjeb 204
28 hahx 515 ahx 213
29 keahxa 1565 ahx 213
30 psf 359 qnpsfo 223
31 saq 1859 saqhu 227
32 cvmkwtx 714 cvm 228
33 ilw 389 pyilwj 231
34 ohwysv 1590 ysv 237
35 utrl 698 trl 244
36 dmttrlcpj 1267 trl 244
37 cpv 236 btcpvmoc 247
38 uto 1047 utoi 257
39 yngunekl 1978 ekl 258
40 vceko 625 vce 265
41 fir 1934 firgk 278
42 qvd 983 eqvdfi 287
43 fir 1934 zwwefir 291
44 idvfkevdf 1380 vdf 312
45 qwdo 1921 qwd 322
46 kam 1205 tlkam 327
47 lck 488 clckjkyzn 329
48 gmspwckw 1015 msp 359
49 ynouuwqtz 1576 nou 360
50 tty 1209 bttyvt 361
51 vkc 999 fmrvkcl 366
52 ipw 1918 fipwjomdu 388
53 zdv 261 zdvkut 410
54 vku 1137 zdvkut 410
55 doby 246 oby 411
56 hycvuupgy 141 uup 421
57 uwlb 1249 wlb 431
58 auj 1452 lcmnauj 444
59 rwd 1667 ukwrwdczs 479
60 ylsihqqor 1290 ihq 483
61 feo 1649 feorvxbm 485
62 zff 755 dohzffujm 499
63 mqutujepu 904 epu 507
64 uiepu 1308 epu 507
65 vahepuk 1434 epu 507
66 cug 1125 accugl 509
67 fir 1934 firwe 517
68 dia 1599 dialeddd 527
69 temiwd 1725 tem 531
70 svofivl 1177 svo 545
71 flm 657 aflm 546
72 vnt 112 vnt 551
73 bhmoskrz 426 osk 558
74 wev 728 shemuwev 569
75 hzpi 1586 hzp 579
76 gvi 1064 mkgvivlfe 582
77 fjb 1398 vkfjbxnjl 589
78 qin 1013 qinp 593
79 ecn 1342 ecnzre 598
80 zre 1610 ecnzre 598
81 xvr 772 dpxvrfmo 623
82 tqr 1419 tqrmztdm 624
83 zmwnf 1571 mwn 626
84 ypil 1787 pil 630
85 mnxlqgfh 1132 nxl 643
86 gse 1563 gseice 646
87 ygk 1309 ygkqrk 655
88 fgm 933 vzfgmy 663
89 rlupd 977 upd 666
90 mcupdkuiy 1307 upd 666
91 fly 391 fly 669
92 vbkko 1603 kko 678
93 uvrew 465 rew 680
94 hgbhngwvd 901 wvd 690
95 wvdjprmo 1432 wvd 690
96 cgn 22 cgnd 698
97 dngnjv 967 njv 700
98 psqs 841 sqs 720
99 ywv 1180 ptywvlgc 730
100 ypil 1787 ypi 734
101 rwd 1667 srserwd 737
102 jqydasl 1294 jqy 742
103 ckujmc 717 ujm 751
104 dfzxta 662 xta 775
105 bjb 1562 jabjbei 779
106 adwknpll 1242 npl 780
107 kdv 1327 xhkdvqo 789
108 ghj 174 oghj 801
109 lhwfyksz 94 lhw 811
110 nwrrnlhw 929 lhw 811
111 xlhwm 1720 lhw 811
112 ncc 1602 wurhxnccn 814
113 jdslrf 1094 dsl 835
114 ktmw 1738 tmw 844
115 igwsvrzpk 16 gws 856
116 kug 591 pkugls 857
117 befgcpedr 339 fgc 862
118 ojf 1397 ojfpnkla 863
119 gyl 1203 gylxeqzw 872
120 ugcbb 1727 ugc 876
121 arh 1234 karhwhg 878
122 amm 458 ammqdc 883
123 azazryje 636 zaz 900
124 wczazw 1887 zaz 900
125 gkritbe 47 ritb 915
126 vku 1137 yjvkuxued 929
127 rnh 1633 kvyrnhugu 937
128 mzh 1135 xllmwmzhn 940
129 cug 1125 cug 960
130 xgv 1280 xgv 962
131 xusxgv 1436 xgv 962
132 umc 351 lwumcmvoo 980
133 zlb 1900 nkyazlb 991
134 llfkalao 1049 llf 1002
135 sflpbht 991 lpb 1048
136 rairmmcl 442 mmc 1087
137 mmckoln 780 mmc 1087
138 gfxmmcgb 1814 mmc 1087
139 aoj 402 taojlgp 1089
140 mypvzhp 121 ypv 1095
141 moctwaypv 611 ypv 1095
142 rngedn 306 ged 1106
143 djshecy 1408 ecy 1108
144 rairmmcl 442 rmm 1117
145 gzua 1594 zua 1124
146 ytj 416 yytj 1140
147 ubt 300 hubtcfr 1141
148 gqg 1854 ogqgsjqc 1144
149 tfg 1204 xiutfgru 1145
150 avrq 741 avr 1147
151 ytkpvss 440 tkp 1149
152 kug 591 yxsjkug 1176
153 vix 1846 vixsmn 1187
154 qtl 600 qtljkxz 1188
155 lgr 494 dlgrco 1189
156 ryg 864 xlmtryg 1203
157 yskvkxwj 1547 kvk 1205
158 kxhee 1795 xhe 1222
159 hzbcjs 1493 cjs 1224
160 kbi 270 itxlwkbi 1225
161 gdymcam 806 ymca 1232
162 tqr 1419 rxtqrdtl 1236
163 yyz 215 yyzw 1242
164 jyx 1735 mljjyxu 1248
165 aai 1928 umkpaaiwo 1254
166 dsd 1122 dndsdova 1257
167 tor 744 etor 1270
168 vhcyznp 1296 yzn 1278
169 xlc 1947 odxlcjwj 1280
170 mlm 1629 aomlmgtq 1303
171 owm 239 owmugb 1304
172 ynezwaml 507 nez 1308
173 jls 695 jlsve 1325
174 dvm 879 dvmv 1339
175 vsgx 944 dqpihvsgx 1352
176 wfo 768 wfokpjois 1354
177 tltbkinat 1986 nat 1362
178 gyl 1203 gylqte 1363
179 ngg 735 bsnggqbjd 1366
180 fkq 345 jdfkqf 1368
181 ojf 1397 ojfpgfga 1382
182 dqgd 1623 prqbndqgd 1398
183 siu 827 siuypucup 1412
184 yinsoivfd 1895 yin 1414
185 esm 1834 sesmeepz 1417
186 umc 351 umcj 1432
187 wny 866 wnyxamguw 1443
188 ujbhtvnin 399 vni 1444
189 dbq 630 bdbqq 1452
190 ebn 1405 ebngddw 1461
191 zcj 704 rbtjzcjod 1465
192 avn 500 avnspxv 1468
193 vkk 567 hvkk 1477
194 hmm 1441 bgjhmmthz 1483
195 aguakz 614 guak 1487
196 hycvuupgy 141 pgy 1493
197 tizpgymz 280 pgy 1493
198 guk 571 cncxdguk 1502
199 zyw 281 nzywuqs 1504
200 jnz 1558 rxdxsjnzw 1510
201 uuq 92 nxuuqtj 1514
202 qtj 720 nxuuqtj 1514
203 vkk 567 xpbpvkkdc 1518
204 iaa 460 sjiaa 1525
205 txsgmynng 1019 xsg 1526
206 yjvtwc 1107 jvt 1529
207 lnk 1113 hylnknwy 1546
208 szd 635 woszdm 1557
209 osm 253 sosmdp 1567
210 nbd 1067 nbdmmg 1570
211 mmg 1305 nbdmmg 1570
212 wqdsatbd 1536 sat 1585
213 sdlypo 1527 sdl 1596
214 inkynog 288 inky 1600
215 hpwoeclfy 1321 clf 1601
216 wodyqwqf 679 dyq 1603
217 lyw 7 xnalywyuw 1607
218 njm 1825 vjlnjmns 1617
219 njytqhaut 428 qha 1620
220 ilw 389 rilwbk 1647
221 oqo 45 ixoqowkpg 1650
222 odcbcvaun 1386 bcv 1652
223 mastn 434 stn 1662
224 xebhdssit 1091 xeb 1663
225 nmy 782 nmyxj 1671
226 fsqvgdw 673 gdw 1676
227 mwwczhs 482 wcz 1679
228 wczazw 1887 wcz 1679
229 anmryzm 915 ryz 1698
230 rteh 523 rte 1708
231 mlwrguae 817 lwr 1709
232 mbu 819 xpsuqmbuf 1729
233 mmckoln 780 cko 1733
234 lxpg 798 lxp 1734
235 ane 370 vxnanehvk 1746
236 tty 1209 vbttyozui 1752
237 igncdgyjx 332 ign 1753
238 ndignk 621 ign 1753
239 nmy 782 ivnmyba 1780
240 wknrsauuj 2 rsa 1799
241 tgd 165 qtgdidlf 1803
242 iaa 460 yziaazxto 1833
243 xto 1245 yziaazxto 1833
244 zff 755 dpzfft 1857
245 jyx 1735 jwjyxphe 1873
246 ytj 416 eytj 1881
247 lcggwonk 1596 onk 1882
248 zdv 261 zdvxfz 1889
249 xhskcb 417 kcb 1890
250 mrikqkcb 770 kcb 1890
251 psvxqnsap 1352 psv 1898
252 udjswzb 411 jsw 1900
253 rpfjswy 1840 jsw 1900
254 bjaywiso 1677 ayw 1902
255 zfli 130 fli 1906
256 vazx 1215 itvazxw 1918
257 tuw 100 tuwywtbwd 1921
258 vle 1437 ebvleaovm 1937
259 znycsygd 1757 nyc 1944
260 ynezwaml 507 ezw 1952
261 tseezwf 1276 ezw 1952
262 ezwzyfudo 1690 ezw 1952
263 oudiky 1503 dik 1964
264 dikjn 1615 dik 1964
265 oms 106 wpomsudi 1977
266 hhp 1864 hhpkm 1983
Benchmarking
It seems this base R option is slightly slower than #Andrew's approach.
TIC <- function() {
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
# data_set_A[unique(idx[, 1]), ]
}
jpiversen_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
andrew <- function() {
nchar_a <- nchar(data_set_A$name)
nchar_b <- nchar(data_set_B$name_2)
pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a <- data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm <- microbenchmark(
TIC(),
jpiversen_fixed(),
andrew(),
times = 20
)
shows that
> bm
Unit: milliseconds
expr min lq mean median uq max
TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841
jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872
neval
20
20
20
I am getting an error that I don't really understand at all. I was just messing around with generating some sequences, and I came across this problem:
This should create a sequence of 50 numbers.
seq.int(from=1,to=1000,by=5,length.out=50)
But if I enter this in the console I get the error message:
Error in seq.int(from = 1, to = 1000, by = 5, length.out = 50) :
too many arguments
If I look at the help (?seq), in the Usage section there is this line in there which makes it seem as though I called the function correctly, and it allows this many number of arguments:
seq.int(from, to, by, length.out, along.with, ...)
So what the heck is going on? I am I missing something fundamental, or are the docs out of date?
NOTE
The arguments I am providing to the function in the code sample are just for sake of example. I'm not trying to solve a particular problem, just curious as to why I get the error.
It's not clear what you expect as output from this line of code, and you're getting an error because R doesn't want to resolve the contradictions for you.
Here is some valid output, and the line of code you'd use to achieve each. This is a case where you need to decide for yourself which approach to use given the task you have in mind:
Override length.out
[1] 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
...
[199] 991 996
#via:
seq.int(from=1,to=1000,by=5)
Override by
[1] 1.00000 21.38776 41.77551 62.16327 82.55102 102.93878 123.32653
[8] 143.71429 164.10204 184.48980 204.87755 225.26531 245.65306 266.04082
[15] 286.42857 306.81633 327.20408 347.59184 367.97959 388.36735 408.75510
[22] 429.14286 449.53061 469.91837 490.30612 510.69388 531.08163 551.46939
[29] 571.85714 592.24490 612.63265 633.02041 653.40816 673.79592 694.18367
[36] 714.57143 734.95918 755.34694 775.73469 796.12245 816.51020 836.89796
[43] 857.28571 877.67347 898.06122 918.44898 938.83673 959.22449 979.61224
[50] 1000.00000
#via:
seq.int(from=1,to=1000,length.out=50)
Override to
[1] 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
[22] 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206
[43] 211 216 221 226 231 236 241 246
#via:
seq.int(from=1,by=5,length.out=50)
Override from
[1] 755 760 765 770 775 780 785 790 795 800 805 810 815 820 825 830 835 840
[19] 845 850 855 860 865 870 875 880 885 890 895 900 905 910 915 920 925 930
[37] 935 940 945 950 955 960 965 970 975 980 985 990 995 1000
#via:
seq.int(to=1000,by=5,length.out=50)
A priori, R has no way of telling which of the above you'd like, nor should it. You as programmer need to decide which inputs take precedence.
And you're right that this should be documented; for now, take a look at the source of .Primitive("seq.int"), as linked originally by #nongkrong.
No, there is nothing fundamental to the R language that I was missing that was the source of the problem. The problem is that the documents, at least at time of writing, are misleading and/or incorrect.