marking a point in the boxplot - r

I am plotting three different sets as three boxplot in 1 page using ggplot2. In each set there is a point that I would like to highlight, and illustrate where the point stands compare to the others, is it inside the box ? or the outside.
Here is my datapoint
CDH 1KG NHLBI
CDH 301 688 1762
RS0 204 560 21742
RS1 158 1169 1406
RS2 182 1945 1467
RS3 256 2371 1631
RS4 198 580 1765
RS5 193 524 1429
RS6 139 2551 1469
RS7 188 702 1584
RS8 142 4311 1461
RS9 223 916 1591
RS10 250 794 1406
RS11 185 539 1270
RS12 228 641 1786
RS13 152 557 1677
RS14 225 1970 1619
RS15 196 458 1543
RS16 203 2891 1528
RS17 221 1542 1780
RS18 258 1173 1850
RS19 202 718 1651
RS20 191 6314 1564
library(ggplot2)
rm(list = ls())
orig_table = read.table("thedata.csv", header = T, sep = ",")
bb = orig_table # have copy of the data
bb = bb[,-1] # since these points, the ones in the first raw are my interesting point, I exclude them from the sets for the time being
tt = bb
mydata = cbind(c(tt[,1], tt[,2], tt[,3]), c(rep(1,22),rep(2,22),rep(3,22))) # I form the dataframe
data2 = cbind(c(301,688,1762),c(1,2,3)) # here is the points that I want to highlight - similar to the first raw
colnames(data2) = c("num","gro")
data2 = as.data.frame(data2) # I form them as a dataframe
colnames(mydata) = c("num","gro")
mydata = as.data.frame(mydata)
mydata$gro = factor(mydata$gro, levels=c(1,2,3))
qplot(gro, num, data=mydata, geom=c("boxplot"))+scale_y_log10() # I am making the dataframe out of 21 other ponts
# and here I want to highlight those three values in the "data2" dataframe
I appreciate your help

First, ggplot is a lot easier to use if you use data in long format. melt from reshape2 helps with that:
library(reshape2)
library(ggplot2)
df$highlight <- c(TRUE, rep(FALSE, nrow(df) - 1L)) # tag first row as interesting
df.2 <- melt(df) # convert df to long format
ggplot(subset(df.2, !highlight), aes(x=variable, y=value)) +
geom_boxplot() + scale_y_log10() +
geom_point( # add the highlight points
data=subset(df.2, highlight),
aes(x=variable, y=value),
color="red", size=5
)
Now, all I did was add a TRUE, to the first row, melted the data to be compatible with ggplot, and plotted the points with highlight==TRUE in addition to the boxplots.
EDIT: this is how I made the data:
df <- read.table(text=" CDH 1KG NHLBI
CDH 301 688 1762
RS0 204 560 21742
RS1 158 1169 1406
RS2 182 1945 1467
RS3 256 2371 1631
RS4 198 580 1765
RS5 193 524 1429
RS6 139 2551 1469
RS7 188 702 1584
RS8 142 4311 1461
RS9 223 916 1591
RS10 250 794 1406
RS11 185 539 1270
RS12 228 641 1786
RS13 152 557 1677
RS14 225 1970 1619
RS15 196 458 1543
RS16 203 2891 1528
RS17 221 1542 1780
RS18 258 1173 1850
RS19 202 718 1651
RS20 191 6314 1564", header=T)

Related

R: Different Ways to Name Matrices

I am working with the R programming language.
In the "datasets" library in R, there is a data set called "eurodist" that contains the distance between each combination of cities :
library(datasets)
This data set can be then converted into a "matrix":
eurodist = as.matrix(eurodist)
Athens Barcelona Brussels Calais Cherbourg Cologne Copenhagen Geneva Gibraltar Hamburg Hook of Holland Lisbon Lyons Madrid Marseilles Milan Munich Paris Rome Stockholm Vienna
Athens 0 3313 2963 3175 3339 2762 3276 2610 4485 2977 3030 4532 2753 3949 2865 2282 2179 3000 817 3927 1991
Barcelona 3313 0 1318 1326 1294 1498 2218 803 1172 2018 1490 1305 645 636 521 1014 1365 1033 1460 2868 1802
Brussels 2963 1318 0 204 583 206 966 677 2256 597 172 2084 690 1558 1011 925 747 285 1511 1616 1175
Calais 3175 1326 204 0 460 409 1136 747 2224 714 330 2052 739 1550 1059 1077 977 280 1662 1786 1381
Cherbourg 3339 1294 583 460 0 785 1545 853 2047 1115 731 1827 789 1347 1101 1209 1160 340 1794 2196 1588
Cologne 2762 1498 206 409 785 0 760 1662 2436 460 269 2290 714 1764 1035 911 583 465 1497 1403 937
Copenhagen 3276 2218 966 1136 1545 760 0 1418 3196 460 269 2971 1458 2498 1778 1537 1104 1176 2050 650 1455
Geneva 2610 803 677 747 853 1662 1418 0 1975 1118 895 1936 158 1439 425 328 591 513 995 2068 1019
Gibraltar 4485 1172 2256 2224 2047 2436 3196 1975 0 2897 2428 676 1817 698 1693 2185 2565 1971 2631 3886 2974
Hamburg 2977 2018 597 714 1115 460 460 1118 2897 0 550 2671 1159 2198 1479 1238 805 877 1751 949 1155
Hook of Holland 3030 1490 172 330 731 269 269 895 2428 550 0 2280 863 1730 1183 1098 851 457 1683 1500 1205
Lisbon 4532 1305 2084 2052 1827 2290 2971 1936 676 2671 2280 0 1178 668 1762 2250 2507 1799 2700 3231 2937
Lyons 2753 645 690 739 789 714 1458 158 1817 1159 863 1178 0 1281 320 328 724 471 1048 2108 1157
Madrid 3949 636 1558 1550 1347 1764 2498 1439 698 2198 1730 668 1281 0 1157 1724 2010 1273 2097 3188 2409
Marseilles 2865 521 1011 1059 1101 1035 1778 425 1693 1479 1183 1762 320 1157 0 618 1109 792 1011 2428 1363
Milan 2282 1014 925 1077 1209 911 1537 328 2185 1238 1098 2250 328 1724 618 0 331 856 586 2187 898
Munich 2179 1365 747 977 1160 583 1104 591 2565 805 851 2507 724 2010 1109 331 0 821 946 1754 428
Paris 3000 1033 285 280 340 465 1176 513 1971 877 457 1799 471 1273 792 856 821 0 1476 1827 1249
Rome 817 1460 1511 1662 1794 1497 2050 995 2631 1751 1683 2700 1048 2097 1011 586 946 1476 0 2707 1209
Stockholm 3927 2868 1616 1786 2196 1403 650 2068 3886 949 1500 3231 2108 3188 2428 2187 1754 1827 2707 0 2105
Vienna 1991 1802 1175 1381 1588 937 1455 1019 2974 1155 1205 2937 1157 2409 1363 898 428 1249 1209 2105 0
My Question: Suppose I have 6 cities and the Longitude/Latitude for each of these cities :
data_1 = data.frame(id = c(1,2,3), long = rnorm(3, -74, 1 ), lat = rnorm(3, 40, 1 ))
data_2 = data.frame(id = c(4,5,6), long = rnorm(3, -78, 1 ), lat = rnorm(3, 42, 1 ))
final_data = rbind(data_1, data_2)
final_data$names <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
id long lat names
1 1 -75.28447 40.21079 city_1
2 2 -73.29385 40.09104 city_2
3 3 -75.12737 38.88355 city_3
4 4 -79.42325 42.61917 city_4
5 5 -77.82508 41.11707 city_5
6 6 -77.62831 39.94935 city_6
I can also make a similar matrix for these cities that contains the distance between each pair of cities:
library(geosphere)
N <- nrow(final_data)
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
geosphere::distHaversine(final_data[a,2:3], final_data[b,2:3]) # Notes 1, 2
})
D <- as.matrix(dists)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
[2,] 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
[3,] 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
[4,] 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
[5,] 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
[6,] 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
How can I make my matrix look the same way as the "eurodist" matrix?
I had thought of the following way to do this:
colnames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
rownames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
city_1 city_2 city_3 city_4 city_5 city_6
city_1 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
city_2 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
city_3 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
city_4 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
city_5 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
city_6 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
In the end, I would like to use the above matrix as input for a customized Travelling Salesman Problem (R: Customizing the Travelling Salesman Problem) - e.g. Try to find the optimal path when you are forced to start at "city 4" and the third city should be "city 5":
D <- dists
transformMatrix <- function(fixed_points, D){
if(length(fixed_points) == 0) return(D)
p <- integer(nrow(D))
pos <- match(names(fixed_points), colnames(D))
p[fixed_points] <- pos
p[-fixed_points] <- sample(setdiff(seq_len(nrow(D)), pos))
D[p, p]
}
fixed_points <- c(
"city_4" = 1, "city_5" = 3
)
D_perm <- transformMatrix(fixed_points, D)
feasiblePopulation <- function(n, size, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
m <- matrix(0, size, n)
if(length(fixed_points) > 0){
m[, fixed_points] <- rep(fixed_points, each = size)
for(i in seq_len(size))
m[i, -fixed_points] <- sample(positions)
} else {
for(i in seq_len(size))
m[i,] <- sample(positions)
}
m
}
mutation <- function(n, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
function(obj, parent){
vec <- obj#population[parent,]
if(length(positions) < 2) return(vec)
indices <- sample(positions, 2)
replace(vec, indices, vec[rev(indices)])
}
}
fitness <- function(tour, distMatrix) {
tour <- c(tour, tour[1])
route <- embed(tour, 2)[,2:1]
1/sum(distMatrix[route])
}
popSize = 500
res <- ga(
type = "permutation",
fitness = fitness,
distMatrix = D_perm,
lower = 1,
upper = nrow(D_perm),
mutation = mutation(nrow(D_perm), fixed_points),
crossover = gaperm_pmxCrossover,
suggestions = feasiblePopulation(nrow(D_perm), popSize, fixed_points),
popSize = popSize,
maxiter = 5000,
run = 500,
pmutation = 0.2
)
colnames(D_perm)[res#solution[1,]]
This results in the following error:
Error in if (object#run >= run) break :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(fitness) : no non-missing arguments to max; returning -Inf
2: In max(Fitness, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
3: In max(fitness) : no non-missing arguments to max; returning -Inf
4: In max(x, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Is the above error because I have not made "distance matrix" (i.e. "D") properly? Is there a different way to name the columns and rows of a matrix in R?
Thanks!
Note : If anyone knows another way to solve this constraint Travelling Salesman Problem with custom cities using the Genetic Algorithm in R (e.g. different objective function, different way to specify constraints, etc.), please let me know. I am open to different ways to solving this problem!
That’s not the problem. The error says the it encountered code:
if (object#run >= run) break
… and either object#run or run had length 0 which the if function cannot handle gracefully. It may be an error in the ga function itself or in the arguments to it.
To address the direct question about how to make the distance matrix look like the example in eurodist: There is a dimnames attribute for matrices. You need to assign a list with a rownames and a colnames value in it and assign that list to the dimnames attribute.
dimnames(D) <- list(rownames=final_data$names,
colnames=final_data$names)
Then when you run your code you get an error from the ga(...) call:
Error in gaperm_pmxCrossover_Rcpp(object, parents) : index error
Looking at the problem setup, your population size appears much larger than needed. If you drop it down a bit to say 100 or 200, then the results begin to be computed.
popSize=200;
# now calculate a res
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSize=100
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSiz=20
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
It doesn't seem "proper" that a population size larger than needed should cause an obscure error, so you might contact the package maintainer with your example (now that it has been "dressed up" properly.)

fast partial match checking in R (or Python or Julia)

I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.
Create two dummy datasets
library(dplyr)
library(stringr)
set.seed(1)
data_set_A <- tibble(name = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
Test matching of full matches only
# This is almost instant
data_set_A %>%
rowwise() %>%
filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
ungroup()
# A tibble: 4 x 2
name ID_A
<chr> <int>
1 vnt 112
2 fly 391
3 cug 1125
4 xgv 1280
Include partial matches (This is what I want to optimize)
This of course only gives me the subset of dataset A, but that is ok.
# This takes way too long
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
A tibble: 237 x 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
Fuzzyjoin method.
This also works, but is equally slow
bind_rows(
fuzzyjoin::fuzzy_inner_join(
data_set_A,
data_set_B,
by = c("name" = "name_2"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A),
fuzzyjoin::fuzzy_inner_join(
data_set_B,
data_set_A,
by = c("name_2" = "name"),
match_fun = stringr::str_detect
) %>%
select(name, ID_A)
) %>%
distinct()
data.table solution
not much faster unfortunately
library(data.table)
setDT(data_set_A)
setDT(data_set_B)
data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) |
any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]
This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
TL;DR
The slow part is str_detect(string, pattern).
To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.
To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.
Under follows examples, explanations and benchmarks.
Rewriting str_detect() using fixed() or coll()
I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().
From ?stringr::str_detect():
Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.
Under is a comparison with your original code:
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
# Note the use of fixed()
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
# Note the use of coll()
using_coll <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
ungroup()
}
bm <- bench::mark(
original(),
using_fixed(),
using_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original() 6.58s 6.59s 0.152 32.4MB 0.371
#> 2 using_fixed() 501.64ms 505.51ms 1.97 61.4MB 3.94
#> 3 using_coll() 4.48s 4.5s 0.222 61.4MB 0.512
bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr
Created on 2022-04-02 by the reprex package (v2.0.1)
So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.
Removing rowwise()
Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
using_map_coll <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, coll(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, coll(.x)))
)
data_set_A[logical_vec, ]
}
bm <- bench::mark(
using_fixed(),
using_map_fixed(),
using_coll(),
using_map_coll(),
iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
bm
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 using_fixed() 503.4ms 507.24ms 1.95 62.9MB 5.37
#> 2 using_map_fixed() 474.28ms 477.63ms 2.09 60.3MB 3.14
#> 3 using_coll() 4.49s 4.5s 0.222 61.4MB 0.489
#> 4 using_map_coll() 4.37s 4.38s 0.228 60.2MB 0.354
Created on 2022-04-02 by the reprex package (v2.0.1)
As we see, this gives another slight speed boost.
Using fixed() with data.table or fuzzyjoin
You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.
This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.
If you would like to use base R, the code below might be one fast option
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
which gives
> res
name ID_A name_2 ID_B
1 arh 1234 pimoarhd 8
2 qtj 720 aqtj 23
3 szcympsn 142 cym 43
4 cymvubnxg 245 cym 43
5 dppvtcymq 355 cym 43
6 kzi 690 kzii 48
7 eyajqchkn 498 chk 53
8 upfzh 522 upf 61
9 ioa 1852 ioadr 63
10 lya 1349 ibelyalvh 64
11 honod 504 ono 71
12 zozxjpu 18 zoz 72
13 jcz 914 cdjczpqg 88
14 ailmjf 623 ilm 99
15 upoux 609 oux 104
16 pouxifvp 1466 oux 104
17 mvob 516 vob 106
18 nqtotvhhm 1088 otv 115
19 wom 202 womtglapx 117
20 qkc 756 dqkcfqpps 118
21 qtl 600 ivqtlymzr 126
22 qqi 1605 owfsqqiyu 153
23 fmjalirze 1470 ali 172
24 ibwfwkyp 1588 fwk 175
25 iat 1258 iatjeg 185
26 osm 253 nviiqosm 199
27 wpj 373 wpjeb 204
28 hahx 515 ahx 213
29 keahxa 1565 ahx 213
30 psf 359 qnpsfo 223
31 saq 1859 saqhu 227
32 cvmkwtx 714 cvm 228
33 ilw 389 pyilwj 231
34 ohwysv 1590 ysv 237
35 utrl 698 trl 244
36 dmttrlcpj 1267 trl 244
37 cpv 236 btcpvmoc 247
38 uto 1047 utoi 257
39 yngunekl 1978 ekl 258
40 vceko 625 vce 265
41 fir 1934 firgk 278
42 qvd 983 eqvdfi 287
43 fir 1934 zwwefir 291
44 idvfkevdf 1380 vdf 312
45 qwdo 1921 qwd 322
46 kam 1205 tlkam 327
47 lck 488 clckjkyzn 329
48 gmspwckw 1015 msp 359
49 ynouuwqtz 1576 nou 360
50 tty 1209 bttyvt 361
51 vkc 999 fmrvkcl 366
52 ipw 1918 fipwjomdu 388
53 zdv 261 zdvkut 410
54 vku 1137 zdvkut 410
55 doby 246 oby 411
56 hycvuupgy 141 uup 421
57 uwlb 1249 wlb 431
58 auj 1452 lcmnauj 444
59 rwd 1667 ukwrwdczs 479
60 ylsihqqor 1290 ihq 483
61 feo 1649 feorvxbm 485
62 zff 755 dohzffujm 499
63 mqutujepu 904 epu 507
64 uiepu 1308 epu 507
65 vahepuk 1434 epu 507
66 cug 1125 accugl 509
67 fir 1934 firwe 517
68 dia 1599 dialeddd 527
69 temiwd 1725 tem 531
70 svofivl 1177 svo 545
71 flm 657 aflm 546
72 vnt 112 vnt 551
73 bhmoskrz 426 osk 558
74 wev 728 shemuwev 569
75 hzpi 1586 hzp 579
76 gvi 1064 mkgvivlfe 582
77 fjb 1398 vkfjbxnjl 589
78 qin 1013 qinp 593
79 ecn 1342 ecnzre 598
80 zre 1610 ecnzre 598
81 xvr 772 dpxvrfmo 623
82 tqr 1419 tqrmztdm 624
83 zmwnf 1571 mwn 626
84 ypil 1787 pil 630
85 mnxlqgfh 1132 nxl 643
86 gse 1563 gseice 646
87 ygk 1309 ygkqrk 655
88 fgm 933 vzfgmy 663
89 rlupd 977 upd 666
90 mcupdkuiy 1307 upd 666
91 fly 391 fly 669
92 vbkko 1603 kko 678
93 uvrew 465 rew 680
94 hgbhngwvd 901 wvd 690
95 wvdjprmo 1432 wvd 690
96 cgn 22 cgnd 698
97 dngnjv 967 njv 700
98 psqs 841 sqs 720
99 ywv 1180 ptywvlgc 730
100 ypil 1787 ypi 734
101 rwd 1667 srserwd 737
102 jqydasl 1294 jqy 742
103 ckujmc 717 ujm 751
104 dfzxta 662 xta 775
105 bjb 1562 jabjbei 779
106 adwknpll 1242 npl 780
107 kdv 1327 xhkdvqo 789
108 ghj 174 oghj 801
109 lhwfyksz 94 lhw 811
110 nwrrnlhw 929 lhw 811
111 xlhwm 1720 lhw 811
112 ncc 1602 wurhxnccn 814
113 jdslrf 1094 dsl 835
114 ktmw 1738 tmw 844
115 igwsvrzpk 16 gws 856
116 kug 591 pkugls 857
117 befgcpedr 339 fgc 862
118 ojf 1397 ojfpnkla 863
119 gyl 1203 gylxeqzw 872
120 ugcbb 1727 ugc 876
121 arh 1234 karhwhg 878
122 amm 458 ammqdc 883
123 azazryje 636 zaz 900
124 wczazw 1887 zaz 900
125 gkritbe 47 ritb 915
126 vku 1137 yjvkuxued 929
127 rnh 1633 kvyrnhugu 937
128 mzh 1135 xllmwmzhn 940
129 cug 1125 cug 960
130 xgv 1280 xgv 962
131 xusxgv 1436 xgv 962
132 umc 351 lwumcmvoo 980
133 zlb 1900 nkyazlb 991
134 llfkalao 1049 llf 1002
135 sflpbht 991 lpb 1048
136 rairmmcl 442 mmc 1087
137 mmckoln 780 mmc 1087
138 gfxmmcgb 1814 mmc 1087
139 aoj 402 taojlgp 1089
140 mypvzhp 121 ypv 1095
141 moctwaypv 611 ypv 1095
142 rngedn 306 ged 1106
143 djshecy 1408 ecy 1108
144 rairmmcl 442 rmm 1117
145 gzua 1594 zua 1124
146 ytj 416 yytj 1140
147 ubt 300 hubtcfr 1141
148 gqg 1854 ogqgsjqc 1144
149 tfg 1204 xiutfgru 1145
150 avrq 741 avr 1147
151 ytkpvss 440 tkp 1149
152 kug 591 yxsjkug 1176
153 vix 1846 vixsmn 1187
154 qtl 600 qtljkxz 1188
155 lgr 494 dlgrco 1189
156 ryg 864 xlmtryg 1203
157 yskvkxwj 1547 kvk 1205
158 kxhee 1795 xhe 1222
159 hzbcjs 1493 cjs 1224
160 kbi 270 itxlwkbi 1225
161 gdymcam 806 ymca 1232
162 tqr 1419 rxtqrdtl 1236
163 yyz 215 yyzw 1242
164 jyx 1735 mljjyxu 1248
165 aai 1928 umkpaaiwo 1254
166 dsd 1122 dndsdova 1257
167 tor 744 etor 1270
168 vhcyznp 1296 yzn 1278
169 xlc 1947 odxlcjwj 1280
170 mlm 1629 aomlmgtq 1303
171 owm 239 owmugb 1304
172 ynezwaml 507 nez 1308
173 jls 695 jlsve 1325
174 dvm 879 dvmv 1339
175 vsgx 944 dqpihvsgx 1352
176 wfo 768 wfokpjois 1354
177 tltbkinat 1986 nat 1362
178 gyl 1203 gylqte 1363
179 ngg 735 bsnggqbjd 1366
180 fkq 345 jdfkqf 1368
181 ojf 1397 ojfpgfga 1382
182 dqgd 1623 prqbndqgd 1398
183 siu 827 siuypucup 1412
184 yinsoivfd 1895 yin 1414
185 esm 1834 sesmeepz 1417
186 umc 351 umcj 1432
187 wny 866 wnyxamguw 1443
188 ujbhtvnin 399 vni 1444
189 dbq 630 bdbqq 1452
190 ebn 1405 ebngddw 1461
191 zcj 704 rbtjzcjod 1465
192 avn 500 avnspxv 1468
193 vkk 567 hvkk 1477
194 hmm 1441 bgjhmmthz 1483
195 aguakz 614 guak 1487
196 hycvuupgy 141 pgy 1493
197 tizpgymz 280 pgy 1493
198 guk 571 cncxdguk 1502
199 zyw 281 nzywuqs 1504
200 jnz 1558 rxdxsjnzw 1510
201 uuq 92 nxuuqtj 1514
202 qtj 720 nxuuqtj 1514
203 vkk 567 xpbpvkkdc 1518
204 iaa 460 sjiaa 1525
205 txsgmynng 1019 xsg 1526
206 yjvtwc 1107 jvt 1529
207 lnk 1113 hylnknwy 1546
208 szd 635 woszdm 1557
209 osm 253 sosmdp 1567
210 nbd 1067 nbdmmg 1570
211 mmg 1305 nbdmmg 1570
212 wqdsatbd 1536 sat 1585
213 sdlypo 1527 sdl 1596
214 inkynog 288 inky 1600
215 hpwoeclfy 1321 clf 1601
216 wodyqwqf 679 dyq 1603
217 lyw 7 xnalywyuw 1607
218 njm 1825 vjlnjmns 1617
219 njytqhaut 428 qha 1620
220 ilw 389 rilwbk 1647
221 oqo 45 ixoqowkpg 1650
222 odcbcvaun 1386 bcv 1652
223 mastn 434 stn 1662
224 xebhdssit 1091 xeb 1663
225 nmy 782 nmyxj 1671
226 fsqvgdw 673 gdw 1676
227 mwwczhs 482 wcz 1679
228 wczazw 1887 wcz 1679
229 anmryzm 915 ryz 1698
230 rteh 523 rte 1708
231 mlwrguae 817 lwr 1709
232 mbu 819 xpsuqmbuf 1729
233 mmckoln 780 cko 1733
234 lxpg 798 lxp 1734
235 ane 370 vxnanehvk 1746
236 tty 1209 vbttyozui 1752
237 igncdgyjx 332 ign 1753
238 ndignk 621 ign 1753
239 nmy 782 ivnmyba 1780
240 wknrsauuj 2 rsa 1799
241 tgd 165 qtgdidlf 1803
242 iaa 460 yziaazxto 1833
243 xto 1245 yziaazxto 1833
244 zff 755 dpzfft 1857
245 jyx 1735 jwjyxphe 1873
246 ytj 416 eytj 1881
247 lcggwonk 1596 onk 1882
248 zdv 261 zdvxfz 1889
249 xhskcb 417 kcb 1890
250 mrikqkcb 770 kcb 1890
251 psvxqnsap 1352 psv 1898
252 udjswzb 411 jsw 1900
253 rpfjswy 1840 jsw 1900
254 bjaywiso 1677 ayw 1902
255 zfli 130 fli 1906
256 vazx 1215 itvazxw 1918
257 tuw 100 tuwywtbwd 1921
258 vle 1437 ebvleaovm 1937
259 znycsygd 1757 nyc 1944
260 ynezwaml 507 ezw 1952
261 tseezwf 1276 ezw 1952
262 ezwzyfudo 1690 ezw 1952
263 oudiky 1503 dik 1964
264 dikjn 1615 dik 1964
265 oms 106 wpomsudi 1977
266 hhp 1864 hhpkm 1983
Benchmarking
It seems this base R option is slightly slower than #Andrew's approach.
TIC <- function() {
A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
# data_set_A[unique(idx[, 1]), ]
}
jpiversen_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
andrew <- function() {
nchar_a <- nchar(data_set_A$name)
nchar_b <- nchar(data_set_B$name_2)
pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a <- data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm <- microbenchmark(
TIC(),
jpiversen_fixed(),
andrew(),
times = 20
)
shows that
> bm
Unit: milliseconds
expr min lq mean median uq max
TIC() 423.8410 441.3574 492.6091 478.2596 549.2376 611.3841
jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
andrew() 329.4821 335.3388 345.8890 341.4758 354.1298 381.6872
neval
20
20
20

cast function results arguments imply differing number of rows: 9, 0

I am trying to find the mean for each variable uniquely but I don't know why it's giving error after applying cast function.
library(reshape)
> odata <- read.csv("dummy2.csv")
> msdata <- melt(odata, id=c("A","F"))
> subjmeans <- cast(msdata, A~ variable, mean)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 9, 0
Here is the data I have used.
Timestamp A B C D E F G H I J
2586 01_Antwerpen_S1.jpg 9 250 1151 458 p1 color 261.8472837 13.27605282 50.20731621
2836 01_Antwerpen_S1.jpg 10 150 1371 316 p1 color 41.01219331 2.088502575 25.59470566
2986 01_Antwerpen_S1.jpg 11 283 1342 287 p1 color 580.2206477 28.92031693 84.62469724
3269 01_Antwerpen_S1.jpg 12 433 762 303 p1 color 138.1303732 7.026104125 36.45742907
3702 01_Antwerpen_S1.jpg 13 183 624 297 p1 color 88.20430828 4.489909458 30.87780081
3885 01_Antwerpen_S1.jpg 14 333 712 303 p1 color 42.20189569 2.149072905 25.72796039
4218 01_Antwerpen_S1.jpg 15 300 753 293 p1 color 51.7880295 2.637077062 26.80156954
6517 01_Antwerpen_S1.jpg 22 333 601 674 p1 color 466.0525721 23.40488212 72.49074066
9066 02_Berlin_S1.jpg 27 149 1067 681 p1 color 90.42676595 4.602920212 31.12642447
9215 02_Berlin_S1.jpg 28 266 1116 757 p1 color 101.8430165 5.18328435 32.40322557
9481 02_Berlin_S1.jpg 29 217 1020 723 p1 color 314.3962468 15.90906187 55.99993612
9698 02_Berlin_S1.jpg 30 183 711 781 p1 color 272.045952 13.78825606 51.33416332
9881 02_Berlin_S1.jpg 31 183 439 776 p1 color 249.9939999 12.68008164 48.8961796
10064 02_Berlin_S1.jpg 32 167 328 552 p1 color 193.8375609 9.847751174 42.66505258
10231 02_Berlin_S1.jpg 33 400 310 359 p1 color 68.00735254 3.462531847 28.61757006
10631 02_Berlin_S1.jpg 34 666 246 336 p1 color 93.40770846 4.754485399 31.45986788
11297 02_Berlin_S1.jpg 35 333 172 279 p1 color 1105.224412 52.32154317 136.107395
13679 03_Bordeaux_S1.jpg 40 316 1152 790 p1 color 280.8629559 14.23062355 52.30737182
13995 03_Bordeaux_S1.jpg 41 583 1424 860 p1 color 134.1827113 6.825784964 36.01672692
14578 03_Bordeaux_S1.jpg 42 283 1486 979 p1 color 133.9589489 6.814429158 35.99174415
14861 03_Bordeaux_S1.jpg 43 233 1419 863 p1 color 282.1772493 14.29652823 52.4523621
15094 03_Bordeaux_S1.jpg 44 266 1149 781 p1 color 998.5128943 47.86171758 126.2957787
17559 04_Köln_S1.jpg 49 200 151 813 p1 color 590.041524 29.38880547 85.65537204
17759 04_Köln_S1.jpg 50 183 741 806 p1 color 294.9779653 14.93791111 53.86340444
17943 04_Köln_S1.jpg 51 216 1035 782 p1 color 81.0246876 4.124771083 30.07449638
18159 04_Köln_S1.jpg 52 117 1068 708 p1 color 85.80209788 4.367748556 30.60904682
Result is same and error is same with IRIS Data too.
library(reshape)
ss <- iris
msdata <- melt(ss, id=c("Sepal.Length","Species"))
subjmeans <- cast(msdata, Species~ variable, mean)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 9, 0

Divide paired matching columns

I have a data.frame df with matching columns that are also paired. The matching columns are defined in the factor patient. I would like to devide the matching columns by each other. Any suggestions how to do this?
I tried this, but this does not take the pairing from patient into account.
m1 <- m1[sort(colnames(df)]
m1_g <- m1[,grep("^n",colnames(df))]
m1_r <- m1[,grep("^t",colnames(df))]
m1_new <- m1_g/m1_r
m1_new
head(df)
na-008 ta-008 nc012 tb012 na020 na-018 ta-018 na020 tc020 tc093 nc093
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGT 56 311 137 242 23 96 113 106 41 114
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGG 208 656 350 713 49 476 183 246 157 306
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGT 631 1978 1531 2470 216 1906 732 850 665 909
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTT 2760 8159 6067 9367 622 4228 2931 3031 2895 2974
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTTT 1698 4105 3737 3729 219 1510 1697 1643 1527 1536
> head(patient)
$`008`
[1] "na-008" "ta-008"
$`012`
[1] "nc012" "tb012"
$`018`
[1] "na-018" "ta-018"
$`020`
[1] "na020" "tc020"
$`045`
[1] "nb045" "tc045"
$`080`
[1] "nb-080" "ta-080"

generalizing net/gross in a bar chart

I'm doing a particular operation quite a bit, and I need help generalizing it.
I have a lot of data that "looks" kind of like this:
> hflights::hflights %>% tbl_df %>% mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=sum(ArrDelay, na.rm=T)) %>%
dcast(month ~ carrier)
month AA AS B6 CO DL EV F9 FL MQ OO UA US WN XE YV
1 1 18 296 229 27031 1026 1337 851 216 2322 3957 -219 -1068 31701 24248 NA
2 2 461 249 802 15769 1657 730 707 1079 4283 11486 323 -663 36729 27861 -44
3 3 317 476 1037 49061 905 2529 673 1111 2524 12955 1665 -606 28758 50702 -38
4 4 1147 465 518 52086 1856 4483 515 927 5085 17439 1803 -711 47084 69590 260
5 5 1272 56 654 63413 1381 3563 1334 1213 7899 22190 1798 1627 73771 66972 18
6 6 -262 172 504 60042 3736 2618 744 983 4519 21652 6260 2140 40191 66456 49
7 7 -460 112 1241 41300 2868 1628 321 506 1529 23432 2780 497 21200 98484 34
8 8 -1417 59 1659 36106 -949 808 42 -1366 310 11038 3546 -84 6991 33554 34
9 9 -841 -364 -202 24857 1022 -424 151 -747 -1373 4502 1743 248 15592 31846 NA
10 10 215 -112 -45 26437 1082 -1005 277 -537 522 13 1833 -1878 14725 27539 NA
11 11 97 -5 -72 20339 -101 207 180 449 2286 2628 230 -1093 8424 24199 NA
12 12 2287 -242 310 6644 1281 -1082 585 79 2311 5900 -491 -951 12735 65269 NA
There are positive and negative values with some groups; in this case, month & carrier. I can plot it like this:
> hflights::hflights %>% tbl_df %>% mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=mean(ArrDelay, na.rm=T)) %>%
ggplot(aes(x=month, y=delay, fill=carrier)) + geom_bar(stat='identity')
Which gives me an eye-bleedy chart like this:
It also gives me the message:
Warning message:
Stacking not well defined when ymin != 0
This message is kind of what I'm after. I want to separate positive from negative so that I can see the "gross" amount, and also generate the sum per group and show the "net" amount.
For this dataset, I can do that like so:
> df <- hflights::hflights %>% tbl_df %>%
mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=mean(ArrDelay, na.rm=T))
> ggplot(NULL, aes(x=month, y=delay, fill=carrier)) +
geom_bar(data=df %>% filter(delay > 0), stat='identity') +
geom_bar(data=df %>% filter(delay < 0), stat='identity') +
geom_bar(data=df %>% group_by(month) %>% summarize(delay=sum(delay, na.rm=T)), fill='black', width=0.25, alpha=0.5, stat='identity')
Which gives me this chestnut:
This is much nicer because in September, it doesn't do netting so I get a better sense of the magnitude of the positives and the negatives.
However, the above only works for this dataset. What happens when I have different groups? How do I generalize this?
Adding position = "identity" to geom_bar should get rid of the warning you are getting in your first plot.
The reason for this warning is related to interpreting that bars have negative height instead of just negative values.

Resources