I am having a few issues with scaling a text matching program. I am using text2vec which provides very good and fast results.
The main problem I am having is manipulating a large matrix which is returned by the text2vec::sim2() function.
First, some details of my hardware / OS setup: Windows 7 with 12 cores about 3.5 GHz and 128 Gb of memory. Its a pretty good machine.
Second, some basic details of what my R program is trying to achieve.
We have a database of 10 million unique canonical addresses for every house / business in address. These reference addresses also have latitude and longitude information for each entry.
I am trying to match these reference addresses to customer addresses in our database. We have about 600,000 customer addresses. The quality of these customer addresses is not good. Not good at all! They are stored as a single string field with absolutely zero checks on input.
The techical strategy to match these addresses is quite simple. Create two document term matrices (DTM) of the customer addresses and reference addresses and use cosine similarity to find the reference address which is the most similar to a specific customer address. Some customer addresses are so poor that will result in a very low cosine similarity -- so, for these addresses a "no match" would be assigned.
Despite being a pretty simple solution, the results obtained are very encouraging.
But, I am having problems scaling things....? And I am wondering if anyone has any suggestions.
There is a copy of my code below. Its pretty simple. Obviously, I cannot include real data but it should provide readers a clear idea of what I am trying to do.
SECTION A - Works very well even on the full 600,000 * 10 million input data set.
SECTION B - the text2vec::sim2() function causes R studio to shut down when the vocabulary exceeds about 140,000 tokens (i.e columns). To avoid this, I process the customer addresses in chunks of about 200.
SECTION C - This is the most expensive section. When processing addresses in chunks of 200, SECTION A and SECTION B take about 2 minutes. But SECTION C, using (what I would have thought to be super quick functions) take about 5 minutes to process to process a 10 million row * 200 column matrix.
Combined, SECIONS A:C take about 7 minutes to process 200 addresses. As there are 600,000 addresses to process, this will take about 14 days to process.
Are they are ideas to make this code run faster...?
rm(list = ls())
library(text2vec)
library(dplyr)
# Create some test data
# example is 10 entries.
# but in reality we have 10 million addresses
vct_ref_address <- c("15 smith street beaconsfield 2506 NSW",
"107 orange grove linfield 2659 NSW",
"88 melon drive calton 3922 VIC",
"949 eyre street sunnybank 4053 QLD",
"12 black avenue kingston 2605 ACT",
"5 sweet lane 2004 wynyard NSW",
"32 mugga way 2688 manuka ACT",
"4 black swan avenue freemantle 5943 WA",
"832 big street narrabeet 2543 NSW",
"5 dust road 5040 NT")
# example is 4 entries
# but in reality, we have 1.5 million addresses
vct_test_address <- c("949 eyre street sunnybank 4053 QLD",
"1113 completely invalid suburb with no post code QLD",
"12 black road kingston 2605 ACT",
"949 eyre roaod sunnybank 4053 QLD" )
# ==========================
# SECTION A ===== prepare data
# A.1 create vocabulary
t2v_token <- text2vec::itoken(c(vct_test_address, vct_ref_address), progressbar = FALSE)
t2v_vocab <- text2vec::create_vocabulary(t2v_token)
t2v_vectorizer <- text2vec::vocab_vectorizer(t2v_vocab)
# A.2 create document term matrices dtm
t2v_dtm_test <- text2vec::create_dtm(itoken(vct_test_address, progressbar = FALSE), t2v_vectorizer)
t2v_dtm_reference <- text2vec::create_dtm(itoken(vct_ref_address, progressbar = FALSE), t2v_vectorizer)
# ===========================
# SECTION B ===== similarity matrix
mat_sim <- text2vec::sim2(t2v_dtm_reference, t2v_dtm_test, method = 'cosine', norm = 'l2')
# ===========================
# SECTION C ===== process matrix
vct_which_reference <- apply(mat_sim, 2, which.max)
vct_sim_score <- apply(mat_sim, 2, max)
# ============================
# SECTION D ===== apply results
# D.1 assemble results
df_results <- data.frame(
test_addr = vct_test_address,
matched_addr = vct_ref_address[vct_which_reference],
similarity = vct_sim_score )
# D.2 print results
df_results %>% arrange(desc(similarity))
The issue in step C is that mat_sim is sparse and all the apply calls make column/row subsetting which are super slow (and convert sparse vectors to dense).
There could be several solutions:
if mat_sim is not very huge convert to the dense with as.matrix and then use apply
Better you can convert mat_sim to sparse matrix in a triplet format with as(mat_sim, "TsparseMatrix") and then use data.table to get indices of the max elements. Here is an example:
library(text2vec)
library(Matrix)
data("movie_review")
it = itoken(movie_review$review, tolower, word_tokenizer)
dtm = create_dtm(it, hash_vectorizer(2**14))
mat_sim = sim2(dtm[1:100, ], dtm[101:5000, ])
mat_sim = as(mat_sim, "TsparseMatrix")
library(data.table)
# we add 1 because indices in sparse matrices in Matrix package start from 1
mat_sim_dt = data.table(row_index = mat_sim#i + 1L, col_index = mat_sim#j + 1L, value = mat_sim#x)
res = mat_sim_dt[,
{ k = which.max(value); list(max_sim = value[[k]], row_index = row_index[[k]]) },
keyby = col_index]
res
Also as a side suggestion - I recommend to try char_tokenizer() with ngrams (for example of the size c(3, 3)) to "fuzzy" match different spelling and abbreviations of addresses.
Related
Say that I have two sets of coordinates, A and B. My goal is, for each element of A, to find the element of B with the shortest driving time (retaining B's index, driving time, and distance). Based on the answer by #Ben in this question (Calculating distance of multiple coordinates in R), I have come up with the code below. My question is how to make this faster.
library(osrm)
library(sf)
apotheke.df <- st_read(system.file("gpkg/apotheke.gpkg", package = "osrm"),
quiet = TRUE)
points <- data.frame(
id = 1:3,
longitude = c(13.4, 13.5, 13.3),
latitude = c(52.4, 52.5, 52.3)
)
route <- list()
#overall goal: for each point, find distance to nearest apotheke
#find distance from each of the 3 points to the first apotheke
for (x in 1:nrow(points)) {
route[[x]] <- osrmRoute(src = c(points$longitude[x], points$latitude[x]),
dst = apotheke.df[1,],
overview = FALSE, osrm.profile = "car")
#add index
route[[x]][3] <- 1
}
#replace if duration is less than the lowest one
for (x in 1:nrow(points)) {
for(y in 2:nrow(apotheke.df)) {
temp <- osrmRoute(src = c(points$longitude[x], points$latitude[x]),
dst = apotheke.df[y,],
overview = FALSE, osrm.profile = "car")
temp[3] <- y
print(y)
print(temp)
if(temp[2] < route[[x]][2]) route[[x]] <- temp
}
}
do.call(rbind, route)
Result:
duration distance
[1,] 3.52 1.84 18
[2,] 2.05 1.00 14
[3,] 17.10 17.76 76
In my actual application, one set of points has about 150 and the other has thousands and thousands. This takes a really long time to do. This leads to my question - how can I speed up the above code?
My best guess is to use parallel processing (though other aspects of the code may be slow), but I am not good at figuring out how to do this. It may be related to these questions: 1) parallelizing API in R (OSRM) and 2) How to apply an osrm function to every row of a dataframe.
Here is a reproducible example elaborating on my comment.
1. Define sample source and dest data:
This is just for the purpose of the answer.
library(osrm)
source <- data.frame(
id = 1:3,
longitude = c(13.4, 13.5, 13.3),
latitude = c(52.4, 52.5, 52.3)
)
dest <- data.frame(
id = 4:5,
longitude = c(13.9, 13.6),
latitude = c(52.2, 52.4)
)
2. Calculate distances
You will save a huge amount of time using osrmTable() rather than looping through your data and calling osrmRoute() each time. This is particularly true if you are using a remote server rather than hosting you own, as each request is travelling over the internet. Using osrmTable() makes one request to the server for the entire table, rather than a request for every pair of coordinates.
distances <- osrmTable(
src = source[c("longitude", "latitude")],
dst = dest[c("longitude", "latitude")],
osrm.profile = "car"
)
distances[["durations"]]
# 1 2
# 1 60.6 25.7
# 2 71.4 25.4
# 3 55.7 25.3
Even with just 100 sources coordinates and 100 destinations, R needs to interpret one command, make one network request and do one high level assignment (<-) operation, rather than 10,000 if you iterate over each one individually.
3. Find the closest destination to each source
Again it is much quicker if you skip the loop. I am going to use data.table in the example because it's fast. Firstly, make the distance matrix into a data.table and get it into long form using melt():
library(data.table)
distances_dt <- data.table(
source = source$id,
distances[["durations"]],
key = "source"
) |>
setnames(
seq_len(nrow(dest)) + 1,
as.character(dest$id)
) |>
melt(
id.vars = "source",
variable.name = "dest",
value.name = "distance"
)
# source dest distance
# <int> <fctr> <num>
# 1: 1 4 60.6
# 2: 2 4 71.4
# 3: 3 4 55.7
# 4: 1 5 25.7
# 5: 2 5 25.4
# 6: 3 5 25.3
Then we can simply find the minimum distance for each source:
distances_dt[,
.(min_distance = min(distance)),
by = source
]
# source min_distance
# <int> <num>
# 1: 1 25.7
# 2: 2 25.4
# 3: 3 25.3
4. Setting up your own osrm-backend instance
The only problem I can foresee is that you will get a rate limit error if you use the default R osrm package server, rather than a local one. According to the R osrm package docs:
The OSRM demo server does not allow large queries (more than 10000 distances or durations).
If you are not sure whether you are hosting your own server then you probably aren't. You can check the default server in your R session by running getOption("osrm.server"). The default is "https://routing.openstreetmap.de/".
If you make a request larger than the max-table-size parameter on the server you are using, you will get this response:
{"code":"TooBig","message":"Too many table coordinates"}
If this is the case, then if practical, you could break up your data into smaller chunks.
Alternatively, you can run your own osrm-backend instance. If you are going to be calculating distances regularly, or even a lot of distances once, I would recommend doing this. There are two approaches:
a. Install osrm-backend and change the max-table-size parameter (Linux/Mac)
This is the tutorial I followed some years ago, which I think is a good way to do it. The instructions are for Ubuntu although only minor changes are required to install on other, popular Linux flavours or MacOS.
You need to change one line in the tutorial. Rather than running:
osrm-routed map.xml.osrm
You can run:
osrm-routed map.xml.osrm --max-table-size 100000
(Or whatever number is sufficiently large.)
b. Expose the API in a Docker image (Windows/Linux/Mac)
Alternatively, if you're running Windows, or you're comfortable with Docker, then it's probably easier to use the Docker images. Even if you haven't used Docker before, this is a good set of instructions.
If you are running an osrm-backend instance in a Docker container, you can change the max-table-size parameter by passing it to docker run as described here, with syntax similar to:
docker run -t -i -p 5000:5000 -v "${PWD}:/data" osrm/osrm-backend osrm-routed --algorithm mld --max-table-size 10000 /data/berlin-latest.osrm
In either case, once you have set up your own osrm-routed instance, you need to tell the R osrm package to use it. If it is running on port 5000 of your localhost, you can do this with:
options(osrm.server = "http://127.0.0.1:5000/")
You can see an example of integrating a local osrm-routed instance with the R osrm package here.
I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?
I am trying to count the frequency of a multiword expression in Quanteda. I know several articles in the corpus contain this expression, as when I look for it using 're' in Python it can find them. However, with Quanteda it doesn't seem to be working. Can anybody tell me what I am doing wrong?
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]
First off, apologies for not being able to use a fully Chinese text. But here's presidential address into which I've taken the liberty of inserting your Mandarin words:
data <- "I stand here today humbled by the task before us 抗美 援朝,
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation,
as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
What you can do, if you want to use quanteda, is you can compute 4-grams (I take it your words consist of four signs and will hence be treated as four words)
Step 1: split text into word tokens:
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)
Step 2: compute 4-grams and make a frequency list of them
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)
You can inspect the first ten:
fourgrams[1:10]
抗 美 援 朝 美 援 朝 have America has carried on Americans 抗 美 援
4 2 1 1
amidst gathering clouds and ancestors I thank President and cooperation he has and raging storms At
1 1 1 1
and the still waters and true to our
1 1
If you just want to know the frequency of your target compound:
fourgrams["抗 美 援 朝"]
抗 美 援 朝
4
Alternatively, and much more simply, especially if your interest is really just in a single compound, you could use str_extract_all from stringr. This will provide you the frequency count immediately:
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4
Generally speaking, it is the best to make a dictionary to lookup or compound tokens in Chinese or Japanese languages, but dictionary values should be segmented in the same way as tokens does.
require(quanteda)
require(stringi)
txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")
## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)
## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs mwe1 mwe2
## text1 1 1
You're on the right track, but quanteda's default tokenizer seems to separate the tokens in your phrase into four characters:
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"
For these reasons, you should consider an alternative tokenizer. Fortunately the excellent spaCy Python library offers a means to do this, and has Chinese language models. Using the spacyr package and quanteda, you can create tokens directly from the output of spacyr::spacy_tokenize() after loading the small Chinese language model.
To count just these expressions, you can use a combination of tokens_select() and then textstat_frequency() on the dfm.
library("quanteda")
## Package version: 2.1.0
txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
spacy_tokenize(txt) %>%
as.tokens() %>%
tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
tokens_select("抗美 援朝") %>%
dfm() %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 抗美 援朝 3 1 1 all
I need to randomise 380 samples (by age, sex and group [grp]) across four 96 well plates (8 rows, 12 columns), with A01 reserved in each plate for a positive control.
I tried the r-pkg (OSAT) and the recommended script is below. The only piece that does not work is excluding well A01 from each of the four plates.
library(OSAT)
samples <- read.table("~/file.csv", sep=";", header=T)
head(samples)
grp sex age
1 A F 45
2 A M 56
3 A F 57
4 A M 67
5 A F 45
6 A M 65
sample.list <- setup.sample(samples, optimal = c("grp", "sex", "age"))
excludedWells <- data.frame("plates"= 1:4, chips=rep(1,4), wells=rep(1,4))
container <- setup.container(IlluminaBeadChip96Plate, 4, batch = 'plates')
exclude(container) <- excludedWells
setup <- create.optimized.setup(fun ="optimal.shuffle", sample.list, container)
out <- map.to.MSA(setup, MSA4.plate)
The corresponding R help doc states:
"If for any reason we need to reserve certain wells for other usage, we can exclude them from the sample assignment process. For this one can create a data frame to mark these excluded wells. Any wells in the container can be identified by its location identified by three variable "plates", "chips", "wells". Therefore the data frame for the excluded wells should have these three columns.
For example, if we will use the first well of the first chip on each plate to hold QC samples, these wells will not be available for sample placement. We have 6 plates in our example so the following will reserve the 6 wells from sample assignment:
excludedWells <- data.frame(plates=1:6, chips=rep(1,6), wells=rep(1,6))
Our program can let you exclude multiple wells at the same position of plate/chip. For example, the following data frame will exclude the first well on each chips regardless how many plates we have:
ex2 <- data.frame(wells=1)
I tried both of these and they do not work - as they simply specify ANY well (and not well #1-A01).
*Update - I emailed the developer of the package and he acknowledged the error and provided a work around. Incorporated here (exclude wells after setting up the container)
I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e