R - grepl over 7 million observations - how to boost efficiency?

R - grepl over 7 million observations - how to boost efficiency? - r

I've run into a bit of a dead end with some R code that I've written, and I thought maybe you'd know how to make this whole thing feasible, in the sense that efficiency can be improved.
So, what I'm trying to do is the following:
I've got a tweet dataset with ~7 million observations. Currently, I'm not interested in the text of the tweets or any of the other metadata, but only in the "Location" field, so I've extracted that data into a new data.frame, which contains the location variable (string) and a new, currently empty, "isRelevant" variable (logical). Furthermore, I've got a vector containing text information formatted as follows: "Placename(1)|Placename(2)[...]|Placename(i)" . What I'm trying to do is to grepl every line of the locations variable to see if there is a match with the Placenames vector, and if so, return a "TRUE" in the isRelevant variable and return a "FALSE" if not.
To do this, I wrote some R code, which basically boils down to this line:
locations.df$isRelevant <- sapply(locations.df$locations, function(s) grepl(grep_places, s, ignore.case = TRUE))
whereby grep_places is the list of possible matching terms separated by "|" characters, to let R know that it can match any of the elements in the vector. I am running this on a remote high-capacity computer, which provides over 2 TB of RAM using RStudio (R 3.2.0), and I'm running it with 'pbsapply' which provides me with a progress bar. As it turns out, this is taking ridiculously long. It's done about 45% to date (I started it more than a week ago) and it's saying it's still going to need over 270 hours to complete it. That's obviously not really a workable situation, as I'm going to have to run similar code in the future, using way larger datasets. Have you got any idea how I might get this job done in a more acceptable timeframe, perhaps like one day or something like that (keeping in mind the super-strong computer).
EDIT
Here's some semi-simulated data to indicate what I'm working with approximately looks like:
print(grep_places)
> grep_places
"Acworth NH|Albany NH|Alexandria NH|Allenstown NH|Alstead NH|Alton NH|Amherst NH|Andover NH|Antrim NH|Ashland NH|Atkinson NH|Auburn NH|Barnstead NH|Barrington NH|Bartlett NH|Bath NH|Bedford NH|Belmont NH|Bennington NH|Benton NH|Berlin NH|Bethlehem NH|Boscawen NH|Bow NH|Bradford NH|Brentwood NH|Bridgewater NH|Bristol NH|Brookfield NH|Brookline NH|Campton NH|Canaan NH|Candia NH|Canterbury NH|Carroll NH|CenterHarbor NH|Charlestown NH|Chatham NH|Chester NH|Chesterfield NH|Chichester NH|Claremont NH|Clarksville NH|Colebrook NH|Columbia NH|Concord NH|Conway NH|Cornish NH|Croydon NH|Dalton NH|Danbury NH|Danville NH|Deerfield NH|Deering NH|Derry NH|Dorchester NH|Dover NH|Dublin NH|Dummer NH|Dunbarton NH|Durham NH|EastKingston NH|Easton NH|Eaton NH|Effingham NH|Ellsworth NH|Enfield NH|Epping NH|Epsom NH|Errol NH|Exeter NH|Farmington NH|Fitzwilliam NH|Francestown NH|Franconia NH|Franklin NH|Freedom NH|Fremont NH|Gilford NH|Gilmanton NH|Gilsum NH|Goffstown NH|Gorham NH|Goshen NH|Grafton NH|Grantham NH|Greenfield NH|Greenland NH|Greenville NH|Groton NH|Hampstead NH|Hampton NH|HamptonFalls NH|Hancock NH|Hanover NH|Harrisville NH|Hart'sLocation NH|Haverhill NH|Hebron NH|Henniker NH|Hill NH|Hillsborough NH|Hinsdale NH|Holderness NH|Hollis NH|Hooksett NH|Hopkinton NH|Hudson NH|Jackson NH|Jaffrey NH|Jefferson NH|Keene NH|Kensington NH|Kingston NH|Laconia NH|Lancaster NH|Landaff NH|Langdon NH|Lebanon NH|Lee NH|Lempster NH|Lincoln NH|Lisbon NH|Litchfield NH|Littleton NH|Londonderry NH|Loudon NH|Lyman NH|Lyme NH|Lyndeborough NH|Madbury NH|Madison NH|Manchester NH|Marlborough NH|Marlow NH|Mason NH|Meredith NH|Merrimack NH|Middleton NH|Milan NH|Milford NH|Milton NH|Monroe NH|MontVernon NH|Moultonborough NH|Nashua NH|Nelson NH|NewBoston NH|NewCastle NH|NewDurham NH|NewHampton NH|NewIpswich NH|NewLondon NH|Newbury NH|Newfields NH|Newington NH|Newmarket NH|Newport NH|Newton NH|NorthHampton NH|Northfield NH|Northumberland NH|Northwood NH|Nottingham NH|Orange NH|Orford NH|Ossipee NH|Pelham NH|Pembroke NH|Peterborough NH|Piermont NH|Pittsburg NH|Pittsfield NH|Plainfield NH|Plaistow NH|Plymouth NH|Portsmouth NH|Randolph NH|Raymond NH|Richmond NH|Rindge NH|Rochester NH|Rollinsford NH|Roxbury NH|Rumney NH|Rye NH|Salem NH|Salisbury NH|Sanbornton NH|Sandown NH|Sandwich NH|Seabrook NH|Sharon NH|Shelburne NH"
head(location.df, n=20)
> location isRelevant
1 London NA
2 Orleans village VT USA NA
3 The World NA
4 D M V Towson NA
5 Playa del Sol Solidaridad NA
6 Beautiful Downtown Burbank NA
7 <NA> NA
8 US NA
9 Gaithersburg Md NA
10 <NA> NA
11 California NA
12 Indy NA
13 Florida NA
14 exsnaveen com NA
15 Houston TX NA
16 Tweaking NA
17 Phoenix AZ NA
18 Malibu Ca USA NA
19 Hermosa Beach CA NA
20 California USA NA
Thanks in advance everyone, I'd seriously appreciate any help with this.

grepl is a vectorized function, there should be no need to apply a loop to it. Have you tried:
#dput(location.df)
location.df<-structure(list(location = structure(c(12L, 14L, 17L, 5L, 16L,
2L, 1L, 19L, 8L, 1L, 3L, 11L, 7L, 6L, 10L, 18L, 15L, 13L, 9L,
4L), .Label = c("<NA>", "Beautiful Downtown Burbank", "California",
"California USA", "D M V Towson", "exsnaveen com", "Florida",
"Gaithersburg Md", "Hermosa Beach CA", "Houston TX", "Indy",
"London", "Malibu Ca USA", "Orleans village VT USA", "Phoenix AZ",
"Playa del Sol Solidaridad", "The World", "Tweaking", "US"), class = "factor"),
isRelevant = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("location",
"isRelevant"), row.names = c(NA, -20L), class = "data.frame")
#grep_places with places in the test data
grep_places<-"Gaithersburg Md|Phoenix AZ"
location.df$isRelevant[grepl(grep_places, location.df$location, ignore.case = TRUE)]<-TRUE
or for a slightly faster implementation,as per David Arenburg's comment:
location.df$isRelevant <- grepl(grep_places, location.df$location, ignore.case = TRUE)

Related

Create a column value based on a matching regular expression

I have the following character string in a column called "Sentences" for a df:
I like an apple
I would like to create a second column, called Type, whose values are determined by matching strings. I would like to take the regular expression \bapple\b, match it with the sentence and if it matches, add the value Fruit_apple in the Type column.
In the long run I'd like to do this with several other strings and types.
Is there an easy way to do this using a function?
dataset (survey_1):
structure(list(slider_8.response = c(1L, 1L, 3L, 7L, 7L, 7L,
1L, 3L, 2L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 7L,
7L, 7L, 1L, 1L, 7L, 6L, 6L, 1L, 1L, 7L, 1L, 7L, 7L, 1L, 7L, 7L,
7L, 7L, 7L, 6L, 7L, 7L, 7L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 7L, 2L
), Sentences = c("He might could do it.", "I ever see the film.",
"I may manage to come visit soon.", "She’ll never be forgotten.",
"They might find something special.", "It might not be a good buy.",
"Maybe my pain will went away.", "Stephen maybe should fix your bicycle.",
"It used to didnʼt matter if you walked in late.", "He’d could climb the stairs.",
"Only Graeme would might notice that.", "I used to cycle a lot. ",
"Your dad belongs to disagree with this. ", "We can were pleased to see her.",
"He may should take us to the city.", "I could never forgot his deep voice.",
"I should can turn this thing over to Ann.", "They must knew who they really are.",
"We used to runs down three flights.", "I don’t care what he may be up to. ",
"That’s something I ain’t know about.", "That must be quite a skill.",
"We must be able to invite Jim.", "She used to play with a trolley.",
"He is done gone. ", "You might can check this before making a decision.",
"It would have a positive effect on the team. ", "Ruth can maybe look for it later.",
"You should tag along at the dance.", "They’re finna leave town.",
"A poem should looks like that.", "I can tell you didn’t do your homework. ",
"I can driving now.", "They should be able to put a blanket over it.",
"We could scarcely see each other.", "I might says I was never good at maths.",
"The next dance will be a quickstep. ", "I might be able to find myself a seat in this place.",
"Andrew thinks we shouldn’t do it.", "Jack could give a hand.",
"She’ll be able to come to the event.", "She’d maybe keep the car the way it is.",
"Sarah used to be able to agree with this proposal.", "I’d like to see your lights working. ",
"I’d be able to get a little bit more sleep.", "John may has a second name.",
"You must can apply for this job.", "I maybe could wait till the 8 o’clock train.",
"She used to could go if she finished early.", "That would meaned something else, eh?",
"You’ll can enjoy your holiday.", "We liketa drowned that day. ",
"I must say it’s a nice feeling.", "I eaten my lunch."), construct = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 54L), class = "data.frame")
type_list:
list("DM_will_can"=c("ll can","will can"), "DM_would_could"=c("d could","would could"),
"DM_might_can"="might can","DM_might_could"="might could","DM_used_to_could"="used to could",
"DM_should_can"="should can","DM_would_might"=c("d might", "would might"),"DM_may_should"="may should",
"DM_must_can"="must can", "SP_will_be_able"=c("ll be able","will be able"),
"SP_would_be_able"=c("d be able","would be able"),"SP_might_be_able"="might be able",
"SP_maybe_could"="maybe could","SP_used_to_be_able"="used to be able","SP_should_be_able"=
"should be able","SP_would_maybe"=c("d maybe", "would maybe"), "SP_maybe_should"="maybe should",
"SP_must_be_able"="must be able", "Filler_will_a"="quickstep","Filler_will_b"="forgotten",
"Filler_would_a"="lights working","Filler_would_b"="positive effect","Filler_can_a"="homework",
"Filler_can_b"="Ruth","Filler_could_a"="scarcely","Filler_could_b"="Jack", "Filler_may_a"="may be up to",
"Filler_may_b"="visit soon", "Filler_might_a"="good buy","Filler_might_be"="something special",
"Filler_should_a"="tag along","Filler_should_b"="Andrew","Filler_used_to_a"="trolley",
"Filler_used_to_b"="cycle a lot","Filler_must_a"="quite a skill","Filler_must_b"="nice feeling",
"Dist_gram_will_went"="will went","Dist_gram_meaned"="meaned","Dist_gram_can_were"="can were",
"Dist_gram_forgot"="never forgot", "Dist_gram_may_has"="may has",
"Dist_gram_might_says"="might says","Dist_gram_used_to_runs"="used to runs",
"Dist_gram_should_looks"="should looks","Dist_gram_must_knew"="must knew","Dist_dial_liketa"="liketa",
"Dist_dial_belongs"="belongs to disagree","Dist_dial_finna"="finna","Dist_dial_used_to_didnt"="used to didn't matter",
"Dist_dial_eaten"="I eaten", "Dist_dial_can_driving"="can driving","Dist_dial_aint_know"="That's something",
"Dist_dial_ever_see"="ever see the film","Dist_dial_done_gone"="done gone")

I want to do this with a Python dictionary, but we're talking about R, so I've more or less translated the approach. There is probably a more idiomatic way to do this in R than two for loops, but this should work:
# Define data
df <- data.frame(
id = c(1:5),
sentences = c("I like apples", "I like dogs", "I have cats", "Dogs are cute", "I like fish")
)
# id sentences
# 1 1 I like apples
# 2 2 I like dogs
# 3 3 I have cats
# 4 4 Dogs are cute
# 5 5 I like fish
type_list <- list(
"fruit" = c("apples", "oranges"),
"animals" = c("dogs", "cats")
)
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
# Output:
# id sentences type item
# 1 1 I like apples fruit apples
# 2 2 I like dogs animals dogs
# 3 3 I have cats animals cats
# 4 4 Dogs are cute animals dogs
# 5 5 I like fish <NA> <NA>
EDIT
Added after data was added. If I read in your data, and call it df, and your type list and call it type_list, the following works:
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$Sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
This is exactly the same as my previous code, except Sentences has an upper case S in your data frame.

Merging / combining columns is not working at all

I cant for the life of me figure out a way of getting this working without changing the class of the columns or getting a random level that isn't even in the original columns!
I have data that looks like this:
data <- structure(list(WHY = structure(1:4, .Label = c("WHY1", "WHY2",
"WHY3", "WHY4"), class = "factor"), HELP1 = structure(c(3L, NA,
1L, 2L), .Label = c("1", "2", "D/A"), class = "factor"), HELP2 = c(NA,
2L, NA, NA)), class = "data.frame", row.names = c(NA, -4L))
What I want to do:
If HELP 2 IS NOT NA & IF HELP1 is D/A then merge columns WITHOUT changing class.
Here is what I tried:
data$HELP3 <-
ifelse(
!is.na(data$HELP2) &
data$HELP1 == "D/A",
data$HELP1, data$HELP2)
Result:
data
WHY HELP1 HELP2 HELP3
1 WHY1 D/A NA NA
2 WHY2 <NA> 2 NA
3 WHY3 1 NA NA
4 WHY4 2 NA NA
>
I would be so very very very grateful for any help with this. I have been on stack overflow for 5 hours now and no closer to making this work :( I am not that hot with dyplr so a base r or anything else would be wonderful!

Since HELP2 and HELP1 have different class and ifelse also has issues to return vector of factor class. You could however, do this without ifelse and without changing the classes of columns.
data$HELP3 <- data$HELP1
inds <- (!is.na(data$HELP2)) & data$HELP1 == "D/A"
data$HELP3[inds] <- data$HELP2[inds]

RecordLinkage: how to pair only best matches and export a merged table?

I am trying to use the R package RecordLinkage to match items in the purchase orders list with entries in the master catalogue. Below is the R code and a reproducible example using two dummy datasets (DOrders and DCatalogue):
DOrders <- structure(list(Product = structure(c(1L, 2L, 7L, 3L, 4L, 5L,
6L), .Label = c("31471 - SOFTSILK 2.0 SCREW 7mm x 20mm", "Copier paper white A4 80gsm",
"High resilience memory foam standard mattress", "Liston forceps bone cutting 152mm",
"Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm", "Micro reciprocating blade 39.5 x 7.0 x 0.38",
"microaire dual tooth 18 x 90 x 0.89"), class = "factor"), Supplier = structure(c(5L,
6L, 2L, 1L, 4L, 3L, 3L), .Label = c("KAROMED LTD", "Morgan Steer Ortho Limited",
"ORTHOPAEDIC SOLUTIONS", "SURGICAL HOLDINGS", "T J SMITH NEPHEW LTD",
"XEROX SOLUTIONS"), class = "factor"), UOI = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("Each", "Pack"), class = "factor"),
Price = c(5.99, 6.99, 40, 230, 35, 80, 79)), .Names = c("Product",
"Supplier", "UOI", "Price"), class = "data.frame", row.names = c(NA,
-7L))
DCatalogue <- structure(list(Product = structure(c(7L, 3L, 4L, 5L, 6L, 2L,
8L, 1L), .Label = c("7.0mm cann canc scr 32x80mm non sterile single use",
"A4 80gsm white copier paper", "High resilience memory foam standard hospital mattress with stitched seams has a fully enclosing cover",
"Liston bone cutting forceps with fluted handle straight 152mm",
"Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm", "Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm",
"microaire large osc dual tooth 18mm x 90mm x 0.89mm", "Softsilk 2.0 pkg 7x20 ster"
), class = "factor"), Supplier = structure(c(3L, 2L, 6L, 4L,
4L, 7L, 5L, 1L), .Label = c("BIOMET MERCK LTD", "KAROMED LIMITED",
"MORGAN STEER ORTHOPAEDICS LTD", "ORTHO SOLUTIONS", "SMITH & NEPHEW ADVANCED SURGICAL DEVICES",
"SURGICAL HOLDINGS", "XEROX"), class = "factor"), UOI = structure(c(1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("Each", "Pack"), class = "factor"),
RefPrice = c(38.7, 274.18, 34.96, 79.48, 81.29, 6.99, 5.99,
5)), .Names = c("Product", "Supplier", "UOI", "RefPrice"), class = "data.frame", row.names = c(NA,
-8L))
For the purpose of experimenting, the DOrders has 7 entries, each of which matches with one of nine rows in the reference set DCatalogue. In the real data, not all orders would match.
head(DOrders)
Product Supplier UOI Price
1 31471 - SOFTSILK 2.0 SCREW 7mm x 20mm T J SMITH NEPHEW LTD Each 5.99
2 Copier paper white A4 80gsm XEROX SOLUTIONS Each 6.99
3 microaire dual tooth 18 x 90 x 0.89 Morgan Steer Ortho Limited Each 40.00
4 High resilience memory foam standard mattress KAROMED LTD Each 230.00
5 Liston forceps bone cutting 152mm SURGICAL HOLDINGS Each 35.00
6 Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHOPAEDIC SOLUTIONS Each 80.00
> head(DCatalogue)
Product Supplier UOI RefPrice
1 microaire large osc dual tooth 18mm x 90mm x 0.89mm MORGAN STEER ORTHOPAEDICS LTD Each 38.70
2 High resilience memory foam standard hospital mattress with stitched seams has a fully enclosing cover KAROMED LIMITED Each 274.18
3 Liston bone cutting forceps with fluted handle straight 152mm SURGICAL HOLDINGS Each 34.96
4 Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHO SOLUTIONS Pack 79.48
5 Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm ORTHO SOLUTIONS Pack 81.29
6 A4 80gsm white copier paper XEROX Each 6.99
The first step in the linkage is to ensure that items match by unit of issue (UOI). This is because a pack of items is obviously not the same as one unit, even if the items are exactly the same. E.g.:
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHOPAEDIC SOLUTIONS Each 80.00
Is the same item, but should be a non-match to:
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHO SOLUTIONS Pack 79.48
Hence, I am using the blocking argument blockfld = 3, to attempt to match only those entries with identical values in the 3rd column. Also, using exclude = 4 , to exclude the price from matching. This will be different between the Orders and the Catalogue and is itself the main interest of matching. The matching is done using jarowinkler string comparator (as described here) on Product and Supplier names:
library(RecordLinkage)
rpairs <- compare.linkage(DOrders, DCatalogue,
blockfld = 3,
exclude = 4,
strcmp = 1:2,
strcmpfun = jarowinkler)
Next, I am computing the weights for each pair using the Contiero et al. (2005) method:
rpairs <- epiWeights(rpairs)
> summary(rpairs)
Weight distribution:
[0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
1 1 19 10 3 0 4
Based on this distribution, I want to classify as matches only those pairs with a weight > 0.7
result <- epiClassify(rpairs, 0.7)
> summary(result)
7 links detected
0 possible links detected
31 non-links detected
This is as far as I got, but there are some problems with this.
First, getPairs(result) shows that one entry from DOrders can have a high weight match with more than one entry in the DCatalogue. E.g.
This pair is correctly matched, with a weight of 0.948
Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS Pack 79
Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm ORTHO SOLUTIONS Pack 81.29 0.9480503
but is also matched incorrectly with a weight of 0.928:
Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS Pack 79
Micro reciprocating blade 25.4mm x 8.0mm x 0.38mm ORTHO SOLUTIONS Pack 79.48 0.9283522
Obviously, I need to restrict pairing to only one best match with the highest weight, but how to do it?
And finally, the end result that I am looking for is a merged dataset that contains matched entries from both Orders and Catalogue in one row, with all columns from both original sets side by side for comparison. getPairs produces an output in an awkward format:
> getPairs(result)
id Product Supplier UOI Price Weight
1 7 Micro reciprocating blade 39.5 x 7.0 x 0.38 ORTHOPAEDIC SOLUTIONS Pack 79
2 5 Micro reciprocating blade 39.5mm x 7.0mm x 0.38mm ORTHO SOLUTIONS Pack 81.29 0.9480503
3
4 5 Liston forceps bone cutting 152mm SURGICAL HOLDINGS Each 35
5 3 Liston bone cutting forceps with fluted handle straight 152mm SURGICAL HOLDINGS Each 34.96 0.9329244
...

First of all, thank you for providing a reproducable example, which eases answering your questions a lot. I will start with your second question:
And finally, the end result that I am looking for is a merged dataset that contains matched entries from both Orders and Catalogue in one row, with all columns from both original sets side by side for comparison.
With single.rows=TRUE, getPairs lists both entries in one line. Furthermore, show="links" limits the output to pairs classified as belonging together (see ?getPairs for details):
> matchedPairs <- getPairs(result, single.rows=TRUE, show="links")
However, this does not put matching columns next to each other, but all columns of record one are followed by all columns of record two (and finally the matching weight as last column). I show only the column names here as the whole table is really wide:
> names(matchedPairs)
[1] "id1" "Product.1" "Supplier.1" "UOI.1" "Price.1" "id2" "Product.2" "Supplier.2" "UOI.2" "RefPrice.2" "Weight"
So if you want direct column-to-column comparison in this format, you have to rearrange the colums to fit your needs.
Obviously, I need to restrict pairing to only one best match with the highest weight, but how to do it?
This functionality is not provided by the package and I believe the process of choosing one-to-one assignments from a record linkage result needs some conceptual attention on its own. I have never gone deeply into this step, so the following might just be an idea to start with. You can use the data.table library to choose from each group of pairs with the same left-hand id the one with maximum weight (compare How to select the row with the maximum value in each group):
> library(data.table)
> matchedPairs <- data.table(matchedPairs)
> matchedPairs[matchedPairs[,.I[which.max(Weight)],by=id1]$V1, list(id1,id2)]
id1 id2
1: 7 5
2: 5 3
3: 4 2
4: 2 6
5: 6 1
6: 3 1
Here, list(id1,id2) limits the output to the record ids.
In order to eliminate also double mappings for the right hand ids (in this case, 1 appears twice for id2) you would have to repeat the process for id2. Note however, that in some situations choosing pairs with highest weight in step 1 (reducing to unique values for id1) may remove pairs where the weight is maximal for a given value of id2. So for choosing an optimal overall mapping (e.g. maximizing the sum of weights of all chosen mappings) a non-greedy optimization strategy is needed.
Update: Using classes and methods for big data sets
For large datasets, the so called "big data" classes and methods can be used (see https://cran.r-project.org/web/packages/RecordLinkage/vignettes/BigData.pdf). These use file-backed data structures, so the size limit is the available disk space. The syntax is mostly, but not completely identical. For this example, the necessary calls to achieve the same result as above would be:
rpairs <- RLBigDataLinkage(DOrders, DCatalogue,
blockfld = 3,
exclude = 4,
strcmp = 1:2,
strcmpfun = "jarowinkler")
rpairs <- epiWeights(rpairs)
result <- epiClassify(rpairs, 0.7)
matchedPairs <- getPairs(result, single.rows=TRUE, filter.link="link")
matchedPairs <- data.table(matchedPairs)
matchedPairs[matchedPairs[,.I[which.max(Weight)],by=id.1]$V1, list(id.1,id.2)]
However, concerning your size estimation of 2 TB, this is still not feasible. I think you have to further reduce the number of pairs by additional blocking.
The problem in this case is that the package only supports "hard" blocking criteria (i.e. two records must match exactly in the blocking field). When linking personal data (which was our use case while developing the package), the day, month and year components of the date of birth can usually be combined for blocking in such a way that the number of pairs is significantly reduced without missing match candidates. As far as I can judge from the examples, further "hard" blocking is not possible for your data, as the matching pairs have only similar, but not equal attribute values (aside from the "unit of issue", which you already use for blocking). A criterium like "only consider pairs where the string similarity of the product names is greater than [some threshold]" seems most appropriate to me. To achieve this, you would have to extend compare.linkage() or RLBigDataLinkage().

R data.table get maximum value per row for multiple columns

I've got a data.table in R which looks like that one:
dat <- structure(list(de = c(1470L, 8511L, 3527L, 2846L, 2652L, 831L
), fr = c(14L, 81L, 36L, 16L, 30L, 6L), it = c(9L, 514L, 73L,
37L, 91L, 2L), ro = c(1L, 14L, 11L, 1L, 9L, 0L)), .Names = c("de",
"fr", "it", "ro"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L))
I now wanna create a new data.table (having exactly the same columns) but holding only the maximum value per row. The values in the other columns should simply be NA.
The data.table could have any number of columns (the data.table above is just an example).
The desired output table would look like this:
de fr it ro
1: 1470 NA NA NA
2: 8511 NA NA NA
3: 3527 NA NA NA
4: 2846 NA NA NA
5: 2652 NA NA NA
6: 831 NA NA NA

There are several issues with what the OP is attempting here: (1) this really looks like a case where data should be kept in a matrix rather than a data.frame or data.table; (2) there's no reason to want this sort of output that I can think of; and (3) doing any standard operations with the output will be a hassle.
With that said...
dat2 = dat
is.na(dat2)[-( 1:nrow(dat) + (max.col(dat)-1)*nrow(dat) )] <- TRUE
# or, as #PierreLafortune suggested
is.na(dat2)[col(dat) != max.col(dat)] <- TRUE
# or using the data.table package
dat2 = dat[rep(NA_integer_, nrow(dat)), ]
mc = max.col(dat)
for (i in seq_along(mc)) set(dat2, i = i, j = mc[i], v = dat[i, mc[i]])

It's not clear to me whether you mean that you want to use the data.table package, or if you are satisfied with making a data.frame using only base functions. It is certainly possible to do the latter.
Here is one solution, which uses only max() and which.max() and relies on the fact that an empty data.frame will fill in all of the remaining cells with NA to achieve a rectangular structure.
maxdat <- data.frame()
for (col in names(dat)) {
maxdat[which.max(dat[,col]), col] <- max(dat[,col])
}

record how long a variable was above a level in r

I am working on converting a project that I currently have programmed in Excel to R. The reason for doing this is that the code includes lots of logic and data which means thats Excel's performance is very poor. So far I have coded up around 50% of this project in R and I am extremely impressed with the performance.
The code I have does the following:
Loads a 5min time-series data of a stock and adds a day of the year column labeled doy in this example below.
The OHLC data looks like this:
Date Open High Low Close doy
1 2015-09-21 09:30:00 164.6700 164.7100 164.3700 164.5300 264
2 2015-09-21 09:35:00 164.5300 164.9000 164.5300 164.6400 264
3 2015-09-21 09:40:00 164.6600 164.8900 164.6000 164.8900 264
4 2015-09-21 09:45:00 164.9100 165.0900 164.9100 164.9736 264
5 2015-09-21 09:50:00 164.9399 165.0980 164.8200 164.8200 264
Converts that data to a table called df df <- tbl_df(DIA_5)
Using mainly plyr with hint of TTR it filters through the data creating a set of 10 new variables in a new data frame called data. See below:
data <- structure(list(doy = c(264, 265, 266, 267, 268, 271, 272, 11,12, 13),
Date = structure(c(1442824200, 1442910600, 1442997000,1443083400,
1443169800, 1443429000, 1443515400, 1452504600,
1452591000,1452677400), class = c("POSIXct", "POSIXt"), tzone = ""),
OR_High = c(164.71,162.96, 163.38, 161.37, 163.91, 162.06, 160.22,
164.5, 165.23,165.84), OR_Low = c(164.37, 162.62, 162.98, 161.06,
163.57, 161.66,159.7, 164.06, 164.84, 165.4), HOD = c(165.56, 163.36,
163.38,162.24, 164.43, 162.06, 160.96, 164.5, 165.78, 165.84), LOD =
c(165.22,163.1, 162.98, 161.95, 164.24, 161.66, 160.75, 164.06,
165.56,165.4), Close = c(164.92, 163.02, 162.58, 161.85, 162.94,
159.84,160.19, 163.83, 165.02, 161.38), Range =
c(0.340000000000003,0.260000000000019, 0.400000000000006,
0.29000000000002, 0.189999999999998,0.400000000000006,
0.210000000000008, 0.439999999999998, 0.219999999999999,0.439999999999998),
`A-val` = c(NA, NA, NA, NA, NA, NA, NA,
0.0673439999999994,0.0659639999999996, 0.0729499999999996),
`A-up` = c(NA, NA, NA,NA, NA, NA, NA, 164.567344, 165.295964,
165.91295), `A-down` = c(NA,NA, NA, NA, NA, NA, NA, 163.992656,
164.774036, 165.32705)), .Names = c("doy","Date", "OR_High", "OR_Low",
"HOD", "LOD", "Close", "Range","A-val", "A-up", "A-down"),
row.names = c(1L, 2L, 3L, 4L, 5L,6L, 7L, 78L, 79L, 80L),
class = "data.frame")
The next part is where it gets complicated. What I need to do is to analyse the high and low prices of each 5 minute bar of the day in relation to my A-up & A-down and close values as seen in the table. What I am looking for is to be able to compute a score for the day depending on the time spent above the A-up level or below the A-down level.
The way I got by this in Excel was to index each 5 minute high & low price of the time series then used logic to score the activity in that 5min time slice. If the low was > A-up level it was given a 1 and - 1 if the high was < A-down. For the scoring if price stays > A-up level or < A-down level for greater than 30 mins I score it a 2 0r -2. This was achieved by using a running 5 period sum of the results of the and if one had more than 5 ones I knew that price had stayed > the A-up level etc then I would score it a 2.
For the days scoring I need to know the following;
Did price stay above or below and A level for > 30 minutes or fail by spending < 30 minutes there?
If price went above and below both levels in one day, which level did it break first?
So after a long winded intro my question. Does anyone out there have a good idea of the best way to go about coding this. I don't need specific code but moreover what packages may help to accomplish this. As I mentioned above my reason for switching to R was mainly for speed so whatever code used must be efficient. When I have this coded I intend on programming a loop so that it can analyse several hundred instruments.
Thanks.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - grepl over 7 million observations - how to boost efficiency? - r

Related

Create a column value based on a matching regular expression

Merging / combining columns is not working at all

RecordLinkage: how to pair only best matches and export a merged table?

R data.table get maximum value per row for multiple columns

record how long a variable was above a level in r

Categories

Resources