R: Finding the most recent duplicate in a large data frame - r

I have a large (~18 million records) database of insurance policies, and I need to determine if each policy has been renewed or not. Imagine that a few records look like this: (today is October 5, 2022):
policy_number
prior_policy_number
zip_code
expiration_date
123456
90210
2023-10-01
123456
987654
90210
2022-10-01
987654
90210
2021-10-01
456654
10234
2019-05-01
The first line is a current policy, because 2023-10-01 is in the future.
The second line was renewed (by the first line).
The third line was renewed by the second line--we can tell because the second line's prior policy number matches the third line's policy number.
The fourth line was not renewed.
So a policy is renewed if either:
a) there is another policy with the same policy number and zip code but a later expiration date
b) there is another policy whose prior policy number matches this policy number, they have the same zip code, and the other policy has a later expiration date.
(Zip code is necessary because some insurers use policy numbers like "00000002" and this disambiguates duplicates.)
I wrote the following code, which works but takes forever to execute. Basically, I sort the data frame by descending expiration date, and then for each observation I create a miniature data frame that consists of just policies that have the same policy number or previous policy number and zip code, and then check the expiration data of the first (and therefore latest) one to see if it's later than the policy in question. I realize this is probably a horrible way to do this.
Does anyone have suggestions for how to make it more efficient?
non_renewals <- valid_zip_policies %>% arrange(desc(expiration_date))
check_renewed <- function (policy,zip,exp) {
#We create a subset of the main data frame containing only that policy number, (or policies with this policy as the prior policy number) and filter it for the matching zip code
cat(policy,zip,exp)
test_renewed <- valid_zip_policies %>% select(c("policy_number","prior_policy_number","zip_code","expiration_date")) %>% filter(policy_number == policy | prior_policy_number == policy) %>% filter(zip_code == zip)
#These are all the policies for the given policy number, sorted from latest to earliest expiration date. Is the expiration date of the most recent one later than the expiration date of this one? If so, it was renewed
if (test_renewed$expiration_date[1] > exp) { return (TRUE)} else {return (FALSE)}
}
for (i in 1:nrow(non_renewals)) {
non_renewals$renewed [i] <- check_renewed(non_renewals$policy_number[i],non_renewals$zip_code[i],non_renewals$expiration_date[i])
}

So I was able to answer my own question! The following code is literally about 100 times faster! Two things helped:
by far the greatest speed boost was from using data tables from package data.table rather than data frames. That package also has the fifelse command you see below.
using package parallel and its mclapply command gave an additional speed boost on my system.
It may also have helped that instead of passing three items from the original table to the function, I just pass the number and let the function retrieve the items as necessary.
non_renewals <- setDT(non_renewals)
check_renewed <- function (obs) {
#If expiration date of latest example is later, then it was renewed
if (non_renewals[policy_number==policy_number[obs] & zip_code==zip_code[obs],expiration_date][1] > non_renewals$expiration_date[obs]) {return("RENEWED")}
#If not, check the prior policies
final <- fifelse(non_renewals[prior_policy_number==policy_number[obs] & zip_code==zip_code[obs],expiration_date][1] > non_renewals$expiration_date[obs],"RENEWED","NONRENEWED",na="NONRENEWED")
return(final)
}
renewed <- character(10000)
system.time(renewed <- mclapply(1:10000,function (i) {check_renewed(i)}))

Related

Efficient way to reshape a dataset from long to wide

I have a medical dataset which looks like this:
patient_id disease_id
1111111111 DISEASE:1
1111111111 DISEASE:2
1111111111 DISEASE:3
1111111111 DISEASE:4
1111111111 DISEASE:5
1111111111 DISEASE:6
1111111111 DISEASE:6
1111111112 DISEASE:1
1111111112 DISEASE:2
1111111112 DISEASE:4
1111111113 DISEASE:1
1111111113 DISEASE:5
which I need to feed into a neural network/random forest model. So, the only natural data representation to feed into the models I thought of was:
patient_id DISEASE:1 DISEASE:2 DISEASE:3 DISEASE:4 DISEASE:5 DISEASE:6 ...
11111111111 1 1 1 1 1 1 ...
11111111112 1 1 0 1 0 0 ...
11111111113 1 0 0 0 1 0 ...
But my dataset is very big (~50GB, 1.5 GB compressed) and has tons of disease_ids so that the reshaping this data in the most efficient way possible in R requires 11.7 TB of space in compressed in RDs format (I know this because I divided the dataset into 100 chunks and reshaping of a single one resulted in 117 GB heavy RDs file; merging 100 of these would produce something larger than 11.7TB).
Now, I have 5 dataset this big that I need to merge together, so I feel a bit stuck. I need to come up with a more efficient data representation but don't know how as I am dealing with categorical variables which will require 1-hot encoding. Can anyone suggest any alternative ways to deal with a data like this.
Given the size of the input you will want to do stream processing and R is not that suitable for such processing so here we use a simple gawk program instead.
gawk is available in Rtools on Windows and comes natively with most UNIX/Linux systems.
In the first pass the gawk program creates an associative array disease from the disease field, i.e. second field, of the input. Presumably the number of diseases is much smaller than the length of the file so this would likely fit into memory.
Then in a second pass it reads each group of records corresponding to a patient assuming that all records for a patient are consecutive. For each patient it outputs a single row with the patient id and a sequence of 0's and 1's such that the ith indicates absence or presence of the ith disease.
FNR == 1 { next } # skip header on both passes
# first pass - create disease array
FNR == NR {
disease[$2] = 0;
next;
}
# second pass - create and output flattened records
{
if ($1 != prevkey && FNR > 2) {
printf("%s ", prevkey);
for(d in disease) printf("%d ", disease[d]);
printf("\n");
for(d in disease) disease[d] = 0;
}
disease[$2] = 1;
prevkey = $1;
}
END {
if (FNR == NR) for(d in disease) {
print d;
} else {
printf("%s ", prevkey);
for(d in disease) printf("%d ", disease[d]);
printf("\n");
}
}
If we put the above gawk code in model_mat.awk then we can run it like this -- note that the file must be specified twice -- once for each of the two passes:
gawk -f model_mat.awk disease.txt disease.txt
The output is the following where we are assuming it is wanted that each disease be indicated by 1 if it is present or 0 if not.
1111111111 1 1 1 1 1 1
1111111112 1 1 0 1 0 0
1111111113 1 0 0 0 1 0
If we run it with only one disease.txt argument then it will only run the first pass and then at the end list the diseases without duplicates:
gawk -f model_mat.awk disease.txt
giving:
DISEASE:1
DISEASE:2
DISEASE:3
DISEASE:4
DISEASE:5
DISEASE:6
Listing diseases
An alternative for listing diseases is this UNIX pipeline which lists the diseases without duplicates and sorts them. sed removes the header, cut takes the third space separated field (it is the third because there are two spaces between the two fields) and sort sorts it taking unique elements.
sed 1d disease.txt | cut -f 3 -d " " | sort -u > diseases-sorted.txt
Sorting and Merging
The GNU sort utility can sort and merge files larger than memory and has a parallel option to speed it up. Also see the free cmsort utility (Windows only).
csvfix
Below are some scripts using the free csvfix command line utility. You may need to modify the quotes depending on the command line processor/shell you are using and will need to put each on a single line or appropriately escape the newline (backslash for bash, circumflex for Windows cmd). We have shown each pipeline spread over separate lines for clarity.
The first pipeline below creates a one column list of diseases in disease-list.txt . The first csvfix command in it removes the header, the second csvfix command extracts the second field (i.e. drops the patient id) and the last csvfix command reduces it to unique diseases.
The second pipeline below creates a file with one row per patient with the patient id followed by the diseases for that patient. The first csvfix command in it removes the header, the second converts it to csv format and the last csvfix command flattens it.
csvfix remove -if "$line == 1" -smq disease.txt |
csvfix read_dsv -s " " -cm -f 2 |
csvfix uniq -smq > disease-list.txt
csvfix remove -if "$line == 1" -smq disease.txt |
csvfix read_dsv -s " " -cm -f 1,2 |
csvfix flatten -smq > flat.txt
You are raising interesting question. Analyzing that volume of data with R will be a real change.
So, I can only give you general advices. First, I think you need to dissociate RAM and disk storage. Using Rds won't help you regarding the efficiency of the reshaping but will produce smaller data on disk than csv.
Concerning the efficiency of the reshaping
data.table
If you want an in-memory approach, I don't see any other possibility than using data.table::dcast. In that case, follow #Ronak Shah recommendation:
library(data.table)
setDT(df)
df[, n := 1]
dcast(unique(df), patient_id~ disease_id, value.var = "n", fill = 0)
?data.table::dcast :
In the spirit of data.table, it is very fast and memory efficient, making it well-suited to handling large data sets in RAM. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage.
Other solutions
With data that voluminous, I think in-memory is not the most appropriate approach. You might have a look to database approaches (especially postgreSQL) or Spark.
Databases
You have several options to use postgreSQL in R. One of them is dbplyr: if you know the tidyverse syntax you will find familiar verbs. The pivot operation is a little bit trickier for database than standard R dataframe but you might find some ways to do that. You won't have difficulties to find some people more expert on databases than me that can give you very interesting tricks.
Spark
Spark can be a very good candidate to perform the reshaping if you can spread your tasks among executors in a serveur. If you are on a personal computer (standalone mode) you can still parallelize tasks between your cores but do not forget to change the spark.memory.fraction parameter of the session otherwise I think you might experience out of memory problems. I am more used to pyspark than sparkR but I think the logic will be the same.
Since Spark 1.6, you can pivot your data (ex: pyspark doc). This enables a wide to long conversion. Something in this spirit (pyspark code)
df.withColumn("n", psf.lit(1)).pivot("patient_id").sum("n")
Concerning the size on disk
You use Rds. You have some format more compressed, e.g. fst. parquet files are also very compressed, maybe one of the best options to store voluminous data. You can read them with SparkR or using arrow package

E-Mail Text Parsing and Extracting via Regular Expression in R

After over a year struggling to no avail, I'm turning the SO community for help. I've used various RegEx creator sites, standalone RegEx creator software as well as manual editing all in a futile attempt to create a pattern to parse and extract dynamic data from the below e-mail samples (sanitized to protect the innocent):
Action to Take: Buy shares of Facebook (Nasdaq: FB) at market. Use a 20% trailing stop to protect yourself. ...
Action to Take: Buy Google (Nasdaq: GOOG) at $42.34 or lower. If the stock is above $42.34, don't chase it. Wait for it to come down. Place a stop at $35.75. ...
***Action to Take***
 
Buy International Business Machines (NYSE: IBM) at market. And use a protective stop at $51. ...
What needs to be parsed is both forms of "Action to Take" sections and the resulting extracted data must include the direction (i.e. buy or sell, but just concerned about buys here), the ticker, the limit price (if applicable) and the stop value as either a percentage or number (if applicable). Sometimes there's also multiple "Action to Take"'s in a single e-mail as well.
Here's examples of what the pattern should not match (or ideally be flexible enough to deal with):
Action to Take: Sell half of your Apple (NYSE: AAPL) April $46 calls for $15.25 or higher. If the spread between the bid and the ask is $0.20 or more, place your order between the bid and the ask - even if the bid is higher than $15.25.
Action to Take: Raise your stop on Apple (NYSE: AAPL) to $75.15.
Action to Take: Sell one-quarter of your Facebook (Nasdaq: FB) position at market. ...
Here's my R code with the latest Perl pattern (to be able to use lookaround in R) that I came up with that sort of works, but not consistently or over multiple saved e-mails:
library(httr)
library("stringr")
filenames <- list.files("R:/TBIRD", pattern="*.eml", full.names=TRUE)
parse <- function(input)
{
text <- readLines(input, warn = FALSE)
text <- paste(text, collapse = "")
trim <- regmatches(text, regexpr("Content-Type: text/plain.*Content-Type: text/html", text, perl=TRUE))
pattern <- "(?is-)(?<=Action to Take).*(?i-s)(Buy|Sell).*(?:\\((?:NYSE|Nasdaq)\\:\\s(\\w+)\\)).*(?:for|at)\\s(\\$\\d*\\.\\d* or|market)\\s"
df <- str_match(text,pattern)
return(df)
}
list <- lapply(filenames, function(x){ parse(x) })
table <- do.call(rbind,list)
table <- data.frame(table)
table <- table[rowSums(is.na(table)) < 1, ]
table <- subset(table, select=c("X2","X3","X4"))
The parsing has to operate on the text copy because the HTML appears way too complicated to do so due to lack of standardization from e-mail to e-mail. Unfortunately, the text copy also commonly tends to have wrong line endings than regexp expects which greatly aggravates things.

Importing option chain data from Bloomberg

I would like to import from Bloomberg into R for a specified day the entire option chain for a particular stock, i.e. all expiries and strikes for the exchange traded options. I am able to import the option chain for a non-specified day (today):
bbgData <- bds(connection,sec,"OPT_CHAIN")
Where connection is a valid Bloomberg connection and sec is a Bloomberg security ticker such as "TLS AU Equity"
However, if I add extra fields it doesn't work, i.e.
bbgData <- bds(connection, sec,"OPT_CHAIN", testDate, "OPT_STRIKE_PX", "MATURITY", "PX_BID", "PX_ASK")
bbgData <- bds(connection, sec,"OPT_CHAIN", "OPT_STRIKE_PX", "MATURITY", "PX_BID", "PX_ASK")
Similarly, if I switch to using the historical data function it doesn't work
bbgData <- dateDataHist <- bdh(connection,sec,"OPT_CHAIN","20160201")
I just need the data for one day, but for a specified day, and including the additional fields
Hint: I think the issue is that every field following "OPT_CHAIN" is dependent on the result of "OPT_CHAIN", so for example it is the strike price given the code in "OPT_CHAIN", but I am unsure how to introduce this conditionality into the R Bloomberg query.
It's better to use the field CHAIN_TICKERS and related overrides when retrieving option data for a given underlying from Bloomberg. You can, for example, request points for a given moneyness by getting CHAIN_TICKERS with an override of CHAIN_STRIKE_PX_OVRD equal to 90%-110%.
In either case you need to use the tickers that are the result of your first request in a second request if you want to retrieve additional data. So:
option_tickers <- bds("TLS AU Equity","CHAIN_TICKERS",
overrides=c(CHAIN_STRIKE_PX_OVRD="90%-110%"))
option_prices <- bdp(sapply(option_tickers, paste, "equity"), c("PX_BID","PX_ASK"))

Getting variables out of a function in R

So here is some backghround info:
I have created a question and answer function in R. After the user calls the function they are prompted a succession of questions that will eventually be used to populate a report using R markdown. The function is divided into sections that follow the intended report and each section ends with a data.frame that has the question category, the answer and the name of the variable. In total there are 17 sections which means that there are 17 data.frames that get strung together using rbind function before the function writes the final data.frame to a .csv, saves it to a directory and exits. This function works well and I have no problems with it at all.
My problem lies in the fact that once the function ends I am not able to call the variables back to the console. this is a problem because if I would like to populate a report with the questions in R markdown I cannot because they only exist in the realm of the function.
What I have tried already:
I have already tried creating a list (using c()) containing the variables from each section and had the function return the list. however this did not work since it not only returns a small portion of the list and it only populates the readlines I passed to the variables. I need the be able to call the variable and receive what was answered.
I have called back the the .csv that was saved by the function and attempted to use the assign function to assign the variable name to the variable answer. This worked only when I entered one line at a time and fails when I attempt to assign column 1 to column 2. Considering there are 163 questions assigning them one at a time is a waste of time. I have even tried using the lapply and sapply functions to do this but there always a failure with the assign function
I need to be able to bring out the 163 variables that were created during the execution of the function. Here is a sample of the function for whom ever is interested to play around with.
sv<-function(){
Name<-readline("What is your Name?")
Date<-readline("What date is the site audit set for?(mm/dd/yyyy)")
Number<-readline("What is the project number")
Bname<-readline("What is the buildings name?")
ADD<-readline("What is the buildings address?(123 Fake Street)")
City<-readline("What city is the bulding located in?")
Pcode<-readline("What is the buildings postal code?")
HOO<-readline("What are the building's hours of operation?")
PHONE<-readline("What is the building's telephone number? (555-555-5555)")
FAX<-readline("What is the Fire Department's fax number? (555-555-5555)")
CONTACT<-readline("Who is the contact person for the Building? (First name, Last name)")
}
I thank you in advance for you help. Also please note I have searched through the site and saw similar questions but was not able to make the suggestions work so I apologize if this is redundant. Rember I need to be able to call Name and receive the name I entered once the function has done its thing.
Use the global assignment operator:
> sv <- function(){
+ Name <<- readline("What is your Name?")
+ }
> sv()
What is your Name?mkemp6
> print(Name)
[1] "mkemp6"

Retrieve time zone based on locale country information (OS-independent)

Yet another date/time related question ;-)
Before you aim and shoot
Things are kind of messed up with a Germany + MS Windows + R combination as the following yields an invalid time zone:
> Sys.timezone()
[1] "MST"
Warning message:
In as.POSIXlt.POSIXct(Sys.time()) : unknown timezone 'MET-1MST'
That's definitely not R's fault, it's Windows. Hence the question in the first place ;-)
Question
Is there an easy/alternative and OS-independent way to query your current country via locale info and then look up the corresponding time zone (format "<country>/<city>", e.g. "Europe/Berlin" for Germany)?
I should also add that I'd like the solution to be independent from internet resources such as stated in this post/answer.
The problem context
Suppose you don't know how to specify your time zone yet. You might have heard something about CET/CEST etc, but AFAIK that doesn't really get you anywhere when using base R functionality (at least being located in Germany ;-)).
You can get a list of available "<country>/<city>" pairs from the /share/zoneinfo/zone.tab file in your RHOME directory. Yet, in order to find the time zone corresponding to the current country you're in you need to know the ISO country code.
Of course we usually do for our native country, but let's suppose we don't (I'd like to end up with a generic approach). What do you do next?
Below is my "four-step" solution, but I'm not really happy with it because
it relies on yet another contrib package (ISOcodes)
I can't test if it works for other locales as I don't know what the info actually would look like if you're in India, Russia, Australia etc.
Anyone got a better idea? Also, It'd be great if some of you in countries other than Germany could run this through and post their locale info Sys.getlocale().
Step 1: get locale info
loc <- strsplit(unlist(strsplit(Sys.getlocale(), split=";")), split="=")
foo <- function(x) {
out <- list(x[2])
names(out) <- x[1]
out
}
loc <- sapply(loc, foo)
> loc
$LC_COLLATE
[1] "German_Germany.1252"
$LC_CTYPE
[1] "German_Germany.1252"
$LC_MONETARY
[1] "German_Germany.1252"
$LC_NUMERIC
[1] "C"
$LC_TIME
[1] "German_Germany.1252"
Step 2: get country name from locale info
country.this <- unlist(strsplit(loc$LC_TIME, split="_|\\."))[2]
> country.this
[1] "Germany"
Step 3: get ISO country code
Use country.this to look up the associated country code in data set ISO_3166_1 of package ISOcodes
require("ISOcodes")
data("ISO_3166_1")
iso <- ISO_3166_1
idx <- which(iso$Name %in% country.this)
code <- iso[idx, "Alpha_2"]
> code
[1] "DE"
Step 4: get time zone
Use code to look up the time zone in the data frame that can be derived from file RHOME/share/zoneinfo/zone.tab
path <- file.path(Sys.getenv("R_HOME"), "share/zoneinfo/zone.tab")
tzones <- read.delim(
path,
row.names=NULL,
header=FALSE,
col.names=c("country", "coords", "name", "comments"),
as.is=TRUE,
fill=TRUE,
comment.char = "#"
)
> tzones[which(tzones$country == code), "name"]
[4] "Europe/Berlin"
Specifically regarding your question:
Is there an easy/alternative and OS-independent way to query your current country via locale info and then look up the corresponding time zone?
No - there is not. This is because there are several countries that have multiple time zones. One cannot know the time zone from just the country alone.
This is why TZDB identifiers are in the form of Area/Location, rather than just a list of country codes.
Some simplifications to your workflow.
You can retrieve just the time part of the locale using
Sys.getlocale("LC_TIME")
which avoids the neeed to split strings.
The lubridate package contains a function to retrieve Olson-style time zone names, so you don't have to worry about reading and parsing zone.tab.
library(lubridate)
olson_time_zones()

Resources