Creating multiple dataframes out of one based on string search in R - r

I am relatively new to R. I have a dataframe that has more than 10 million rows that contain 500,000 PMIDs (a type of ID). However, the code I use to run on it can only handle 4000-5000 PMIDs at most. Here is a sample of what the raw dataframe (it's all in one column) looks like:
PMID- 28524368
OT - cardiomyopathy
OT - encephalitis
LID - 10.1111/jmp.12273 [doi]
PL - Denmark
PMID- 28523858
OT - Pan troglodytes
PST - aheadofprint
LID - 10.1111/echo.13561 [doi]
STAT- Publisher
FAU - Ruivo, Catarina
PMID- 52528302
CI - (c) 2017, Wiley Periodicals, Inc.
DA - 20170518
OWN - NLM
PMID- 18325287
STAT- Publisher
OWN - NLM
DA - 20170519
LA - eng
PMID- 95625132
FAU - Oumerzouk, Jawad
JID - 0135232
PL - Australia
PMID- 47628853
LA - eng
STAT- Publisher
AID - 10.1111/jmp.12273 [doi]
As you can see in the example dataframe, there are only 6 PMIDs. So for the sake of the example, let's say I need to make multiple dataframes and each dataframe should only have 2 PMIDs (in my actual code I will probably do around 4000 PMIDs). Thus, I would like to split up my dataframe into 3 different dataframes that look like this (start at one PMID and end before the third PMID comes)
df1:
PMID- 28524368
OT - cardiomyopathy
OT - encephalitis
LID - 10.1111/jmp.12273 [doi]
PL - Denmark
PMID- 28523858
OT - Pan troglodytes
PST - aheadofprint
LID - 10.1111/echo.13561 [doi]
STAT- Publisher
FAU - Ruivo, Catarina
df2:
PMID- 52528302
CI - (c) 2017, Wiley Periodicals, Inc.
DA - 20170518
OWN - NLM
PMID- 18325287
STAT- Publisher
OWN - NLM
DA - 20170519
LA - eng
df3:
PMID- 95625132
FAU - Oumerzouk, Jawad
JID - 0135232
PL - Australia
PMID- 47628853
LA - eng
STAT- Publisher
AID - 10.1111/jmp.12273 [doi]
Note that the row differences between each PMID is different, so it must be done by string matching PMID. I don't know how to do this on such a large dataset (how do I not manually create the dataframes? for loop?)
Any suggestions would be appreciated.

Make a little counter whenever you hit the start of a new group, then split. Here's a simplified example:
x <- rep(1:3,5)
grpsize <- 2
split(x, (cumsum(x==1)+grpsize-1) %/% grpsize)
#$`1`
#[1] 1 2 3 1 2 3
#
#$`2`
#[1] 1 2 3 1 2 3
#
#$`3`
#[1] 1 2 3
On your full data then you could use grepl to identify the start of each group:
split(df, (cumsum(grepl("^PMID",df$var)) + grpsize - 1) %/% grpsize)
Arguably you could add the counter as a new column on your dataset and use it as an identifier to go from a long to a wide dataset.

so although the solution of #thelatemail seemed very promising, it did not work on my dataset. even after I tried the code on a smaller subset of only 1 million rows, it would constantly freeze my computer and I would have to continuously re-start my computer and re-load all the code and large file. perhaps it works better on numerical data or maybe on fewer data or maybe using data.table or dplyr or maybe I was just coding it wrong...not sure exactly why I wasn't able to implement it correctly (I would've experimented more, but I want to go home soon hah), but I was able to come up with my own solution:
# shows indices of each PMID
a <- which(grepl("^PMID", df$V1))
a <- as.data.frame(a)
# creates dataframes based on indices from `a` at every 4000 PMID
df1 <- original[c(a[1, 1]:a[4000, 1]), ]
df1 <- as.data.frame(df1)
df2 <- original[c(a[4001, 1]:a[8000, 1]), ]
df2 <- as.data.frame(df2)
etc...until df100, ha. very tedious, but I couldn't come up of a way to not do this manually...perhaps creating a function? regardless, my code ran within seconds, so I'm not complaining. plus the tedious work was just mindless work anyway that actually only took 10-15 minutes.

Related

Suggestion with aggregation of data in R

Hello i have a data frame with more than 3632200+ obs, and I'm trying to find some useful information out of it. I have cleaned it a bit so now this is what the data looks like
Order Lane Days
18852324 796005 - Ahmedabad 2
232313 796008 - Delhi 5
63963231 796005 - Ahmedabad 5
23501231 788152 - Chennai 1
2498732 796008 - Delhi 2
231413 796005 - Ahmedabad 3
75876876 796012 - Chennai 4
14598676 796008 - Delhi 4
Order are different Order Id's, they all are unique, Lane are different paths on which the order was delivered(Lanes can repeat for various orders) & Days is calculated using difftime function in R by differentiating Order delivered and created date.
Now What I'm trying to achieve is something like this
Now I can calculate 98.% order achieved date by using quantile function in R across various lane.
But how do I achieve % of orders fulfilled by day 1 to 5 across various lanes?
Any help would be highly appreciated.
Thank You
Hard to tell without the data, but maybe something like this:
library(purrr)
#df = your data
max_days = max(df$days)
aggregate_fun = function(x){
days = factor(x$days,levels=c(1:max_days))
prop.table(table(days))
}
df = split(df,df$lane)
results = reduce(lapply(df,aggregate_fun),rbind)

Updating Values within a Simulation in R

I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!

Matching documens with text2vec -- scaling problems

I am having a few issues with scaling a text matching program. I am using text2vec which provides very good and fast results.
The main problem I am having is manipulating a large matrix which is returned by the text2vec::sim2() function.
First, some details of my hardware / OS setup: Windows 7 with 12 cores about 3.5 GHz and 128 Gb of memory. Its a pretty good machine.
Second, some basic details of what my R program is trying to achieve.
We have a database of 10 million unique canonical addresses for every house / business in address. These reference addresses also have latitude and longitude information for each entry.
I am trying to match these reference addresses to customer addresses in our database. We have about 600,000 customer addresses. The quality of these customer addresses is not good. Not good at all! They are stored as a single string field with absolutely zero checks on input.
The techical strategy to match these addresses is quite simple. Create two document term matrices (DTM) of the customer addresses and reference addresses and use cosine similarity to find the reference address which is the most similar to a specific customer address. Some customer addresses are so poor that will result in a very low cosine similarity -- so, for these addresses a "no match" would be assigned.
Despite being a pretty simple solution, the results obtained are very encouraging.
But, I am having problems scaling things....? And I am wondering if anyone has any suggestions.
There is a copy of my code below. Its pretty simple. Obviously, I cannot include real data but it should provide readers a clear idea of what I am trying to do.
SECTION A - Works very well even on the full 600,000 * 10 million input data set.
SECTION B - the text2vec::sim2() function causes R studio to shut down when the vocabulary exceeds about 140,000 tokens (i.e columns). To avoid this, I process the customer addresses in chunks of about 200.
SECTION C - This is the most expensive section. When processing addresses in chunks of 200, SECTION A and SECTION B take about 2 minutes. But SECTION C, using (what I would have thought to be super quick functions) take about 5 minutes to process to process a 10 million row * 200 column matrix.
Combined, SECIONS A:C take about 7 minutes to process 200 addresses. As there are 600,000 addresses to process, this will take about 14 days to process.
Are they are ideas to make this code run faster...?
rm(list = ls())
library(text2vec)
library(dplyr)
# Create some test data
# example is 10 entries.
# but in reality we have 10 million addresses
vct_ref_address <- c("15 smith street beaconsfield 2506 NSW",
"107 orange grove linfield 2659 NSW",
"88 melon drive calton 3922 VIC",
"949 eyre street sunnybank 4053 QLD",
"12 black avenue kingston 2605 ACT",
"5 sweet lane 2004 wynyard NSW",
"32 mugga way 2688 manuka ACT",
"4 black swan avenue freemantle 5943 WA",
"832 big street narrabeet 2543 NSW",
"5 dust road 5040 NT")
# example is 4 entries
# but in reality, we have 1.5 million addresses
vct_test_address <- c("949 eyre street sunnybank 4053 QLD",
"1113 completely invalid suburb with no post code QLD",
"12 black road kingston 2605 ACT",
"949 eyre roaod sunnybank 4053 QLD" )
# ==========================
# SECTION A ===== prepare data
# A.1 create vocabulary
t2v_token <- text2vec::itoken(c(vct_test_address, vct_ref_address), progressbar = FALSE)
t2v_vocab <- text2vec::create_vocabulary(t2v_token)
t2v_vectorizer <- text2vec::vocab_vectorizer(t2v_vocab)
# A.2 create document term matrices dtm
t2v_dtm_test <- text2vec::create_dtm(itoken(vct_test_address, progressbar = FALSE), t2v_vectorizer)
t2v_dtm_reference <- text2vec::create_dtm(itoken(vct_ref_address, progressbar = FALSE), t2v_vectorizer)
# ===========================
# SECTION B ===== similarity matrix
mat_sim <- text2vec::sim2(t2v_dtm_reference, t2v_dtm_test, method = 'cosine', norm = 'l2')
# ===========================
# SECTION C ===== process matrix
vct_which_reference <- apply(mat_sim, 2, which.max)
vct_sim_score <- apply(mat_sim, 2, max)
# ============================
# SECTION D ===== apply results
# D.1 assemble results
df_results <- data.frame(
test_addr = vct_test_address,
matched_addr = vct_ref_address[vct_which_reference],
similarity = vct_sim_score )
# D.2 print results
df_results %>% arrange(desc(similarity))
The issue in step C is that mat_sim is sparse and all the apply calls make column/row subsetting which are super slow (and convert sparse vectors to dense).
There could be several solutions:
if mat_sim is not very huge convert to the dense with as.matrix and then use apply
Better you can convert mat_sim to sparse matrix in a triplet format with as(mat_sim, "TsparseMatrix") and then use data.table to get indices of the max elements. Here is an example:
library(text2vec)
library(Matrix)
data("movie_review")
it = itoken(movie_review$review, tolower, word_tokenizer)
dtm = create_dtm(it, hash_vectorizer(2**14))
mat_sim = sim2(dtm[1:100, ], dtm[101:5000, ])
mat_sim = as(mat_sim, "TsparseMatrix")
library(data.table)
# we add 1 because indices in sparse matrices in Matrix package start from 1
mat_sim_dt = data.table(row_index = mat_sim#i + 1L, col_index = mat_sim#j + 1L, value = mat_sim#x)
res = mat_sim_dt[,
{ k = which.max(value); list(max_sim = value[[k]], row_index = row_index[[k]]) },
keyby = col_index]
res
Also as a side suggestion - I recommend to try char_tokenizer() with ngrams (for example of the size c(3, 3)) to "fuzzy" match different spelling and abbreviations of addresses.

Name matching with different length data frames in R

I have two dataframes with numerous variables. Of primary concern are the following variables, df1.organization_name and df2.legal.name. I'm just using fully qualified SQL-esque names here.
df1 has dimensions of 15 x 2700 whereas df2 has dimensions of 10x40,000. And essentially, the 'common' or 'matching' columns are name fields.
I reviewed this post Merging through fuzzy matching of variables in R and it was very helpful but I can't really figure out how to wrangle the script to get it to work with my dfs.
I keep getting an error - Error in which(organization_name[i] == LEGAL.NAME) :
object 'LEGAL.NAME' not found.
Desired Matching and Outcome
What I am trying to do is compare each and every one of my df1.organization_name to every one of the df2.legal_name and make a comparison if they are a very close match (like >=85%). And then like in the script above, take matched customer name and the matched comparison name and put them into a data.frame for later analysis.
So, if one of my customer names is 'Johns Hopkins Auto Repair' and one of my public list names is, 'John Hopkins Microphone Repair', I would call that a good match and I want some sort of indicator appended to my customer list (in another column) that says, 'Partial Match' and the name from the public list.
Example(s) of the dfs for text wrangling:
df1.organization_name (these are fake names b/c I can't post customer names)
- My Company LLC
- John Johns DBA John's Repair
- Some Company Inc
- Ninja Turtles LLP
- Shredder Partners
df2.LEGAL.NAME (these are real names from the open source file)
- $1 & UP STORE CORP.
- $1 store 0713
- LLC 0baid/munir/gazem
- 1 2 3 MONEY EXCHANGE LLC
- 1 BOY & 3 GIRLS, LLC
- 1 STAR BEVERAGE INC
- 1 STOP LLC
- 1 STOP LLC
- 1 STOP LLC DBA TIENDA MEXICANA LA SAN JOSE
- 1 Stop Money Centers, LLC/Richard

R - Aggregate Values Based on a Date Interval and 3 Factor Variables

I have ExpandedGrid 11760 obs of 4 variables:
Date - date format
Device - factor
Creative - factor
Partner - factor
I also have a MediaPlanDF 215 obs of 6 variables:
Interval - an interval of dates I created using lubridate
Partner - factor
Device - factor
Creative - factor
Daily Spend - num
Daily Impressions - num
Here is my trouble.
I need to sum daily spend and daily impressions in respective columns in MediaPlanDF, based on the following 2 criteria:
Criterion 1
- ExpandedGrid$Device matches MediaPlanDF$Device
- ExpandedGrid$Creative matches MediaPlanDF$Creative
- ExpandedGrid$Partner matches MediaPlanDF$Partner
Criterion 2
- ExpandedGrid$Date falls within MediaPlanDF$Interval
Now I can pull this off for each criteria on its own, but I am having the hardest time putting them together without getting errors, and my search for answers hasn't ended in very much success (a lot of great examples but nothing I have the skill to adapt to my context). I've tried a variety of methods but my mind is starting to wander towards overly complicated solutions and I need help.
I've tried indexing like so:
indexb <- as.character(ExpandedGrid$Device) == as.character(MediaPlanDF$Device);
indexc <- as.character(ExpandedGrid$Creative) == as.character(MediaPlanDF$Creative);
indexd <- as.character(ExpandedGrid$Partner) == as.character(MediaPlanDF$Partner);
index <- ExpandedGrid$Date %within% MediaPlanDF$Interval;
KEYDF <- data.frame(index, indexb, indexc, indexd)
KEYDF$Key <- apply(KEYDF, 1, function(x)(all(x) || all(!x)))
KEYDF$Key.cha <- as.character(KEYDF$Key)
outputbydim <- do.call(rbind, lapply(KEYDF$Key.cha, function(x){
index <- x == "TRUE";
list(impressions = sum(MediaPlanDF$Daily.Impressions[index]),
spend = sum(MediaPlanDF$Daily.Spend[index]))}))
Unfortunately this excludes values from being summed correctly, but the sum values for those that are true are incorrect.
Here is a data snippet:
ExpandedGrid:
Date Device Creative Partner
2015-08-31 "Desktop" "Standard" "ACCUEN"
MediaPlanDF
Interval Device Creative Partner Daily Spend Daily Impressions
2015-08-30 17:00:00 PDT--2015-10-03 17:00:00 PDT "Desktop" "Standard" "ACCUEN" 1696.27 1000339.17
Does anyone know where to go from here?
Thanks in advance!

Resources