stringdist_join not merging data

stringdist_join not merging data - r

I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the fuzzy join package. When I run the below code, it just keeps running. I can't figure out how to fix this.
Can you identify where I went wrong?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)

Related

Using R to create a collection of dataframes extracted from a large database

Okay so I have a large collection of data I'm trying to analyze. It contains ~2 million police reports with information such as jurisdiction, type of offense, race, age, etc. My end goal is to determine the top 20 jurisdictions with the most reports, and then chart only the entries from those locations in various ways.
top20 <- police %>%
group_by(JURISDICTION) %>%
tally(sort=TRUE) %>%
filter(row_number() <= 20)
I'm using this to tally up the totals in each jurisdiction and then trim down to the top 20 entries. What I'd like to have happen next is, using each of the entries in column one of this top20 data.frame, create a new data frame with the name of the locality, as well as all the entries from police with a matching jurisdiction.
I've been experimenting with something along the lines of:
for (i in 1:20) {
assign(paste(top20[i,1]),
filter(police, JURISDICTION == top20[i,1]))
}
which does create data frames with the correct names, but the second portion isn't reading correctly and it's just creating blank data frames at the moment. Any advice on how to streamline this would be appreciated. I'm very capable of just making each frame individually, but if I can do it succinctly I'll be more satisfied.

PSM in R with specific lines

to get matched pairs due to PSM ("Matchit"-Package and Method = full) i need to specifiy my command for my longitudinal data frame. Every Case has several obeservations but i only need the first observation per patient to be included in the Matching. So the matching should be based on every patients' first observation but my later analysis should include the complete dataset of each patient with all observations.
Has anyone an idea how to achieve this?
I tried using a data subset (first observation per patient) but wasn't able to get the matching included in the data set (with all observations per patient) using "Match.data".
Thanks in advance
Simon (desperately writing his masters thesis)

My udnerstanding is that you want to create matches at just the first time point but have those matches be identified for each unit at all time points. Fortunatly, this is pretty straightforward: just perform the matching at the first time point and then merge the matched dataset with the full dataset. Here is how this might look. Let's say your original long dataset is d and has an ID column id and a time column time.
m <- matchit(treat ~ X1 + X2, data = subset(d, time == 1), method = "full")
md1 <- match.data(m)
d <- merge(d, md1[c("id", "subclass", "weights")], by = "id", all.x = TRUE)
Your new dataset should have two new columns, subclass and weights, which contain the matching subclass and matching weight for each unit. Rows with identical IDs (i.e., rows corresponding to the same unit at multiple time points) will have the same value of subclass and weight.

Replace strings in a dataframe based on another dataframe

I have a 200k rows dataframe with a character column named "departament_name", some of the values in this column contain a specific char: "?". For example: "GENERAL SAN MART?N", "
UNI?N", etc.
I want to replace those values using another 750k rows dataframe that cointains a column also named "departament_name", but the values in this column are correct. Following the example, it will be: "GENERAL SAN MARTIN", "UNION", and so on.
Can I do this automatically using pattern recognition withouth making a dictionary (there are several values with this problem).
My objetive is to have an unified dataset with the two dataframes and unique values for those problematics rows in "departament_name". I prefer tidyverse (mutate, stringr, etc) if possible.

You can try using stringdist.* joins from fuzzjoin package.
fuzzyjoin::stringdist_left_join(df1, df2, by = 'departament_name')
# departament_name.x departament_name.y
#1 GENERAL SAN MART?N GENERAL SAN MARTIN
#2 UNI?N UNION
Obviously, this works for the simple example you have shared but it might not give you 100% correct result for all the entries in your actual data. You can tweak the parameters max_dist and method as per your data. See ?fuzzyjoin::stringdist_join for more information about them.
data
df1 <- data.frame(departament_name = c("GENERAL SAN MART?N", "UNI?N"))
df2 <- data.frame(departament_name = c("GENERAL SAN MARTIN", "UNION"))

Creating horizontal dataframe from vertical table (with repeated variables)

Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?

for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.

Merge with multiple conditions and nearest numerical match

From looking through Stackoverflow, and other sources, I believe that changing my dataframes to data.tables and using setkey, or similar, will give what I want. But as of yet I have been unable to get a working Syntax.
I have two data frames, one containing 26000 rows and the other containing 6410 rows.
The first dataframe contains the following columns:
Customer name, Base_Code, Idenity_Number, Financials
The second dataframe holds the following:
Customer name, Base_Code, Idenity_Number, Financials, Lapse
Both sets of data have identical formatting.
My goal is to join the Lapse column in the second dataframe to first dataframe. The issue I have is that the numeric value in Financials does not match between the two datasets and I only want the closest match in DF1 to have the value in the Lapse column in DF2 against it.
There will be examples where there are multiple entries for the same customer ID and Base Code in each dataframe, so I need to merge the two based on Idenity_Number and Base_Code (which is exact) and then match against the nearest financial numeric match for each entry only.
There will never be more entries in the DF2 then held within DF1 for each Customer and Base_Code.
Here is an example of DF1:
Here is an example of DF2:
And finally, here is what I want end up with:
If we use Jessica Rabbit as the example we have a match against DF1 and DF2, the financial value of 1240 from DF1 was matched against 1058 in DF2 as that was the closest match.

I could not work out how to get a working solution using data.table, so I re-thought my approach and have come up with a solution.
First of all I merged the two datasets, and then removed any entries that did not have a stauts of "LAP", this gave me all of the NON Lapsed entries:
NON_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.x=TRUE)
NON_LAP <- NON_LAP [!grepl("LAP", NON_LAP$Status, ignore.case=FALSE),]
Next I merged again, this time looking specifically for the lapsed cases. To work out which was the cloest match I used the abs function, then I ordered by the lowest difference to get the closest matches in order. Finally I removed duplicates to show the closest matches and then also kept duplicates and stripped out the "LAP" status to ensure those that were not the closest match remained in the data.
Finally I merged them all together giving me the required outcome.
FIND_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.y=FALSE)
FIND_LAP$Difference <- abs(FIND_LAP$GWP - FIND_LAP$ACTUAL_PRICE)
FIND_LAP <- FIND_LAP[order( FIND_LAP[,27] ),]
FOUND_LAP <- FIND_LAP [!duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
NOT_LAP <- FIND_LAP [duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
Hopefully this will help someone else who might be new to R and encounters the same issue.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

stringdist_join not merging data - r

Related

Using R to create a collection of dataframes extracted from a large database

PSM in R with specific lines

Replace strings in a dataframe based on another dataframe

Creating horizontal dataframe from vertical table (with repeated variables)

Merge with multiple conditions and nearest numerical match

Categories

Resources