Replace strings in a dataframe based on another dataframe - r

I have a 200k rows dataframe with a character column named "departament_name", some of the values in this column contain a specific char: "?". For example: "GENERAL SAN MART?N", "
UNI?N", etc.
I want to replace those values using another 750k rows dataframe that cointains a column also named "departament_name", but the values in this column are correct. Following the example, it will be: "GENERAL SAN MARTIN", "UNION", and so on.
Can I do this automatically using pattern recognition withouth making a dictionary (there are several values with this problem).
My objetive is to have an unified dataset with the two dataframes and unique values for those problematics rows in "departament_name". I prefer tidyverse (mutate, stringr, etc) if possible.

You can try using stringdist.* joins from fuzzjoin package.
fuzzyjoin::stringdist_left_join(df1, df2, by = 'departament_name')
# departament_name.x departament_name.y
#1 GENERAL SAN MART?N GENERAL SAN MARTIN
#2 UNI?N UNION
Obviously, this works for the simple example you have shared but it might not give you 100% correct result for all the entries in your actual data. You can tweak the parameters max_dist and method as per your data. See ?fuzzyjoin::stringdist_join for more information about them.
data
df1 <- data.frame(departament_name = c("GENERAL SAN MART?N", "UNI?N"))
df2 <- data.frame(departament_name = c("GENERAL SAN MARTIN", "UNION"))

Related

stringdist_join not merging data

I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the fuzzy join package. When I run the below code, it just keeps running. I can't figure out how to fix this.
Can you identify where I went wrong?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)

Fuzzy matching Countries in R

for an assignment I have to use fuzzy matching in R to merge two different datasets that both had a "Country" column. The first dataset is from Kaggle(Countries dataset) while the other is from ISO 3166 standard. I already use fuzzy matching it worked well. I add both data sets a new column that counts a number of observations(it is a must for fuzzy matching as far as I understand) 1 from their respectable lengths. That I named "Observation number" For my first dataset, there are 227 observations and for the ISO dataset, there are 249 observations.
I want to create a new dataset that includes columns from my first dataset(I had to use this data set specifically it has columns like migration, literacy, etc) and Country codes from the ISO dataset. I couldn't manage to do it. fuzzy matching output gave me how the first data set's observation numbers change in the ISO dataset. (For example in the first dataset countries ordered such as Afghanistan, Albania, Algeria.... whilst in ISO order in Albania, Algeria, Afghanistan) so for that fuzzy match output gave me 3,1,2... I understand this means 3rd observation in the ISO dataset is 1st in the Countries dataset.
I want to create a new data set that has all the information on the Countries datasets ordered withrespect to ISO datasets' Country columns' order.
However i cannot do it using
a=(Result1$matches)$observationnumber
#gives me vector a, where can I find i'th observation of Country dataset in ISO dataset
countryorderedlikeISO <- countries.of.the.world[match(c(a), countries.of.the.world$observation),]
It seems to ignore the countries that are present in ISO but not in the country dataset.
What can I do? I want this new dataset to be in ISO's length, with NA values for observations that are present in ISO but not in Country.

How to search for county names in a description column with multiple strings - R

I have a donation dataset with a field in it called "Description", where the donor described what they gave their gift for. This field has multiple words or strings in it (sometimes a full sentence), and several rows list specific counties where they wanted their donation to be designated.
I would like to identify which rows in this field have a county name in them, and indicate that somehow in a new field. I have a dataframe with the county names from the two states I need, but I'm struggling to know which code let me use the county field in the county dataframe as a basis for identifying county names in within the Description field.
I'm still at a low level in R but I'll try to give some sample code. I have over 1000 rows so it will take too long for me to search for specific counties in a string - it will be more helpful to use a list of counties as my basis for searching.
`df <- tibble(`Donor Type` = c("Single Donation", "Grant", "Recurring Donation"), Amount = c("10", "50", "100"), Description = c("This is for Person County", "Books for Beaufort County", "Brews for Books"))`
`Donor Type` Amount Description
<chr> <chr> <chr>
1 Single Donation 10 This is for Person County
2 Grant 50 Books for Beaufort County
3 Recurring Donation 100 Brews for Books
I have a dataframe with county names in two states (named Carolina.Counties below)- what code should I use to make an additional column in my donor dataframe indicating which descriptions were limited to a specific county? I've been playing around with the following - but am not getting the right results.
Df <-
apply(Df, 1, function(x)
ifelse(any(Df$Description %in% Carolina.Counties$county), 'yes','no'))
%in% would look for an exact match. You may need some sort of regex match which can be achieved with the help of grepl.
df$result <- ifelse(grepl(paste0(Carolina.Counties$county, collapse = '|'),
df$Description), 'Yes', 'No')
paste0(Carolina.Counties$county, collapse = '|') would create a single regex pattern to looking for all the counties. We look for this pattern in Description column if it exists assign "Yes" else "No".

Convert comma separated column into multiple columns

I have a dataset of film with several columns, one of which is a column for country. Because some films are produced by more than one country, the film can have different countries at the same time in the "country" column. For example,
enter image description here
I now want to create a new dataset in which each row in “country” column can only has one country. For example, in the screenshot above, Bluebeard are produced by “France”, “Germany”, and “Italy” country. Right now, I want the dataset showing that Bluebeard is produced by “France”, “Germany”, and “Italy” country separately.
I tried strsplit()and colsplit() function, but that doesn’t seem to convert comma-separated "country" column into multiple columns that only contain one country each row.
Any suggestions?
Thank you!
Using tidyr:
separate_rows(data, country, sep = ", ")

Merge with multiple conditions and nearest numerical match

From looking through Stackoverflow, and other sources, I believe that changing my dataframes to data.tables and using setkey, or similar, will give what I want. But as of yet I have been unable to get a working Syntax.
I have two data frames, one containing 26000 rows and the other containing 6410 rows.
The first dataframe contains the following columns:
Customer name, Base_Code, Idenity_Number, Financials
The second dataframe holds the following:
Customer name, Base_Code, Idenity_Number, Financials, Lapse
Both sets of data have identical formatting.
My goal is to join the Lapse column in the second dataframe to first dataframe. The issue I have is that the numeric value in Financials does not match between the two datasets and I only want the closest match in DF1 to have the value in the Lapse column in DF2 against it.
There will be examples where there are multiple entries for the same customer ID and Base Code in each dataframe, so I need to merge the two based on Idenity_Number and Base_Code (which is exact) and then match against the nearest financial numeric match for each entry only.
There will never be more entries in the DF2 then held within DF1 for each Customer and Base_Code.
Here is an example of DF1:
Here is an example of DF2:
And finally, here is what I want end up with:
If we use Jessica Rabbit as the example we have a match against DF1 and DF2, the financial value of 1240 from DF1 was matched against 1058 in DF2 as that was the closest match.
I could not work out how to get a working solution using data.table, so I re-thought my approach and have come up with a solution.
First of all I merged the two datasets, and then removed any entries that did not have a stauts of "LAP", this gave me all of the NON Lapsed entries:
NON_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.x=TRUE)
NON_LAP <- NON_LAP [!grepl("LAP", NON_LAP$Status, ignore.case=FALSE),]
Next I merged again, this time looking specifically for the lapsed cases. To work out which was the cloest match I used the abs function, then I ordered by the lowest difference to get the closest matches in order. Finally I removed duplicates to show the closest matches and then also kept duplicates and stripped out the "LAP" status to ensure those that were not the closest match remained in the data.
Finally I merged them all together giving me the required outcome.
FIND_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.y=FALSE)
FIND_LAP$Difference <- abs(FIND_LAP$GWP - FIND_LAP$ACTUAL_PRICE)
FIND_LAP <- FIND_LAP[order( FIND_LAP[,27] ),]
FOUND_LAP <- FIND_LAP [!duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
NOT_LAP <- FIND_LAP [duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
Hopefully this will help someone else who might be new to R and encounters the same issue.

Resources