Fuzzy matching by category - r

I am trying to fuzzy match two different dataframes based on company names, using the agrep function. To improve my matching, I would like to only match companies if they are located in the same country.
df1: df2:
Company ISO Company ISO
Aalberts Industries NL Aalberts NL
Allison NL Allison transmission NL
Allison UK Allison transmission UK
I use the following function to match:
testb$test <- ""
for(i in 1:dim(testb)[1]) {x2 <- agrep(testb$name[i], testa$name, ignore.case=TRUE, value=TRUE, max.distance = Inf, useBytes = TRUE, fixed = TRUE)
x2 <- paste0(x2,"")
testb$test2[i] <- x2
}
I can create a subset for every country and than match each subset, which works, but is time consuming. Is there another way to let R only match company names if df1$ISO = df2$ISO? Thanks!

Try indexing with the data.table package: https://www.r-bloggers.com/intro-to-the-data-table-package/.
Your company columns seem to be too dissimilar to match consistently and accurately with agrep(). For example, "Aalberts Industries" will match "Aalberts" only when you set max.distance to a value greater than 10. The same string distance would also report a match between "Algebra" and "Alleyway" — not very close at all. I recommend cleaning out the unnecessary words in your company columns before matching.
Sorry, I would make this a comment, but I don't have the required reputation. Maybe someone could convert this to a comment for me?

Related

Pattern matching character vectors in R

I am trying to match the characters between two vectors in two separate dataframes, lets call the dataframes "rentals" and "parcels", which both contain the vector "address" which is a character of the addresses of all rental parcels in a county and the addresses of all parcels in a city. We would like to figure out which addresses in the "parcels" dataframe match an address in the "rentals" dataframe by searching through the vector of addresses in "parcels" for matches with an address in "rentals."
The values in rentals$address look like this:
rentals$address <- c("110 SW ARTHUR ST", "1610 NE 66TH AVE", "1420 SE 16TH AVE",...)
And the values in parcels$address look like this:
parcels$address <- c("635 N MARINE DR, PORTLAND, OR, 97217", "7023 N BANK ST, PORTLAND, OR, 97203", "5410 N CECELIA ST, PORTLAND, OR, 97203",...)
There are about 172,000 entries in the "parcels" dataframe and 285 in the "rentals" dataframe. My first solution was to match character values using grepl, which I don't think worked:
matches = grepl(rentals$address, parcels$address, fixed = TRUE)
This returns FALSE for each entry in parcels$address, but copying and pasting some values of "address" from "rentals" into Excel's CNTRL+F window viewing the "parcels" dataframe, I see a few addresses. So some appear to match.
How would I best be able to find which observation's values in the "address" column of the "rentals" dataframe is a matching character sequence in the "parcels" dataframe?
Are the addresses all exact matches? That is, no variations in spacing, capitalization, apartment number? If so, you might be able to use the dplyr function left_join to create a new df, using the address as the key, like so
library(dplyr)
df_compare <- df_rentals %>%
left_join(df_parcels, by = "address")
additionally, if you have columns along the lines of df_rentals$rentals = yes and df_parcels$parcels = yes, you can filter the resulting new dataframe
df_both <- filter(df_compare, rentals == "yes", parcels == "yes")

Creating a large frequency table from 'scratch' with specific ratios / values?

I have a problem that I can't figure out have to solve.
I have 3 (tibble) data-frames with just names of diffrent populations.
df1 is all, unique, surnames in Sweden and a column with a count.
382.492 (unique names * the count) = 10002985 people in df1.
10002985 is then the total population in the this 'experiment'.
df2 is a list of all registered lawyers in Sweden.
6211 lawyers total in the population.
df3 is a list of all people with noble family surnames in Sweden
there are 542 unique names and 46851 people with noble surnames in the population.
We also know that in the lawyer subgroup there is:
106 people lawyer with a noble surname.
Now my problem is that I want to create just one df with all this info.
It should look like this:
The main idea is to create a df with one row per population: 10002985 rows.
noble and lawyer is then a dummy variable where 1 = yes, 0 = no. So for example: for the tot_pop, 46851 people should have noble = 1, and 106 out of that group should have lawyer = 1.
Notice that I don't really care what the names are - I just care about the ratios.
Notice also that the reason why I want to create a new data-frame without the names is because I think this is the only way to solve the problem, at least the easiest. But if anyone insists -- I can upload some sample data from each df.
In the end I want to run some probability tests.
Let me know if the question confusing. Also, let me know if this is a really dumb way to go about this :p
SOLUTION:
It was quite easy once I realized what I was looking for :)
There is probably a more elegant solution.
# pop
pop <- 1:10002985
# noble
n <- c(46851, 9956134)
noble <- rep(1:0, n)
# attorney
a <- c(106,46745, 46745, 9909389)
attorney <- rep(c(1,0,1,0), a)
final_data <- tibble(pop, noble, attorney)

simple Join error where some rows join and some don't

I have two dataframe which I'm trying to join which should be straight forward but I see some anomalous behavior.
Dataframe A
Name Sample Country Path
John S18902 UK /Home/drive/John
BOB 135671 USA /Home/drive/BOB
Tim GB12345_serum_63 UK /Home/drive/Tim
Wayne 12345_6789 UK /Home/drive/Wayne
Dataframe B
Surname Sample State FILE
Paul S18902 NJ John.csv
Gem 135671 PP BOB.csv
Love GB12345_serum_63 AP Tim.csv
Dave 12345_6789 MK Wayne.csv
I am using R markdown to do a simple join using the following command
Dataframec <- DataframeA %>%
left_join(DataframeB ,by = "Sample",all.x=T )
All rows join apart from the row where sample== GB12345_serum_63
There should be a simple fix to this but I'm out of ideas.
Thank you
If you are cutting-and-pasting your data directly into your question then the reason for this is because your key values are technically different due to having different numbers of spaces.
I cut and paste from your question from the beginning of the value to the start of the adjacent column name. So to 'country' in the first case and to 'state' in the second case
DataframeA: "GB12345_serum_63"
DataframeB: "GB12345_serum_63 "
You can see for DataframeB there are 3 space characters after the value. This can be resolved by removing extra whitespace from your key values as follows using regular expressions: gsub("^\\s+|\\s+$", "", x)
DataframeA$Sample <- gsub("^\\s+|\\s+$", "", DataframeA$Sample)
DataframeB$Sample <- gsub("^\\s+|\\s+$", "", DataframeB$Sample)
Now your join should work

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

Transforming character strings in R

I have to merge to data frames in R. The two data frames share a common id variable, the name of the subject. However, the names in one data frame are partly capitalized, while in the other they are in lower cases. Furthermore the names appear in reverse order. Here is a sample from the data frames:
DataFrame1$Name:
"Van Brempt Kathleen"
"Gräßle Ingeborg"
"Gauzès Jean-Paul"
"Winkler Iuliu"
DataFrame2$Name:
"Kathleen VAN BREMPT"
"Ingeborg GRÄSSLE"
"Jean-Paul GAUZÈS"
"Iuliu WINKLER"
Is there a way in R to make these two variables usable as an identifier for merging the data frames?
Best, Thomas
You can use gsub to convert the names around:
> names
[1] "Kathleen VAN BREMPT" "jean-paul GAULTIER"
> gsub("([^\\s]*)\\s(.*)","\\2 \\1",names,perl=TRUE)
[1] "VAN BREMPT Kathleen" "GAULTIER jean-paul"
>
This works by matching first anything up to the first space and then anything after that, and switching them around. Then add tolower() or toupper() if you want, and use match() for joining your data frames.
Good luck matching Grassle with Graßle though. Lots of other things will probably bite you too, such as people with two first names separated by space, or someone listed with a title!
Barry
Here's a complete solution that combines the two partial methods offered so far (and overcomes the fears expressed by Spacedman about "matching Grassle with Graßle"):
DataFrame2$revname <- gsub("([^\\s]*)\\s(.*)","\\2 \\1",DataFrame2$Name,perl=TRUE)
DataFrame2$agnum <-sapply(tolower(DataFrame2$revname), agrep, tolower(DataFrame1$Name) )
DataFrame1$num <-1:nrow(DataFrame1)
merge(DataFrame1, DataFrame2, by.x="num", by.y="agnum")
Output:
num Name.x Name.y revname
1 1 Van Brempt Kathleen Kathleen VAN BREMPT VAN BREMPT Kathleen
2 2 Gräßle Ingeborg Ingeborg GRÄSSLE GRÄSSLE Ingeborg
3 3 Gauzès Jean-Paul Jean-Paul GAUZÈS GAUZÈS Jean-Paul
4 4 Winkler Iuliu Iuliu WINKLER WINKLER Iuliu
The third step would not be necessary if DatFrame1 had rownames that were still sequentially numbered (as they would be by default). The merge statement would then be:
merge(DataFrame1, DataFrame2, by.x="row.names", by.y="agnum")
--
David.
Can you add an additional column/variable to each data frame which is a lowercase version of the original name:
DataFrame1$NameLower <- tolower(DataFrame1$Name)
DataFrame2$NameLower <- tolower(DataFrame2$Name)
Then perform a merge on this:
MergedDataFrame <- merge(DataFrame1, DataFrame2, by="NameLower")
In addition to the answer using gsub to rearrange the names, you might want to also look at the agrep function, this looks for approximate matches. You can use this with sapply to find the matching rows from one data frame to the other, e.g.:
> sapply( c('newyork', 'NEWJersey', 'Vormont'), agrep, x=state.name, ignore.case=TRUE )
newyork NEWJersey Vormont
32 30 45

Resources