How to subset IDs with partial match into data frame - r

I am trying to subset data to create a list of possible duplicates in a new data frame. The problem is that the names are in different format and possible only a small part of the ID may actually match.
I need R to output a list of possible duplicates for me to then check
I've found a few examples for formatting issues or when the it's the first few characters that you are trying to match. I am not sure how to put the codes together and the characters that match may be anywhere in the name.
So far, this seems to get me the closest, but Im still not sure how to apply the code the work for me.
Subset a df using partial match with multiple criteria
This is what my data looks like (but with 1000000s of lines):
Supplier.Name Date.of.Record BMCC.avg
SG & JM Hammond 2018-07-21 292.2381
Mileshan Nominees Pty Ltd 2018-12-21 130.0000
RW & GJ Brown & Sons 2018-02-21 162.8333
BD & BA Smith 2018-02-21 478.0000
In the end,I would like a list of possible duplicates based on partial matches (maybe 4 or 5 characters in a row?)
Right now I can't seem to put together a code at all. Even a few starting point suggesting would be helpful.
Thanks!

Related

How to scrape a complex table which has columns spanning multiple rows (a pedigree chart) in R?

I've looked at all the other related Stack Overflow questions, and none of them are close enough to what I'm trying to do to be useful. In part because, while some of those questions address dealing with tables where the leftward columns span multiple rows (as in a pedigree chart), they don't address how to handle the messy HTML which is somehow generating the chart. When I try the usual ways of ingesting the table with rvest it really doesn't work.
The table I'm trying to scrape looks like this:
When I extract the HTML of the first row (tr) of the table, I see that it contains: Betty, Jack, Bo, Bob, Jim, Dan, b 1932 (the very top of the table).
Great, you say, but not so fast. Because with this structure there's no way to know that Betty's mom is Sue (because Sue is on a different row).
Sue's row doesn't include Betty, but instead starts with Sue herself.
So in this example, Sue's row would be: Sue, Owen, Jacob, Luca, Blane, b 1940.
Furthermore, the row #2 in the HTML is actually just Ava b 1947.
I.e., the here's the content of each HTML row:
I tried using rvest to download the page and then extract the table.
A la:
pedigree <- read_html(page) %>% html_nodes("#PedigreeTable") %>% html_table
It really didn't work. Oddly, I got every column duplicated twice--so not too bad, but I'd rather it be a tibble/dataframe/matrix with the first column being 32 Bettys, and then the next column be 16 of each of Jack and Sue, etc...
I hope this is all clear as mud!
Ideally, as far as output, I'd get a nice neat dataframe with the columns person, father, mother. Like so:
Thanks in advance!
Maybe writting a algorithm can do it, like :
Select only the last two columns :
father_name=first value of the penultimate column
then browse the column to find the next non-NA value, count each rows
count=1 + number of NA values
mother_name=second non NA value
then count all rows until you find a name
count=count + 1 + number of NA values
Create your final table with :
name | father | mother
Isolate all the family's child names, and save them in your final table.
Assign father_name and mother_name in corresponding columns in your final table
Delete all rows used and start again.
Once you have assigned all the last column-people to their parents, delete the last column.
Then, delete all blank rows to have a structure similar to the one needed in the fist step, and start the algorithm again
Hope that helps !!
PS : I suggest that you give an unique ID to each person at some point, to avoid confusion between people that have the same name.

Grouping towns/villages into cities/communities

I have this problem in R where I have a list of Spanish communities and inside each community there is a list of towns/municipalities.
For example, this is a list of municipalities inside the community of Catalonia.
https://en.wikipedia.org/wiki/Municipalities_of_Catalonia
So; Catalonia is one community and within this community it has a list of towns/cities which I would like to group/ assign a new value 'Catalona'.
I have a list of all the municipalities/towns/cities in my dataset and I would like to group them into communities such as; Andalusia, Catalonia, Basque Country, Madrid etc.
Firstly, how can I go about grouping these rows into the list of communities?
For example; el prat de llobregat is a municipality within Catalonia so I would like to assign this to the region of Catalonia. Getafe is a municipality of Madrid so I would like to assign this to a value of Madrid. Alicante is a municipality of Valencia so I would like to assign this to a value Valencia. Etc.
#
That was my first question and if you are able to help with just that, I would be very thankful.
However, my dataset is not that clean, I did my best to remove Spanish accents, remove unnecessary code identifiers in the municipality names but there still contains some small errors. For example, castellbisbal is a municipality of Catalonia, however some entries have very small spelling mistakes, i.e. including 1 'l' instead of two, spelling; (castelbisbal).
These errors are human errors and are very small, is there a way I can work around this?
I was thinking of a vector of all correctly spelt names and then rename the incorrectly spelt names based on a percentage of incorectness, could this work? For instance castellbisbal is 13 characters long, and has an error of 1 character, with less than an 8% error rate. Can I rename values based on an error rate?
Do you have any suggestions on how I can proceed with the second part?
Any tips/suggestions would be great.
As for the spelling errors, have you tried the soundex algorithm? It was meant for that and at least two R packages implement it.
library(stringdist)
phonetic("barradas")
[1] "B632"
phonetic("baradas")
[1] "B632"
And the soundex codes for for the same words are the same with package phonics.
library(phonics)
soundex("barradas")
[1] "B632"
soundex("baradas")
[1] "B632"
All you would have to do would be to compare soundex codes, not the words themselves. Note that soundex was designed for the english language so it can only handle english language characters, not accents. But you say you are already taking care of those, so it might work with the words you have to process.

R How to compare Rows and Delete by Matching Strings

Match Up Date Points Opponent Points Reb Opponent Reb
Dal vs Den 8/16/2015 20 21 10 15
Den vs Dal 8/16/2015 21 20 15 10
I have a dataframe with sports data. However, I have two rows for every game due to the way that the data had to be collected. For example the two rows above are the same game, but again the data had to be collected twice for each game in this case: once for Dal and one for Den.
I'd like to find a way to delete duplicate games. I figure that one of my conditions to compare will have to be game date. How else would I be able to tell R that I would like for it to check to delete duplicate rows? I assume that I should be able to tell R to:
Check that game date matches
If game date match and if "Teams" match then delete duplicate. (Can this be done even though the strings are not an exact match, i.e. since Den vs Dal and Dal vs Den would not be a matching string?)
Move on to the next row and repeat until the end of the spreadsheet.
R would not need to check more than 50 rows down before moving on to the next row.
Is there a function to test for matching individual words? So for example I do not want to have to tell R: "If cell contains Den... Or "If cell contains Dal as this would involve too many teams. R needs to be able to check the cells for ANY value that could be in and then look to find out if the same value can be found as a string in later rows.
Please help.

Record linking and fuzzy name matching in big datasets in R

I'm trying to merge two large datasets. The common variable, first and last name, vary in spelling between the datasets and there are many duplicates, even between similarly spelled names. I've included download links for the files and some R code below. I'll walk through what I've tried and what went wrong.
There are a few R tutorials that have tried to tackle (the common) problem of record linking, but none of dealt with large datasets. I'm hoping the SO community can help me solve this problem.
The first dataset is a large file (several hundred thousand
rows) of Federal Elections Commission political contributions.
The second is a custom dataset of the name and companies of
every Internet company founder (~5,000 rows)
https://www.dropbox.com/s/lfbr9lmurv791il/010614%20CB%20Founders%20%20-%20CB%20Founders.csv?dl=0
--Attempted code matching with regular expressions--
My first attempt, thanks to the help of previous SO suggestions, was to use agrep and regular string matching. This narrowed down the names, but resulted in too many duplicates
#Load files#
expends12 <- fread("file path for FEC", sep="|", header=FALSE)
crunchbase.raw <- fread("file path for internet founders")
exp <- expends12
cr <- crunchbase.raw
#user regular string matching#
exp$xsub= gsub("^([^,]+)\\, (.{7})(.+)", "\\2 \\1", tolower(expends12$V8))
cr$ysub= gsub("^(.{7})([^ ]+) (.+)", "\\1 \\3", tolower(cr$name))
#merge files#
fec.merge <- merge(exp,cr, by.x="xsub", by.y="ysub")
The result is 6,900 rows, so there are (a lot) of duplicates. Many rows are people with similar names as Internet founders, such as Alexander Black, but are from different states and have different job titles. So, now its a question of finding the real Internet founder.
One option to narrow the results would be filter the results by states. So, I might only take the Alexander Black from California or New York, because that is where most startups are founded. I might also only take certain job titles, such as CEO or founder. But, many founders had jobs before and after their companies, so i wouldn't want to narrow by job title too much.
Alternatively, there is an r package, RecordLinkage, but as I far as I can tell, there needs to be similar rows and columns between the datasets, which is a nonstarter for this task
I'm familiar with R, but have somewhat limited statistical knowledge and programming ability. Any step-by-step help is very much appreciated. Thank you and please let me know if there's any trouble downloading the data.
Why don't you select the columns you need from both datasets, rename them similarly and in the result object, you get the row indices for matches returned. As long as you don't reorder things, you can use the results to match both datasets.

R: creating factor using data from multiple columns

I want to create a column that codes for whether patients have had a comorbid diagnosis of depression or not. Problem is, the diagnosis can be recorded in one of 4 columns:
ComorbidDiagnosis;
OtherDiagnosis;
DischargeDiagnosis;
OtherDischargeDiagnosis.
I've been using
levels(dataframe$ynDepression)[levels(dataframe$ComorbidDiagnosis)=="Depression"]<-"Yes"
for all 4 columns but I don't know how to code those who don't have a diagnosis in any of the columns. I tried:
levels(dataframe$ynDepression)[levels(dataframe$DischOtherDiagnosis &
dataframe$OtherDiagnosis &
dataframe$ComorbidDiagnosis &
dataframe$DischComorbidDiagnosis)==""]<-"No"
I also tried using && instead but it didn't work. Am I missing something?
Thanks in advance!
Edit: I tried uploading an image of some example data but I don't have enough reputations to upload images yet. I'll try to put an example here but might not work:
Patient ID PrimaryDiagnosis OtherDiagnosis ComorbidDiagnosis
_________AN__________Depression
_________AN
_________AN__________Depression______PTSD
_________AN_________________________Depression
What's inside the [] must be (transformable to) a boolean for the subset to work. For example:
x<-1:5
x[x>3]
#4 5
x>3
# F F F T T
works because the condition is a boolean vector. Sometimes, the booleanship can be implicite, like in dataframe[,"var"] which means dataframe[,colnames(dataframe)=="var"] but R must be able to make it a boolean somehow.
EDIT : As pointed out by beginneR, you can also subset with something like df[,c(1,3)], which is numeric but works the same way as df[,"var"]. I like to see that kind of subset as implicit booleans as it enables a yes/no choice but you may very well not agree and only consider that they enable R to select columns and rows.
In your case, the conditions you use are invalid (dataframe$OtherDiagnosisfor example).
You would need something like rowSums(df[,c("var1","var2","var3")]=="")==3, which is a valid condition.

Resources