Remove duplicate entries after fuzzy matching between tables - r

I am trying to find data entry errors in the names and locations of my dataset by fuzzy matching. I am have a unique key from the original data, siterow_id, and have made a new key, pi_key, where I already identified some hard matches. (no fuzzy matching). After running the fuzzy matching I get duplicate values. The matches from both the left and right side of the join for some of the siterow_id's. I can manually look at the data and see where this occurs and hard code to remove the rows. I want a more algorithmic way of doing this as I go to a larger dataset with many more matches.
I tried doing it this way but it removes the matches on the left and the right. If possible I would love a tidyverse way to do this and not a loop.
The table output is included below. You can see a duplicate in row 8 and 9.
for(site in three_letter_matches$siterow_id.x){
if (any(three_letter_matches$siterow_id.y == site)) {
three_letter_matches <- three_letter_matches[!three_letter_matches$siterow_id.y == site,]
}
}
pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
2 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
3 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
4 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
5 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
6 8822 1-E3UILG bar jair ramat~ israel 7567 1-CW4DXI
7 11870 1-HC3YY6 kim kevin san f~ united s~ 6309 1-9CH29M
8 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
9 13460 1-IGKCPP lee hyo jin daeje~ korea re~ 12357 1-HUUEA6
I found another way to do it
update <- three_letter_matches[!is.na(match(three_letter_matches$siterow_id.x, three_letter_matches$siterow_id.y)),]
update %<>% arrange(last_name.x, first_name.x) %>%
filter(row_number() %% 2 != 0)
three_letter_matches_update <- three_letter_matches %>%
anti_join(update)
Still open to suggestions.

Not the easiest problem, but there are a few ways to do this. The first that comes to mind for me is a bit slow (because it uses rowwise() which is equivalent to using map() or lapply()) is this one:
NOTE: This only works if siterow_id.x/y are character vectors. Won't work for factors.
three_letter_matches <- three_letter_matches %>%
rowwise() %>%
mutate(both_values = paste0(sort(c(siterow_id.x,siterow_id.y)),collapse = ",")) %>%
ungroup() %>%
distinct(both_values,.keep_all = TRUE) %>%
select(-both_values)
# pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
# 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
# 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
# 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
Basically what I'm doing here is doing rowwise so that I work on one row at a time, then I take the site_row ids and sort them so that every row will have the same order, then I paste them together into a single string that is easy to compare for equivalence. Next I ungroup so that you are looking at all rows again (get rid of that rowwise). Then run a distinct to only keep the first row for each value in the new column but with the .keep_all option to keep all the columns. Then I cleanup by removing my extra column.

Related

I am trying to filter on two conditions, but I keep removing all patients with either condition

I'm a beginner on R so apologies for errors, and thank you for helping.
I have a dataset (liver) where rows are patient ID numbers, and columns include what region the patient resides in (London, Yorkshire etc) and what unit the patient was treated in (hospital name). Some of the units are private units. I've identified 120 patients from London, of whom 100 were treated across three private units. I want to remove the 100 London patients treated in private units but I keep accidentally removing all patients treated in the private units (around 900 patients). I'd be grateful for advice on how to just remove the London patients treated privately.
I've tried various combinations of using subset and filter with different exclamation points and brackets in different places including for example:
liver <- filter(liver, region_name != "London" & unit_name!="Primrose Hospital" & unit_name != "Oak Hospital" & unit_name != "Wilson Hospital")
Thank you very much.
Your unit_name condition is zeroing your results. Try using the match function which is more commonly seen in its infix form %in%:
liver <- filter(liver,
region_name != "London",
! unit_name %in% c("Primrose Hospital",
"Oak Hospital",
"Wilson Hospital"))
Also you can separate logical AND conditions using a comma.
Building on Pariksheet's great start (still drops outside-London private hospital patients). Here we need to use the OR | operator within the filter function. I've made an example dataframe which demonstrates how this works for your case. The example tibble contains your three private London hospitals plus one non-private hospital that we want to keep. Plus, it has Manchester patients who attend both Manch and one of the private hospitals, all of whom we want to keep.
EDITED: Now includes character vectors to allow generalisation of combinations to exclude.
liver <- tibble(region_name = rep(c('London', 'Liverpool', 'Glasgow', 'Manchester'), each = 4),
unit_name = c(rep(c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital',
'State Hospital'), times = 3),
rep(c('Manch General', 'Primrose Hospital'), each = 2)))
liver
# A tibble: 16 x 2
region_name unit_name
<chr> <chr>
1 London Primrose Hospital
2 London Oak Hospital
3 London Wilson Hospital
4 London State Hospital
5 Liverpool Primrose Hospital
6 Liverpool Oak Hospital
7 Liverpool Wilson Hospital
8 Liverpool State Hospital
9 Glasgow Primrose Hospital
10 Glasgow Oak Hospital
11 Glasgow Wilson Hospital
12 Glasgow State Hospital
13 Manchester Manch General
14 Manchester Manch General
15 Manchester Primrose Hospital
16 Manchester Primrose Hospital
excl.private.regions <- c('London',
'Liverpool',
'Glasgow')
excl.private.hospitals <- c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital')
liver %>%
filter(! region_name %in% excl.private.regions |
! unit_name %in% excl.private.hospitals)
# A tibble: 7 x 2
region_name unit_name
<chr> <chr>
1 London State Hospital
2 Liverpool State Hospital
3 Glasgow State Hospital
4 Manchester Manch General
5 Manchester Manch General
6 Manchester Primrose Hospital
7 Manchester Primrose Hospital

Is it possible to convert lines from a text file into columns to get a dataframe?

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

Seperating characters in dfm object R

all,
I have imported the sotu corpus from quanteda in R. I am somewhat new to dfm objects and am wanting to separate the doc_id column to give me a name and a year column. If this was a tibble, this code works:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- as_tibble(data_corpus_sotu)
sotusubsetted <- sotu %>%
separate(doc_id, c("name","year"),"-")
However, since I am new with dfm and regex, I am not sure if there is an equivalent process if I load in the data as:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- corpus(data_corpus_sotu)
sotudfm <- dfm(sotu)
Is there some equivalent way to do this with dfm objects?
The safest method is also one that will work for any core quanteda object, meaning equally for a corpus, tokens, or dfm object. These involve using the accessor functions, not addressing the internals of the corpus or dfm objects directly, which is strongly discouraged. You can do that, but your code could break in the future if those object structures are changed. In addition, our accessor functions are generally also the most efficient method.
For this task, you want to use the docnames() functions or accessing the document IDs, and this works for the corpus as well as for the dfm.
library("quanteda")
## Package version: 2.1.2
data("data_corpus_sotu", package = "quanteda.corpora")
data.frame(doc_id = docnames(data_corpus_sotu[1:5])) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
data.frame(doc_id = docnames(dfm(data_corpus_sotu[1:5]))) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
You could also have taken this from the "President" docvar field and the "Date":
data.frame(
name = data_corpus_sotu$President,
year = lubridate::year(data_corpus_sotu$Date)
) %>%
head()
## name year
## 1 Washington 1790
## 2 Washington 1790
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
## 6 Washington 1794
Created on 2021-02-13 by the reprex package (v1.0.0)
The following code will do exactly what you want, albeit it might break some operations in quanteda that will look for docid_ in sotudfm#docvars, the data frame that stores the documents relational data. For instance, it will break any filtering by sotudfm#Dimnames$docs, which is where the dimension names of the documents are listed.
sotudfm#docvars <- sotudfm#docvars %>% separate(col = docid_, c("name","year"),"-")
> sotudfm#docvars %>% as_tibble()
# A tibble: 241 x 10
docname_ name year segid_ FirstName President Date delivery type party
<chr> <chr> <chr> <int> <chr> <chr> <date> <fct> <fct> <fct>
1 Washington-1790 Washington 1790 1 George Washington 1790-01-08 spoken SOTU Independent
2 Washington-1790b Washington 1790b 1 George Washington 1790-12-08 spoken SOTU Independent
3 Washington-1791 Washington 1791 1 George Washington 1791-10-25 spoken SOTU Independent
4 Washington-1792 Washington 1792 1 George Washington 1792-11-06 spoken SOTU Independent
5 Washington-1793 Washington 1793 1 George Washington 1793-12-03 spoken SOTU Independent
6 Washington-1794 Washington 1794 1 George Washington 1794-11-19 spoken SOTU Independent
7 Washington-1795 Washington 1795 1 George Washington 1795-12-08 spoken SOTU Independent
8 Washington-1796 Washington 1796 1 George Washington 1796-12-07 spoken SOTU Independent
9 Adams-1797 Adams 1797 1 John Adams 1797-11-22 spoken SOTU Federalist
10 Adams-1798 Adams 1798 1 John Adams 1798-12-08 spoken SOTU Federalist
Here is the code that ended up working for me:
sotudfm#docvars <- sotudfm#docvars %>%
separate(col = docname_, c("name","year"),"-")
This kept the doc_id intact when I ran
head(sotudfm, 10)
It appears that docid_ and docname_ are identical.

Find aiports that have flights connected to them

library(tidyverse)
library(nycflights13)
I want to find out which airports have flights to them. My attempt is seen below, but it is not correct (it yields a number that is way bigger than the amount of airports)
airPortFlights <- airports %>% rename(dest=faa) %>% left_join(flights, "dest"=faa)
If anyone wonders why I do the rename above, that's because it won't let me do
airports %>% left_join(flights, "dest"=faa)
It gives
Error: by required, because the data sources have no common variables`
I even tried airports %>% left_join(flights, by=c("dest"=faa)) and several other attempts, which are also not working.
Thanks in advance.
You want an inner_join and then either count the distinct flights, or just list the airports using distinct. Here I count them.
library(dplyr)
inner_join(airports, flights, by=c("faa"="dest")) %>%
count(faa, name) %>% # number of flights
arrange(-n)
# A tibble: 101 x 3
faa name n
<chr> <chr> <int>
1 ORD Chicago Ohare Intl 17283
2 ATL Hartsfield Jackson Atlanta Intl 17215
3 LAX Los Angeles Intl 16174
4 BOS General Edward Lawrence Logan Intl 15508
5 MCO Orlando Intl 14082
6 CLT Charlotte Douglas Intl 14064
7 SFO San Francisco Intl 13331
8 FLL Fort Lauderdale Hollywood Intl 12055
9 MIA Miami Intl 11728
10 DCA Ronald Reagan Washington Natl 9705
# ... with 91 more rows
So 101 of the 1,458 airports in this dataset have at least 1 record in the flights dataset, with Chicago's O'Hare Intl airport having the most flights from New York.
And just for fun, the following lists the airports that don't have any flights from NY:
anti_join(airports, flights, by=c("faa"="dest"))

standardizing customer ids based on the same company name

I need to use one of the many customers ids and standarize it upon all companies names that are extact same.
Before
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1756 Lightz California
After
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1211 Lightz California
The customer ids for the two companies are now the same. Which code would be the best for this?
We can use match here as it returns the first matching position. We can match Company with Company. According to ?match
match returns a vector of the positions of (first) matches of its first argument in its second.
df$Customer.Ids <- df$Customer.Ids[match(df$Company, df$Company)]
df
# Customer.Ids Company Location
#1 1211 Lightz NewYork
#2 1325 Comput.Inc Seattle
#3 1211 Lightz California
where
match(df$Company, df$Company) #returns
#[1] 1 2 1
Some other options, using sapply
df$Customer.Ids <- df$Customer.Ids[sapply(df$Company, function(x)
which.max(x == df$Company))]
Here we loop over each Company and get the first instance of it's occurrence.
Or another option using ave which follows same logic as that of #Shree, to get first occurrence by group.
with(df, ave(Customer.Ids, Company, FUN = function(x) head(x, 1)))
#[1] 1211 1325 1211
Here's a way using dplyrpackage. It'll replace all Ids as per the first instance for any company -
df %>%
group_by(Company) %>%
mutate(
Customer.Ids = Customer.Ids[1]
) %>%
ungroup()
# A tibble: 3 x 3
Customer.Ids Company Location
<int> <fct> <fct>
1 1211 Lightz New York
2 1325 Comput.Inc Seattle
3 1211 Lightz California

Resources