Identifying, reviewing, and deduplicating records in R - r

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!

#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

Related

Match rows across multiple columns but ignore NAs in Rstudio

I am using Rstudio to identify duplicate accounts in a data frame.
I want to find a way to identify any duplicates across certain columns but I am running into a problem with NAs.
In the lines below, I would want these 2 rows to be considered a match if they have the same first, last, dob, and gender but given I have an NA in gender, the 2 rows are not duplicates given I create the is_duplicate flag based on the concatenated Match column.
Any ideas how to adjust for that?
Id -- First -- Last -- DOB -- Gender -- Match -- Is_duplicates
123 -- Ali -- Smith -- 1993 -- Female -- AliSmith1993Female -- 0
435 -- Ali -- Smith -- 1993 -- NA -- AliSmith1993NA -- 0
Have you tried fuzzy matching using agrep ?
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/agrep
or maybe this post will help you?
Smartest way to double loop over a data frame (comparing rows to each other with a Levenshtein Dist) in R?

Merge columns with the same name R

I'm fairly new to R. I'm working with a data set that is incredibly redundant with a lot of columns (~400). There are several duplicate column names, however the data is not duplicate, so I need to sum the columns when collapsing them.
The columns all have a similar name that allows easy identification, so I'm hoping I can use that to my advantage.
I attempted to perform the following:
ColNames <- unique(colnames(df))
CombinedDf <- data.frame(sapply(ColNames, function(i)rowSums(Test[,ColNames==i, drop=FALSE])))
This works if I sum over the range of columns that only contain integers, but the issue is that other columns have strings and such in them, so rowSums throws a fit.
Assuming that the identifier is "XXX", how can I aggregate all the columns that are of the same name leaving the other columns as is?
Thank you for your time.
Edit: Sample data has been asked for, I cannot give the exact data as it is sensitive, but I will give an example:
Name COL1XXX COL2XXX COL1XXX COL3XXX COL2XXX Type
Henry 5 15 25 31 1 Orange
Tom 8 16 12 4 3 Green
Should return
Name COL1XXX COL2XXX COL3XXX Type
Henry 30 16 31 Orange
Tom 20 19 4 Green
I'm not really sure, but you may try transposing the data and then aggregating by unique names.
t_df=as.data.frame(t(df))
new_df=aggregate(t_df, by=list(rownames(t_df)),sum)
Again, without sample data I'm unsure if it'll work, but based on what you said, that might work.

Delete particular rows in R

In general, I know how to delete rows in R. However, for this particular requirement, I am unsure how to proceed. Here is an idea of what I need to do with data:
ID MONTH INCOME
1. 00000012 6 60
2. 00000012 8 65
3. 00000015 12 70
4. 00000025 4 45
5. 00000025 8 60
6. 00000032 6 10
7. 00000035 6 30
Quick explanation of each column:
The first 7 digits of ID identify an agent. So, in row one, 00000012 means agent 1. The last digit is the interview number. So, in row three, 00000015 means agent 1, interview 5.
Month and income are straightforward.
What Must Be Done
I need to delete every ID that does not include both a 2nd and 5th interview.
I need to only have the max. month for the 2nd interview, and 5th interview for each ID.
So, if I cleaned the data properly, I would have:
ID MONTH INCOME
2. 00000012 8 65
3. 00000015 12 70
6. 00000032 6 10
7. 00000035 6 30
Notice row 4,5 are gone because there was no 2nd interview for agent 2. Row 1 is gone because there was a higher month for agent 1, interview 2.
My current thoughts how to do this seem overly complex. I am thinking of breaking ID into two columns, one with the first 7 digits, another column with the last digit. Then, loop through the entire data, and at each row, run another loop to see if the ID that corresponds to the row has both an interview 2 and interview 5. If it does, fine. If it doesn't, I then have to delete all rows with that ID.
Next, I have to do a similar thing for deleting non-max months.
I feel like I could do the above, but it is very cumbersome. Is there a better way to do this? Thank you.
You can do something like that:
library(stringi)
Agents <- substr(df$ID,1,nchar(df$ID)-1 )
A2 <- stri_endswith_fixed(df$ID,"2", fixed = T)
A5 <- stri_endswith_fixed(df$ID,"5", fixed = T)
A2and5 <- intersect(Agents[A5], Agents[A2])
df[Agents %in% A2and5,]

Find if a specific choice is in a Data Frame R

I have a Data Frame object which contains a list of possible choices. For example, an analogy of this would be:
FirstName, SurName, Subject, Grade
Brian, Smith, History, 75
Jenny, Jackson, English, 60
How would I...
1) Check to see if a certain pupil-subject combination is in my Data Frame
2) And for those who are there, extract their grade (And potentially other relevant fields)
?
Thanks so much
The only solutions I've found so far include appending the values onto the end of the Data Frame and trying to see if it is unique or not? This seems a crude and ridiculous hack?
learn data subset (extraction) using base R.
To subset any data frame by its rows and column you use [ ]
Let df be your data frame.
FirstName SurName Subject Grade
1 Brian Smith History 75
2 Jenny Jackson English 60
3 Tom Brandon Physics 50
You can subset it by its rows and columns using
df[rows,columns]
Here rows and column can be :
1) Index (Number/Name)
Which means subset that give me that particular row and column like
df[2,3]
this will return second row and third column
[1] English
or
df[2,"Grade"]
returns
[1] 60
2) Range (Indices/List of Names)
Which means subset that give me these rows and columns like
df[1:2,2,drop=F]
Here drop=F to avoid flattening of result and output like a data.frame. It will give you this
SurName
1 Smith
2 Jackson
Range also supports all by leaving either rows or columns empty like
df[,3,drop=F]
this will return all rows for third column
Subject
1 History
2 English
3 Physics
or
df[1:2,c("Grade","Subject")]
Grade Subject
1 75 History
2 60 English
3) Logical
Which means you want to subset using a logical condition.
df[df$FirstName=="Brian",]
meaning give me rows where FirstName is Brian and all columns for it.
FirstName SurName Subject Grade
1 Brian Smith History 75
or
df[df$FirstName=="Brian",1:3]
give me rows where FirstName is Brian and give me only 1 to 3 columns.
or create complex logicals
df[df$FirstName=="Brian" & df$SurName==" Smith",1:3]
output
FirstName SurName Subject
1 Brian Smith History
or complex logical and extract column by name
df[df$FirstName=="Brian" & df$SurName==" Smith","Grade",drop=F]
Grade
1 75
or complex logical and extract multiple columns by name
df[df$FirstName=="Brian" & df$SurName==" Smith",c("Grade","Subject")]
Grade Subject
1 75 History
to use this in a function do
myfunc<-function(input_var1,input_var2,input_var3)
{
df[df$FirstName==input_var1 & df$SurName==input_var2 & df$Subject==input_var3,"Grade",drop=F]
}
run it like this
myfunc("Tom","Brandon","Physics")
I think you are looking for this:
result <- data[data$FirstName == "Brian" & data$Subject == "History", c("Grade") ]
Try subset:
con <- textConnection("FirstName,SurName,Subject,Grade\nBrian,Smith,History,75\nJenny,Jackson,English,60")
dat <- read.csv(con, stringsAsFactors=FALSE)
subset(dat, FirstName=="Brian" & SurName=="Smith" & Subject=="History", Grade)
Maybe aggregate can be helpful, too. The following code gives the mean of the grades for all pupil/subject combinations:
dat <- transform(dat, FullName=paste(FirstName, SurName), stringsAsFactors=FALSE)
aggregate(Grade ~ FullName+Subject, data=dat, FUN=mean)

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Resources