Locate and merge duplicate rows in a data.frame but ignore column order - r

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.

Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.

Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Related

Complex merging in R with duplicate matching values in y set producing problems

So I'm trying to merge two dataframes. Dataframe x looks something like:
Name ParentID
Steve 1
Kevin 1
Stacy 1
Paula 4
Evan 7
Dataframe y looks like:
ParentID OtherStuff
1 things
2 stuff
3 item
4 ideas
5 short
6 help
7 me
The dataframe I want would look like:
Name ParentID OtherStuff
Steve 1 things
Kevin 1 things
Stacy 1 things
Paula 4 ideas
Evan 7 me
Using a left merge gives me substantially more observations than I want, with many duplicates. Any idea how to merge things, where y is duplicated where appropriate to match x?
I'm working with a databases set up similarly to the example. x has 5013 observations, while y has 6432. Using the merge function as described by Joel and thelatemail gives me 1627727 observations.
We can use match from base R
df1$OtherStuff <- with(df1, df2$OtherStuff[match(ParentID, df2$ParentID)])
df1
# Name ParentID OtherStuff
#1 Steve 1 things
#2 Kevin 1 things
#3 Stacy 1 things
#4 Paula 4 ideas
#5 Evan 7 me

R merge dataframes with different columns with order

I have three large data frames below, and want to merge into one data frame with order.
df 1:
First Name Last Name
John Langham
Paul McAuley
Steven Hutchison
Sean Hamilton
N N
df2:
First Name Wage Location
John 500 HK
Paul 600 NY
Steven 1900 LDN
Sean 800 TL
N N N
df3:
Last Name Time
Langham 8
McAuley 9
Hutchison 12
Hamilton 7
N N
desired output:
First Name Last Name Wage Location Time
John Langham 500 HK 8
Paul McAuley 600 NY 9
Steven Hutchison 1900 LDN 12
Sean Hamilton 800 TL 7
N N N N N
I know how to merge df1 and df2 but df1+2 merges to df3 by second column changed the order in desired output, so I want to ask is there any recommendation? Thank you.
If you are trying to preserve the order as it appears in df1, create a column to memorialize that order, then use it again to set the order of df3
# record order
df1$original_order <- 1:nrow(df1)
# then do your merges...
# ...
# then restore the df1 order to df3
df3 <- df3[order(df3$original_order),]
If you want you can then also get rid of that column:
df3$original_order <- NULL

Add column to R dataframe that is length of string in another column

This should be EASY but I can't figure it out and search didn't help. I'd like to add a column to a dataframe that is just the length of the strings in another column.
So say I have a data frame of names like such:
Name Last
1 John Doe
2 Edgar Poe
3 Walt Whitman
4 Jane Austen
I'd like to append a new column with the string length of, say, the last name, so it would look like:
Name Last Length
1 John Doe 3
2 Edgar Poe 3
3 Walt Whitman 7
4 Jane Austen 6
Thanks
We can use str_count from stringr
library(stringr)
df1$Length <- str_count(df1$Last)
df1$Length
[1] 3 3 7 6
If you want to filter by the length, based on column, then do the following:
library(dplyr)
df<- df %>%
filter(nchar(Last) <= 3)

Efficiently joining two data tables with a condition

One data table (let's call is A) contains the ID numbers:
ID
3
5
12
8
...
and another table (let's call it B) contains the lower bound and the upper bound and the name for that ID.
ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah
so based on table B, given the ID from table A, we can find the matching name by finding the name on the row in table B such that
ID_lower <= ID <= ID upper
and I wanna create a table of ID and Name, so in the above example, it would be
ID Name
3 James
5 Arthur
12 Sarah
8 Jacob
... ...
I used for loop, so that for each row of A, I look for the row in B such that ID is between the ID_lower and ID_upper for that row and joined the name from there.
However, this method was a bit slow. Is there a fast way of doing it in R?
Using the new non-equi joins feature in the current development version of data.table, this is straightforward:
require(data.table) # v1.9.7+
dt2[dt1, .(ID, Name), on=.(ID_lower <= ID, ID_upper >= ID)]
See the installation instructions for devel version here.
where,
dt1=fread('ID
3
5
12
8')
dt2 = fread('ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah')
You can make a look-up table with your second data.frame (B):
lu <- do.call(rbind,
apply(B,1,function(x)
data.frame(ID=c(x[1]:x[2]),Name=x[3], row.names = NULL)))
then you query it with your first data.frame (A):
A$Name <- lu[A$ID,"Name"]
You can try this data.table solution:
data.table::setDT(B)[, .(Name, ID = Map(`:`, ID_lower, ID_upper))]
[, .(ID = unlist(ID)), .(Name)][ID %in% A$ID]
Name ID
1: James 3
2: Arthur 5
3: Sarah 12
4: Jacob 8
I believe findInterval() on ID_lower might be the ideal approach here:
A[,Name:=B[findInterval(ID,ID_lower),Name]];
A;
## ID Name
## 1: 3 James
## 2: 5 Arthur
## 3: 12 Sarah
## 4: 8 Jacob
This will only be correct if (1) B is sorted by ID_lower and (2) all values in A$ID are covered by the ranges in B.

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

Resources