Regrouping data based on an indicator value - r

I have a dataframe with two columns as shown below,
Name Indicator
DeAngelo Williams 1
Marcus Brown 1
Elaine Nelson 2
Steve Olson 3
Jennifer Carter 1
Michael Johnson 2
Angela Brawley 3
Dax Shepard 4
What I am trying to do is combine all the names where the Indicator Column values is 1 until the next value 1 is encountered, the final output should looks like this below.
Name
-------
DeAngelo Williams
Marcus Brown,Elaine Nelson,Steve Olson
Jennifer Carter, Michael Johnson, Angela Brawley, Dax Shepard
I am unable to think of a solution for this issue so any assistance on accomplishing this is much appreciated.

We can use aggregate from base R to do this. As #thelatemail mentioned in the comments, create a group by doing the cumulative sum of the logical vector Indicator==1, using the formula method, we paste the elements in 'Name' together.
aggregate(Name~cbind(Group=cumsum(Indicator==1)), df1, FUN=toString)[2]

Related

Apply.weekly for non unique date column?

I currently have the below data.table with Name and Id recycling per day.
Date Name Id Widgets
2016-12-31 Bob Jones 0052A00001 5
2016-12-31 James Smith 0052A00002 25
2016-12-31 Tom Wilson 0052A00003 29
...
2016-01-31 Bob Jones 0052A00001 8
2016-01-31 James Smith 0052A00002 18
2016-01-31 Tom Wilson 0052A00003 20
Is it possible to apply the zoo function apply.weekly to this since there are not unique values per date? If not, what is the easiest way to aggregate this by a weekly value (or period of another length- say 4 days) and create groupings according to that?
You can create a grouping first before you match in the week. You can play around with cut to get your desired grouping.
grpWeek <- data.table(Date=seq.Date(as.Date("2016-01-01"), as.Date("2016-12-31"), by="1 day"))[,
list(Date,
DT_Week=week(Date),
Week_Num=format(Date, "%W"),
User_Week=cut(Date, breaks=52, labels=paste0("Week",1:52)))]
dt <- fread("Date,Name,Id,Widgets
2016-12-31,Bob Jones,0052A00001,5
2016-12-31,James Smith,0052A00002,25
2016-12-31,Tom Wilson,0052A00003,29
2016-01-31,Bob Jones,0052A00001,8
2016-01-31,James Smith,0052A00002,18
2016-01-31,Tom Wilson,0052A00003,20")
dt[,Date:=as.Date(Date)]
grpWeek[dt, on="Date"]

Complex merging in R with duplicate matching values in y set producing problems

So I'm trying to merge two dataframes. Dataframe x looks something like:
Name ParentID
Steve 1
Kevin 1
Stacy 1
Paula 4
Evan 7
Dataframe y looks like:
ParentID OtherStuff
1 things
2 stuff
3 item
4 ideas
5 short
6 help
7 me
The dataframe I want would look like:
Name ParentID OtherStuff
Steve 1 things
Kevin 1 things
Stacy 1 things
Paula 4 ideas
Evan 7 me
Using a left merge gives me substantially more observations than I want, with many duplicates. Any idea how to merge things, where y is duplicated where appropriate to match x?
I'm working with a databases set up similarly to the example. x has 5013 observations, while y has 6432. Using the merge function as described by Joel and thelatemail gives me 1627727 observations.
We can use match from base R
df1$OtherStuff <- with(df1, df2$OtherStuff[match(ParentID, df2$ParentID)])
df1
# Name ParentID OtherStuff
#1 Steve 1 things
#2 Kevin 1 things
#3 Stacy 1 things
#4 Paula 4 ideas
#5 Evan 7 me

Find Duplicates in R based on multiple characters

I can't seem to remember how to code this properly in R -
if I want to remove duplicates within a csv file based on multiple entries - first name and last name that are stored in separate columns
Then I can code: file[(duplicated(file$First.Name),]
but that only looks at the first name, I want it to look at the last same simultaneously.
If this is my starting file:
Steve Jones
Eric Brown
Sally Edwards
Steve Jones
Eric Davis
I want the output to be
Steve Jones
Eric Brown
Sally Edwards
Eric Davis
Only removing names of first and last name matching.
You can use
file[!duplicated(file[c("First.Name", "Last.Name")]), ]
Here is the solution for better performance (using data.table assuming First Name and Last Name are stored in separate columns):
> df <- read.table(text = 'Steve Jones
+ Eric Brown
+ Sally Edwards
+ Steve Jones
+ Eric Davis')
> colnames(df) <- c("First.Name","Last.Name")
> df
First.Name Last.Name
1 Steve Jones
2 Eric Brown
3 Sally Edwards
4 Steve Jones
5 Eric Davis
Here is where data.table specific code begins
> dt <- setDT(df)
> unique(dt,by=c('First.Name','Last.Name'))
First.Name Last.Name
1: Steve Jones
2: Eric Brown
3: Sally Edwards
4: Eric Davis
If there is a single column, use sub to remove the substring (i.e. first name) followed by space, get the logical vector (!duplicated(..) based on that to subset the rows of the dataset.
df1[!duplicated(sub("\\w+\\s+", "", df1$Col1)),,drop=FALSE]
# Col1
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
If it is based on two columns and the dataset have two columns, just do duplicated directly on the dataset to get the logical vector, negate it and subset the rows.
df1[!duplicated(df1), , drop=FALSE]
# first.name second.name
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
try:
!duplicated(paste(File$First.Name,File$Last.Name))

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources