Replace NA with minimum Group Value R - r

I'm struggeling with transforming my data and would appreciate some help
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
na
2010
John
na
2012
John
na
2007
Louis
na
2012
Louis
na
the aim is to replace all NAs with the minimum value in year for every name group so the data looks like this
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
2009
2010
John
2009
2012
John
2009
2007
Louis
2007
2012
Louis
2007
Note: either all start values of one name group are NAs or none
I tried to use
mydf %>% group_by(name) %>% mutate(start= ifelse(is.na(start), min(year, na.rm = T), start))
but got this error
x `start` must return compatible vectors across groups
There are a lot of similar problems here.
Some people here used the ave function or worked with data.table which both doesnt seem to fit my problem
My base function must be sth like
df$A <- ifelse(is.na(df$A), df$B, df$A)
however I cant seem to properly combine it with the min() and group by() function.
Thank you for any help

I changed the colname to 'Year' because it was colliding to
dat %>%
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007

We can use na.aggregate
library(dplyr)
library(zoo)
dat %>%
group_by(name) %>%
mutate(start = na.aggregate(na_if(start, "na"), FUN = min))

Related

Transform "start-end" datasets to panel dataset in r [duplicate]

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

Add column to display period of time, based on start year and end year [duplicate]

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

Get the additions and drops for each person by year in R

I'm trying to get the additions and drops for each person by year in R (examples is as follows). I was trying to write a function which returns what each person adds and drops as well as the number of person adds and drops by year. Say in this sample Mark Add = 0, Add_act =N/A, Drop = 2, Drop_act = c("Basketball", "Volleyball"). Using "for loops" is my instinct reaction, any suggestions on how is more appropriate to design the algorithm?
Thanks,
Anne
Year Name Activity
2010 Mark Tennis
2010 Mark Swim
2010 Mark Basketball
2010 Mark Volleyball
2010 Tom Swim
2010 Rachale Tennis
2010 Rachale Waterball
2010 Rachale Yoga
2010 Mary Volleyball
2010 Mary Yoga
2010 Kim Waterball
2011 Mark Tennis
2011 Mark Swim
2011 Tom Volleyball
2011 Tom Waterball
2011 Tom Swim
2011 Rachale Tennis
2011 Rachale Waterball
2011 Rachale Yoga
2011 Rachale Swim
2011 Mary Volleyball
2011 Jerry Basketball
The outcome I'm expecting looks like this:
Year Name Add Add_act Drop Drop_act
2010 Mark 4 "Tennis", "Swim", "Basketball", "Volleyball" 0 NA
2010 Tom 1 "Swim" 0 NA
2010 Rachale 3 "Tennis", "Waterball", "Yoga" 0 NA
2010 Mary 2 "Volleyball", "Yoga" 0 NA
2010 Kim 1 "Waterball" 0 NA
2011 Mark 0 NA 1 "Basketball"
2011 Tom 1 "Waterball" 0 NA
2011 Rachale 1 "Swim" 0 NA
2011 Mary 0 NA 1 "Yoga"
2011 Jerry 1 "Basketball" 0 NA
2011 Kim 0 NA 1 "Waterball"
EDITED: Okay, so, you will need to use loops now that I understand your desire to aggregate across the whole dataset. However, you can do so using the *apply functions in R, which will also put your output into a nice list.
We can use the simple function I wrote originally, with a minor modification that adds in the name and year (just for ease of interpreting the output).
The function takes an input data frame, the person you want to check, and the year you are evaluating. It then forms two vectors, one of the current year's activities and one of the previous year's activities. Then, we just use the %in% operator to subset each of those vectors to get additions and subtractions, and find the total using length.
Using expand.grid, we'll get all possible combinations of year and individual in the sample data. Then, using mapply, we can create the outputs from those combinations. The result is a list of lists (which I used because a data frame is not sensible in this context, since the activities added or dropped is of varying length).
I put your data into a text file that I read in using read.csv.
options(stringsAsFactors = FALSE)
df_example <- read.csv(file = "C:/Users/trehman/Desktop/input.txt",header = F)
names(df_example) <- c("Year","Name","Activity")
func_find_changes <- function(data,person,year) {
curryr_acts <- data[data$Name == person
& data$Year == year,"Activity"]
prevyr_acts <- data[data$Name == person
& data$Year == year - 1,"Activity"]
added_acts <- curryr_acts[!(curryr_acts %in% prevyr_acts)]
dropped_acts <- prevyr_acts[!(prevyr_acts %in% curryr_acts)]
n_add <- length(added_acts)
n_drop <- length(dropped_acts)
return(list(Person = person,
Year = year,
Add = n_add,
Add_act = added_acts,
Drop = n_drop,
Drop_act = dropped_acts))
}
# Create all combinations to check
df_nameyears <- expand.grid(unique(df_example$Year),
unique(df_example$Name),
stringsAsFactors = FALSE)
# Use mapply() to get them
lst_changes <- mapply(FUN = func_find_changes,
year = df_nameyears$Var1,
person = df_nameyears$Var2,
MoreArgs = list(data = df_example),
SIMPLIFY = FALSE)
I don't have time to go through the final outcome but I can see this will maybe help you or someone else get started. The idea is to split the datasets by the individuals, then index the years and find co-occurrence frequencies
Also, please use dput so we can recreate the data easily... here's what I did:
a <- textConnection('Year Name Activity
2010 Mark Tennis
2010 Mark Swim
2010 Mark Basketball
2010 Mark Volleyball
2010 Tom Swim
2010 Rachale Tennis
2010 Rachale Waterball
2010 Rachale Yoga
2010 Mary Volleyball
2010 Mary Yoga
2011 Mark Tennis
2011 Mark Swim
2011 Tom Volleyball
2011 Tom Waterball
2011 Tom Swim
2011 Rachale Tennis
2011 Rachale Waterball
2011 Rachale Yoga
2011 Rachale Swim
2011 Mary Volleyball')%>% read.table %>% {
colnames(.) <- as.character(.[1,])
.[-1,]
}
lapply(split(a, a$Name), function(i){
counts <- count(i, Year)
n_change <- as.numeric(counts[nrow(counts),2] - counts[1,2])
if(n_change < 0){
add <- 0
drop <- n_change * -1
}else {
add <- n_change
drop <- 0
}
check_act <- acast(i, Activity ~ Year, value.var = "Year")
list(add = add, drop = drop, adply(check_act, 2, is.na))
})
# $Mark
# $Mark$add
# [1] 0
#
# $Mark$drop
# [1] 2
#
# $Mark[[3]]
# X1 Basketball Swim Tennis Volleyball
# 1 2010 FALSE FALSE FALSE FALSE
# 2 2011 TRUE FALSE FALSE TRUE
#
#
# $Mary
# $Mary$add
# [1] 0
#
# $Mary$drop
# [1] 1
#
# $Mary[[3]]
# X1 Volleyball Yoga
# 1 2010 FALSE FALSE
# 2 2011 FALSE TRUE
#
#
# $Rachale
# $Rachale$add
# [1] 1
#
# $Rachale$drop
# [1] 0
#
# $Rachale[[3]]
# X1 Swim Tennis Waterball Yoga
# 1 2010 TRUE FALSE FALSE FALSE
# 2 2011 FALSE FALSE FALSE FALSE
#
#
# $Tom
# $Tom$add
# [1] 2
#
# $Tom$drop
# [1] 0
#
# $Tom[[3]]
# X1 Swim Volleyball Waterball
# 1 2010 FALSE TRUE TRUE
# 2 2011 FALSE FALSE FALSE
#
#
You can easily use data.table to find the change in activities over the years grouped by individual:
DF <- structure(
list(Year = c(2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011),
Name = c("Mark", "Mark", "Mark", "Mark", "Tom",
"Rachale", "Rachale", "Rachale", "Mary", "Mary", "Mark", "Mark",
"Tom", "Tom", "Tom", "Rachale", "Rachale", "Rachale", "Rachale",
"Mary"),
Activity = c("Tennis", "Swim", "Basketball", "Volleyball",
"Swim", "Tennis", "Waterball", "Yoga", "Volleyball", "Yoga",
"Tennis", "Swim", "Volleyball", "Waterball", "Swim", "Tennis",
"Waterball", "Yoga", "Swim", "Volleyball")),
.Names = c("Year", "Name", "Activity"),
row.names = c(NA, 20L), class = "data.frame")
library(data.table)
DT <- data.table(DF)
yearly_count <- DT[, .N, by = c('Name', 'Year')]
print(yearly_count)
change <- yearly_count[, list(change = diff(N)), by = Name]
print(change)
which results in the following output:
> print(yearly_count)
Name Year N
1: Mark 2010 4
2: Tom 2010 1
3: Rachale 2010 3
4: Mary 2010 2
5: Mark 2011 2
6: Tom 2011 3
7: Rachale 2011 4
8: Mary 2011 1
> print(change)
Name change
1: Mark -2
2: Tom 2
3: Rachale 1
4: Mary -1
You only have 2 years in your data so there is only a single value representing the change from 2010 to 2011. Mark dropped 2 activities, Tom added 2, etc.
The code below adds a value column to mark 1 when a Name participates in an activity or zero if not, then adds a Status column to mark changes in Activity. In order to capture detailed structure in time-varying particpation in each Activity, the Status column is built with a long series of ifelse conditions. There's probably a simpler approach.
The value column is needed only to create the Status column and can be removed at the end of the process, but I've left it in for illustration.
The Did not participate value of Status marks cases where a person did not participate in the previous year and continued to not participate in the current year, while Dropped indicates a change from participation to non-participation.
library(tidyverse)
dat = dat %>%
mutate(value=1) %>%
complete(Activity, nesting(Year, Name), fill=list(value=0)) %>%
arrange(Name, Activity, Year) %>%
group_by(Name, Activity) %>%
mutate(Status = ifelse(!lag(value) %in% 1 & value==1, "Added",
ifelse((!lag(value) %in% 0:1 & value==0) | all(value==0), "Did not participate",
ifelse(lag(value)==1 & value==1, "Continued",
ifelse(lag(value)==1 & value==0, "Dropped", NA_character_)))))
Activity Year Name value Status
1 Basketball 2010 Mark 1 Added
2 Basketball 2011 Mark 0 Dropped
3 Swim 2010 Mark 1 Added
4 Swim 2011 Mark 1 Continued
5 Tennis 2010 Mark 1 Added
6 Tennis 2011 Mark 1 Continued
7 Volleyball 2010 Mark 1 Added
8 Volleyball 2011 Mark 0 Dropped
9 Waterball 2010 Mark 0 Did not participate
10 Waterball 2011 Mark 0 Did not participate
# ... with 38 more rows
Summarize activities by person by year:
dat %>%
group_by(Year, Name, Status) %>%
tally %>%
ungroup %>%
complete(Status, nesting(Year, Name), fill=list(n=0)) %>%
spread(Status, n) %>%
arrange(Name, Year)
Year Name Added Continued `Did not participate` Dropped
1 2010 Mark 4 0 2 0
2 2011 Mark 0 2 2 2
3 2010 Mary 2 0 4 0
4 2011 Mary 0 1 4 1
5 2010 Rachale 3 0 3 0
6 2011 Rachale 1 3 2 0
7 2010 Tom 1 0 5 0
8 2011 Tom 2 1 3 0

Using R to removing duplicate rows from a data frame based on certain conditions [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 2 years ago.
I'm working on a project where I need to sort data based on how people vote. I cannot find a function where I can delete duplicate rows based on certain conditions being met.
I'm looking for a function that will remove duplicate rows based on one column having duplicate values and another column meeting certain conditions.
For example in the table below I would like to remove voters who voted in three different elections. Paul needs to be removed from this data frame.
df <- data.frame(Name=c("Paul","Paul","Mary","Bill","Jane","Paul","Mary","John",
"Bill","John"),ElectionDay=c("November 2010","November 2014",
"November 2010","November 2010","November 2014","November 2006",
"November 2014","November 2010","November 2014","November 2014"))
df
# Name ElectionDay
# 1 Paul November 2010
# 2 Paul November 2014
# 3 Mary November 2010
# 4 Bill November 2010
# 5 Jane November 2014
# 6 Paul November 2006
# 7 Mary November 2014
# 8 John November 2010
# 9 Bill November 2014
# 10 John November 2014
Below is an example of the result I am looking for:
Name ElectionDay
1 Mary November 2010
2 Bill November 2010
3 Jane November 2014
4 Mary November 2014
5 John November 2010
6 Bill November 2014
7 John November 2014
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name', we get the length of unique 'ElectionDay' (uniqueN(ElectionDay)). If the length is less than 3, we get the Subset of Data.Table (.SD).
library(data.table)#v1.9.6+
setDT(df)[, if(uniqueN(ElectionDay) < 3) .SD, by = Name]
A similar base R option would be using ave. We get the length of unique elements of 'ElectionDay' grouped by 'Name' and check whether it is less than 3 to get a logical index. The index can be used to subset the rows of dataset.
df[with(df, ave(as.character(ElectionDay), Name,
FUN=function(x) length(unique(x)))) < 3,]
# Name ElectionDay
#3 Mary November 2010
#4 Bill November 2010
#5 Jane November 2014
#7 Mary November 2014
#8 John November 2010
#9 Bill November 2014
#10 John November 2014
The names that occur in more than 2 rows are calculated as
names(which(table(df$Name) > 2))
#[1] "Paul"
So what you need is
df[!(df$Name %in% names(which(table(df$Name) > 2))), ]
# Name ElectionDay
#3 Mary November 2010
#4 Bill November 2010
#5 Jane November 2014
#7 Mary November 2014
#8 John November 2010
#9 Bill November 2014
#10 John November 2014
Or you can also use dplyr, counting the number of elections on which every people voted and then removing the rows for which the count is 3:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(NumberElections = length(unique(ElectionDay))) %>%
ungroup() %>%
filter(NumberElections != 3)

Expand ranges defined by "from" and "to" columns

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

Resources