Split values in row duplicating other colums values R [duplicate] - r

This question already has answers here:
Split multiple comma-separated column into separate rows [duplicate]
(3 answers)
Splitting row-values into multiple rows in R dataframe [duplicate]
(2 answers)
Split df row into multiple rows
(4 answers)
Split delimited strings in multiple columns and separate them into rows
(4 answers)
Closed last year.
I have this kind of dataset
Name Year Subject State
Jack 2003 Math/ Sci/ Music MA/ AB/ XY
Sam 2004 Math/ PE CA/ AB
Nicole 2005 Math/ Life Sci/ Geography NY/ DE/ FG
This is what I want as output:
Name Year Subject State
Jack 2003 Math MA
Jack 2003 Sci AB
Jack 2003 Music XY
Sam 2004 Math CA
Sam 2004 PE AB
Nicole 2005 Math NY
Nicole 2005 Life Sci DE
Nicole 2005 Geography FG
Keep in mind that the first element of 'subject' corresponds to the first of 'State' and so on. I need to keep this correspondence. I think I have to use something like 'pivot_longer' but I do not use R every day and I'm not skilled enought. Thanks in advance :)
Thanks!!

Name <- c("Jack", "Sam", "Nicole")
Year = c(2003, 2004, 2005)
Subject = c("Math/ Sci/ Music", "Math/ PE", "Math/ Life Sci/ Geography")
State = c("MA/ AB/ XY", "CA/ AB", "NY/ DE/ FG")
library(tidyr)
df <- data.frame(Name, Year, Subject, State)
df %>% separate_rows(Subject, State, sep = "/ ")
#> # A tibble: 8 × 4
#> Name Year Subject State
#> <chr> <dbl> <chr> <chr>
#> 1 Jack 2003 Math MA
#> 2 Jack 2003 Sci AB
#> 3 Jack 2003 Music XY
#> 4 Sam 2004 Math CA
#> 5 Sam 2004 PE AB
#> 6 Nicole 2005 Math NY
#> 7 Nicole 2005 Life Sci DE
#> 8 Nicole 2005 Geography FG
Created on 2022-01-27 by the reprex package (v2.0.1)

Alternatively, using pivot_wider after pivot_longer:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
Name = c("Jack", "Sam", "Nicole"),
Year = c(2003L, 2004L, 2005L),
Subject = c("Math/ Sci/ Music","Math/ PE",
"Math/ Life Sci/ Geography"),
State = c("MA/ AB/ XY", "CA/ AB", "NY/ DE/ FG")
)
df %>%
pivot_longer(cols = 3:4) %>%
pivot_wider(id_cols = c(Name, Year), values_fn = \(x) str_split(x, "/ ")) %>%
unnest(everything())
#> # A tibble: 8 × 4
#> Name Year Subject State
#> <chr> <int> <chr> <chr>
#> 1 Jack 2003 Math MA
#> 2 Jack 2003 Sci AB
#> 3 Jack 2003 Music XY
#> 4 Sam 2004 Math CA
#> 5 Sam 2004 PE AB
#> 6 Nicole 2005 Math NY
#> 7 Nicole 2005 Life Sci DE
#> 8 Nicole 2005 Geography FG

Related

Transform "start-end" datasets to panel dataset in r [duplicate]

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

Replace NA with minimum Group Value R

I'm struggeling with transforming my data and would appreciate some help
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
na
2010
John
na
2012
John
na
2007
Louis
na
2012
Louis
na
the aim is to replace all NAs with the minimum value in year for every name group so the data looks like this
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
2009
2010
John
2009
2012
John
2009
2007
Louis
2007
2012
Louis
2007
Note: either all start values of one name group are NAs or none
I tried to use
mydf %>% group_by(name) %>% mutate(start= ifelse(is.na(start), min(year, na.rm = T), start))
but got this error
x `start` must return compatible vectors across groups
There are a lot of similar problems here.
Some people here used the ave function or worked with data.table which both doesnt seem to fit my problem
My base function must be sth like
df$A <- ifelse(is.na(df$A), df$B, df$A)
however I cant seem to properly combine it with the min() and group by() function.
Thank you for any help
I changed the colname to 'Year' because it was colliding to
dat %>%
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007
We can use na.aggregate
library(dplyr)
library(zoo)
dat %>%
group_by(name) %>%
mutate(start = na.aggregate(na_if(start, "na"), FUN = min))

R: How do I avoid getting an error when merging two data frames (group by/summarise)?

I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.
If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)
An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))

Add column to display period of time, based on start year and end year [duplicate]

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

Expand ranges defined by "from" and "to" columns

This problem is also known as 'transforming a "start-end" dataset to a panel dataset'
I have a data frame containing "name" of U.S. Presidents, the years when they start and end in office, ("from" and "to" columns). Here is a sample:
name from to
Bill Clinton 1993 2001
George W. Bush 2001 2009
Barack Obama 2009 2012
...and the output from dput:
dput(tail(presidents, 3))
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
I want to create data frame with two columns ("name" and "year"), with a row for each year that a president was in office. Thus, I need to create a regular sequence with each year from "from", to "to". Here's my expected out:
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
I know that I can use data.frame(name = "Bill Clinton", year = seq(1993, 2001)) to expand things for a single president, but I can't figure out how to iterate for each president.
How do I do this? I feel that I should know this, but I'm drawing a blank.
Update 1
OK, I've tried both solutions, and I'm getting an error:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
Here's a data.table solution. It has the nice (if minor) feature of leaving the presidents in their supplied order:
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
Edit: To handle presidents with non-consecutive terms, use this instead:
dt[, list(year = seq(from, to)), by = c("name", "from")]
You can use the plyr package:
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
and if it is important that the data be sorted by year, you can use the arrange function:
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Edit 1: Following's #edgester's "Update 1", a more appropriate approach is to use adply to account for presidents with non-consecutive terms:
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
An alternate tidyverse approach using unnest and map2. However many data columns you have (such as name), they will all be correctly present in the new data frame.
library(tidyverse)
presidents %>%
mutate(year = map2(from, to, seq)) %>%
unnest(year) %>%
select(-from, -to)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
Before tidyr v1.0.0, one could create variables as part of unnest().
presidents %>%
unnest(year = map2(from, to, seq)) %>%
select(-from, -to)
Here's a dplyr solution:
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
h/t: https://stackoverflow.com/a/24804470/1036500
Two base solutions.
Using sequence:
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
Using mapply:
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
As noted by #Esteis in comment, there may well be several columns that needs to be repeated following the expansion of the ranges (not only 'name', like in OP). In such case, instead of repeating values of a single column, simply repeat the rows of the entire data frame, except the 'from' & 'to' columns. A simple example:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
Here is a quick base-R solution, where Df is your data.frame:
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
It gives some warnings about row names, but appears to return the correct data.frame.
Another option using tidyverse could be to gather data into long format, group_by name and create a sequence between from and to date.
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
Another solution using dplyr and tidyr. It correctly preserves any data columns you have.
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
Use by to create a by list L of data.frames, one data.frame per president, and then rbind them together. No packages are used.
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
If you don't mind row names then the last line could be reduced to just:
do.call("rbind", L)
An addition to the tidyverse solutions can be:
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001

Resources