Fill numeric variable while preserving group - r

[EDITED to reflect a better example]
Say I have a dataframe like this:
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
> df
x year
1 A 2001
2 A 2004
3 B 2002
4 B 2005
How can I increment year by 1 while preserving x? I would like to fill in year so that the sequence is this:
x year
1 A 2001
2 A 2002
3 A 2003
4 A 2004
5 B 2002
6 B 2003
7 B 2004
8 B 2005
Can anyone recommend a good way of doing this?
#useR recommend this approach:
> data.frame(year = min(df$year):max(df$year)) %>%
full_join(df) %>%
fill(x)
Joining, by = "year"
year x
1 2001 A
2 2002 B
3 2003 B
4 2004 A
5 2005 B
However that does not match the desired output.

An option using tidyr::complete and dplyr::lead can be as:
library(tidyverse)
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
df %>% mutate(nextYear = ifelse(is.na(lead(year)),year, lead(year)-1)) %>%
group_by(x) %>%
complete(year = seq(year, nextYear, by=1)) %>%
select(-nextYear) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 B 2004
# 5 B 2005
# 6 B 2006
# 7 C 2007
Edited: The solution for modified data
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
library(tidyverse)
df %>% group_by(x) %>%
complete(year = seq(min(year), max(year), by=1)) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 A 2004
# 5 B 2002
# 6 B 2003
# 7 B 2004
# 8 B 2005

Using base R (with a little help from zoo):
full_df = data.frame(year = min(df$year):max(df$year))
df = merge(df, full_df, all = TRUE)
df = df[order(df$year), ]
df$x = zoo::na.locf(df$x)
df
# year x
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
Using the "tidyverse"
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
library(dplyr)
library(tidyr)
df = df %>% mutate(year = factor(year, levels = min(year):max(year))) %>%
complete(year) %>%
fill(x) %>%
mutate(year = as.numeric(as.character(year)))
df
# # A tibble: 7 x 2
# year x
# <dbl> <fctr>
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C

We can first split by x, then create a year vector for each x group, join with each group df, fill down x, then finally rbind all group df's together.
library(dplyr)
library(tidyr)
df %>%
split(.$x) %>%
lapply(function(y) data.frame(year = min(y$year):max(y$year)) %>%
full_join(y) %>%
fill(x)) %>%
unname() %>%
do.call(rbind, .)
Result:
year x
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B

Here's a pretty simple base R method with tapply and stack.
stack(tapply(df$year, df["x"], function(x) min(x):max(x)))
Here, tapply splits the year vector by df$x groups and then constructs a sequence from the min to the max year. This returns a named list which is fed to stack to produce the following.
values ind
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
If you are curious how you might do this in data.table, it's also pretty straight forward:
library(data.table)
setDT(df)[, .(year=min(year):max(year)), by=x]
which returns
x year
1: A 2001
2: A 2002
3: A 2003
4: A 2004
5: B 2002
6: B 2003
7: B 2004
8: B 2005

Related

Apply function over list then iterate over second variable, in r

I am trying to have a function apply over a list and iterate over a second variable in the function, in r.
Here is an example:
Create the data
A <- data.frame(var = 1:3, year = 2000:2002)
B <- data.frame(var = 4:6, year = 2000:2002)
C <- data.frame(var = 7:9, year = 2000:2002)
ABC <- list(A, B, C)
> ABC
[[1]]
var year
1 1 2000
2 2 2001
3 3 2002
[[2]]
var year
1 4 2000
2 5 2001
3 6 2002
[[3]]
var year
1 7 2000
2 8 2001
3 9 2002
Write the function: sum (which simply filters for a start year and sums the 'var' values - sorry this simple function got messier in this example than I had intended).
library(dplyr)
sum <- function(dat, start.year) {
dat %>%
filter(year >= start.year) %>%
select(var) %>%
colSums() %>%
data.frame(row.names = NULL) %>%
rename(var = '.') %>%
mutate(start = start.year)
}
Now I can apply the function to the list (and bind_rows to get a neat output):
lapply(ABC, sum, 2000) %>%
bind_rows()
var start
1 6 2000
2 15 2000
3 24 2000
What I want to do however is iterate over start.year creating dataframes for start.year = c(2000, 2001, 2002). This would ideally give:
var start
1 6 2000
2 15 2000
3 24 2000
4 5 2001
5 11 2001
6 17 2001
7 3 2002
8 6 2002
9 9 2002
I have looked at map2, but that talks about using vectors of the same length. That would work in this case, but imagine my list had 4 items in it and only 3 records per list. So assume map2 is doing something different. I also thought about a nested for loop. When I started writing that however I realized I would be dealing with list.append functions in r and that seemed wrong. I assume this is an easy thing to do. Any help would be appreciated.
We can do this with a nested lapply/map
library(purrr)
map_dfr(2000:2002, ~ map_dfr(ABC, sum, .x))
# var start
#1 6 2000
#2 15 2000
#3 24 2000
#4 5 2001
#5 11 2001
#6 17 2001
#7 3 2002
#8 6 2002
#9 9 2002
Or inspired from #thelatemail's suggestion with Map
map2_dfr(rep(ABC, 3), rep(2000:2002,each=length(ABC)), sum)
With lapply
do.call(rbind, lapply(2000:2002, function(x) do.call(rbind, lapply(ABC, sum, x))))
# var start
#1 6 2000
#2 15 2000
#3 24 2000
#4 5 2001
#5 11 2001
#6 17 2001
#7 3 2002
#8 6 2002
#9 9 2002
Or as #thelatemail mentioned
do.call(rbind, Map(sum, ABC, start.year=rep(2000:2002,each=length(ABC))))
If the OP's function can be changed, another option is
library(dplyr)
library(tidyr)
map_dfr(ABC, ~ .x %>%
crossing(year2 = 2000:2002) %>%
filter(year >= year2) %>%
group_by(year2) %>%
summarise(var = base::sum(var)))
Or instead of doing this in a list, we can bind them together with bind_rows then do a group by sum after crossing with the input 'years'
bind_rows(ABC, .id = 'grp') %>%
group_by(grp) %>%
crossing(year2 = 2000:2002) %>%
filter(year >= year2) %>%
group_by(grp, year2) %>%
summarise(var = base::sum(var))

Create a new dataframe with rows for every value in a sequence between two columns in a previous dataframe [duplicate]

This question already has answers here:
R creating a sequence table from two columns
(4 answers)
Closed 3 years ago.
I have a dataframe, where two columns represent the beginning and end of a range of dates. So:
df <- data.frame(var=c("A", "B"), start_year=c(2000, 2002), end_year=c(2005, 2004))
> df
var start_year end_year
1 A 2000 2005
2 B 2002 2004
And I'd like to create a new dataframe, where there is a row for every value between start_year and end_year, for each var.
So the result should look like:
> newdf
var year
1 A 2000
2 A 2001
3 A 2002
4 A 2003
5 A 2004
6 A 2005
7 B 2002
8 B 2003
9 B 2004
Ideally this would involve something from the tidyverse. I've been trying different things with dplyr::group_by and tidyr::gather, but I'm not having any luck.
As akrun demonstrated, it's probably easier to do it without gather and group_by (as mentioned in the question). But in case you're curious how to do it that way, here it is
df %>%
gather(key, value, -var) %>%
group_by(var) %>%
expand(year = value[1]:value[2])
# # A tibble: 9 x 2
# # Groups: var [2]
# var year
# <fct> <int>
# 1 A 2000
# 2 A 2001
# 3 A 2002
# 4 A 2003
# 5 A 2004
# 6 A 2005
# 7 B 2002
# 8 B 2003
# 9 B 2004
Here's the same idea, convert to long and expand, in data.table (same output)
library(data.table)
setDT(df)
melt(df, 'var')[, .(year = value[1]:value[2]), var]
Edit: As markus points out, you don't need to convert to long first with data.table, you can do it in one step (not counting the two lines library/setDT in the code block above). This is a similar approach to akrun's tidyverse answer.
df[, .(year = start_year:end_year), by=var]
We can use map2 to get the sequence from 'start_year' to 'end_year' and unnest the list column to expand the data into 'long' format
library(tidyverse)
df %>%
transmute(var, year = map2(start_year, end_year, `:`)) %>%
unnest
# var year
#1 A 2000
#2 A 2001
#3 A 2002
#4 A 2003
#5 A 2004
#6 A 2005
#7 B 2002
#8 B 2003
#9 B 2004
Or another option is complete
df %>%
group_by(var) %>%
complete(start_year = start_year:end_year) %>%
select(var, year = start_year)
Or in base R with stack and Map
stack(setNames(do.call(Map, c(f = `:`, df[-1])), df$var))
NOTE: First posted the solution with Map and stack
In case of other variations,
stack(setNames(Map(`:`, df[[2]], df[[3]]), df$var))
stack(setNames(do.call(mapply, c(FUN = `:`, df[-1])), df$var))
A short base R solution with seq.
stack(setNames(Map(seq, df[[2]], df[[3]]), df[[1]]))
# values ind
# 1 2000 A
# 2 2001 A
# 3 2002 A
# 4 2003 A
# 5 2004 A
# 6 2005 A
# 7 2002 B
# 8 2003 B
# 9 2004 B
Data
df <- structure(list(var = structure(1:2, .Label = c("A", "B"), class = "factor"),
start_year = c(2000, 2002), end_year = c(2005, 2004)), class = "data.frame", row.names = c(NA,
-2L))

Expand data frame with intervening observations

I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))

Speeding up rowwise comparison

I have a data.frame like the following:
id year x y v1
1 2006 12 1 0.8510703
1 2007 12 1 0.5954527
1 2008 12 2 -1.9312854
1 2009 12 1 0.1558393
1 2010 8 1 0.9051487
2 2001 12 2 -0.5480566
2 2002 12 2 -0.7607420
2 2003 3 2 -0.8094283
2 2004 3 2 -0.1732794
I would like to sum up (grouped by id) v1 of consecutive years (so 2010 and 2009, 2009 and 2008 and so on) only if x and y match. Expected output:
id year res
1 2010 NA
1 2009 NA
1 2008 NA
1 2007 1.4465230
2 2004 -0.9827077
2 2003 NA
2 2002 -1.3087987
The oldest year per id is removed, as there is no preceding year.
I have a slow lapply solution in place but would like to speed things up, as my data is rather large.
Data:
set.seed(1)
dat <- data.frame(id = c(rep(1,5),rep(2,4)),year = c(2006:2010,2001:2004),
x = c(12,12,12,12,8,12,12,3,3), y = c(1,1,2,1,1,2,2,2,2),
v1 = rnorm(9))
Current Solution:
require(dplyr)
myfun <- function(dat) { do.call(rbind,lapply(rev(unique(dat$year)[-1]),
function(z) inner_join(dat[dat$year==z,2:5],
dat[dat$year==z-1,2:5],
by=c("x","y")) %>%
summarise(year = z, res = ifelse(nrow(.) < 1,NA,sum(v1.x,v1.y)))))
}
dat %>% group_by(id) %>% do(myfun(.))
Here is a data.table solution, I think.
datNew <- setDT(dat)[, .(year=year, res=(v1+shift(v1)) * NA^(x != shift(x) | y != shift(y))),
by=id][-1, .SD, by=id][]
id year res
1: 1 2007 -0.4428105
2: 1 2008 NA
3: 1 2009 NA
4: 1 2010 NA
5: 2 2001 NA
6: 2 2002 -0.3330393
7: 2 2003 NA
8: 2 2004 1.3141061
Here, the j statement contains a list with two elements, the year and a function. This function sums values with the lagged value, using shift, but is multiplied by NA or 1 depending on whether the x and y match with their lagged values. This calculation is performed by id. The output is fed to a second chain, which drops the first observation of each id which is all NA.
You can adjust the order efficiently using setorder if desired.
setorder(datNew, id, -year)
datNew
id year res
1: 1 2010 NA
2: 1 2009 NA
3: 1 2008 NA
4: 1 2007 -0.4428105
5: 2 2004 1.3141061
6: 2 2003 NA
7: 2 2002 -0.3330393
8: 2 2001 NA
Assuming there are sorted years as in the example:
dat %>%
group_by(id) %>%
mutate(res = v1 + lag(v1), #simple lag for difference
res = ifelse(x == lag(x) & y == lag(y), v1, NA)) %>% #NA if x and y don't match
slice(-1) #drop the first year
You can use %>% select(id, year, res), and %>% arrange(id, desc(year)) at the end if you want.

Subset panel data by group [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.
How do I best do this in R? For example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
What I would like to have:
name year
1 A 2001
3 A 2003
4 B 2000
5 B 2002
6 C 2000
8 C 2003
dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))
library(dplyr)
dt %>%
group_by(name) %>%
mutate(first = first(year)
,last = last(year)) %>%
filter(year == first | year == last) %>%
select(name, year)
name year
1 A 2001
2 A 2003
3 B 2000
4 B 2002
5 C 2000
6 C 2003
*your example to didn't mention any specific order but it that case, arrange() will help
Here's a quick possible data.table solution
library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
# name year
# 1: A 2001
# 2: A 2003
# 3: B 2000
# 4: B 2002
# 5: C 2000
# 6: C 2003
Or if you only have two columns
dt[, year[c(1L, .N)], by = name]
This is pretty simple using by to split the data.frame by group and then return the head and tail of each group.
> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
name year
A.1 A 2001
A.3 A 2003
B.4 B 2000
B.5 B 2002
C.6 C 2000
C.8 C 2003
head and tail are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:
do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))

Resources