Subset panel data by group [duplicate] - r

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.
How do I best do this in R? For example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
What I would like to have:
name year
1 A 2001
3 A 2003
4 B 2000
5 B 2002
6 C 2000
8 C 2003

dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))
library(dplyr)
dt %>%
group_by(name) %>%
mutate(first = first(year)
,last = last(year)) %>%
filter(year == first | year == last) %>%
select(name, year)
name year
1 A 2001
2 A 2003
3 B 2000
4 B 2002
5 C 2000
6 C 2003
*your example to didn't mention any specific order but it that case, arrange() will help

Here's a quick possible data.table solution
library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
# name year
# 1: A 2001
# 2: A 2003
# 3: B 2000
# 4: B 2002
# 5: C 2000
# 6: C 2003
Or if you only have two columns
dt[, year[c(1L, .N)], by = name]

This is pretty simple using by to split the data.frame by group and then return the head and tail of each group.
> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
name year
A.1 A 2001
A.3 A 2003
B.4 B 2000
B.5 B 2002
C.6 C 2000
C.8 C 2003
head and tail are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:
do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))

Related

Add missing years to data frame (reshaping)

Say I have data frame as follow:
df <- structure(list(
year = c(2001, 2001, 2002, 2003, 2001, 2002, 2003),
name = c("A", "B", "B", "B", "C", "C", "C"),
revenue = c(10, 20, 30, 40, 30, 40, 50)),
.typeOf = c("numeric", "factor", "numeric"),
row.names = c(NA, -7L),
class = "data.frame")
First column contains years, second - names and last one - revenues. As you might see the company "A" contains data only for the first year, while the rest companies have more data. I want yo add new rows for company "A" with NA as a revenu for next years (i.e. 2002 and 2003). For this purpose I use follow code:
df %>%
spread(year, revenue) %>%
gather(year, revenue, 2:ncol(.)) %>%
arrange(name) %>%
View()
It works pretty good, especially for a small data sets, however I am not sure that my solutions is correct from programming point of view. Probably exists much better solution using melt, cast(dcast) or something else. Any ideas?
EDITED: any ideas how can I do it in/using pipe "%>%" ?
Alternatively you can use expand.grid of unique name and year and merge them to your df using all=TRUE.
merge(expand.grid(lapply(df[2:1], unique)), df, all=TRUE)
# name year revenue
#1 A 2001 10
#2 A 2002 NA
#3 A 2003 NA
#4 B 2001 20
#5 B 2002 30
#6 B 2003 40
#7 C 2001 30
#8 C 2002 40
#9 C 2003 50
In data.table you can use dcast() to cast to wide, meanwhile creating a complete groupset using drop = FALSE (which keeps empty groups).
setorder( dcast( setDT(df), year + name ~ ., drop = FALSE ), name )[]
# year name .
# 1: 2001 A 10
# 2: 2002 A NA
# 3: 2003 A NA
# 4: 2001 B 20
# 5: 2002 B 30
# 6: 2003 B 40
# 7: 2001 C 30
# 8: 2002 C 40
# 9: 2003 C 50
Another data.table option:
library(data.table)
setDT(df)
df[CJ(year, name, unique = TRUE), on = c("year", "name")]
# year name revenue
# 1: 2001 A 10
# 2: 2001 B 20
# 3: 2001 C 30
# 4: 2002 A NA
# 5: 2002 B 30
# 6: 2002 C 40
# 7: 2003 A NA
# 8: 2003 B 40
# 9: 2003 C 50

Create a new dataframe with rows for every value in a sequence between two columns in a previous dataframe [duplicate]

This question already has answers here:
R creating a sequence table from two columns
(4 answers)
Closed 3 years ago.
I have a dataframe, where two columns represent the beginning and end of a range of dates. So:
df <- data.frame(var=c("A", "B"), start_year=c(2000, 2002), end_year=c(2005, 2004))
> df
var start_year end_year
1 A 2000 2005
2 B 2002 2004
And I'd like to create a new dataframe, where there is a row for every value between start_year and end_year, for each var.
So the result should look like:
> newdf
var year
1 A 2000
2 A 2001
3 A 2002
4 A 2003
5 A 2004
6 A 2005
7 B 2002
8 B 2003
9 B 2004
Ideally this would involve something from the tidyverse. I've been trying different things with dplyr::group_by and tidyr::gather, but I'm not having any luck.
As akrun demonstrated, it's probably easier to do it without gather and group_by (as mentioned in the question). But in case you're curious how to do it that way, here it is
df %>%
gather(key, value, -var) %>%
group_by(var) %>%
expand(year = value[1]:value[2])
# # A tibble: 9 x 2
# # Groups: var [2]
# var year
# <fct> <int>
# 1 A 2000
# 2 A 2001
# 3 A 2002
# 4 A 2003
# 5 A 2004
# 6 A 2005
# 7 B 2002
# 8 B 2003
# 9 B 2004
Here's the same idea, convert to long and expand, in data.table (same output)
library(data.table)
setDT(df)
melt(df, 'var')[, .(year = value[1]:value[2]), var]
Edit: As markus points out, you don't need to convert to long first with data.table, you can do it in one step (not counting the two lines library/setDT in the code block above). This is a similar approach to akrun's tidyverse answer.
df[, .(year = start_year:end_year), by=var]
We can use map2 to get the sequence from 'start_year' to 'end_year' and unnest the list column to expand the data into 'long' format
library(tidyverse)
df %>%
transmute(var, year = map2(start_year, end_year, `:`)) %>%
unnest
# var year
#1 A 2000
#2 A 2001
#3 A 2002
#4 A 2003
#5 A 2004
#6 A 2005
#7 B 2002
#8 B 2003
#9 B 2004
Or another option is complete
df %>%
group_by(var) %>%
complete(start_year = start_year:end_year) %>%
select(var, year = start_year)
Or in base R with stack and Map
stack(setNames(do.call(Map, c(f = `:`, df[-1])), df$var))
NOTE: First posted the solution with Map and stack
In case of other variations,
stack(setNames(Map(`:`, df[[2]], df[[3]]), df$var))
stack(setNames(do.call(mapply, c(FUN = `:`, df[-1])), df$var))
A short base R solution with seq.
stack(setNames(Map(seq, df[[2]], df[[3]]), df[[1]]))
# values ind
# 1 2000 A
# 2 2001 A
# 3 2002 A
# 4 2003 A
# 5 2004 A
# 6 2005 A
# 7 2002 B
# 8 2003 B
# 9 2004 B
Data
df <- structure(list(var = structure(1:2, .Label = c("A", "B"), class = "factor"),
start_year = c(2000, 2002), end_year = c(2005, 2004)), class = "data.frame", row.names = c(NA,
-2L))

How to get the first year information of every group? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 3 years ago.
I was wondering whether it is possible to get the first row of every year per group.
library(data.table)
dt <- data.table(Group = c(rep("A", 4), rep("B", 3), rep("C", 3)),
A = c(1:10),
B = c(10:1),
Year = c(2003:2006, 2004:2006, 2007, 2008, 2009))
The data is as follows
Group A B Year
1: A 1 10 2003
2: A 2 9 2004
3: A 3 8 2005
4: A 4 7 2006
5: B 5 6 2004
6: B 6 5 2005
7: B 7 4 2006
8: C 8 3 2007
9: C 9 2 2008
10: C 10 1 2009
But what I would like to get is the earliest year per group, but I can't seem to get it right:
dt[min(Year) == Year, by = Group]
How should I do this selection?
Try:
dt[, .SD[which.min(Year)], by = Group]
Directly use dplyr, try this :
dt_min <- dt %>% group_by(Group) %>% summarise(new_year = min(Year))

Expand data frame with intervening observations

I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))

Fill numeric variable while preserving group

[EDITED to reflect a better example]
Say I have a dataframe like this:
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
> df
x year
1 A 2001
2 A 2004
3 B 2002
4 B 2005
How can I increment year by 1 while preserving x? I would like to fill in year so that the sequence is this:
x year
1 A 2001
2 A 2002
3 A 2003
4 A 2004
5 B 2002
6 B 2003
7 B 2004
8 B 2005
Can anyone recommend a good way of doing this?
#useR recommend this approach:
> data.frame(year = min(df$year):max(df$year)) %>%
full_join(df) %>%
fill(x)
Joining, by = "year"
year x
1 2001 A
2 2002 B
3 2003 B
4 2004 A
5 2005 B
However that does not match the desired output.
An option using tidyr::complete and dplyr::lead can be as:
library(tidyverse)
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
df %>% mutate(nextYear = ifelse(is.na(lead(year)),year, lead(year)-1)) %>%
group_by(x) %>%
complete(year = seq(year, nextYear, by=1)) %>%
select(-nextYear) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 B 2004
# 5 B 2005
# 6 B 2006
# 7 C 2007
Edited: The solution for modified data
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
library(tidyverse)
df %>% group_by(x) %>%
complete(year = seq(min(year), max(year), by=1)) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 A 2004
# 5 B 2002
# 6 B 2003
# 7 B 2004
# 8 B 2005
Using base R (with a little help from zoo):
full_df = data.frame(year = min(df$year):max(df$year))
df = merge(df, full_df, all = TRUE)
df = df[order(df$year), ]
df$x = zoo::na.locf(df$x)
df
# year x
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
Using the "tidyverse"
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
library(dplyr)
library(tidyr)
df = df %>% mutate(year = factor(year, levels = min(year):max(year))) %>%
complete(year) %>%
fill(x) %>%
mutate(year = as.numeric(as.character(year)))
df
# # A tibble: 7 x 2
# year x
# <dbl> <fctr>
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
We can first split by x, then create a year vector for each x group, join with each group df, fill down x, then finally rbind all group df's together.
library(dplyr)
library(tidyr)
df %>%
split(.$x) %>%
lapply(function(y) data.frame(year = min(y$year):max(y$year)) %>%
full_join(y) %>%
fill(x)) %>%
unname() %>%
do.call(rbind, .)
Result:
year x
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
Here's a pretty simple base R method with tapply and stack.
stack(tapply(df$year, df["x"], function(x) min(x):max(x)))
Here, tapply splits the year vector by df$x groups and then constructs a sequence from the min to the max year. This returns a named list which is fed to stack to produce the following.
values ind
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
If you are curious how you might do this in data.table, it's also pretty straight forward:
library(data.table)
setDT(df)[, .(year=min(year):max(year)), by=x]
which returns
x year
1: A 2001
2: A 2002
3: A 2003
4: A 2004
5: B 2002
6: B 2003
7: B 2004
8: B 2005

Resources