Add missing years to data frame (reshaping) - r

Say I have data frame as follow:
df <- structure(list(
year = c(2001, 2001, 2002, 2003, 2001, 2002, 2003),
name = c("A", "B", "B", "B", "C", "C", "C"),
revenue = c(10, 20, 30, 40, 30, 40, 50)),
.typeOf = c("numeric", "factor", "numeric"),
row.names = c(NA, -7L),
class = "data.frame")
First column contains years, second - names and last one - revenues. As you might see the company "A" contains data only for the first year, while the rest companies have more data. I want yo add new rows for company "A" with NA as a revenu for next years (i.e. 2002 and 2003). For this purpose I use follow code:
df %>%
spread(year, revenue) %>%
gather(year, revenue, 2:ncol(.)) %>%
arrange(name) %>%
View()
It works pretty good, especially for a small data sets, however I am not sure that my solutions is correct from programming point of view. Probably exists much better solution using melt, cast(dcast) or something else. Any ideas?
EDITED: any ideas how can I do it in/using pipe "%>%" ?

Alternatively you can use expand.grid of unique name and year and merge them to your df using all=TRUE.
merge(expand.grid(lapply(df[2:1], unique)), df, all=TRUE)
# name year revenue
#1 A 2001 10
#2 A 2002 NA
#3 A 2003 NA
#4 B 2001 20
#5 B 2002 30
#6 B 2003 40
#7 C 2001 30
#8 C 2002 40
#9 C 2003 50

In data.table you can use dcast() to cast to wide, meanwhile creating a complete groupset using drop = FALSE (which keeps empty groups).
setorder( dcast( setDT(df), year + name ~ ., drop = FALSE ), name )[]
# year name .
# 1: 2001 A 10
# 2: 2002 A NA
# 3: 2003 A NA
# 4: 2001 B 20
# 5: 2002 B 30
# 6: 2003 B 40
# 7: 2001 C 30
# 8: 2002 C 40
# 9: 2003 C 50

Another data.table option:
library(data.table)
setDT(df)
df[CJ(year, name, unique = TRUE), on = c("year", "name")]
# year name revenue
# 1: 2001 A 10
# 2: 2001 B 20
# 3: 2001 C 30
# 4: 2002 A NA
# 5: 2002 B 30
# 6: 2002 C 40
# 7: 2003 A NA
# 8: 2003 B 40
# 9: 2003 C 50

Related

Running a function in sapply is returning faulty results when first ID is all missing

I'm having a very odd, specific problem that I'm struggling to google, so I'm hoping I can just show someone.
I've written a function that will fill in some missing data according to a few conditions. For example, for panel data like this:
library(tidyverse)
library(data.table)
dt <- data.frame(id = c(rep('a', 5),
rep('b', 5),
rep('c', 5)),
var1 = c(rep('', 4), 'bonjour',
'bye', NA, 'bye', 'bye', NA,
'hi', 'hi', NA, 'hi', 'hi'),
year = c(2005:2009,
1995:1998, 2002,
1995:1999))
dt
id var1 year
1: a 2005
2: a 2006
3: a 2007
4: a 2008
5: a bonjour 2009
6: b bye 1995
7: b <NA> 1996
8: b bye 1997
9: b bye 1998
10: b <NA> 2002
11: c hi 1995
12: c hi 1996
13: c <NA> 1997
14: c hi 1998
15: c hi 1999
I use the following function to update some of the missing values:
fill.in <- function(var, yr, finyr) {
leadv <- lead(var, n=1, order_by = yr)
lagv <- lag(var, n=1, order_by = yr)
leadyr <- lead(yr, n=1, order_by = yr)
lagyr <- lag(yr, n=1, order_by = yr)
# ------- build the updated var w/ sequential conditions
# keep the var as it is if not missing
try1 <- ifelse(test = !is.na(var),
yes = var,
no = NA)
# fill in if the lead and lag match and no more than 2 missing years
try2 <- ifelse(test = is.na(try1) & leadv == lagv &
abs(leadyr-lagyr) <= 3 &
!is.na(leadv),
yes = leadv,
no = try1)
# fill in with the lag if it's the final year of observed data
ifelse(test = is.na(try2) & yr == finyr &
abs(yr-lagyr) <= 3 & !is.na(lagv),
yes = lagv,
no = try2)
}
After a little bit of set-up, by and large I get good results:
# ------------ Set-up
# real data is big so use data.table
setDT(dt)
dt[, finalyr := max(year), by = id]
# don't want to fill in factor values
dt$var1 <- as.character(dt$var1)
# make empty strings NAs
dt[, var1 := na_if(var1, '')]
# useful for when i'm filling in many variables
fill.in.vs <- c('var1')
fixed.vnames <- paste0('fixed.', fill.in.vs)
# ------------ Call the function and results
dt[, (fixed.vnames) := sapply(.SD,
FUN = fill.in,
year,
finalyr,
simplify = FALSE, USE.NAMES = FALSE),
by = id, .SDcols = fill.in.vs]
# this gives me what I want:
dt
id var1 year finalyr fixed.var1
1: a <NA> 2005 2009 <NA>
2: a <NA> 2006 2009 <NA>
3: a <NA> 2007 2009 <NA>
4: a <NA> 2008 2009 <NA>
5: a bonjour 2009 2009 bonjour
6: b bye 1995 2002 bye
7: b <NA> 1996 2002 bye
8: b bye 1997 2002 bye
9: b bye 1998 2002 bye
10: b <NA> 2002 2002 <NA>
11: c hi 1995 1999 hi
12: c hi 1996 1999 hi
13: c <NA> 1997 1999 hi
14: c hi 1998 1999 hi
15: c hi 1999 1999 hi
The problem is that when the first set of IDs--e.g. all the 'a' values--have empty strings that I turn into NAs, all values of the "fixed" variable end up NAs as well.
So using that same code but with the following data, I get all NAs in the new variable:
# id of 'a' now is all empty strings in var1:
dt <- data.frame(id = c(rep('a', 5),
rep('b', 5),
rep('c', 5)),
var1 = c(rep('', 5),
'bye', NA, 'bye', 'bye', NA,
'hi', 'hi', NA, 'hi', 'hi'),
year = c(2005:2009,
1995:1998, 2002,
1995:1999))
# which results in this final data after running the same code above:
dt
id var1 year finalyr fixed.var1
1: a <NA> 2005 2009 NA
2: a <NA> 2006 2009 NA
3: a <NA> 2007 2009 NA
4: a <NA> 2008 2009 NA
5: a <NA> 2009 2009 NA
6: b bye 1995 2002 NA
7: b <NA> 1996 2002 NA
8: b bye 1997 2002 NA
9: b bye 1998 2002 NA
10: b <NA> 2002 2002 NA
11: c hi 1995 1999 NA
12: c hi 1996 1999 NA
13: c <NA> 1997 1999 NA
14: c hi 1998 1999 NA
15: c hi 1999 1999 NA
For brevity I won't show you all the things I've tried, but a few observations about when it happens:
All empty strings in the first ID isn't a problem if I do not convert empty strings to NA.
I only get this result if it's the first ID that has all empty strings; if it's the second set, the results are fine.
How I convert "" to NA doesn't matter, i.e. it's not an issue with na_if because it also happens when I use ifelse.
Overall I'm pretty stumped as to what's happening or how to investigate it further. I would really appreciate any help.
When I run your code, I get this warning:
1: In [.data.table(dt, , :=((fixed.vnames), sapply(.SD, FUN = fill.in, : Coercing 'character' RHS to 'logical' to match the type of the target column (column 0 named '').
I get it twice, once for the second, and for the third group.
As it says, the variable fixed.var1 is initialised as logical variable (for the group id==a); values that are added later are then converted to the same class 'logical'.
The major culprit here is your function fill.in(), since e.g.
logicalVar <- fill.in( var=rep(NA,5), yr=2005:2009, finyr=rep(2009,5)); class(logicalVar)
returns a logical variable.
So all you need to do is to wrap as.character() around the return of your function.

Create a new dataframe with rows for every value in a sequence between two columns in a previous dataframe [duplicate]

This question already has answers here:
R creating a sequence table from two columns
(4 answers)
Closed 3 years ago.
I have a dataframe, where two columns represent the beginning and end of a range of dates. So:
df <- data.frame(var=c("A", "B"), start_year=c(2000, 2002), end_year=c(2005, 2004))
> df
var start_year end_year
1 A 2000 2005
2 B 2002 2004
And I'd like to create a new dataframe, where there is a row for every value between start_year and end_year, for each var.
So the result should look like:
> newdf
var year
1 A 2000
2 A 2001
3 A 2002
4 A 2003
5 A 2004
6 A 2005
7 B 2002
8 B 2003
9 B 2004
Ideally this would involve something from the tidyverse. I've been trying different things with dplyr::group_by and tidyr::gather, but I'm not having any luck.
As akrun demonstrated, it's probably easier to do it without gather and group_by (as mentioned in the question). But in case you're curious how to do it that way, here it is
df %>%
gather(key, value, -var) %>%
group_by(var) %>%
expand(year = value[1]:value[2])
# # A tibble: 9 x 2
# # Groups: var [2]
# var year
# <fct> <int>
# 1 A 2000
# 2 A 2001
# 3 A 2002
# 4 A 2003
# 5 A 2004
# 6 A 2005
# 7 B 2002
# 8 B 2003
# 9 B 2004
Here's the same idea, convert to long and expand, in data.table (same output)
library(data.table)
setDT(df)
melt(df, 'var')[, .(year = value[1]:value[2]), var]
Edit: As markus points out, you don't need to convert to long first with data.table, you can do it in one step (not counting the two lines library/setDT in the code block above). This is a similar approach to akrun's tidyverse answer.
df[, .(year = start_year:end_year), by=var]
We can use map2 to get the sequence from 'start_year' to 'end_year' and unnest the list column to expand the data into 'long' format
library(tidyverse)
df %>%
transmute(var, year = map2(start_year, end_year, `:`)) %>%
unnest
# var year
#1 A 2000
#2 A 2001
#3 A 2002
#4 A 2003
#5 A 2004
#6 A 2005
#7 B 2002
#8 B 2003
#9 B 2004
Or another option is complete
df %>%
group_by(var) %>%
complete(start_year = start_year:end_year) %>%
select(var, year = start_year)
Or in base R with stack and Map
stack(setNames(do.call(Map, c(f = `:`, df[-1])), df$var))
NOTE: First posted the solution with Map and stack
In case of other variations,
stack(setNames(Map(`:`, df[[2]], df[[3]]), df$var))
stack(setNames(do.call(mapply, c(FUN = `:`, df[-1])), df$var))
A short base R solution with seq.
stack(setNames(Map(seq, df[[2]], df[[3]]), df[[1]]))
# values ind
# 1 2000 A
# 2 2001 A
# 3 2002 A
# 4 2003 A
# 5 2004 A
# 6 2005 A
# 7 2002 B
# 8 2003 B
# 9 2004 B
Data
df <- structure(list(var = structure(1:2, .Label = c("A", "B"), class = "factor"),
start_year = c(2000, 2002), end_year = c(2005, 2004)), class = "data.frame", row.names = c(NA,
-2L))

Expand data frame with intervening observations

I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))

Subset panel data by group [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.
How do I best do this in R? For example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
What I would like to have:
name year
1 A 2001
3 A 2003
4 B 2000
5 B 2002
6 C 2000
8 C 2003
dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))
library(dplyr)
dt %>%
group_by(name) %>%
mutate(first = first(year)
,last = last(year)) %>%
filter(year == first | year == last) %>%
select(name, year)
name year
1 A 2001
2 A 2003
3 B 2000
4 B 2002
5 C 2000
6 C 2003
*your example to didn't mention any specific order but it that case, arrange() will help
Here's a quick possible data.table solution
library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
# name year
# 1: A 2001
# 2: A 2003
# 3: B 2000
# 4: B 2002
# 5: C 2000
# 6: C 2003
Or if you only have two columns
dt[, year[c(1L, .N)], by = name]
This is pretty simple using by to split the data.frame by group and then return the head and tail of each group.
> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
name year
A.1 A 2001
A.3 A 2003
B.4 B 2000
B.5 B 2002
C.6 C 2000
C.8 C 2003
head and tail are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:
do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))

Subset R data frame contingent on the value of duplicate variables

How can I subset the following example data frame to only return one
observation for the earliest occurance [i.e. min(year)] of each id?
id <- c("A", "A", "C", "D", "E", "F")
year <- c(2000, 2001, 2001, 2002, 2003, 2004)
qty <- c(100, 300, 100, 200, 100, 500)
df=data.frame(year, qty, id)
In the example above there are two observations for the "A" id at years 2000 and 2001. In the case of duplicate id's, I would like the subset data frame to only include the the first occurance (i.e. at 2000) of the observations for the duplicate id.
df2 = subset(df, ???)
This is what I am trying to return:
df2
year qty id
2000 100 A
2001 100 C
2002 200 D
2003 100 E
2004 500 F
Any assistance would be greatly appreciated.
You can aggregate on minimum year + id, then merge with the original data frame to get qty:
df2 <- merge(aggregate(year ~ id, df1, min), df1)
# > df2
# id year qty
# 1 A 2000 100
# 2 C 2001 100
# 3 D 2002 200
# 4 E 2003 100
# 5 F 2004 500
Is this what you're looking for? Your second row looks wrong to me (it's the duplicated year, not the first).
> duplicated(df$year)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
> df[!duplicated(df$year), ]
year qty id
1 2000 100 A
2 2001 300 A
4 2002 200 D
5 2003 100 E
6 2004 500 F
Edit 1: Er, I completely misunderstood what you were asking for. I'll keep this here for completeness though.
Edit 2:
Ok, here's a solution: Sort by year (so the first entry per ID has the earliest year) and then use duplicated. I think this is the simplest solution:
> df.sort.year <- df[order(df$year), ]
> df.sort.year[!duplicated(df$id), ]
year qty id
1 2000 100 A
3 2001 100 C
4 2002 200 D
5 2003 100 E
6 2004 500 F
Using plyr
library(plyr)
## make sure first row will be min (year)
df <- arrange(df, id, year)
df2 <- ddply(df, .(id), head, n = 1)
df2
## year qty id
## 1 2000 100 A
## 2 2001 100 C
## 3 2002 200 D
## 4 2003 100 E
## 5 2004 500 F
or using data.table. Setting the key as id, year will ensure the first row is the minimum of year.
library(data.table)
DF <- data.table(df, key = c('id','year'))
DF[,.SD[1], by = 'id']
## id year qty
## [1,] A 2000 100
## [2,] C 2001 100
## [3,] D 2002 200
## [4,] E 2003 100
## [5,] F 2004 500
There is likely a prettier way of doing this, but this is what came to mind
# use which() to get index for each id, saving only first
first_occurance <- with(df, sapply(unique(id), function(x) which(id %in% x)[1]))
df[first_occurance,]
# year qty id
#1 2000 100 A
#3 2001 100 C
#4 2002 200 D
#5 2003 100 E
#6 2004 500 F

Resources