Subset R data frame contingent on the value of duplicate variables

Subset R data frame contingent on the value of duplicate variables - r

How can I subset the following example data frame to only return one
observation for the earliest occurance [i.e. min(year)] of each id?
id <- c("A", "A", "C", "D", "E", "F")
year <- c(2000, 2001, 2001, 2002, 2003, 2004)
qty <- c(100, 300, 100, 200, 100, 500)
df=data.frame(year, qty, id)
In the example above there are two observations for the "A" id at years 2000 and 2001. In the case of duplicate id's, I would like the subset data frame to only include the the first occurance (i.e. at 2000) of the observations for the duplicate id.
df2 = subset(df, ???)
This is what I am trying to return:
df2
year qty id
2000 100 A
2001 100 C
2002 200 D
2003 100 E
2004 500 F
Any assistance would be greatly appreciated.

You can aggregate on minimum year + id, then merge with the original data frame to get qty:
df2 <- merge(aggregate(year ~ id, df1, min), df1)
# > df2
# id year qty
# 1 A 2000 100
# 2 C 2001 100
# 3 D 2002 200
# 4 E 2003 100
# 5 F 2004 500

Is this what you're looking for? Your second row looks wrong to me (it's the duplicated year, not the first).
> duplicated(df$year)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
> df[!duplicated(df$year), ]
year qty id
1 2000 100 A
2 2001 300 A
4 2002 200 D
5 2003 100 E
6 2004 500 F
Edit 1: Er, I completely misunderstood what you were asking for. I'll keep this here for completeness though.
Edit 2:
Ok, here's a solution: Sort by year (so the first entry per ID has the earliest year) and then use duplicated. I think this is the simplest solution:
> df.sort.year <- df[order(df$year), ]
> df.sort.year[!duplicated(df$id), ]
year qty id
1 2000 100 A
3 2001 100 C
4 2002 200 D
5 2003 100 E
6 2004 500 F

Using plyr
library(plyr)
## make sure first row will be min (year)
df <- arrange(df, id, year)
df2 <- ddply(df, .(id), head, n = 1)
df2
## year qty id
## 1 2000 100 A
## 2 2001 100 C
## 3 2002 200 D
## 4 2003 100 E
## 5 2004 500 F
or using data.table. Setting the key as id, year will ensure the first row is the minimum of year.
library(data.table)
DF <- data.table(df, key = c('id','year'))
DF[,.SD[1], by = 'id']
## id year qty
## [1,] A 2000 100
## [2,] C 2001 100
## [3,] D 2002 200
## [4,] E 2003 100
## [5,] F 2004 500

There is likely a prettier way of doing this, but this is what came to mind
# use which() to get index for each id, saving only first
first_occurance <- with(df, sapply(unique(id), function(x) which(id %in% x)[1]))
df[first_occurance,]
# year qty id
#1 2000 100 A
#3 2001 100 C
#4 2002 200 D
#5 2003 100 E
#6 2004 500 F

Related

Add missing years to data frame (reshaping)

Say I have data frame as follow:
df <- structure(list(
year = c(2001, 2001, 2002, 2003, 2001, 2002, 2003),
name = c("A", "B", "B", "B", "C", "C", "C"),
revenue = c(10, 20, 30, 40, 30, 40, 50)),
.typeOf = c("numeric", "factor", "numeric"),
row.names = c(NA, -7L),
class = "data.frame")
First column contains years, second - names and last one - revenues. As you might see the company "A" contains data only for the first year, while the rest companies have more data. I want yo add new rows for company "A" with NA as a revenu for next years (i.e. 2002 and 2003). For this purpose I use follow code:
df %>%
spread(year, revenue) %>%
gather(year, revenue, 2:ncol(.)) %>%
arrange(name) %>%
View()
It works pretty good, especially for a small data sets, however I am not sure that my solutions is correct from programming point of view. Probably exists much better solution using melt, cast(dcast) or something else. Any ideas?
EDITED: any ideas how can I do it in/using pipe "%>%" ?

Alternatively you can use expand.grid of unique name and year and merge them to your df using all=TRUE.
merge(expand.grid(lapply(df[2:1], unique)), df, all=TRUE)
# name year revenue
#1 A 2001 10
#2 A 2002 NA
#3 A 2003 NA
#4 B 2001 20
#5 B 2002 30
#6 B 2003 40
#7 C 2001 30
#8 C 2002 40
#9 C 2003 50

In data.table you can use dcast() to cast to wide, meanwhile creating a complete groupset using drop = FALSE (which keeps empty groups).
setorder( dcast( setDT(df), year + name ~ ., drop = FALSE ), name )[]
# year name .
# 1: 2001 A 10
# 2: 2002 A NA
# 3: 2003 A NA
# 4: 2001 B 20
# 5: 2002 B 30
# 6: 2003 B 40
# 7: 2001 C 30
# 8: 2002 C 40
# 9: 2003 C 50

Another data.table option:
library(data.table)
setDT(df)
df[CJ(year, name, unique = TRUE), on = c("year", "name")]
# year name revenue
# 1: 2001 A 10
# 2: 2001 B 20
# 3: 2001 C 30
# 4: 2002 A NA
# 5: 2002 B 30
# 6: 2002 C 40
# 7: 2003 A NA
# 8: 2003 B 40
# 9: 2003 C 50

Create a new dataframe with rows for every value in a sequence between two columns in a previous dataframe [duplicate]

This question already has answers here:
R creating a sequence table from two columns
(4 answers)
Closed 3 years ago.
I have a dataframe, where two columns represent the beginning and end of a range of dates. So:
df <- data.frame(var=c("A", "B"), start_year=c(2000, 2002), end_year=c(2005, 2004))
> df
var start_year end_year
1 A 2000 2005
2 B 2002 2004
And I'd like to create a new dataframe, where there is a row for every value between start_year and end_year, for each var.
So the result should look like:
> newdf
var year
1 A 2000
2 A 2001
3 A 2002
4 A 2003
5 A 2004
6 A 2005
7 B 2002
8 B 2003
9 B 2004
Ideally this would involve something from the tidyverse. I've been trying different things with dplyr::group_by and tidyr::gather, but I'm not having any luck.

As akrun demonstrated, it's probably easier to do it without gather and group_by (as mentioned in the question). But in case you're curious how to do it that way, here it is
df %>%
gather(key, value, -var) %>%
group_by(var) %>%
expand(year = value[1]:value[2])
# # A tibble: 9 x 2
# # Groups: var [2]
# var year
# <fct> <int>
# 1 A 2000
# 2 A 2001
# 3 A 2002
# 4 A 2003
# 5 A 2004
# 6 A 2005
# 7 B 2002
# 8 B 2003
# 9 B 2004
Here's the same idea, convert to long and expand, in data.table (same output)
library(data.table)
setDT(df)
melt(df, 'var')[, .(year = value[1]:value[2]), var]
Edit: As markus points out, you don't need to convert to long first with data.table, you can do it in one step (not counting the two lines library/setDT in the code block above). This is a similar approach to akrun's tidyverse answer.
df[, .(year = start_year:end_year), by=var]

We can use map2 to get the sequence from 'start_year' to 'end_year' and unnest the list column to expand the data into 'long' format
library(tidyverse)
df %>%
transmute(var, year = map2(start_year, end_year, `:`)) %>%
unnest
# var year
#1 A 2000
#2 A 2001
#3 A 2002
#4 A 2003
#5 A 2004
#6 A 2005
#7 B 2002
#8 B 2003
#9 B 2004
Or another option is complete
df %>%
group_by(var) %>%
complete(start_year = start_year:end_year) %>%
select(var, year = start_year)
Or in base R with stack and Map
stack(setNames(do.call(Map, c(f = `:`, df[-1])), df$var))
NOTE: First posted the solution with Map and stack
In case of other variations,
stack(setNames(Map(`:`, df[[2]], df[[3]]), df$var))
stack(setNames(do.call(mapply, c(FUN = `:`, df[-1])), df$var))

A short base R solution with seq.
stack(setNames(Map(seq, df[[2]], df[[3]]), df[[1]]))
# values ind
# 1 2000 A
# 2 2001 A
# 3 2002 A
# 4 2003 A
# 5 2004 A
# 6 2005 A
# 7 2002 B
# 8 2003 B
# 9 2004 B
Data
df <- structure(list(var = structure(1:2, .Label = c("A", "B"), class = "factor"),
start_year = c(2000, 2002), end_year = c(2005, 2004)), class = "data.frame", row.names = c(NA,
-2L))

Impute/ fill in missing values between time periods

I have data that often contains missing observations between time periods. I want to fill in those observations, properly incrementing the time periods, but conditional on the values of the observations. Here's an example:
df <- data.frame(id=c("a","a","b","b"), group=c("x","x","y","z"), year=c(2000,2003,2003,2005))
Which gives the 4 observation data frame
id group year
1 a x 2000
2 a x 2003
3 b y 2003
4 b z 2005
I would like to have 2 additional observations here (between #1 and #2) for 2001 and 2002, since observation #1 and #2 match on id and group. But I don't want additional observation between #3 and #4 because the id and group do not match.

You can use full_seq from tidyr - it was created exactly for tasks like this (Create the full sequence of values in a vector):
library(tidyr)
library(dplyr)
df %>%
group_by(id, group) %>%
complete(year = full_seq(year, period = 1))
id group year
<fct> <fct> <dbl>
1 a x 2000
2 a x 2001
3 a x 2002
4 a x 2003
5 b y 2003
6 b z 2005

Or using data.table
library(data.table)
setDT(df)[, .(year = year[1]:year[.N]), .(id, group)]
# id group year
#1: a x 2000
#2: a x 2001
#3: a x 2002
#4: a x 2003
#5: b y 2003
#6: b z 2005

R - select and assign value to group based on condition in column

Having a data frame that looks like the following:
d
year pos days sal
1 2009 A 31 2000
2 2009 B 60 4000
3 2009 C 10 600
4 2010 B 10 1000
5 2010 D 90 7000
I would like to group data by year, adding days and sal, and select pos where days is maximum in the group.
The result should be like:
year pos days sal
1 2009 B 101 6600
2 2010 D 100 8000
I could deal with numeric values such as days and sal using functions like tapply(d$days, d$year, sum).
However, I have no idea how I can select pos that meets a condition on days and assign it to the group.
Any comments will be greatly appreciated!

We can use dplyr. After grouping by 'year', get the 'pos' where the 'days' are max (which.max(days)), as well do the sum of 'days' and 'sal'.
library(dplyr)
d %>%
group_by(year) %>%
summarise(pos = pos[which.max(days)], days = sum(days), sal = sum(sal))
# # A tibble: 2 × 4
# year pos days sal
# <int> <chr> <int> <int>
#1 2009 B 101 6600
#2 2010 D 100 8000

A solution with base R:
m1 <- d[as.logical(with(d, ave(days, year, FUN = function(x) seq_along(x) == which.max(x)) )), c('year','pos')]
m2 <- aggregate(cbind(days, sal) ~ year, d, sum)
merge(m1, m2, by = 'year')
Or with the data.table package:
library(data.table)
setDT(d)[order(days), .(pos = pos[.N], days = sum(days), sal = sum(sal)), by = year]
the resulting data.frame / data.table:
year pos days sal
1 2009 B 101 6600
2 2010 D 100 8000

With sqldf:
library(sqldf)
cbind.data.frame(sqldf('select year, sum(days) as days, sum(sal) as sal
from d group by year'),
sqldf('select pos from d group by year having days=max(days)'))
year days sal pos
1 2009 101 6600 B
2 2010 100 8000 D

Subset panel data by group [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.
How do I best do this in R? For example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
What I would like to have:
name year
1 A 2001
3 A 2003
4 B 2000
5 B 2002
6 C 2000
8 C 2003

dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))
library(dplyr)
dt %>%
group_by(name) %>%
mutate(first = first(year)
,last = last(year)) %>%
filter(year == first | year == last) %>%
select(name, year)
name year
1 A 2001
2 A 2003
3 B 2000
4 B 2002
5 C 2000
6 C 2003
*your example to didn't mention any specific order but it that case, arrange() will help

Here's a quick possible data.table solution
library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
# name year
# 1: A 2001
# 2: A 2003
# 3: B 2000
# 4: B 2002
# 5: C 2000
# 6: C 2003
Or if you only have two columns
dt[, year[c(1L, .N)], by = name]

This is pretty simple using by to split the data.frame by group and then return the head and tail of each group.
> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
name year
A.1 A 2001
A.3 A 2003
B.4 B 2000
B.5 B 2002
C.6 C 2000
C.8 C 2003
head and tail are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:
do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset R data frame contingent on the value of duplicate variables - r

You can aggregate on minimum year + id, then merge with the original data frame to get qty: df2 <- merge(aggregate(year ~ id, df1, min), df1) # > df2 # id year qty # 1 A 2000 100 # 2 C 2001 100 # 3 D 2002 200 # 4 E 2003 100 # 5 F 2004 500

Related

Add missing years to data frame (reshaping)

Create a new dataframe with rows for every value in a sequence between two columns in a previous dataframe [duplicate]

Impute/ fill in missing values between time periods

R - select and assign value to group based on condition in column

Subset panel data by group [duplicate]

Categories

Resources