This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 3 years ago.
I was wondering whether it is possible to get the first row of every year per group.
library(data.table)
dt <- data.table(Group = c(rep("A", 4), rep("B", 3), rep("C", 3)),
A = c(1:10),
B = c(10:1),
Year = c(2003:2006, 2004:2006, 2007, 2008, 2009))
The data is as follows
Group A B Year
1: A 1 10 2003
2: A 2 9 2004
3: A 3 8 2005
4: A 4 7 2006
5: B 5 6 2004
6: B 6 5 2005
7: B 7 4 2006
8: C 8 3 2007
9: C 9 2 2008
10: C 10 1 2009
But what I would like to get is the earliest year per group, but I can't seem to get it right:
dt[min(Year) == Year, by = Group]
How should I do this selection?
Try:
dt[, .SD[which.min(Year)], by = Group]
Directly use dplyr, try this :
dt_min <- dt %>% group_by(Group) %>% summarise(new_year = min(Year))
Related
I would like to subset a data.table by condition "Year". Basically I want the data from the dt that matches a given year, per group. However, some groups do not have a complete time line across all years, and therefore I would like to return the nearest year's data for every group, so there are data for every group present for any year chosen (whether that is exactly the right year, or not).
library(data.table)
# make dummy data
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)),
x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
# subset by, e.g., Year == 2015 is fine, but I want a full result for ANY
# year chosen, such as 2012, by using the closest entry in time, per group.
# Attempt;
y <- 2012
dt[Year == which.min(abs(Year - y)), .SD, by = Group]
Empty data.table (0 rows and 3 cols): Group,x,Year
The result in this example should be;
Group x Year
1: A 4 2012
2: B 7 2013
3: C 2 2012
4: D 3 2014
You are close: the use of which.min(abs(Year - y)) is good, but needs to be within the .SD-subsetting in the j portion.
dt[, .SD[which.min(abs(Year - y)),], Group]
# Group x Year
# <char> <int> <int>
# 1: A 5 2012
# 2: B 4 2013
# 3: C 8 2012
# 4: D 5 2014
Reproducible data
set.seed(42)
dt <- data.table(Group = c(rep("A", 5),rep("B", 3),rep("C", 5),rep("D", 2)), x = sample(1:10,15, rep=T), Year = c(2011:2015, 2013:2015, 2011:2015, 2014:2015))
dt
# Group x Year
# <char> <int> <int>
# 1: A 1 2011
# 2: A 5 2012
# 3: A 1 2013
# 4: A 9 2014
# 5: A 10 2015
# 6: B 4 2013
# 7: B 2 2014
# 8: B 10 2015
# 9: C 1 2011
# 10: C 8 2012
# 11: C 7 2013
# 12: C 4 2014
# 13: C 9 2015
# 14: D 5 2014
# 15: D 4 2015
y <- 2012
I have large quantities of datasets in Excel that I would like to analyze in R. The files have a format that organizes all information per block of the same year, which looks like:
Group <- c(2010, 'Group', 'A', 'B', 'C', 2011, 'Group', 'A', 'B', 'E', 2012, 'Group', 'A', 'B')
Value <- c(NA,'Value', 1, 2, 9, NA, 'Value', 3, 5, 2, NA, 'Value', 9, 1)
df <- cbind(Group, Value)
Group Value
1: 2010 NA
2: Group Value
3: A 1
4: B 2
5: C 9
6: 2011 NA
7: Group Value
8: A 3
9: B 5
10: E 2
11: 2012 NA
12: Group Value
13: A 9
14: B 1
To be able to analyze the data, I would like to automatically add a column for the year so that all data can be combined, as follows:
Year Group Value
1: 2010 A 1
2: 2010 B 2
3: 2010 C 9
4: 2011 A 3
5: 2011 B 5
6: 2011 E 2
7: 2012 A 9
8: 2012 B 1
library(data.table)
dt <- data.table(df)
dt[, Year := Group[1], cumsum(is.na(Value))][Value != 'Value']
Group Value Year
1: A 1 2010
2: B 2 2010
3: C 9 2010
4: A 3 2011
5: B 5 2011
6: E 2 2011
7: A 9 2012
8: B 1 2012
in Base R:
subset(transform(df, Year = ave(Group, cumsum(is.na(Value)), FUN=\(x)x[1])), Value != 'Value')
Group Value Year
3 A 1 2010
4 B 2 2010
5 C 9 2010
8 A 3 2011
9 B 5 2011
10 E 2 2011
13 A 9 2012
14 B 1 2012
Note that the above columns are character. You can use type.convert(new_df, as.is = TRUE) where new_df is the resultant df to convert the columns to respective classes
Here is one method with tidyverse - create the 'Year' column where the 'Group' values have 4 digits numbers, then filter out the 'Group' rows where value is 'Group', fill the 'Year' column with the previous non-NA values, filter out the first row with duplicated and convert the type (type.convert)
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(Year = case_when(str_detect(Group, "^\\d{4}$") ~ Group)) %>%
filter(Group != 'Group') %>%
fill(Year) %>%
filter(duplicated(Year)) %>%
type.convert(as.is = TRUE) %>%
select(Year, Group, Value)
-output
Year Group Value
1 2010 A 1
2 2010 B 2
3 2010 C 9
4 2011 A 3
5 2011 B 5
6 2011 E 2
7 2012 A 9
8 2012 B 1
data
df <- data.frame(Group, Value)
This question already has answers here:
R creating a sequence table from two columns
(4 answers)
Closed 3 years ago.
I have a dataframe, where two columns represent the beginning and end of a range of dates. So:
df <- data.frame(var=c("A", "B"), start_year=c(2000, 2002), end_year=c(2005, 2004))
> df
var start_year end_year
1 A 2000 2005
2 B 2002 2004
And I'd like to create a new dataframe, where there is a row for every value between start_year and end_year, for each var.
So the result should look like:
> newdf
var year
1 A 2000
2 A 2001
3 A 2002
4 A 2003
5 A 2004
6 A 2005
7 B 2002
8 B 2003
9 B 2004
Ideally this would involve something from the tidyverse. I've been trying different things with dplyr::group_by and tidyr::gather, but I'm not having any luck.
As akrun demonstrated, it's probably easier to do it without gather and group_by (as mentioned in the question). But in case you're curious how to do it that way, here it is
df %>%
gather(key, value, -var) %>%
group_by(var) %>%
expand(year = value[1]:value[2])
# # A tibble: 9 x 2
# # Groups: var [2]
# var year
# <fct> <int>
# 1 A 2000
# 2 A 2001
# 3 A 2002
# 4 A 2003
# 5 A 2004
# 6 A 2005
# 7 B 2002
# 8 B 2003
# 9 B 2004
Here's the same idea, convert to long and expand, in data.table (same output)
library(data.table)
setDT(df)
melt(df, 'var')[, .(year = value[1]:value[2]), var]
Edit: As markus points out, you don't need to convert to long first with data.table, you can do it in one step (not counting the two lines library/setDT in the code block above). This is a similar approach to akrun's tidyverse answer.
df[, .(year = start_year:end_year), by=var]
We can use map2 to get the sequence from 'start_year' to 'end_year' and unnest the list column to expand the data into 'long' format
library(tidyverse)
df %>%
transmute(var, year = map2(start_year, end_year, `:`)) %>%
unnest
# var year
#1 A 2000
#2 A 2001
#3 A 2002
#4 A 2003
#5 A 2004
#6 A 2005
#7 B 2002
#8 B 2003
#9 B 2004
Or another option is complete
df %>%
group_by(var) %>%
complete(start_year = start_year:end_year) %>%
select(var, year = start_year)
Or in base R with stack and Map
stack(setNames(do.call(Map, c(f = `:`, df[-1])), df$var))
NOTE: First posted the solution with Map and stack
In case of other variations,
stack(setNames(Map(`:`, df[[2]], df[[3]]), df$var))
stack(setNames(do.call(mapply, c(FUN = `:`, df[-1])), df$var))
A short base R solution with seq.
stack(setNames(Map(seq, df[[2]], df[[3]]), df[[1]]))
# values ind
# 1 2000 A
# 2 2001 A
# 3 2002 A
# 4 2003 A
# 5 2004 A
# 6 2005 A
# 7 2002 B
# 8 2003 B
# 9 2004 B
Data
df <- structure(list(var = structure(1:2, .Label = c("A", "B"), class = "factor"),
start_year = c(2000, 2002), end_year = c(2005, 2004)), class = "data.frame", row.names = c(NA,
-2L))
I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))
This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.
How do I best do this in R? For example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
What I would like to have:
name year
1 A 2001
3 A 2003
4 B 2000
5 B 2002
6 C 2000
8 C 2003
dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))
library(dplyr)
dt %>%
group_by(name) %>%
mutate(first = first(year)
,last = last(year)) %>%
filter(year == first | year == last) %>%
select(name, year)
name year
1 A 2001
2 A 2003
3 B 2000
4 B 2002
5 C 2000
6 C 2003
*your example to didn't mention any specific order but it that case, arrange() will help
Here's a quick possible data.table solution
library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
# name year
# 1: A 2001
# 2: A 2003
# 3: B 2000
# 4: B 2002
# 5: C 2000
# 6: C 2003
Or if you only have two columns
dt[, year[c(1L, .N)], by = name]
This is pretty simple using by to split the data.frame by group and then return the head and tail of each group.
> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
name year
A.1 A 2001
A.3 A 2003
B.4 B 2000
B.5 B 2002
C.6 C 2000
C.8 C 2003
head and tail are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:
do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))