Select first row for each id for each year - r

Say I have a dataset below where each id can have multiple records per year. I would like to keep only the id's most recent record per year.
id<-c(1,1,1,2,2,2)
year<-c(2020,2020,2019,2020,2018,2018)
month<-c(12,6,4,5,4,1)
have<-as.data.frame(cbind(id,year,month))
have
id year month
1 2020 12
1 2020 6
1 2019 4
2 2020 5
2 2018 4
2 2018 1
This is what would like the dataset to look like:
want
id year month
1 2020 12
1 2019 4
2 2020 5
2 2018 4
I know that I can get the first instance of each id with this code, however I want the latest record for each year.
want<-have[match(unique(have$id), have$id),]
id year month
1 2020 12
2 2020 5
I modified the code to add in year, but it outputs the same results as the code above:
want<-have[match(unique(have$id,have$year), have$id),]
id year month
1 2020 12
2 2020 5
How would I modify this so I can see one record displayed per year?

You can use dplyr::slice_min like this:
library(dplyr)
have %>%
group_by(id,year) %>%
slice_min(order_by = month)
Output:
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5

We could group and then summarise with first()
library(dplyr)
have %>%
group_by(id, year) %>%
summarise(month = first(month))
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5

You can use the group_by in dplyr as follows:
have %>% group_by(year) %>% tally(max(month))

Related

Ranking of values in one quarter [duplicate]

This question already has answers here:
Calculate rank by group
(4 answers)
How to emulate SQLs rank functions in R?
(5 answers)
Closed 8 days ago.
I am trying to implement a calculation that will rank the Price values in a separate partition. Below you can see my data
df<-data.frame( year=c(2010,2010,2010,2010,2010,2010),
quarter=c("q1","q1","q1","q2","q2","q2"),
Price=c(10,20,30,10,20,30)
)
df
Now I want to count over each quarter and I expect to have 1 for the smallest Price and 3 for the highest Price
df %>% group_by(quarter) %>% mutate(id = row_number(Price))
Instead of the expected results, I received different results. Below you can see the result from the code. Instead of ranking in separate quarter, ranging is in both quarters.
So can anybody help me how to solve this problem and to receive results as in table below
You probably want rank.
transform(df, id=ave(Price, year, quarter, FUN=rank))
# year quarter Price id
# 1 2010 q1 10 1
# 2 2010 q1 20 2
# 3 2010 q1 30 3
# 4 2010 q2 10 1
# 5 2010 q2 20 2
# 6 2010 q2 30 3
With dplyr, use dense_rank
library(dplyr)
df %>%
group_by(quarter) %>%
mutate(id = dense_rank(Price)) %>%
ungroup
# A tibble: 6 × 4
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
In the newer version of dplyr, can also use .by in mutate
df %>%
mutate(id = dense_rank(Price), .by = 'quarter')
year quarter Price id
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
Alternatively with row_number()
library(tidyverse)
df %>% group_by(year, quarter) %>% mutate(id=row_number())
Created on 2023-02-12 with reprex v2.0.2
# A tibble: 6 × 4
# Groups: year, quarter [2]
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3

Calculating cumulative sum for multiple columns in R

R newb, I'm trying to calculate the cumulative sum grouped by year, month, group and subgroup, also having multiple columns to calculate.
Sample of the data:
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
Year Month Group SubGroup V1 V2
1 2020 Jan A a 10 0
2 2020 Jan A a 10 1
3 2020 Jan A b 20 2
4 2020 Jan B b 20 2
5 2020 Feb A a 50 0
6 2020 Feb B b 50 5
7 2020 Feb B a 10 1
8 2020 Feb B b 10 1
Resulting Table wanted:
Year Month Group SubGroup V1 V2
1 2020 Jan A a 20 1
2 2020 Feb A a 70 1
3 2020 Jan A b 20 2
4 2020 Feb A b 20 2
5 2020 Jan B a 0 0
6 2020 Feb B a 10 1
7 2020 Jan B b 20 2
8 2020 Feb B b 80 8
From Sample Table, on Jan 2020, the sum of Group 'A' Subgroup 'a' was 10+10 = 20... On Feb 2020, the value was 50, therefore 20 from Jan + 50 = 70, and so on...
If there is no value, it should consider 0.
I've tried few codes but none didn't get even close to the output I need. Would really appreciate if someone could help me with some tips for this problem.
This is a simple group_by/mutate problem. The columns V1, V2 are chosen with across and cumsum applied to them.
df$Month <- factor(df$Month, levels = c("Jan", "Feb"))
df %>%
group_by(Year, Group, SubGroup) %>%
mutate(across(V1:V2, ~cumsum(.x))) %>%
ungroup() %>%
arrange(Year, Group, SubGroup, Month)
## A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <chr> <fct> <chr> <chr> <dbl> <dbl>
#1 2020 Jan A a 10 0
#2 2020 Jan A a 20 1
#3 2020 Feb A a 70 1
#4 2020 Jan A b 20 2
#5 2020 Feb B a 10 1
#6 2020 Jan B b 20 2
#7 2020 Feb B b 70 7
#8 2020 Feb B b 80 8
If I understand what you are doing, you're taking the sum for each month, then doing the cumulative sums for the months. This is usuaully pretty easy in dplyr.
library(dplyr)
df %>%
group_by(Year, Month, Group, SubGroup) %>%
summarize(
V1_sum = sum(V1),
V2_sum = sum(V2)
) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1_cumsum = cumsum(V1_sum),
V2_cumsum = cumsum(V2_sum)
)
# A tibble: 6 x 8
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 Feb A a 50 0 50 0
# 2 2020 Feb B a 10 1 10 1
# 3 2020 Feb B b 60 6 60 6
# 4 2020 Jan A a 20 1 70 1
# 5 2020 Jan A b 20 2 20 2
# 6 2020 Jan B b 20 2 80 8
But you'll notice that the monthly cumulative sums are backwards (i.e. January comes after February), because by default group_by groups alphabetically. Also, you don't see the empty values because dplyr doesn't fill them in.
To fix the order of the months, you can either make your months numeric (convert to dates) or turn them into factors. You can add back 'missing' combinations of the grouping variables by using aggregate in base R instead of dplyr::summarize. aggregate includes all combinations of the grouping factors. aggregate converts the missing values to NA, but you can replace the NA with 0 with tidyr::replace_na, for example.
library(dplyr)
library(tidyr)
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)
# Get monthly sums
df1 <- with(df, aggregate(
list(V1_sum = V1, V2_sum = V2),
list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
FUN = sum, drop = FALSE
))
df1 <- df1 %>%
# Replace NA with 0
mutate(
V1_sum = replace_na(V1_sum, 0),
V2_sum = replace_na(V2_sum, 0)
) %>%
# Get cumulative sum across months
group_by(Year, Group, SubGroup) %>%
mutate(V1cumsum = cumsum(V1_sum),
V2cumsum = cumsum(V2_sum)) %>%
ungroup() %>%
select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)
This gives the same result as your example:
# # A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <dbl> <ord> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 20 1
# 2 2020 Feb A a 70 1
# 3 2020 Jan B a 0 0
# 4 2020 Feb B a 10 1
# 5 2020 Jan A b 20 2
# 6 2020 Feb A b 20 2
# 7 2020 Jan B b 20 2
# 8 2020 Feb B b 80 8
library(dplyr)
library(zoo)
df %>%
arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1 = cumsum(V1),
V2 = cumsum(V2)
) %>%
arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering
# A tibble: 8 x 6
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1 V2
# <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 10 0
# 2 2020 Jan A a 20 1
# 3 2020 Feb A a 70 1
# 4 2020 Jan A b 20 2
# 5 2020 Feb B a 10 1
# 6 2020 Jan B b 20 2
# 7 2020 Feb B b 70 7
# 8 2020 Feb B b 80 8

Calculate average of values in R and add result as new rows instead of as a new column

I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!
We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))
Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667

r conditional subtract number

I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA

Group By and summaries with condition

I have data frame df. After group_by(id, Year, Month, new_used_ind) and summarise(n = n()) it looks like:
id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2
I want to add and get total for id, year and month but also want a total of ' N' from new_used_ind in a new column.
Something like this
id Year Month Total_New total
1 2001 apr 3 5
2 2002 mar 5 8
4 2004 july 4 6
library(dplyr)
read.table(text= "id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2", header = T) -> df
df %>%
group_by(id, Year, Month) %>%
mutate(total_New=sum(n*(new_used_ind=="N"))) %>%
mutate(total_n=sum(n)) %>%
summarise_at(c("total_New", "total_n"), mean)
#> # A tibble: 4 x 5
#> # Groups: id, Year [4]
#> id Year Month total_New total_n
#> <int> <int> <fct> <dbl> <dbl>
#> 1 1 2001 apr 3 5
#> 2 2 2002 mar 5 5
#> 3 3 2003 mar 0 3
#> 4 4 2004 july 4 6
Created on 2019-06-11 by the reprex package (v0.3.0)

Resources