Ranking of values in one quarter [duplicate] - r

This question already has answers here:
Calculate rank by group
(4 answers)
How to emulate SQLs rank functions in R?
(5 answers)
Closed 8 days ago.
I am trying to implement a calculation that will rank the Price values in a separate partition. Below you can see my data
df<-data.frame( year=c(2010,2010,2010,2010,2010,2010),
quarter=c("q1","q1","q1","q2","q2","q2"),
Price=c(10,20,30,10,20,30)
)
df
Now I want to count over each quarter and I expect to have 1 for the smallest Price and 3 for the highest Price
df %>% group_by(quarter) %>% mutate(id = row_number(Price))
Instead of the expected results, I received different results. Below you can see the result from the code. Instead of ranking in separate quarter, ranging is in both quarters.
So can anybody help me how to solve this problem and to receive results as in table below

You probably want rank.
transform(df, id=ave(Price, year, quarter, FUN=rank))
# year quarter Price id
# 1 2010 q1 10 1
# 2 2010 q1 20 2
# 3 2010 q1 30 3
# 4 2010 q2 10 1
# 5 2010 q2 20 2
# 6 2010 q2 30 3

With dplyr, use dense_rank
library(dplyr)
df %>%
group_by(quarter) %>%
mutate(id = dense_rank(Price)) %>%
ungroup
# A tibble: 6 × 4
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
In the newer version of dplyr, can also use .by in mutate
df %>%
mutate(id = dense_rank(Price), .by = 'quarter')
year quarter Price id
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3

Alternatively with row_number()
library(tidyverse)
df %>% group_by(year, quarter) %>% mutate(id=row_number())
Created on 2023-02-12 with reprex v2.0.2
# A tibble: 6 × 4
# Groups: year, quarter [2]
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3

Related

Select first row for each id for each year

Say I have a dataset below where each id can have multiple records per year. I would like to keep only the id's most recent record per year.
id<-c(1,1,1,2,2,2)
year<-c(2020,2020,2019,2020,2018,2018)
month<-c(12,6,4,5,4,1)
have<-as.data.frame(cbind(id,year,month))
have
id year month
1 2020 12
1 2020 6
1 2019 4
2 2020 5
2 2018 4
2 2018 1
This is what would like the dataset to look like:
want
id year month
1 2020 12
1 2019 4
2 2020 5
2 2018 4
I know that I can get the first instance of each id with this code, however I want the latest record for each year.
want<-have[match(unique(have$id), have$id),]
id year month
1 2020 12
2 2020 5
I modified the code to add in year, but it outputs the same results as the code above:
want<-have[match(unique(have$id,have$year), have$id),]
id year month
1 2020 12
2 2020 5
How would I modify this so I can see one record displayed per year?
You can use dplyr::slice_min like this:
library(dplyr)
have %>%
group_by(id,year) %>%
slice_min(order_by = month)
Output:
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
We could group and then summarise with first()
library(dplyr)
have %>%
group_by(id, year) %>%
summarise(month = first(month))
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
You can use the group_by in dplyr as follows:
have %>% group_by(year) %>% tally(max(month))

Calculate average of values in R and add result as new rows instead of as a new column

I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!
We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))
Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667

Group By and summaries with condition

I have data frame df. After group_by(id, Year, Month, new_used_ind) and summarise(n = n()) it looks like:
id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2
I want to add and get total for id, year and month but also want a total of ' N' from new_used_ind in a new column.
Something like this
id Year Month Total_New total
1 2001 apr 3 5
2 2002 mar 5 8
4 2004 july 4 6
library(dplyr)
read.table(text= "id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2", header = T) -> df
df %>%
group_by(id, Year, Month) %>%
mutate(total_New=sum(n*(new_used_ind=="N"))) %>%
mutate(total_n=sum(n)) %>%
summarise_at(c("total_New", "total_n"), mean)
#> # A tibble: 4 x 5
#> # Groups: id, Year [4]
#> id Year Month total_New total_n
#> <int> <int> <fct> <dbl> <dbl>
#> 1 1 2001 apr 3 5
#> 2 2 2002 mar 5 5
#> 3 3 2003 mar 0 3
#> 4 4 2004 july 4 6
Created on 2019-06-11 by the reprex package (v0.3.0)

How many counts are in each types in each year? With either group_by() or Split() [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame df as follows:
df
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
5 n005 2004 Canada 1
6 n006 2005 Britain 2
7 n007 2005 USA 1
8 n008 2005 USA 2
9 n010 2005 USA 1
10 n011 2005 Canada 1
11 n012 2005 USA 2
12 n013 2005 USA 5
13 n014 2005 Canada 1
14 n015 2006 USA 2
15 n017 2006 Canada 1
16 n018 2006 Britain 1
17 n019 2006 Canada 1
18 n020 2006 USA 1
...
where Type is the type of news, and Time is the year when the news was published.
My aim is to count the number of each type of news each year.
I was thinking about a result like this:
...
$2005
Type: 1 Count: 4
Type: 2 Count: 3
Type: 5 Count: 1
$2006
Type: 1 Count: 4
...
I used the following code:
gp = group_by(df, Time)
summarise(gp, table(Time)
Error in summarise_impl(.data, dots) :
Evaluation error: unique() applies only to vectors.
Then I tried split( ), thinking it may be able to separate the dataframe by year so I could count the number of each type by year
split(df, 'Time')
$Time
Code Time Country Type
1 n001 2000 France 1
2 n002 2001 Japan 5
3 n003 2003 USA 2
4 n004 2004 USA 2
...
Everything is almost the same, apart from the "$Time" sign.
I was wondering what I did wrong, and how to fix it.
We can split Type Column by Time and calculate it's frequency by table.
lapply(split(df$Type, df$Time), table)
#$`2000`
#1
#1
#$`2001`
#5
#1
#$`2003`
#2
#1
#$`2004`
#1 2
#1 1
#$`2005`
#1 2 5
#4 3 1
#$`2006`
#1 2
#4 1
How about this?
df %>%
group_by(Time, Type) %>%
count() %>%
spread(Type, n)
You could use something like this. split on Time, then group by Type and tally the result
df %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% tally())
......
$`2004`
# A tibble: 2 x 2
Type n
<int> <int>
1 1 1
2 2 1
$`2005`
# A tibble: 3 x 2
Type n
<int> <int>
1 1 4
2 2 3
3 5 1
$`2006`
# A tibble: 2 x 2
......
Or use summarise instead of tally if you want a column called count instead of n
df1 %>%
split(.$Time) %>%
map(~ group_by(., Type) %>% summarise(count = n()))

How to de-cumulate variable in dplyr?

I have an issue. I have panel of quarterly individual data, which are "annually cumulative", ie. values for 1st quarter are for 1st quarter, values for 2nd quarter are sum for 1st and 2nd, 3rd quarter values are sums for first 3 quarters of the year and 4th quarter are annual sums. How to easily de-cumulate those in dplyr, grouping by id and year?
Assuming we have two years, and in year one sales are 2 per quarter, and in year 2 sales are 3 per quarter, the original is:
df = data.frame(quarter = c("Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4"), year=c(rep(2017,4),rep(2018,4)), cum_tot= c(2,4,6,8,3,6,9,12))
quarter year cum_tot
1 Q1 2017 2
2 Q2 2017 4
3 Q3 2017 6
4 Q4 2017 8
5 Q1 2018 3
6 Q2 2018 6
7 Q3 2018 9
8 Q4 2018 12
Then we can get the sales per quarter as:
library(dplyr)
df %>% group_by(year) %>% mutate(original = c(cum_tot[1], diff(cum_tot)))
Or, as per GGamba's comment below:
df %>% group_by(year) %>% mutate(original = cum_tot - lag(cum_tot, default = 0))
They both result in:
quarter year cum_tot original
1 Q1 2017 2 2
2 Q2 2017 4 2
3 Q3 2017 6 2
4 Q4 2017 8 2
5 Q1 2018 3 3
6 Q2 2018 6 3
7 Q3 2018 9 3
8 Q4 2018 12 3
Hope this helps!

Resources