Get the ratios by group in R - r

EDIT: I would like to add two additional columns: mean and range (see below)
My data are as follows:
year species count
2020 chinook 10000
2020 chum 1450
2020 sockeye 600
2020 coho 1100
2021 chinook 8672
2021 sockeye
2021 coho 10100
2021 chum 200
I would like to get the chinook to other species ratio for each year. In some years, species do not have count data, so I would like to just leave the outcome blank for those species.
I would then like to get the mean and range for each species across years.
The finished dataset I am looking for is as follows:
year species count proportion mean range
2020 chinook 10000 1 1 1
2020 chum 1450 0.145 0.084 0.023-0.145
2020 sockeye 600 0.06 0.06 0.06
2020 coho 1100 0.11 1.274 0.11-1.164
2021 chinook 8672 1 1 1
2021 sockeye NA 0.06 0.06
2021 coho 10100 1.164 1.274 0.11-1.164
2021 chum 200 0.023 0.084 0.023-0.145
Thank you in advance!

library(dplyr)
df %>% group_by(year) %>%
mutate(proportion = count / count[species == "chinook"])

Related

Loop instead of iterative calculations in R

In my example below I have quarterly data from 2021Q1 to 2022Q3 for variable X. I have forecasted growth rate of variable X (growth_x) from 2022Q4 to 2025Q4. I want to use the growth_x variable to calculate the variable X from 2022Q4 to 2025Q4 iteratively.I am manually calculating it below and still missing 2025Q4. Is it possible to write a function to do it? I am fairly new to writing loops. Any help will be greatly appreciated. Thank you in advance.
library(readxl)
library(dplyr)
library(lubridate)
# Quarterly Data
data <- data.frame(c("2021Q1","2021Q2","2021Q3","2021Q4",
"2022Q1","2022Q2","2022Q3","2022Q4",
"2023Q1","2023Q2","2023Q3","2023Q4",
"2024Q1","2024Q2","2024Q3","2024Q4",
"2025Q1","2025Q2","2025Q3","2025Q4"),
# Variable X - Actuals upto 2022Q3
c(804,511,479,462,
427,330,440,NA,
NA,NA,NA,NA,
NA,NA,NA,NA,
NA,NA,NA,NA),
# Forecasted Growth rates of X from 2022Q4
c(NA,NA,NA,NA,
NA,NA,NA,0.24,
0.49,0.65,0.25,0.71,
0.63,0.33,0.53,0.83,
0.87,0.19,0.99,0.16))
# Renaming the columns
data<-data%>%rename(yrqtr=1,x=2,growth_x=3)
# Creating Date Variable
data<-data%>%mutate(year=substr(yrqtr,1,4),
qtr=substr(yrqtr,5,6),
mon=ifelse(qtr=="Q1",3,
ifelse(qtr=="Q2",6,
ifelse(qtr=="Q3",9,12))),
date=make_date(year,mon,1))
# Computing Growth Rate from 2022Q3 to 2023Q3
Growth_2023_3<-data%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2022-09-01",forecast_x,x))%>%select(-forecast_x)
# Computing Growth Rate from 2023Q3 to 2024Q3
Growth_2024_3<-Growth_2023_3%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2023-09-01",forecast_x,x))%>%select(-forecast_x)
# Computing Growth Rate from 2024Q3 to 2025Q3
Growth_2025_3<-Growth_2024_3%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2024-09-01",forecast_x,x))%>%select(-forecast_x)
Does this do what you want?
n_years <- length(unique(data$year))
for(i in unique(data$year)[2:n_years]){
# Computing Growth Rate from 2022Q3 to 2023Q3
data <- data %>%
mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date > as.Date(paste0(i,"-09-01")),forecast_x,x))
}
As an aside, column names can be assigned at the time the data frame is created. For example:
# Quarterly Data
data <- data.frame(yrqtr = c("2021Q1","2021Q2"),
x = c(804,511),
growth_x = c(0.24,0.49))
If you want to avoid using a loop, you can use purrr::reduce().
library(tidyverse)
library(lubridate)
sol <- reduce(
.x = unique(data$year), # iterate over years
.init = data,
\(lhs, rhs) lhs %>%
mutate(x = ifelse(year == rhs & is.na(x), (1+growth_x)*lag(x,4), x))
)
sol
#> yrqtr x growth_x year qtr mon date
#> 1 2021Q1 804.0000 NA 2021 Q1 3 2021-03-01
#> 2 2021Q2 511.0000 NA 2021 Q2 6 2021-06-01
#> 3 2021Q3 479.0000 NA 2021 Q3 9 2021-09-01
#> 4 2021Q4 462.0000 NA 2021 Q4 12 2021-12-01
#> 5 2022Q1 427.0000 NA 2022 Q1 3 2022-03-01
#> 6 2022Q2 330.0000 NA 2022 Q2 6 2022-06-01
#> 7 2022Q3 440.0000 NA 2022 Q3 9 2022-09-01
#> 8 2022Q4 572.8800 0.24 2022 Q4 12 2022-12-01
#> 9 2023Q1 636.2300 0.49 2023 Q1 3 2023-03-01
#> 10 2023Q2 544.5000 0.65 2023 Q2 6 2023-06-01
#> 11 2023Q3 550.0000 0.25 2023 Q3 9 2023-09-01
#> 12 2023Q4 979.6248 0.71 2023 Q4 12 2023-12-01
#> 13 2024Q1 1037.0549 0.63 2024 Q1 3 2024-03-01
#> 14 2024Q2 724.1850 0.33 2024 Q2 6 2024-06-01
#> 15 2024Q3 841.5000 0.53 2024 Q3 9 2024-09-01
#> 16 2024Q4 1792.7134 0.83 2024 Q4 12 2024-12-01
#> 17 2025Q1 1939.2927 0.87 2025 Q1 3 2025-03-01
#> 18 2025Q2 861.7802 0.19 2025 Q2 6 2025-06-01
#> 19 2025Q3 1674.5850 0.99 2025 Q3 9 2025-09-01
#> 20 2025Q4 2079.5475 0.16 2025 Q4 12 2025-12-01

Inflation rate with the CPI multiples country, with R

I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows

How to rearrange daily stream discharge data into monthly format and rank the discharge values for each month using R

I have a data set of daily stream discharge values from a gauging station for approximately 50 years. The data is arranged into three columns, namely, "date", "month", "discharge".(Sample data shown here)
`
Date<- as.Date(c('1938-10-01','1954-10-27', '1967-06-16','1943-01-01','1945-01-14','1945-03-14','1954-05-04','1960-04-23','1960-05-09','1962-01-18','1968-12-19','1972-01-15','1977-08-15','1981-04-11','1986-06-20','1989-01-20','1992-03-29'))
> Months<- c('Oct','Oct','Jun','Jan','Jan','Mar','May','Apr','May','Jan','Dec','Jan','Aug','Apr','Jun','Jan','Mar')
> Dis<-c('1000','1200','400','255','450','215','360','120','145','1204','752','635','1456','154','154','1204','450')
> Sampledata<-data.frame("Date"=Date,"Months"=Months,"Disch"=Dis)
> print(Sampledata)
Date Months Disch
1 1938-10-01 Oct 1000
2 1954-10-27 Oct 1200
3 1967-06-16 Jun 400
4 1943-01-01 Jan 255
5 1945-01-14 Jan 450
6 1945-03-14 Mar 215
7 1954-05-04 May 360
8 1960-04-23 Apr 120
9 1960-05-09 May 145
10 1962-01-18 Jan 1204
11 1968-12-19 Dec 752
12 1972-01-15 Jan 635
13 1977-08-15 Aug 1456
14 1981-04-11 Apr 154
15 1986-06-20 Jun 154
16 1989-01-20 Jan 1204
17 1992-03-29 Mar 450
I want to calculate ranks for each month separately for all the years. For example: Calculate rank in ascending order for the month of January for 50 years. With the same rank value assigned to a duplicate discharge value. Desired output shown here:
> Date Month Disch Rank
1 1943-01-01 Jan 255 1
2 1945-01-14 Jan 450 2
3 1962-01-18 Jan 1204 4
4 1972-01-15 Jan 635 3
5 1989-01-20 Jan 1204 4
> Date Month Disch Rank
1 1945-03-14 Mar 215 1
2 1992-03-29 Mar 450 2
3 2001-03-19 Mar 450 2
Without using any packages first convert columns 2 and 3 to numeric and then use ave and rank with the indicated ties method. Finally order the result.
Note that the output shown in the question does not correspond to the input, e.g. there are three Mar rows in the output but only two such rows in the input so this will correspond to the input but will not be identical to the output shown.
Sampledata2 <- transform(Sampledata,
Disch = as.numeric(as.character(Disch)),
Months = as.numeric(format(Date, "%m")))
Rank <- function(x) rank(x, ties = "min")
Sampledata3 <- transform(Sampledata2,
Rank = ave(Disch, Months, FUN = Rank))
o <- with(Sampledata3, order(Months, Date))
Sampledata3[o, ]
An option would be to group by 'Month' and use one of the ranking functions (dense_rank, row_number(), min_rank - based on the needs) to rank the 'Discharge' column
library(dplyr)
df1 %>%
group_by(Month) %>%
mutate(Rank = dense_rank(Discharge))

Form a monthly series from a quarterly series

Assume that we have quarterly GDP change data like the following:
Country
1999Q3 0.01
1999Q4 0.01
2000Q1 0.02
2000Q2 0.00
2000Q3 -0.01
Now, I would like to turn this into a monthly series based on e.g. the mean of the previous two quarters, as one measure to represent the economic conditions. I.e. with the above data I would like to produce the following:
Country
2000-01 0.01
2000-02 0.01
2000-03 0.01
2000-04 0.015
2000-05 0.015
2000-06 0.015
2000-07 0.01
2000-08 0.01
2000-09 0.01
2000-10 -0.005
2000-11 -0.005
2000-12 -0.005
This is so that I can run regressions with other monthly series. Aggregating data from more frequent to less frequent is easy, but how would I do it to the opposite direction?
Edit.
It seems that using spline would be the right way to do this. The question is then, how does that handle a varying amount of NA's in the beginning of the country series, when doing spline with apply. There are multiple countries in the data frame as columns, as usual, and they have a varying amount of NA's in the beginning of the series.
Convert to zoo with "yearmon" class index assuming the values are at the ends of the quarters. Then perform the rolling mean giving z.mu. Now merge that with a zero width zoo object containing all the months and use na.spline to fill in the missing values (or use na.locf or na.approx for different forms of interpolation). Optionally use fortify.zoo to convert back to a data.frame.
library(zoo)
z <- zoo(coredata(DF), as.yearmon(as.yearqtr(rownames(DF)), frac = 1))
z.mu <- rollmeanr(z, 2, partial = TRUE)
ym <- seq(floor(start(z.mu)), floor(end(z.mu)) + 11/12, 1/12)
z.ym <- na.spline(merge(z.mu, zoo(, ym)))
fortify.zoo(z.ym)
giving:
Index Country
1 Jan 1999 -0.065000000
2 Feb 1999 -0.052222222
3 Mar 1999 -0.040555556
4 Apr 1999 -0.030000000
5 May 1999 -0.020555556
6 Jun 1999 -0.012222222
7 Jul 1999 -0.005000000
8 Aug 1999 0.001111111
9 Sep 1999 0.006111111
10 Oct 1999 0.010000000
11 Nov 1999 0.012777778
12 Dec 1999 0.014444444
13 Jan 2000 0.015000000
14 Feb 2000 0.014444444
15 Mar 2000 0.012777778
16 Apr 2000 0.010000000
17 May 2000 0.006111111
18 Jun 2000 0.001111111
19 Jul 2000 -0.005000000
20 Aug 2000 -0.012222222
21 Sep 2000 -0.020555556
22 Oct 2000 -0.030000000
23 Nov 2000 -0.040555556
24 Dec 2000 -0.052222222
Note: The input DF in reproducible form used is:
Lines <- " Country
1999Q3 0.01
1999Q4 0.01
2000Q1 0.02
2000Q2 0.00
2000Q3 -0.01"
DF <- read.table(text = Lines)
Update: Originally question asked to move last value forward but was changed to ask for spline interpolation so answer has been changed accordingly. Also changed to start in Jan and end in Dec and now assume data is for quarter end.

r ddply error undefined columns selected

I have a time series data set like below:
age time income
16 to 24 2004 q1 400
16 to 24 2004 q2 500
… …
65 and over 2014 q3 800
it has different 60 quarters of income data for each age group.as income data is seasonal. i am trying to apply decomposition function to filter out trends.what i have did so far is below. but R consistently throw errors (error message:undefined columns selected) at me. any idea how to go about it?
fun =function(x){
ts = ts(x,frequency=4,start=c(2004,1))
ts.d =decompose(ts,type='additive')
as.vector(ts.d$trend)
}
trend.dt = ddply(my.dat,.(age),transform,trend=fun(income))
expected result is (NA is because, after decomposition, the first and last ob will not have value,but the rest should have)
age time income trend
16 to 24 2004 q1 400 NA
16 to 24 2004 q2 500 489
… …
65 and over 2014 q3 800 760
65 and over 2014 q3 810 NA

Resources