R Panel data: Create new variable based on ifelse() statement and previous row - r

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …

arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

Related

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

Accessing previous rows to make a calculation in a current row in R

I have data that I need to make calculations within sub-groups of rows. Specifically, I am calculating the nitrogen (N) recovery efficiency of crops between different field treatments. This requires finding the difference in N uptake of crops (totN) between a plot with a given amount of N applied (Nrate) and the control plot with 0 N applied, then dividing by the Nrate.
This is how the data looks for just one year:
year drain Nrate totN
2016 C 0 190
2016 C 100 220
2016 C 200 230
2016 N 0 130
2016 N 100 200
2016 N 200 220
I have gotten this far, thanks to Performing calculations between rows in R, but I am not sure how to reference the control row (Nrate = 0) within the sub-group each row is in.
This is where I am:
library(tidyverse)
df <- data.frame(year=c(2016,2016,2016,2016,2016,2016),
drain=c("C","C","C","N","N","N"),
Nrate=c(0,100,200,0,100,200),
totN=c(190,220,230,130,200,220))
df %>%
group_by(year,drain) %>%
mutate(id = row_number()) %>%
mutate(RE = ifelse(id != 1,
(totN - <the totN where Nrate=0 for same year and drain>) / Nrate,
NA))
This is what I expect to get:
year drain Nrate totN RE
2016 C 0 190 NA
2016 C 100 220 0.3 #(220-190)/100
2016 C 200 230 0.2 #(230-190)/200
2016 N 0 130 NA
2016 N 100 200 0.7 #(200-130)/100
2016 N 200 220 0.45 #(220-130)/200
We may subset the 'totN' by indexing or if it is already ordered, use first(totN)
library(dplyr)
df %>%
group_by(year, drain) %>%
mutate(RE = na_if((totN - totN[Nrate == 0])/Nrate, "NaN")) %>%
ungroup
-output
# A tibble: 6 x 5
# year drain Nrate totN RE
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 2016 C 0 190 NA
#2 2016 C 100 220 0.3
#3 2016 C 200 230 0.2
#4 2016 N 0 130 NA
#5 2016 N 100 200 0.7
#6 2016 N 200 220 0.45

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

Iteration for time series data, using purrr

I have a bunch of time series data stacked on top of one another in a data frame; one series for each region in a country. I'd like to apply the seas() function (from the seasonal package) to each series, iteratively, to make the series seasonally adjusted. To do this, I first have to convert the series to a ts class. I'm struggling to do all this using purrr.
Here's a minimum worked example:
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
For each region (indexed by a number) I'd like to perform the following operations. Here's the first region as an example:
tem1 <- df %>% filter(region==1)
tem2 <- ts(data = tem1$var, frequency = 4, start=c(1990,1))
tem3 <- seas(tem2)
tem4 <- as.data.frame(tem3$data)
I'd then like to stack the output (ie. the multiple tem4 data frames, one for each region), along with the region and quarter identifiers.
So, the start of the output for region 1 would be this:
final seasonaladj trend irregular region quarter
1 27 27 96.95 -67.97279 1 1
2 126 126 96.95 27.87381 1 2
3 124 124 96.95 27.10823 1 3
4 127 127 96.95 30.55075 1 4
5 173 173 96.95 75.01355 1 5
6 130 130 96.95 32.10672 1 6
The data for region 2 would be below this etc.
I started with the following but without luck so far. Basically, I'm struggling to get the time series into the tibble:
seas.adjusted <- df %>%
group_by(region) %>%
mutate(data.ts = map(.x = data$var,
.f = as.ts,
start = 1990,
freq = 4))
I don't know much about the seasonal adjustment part, so there may be things I missed, but I can help with moving your calculations into a map-friendly function.
After grouping by region, you can nest the data so there's a nested data frame for each region. Then you can run essentially the same code as you had, but inside a function in map. Unnesting the resulting column gives you a long-shaped data frame of adjustments.
Like I said, I don't have the expertise to know whether those last two columns having NAs is expected or not.
Edit: Based on #wibeasley's question about retaining the quarter column, I'm adding a mutate that adds a column of the quarters listed in the nested data frame.
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
df %>%
group_by(region) %>%
nest() %>%
mutate(data.ts = map(data, function(x) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(quarter = x$quarter)
})) %>%
unnest(data.ts)
#> # A tibble: 200 x 8
#> region final seasonaladj trend irregular quarter seasonal adjustfac
#> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 27 27 97.0 -68.0 1 NA NA
#> 2 1 126 126 97.0 27.9 2 NA NA
#> 3 1 124 124 97.0 27.1 3 NA NA
#> 4 1 127 127 97.0 30.6 4 NA NA
#> 5 1 173 173 97.0 75.0 5 NA NA
#> 6 1 130 130 97.0 32.1 6 NA NA
#> 7 1 6 6 97.0 -89.0 7 NA NA
#> 8 1 50 50 97.0 -46.5 8 NA NA
#> 9 1 135 135 97.0 36.7 9 NA NA
#> 10 1 105 105 97.0 8.81 10 NA NA
#> # ... with 190 more rows
I also gave a bit more thought to doing this without nesting, and instead tried doing it with a split. Passing that list of data frames into imap_dfr let me take each split piece of the data frame and its name (in this case, the value of region), then return everything rbinded back together into one data frame. I sometimes shy away from nested data just because I have trouble seeing what's going on, so this is an alternative that is maybe more transparent.
df %>%
split(.$region) %>%
imap_dfr(function(x, reg) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(region = reg, quarter = x$quarter)
}) %>%
select(region, quarter, everything()) %>%
head()
#> region quarter final seasonaladj trend irregular seasonal adjustfac
#> 1 1 1 27 27 96.95 -67.97274 NA NA
#> 2 1 2 126 126 96.95 27.87378 NA NA
#> 3 1 3 124 124 96.95 27.10823 NA NA
#> 4 1 4 127 127 96.95 30.55077 NA NA
#> 5 1 5 173 173 96.95 75.01353 NA NA
#> 6 1 6 130 130 96.95 32.10669 NA NA
Created on 2018-08-12 by the reprex package (v0.2.0).
I put all the action inside of f(), and then called it with purrr::map_df(). The re-inclusion of quarter is a hack.
f <- function( .region ) {
d <- df %>%
dplyr::filter(region == .region)
y <- d %>%
dplyr::pull(var) %>%
ts(frequency = 4, start=c(1990,1)) %>%
seas()
y$data %>%
as.data.frame() %>%
# dplyr::select(-seasonal, -adjustfac) %>%
dplyr::mutate(
quarter = d$quarter
)
}
purrr::map_df(1:10, f, .id = "region")
results:
region final seasonaladj trend irregular quarter seasonal adjustfac
1 1 27.00000 27.00000 96.95000 -6.797279e+01 1 NA NA
2 1 126.00000 126.00000 96.95000 2.787381e+01 2 NA NA
3 1 124.00000 124.00000 96.95000 2.710823e+01 3 NA NA
4 1 127.00000 127.00000 96.95000 3.055075e+01 4 NA NA
5 1 173.00000 173.00000 96.95000 7.501355e+01 5 NA NA
6 1 130.00000 130.00000 96.95000 3.210672e+01 6 NA NA
7 1 6.00000 6.00000 96.95000 -8.899356e+01 7 NA NA
8 1 50.00000 50.00000 96.95000 -4.647254e+01 8 NA NA
9 1 135.00000 135.00000 96.95000 3.671077e+01 9 NA NA
10 1 105.00000 105.00000 96.95000 8.806955e+00 10 NA NA
...
96 5 55.01724 55.01724 60.25848 9.130207e-01 16 1.9084928 1.9084928
97 5 60.21549 60.21549 59.43828 1.013076e+00 17 1.0462424 1.0462424
98 5 58.30626 58.30626 58.87065 9.904130e-01 18 0.1715082 0.1715082
99 5 61.68175 61.68175 58.07827 1.062045e+00 19 1.0537962 1.0537962
100 5 59.30138 59.30138 56.70798 1.045733e+00 20 2.5294523 2.5294523
...

Resources