I am working through Rob Hyndman's FPP3. I am on section 2.5 and there is an example about Australian holiday tourism. Here is the example with output:
holidays <- tourism %>%
filter(Purpose == "Holiday") %>%
group_by(State) %>%
summarise(Trips = sum(Trips))
#> # A tsibble: 640 x 3 [1Q]
#> # Key: State [8]
#> State Quarter Trips
#> <chr> <qtr> <dbl>
#> 1 ACT 1998 Q1 196.
#> 2 ACT 1998 Q2 127.
#> 3 ACT 1998 Q3 111.
#> 4 ACT 1998 Q4 170.
#> 5 ACT 1999 Q1 108.
#> 6 ACT 1999 Q2 125.
#> 7 ACT 1999 Q3 178.
#> 8 ACT 1999 Q4 218.
#> 9 ACT 2000 Q1 158.
#> 10 ACT 2000 Q2 155.
#> # … with 630 more rows
However, when I use the same code I get the following output:
> holidays
# A tibble: 8 x 2
State Trips
<chr> <dbl>
1 ACT 12089.
2 New South Wales 238741.
3 Northern Territory 14917.
4 Queensland 170787.
5 South Australia 52887.
6 Tasmania 31229.
7 Victoria 179228.
8 Western Australia 63349.
As you can see, the tsibble has been changed to a tibble. When I run everything but the summarise function, I still get a tsibble. I am thinking that perhaps the summarise function is somehow changing the type to tibble. Any help would be appreciated. Thanks!

I uninstalled and reinstalled the tsibble package. I noticed that my original version was 0.8.6 but after installation I now have 0.9.0. After I did that it fixed the issue. Thanks!


Is it possible to purrr::map the function by using the elements within the same dataframe in r?

x = list(data.frame(age = c(1:4),period = c(2000:2003)),
data.frame(age = c(5:8),period = c(1998:2001)),
data.frame(age = c(11:19),period = c(1990:1998)))
map2(x, x$period, ~cbind(.x, difference = .y-.x$age))
> map2(x, x$period, ~cbind(.x, difference = .y-.x$age))
Is it possible to map the function by using the elements within the same dataframe?
In your context x$period is NULL since x is the list of dataframes and it has no attribute "period". I think you want to access the period column within each unnammed dataframe in the list. I would just use map which will pass along each dataframe in the list, which you can then manipulate in the function to access each column without having to explicitly pass it.
x = list(data.frame(age = c(1:4),period = c(2000:2003)),
data.frame(age = c(5:8),period = c(1998:2001)),
data.frame(age = c(11:19),period = c(1990:1998)))
#Original attempt
result <- map2(x, x$period, ~cbind(.x, difference = .y-.x$age))
#> list()
#My solution
result2 <- map(x, function(df) cbind(df, difference = df$period - df$age))
#> [[1]]
#> age period difference
#> 1 1 2000 1999
#> 2 2 2001 1999
#> 3 3 2002 1999
#> 4 4 2003 1999
#> [[2]]
#> age period difference
#> 1 5 1998 1993
#> 2 6 1999 1993
#> 3 7 2000 1993
#> 4 8 2001 1993
#> [[3]]
#> age period difference
#> 1 11 1990 1979
#> 2 12 1991 1979
#> 3 13 1992 1979
#> 4 14 1993 1979
#> 5 15 1994 1979
#> 6 16 1995 1979
#> 7 17 1996 1979
#> 8 18 1997 1979
#> 9 19 1998 1979
#A more readable solution using dplyr
result3 <- map(x, function(df) df %>% mutate(difference = period - age))
#> [[1]]
#> age period difference
#> 1 1 2000 1999
#> 2 2 2001 1999
#> 3 3 2002 1999
#> 4 4 2003 1999
#> [[2]]
#> age period difference
#> 1 5 1998 1993
#> 2 6 1999 1993
#> 3 7 2000 1993
#> 4 8 2001 1993
#> [[3]]
#> age period difference
#> 1 11 1990 1979
#> 2 12 1991 1979
#> 3 13 1992 1979
#> 4 14 1993 1979
#> 5 15 1994 1979
#> 6 16 1995 1979
#> 7 17 1996 1979
#> 8 18 1997 1979
#> 9 19 1998 1979
How to find the annual evolution rate for each firm in my data table?

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

`dplyr::select` without reordering columns

I am looking for an easy, concise way to use dplyr::select without rearranging columns.
Consider this dataset:
#> # A tibble: 6 × 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
#> 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
#> 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
#> 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
#> 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
If I select vore, genus and name, the resulting dataframe is arranged in the order in which the columns were provided.
msleep %>% select(vore, genus, name)
#> # A tibble: 83 × 3
#> vore genus name
#> <chr> <chr> <chr>
#> 1 carni Acinonyx Cheetah
#> 2 omni Aotus Owl monkey
#> 3 herbi Aplodontia Mountain beaver
#> 4 omni Blarina Greater short-tailed shrew
#> 5 herbi Bos Cow
#> 6 herbi Bradypus Three-toed sloth
#> 7 carni Callorhinus Northern fur seal
#> 8 <NA> Calomys Vesper mouse
#> 9 carni Canis Dog
#> 10 herbi Capreolus Roe deer
#> # … with 73 more rows
I would instead like to leave them in their default order: name, genus, then vore.
I have a solution (see below), but I do not like it because it is quite wordy, and not completely “tidyverse-esque”.
(I am teaching an intro to tidyverse course, and would like something that would not intimidate beginners.)
msleep %>%
select(all_of(names(msleep)[names(msleep) %in% c("vore", "genus", "name")]))
#> # A tibble: 83 × 3
#> name genus vore
#> <chr> <chr> <chr>
#> 1 Cheetah Acinonyx carni
#> 2 Owl monkey Aotus omni
#> 3 Mountain beaver Aplodontia herbi
#> 4 Greater short-tailed shrew Blarina omni
#> 5 Cow Bos herbi
#> 6 Three-toed sloth Bradypus herbi
#> 7 Northern fur seal Callorhinus carni
#> 8 Vesper mouse Calomys <NA>
#> 9 Dog Canis carni
#> 10 Roe deer Capreolus herbi
#> # … with 73 more rows
Is there such a thing? Thank you!
For context: In reality, we have a data frame with about 400 columns, from which we are selecting ~10-20 at a time to work with. The order of the columns in the original data frame is meaningful, but we don't want to have to labor over listing them in their correct order in the select statements. A very specific need, I'll admit.
We could use match with sort
msleep %>%
select(sort(match(c("vore", "genus", "name"), names(.))))
EDIT: Based on the OP's comments
In case of providing a vector we could do as akrun suggests in the comments:
nm1 <- c("vore", "genus", "name"); pattern <- str_c(nm1, collapse="|")
Original answer:
You could first define a string with the search items
and then use matches
pattern <- c("vore|genus|name")
select(msleep, matches(pattern))
name genus vore
<chr> <chr> <chr>
1 Cheetah Acinonyx carni
2 Owl monkey Aotus omni
3 Mountain beaver Aplodontia herbi
4 Greater short-tailed shrew Blarina omni
5 Cow Bos herbi
6 Three-toed sloth Bradypus herbi
7 Northern fur seal Callorhinus carni
8 Vesper mouse Calomys NA
9 Dog Canis carni
10 Roe deer Capreolus herbi
You can use the power of eval_select() to create a function to select and sort the columns.
select_in_order <- function(data, ...) {
ordered_cols <- sort(tidyselect::eval_select(expr(c(...)), data))
select(data, ordered_cols)
So now this will do what you are asking. The benefit is that it will be "full feature" to what you are used to being able to enter into a select() statement.
# library(ggplot2) # msleep is in ggplot2
msleep %>%
select_in_order(vore, genus, name)
# this will work as well
msleep %>%
select_in_order(starts_with("sleep"), vore, name:genus)
As another option, simply use relocate() after your select() statement. This alternative approach accomplishes your end goal of keeping the columns in order in a way that is easy to understand by a beginner.
msleep %>%
select(vore, genus, name) %>%

How can I use dplyr::filter with "|" operator? [duplicate]

I use dplyr's filter() function all the time for tidying my data. Today it has stopped working when using the | operator. I am certain I have been able to use the | to filter any observation that meets any of the criteria separated by the | but it isn't working all of a sudden. Any help/guidance is greatly appreciated, as always. Reprex is below.
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'tidyr' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
id <- c(1:20)
YEAR <- c(2009,2009,2009,2009,2010,2010,2010,2010,2011,2011,2011,2011,2012,2012,2012,2012,2013,2013,2013,2013)
df1 <- data.frame(id,YEAR)
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
#> 5 5 2010
#> 6 6 2010
#> 7 7 2010
#> 8 8 2010
#> 9 9 2011
#> 10 10 2011
#> 11 11 2011
#> 12 12 2011
#> 13 13 2012
#> 14 14 2012
#> 15 15 2012
#> 16 16 2012
#> 17 17 2013
#> 18 18 2013
#> 19 19 2013
#> 20 20 2013
df1 <- df1 %>% dplyr::filter(YEAR == 2009|2010)
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
#> 5 5 2010
#> 6 6 2010
#> 7 7 2010
#> 8 8 2010
#> 9 9 2011
#> 10 10 2011
#> 11 11 2011
#> 12 12 2011
#> 13 13 2012
#> 14 14 2012
#> 15 15 2012
#> 16 16 2012
#> 17 17 2013
#> 18 18 2013
#> 19 19 2013
#> 20 20 2013
Expected results would be:
df1 <- df1 %>% dplyr::filter(YEAR == 2009|2010)
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
#> 5 5 2010
#> 6 6 2010
#> 7 7 2010
#> 8 8 2010
The following works filtering on a single condition:
df1 <- df1 %>% dplyr::filter(YEAR == 2009)
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
We can use %in% instead of == for more than one element
df1 %>%
dplyr::filter(YEAR %in% c(2009, 2010))
With |, we need to repeat
df1 %>%
dplyr::filter(YEAR == 2009|YEAR == 2010)
Any value greater than 0 with another, gives TRUE
#[1] TRUE
#[1] FALSE
I think also your way would work with...
df1 <- df1 %>% dplyr::filter(YEAR == 2009|YEAR == 2010)
I think of it as two separate arguments. If you use each individually, the filter would work. In your provided YEAR == 2009|2010, the second part would simply be filter(2010), which doesn't make sense.

compute deflation factor to index wages, by CPI, in panel data

I'm struggling to understand exactly how to compute a deflation factor for wages in a panel based on inflation.
I've teh R example below to help me illustrate the issue.
In Wooldridge (2009:452) Introductory Econometrics, 5th ed., he creates a deflation factor by dividing 107.6 by 65.2, i.e. 107.6/65.2 ≈ 1.65, but I can't figure out to to apply this to my own panel data. Wooldridge only mentions the deflation factor in passing.
Say I have a mini panel with two people, Jane and Tom, staring from 2006/2009 and running until 2015 with their yearly wage,
# install.packages(c("dplyr"), dependencies = TRUE)
tbl <- tibble(id = rep(c('Jane', 'Tom'), c(7, 10)),
yr = c(2009:2015, 2006:2015),
wg = c(rnorm(7, mean=5.1*10^4, sd=9), rnorm(10, 4*10^4, 12))
); tbl
#> A tibble: 17 x 3
#> id yr wg
#> <chr> <int> <dbl>
#> 1 Jane 2009 50991.93
#> 2 Jane 2010 51001.66
#> 3 Jane 2011 51014.29
#> 4 Jane 2012 50989.83
#> 5 Jane 2013 50999.28
#> 6 Jane 2014 51001.19
#> 7 Jane 2015 51006.37
#> 8 Tom 2006 39997.12
#> 9 Tom 2007 40023.81
#> 10 Tom 2008 39998.33
#> 11 Tom 2009 40005.01
#> 12 Tom 2010 40011.78
#> 13 Tom 2011 39995.29
#> 14 Tom 2012 39987.52
#> 15 Tom 2013 40021.39
#> 16 Tom 2014 39972.27
#> 17 Tom 2015 40010.54
I now get the consumer price index (CPI) (using this answer)
# install.packages(c("Quandl"), dependencies = TRUE)
CPI00to16 <- Quandl::Quandl("FRED/CPIAUCSL", collapse="annual",
start_date="2000-01-01", end_date="2016-01-01")
#> # A tibble: 17 x 2
#> Date Value
#> <date> <dbl>
#> 1 2016-12-31 238.106
#> 2 2015-12-31 237.846
#> 3 2014-12-31 236.290
#> 4 2013-12-31 234.723
#> 5 2012-12-31 231.221
#> 6 2011-12-31 227.223
#> 7 2010-12-31 220.472
#> 8 2009-12-31 217.347
#> 9 2008-12-31 211.398
#> 10 2007-12-31 211.445
#> 11 2006-12-31 203.100
#> 12 2005-12-31 198.100
#> 13 2004-12-31 191.700
#> 14 2003-12-31 185.500
#> 15 2002-12-31 181.800
#> 16 2001-12-31 177.400
#> 17 2000-12-31 174.600
my question is how do I deflate Jane and Tom's wages cf. Wooldridge 2009 selecting 2015 as the baseline year?
update; following MrSmithGoesToWashington’s comment below.
CPI00to16$yr <- as.numeric(format(CPI00to16$Date,'%Y'))
CPI00to16 <- mutate(CPI00to16, deflation_factor = CPI00to16[2,2]/Value)
df <- tbl %>% inner_join(as_tibble(CPI00to16[,3:4]), by = "yr")
df <- mutate(df, wg_defl = deflation_factor*wg, wg_diff = wg_defl-wg)
#> # A tibble: 17 x 6
#> id yr wg deflation_factor wg_defl wg_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Jane 2009 50991.93 1.094315 55801.21 4809.2844
#> 2 Jane 2010 51001.66 1.078804 55020.78 4019.1176
#> 3 Jane 2011 51014.29 1.046751 53399.28 2384.9910
#> 4 Jane 2012 50989.83 1.028652 52450.80 1460.9728
#> 5 Jane 2013 50999.28 1.013305 51677.83 678.5477
#> 6 Jane 2014 51001.19 1.006585 51337.04 335.8494
#> 7 Jane 2015 51006.37 1.000000 51006.37 0.0000
#> 8 Tom 2006 39997.12 1.171078 46839.76 6842.6394
#> 9 Tom 2007 40023.81 1.124860 45021.18 4997.3691
#> 10 Tom 2008 39998.33 1.125110 45002.53 5004.1909
#> 11 Tom 2009 40005.01 1.094315 43778.07 3773.0575
#> 12 Tom 2010 40011.78 1.078804 43164.86 3153.0747
#> 13 Tom 2011 39995.29 1.046751 41865.12 1869.8369
#> 14 Tom 2012 39987.52 1.028652 41133.26 1145.7322
#> 15 Tom 2013 40021.39 1.013305 40553.87 532.4863
#> 16 Tom 2014 39972.27 1.006585 40235.49 263.2225
#> 17 Tom 2015 40010.54 1.000000 40010.54 0.0000
