How to add a new column with values specific to grouped variables - r

I'm new to R and have found similar solutions to my problem, but I'm struggling to apply these to my code. Please help...
These data are simplified, as the id variables are many:
df = data.frame(id = rep(c("a_10", "a_11", "b_10", "b_11"), each = 5),
site = rep(1:5, 4),
value = sample(1:20))
The aim is to add another column labelled "year" with values that are grouped by "id" but the true names are many - so I'm trying to simplify the code by using the ending digits.
I can use dplyr to split the dataframe into each id variable using this code (repeated for each id variable):
df %>%
select(site, id, value) %>%
filter(grepl("10$", id)) %>%
mutate(Year = "2010")`
Rather than using merge to re-combine the dataframes back into one, is there not a more simple method?
I tried modifying case_when with mutate as described in a previous answer:
[https://stackoverflow.com/a/63043920/12313457][1]
mutate(year = case_when(grepl(c("10$", "11$", id) == c("2010", "2011"))))
is something like this possible??
Thanks in advance

In case your id column has different string lengths you can use sub:
df %>%
mutate(Year = paste0("20", sub('^.*_(\\d+)$', '\\1', id)))
#> id site value Year
#> 1 a_10 1 2 2010
#> 2 a_10 2 7 2010
#> 3 a_10 3 16 2010
#> 4 a_10 4 10 2010
#> 5 a_10 5 11 2010
#> 6 a_11 1 5 2011
#> 7 a_11 2 13 2011
#> 8 a_11 3 14 2011
#> 9 a_11 4 6 2011
#> 10 a_11 5 12 2011
#> 11 b_10 1 17 2010
#> 12 b_10 2 1 2010
#> 13 b_10 3 4 2010
#> 14 b_10 4 15 2010
#> 15 b_10 5 9 2010
#> 16 b_11 1 8 2011
#> 17 b_11 2 20 2011
#> 18 b_11 3 19 2011
#> 19 b_11 4 18 2011
#> 20 b_11 5 3 2011
Created on 2022-04-21 by the reprex package (v2.0.1)

You can use substr to get the final two digits of id and then paste0 this to "20" to recreate the year.
df |> dplyr::mutate(Year = paste0("20", substr(id, 3, 4)))
#> id site value Year
#> 1 a_10 1 5 2010
#> 2 a_10 2 12 2010
#> 3 a_10 3 9 2010
#> 4 a_10 4 7 2010
#> 5 a_10 5 13 2010
#> 6 a_11 1 3 2011
#> 7 a_11 2 4 2011
#> 8 a_11 3 16 2011
#> 9 a_11 4 2 2011
#> 10 a_11 5 6 2011
#> 11 b_10 1 19 2010
#> 12 b_10 2 14 2010
#> 13 b_10 3 15 2010
#> 14 b_10 4 10 2010
#> 15 b_10 5 11 2010
#> 16 b_11 1 18 2011
#> 17 b_11 2 1 2011
#> 18 b_11 3 20 2011
#> 19 b_11 4 17 2011
#> 20 b_11 5 8 2011
Created on 2022-04-21 by the reprex package (v2.0.1)

Related

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

How to calculate the number of months from the initial date for each individual

This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41

How can I use dplyr::filter with "|" operator? [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 2 years ago.
I use dplyr's filter() function all the time for tidying my data. Today it has stopped working when using the | operator. I am certain I have been able to use the | to filter any observation that meets any of the criteria separated by the | but it isn't working all of a sudden. Any help/guidance is greatly appreciated, as always. Reprex is below.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
#> Warning: package 'tidyr' was built under R version 3.6.2
#> Warning: package 'purrr' was built under R version 3.6.2
id <- c(1:20)
YEAR <- c(2009,2009,2009,2009,2010,2010,2010,2010,2011,2011,2011,2011,2012,2012,2012,2012,2013,2013,2013,2013)
df1 <- data.frame(id,YEAR)
df1
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
#> 5 5 2010
#> 6 6 2010
#> 7 7 2010
#> 8 8 2010
#> 9 9 2011
#> 10 10 2011
#> 11 11 2011
#> 12 12 2011
#> 13 13 2012
#> 14 14 2012
#> 15 15 2012
#> 16 16 2012
#> 17 17 2013
#> 18 18 2013
#> 19 19 2013
#> 20 20 2013
df1 <- df1 %>% dplyr::filter(YEAR == 2009|2010)
df1
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
#> 5 5 2010
#> 6 6 2010
#> 7 7 2010
#> 8 8 2010
#> 9 9 2011
#> 10 10 2011
#> 11 11 2011
#> 12 12 2011
#> 13 13 2012
#> 14 14 2012
#> 15 15 2012
#> 16 16 2012
#> 17 17 2013
#> 18 18 2013
#> 19 19 2013
#> 20 20 2013
Expected results would be:
df1 <- df1 %>% dplyr::filter(YEAR == 2009|2010)
df1
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
#> 5 5 2010
#> 6 6 2010
#> 7 7 2010
#> 8 8 2010
The following works filtering on a single condition:
df1 <- df1 %>% dplyr::filter(YEAR == 2009)
df1
#> id YEAR
#> 1 1 2009
#> 2 2 2009
#> 3 3 2009
#> 4 4 2009
We can use %in% instead of == for more than one element
library(dplyr)
df1 %>%
dplyr::filter(YEAR %in% c(2009, 2010))
With |, we need to repeat
df1 %>%
dplyr::filter(YEAR == 2009|YEAR == 2010)
Any value greater than 0 with another, gives TRUE
2019|2020
#[1] TRUE
0|0
#[1] FALSE
I think also your way would work with...
df1 <- df1 %>% dplyr::filter(YEAR == 2009|YEAR == 2010)
I think of it as two separate arguments. If you use each individually, the filter would work. In your provided YEAR == 2009|2010, the second part would simply be filter(2010), which doesn't make sense.

compute deflation factor to index wages, by CPI, in panel data

I'm struggling to understand exactly how to compute a deflation factor for wages in a panel based on inflation.
I've teh R example below to help me illustrate the issue.
In Wooldridge (2009:452) Introductory Econometrics, 5th ed., he creates a deflation factor by dividing 107.6 by 65.2, i.e. 107.6/65.2 ≈ 1.65, but I can't figure out to to apply this to my own panel data. Wooldridge only mentions the deflation factor in passing.
Say I have a mini panel with two people, Jane and Tom, staring from 2006/2009 and running until 2015 with their yearly wage,
# install.packages(c("dplyr"), dependencies = TRUE)
library(dplyr)
set.seed(2)
tbl <- tibble(id = rep(c('Jane', 'Tom'), c(7, 10)),
yr = c(2009:2015, 2006:2015),
wg = c(rnorm(7, mean=5.1*10^4, sd=9), rnorm(10, 4*10^4, 12))
); tbl
#> A tibble: 17 x 3
#> id yr wg
#> <chr> <int> <dbl>
#> 1 Jane 2009 50991.93
#> 2 Jane 2010 51001.66
#> 3 Jane 2011 51014.29
#> 4 Jane 2012 50989.83
#> 5 Jane 2013 50999.28
#> 6 Jane 2014 51001.19
#> 7 Jane 2015 51006.37
#> 8 Tom 2006 39997.12
#> 9 Tom 2007 40023.81
#> 10 Tom 2008 39998.33
#> 11 Tom 2009 40005.01
#> 12 Tom 2010 40011.78
#> 13 Tom 2011 39995.29
#> 14 Tom 2012 39987.52
#> 15 Tom 2013 40021.39
#> 16 Tom 2014 39972.27
#> 17 Tom 2015 40010.54
I now get the consumer price index (CPI) (using this answer)
# install.packages(c("Quandl"), dependencies = TRUE)
CPI00to16 <- Quandl::Quandl("FRED/CPIAUCSL", collapse="annual",
start_date="2000-01-01", end_date="2016-01-01")
as_tibble(CPI00to16)
#> # A tibble: 17 x 2
#> Date Value
#> <date> <dbl>
#> 1 2016-12-31 238.106
#> 2 2015-12-31 237.846
#> 3 2014-12-31 236.290
#> 4 2013-12-31 234.723
#> 5 2012-12-31 231.221
#> 6 2011-12-31 227.223
#> 7 2010-12-31 220.472
#> 8 2009-12-31 217.347
#> 9 2008-12-31 211.398
#> 10 2007-12-31 211.445
#> 11 2006-12-31 203.100
#> 12 2005-12-31 198.100
#> 13 2004-12-31 191.700
#> 14 2003-12-31 185.500
#> 15 2002-12-31 181.800
#> 16 2001-12-31 177.400
#> 17 2000-12-31 174.600
my question is how do I deflate Jane and Tom's wages cf. Wooldridge 2009 selecting 2015 as the baseline year?
update; following MrSmithGoesToWashington’s comment below.
CPI00to16$yr <- as.numeric(format(CPI00to16$Date,'%Y'))
CPI00to16 <- mutate(CPI00to16, deflation_factor = CPI00to16[2,2]/Value)
df <- tbl %>% inner_join(as_tibble(CPI00to16[,3:4]), by = "yr")
df <- mutate(df, wg_defl = deflation_factor*wg, wg_diff = wg_defl-wg)
df
#> # A tibble: 17 x 6
#> id yr wg deflation_factor wg_defl wg_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Jane 2009 50991.93 1.094315 55801.21 4809.2844
#> 2 Jane 2010 51001.66 1.078804 55020.78 4019.1176
#> 3 Jane 2011 51014.29 1.046751 53399.28 2384.9910
#> 4 Jane 2012 50989.83 1.028652 52450.80 1460.9728
#> 5 Jane 2013 50999.28 1.013305 51677.83 678.5477
#> 6 Jane 2014 51001.19 1.006585 51337.04 335.8494
#> 7 Jane 2015 51006.37 1.000000 51006.37 0.0000
#> 8 Tom 2006 39997.12 1.171078 46839.76 6842.6394
#> 9 Tom 2007 40023.81 1.124860 45021.18 4997.3691
#> 10 Tom 2008 39998.33 1.125110 45002.53 5004.1909
#> 11 Tom 2009 40005.01 1.094315 43778.07 3773.0575
#> 12 Tom 2010 40011.78 1.078804 43164.86 3153.0747
#> 13 Tom 2011 39995.29 1.046751 41865.12 1869.8369
#> 14 Tom 2012 39987.52 1.028652 41133.26 1145.7322
#> 15 Tom 2013 40021.39 1.013305 40553.87 532.4863
#> 16 Tom 2014 39972.27 1.006585 40235.49 263.2225
#> 17 Tom 2015 40010.54 1.000000 40010.54 0.0000

Performing a dplyr full_join without a common variable to blend data frames

Using the dplyr full_join() operation, I am trying to perform the equivalent of a basic merge() operation in which no common variables exist (unable to satisfy the "by=" argument). This will blend two data frames and return all possible combinations.
However, the current full_join() function requires a common variable. I am unable to locate another dplyr function that can help with this. How can I perform this operation using functions specific to the dplyr library?
df_a = data.frame(department=c(1,2,3,4))
df_b = data.frame(period=c(2014,2015,2016,2017))
#This works as desired
big_df = merge(df_a,df_b)
#I'd like to perform the following in a much bigger operation:
big_df = dplyr::full_join(df_a,df_b)
#Error: No common variables. Please specify `by` param.
You can use crossing from tidyr:
crossing(df_a,df_b)
department period
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 2 2014
6 2 2015
7 2 2016
8 2 2017
9 3 2014
10 3 2015
11 3 2016
12 3 2017
13 4 2014
14 4 2015
15 4 2016
16 4 2017
If there are duplicate rows, crossing doesn't give the same result as merge.
Instead use full_join with by = character() to perform a cross-join which generates all combinations of df_a and df_b.
library("tidyverse") # version 1.3.2
# Add duplicate rows for illustration.
df_a <- tibble(department = c(1, 2, 3, 3))
df_b <- tibble(period = c(2014, 2015, 2016, 2017))
merge doesn't de-duplicate.
df_a_merge_b <- merge(df_a, df_b)
df_a_merge_b
#> department period
#> 1 1 2014
#> 2 2 2014
#> 3 3 2014
#> 4 3 2014
#> 5 1 2015
#> 6 2 2015
#> 7 3 2015
#> 8 3 2015
#> 9 1 2016
#> 10 2 2016
#> 11 3 2016
#> 12 3 2016
#> 13 1 2017
#> 14 2 2017
#> 15 3 2017
#> 16 3 2017
crossing drops duplicate rows.
df_a_crossing_b <- crossing(df_a, df_b)
df_a_crossing_b
#> # A tibble: 12 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
full_join doesn't remove duplicates either.
df_a_full_join_b <- full_join(df_a, df_b, by = character())
df_a_full_join_b
#> # A tibble: 16 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
#> 13 3 2014
#> 14 3 2015
#> 15 3 2016
#> 16 3 2017
packageVersion("tidyverse")
#> [1] '1.3.2'
Created on 2023-01-13 with reprex v2.0.2

Resources