Data manipulation: spread columns with different number of rows in dplyr - r

I am trying to spread the time columns of my dataframe. left_join would be my choice, but the age groups age and geo differ, thus I end up with most years containing NA values and one of the age categories disappears.
library(dplyr)
dt %>%
filter(time!=2001) %>%
group_by(time, geo, age, sex) %>%
filter(time==2011) %>%
left_join(.,dt %>%
group_by(time, sex, age, geo) %>%
mutate(time2 = 2011) %>%
filter(time != 2011) %>%
spread(time, value),
by = c('time' = 'time2', 'age', 'geo'))
What I obtain is this:
time geo sex.x age value sex.y `2000` `2001` `2002` `2003`
2011 51900 1 0 27933 1 NA 26193 NA NA
2011 51900 1 0 27933 2 NA 22760 NA NA
2011 51900 1 5 20627 1 NA 26213 NA NA
2011 51900 1 5 20627 2 NA 25647 NA NA
...
2011 51900 1 75 6400 1 NA 5313 NA NA
2011 51900 1 75 6400 2 NA 11500 NA NA
2011 51900 1 80 4520 NA NA NA NA NA
but there's a problem with the ```value`` column as it repeats the same values twice (and it shouldn't) and years 2000, 2002, ..., 2020
What I would like is this:
geo sex age 2001 2011 2000 2002 2003 ... 2020
51900 1 0 39290 41900 69844 55281 55545 58045
51900 2 0 34140 38270 61192 65301 65429 65391
51902 1 0 4307 4193 69844 55281 55545 58045
51902 2 0 3753 3453 61192 65301 65429 65391
...
51900 1 80 NA 41900 104766 97952 98143 87068
51900 2 80 NA 38270 91788 89921 83317 98086
dt = structure(list(time = c(2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L,
2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2004L, 2004L, 2005L, 2005L, 2005L, 2005L,
2005L, 2006L, 2006L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2008L, 2009L, 2009L,
2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2010L, 2010L, 2012L, 2012L, 2012L, 2012L, 2012L, 2013L, 2013L, 2013L, 2013L, 2013L,
2014L, 2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L,
2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2019L, 2019L, 2019L, 2019L, 2019L, 2020L, 2020L, 2020L, 2020L, 2020L, 2000L,
2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2004L,
2004L, 2005L, 2005L, 2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L,
2008L, 2008L, 2008L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2010L, 2010L, 2012L, 2012L, 2012L, 2012L, 2012L,
2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L,
2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2019L, 2019L, 2019L, 2019L, 2019L, 2020L,
2020L, 2020L, 2020L, 2020L), geo = c(51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51902L, 51902L, 51902L,
51902L, 51902L, 51902L, 51902L, 51902L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51900L, 51902L,
51902L, 51902L, 51902L, 51902L, 51902L, 51902L, 51902L, 51902L, 51902L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L), sex = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), age = c(0L, 5L, 10L, 75L, 0L, 5L, 10L, 75L, 0L, 5L, 10L, 75L, 0L, 5L, 10L, 75L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L,
80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L,
0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L, 0L, 5L,
10L, 75L, 80L, 0L, 5L, 10L, 75L, 80L), value = c(26193L, 26213L, 31653L, 5313L, 22760L, 25647L, 31393L, 11500L, 4307L, 4793L,
5947L, 667L, 3753L, 4500L, 5207L, 1440L, 27933L, 20627L, 20593L, 6400L, 4520L, 25513L, 17480L, 17800L, 9520L, 8560L, 4193L, 3027L,
3453L, 800L, 580L, 3453L, 2473L, 2980L, 1013L, 1167L, 61192L, 88249L, 105509L, 20595L, 18198L, 55281L, 76667L, 99967L, 25571L,
19187L, 55545L, 70490L, 95697L, 28376L, 19340L, 56564L, 64639L, 90809L, 30322L, 19579L, 57471L, 59755L, 85464L, 30949L, 20081L,
60145L, 55926L, 79537L, 30083L, 22373L, 61425L, 53664L, 73329L, 27916L, 24891L, 61683L, 52992L, 67148L, 25620L, 27118L, 61776L,
53403L, 61637L, 24601L, 28551L, 62477L, 53990L, 57438L, 25439L, 29074L, 64401L, 56247L, 52992L, 31317L, 30495L, 64691L, 58095L,
52582L, 35069L, 30691L, 64689L, 60083L, 52853L, 37023L, 31297L, 64391L, 61877L, 53538L, 36327L, 32537L, 63158L, 63367L, 54657L,
33260L, 35359L, 61961L, 64311L, 56249L, 28203L, 38591L, 60751L, 64639L, 58159L, 22742L, 41433L, 59469L, 64485L, 60081L, 18813L,
42936L, 58045L, 64127L, 61703L, 17280L, 42758L, 69844L, 93632L, 109773L, 11025L, 7397L, 65301L, 82373L, 103304L, 16130L, 7705L,
65429L, 77025L, 98764L, 18861L, 7835L, 66195L, 72123L, 93892L, 20763L, 8231L, 66949L, 68002L, 88909L, 21513L, 8973L, 69257L,
64759L, 83202L, 21269L, 10813L, 70402L, 62813L, 77601L, 20044L, 12820L, 70681L, 62125L, 72404L, 18627L, 14631L, 70818L, 62321L,
68099L, 17947L, 15893L, 71579L, 62729L, 65085L, 18379L, 16509L, 73653L, 64712L, 61851L, 21697L, 17861L, 73764L, 66737L, 61483L,
23663L, 18103L, 73537L, 68968L, 61599L, 24347L, 18455L, 73041L, 70867L, 62190L, 23305L, 18986L, 71645L, 72368L, 63235L, 21077L,
20717L, 70201L, 73275L, 64867L, 17653L, 22534L, 68704L, 73517L,
66893L, 14089L, 23935L, 67117L, 73238L, 68928L, 11606L, 24343L, 65391L, 72725L, 70609L, 10697L, 23592L)), .Names = c("time",
"geo", "sex", "age", "value"), class = "data.frame", row.names = c(NA, -226L))

You can use the spread function from tidyr
dt_final <- dt %>% spread (time, # the variable I want to use to create multiple columns
value)# the variable to use to fill the rows in the new columns
head(as.tibble(dt_final))
# geo sex age `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 51 1 0 69844 NA 65301 65429 66195 66949 69257 70402 70681 70818 71579 NA 73653 73764 73537 73041 71645 70201 68704 67117 65391
# 2 51 1 5 93632 NA 82373 77025 72123 68002 64759 62813 62125 62321 62729 NA 64712 66737 68968 70867 72368 73275 73517 73238 72725
# 3 51 1 10 109773 NA 103304 98764 93892 88909 83202 77601 72404 68099 65085 NA 61851 61483 61599 62190 63235 64867 66893 68928 70609
# 4 51 1 75 11025 NA 16130 18861 20763 21513 21269 20044 18627 17947 18379 NA 21697 23663 24347 23305 21077 17653 14089 11606 10697
# 5 51 1 80 7397 NA 7705 7835 8231 8973 10813 12820 14631 15893 16509 NA 17861 18103 18455 18986 20717 22534 23935 24343 23592
# 6 51 2 0 61192 NA 55281 55545 56564 57471 60145 61425 61683 61776 62477 NA 64401 64691 64689 64391 63158 61961 60751 59469 58045

Related

ANOVA error: why is each row of output *not* identified by a unique combination of keys?

I have a two-way ANOVA test (w/repeated measures) that I'm using with four almost identical datasets:
> res.aov <- anova_test(
+ data = LST_Weather_dataset_N, dv = LST, wid = Month,
+ within = c(Buffer, TimePeriod),
+ effect.size = "ges",
+ detailed = TRUE,
+ )
Where:
LST = surface temperature deviation in C
Month = 1-12
Buffer = a value 100-1900 - one of 19 areas outward from the boundary of a solar power plant (each 100m wide)
TimePeriod = a factor with a value of 1 or 2 corresponding to pre-/post-construction of a solar power plant.
For one dataset I get the error:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 38 rows:
* 10, 11
* 217, 218
* 240, 241
* 263, 264
* 286, 287
* 309, 310
* 332, 333
...
As far as I can tell I have unique combinations.
dplyr::count(LST_Weather_dataset_N, LST, Month, Buffer, TimePeriod, sort = TRUE)
returns
LST Month Buffer TimePeriod n
1 -6.309045316 12 100 2 1
2 -5.655279925 9 1000 2 1
3 -5.224196295 12 200 2 1
4 -5.194473224 9 1100 2 1
5 -5.025429891 12 400 2 1
6 -4.987575966 9 700 2 1
7 -4.979453868 12 600 2 1
8 -4.825298768 12 300 2 1
9 -4.668994574 12 500 2 1
10 -4.652282192 12 700 2 1
...
'n' is always 1.
I can't work out why this is happening.
Extract of datafram below:
> dput(LST_Weather_dataset_N[sample(1:nrow(LST_Weather_dataset_N), 50),])
structure(list(Buffer = c(1400L, 700L, 300L, 1400L, 100L, 200L,
1700L, 100L, 800L, 1900L, 1100L, 100L, 700L, 800L, 1400L, 400L,
1300L, 200L, 1200L, 500L, 1200L, 1300L, 400L, 1000L, 1300L, 1100L,
100L, 300L, 300L, 600L, 1100L, 1400L, 1500L, 1600L, 1700L, 1800L,
1700L, 1300L, 1200L, 300L, 1100L, 1900L, 1700L, 700L, 1400L,
1200L, 1600L, 1700L, 1900L, 1300L), Date = c("02/05/2014", "18/01/2017",
"19/06/2014", "25/12/2013", "15/09/2017", "08/04/2017", "22/08/2014",
"21/07/2014", "13/07/2017", "25/12/2013", "22/10/2013", "02/05/2014",
"07/03/2017", "15/03/2014", "13/07/2017", "19/06/2014", "25/12/2013",
"17/10/2017", "16/04/2014", "06/10/2013", "15/09/2017", "18/01/2017",
"10/01/2014", "17/12/2016", "13/07/2017", "19/06/2014", "07/03/2017",
"15/03/2014", "11/02/2014", "22/10/2013", "06/10/2013", "15/09/2017",
"16/04/2014", "18/01/2017", "15/03/2014", "21/07/2014", "17/10/2017",
"15/09/2017", "10/01/2014", "23/09/2014", "16/04/2014", "22/10/2013",
"11/06/2017", "26/05/2017", "19/06/2014", "14/08/2017", "11/02/2014",
"26/02/2017", "26/02/2017", "11/02/2014"), LST = c(1.255502397,
4.33385966, 3.327025603, -0.388631166, -0.865430798, 4.386292648,
-0.243018665, 3.276865987, 0.957036835, -0.065821795, 0.69731779,
4.846851651, -1.437700684, 1.003808572, 0.572460421, 2.995902374,
-0.334633662, -1.231447567, 0.644520741, 0.808262029, -3.392959991,
2.324569449, 2.346707612, -3.124354627, 0.58719862, 1.904859254,
1.701580958, 2.792443253, 1.638270039, 1.460743317, 0.699767335,
-3.015643366, 0.930527864, 1.309519336, 0.477789664, 0.147584938,
-0.498188865, -3.506795723, -1.007487965, 1.149604087, 1.192366386,
0.197471474, 0.999391224, -0.190613618, 1.27324015, 2.686622796,
0.573109026, 0.97847983, 0.395005095, -0.40855426), Month = c(5L,
1L, 6L, 12L, 9L, 4L, 8L, 7L, 7L, 12L, 10L, 5L, 3L, 3L, 7L, 6L,
12L, 10L, 4L, 10L, 9L, 1L, 1L, 12L, 7L, 6L, 3L, 3L, 2L, 10L,
10L, 9L, 4L, 1L, 3L, 7L, 10L, 9L, 1L, 9L, 4L, 10L, 6L, 5L, 6L,
8L, 2L, 2L, 2L, 2L), Year = c(2014L, 2017L, 2014L, 2013L, 2017L,
2017L, 2014L, 2014L, 2017L, 2013L, 2013L, 2014L, 2017L, 2014L,
2017L, 2014L, 2013L, 2017L, 2014L, 2013L, 2017L, 2017L, 2014L,
2016L, 2017L, 2014L, 2017L, 2014L, 2014L, 2013L, 2013L, 2017L,
2014L, 2017L, 2014L, 2014L, 2017L, 2017L, 2014L, 2014L, 2014L,
2013L, 2017L, 2017L, 2014L, 2017L, 2014L, 2017L, 2017L, 2014L
), JulianDay = c(122L, 18L, 170L, 359L, 258L, 98L, 234L, 202L,
194L, 359L, 295L, 122L, 66L, 74L, 194L, 170L, 359L, 290L, 106L,
279L, 258L, 18L, 10L, 352L, 194L, 170L, 66L, 74L, 42L, 295L,
279L, 258L, 106L, 18L, 74L, 202L, 290L, 258L, 10L, 266L, 106L,
295L, 162L, 146L, 170L, 226L, 42L, 57L, 57L, 42L), TimePeriod = c(1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
1L), Temperature = c(28L, 9L, 31L, 12L, 27L, 21L, 29L, 36L, 38L,
12L, 23L, 28L, 12L, 21L, 38L, 31L, 12L, 23L, 25L, 22L, 27L, 9L,
11L, 7L, 38L, 31L, 12L, 21L, 14L, 23L, 22L, 27L, 25L, 9L, 21L,
36L, 23L, 27L, 11L, 31L, 25L, 23L, 29L, 27L, 31L, 34L, 14L, 16L,
16L, 14L), Humidity = c(6L, 34L, 7L, 31L, 29L, 22L, 34L, 15L,
19L, 31L, 16L, 6L, 14L, 14L, 19L, 7L, 31L, 12L, 9L, 12L, 29L,
34L, 33L, 18L, 19L, 7L, 14L, 14L, 31L, 16L, 12L, 29L, 9L, 34L,
14L, 15L, 12L, 29L, 33L, 18L, 9L, 16L, 8L, 13L, 7L, 13L, 31L,
31L, 31L, 31L), Wind_speed = c(6L, 0L, 6L, 7L, 13L, 33L, 6L,
20L, 9L, 7L, 0L, 6L, 0L, 6L, 9L, 6L, 7L, 6L, 0L, 7L, 13L, 0L,
0L, 35L, 9L, 6L, 0L, 6L, 6L, 0L, 7L, 13L, 0L, 0L, 6L, 20L, 6L,
13L, 0L, 0L, 0L, 0L, 24L, 11L, 6L, 24L, 6L, 26L, 26L, 6L), Wind_gust = c(0L,
0L, 0L, 0L, 0L, 54L, 0L, 46L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 46L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 0L, 0L, 39L,
0L, 41L, 41L, 0L), Wind_trend = c(1L, 0L, 1L, 1L, 2L, 2L, 0L,
1L, 2L, 1L, 0L, 1L, 0L, 1L, 2L, 1L, 1L, 0L, 0L, 2L, 2L, 0L, 1L,
1L, 2L, 1L, 0L, 1L, 1L, 0L, 2L, 2L, 0L, 0L, 1L, 1L, 0L, 2L, 1L,
1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Wind_direction = c(0,
0, 0, 337.5, 360, 22.5, 0, 22.5, 0, 337.5, 0, 0, 0, 0, 0, 0,
337.5, 180, 0, 247.5, 360, 0, 0, 180, 0, 0, 0, 0, 337.5, 0, 247.5,
360, 0, 0, 0, 22.5, 180, 360, 0, 0, 0, 0, 360, 22.5, 0, 360,
337.5, 360, 360, 337.5), Pressure = c(940.2, 943.64, 937.69,
951.37, 932.69, 933.94, 937.07, 938.01, 937.69, 951.37, 939.72,
940.2, 948.33, 947.71, 937.69, 937.69, 951.37, 943.32, 932.69,
944.71, 932.69, 943.64, 942.31, 943.01, 937.69, 937.69, 948.33,
947.71, 941.94, 939.72, 944.71, 932.69, 932.69, 943.64, 947.71,
938.01, 943.32, 932.69, 942.31, 938.94, 932.69, 939.72, 928.31,
931.12, 937.69, 932.37, 941.94, 936.13, 936.13, 941.94), Pressure_trend = c(1L,
2L, 0L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 0L, 2L,
1L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 2L,
2L, 1L, 1L, 1L, 0L, 2L, 1L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 2L, 2L,
1L)), row.names = c(179L, 14L, 195L, 426L, 306L, 118L, 299L,
229L, 244L, 436L, 374L, 153L, 90L, 91L, 256L, 197L, 424L, 348L,
137L, 355L, 328L, 26L, 7L, 419L, 254L, 211L, 78L, 81L, 43L, 359L,
373L, 332L, 143L, 32L, 109L, 263L, 393L, 330L, 23L, 309L, 135L,
398L, 224L, 166L, 217L, 290L, 69L, 72L, 76L, 63L), class = "data.frame")
Well, this is a bit embarrassing.
The error arose as there were not, in fact, paired months of the data. Rather than there being 38 data (19x2) for each month, due to an error in determining the month value one month had 57 data (19x3). Correcting this, and checking that each month had the same number of paired data for the ANOVA allowed the test to run sucessfully.
> res.aov <- anova_test(
+ data = LST_Weather_dataset_N, dv = LST, wid = Month,
+ within = c(Buffer, TimePeriod),
+ effect.size = "ges",
+ detailed = TRUE,
+ )
> get_anova_table(res.aov, correction = "auto")
ANOVA Table (type III tests)
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 11 600.135 974.584 6.774 2.50e-02 * 0.189
2 Buffer 18 198 332.217 331.750 11.015 2.05e-21 * 0.115
3 TimePeriod 1 11 29.561 977.945 0.333 5.76e-01 0.011
4 Buffer:TimePeriod 18 198 13.055 283.797 0.506 9.53e-01 0.005
I still don't understand how the error message was telling me this, though.

How to generate a lag variable (endogenous lag) that captures previous values?

I want generate the following endogenous lag (Y) variable
set Y=1 in the current routine year, if submission==1 and routineyear==1 in the previous routine year
set Y=2 in the current routine year, if sub==0 and routineyear==1 in the previous routine year
Otherwise=0
Note though that "previous routine year" is not previous year, the intervals between routine years varies. This is actually what makes it hard for me to generate this variable.
Basically, I want to generate an endogenous variable that would capture state's behavior in their LAST routineyear.
To illustrate what I want to do:
Assume that country A had its routine year in 1990 - the same year the submission variable was also =1. This would generate Y=1.
Now, the next routineyear for country A is in 1992, where the submission=1 and routineyear=1 in that year. The endogenous lag in this should indicate A's previous behavior as in 1990 (Y=1).
Then, the next routineyear is in 1996 where submission=0 while routineyear=1. The endogenous lag in this case would be the value of A's previous behavior in 1992 (Y=1).
Then again, next routineyear is in 1998, where submission=1 and routineyear=1. The endogenous lag here should indicate A's previous behavior in the last routineyear, in 1996. that is: Y=2!.
This is how the endogenous lag should look like (based on the example above)
country year submission routineyear Y(endo lag)
A 1990 1 1 1
A 1991 0 0 0
A 1992 1 1 1
A 1993 1 0 0
A 1994 0 0 0
A 1995 0 0 0
A 1996 0 1 1
A 1997 0 0 0
A 1998 1 1 2
A 1999 0 0 0
A 2000 0 0 0
A 2001 0 1 1
A 2002 0 0 0
A 2003 1 1 2
I've been trying to do this using different logics but without success. One of the biggest problems is that routine year is different for each country, the intervals are not stable.
I believe that someone who can write proper codes/functions in R would be able to slove this puzzle. If not, I would appreciate all recommendations as how to proceed from here.
A sample from my real data:
structure(list(ccode = c(31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L,
31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 40L,
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L,
40L, 40L, 40L, 40L, 40L, 40L, 40L, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
41L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L,
42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L,
52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 53L, 53L,
53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L, 53L,
53L, 53L, 53L, 53L, 53L, 53L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L,
54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L, 54L,
70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L,
70L, 70L, 70L, 70L, 70L, 70L, 70L, 70L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L,
90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L), year = c(1990L,
1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L,
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L,
2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L,
1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L,
1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L,
1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L,
1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 1999L, 2000L, 2001L,
2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L,
1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L,
2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L,
1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L,
1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L,
1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L,
2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L,
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L), country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Bahamas", "Barbados",
"Belize", "Cuba", "Dominica", "Dominican Republic", "Guatemala",
"Haiti", "Jamaica", "Mexico", "Trinidad and Tobago"), class =
"factor"),
submission = c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L,
1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L,
1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L,
0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 0L), routineyear = c(1L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L
)), .Names = c("ccode", "year", "country", "submission", "routineyear"), class = "data.frame", row.names = c(NA, -243L ))
Using data.table:
library(data.table)
setDT(DF)
DF[, Y := 0
][routineyear == 1
, Y := 1 + (shift(submission, fill = 1) == 0)
, by = country][]
which gives (first 15 rows shown):
> DF
ccode year country submission routineyear Y
1: 31 1990 Bahamas 1 1 1
2: 31 1991 Bahamas 0 0 0
3: 31 1992 Bahamas 0 0 0
4: 31 1993 Bahamas 0 1 1
5: 31 1994 Bahamas 0 0 0
6: 31 1995 Bahamas 1 0 0
7: 31 1996 Bahamas 0 0 0
8: 31 1997 Bahamas 1 1 2
9: 31 1998 Bahamas 0 0 0
10: 31 1999 Bahamas 1 1 1
11: 31 2000 Bahamas 0 0 0
12: 31 2001 Bahamas 1 1 1
13: 31 2002 Bahamas 0 0 0
14: 31 2003 Bahamas 1 1 1
15: 31 2004 Bahamas 0 0 0
........
What this does:
setDT(DF) converts your dataframe to a data.table
Y := 0 sets Y to 0 by reference first
Filter for routineyear == 1
Update Y by reference such that Y is set to 1 if previous submission is 1 and to 2 is previous submission is 0
library(dplyr)
select(dat2, -Y) %>%
filter(routineyear == 1L) %>%
group_by(country) %>%
mutate(Y = 2L - lag(submission, default = 1L)) %>%
ungroup() %>%
right_join(select(dat2, -Y)) %>%
mutate(Y = replace(Y, is.na(Y), 0L))
# # A tibble: 14 x 5
# country year submission routineyear Y
# <fct> <int> <int> <int> <int>
# 1 A 1990 1 1 1
# 2 A 1991 0 0 0
# 3 A 1992 1 1 1
# 4 A 1993 1 0 0
# 5 A 1994 0 0 0
# 6 A 1995 0 0 0
# 7 A 1996 0 1 1
# 8 A 1997 0 0 0
# 9 A 1998 1 1 2
# 10 A 1999 0 0 0
# 11 A 2000 0 0 0
# 12 A 2001 0 1 1
# 13 A 2002 0 0 0
# 14 A 2003 1 1 2
all.equal(.Last.value, dat2)
# [1] TRUE
where dat2 is:
dat2 <- read.table(text =
"country year submission routineyear Y
A 1990 1 1 1
A 1991 0 0 0
A 1992 1 1 1
A 1993 1 0 0
A 1994 0 0 0
A 1995 0 0 0
A 1996 0 1 1
A 1997 0 0 0
A 1998 1 1 2
A 1999 0 0 0
A 2000 0 0 0
A 2001 0 1 1
A 2002 0 0 0
A 2003 1 1 2
", header = TRUE)

Conducting regression analysis using R via SQL Server 2017

I want perform regression analysis using R code via SQL Server 2017 (it's integrated here).
Here is the native R code working with the csv
The main matter of code that we perform regression separately by groups [CustomerName]+[ItemRelation]+[DocumentNum]+[DocumentYear]
df=read.csv("C:/Users/synthex/Desktop/re.csv", sep=";",dec=",")
#load needed library
library(tidyverse)
library(broom)
#order dataset
df=df[ order(df[,5]),]
df=df[ order(df[,6]),]
#delete signs
df$Customer<-gsub("\\-","",df$Customer)
#create lm function for separately by group regression
my_lm <- function(df) {
lm(SaleCount~IsPromo, data = df)
}
reg=df %>%
group_by(CustomerName,ItemRelation,DocumentNum,DocumentYear) %>%
nest() %>%
mutate(fit = map(data, my_lm),
tidy = map(fit, tidy)) %>%
select(-fit, - data) %>%
unnest()
w=aggregate(df$action, by=list(CustomerName=df$CustomerName,ItemRelation=df$ItemRelation, DocumentNum=df$DocumentNum, DocumentYear=df$DocumentYear), FUN=sum)
View(w)
# multiply each group by the number of days of the action
EA<-data.frame(reg$CustomerName,reg$ItemRelation,reg$DocumentNum,reg$DocumentYear, reg$estimate*w$x)
#del intercepts
toDelete <- seq(2, nrow(EA), 2)
newdat=EA[ toDelete ,]
View(newdat)
The finished result: this code runs in SSMS
So what I did:
EXECUTE sp_execute_external_script
#language = N'R'
, #script = N' OutputDataSet <- InputDataSet;'
, #input_data_1 = N' SELECT [CustomerName]
,[ItemRelation]
,[SaleCount]
,[DocumentNum]
,[DocumentYear]
,[IsPromo]
FROM [Action].[dbo].[promo_data];'
WITH RESULT SETS (([CustomerName] nvarchar(max) NOT NULL, [ItemRelation] int NOT NULL,
[SaleCount] int NOT NULL,[DocumentNum] int NOT NULL,
[DocumentYear] int NOT NULL, [IsPromo] int NOT NULL));
df=as.data.frame(InputDataSet)
Message 102, level 15, state 1, line 17
Incorrect syntax near the "=" construct.
So, how perform regression analysis in SQL separately by groups?
Note, all coefficients must be saved, because new data come to the sql, should already automatically calculate by the equation of constructed model for each group.
The above code simply estimates the impact of the action, the beta coefficients of each group multiplies by the number of days of the action for each group.
If it is needed, here is a reproducible example:
df=structure(list(CustomerName = structure(c(1L, 2L, 3L, 3L, 1L,
2L, 3L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 1L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
1L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L,
2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("Attacks of the vehicle",
"Auchan TS", "Tape of the vehicle", "X5 Retail Group"), class = "factor"),
ItemRelation = c(13322L, 13322L, 158121L, 158122L, 13322L,
13322L, 158121L, 158122L, 11592L, 13189L, 13191L, 13322L,
13322L, 158121L, 158122L, 11592L, 13189L, 13191L, 158121L,
158121L, 158122L, 158122L, 13322L, 13322L, 158121L, 158122L,
11592L, 13189L, 13191L, 157186L, 157192L, 158009L, 158010L,
158121L, 158121L, 158122L, 158122L, 13322L, 13322L, 158121L,
158122L, 11592L, 13189L, 13191L, 157186L, 157192L, 158009L,
158010L, 158121L, 158121L, 158122L, 158122L, 13322L, 13322L,
158121L, 158122L, 11514L, 11592L, 11623L, 13189L, 13191L),
SaleCount = c(10L, 35L, 340L, 260L, 3L, 31L, 420L, 380L,
45L, 135L, 852L, 1L, 34L, 360L, 140L, 14L, 62L, 501L, 0L,
560L, 640L, 0L, 0L, 16L, 0L, 0L, 15L, 66L, 542L, 49L, 228L,
3360L, 5720L, 980L, 0L, 0L, 1280L, 9L, 29L, 200L, 120L, 46L,
68L, 569L, 52L, 250L, 2360L, 3140L, 1640L, 0L, 0L, 1820L,
5L, 33L, 260L, 220L, 665L, 25L, -10L, 62L, 281L), DocumentNum = c(36L,
4L, 41L, 41L, 36L, 4L, 41L, 41L, 33L, 33L, 33L, 36L, 4L,
41L, 41L, 33L, 33L, 33L, 63L, 62L, 62L, 63L, 36L, 4L, 41L,
41L, 33L, 33L, 33L, 57L, 56L, 12L, 12L, 62L, 63L, 63L, 62L,
36L, 4L, 41L, 41L, 33L, 33L, 33L, 57L, 56L, 12L, 12L, 62L,
63L, 63L, 62L, 36L, 4L, 41L, 41L, 60L, 33L, 71L, 33L, 33L
), DocumentYear = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), IsPromo = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("CustomerName", "ItemRelation",
"SaleCount", "DocumentNum", "DocumentYear", "IsPromo"), class = "data.frame", row.names = c(NA,
-61L))

Plot graphs next to a single output

I have a dataset like this dataframe:
structure(list(year = c(2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2001L,
2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2001L, 2002L, 2003L,
2004L, 2005L, 2006L, 2007L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L
), volume = c(21L, 44L, 37L, 23L, 46L, 21L, 69L, 21L, 44L, 37L,
23L, 46L, 21L, 69L, 21L, 44L, 37L, 23L, 46L, 21L, 69L, 21L, 44L,
37L, 23L, 46L, 21L, 69L, 21L, 44L, 37L, 23L, 46L, 21L, 69L, 21L,
44L, 37L, 23L, 46L, 21L, 69L), stock = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("stock1", "stock2", "stock3",
"stock4", "stock5", "stock6"), class = "factor")), .Names = c("year",
"volume", "stock"), class = "data.frame", row.names = c(NA, -42L
))
I try to have an ouput like one.
What I have until know
library(ggplot2)
p <- ggplot(df, aes(x = df$year, y = df$volume)) + geom_line(aes(color = "red")) +
facet_grid(stock ~ ., scales = "free_x") + theme(legend.position = left)

dplyr data manipulation to multiply columns by a value

I have a dataframe that looks like the following (dput at the end):
region type age_group year value
AO1 p 0 1990 12
AO1 p 5 1990 10
AO1 p 10 1990 8
AO1 p 15 1990 14
AO1 p 20 1990 19
...
AO1 p 80 1990 12
AO1 p 1 1990 0.54
AO1 p 2 1990 0.46
AO1 p 3 1990 1
where the last three lines express the percentage of males (1) and female (2) and total (3).
What I would like to do is to produce two more variables value.m and value.f by multiplying value by the correct percentage
In this case, value.m would use 0.54 and value.f 0.46 for year 1990 in region AO1
dt$value.m <- dt %>%
group_by(region, type, age_num, year) %>%
mutate(value.m=value*???)
Any ideas?
dt <- structure(list(region = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 4L, 4L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 4L, 4L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 4L,
4L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 4L, 4L, 2L, 2L, 2L), .Label =
c("AO1", "AO11", "AO22", "AO3"), class = "factor"), age = structure(c(1L,
10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 1L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 1L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 1L, 10L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 19L, 18L, 20L,
19L, 18L, 20L, 19L, 18L, 20L, 19L, 18L, 20L, 21L, 30L, 22L, 23L,
24L, 25L, 26L, 27L, 28L, 29L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,
21L, 30L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 31L, 32L, 33L,
34L, 35L, 36L, 37L, 21L, 30L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 21L, 30L, 22L, 23L, 24L,
25L, 26L, 27L, 28L, 29L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 39L,
38L, 40L, 39L, 38L, 40L, 39L, 38L, 40L, 39L, 38L, 40L, 1L, 10L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 1L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 1L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 1L, 10L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 19L, 18L, 20L,
19L, 18L, 20L, 19L, 18L, 20L, 19L, 18L, 20L, 21L, 30L, 22L, 23L,
24L, 25L, 26L, 27L, 28L, 29L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,
21L, 30L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 31L, 32L, 33L,
34L, 35L, 36L, 37L, 21L, 30L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 21L, 30L, 22L, 23L, 24L,
25L, 26L, 27L, 28L, 29L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 39L,
38L, 40L, 39L, 38L, 40L, 39L, 38L, 40L, 39L, 38L, 40L), .Label = c("c_0_4",
"c_10_14", "c_15_19", "c_20_24", "c_25_29", "c_30_34", "c_35_39",
"c_40_44", "c_45_49", "c_5_9", "c_50_54", "c_55_59", "c_60_64",
"c_65_69", "c_70_74", "c_75_79", "c_80+", "c_f", "c_m", "c_total_sex",
"p_0_4", "p_10_14", "p_15_19", "p_20_24", "p_25_29", "p_30_34",
"p_35_39", "p_40_44", "p_45_49", "p_5_9", "p_50_54", "p_55_59",
"p_60_64", "p_65_69", "p_70_74", "p_75_79", "p_80+", "p_f", "p_m",
"p_total_sex"), class = "factor"), age_num = c(0L, 5L, 10L, 15L,
20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L,
0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L,
65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L,
45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L,
25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 0L, 5L, 10L, 15L,
20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L,
0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L,
65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L,
45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L,
25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 0L, 5L, 10L, 15L,
20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L,
0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L,
65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L,
45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L,
25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 0L, 5L, 10L, 15L,
20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L,
0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L,
65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L,
45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 0L, 5L, 10L, 15L, 20L,
25L, 30L, 35L, 40L, 45L, 50L, 55L, 60L, 65L, 70L, 75L, 80L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), year = c(2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L, 2007L), value
= c(79.6, 55.1, 44.6, 44.3,
26.8, 9.5, 7.2, 6.5, 5.6, 2.4, 0.6, 5.2, 7.6, 10.4, 12, 13.5,
13.5, 42.4, 23.1, 14.7, 12.5, 3.9, 1.4, 2.4, 5, 4.2, 7, 7.6,
10.2, 9.5, 11.1, 12.1, 13.8, 14.1, 30.5, 18.1, 14.6, 7.6, 1.4,
3.3, 4.1, 6.9, 8, 9.9, 9.8, 13.5, 13.1, 14.1, 14.2, 14.6, 14.6,
60.1, 52.1, 52.5, 64.1, 45.5, 26.9, 10.6, 7.7, 8.7, 0.4, 0.5,
4.1, 8.8, 9.9, 12.4, 13.3, 14, 216.8, 227.6, 459.7, 115.8, 112.3,
243.5, 85, 87.9, 188.2, 241.6, 253.9, 510.8, 0.2, 0.15, 0.13,
0.13, 0.09, 0.053, 0.05, 0.05, 0.04, 0.03, 0.03, 0.024, 0, 0.01,
0.016, 0, 0, 0.22, 0.15, 0.12, 0.11, 0.07, 0.05, 0.05, 0.04,
0.04, 0.03, 0.03, 0.02, 0.02, 0.02, 0.01, 0.01, 0, 0.2, 0.19,
0.15, 0.11, 0.07, 0.06, 0.06, 0.04, 0.04, 0.03, 0.03, 0.01, 0.01,
0.01, 0.01, 0, 0, 0.14, 0.13, 0.13, 0.15, 0.12, 0.08, 0.05, 0.04,
0.05, 0.03, 0.03, 0.02, 0.01, 0.01, 0.01, 0, 0, 0.49, 0.51, 1,
0.51, 0.49, 1, 0.49, 0.51, 1, 0.49, 0.51, 1, 241.9, 175.54, 146.5,
138.46, 108.14, 73.94, 66.58, 64.78, 58.9, 43.86, 49.1, 36.5,
33.38, 25.54, 21.66, 18.42, 18.58, 243.74, 163.86, 130.22, 121.42,
96.1, 80.3, 63.9, 55.02, 49.02, 41.78, 51.74, 35.22, 32.66, 25.78,
23.06, 18.66, 18.14, 152.5, 109.9, 93.34, 82.62, 61.7, 56.06,
44.38, 38.26, 33.02, 29.58, 30.86, 21.86, 21.18, 17.62, 17.86,
15.86, 15.58, 196.82, 175.74, 180.46, 182.3, 153.22, 118.18,
81.34, 70.46, 65.82, 47.7, 54.66, 38.54, 29.42, 25.58, 20.38,
18.18, 17.18, 547.58, 566.78, 1100.38, 519.1, 522.78, 1028.06,
310.54, 322.26, 618.82, 619.62, 647.02, 1252.66, 0.206, 0.15,
0.126, 0.122, 0.088, 0.052, 0.05, 0.05, 0.04, 0.03, 0.032, 0.02,
0.02, 0.01, 0.01, 0, 0.002, 0.222, 0.15, 0.118, 0.108, 0.074,
0.054, 0.05, 0.04, 0.038, 0.028, 0.032, 0.02, 0.02, 0.018, 0.01,
0.008, 0, 0.23, 0.158, 0.142, 0.11, 0.074, 0.064, 0.056, 0.04,
0.038, 0.028, 0.03, 0.012, 0.01, 0.01, 0.01, 0, 0, 0.144, 0.132,
0.134, 0.14, 0.118, 0.082, 0.054, 0.042, 0.046, 0.028, 0.032,
0.02, 0.01, 0.01, 0.008, 0, 0, 0.49, 0.51, 1, 0.57, 0.43, 1,
0.4, 0.6, 1, 0.3, 0.7, 1)), .Names = c("region", "age", "age_num",
"year", "value"), class = "data.frame", row.names = c(NA, -320L))
Step 1: merge year and region in one variable (I work on dt, that you've dput-ed)
new.dt <- dt %>% mutate(regyear = paste(region, year))
Step 2: create data.frame with your p_m's and regyear only:
p.m.s<-new.dt %>%
filter(age=='p_m') %>%
select(regyear, value) %>%
rename(pm=value) # to avoid duplicated names in new.df and p.m.s
Step 3: the same with p_f's:
p.f.s<-new.dt %>% filter(age=='p_f') %>% select(regyear, value) %>% rename(pf=value)
Step 4: get what you need :)
new.dt %>%
left_join(p.m.s) %>% # add p_m's
left_join(p.f.s) %>% # add p_f's
mutate(value.m=value*pm, value.f=value*pf) %>%
select(-c(regyear,pm,pf)) # clean up
Hope this hepled!
Hi in the data you gave the variable type is called age. So be careful about this. According to your data you can accomplish that doing this
dt %>% join(dt %>% filter(age=="p_m" & region==region)
%>% select(region,value) %>% setNames(c("region","p_m")),by= "region")
%>% join(dt %>% filter(age=="p_f" & region==region) %>% select(region,value)
%>% setNames(c("region","p_f")),by= "region")
%>% mutate (value.m=value*p_m, value.f=value*p_f)
%>% select(-c(p_m,p_f))
This code filter p_m and p_f for each region and join with the original table.
Then use mutate to calculate the value, then drop the column p_m and p_f

Resources