Long to wide: compacting rows - r

It is my first post.
I'm a beginner with R.
I have a df like this:
date value
2018-01-01 123
2018-02-01 12
2018-03-01 23
...
2019-01-01 3
2019-02-01 21
2019-03-01 2
...
2020-01-01 31
2020-02-01 23
2020-03-01 32
...
I want to transform it in:
year ene feb mar ...
2018 123 12 23 ...
2019 3 21 2 ...
2020 31 23 32 ...
I try
df <- mutate (df,year=year(as.Date(date)), month=month(as.Date(date), label=TRUE,abbr=TRUE))
I got:
date value month year
2018-01-01 123 ene 2018
2018-02-01 12 feb 2018
2018-03-01 23 mar 2018
...
2019-01-01 3 ene 2019
2019-02-01 21 feb 2019
2019-03-01 2 mar 2019
...
2020-01-01 31 ene 2020
2020-02-01 23 feb 2020
2020-03-01 32 mar 2020
...
Then I do:
pivot_wider(df, names_from="month", values_from=value)
I got:
date year ene feb mar ...
2018-01-01 2018 123 NA NA ...
2018-02-01 2018 NA 12 NA ...
2018-03-01 2018 NA NA 23 ...
...
2019-01-01 2019 3 NA NA ...
2019-02-01 2019 NA 21 NA ...
2019-03-01 2019 NA NA 2 ...
...
2020-01-01 2020 31 NA NA ...
2020-02-01 2020 NA 23 NA ...
2020-03-01 2020 NA NA 32 ...
I need "compress" rows to up, grouping by "year", but i don't know how do it.
I'm close to solution, but I can't find it.
Thanks in advance!

This should do it:
library(tidyverse)
library(lubridate)
df %>%
mutate(month = lubridate::month(as.Date(date), label=TRUE, abbr=TRUE),
year = lubridate::year(as.Date(date))) %>%
select(value, month, year) %>%
pivot_wider(id_cols = year, names_from = month, values_from = value)
Which returns:
# A tibble: 3 × 4
year Jan Feb Mar
<dbl> <int> <int> <int>
1 2018 123 12 23
2 2019 3 21 2
3 2020 31 23 32

Related

Mutate Year with Month Column for a Time Series Data Input in R Using Lubridate Package

I have this time series data frame as follows:
df <- read.table(text =
"Year Month Value
2021 1 4
2021 2 11
2021 3 18
2021 4 6
2021 5 20
2021 6 5
2021 7 12
2021 8 4
2021 9 11
2021 10 18
2021 11 6
2021 12 20
2022 1 14
2022 2 11
2022 3 18
2022 4 9
2022 5 22
2022 6 19
2022 7 22
2022 8 24
2022 9 17
2022 10 28
2022 11 16
2022 12 26",
header = TRUE)
I want to turn this data frame into a time series object of date column and value column only so that I can use the ts function to filter the starting point and the endpoint like ts(ts, start = starts, frequency = 12). R should know that 2022 is a year and the corresponding 1:12 are its months, the same thing should apply to 2021. I will prefer lubridate package.
pacman::p_load(
dplyr,
lubridate)
UPDATE
I now use unite function from dplyr package.
df|>
unite(col='date', c('Year', 'Month'), sep='')
Perhaps this?
df |>
tidyr::unite(col='date', c('Year', 'Month'), sep='-') |>
mutate(date = lubridate::ym(date))
# date Value
# 1 2021-01-01 4
# 2 2021-02-01 11
# 3 2021-03-01 18
# 4 2021-04-01 6
# 5 2021-05-01 20
# 6 2021-06-01 5
# 7 2021-07-01 12
# 8 2021-08-01 4
# 9 2021-09-01 11
# 10 2021-10-01 18
# 11 2021-11-01 6
# 12 2021-12-01 20
# 13 2022-01-01 14
# 14 2022-02-01 11
# 15 2022-03-01 18
# 16 2022-04-01 9
# 17 2022-05-01 22
# 18 2022-06-01 19
# 19 2022-07-01 22
# 20 2022-08-01 24
# 21 2022-09-01 17
# 22 2022-10-01 28
# 23 2022-11-01 16
# 24 2022-12-01 26

How to change grouped data in ungrouped data

I have grouped data that I want to convert to ungrouped data.
year<-c(rep(2014,4),rep(2015,4))
Age<-rep(c(22,23,24,25),2)
n<-c(1,1,3,2,0,2,3,1)
mydata<-data.frame(year,Age,n)
I would like to have a dataset like the one below created from the previous one.
year Age
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
Try
mydata[rep(1:nrow(mydata),mydata$n),]
year Age n
1 2014 22 1
2 2014 23 1
3 2014 24 3
3.1 2014 24 3
3.2 2014 24 3
4 2014 25 2
4.1 2014 25 2
6 2015 23 2
6.1 2015 23 2
7 2015 24 3
7.1 2015 24 3
7.2 2015 24 3
8 2015 25 1
Here's a tidyverse solution:
library(tidyverse)
mydata %>%
uncount(n)
which gives:
year Age
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
You can also use tidyr syntax for this:
library(tidyr)
year<-c(rep(2014,4),rep(2015,4))
Age<-rep(c(22,23,24,25),2)
n<-c(1,1,3,2,0,2,3,1)
mydata<-data.frame(year,Age,n)
uncount(mydata, n)
#> year Age
#> 1 2014 22
#> 2 2014 23
#> 3 2014 24
#> 4 2014 24
#> 5 2014 24
#> 6 2014 25
#> 7 2014 25
#> 8 2015 23
#> 9 2015 23
#> 10 2015 24
#> 11 2015 24
#> 12 2015 24
#> 13 2015 25
But of course you shouldn't use tidyr just because it is tidyr :) An alternate view of the Tidyverse "dialect" of the R language, and its promotion by RStudio.
We can use tidyr::complete
library(tidyr)
library(dplyr)
mydata %>% group_by(year, Age) %>%
complete(n = seq_len(n)) %>%
select(-n) %>%
ungroup()
# A tibble: 14 × 2
year Age
<dbl> <dbl>
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
14 2015 22

Trying to use gg_lag() but apparently have more than one time series

I'm trying to find lag using gg_lag but I keep getting the same error regarding my tsibble
# A tsibble: 255 x 6 [7D]
# Key: Demand [163]
Week Demand Date Month year Quarter
<dbl> <dbl> <date> <mth> <chr> <qtr>
1 1 48 2018-01-01 2018 Jan 2018 2018 Q1
2 2 101 2018-01-08 2018 Jan 2018 2018 Q1
3 3 129 2018-01-15 2018 Jan 2018 2018 Q1
4 4 113 2018-01-22 2018 Jan 2018 2018 Q1
5 5 116 2018-01-29 2018 Jan 2018 2018 Q1
6 6 123 2018-02-05 2018 Feb 2018 2018 Q1
7 7 137 2018-02-12 2018 Feb 2018 2018 Q1
8 8 136 2018-02-19 2018 Feb 2018 2018 Q1
9 9 151 2018-02-26 2018 Feb 2018 2018 Q1
10 10 87 2018-03-05 2018 Mar 2018 2018 Q1
# ... with 245 more rows
Printer_Q %>% gg_lag(Demand, geom='point')
Error: The data provided to contains more than one time series. Please filter a single time series to use gg_lag()
I tried filtering my data with:
Printer_Q <- Demandts %>%
select(-Week, -year, -Month, -Quarter)
...so that I am left with Demand and Date but it still says I have more than one time series? What am I doing wrong?
The Demand column should not be a key variable. A key variable is a categorical variable used to distinguish multiple time series in a single tsibble. It appears you just have one time series here, so you don't need a key variable.

Performing a rolling average with criteria in R

Been trying to learn the most basic of items at first and then expanding the complexity. So for this one, how would I modify the last line to where it would be create a rolling 12 month average for each seriescode. In this case, it would produce an average of 8 for seriescode 100 and 27 for seriescode 101.
First, is the sample data
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
Manipulations
library(dplyr)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA))
You would just need to add a group_by(seriescode) which would then perform the mutate functions per seriescode:
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% group_by(seriescode) %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA)) # add the group_by(seriescode)
This produces the output:
# A tibble: 26 x 7
# Groups: seriescode [2]
Monthx empx seriescode year month date ravg
<dbl> <dbl> <dbl> <chr> <chr> <date> <dbl>
1 201911 1 100 2019 11 2019-11-01 NA
2 201912 2 100 2019 12 2019-12-01 NA
3 20201 3 100 2020 1 2020-01-01 NA
4 20202 4 100 2020 2 2020-02-01 NA
5 20203 5 100 2020 3 2020-03-01 NA
6 20204 6 100 2020 4 2020-04-01 NA
7 20205 7 100 2020 5 2020-05-01 NA
8 20206 8 100 2020 6 2020-06-01 NA
9 20207 9 100 2020 7 2020-07-01 NA
10 20208 10 100 2020 8 2020-08-01 NA
11 20209 11 100 2020 9 2020-09-01 NA
12 202010 12 100 2020 10 2020-10-01 6.5
13 202011 13 100 2020 11 2020-11-01 7.5
14 201911 21 110 2019 11 2019-11-01 NA
15 201912 22 110 2019 12 2019-12-01 NA
16 20201 23 110 2020 1 2020-01-01 NA
17 20202 24 110 2020 2 2020-02-01 NA
18 20203 25 110 2020 3 2020-03-01 NA
19 20204 26 110 2020 4 2020-04-01 NA
20 20205 27 110 2020 5 2020-05-01 NA
21 20206 28 110 2020 6 2020-06-01 NA
22 20207 29 110 2020 7 2020-07-01 NA
23 20208 20 110 2020 8 2020-08-01 NA
24 20209 31 110 2020 9 2020-09-01 NA
25 202010 32 110 2020 10 2020-10-01 25.7
26 202011 33 110 2020 11 2020-11-01 26.7
If you want to continue using the tidyverse for this, the following should do the trick:
library(dplyr)
ces12x %>%
group_by(seriescode) %>%
arrange(date) %>%
slice(tail(row_number(), 12)) %>%
summarize(ravg = mean(empx))

Merging complementary rows of a dataframe with R

I have such a data frame
0 weekday day month year hour basal bolus carb period.h
1 Tuesday 01 03 2016 0.0 0.25 NA NA 0
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
4 Tuesday 01 03 2016 12.0 0.30 NA NA 12
5 Tuesday 01 03 2016 17.0 0.50 NA NA 17
6 Tuesday 01 03 2016 17.6 NA NA 33 17
7 Tuesday 01 03 2016 17.6 NA 1.35 NA 17
8 Tuesday 01 03 2016 18.6 NA NA 44 18
9 Tuesday 01 03 2016 18.6 NA 1.80 NA 18
10 Tuesday 01 03 2016 18.9 NA NA 17 18
11 Tuesday 01 03 2016 18.9 NA 0.70 NA 18
12 Tuesday 01 03 2016 22.0 0.40 NA NA 22
13 Wednesday 02 03 2016 0.0 0.25 NA NA 0
14 Wednesday 02 03 2016 9.7 NA NA 39 9
15 Wednesday 02 03 2016 9.7 NA 2.65 NA 9
16 Wednesday 02 03 2016 11.2 NA NA 13 11
17 Wednesday 02 03 2016 11.2 NA 0.30 NA 11
18 Wednesday 02 03 2016 12.0 0.30 NA NA 12
19 Wednesday 02 03 2016 12.0 NA NA 16 12
20 Wednesday 02 03 2016 12.0 NA 0.65 NA 12
If you look at the lines 2 and 3, you notice that they correspond exactly to the same day & time: just for the line #2 the "carb" is not NA, and the "bolus" is not NA (These are data about diabete).
I want to merge such lines into a single one:
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
->
2 Tuesday 01 03 2016 10.9 NA 4.15 67 10
I could of course do a brutal double loop over each line, but I look for a cleverer and faster way.
You can group your data frame by the common identifier columns weekday, day, month, year, hour, period.h here and then sort and take the first element from the remaining columns which you would like to merge, sort() function by default will remove NAs in the vector to be sorted and thus you will end up with non-NA elements for each column within each group; if all elements in a column are NA, sort(col)[1] returns NA:
library(dplyr)
df %>%
group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(sort(.)[1]))
# weekday day month year hour period.h basal bolus carb
# <fctr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int>
# 1 Tuesday 1 3 2016 0.0 0 0.25 NA NA
# 2 Tuesday 1 3 2016 10.9 10 NA 4.15 67
# 3 Tuesday 1 3 2016 12.0 12 0.30 NA NA
# 4 Tuesday 1 3 2016 17.0 17 0.50 NA NA
# 5 Tuesday 1 3 2016 17.6 17 NA 1.35 33
# 6 Tuesday 1 3 2016 18.6 18 NA 1.80 44
# 7 Tuesday 1 3 2016 18.9 18 NA 0.70 17
# 8 Tuesday 1 3 2016 22.0 22 0.40 NA NA
# 9 Wednesday 2 3 2016 0.0 0 0.25 NA NA
# 10 Wednesday 2 3 2016 9.7 9 NA 2.65 39
# 11 Wednesday 2 3 2016 11.2 11 NA 0.30 13
# 12 Wednesday 2 3 2016 12.0 12 0.30 0.65 16
Instead of sort(), maybe a more appropriate function to use here is na.omit():
df %>% group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(na.omit(.)[1]))

Resources