I have this time series data frame as follows:
df <- read.table(text =
"Year Month Value
2021 1 4
2021 2 11
2021 3 18
2021 4 6
2021 5 20
2021 6 5
2021 7 12
2021 8 4
2021 9 11
2021 10 18
2021 11 6
2021 12 20
2022 1 14
2022 2 11
2022 3 18
2022 4 9
2022 5 22
2022 6 19
2022 7 22
2022 8 24
2022 9 17
2022 10 28
2022 11 16
2022 12 26",
header = TRUE)
I want to turn this data frame into a time series object of date column and value column only so that I can use the ts function to filter the starting point and the endpoint like ts(ts, start = starts, frequency = 12). R should know that 2022 is a year and the corresponding 1:12 are its months, the same thing should apply to 2021. I will prefer lubridate package.
pacman::p_load(
dplyr,
lubridate)
UPDATE
I now use unite function from dplyr package.
df|>
unite(col='date', c('Year', 'Month'), sep='')
Perhaps this?
df |>
tidyr::unite(col='date', c('Year', 'Month'), sep='-') |>
mutate(date = lubridate::ym(date))
# date Value
# 1 2021-01-01 4
# 2 2021-02-01 11
# 3 2021-03-01 18
# 4 2021-04-01 6
# 5 2021-05-01 20
# 6 2021-06-01 5
# 7 2021-07-01 12
# 8 2021-08-01 4
# 9 2021-09-01 11
# 10 2021-10-01 18
# 11 2021-11-01 6
# 12 2021-12-01 20
# 13 2022-01-01 14
# 14 2022-02-01 11
# 15 2022-03-01 18
# 16 2022-04-01 9
# 17 2022-05-01 22
# 18 2022-06-01 19
# 19 2022-07-01 22
# 20 2022-08-01 24
# 21 2022-09-01 17
# 22 2022-10-01 28
# 23 2022-11-01 16
# 24 2022-12-01 26
I have grouped data that I want to convert to ungrouped data.
year<-c(rep(2014,4),rep(2015,4))
Age<-rep(c(22,23,24,25),2)
n<-c(1,1,3,2,0,2,3,1)
mydata<-data.frame(year,Age,n)
I would like to have a dataset like the one below created from the previous one.
year Age
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
Try
mydata[rep(1:nrow(mydata),mydata$n),]
year Age n
1 2014 22 1
2 2014 23 1
3 2014 24 3
3.1 2014 24 3
3.2 2014 24 3
4 2014 25 2
4.1 2014 25 2
6 2015 23 2
6.1 2015 23 2
7 2015 24 3
7.1 2015 24 3
7.2 2015 24 3
8 2015 25 1
Here's a tidyverse solution:
library(tidyverse)
mydata %>%
uncount(n)
which gives:
year Age
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
You can also use tidyr syntax for this:
library(tidyr)
year<-c(rep(2014,4),rep(2015,4))
Age<-rep(c(22,23,24,25),2)
n<-c(1,1,3,2,0,2,3,1)
mydata<-data.frame(year,Age,n)
uncount(mydata, n)
#> year Age
#> 1 2014 22
#> 2 2014 23
#> 3 2014 24
#> 4 2014 24
#> 5 2014 24
#> 6 2014 25
#> 7 2014 25
#> 8 2015 23
#> 9 2015 23
#> 10 2015 24
#> 11 2015 24
#> 12 2015 24
#> 13 2015 25
But of course you shouldn't use tidyr just because it is tidyr :) An alternate view of the Tidyverse "dialect" of the R language, and its promotion by RStudio.
We can use tidyr::complete
library(tidyr)
library(dplyr)
mydata %>% group_by(year, Age) %>%
complete(n = seq_len(n)) %>%
select(-n) %>%
ungroup()
# A tibble: 14 × 2
year Age
<dbl> <dbl>
1 2014 22
2 2014 23
3 2014 24
4 2014 24
5 2014 24
6 2014 25
7 2014 25
8 2015 23
9 2015 23
10 2015 24
11 2015 24
12 2015 24
13 2015 25
14 2015 22
Been trying to learn the most basic of items at first and then expanding the complexity. So for this one, how would I modify the last line to where it would be create a rolling 12 month average for each seriescode. In this case, it would produce an average of 8 for seriescode 100 and 27 for seriescode 101.
First, is the sample data
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
Manipulations
library(dplyr)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA))
You would just need to add a group_by(seriescode) which would then perform the mutate functions per seriescode:
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% group_by(seriescode) %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA)) # add the group_by(seriescode)
This produces the output:
# A tibble: 26 x 7
# Groups: seriescode [2]
Monthx empx seriescode year month date ravg
<dbl> <dbl> <dbl> <chr> <chr> <date> <dbl>
1 201911 1 100 2019 11 2019-11-01 NA
2 201912 2 100 2019 12 2019-12-01 NA
3 20201 3 100 2020 1 2020-01-01 NA
4 20202 4 100 2020 2 2020-02-01 NA
5 20203 5 100 2020 3 2020-03-01 NA
6 20204 6 100 2020 4 2020-04-01 NA
7 20205 7 100 2020 5 2020-05-01 NA
8 20206 8 100 2020 6 2020-06-01 NA
9 20207 9 100 2020 7 2020-07-01 NA
10 20208 10 100 2020 8 2020-08-01 NA
11 20209 11 100 2020 9 2020-09-01 NA
12 202010 12 100 2020 10 2020-10-01 6.5
13 202011 13 100 2020 11 2020-11-01 7.5
14 201911 21 110 2019 11 2019-11-01 NA
15 201912 22 110 2019 12 2019-12-01 NA
16 20201 23 110 2020 1 2020-01-01 NA
17 20202 24 110 2020 2 2020-02-01 NA
18 20203 25 110 2020 3 2020-03-01 NA
19 20204 26 110 2020 4 2020-04-01 NA
20 20205 27 110 2020 5 2020-05-01 NA
21 20206 28 110 2020 6 2020-06-01 NA
22 20207 29 110 2020 7 2020-07-01 NA
23 20208 20 110 2020 8 2020-08-01 NA
24 20209 31 110 2020 9 2020-09-01 NA
25 202010 32 110 2020 10 2020-10-01 25.7
26 202011 33 110 2020 11 2020-11-01 26.7
If you want to continue using the tidyverse for this, the following should do the trick:
library(dplyr)
ces12x %>%
group_by(seriescode) %>%
arrange(date) %>%
slice(tail(row_number(), 12)) %>%
summarize(ravg = mean(empx))
I have such a data frame
0 weekday day month year hour basal bolus carb period.h
1 Tuesday 01 03 2016 0.0 0.25 NA NA 0
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
4 Tuesday 01 03 2016 12.0 0.30 NA NA 12
5 Tuesday 01 03 2016 17.0 0.50 NA NA 17
6 Tuesday 01 03 2016 17.6 NA NA 33 17
7 Tuesday 01 03 2016 17.6 NA 1.35 NA 17
8 Tuesday 01 03 2016 18.6 NA NA 44 18
9 Tuesday 01 03 2016 18.6 NA 1.80 NA 18
10 Tuesday 01 03 2016 18.9 NA NA 17 18
11 Tuesday 01 03 2016 18.9 NA 0.70 NA 18
12 Tuesday 01 03 2016 22.0 0.40 NA NA 22
13 Wednesday 02 03 2016 0.0 0.25 NA NA 0
14 Wednesday 02 03 2016 9.7 NA NA 39 9
15 Wednesday 02 03 2016 9.7 NA 2.65 NA 9
16 Wednesday 02 03 2016 11.2 NA NA 13 11
17 Wednesday 02 03 2016 11.2 NA 0.30 NA 11
18 Wednesday 02 03 2016 12.0 0.30 NA NA 12
19 Wednesday 02 03 2016 12.0 NA NA 16 12
20 Wednesday 02 03 2016 12.0 NA 0.65 NA 12
If you look at the lines 2 and 3, you notice that they correspond exactly to the same day & time: just for the line #2 the "carb" is not NA, and the "bolus" is not NA (These are data about diabete).
I want to merge such lines into a single one:
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
->
2 Tuesday 01 03 2016 10.9 NA 4.15 67 10
I could of course do a brutal double loop over each line, but I look for a cleverer and faster way.
You can group your data frame by the common identifier columns weekday, day, month, year, hour, period.h here and then sort and take the first element from the remaining columns which you would like to merge, sort() function by default will remove NAs in the vector to be sorted and thus you will end up with non-NA elements for each column within each group; if all elements in a column are NA, sort(col)[1] returns NA:
library(dplyr)
df %>%
group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(sort(.)[1]))
# weekday day month year hour period.h basal bolus carb
# <fctr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int>
# 1 Tuesday 1 3 2016 0.0 0 0.25 NA NA
# 2 Tuesday 1 3 2016 10.9 10 NA 4.15 67
# 3 Tuesday 1 3 2016 12.0 12 0.30 NA NA
# 4 Tuesday 1 3 2016 17.0 17 0.50 NA NA
# 5 Tuesday 1 3 2016 17.6 17 NA 1.35 33
# 6 Tuesday 1 3 2016 18.6 18 NA 1.80 44
# 7 Tuesday 1 3 2016 18.9 18 NA 0.70 17
# 8 Tuesday 1 3 2016 22.0 22 0.40 NA NA
# 9 Wednesday 2 3 2016 0.0 0 0.25 NA NA
# 10 Wednesday 2 3 2016 9.7 9 NA 2.65 39
# 11 Wednesday 2 3 2016 11.2 11 NA 0.30 13
# 12 Wednesday 2 3 2016 12.0 12 0.30 0.65 16
Instead of sort(), maybe a more appropriate function to use here is na.omit():
df %>% group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(na.omit(.)[1]))