I am working with some monthly data and I would like to convert it to daily data by creating and populating some dummy rows, as the question suggests.
For example, say I have the following data:
date index
2013-04-30 232
2013-05-31 232
2013-06-30 233
Is there an "easy" way, preferably through tidyverse, that I could convert the above data into daily data, assuming I keep the index constant throughout the month? For example, I would like to create another 29 rows for April, ranging from 2013-04-01 to 2013-04-29 with the index of the last day of the month which would be 232 for April. The same should be applied to the rest of months (I have more data than just those three months).
Any intuitive suggestions will be greatly appreciated :)
Using complete and fill from tidyr you could do:
dat <- structure(list(
date = structure(c(15825, 15856, 15886), class = "Date"),
index = c(232L, 232L, 233L)
), class = "data.frame", row.names = c(
NA,
-3L
))
library(tidyr)
dat |>
complete(date = seq(as.Date("2013-04-01"), as.Date("2013-06-30"), "day")) |>
fill(index, .direction = "up")
#> # A tibble: 91 × 2
#> date index
#> <date> <int>
#> 1 2013-04-01 232
#> 2 2013-04-02 232
#> 3 2013-04-03 232
#> 4 2013-04-04 232
#> 5 2013-04-05 232
#> 6 2013-04-06 232
#> 7 2013-04-07 232
#> 8 2013-04-08 232
#> 9 2013-04-09 232
#> 10 2013-04-10 232
#> # … with 81 more rows
I have two datasets. First dataset has two date columns (Start, Stop) for each ID. Sometimes it can contain multiple Start-Stop Dates for each ID.
Dataset1
Id Code Start Stop
431 279 2017-11-15 2019-08-15
431 578 2019-09-15 2021-01-15
832 590 2008-04-15 2020-05-15
832 519 2020-06-15 2021-04-15
The second dataset has Id and many time stamped rows per ID, like this below
Id Weight Date
431 12.23 2018
832 15.12 2020
832 6.78 2020
832 4.27 2007
My goal is to create a column InBetween that indicates "Yes" if the row in 2nd dataset is between the two dates in the first dataset or "No" if the rows in the 2nd dataset is not between the two dates in first dataset, with other columns from 1st dataset like this.
Id Weight Date Between Code Start Stop
431 12.23 2018 Yes 279 2017-11-15 2019-08-15
832 15.12 2020 Yes 590 2008-04-15 2020-05-15
832 6.78 2020 Yes 590 2008-04-15 2020-05-15
832 4.27 2007 No NA NA NA
I could do this with for loops, but I prefer any solution using dplyr, innerjoin or other options without forloops. Thanks in advance.
It's pretty messy but you may try,
df2 %>%
full_join(df1, by = "Id") %>%
mutate(Date = as.Date(ISOdate(Date,1,1)),
Start = as.Date(Start),
Stop = as.Date(Stop)) %>%
rowwise %>%
mutate(Between = between(Date, Start, Stop)) %>%
group_by(Id, Date) %>%
mutate(check = any(Between)) %>%
filter(!(Between == FALSE& check == TRUE)) %>%
mutate(Start = ifelse(check, Start, NA),
Stop = ifelse(check, Stop, NA),
Code = ifelse(check, Code, NA)) %>%
distinct() %>% select(-check)
Id Weight Date Code Start Stop Between
<int> <dbl> <date> <int> <dbl> <dbl> <lgl>
1 431 12.2 2018-01-01 279 17485 18123 TRUE
2 832 15.1 2020-01-01 590 13984 18397 TRUE
3 832 6.78 2020-01-01 590 13984 18397 TRUE
4 832 4.27 2007-01-01 NA NA NA FALSE
Here's a somewhat shorter version that also achieves what you are after.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df1 <- tibble::tribble(
~Id, ~Code, ~Start, ~Stop,
431L, 279L, "2017-11-15", "2019-08-15",
431L, 578L, "2019-09-15", "2021-01-15",
832L, 590L, "2008-04-15", "2020-05-15",
832L, 519L, "2020-06-15", "2021-04-15"
)
df2 <- tibble::tribble(
~Id, ~Weight, ~Date,
431L, 12.23, 2018L,
832L, 15.12, 2020L,
832L, 6.78, 2020L,
832L, 4.27, 2007L
)
df1 <- df1 %>%
mutate(Start = ymd(Start),
Stop = ymd(Stop))
df2 <- df2 %>%
mutate(Date = ymd(Date, truncated = 2L))
full_join(df1, df2) %>%
mutate(Between = case_when( (Date %within% interval(ymd(Start), ymd(Stop))) == TRUE ~ TRUE,
TRUE ~ FALSE))
#> Joining, by = "Id"
#> # A tibble: 8 × 7
#> Id Code Start Stop Weight Date Between
#> <int> <int> <date> <date> <dbl> <date> <lgl>
#> 1 431 279 2017-11-15 2019-08-15 12.2 2018-01-01 TRUE
#> 2 431 578 2019-09-15 2021-01-15 12.2 2018-01-01 FALSE
#> 3 832 590 2008-04-15 2020-05-15 15.1 2020-01-01 TRUE
#> 4 832 590 2008-04-15 2020-05-15 6.78 2020-01-01 TRUE
#> 5 832 590 2008-04-15 2020-05-15 4.27 2007-01-01 FALSE
#> 6 832 519 2020-06-15 2021-04-15 15.1 2020-01-01 FALSE
#> 7 832 519 2020-06-15 2021-04-15 6.78 2020-01-01 FALSE
#> 8 832 519 2020-06-15 2021-04-15 4.27 2007-01-01 FALSE
Created on 2021-10-11 by the reprex package (v2.0.1)
I have a dataframe like so:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
2019-07-01 100
2019-08-01 120
Problem is that nothing happen in May and June, hence there is no data. Plotting this in barcharts results in some empty space on the x-axis.
Is there some way to "fill" the missing spot like so, using the last known value?:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
**2019-05-01 80** <--
**2019-06-01 80** <--
2019-07-01 100
2019-08-01 120
We can use complete
library(dplyr)
library(tidyr)
df1 %>%
complete(Month = seq(min(Month), max(Month), by = '1 month')) %>%
fill(CumulativeSum)
-output
# A tibble: 7 x 2
# Month CumulativeSum
# <date> <int>
#1 2019-02-01 40
#2 2019-03-01 70
#3 2019-04-01 80
#4 2019-05-01 80
#5 2019-06-01 80
#6 2019-07-01 100
#7 2019-08-01 120
data
df1 <- structure(list(Month = structure(c(17928, 17956, 17987, 18078,
18109), class = "Date"), CumulativeSum = c(40L, 70L, 80L, 100L,
120L)), row.names = c(NA, -5L), class = "data.frame")
Here is a base R option using cummax
transform(
data.frame(
Month = seq(min(df$Month), max(df$Month), by = "1 month"),
CumulativeSum = -Inf
),
CumulativeSum = cummax(replace(CumulativeSum, Month %in% df$Month, df$CumulativeSum))
)
which gives
Month CumulativeSum
1 2019-02-01 40
2 2019-03-01 70
3 2019-04-01 80
4 2019-05-01 80
5 2019-06-01 80
6 2019-07-01 100
7 2019-08-01 120
I am trying to calculate the unemployment rate based of the data below and add it as new rows to the data table. I want to divide unemployed by labourforce based off the date and add each datapoint as a row.
Essentially, I am trying to go from this
date
series_1
value
2021-01-01
labourforce
13793
2021-02-01
labourforce
13812
2021-03-01
labourforce
13856
2021-01-01
unemployed
875
2021-02-01
unemployed
805
2021-03-01
unemployed
778
to this
date
series_1
value
2021-01-01
labourforce
13793
2021-02-01
labourforce
13812
2021-03-01
labourforce
13856
2021-01-01
unemployed
875
2021-02-01
unemployed
805
2021-03-01
unemployed
778
2021-01-01
unemploymentrate
6.3
2021-02-01
unemploymentrate
5.8
2021-03-01
unemploymentrate
5.6
Here is my code so far. I know the last line is wrong? Any suggestions or ideas are welcome!
longdata %>%
group_by(date) %>%
summarise(series_1 = 'unemploymentrate',
value = series_1$unemployed/series_1$labourforce))
Fro each day, you can get the ratio of 'unemployed' by 'labourforce' and add it as new rows to your original dataset.
library(dplyr)
df %>%
group_by(date) %>%
summarise(value = value[series_1 == 'unemployed']/value[series_1 == 'labourforce'] * 100,
series_1 = 'unemploymentrate') %>%
bind_rows(df) %>%
arrange(series_1)
# date value series_1
# <chr> <dbl> <chr>
#1 2021-01-01 13793 labourforce
#2 2021-02-01 13812 labourforce
#3 2021-03-01 13856 labourforce
#4 2021-01-01 875 unemployed
#5 2021-02-01 805 unemployed
#6 2021-03-01 778 unemployed
#7 2021-01-01 6.34 unemploymentrate
#8 2021-02-01 5.83 unemploymentrate
#9 2021-03-01 5.61 unemploymentrate
Try:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = series_1, values_from = value) %>%
mutate(unempolymentrate = round(unemployed*100/labourforce, 2)) %>%
pivot_longer(-1, names_to = "series_1", values_to = "value") %>%
mutate(series_1 = factor(series_1, levels = c("labourforce", "unemployed", "unempolymentrate"))) %>%
arrange(series_1, date)
#> # A tibble: 9 x 3
#> date series_1 value
#> <chr> <fct> <dbl>
#> 1 2021-01-01 labourforce 13793
#> 2 2021-02-01 labourforce 13812
#> 3 2021-03-01 labourforce 13856
#> 4 2021-01-01 unemployed 875
#> 5 2021-02-01 unemployed 805
#> 6 2021-03-01 unemployed 778
#> 7 2021-01-01 unempolymentrate 6.34
#> 8 2021-02-01 unempolymentrate 5.83
#> 9 2021-03-01 unempolymentrate 5.61
Created on 2021-04-23 by the reprex package (v2.0.0)
data
df <- structure(list(date = c("2021-01-01", "2021-02-01", "2021-03-01",
"2021-01-01", "2021-02-01", "2021-03-01"), series_1 = c("labourforce",
"labourforce", "labourforce", "unemployed", "unemployed", "unemployed"
), value = c(13793L, 13812L, 13856L, 875L, 805L, 778L)), class = "data.frame", row.names = c(NA,
-6L))
Day Time Numbers
6388 2017-02-01 10:43 R33
7129 2017-02-04 15:32 N39.0, N39.0, N39.0
9689 2017-02-17 08:54 S72.11, S72.11, S72.11, S72.11
6703 2017-02-02 18:55 R11
9026 2017-02-13 17:34 S06.0, S06.0, S06.0
5013 2017-01-25 00:33 J18.1, J18.1, J18.1, J18.1
5849 2017-01-29 17:57 I21.4, I21.4, I21.4
9245 2017-02-14 19:03 J18.0, J18.0, J18.0
1978 2017-01-09 21:23 K59.0
5021 2017-01-25 02:46 I47.1, I47.1, I47.1
9258 2017-02-14 20:19 S42.3
541 2017-01-03 11:44 I63.8, I63.8, I63.8
4207 2017-01-20 19:52 E83.58, E83.58, E83.58
8650 2017-02-11 18:39 R55, R55, S06.0, S06.0, R55
9442 2017-02-15 21:30 K86.1
4186 2017-01-20 18:27 S05.1
4231 2017-01-20 22:10 M17.9
6847 2017-02-03 11:45 L02.4
1739 2017-01-08 21:19 S20.2
3685 2017-01-18 09:56 G40.9
9497 2017-02-16 09:52 S83.6
2563 2017-01-12 20:47 M13.16, M25.56, M25.56
9731 2017-02-17 13:10 B99, B99, N39.0, N39.0
7759 2017-02-07 14:25 R51, G43.0, G43.0
368 2017-01-02 15:05 T83.0, T83.0, T83.0, N13.3, N13.6
I want to aggregate this df in a special way. I want to count how many Numbers starting e.g. "A" on each day. I want a new dataframe that looks like this:
Day GroupA GroupB GroupC .....
1 2017-01-01 2 2 0
2 2017-01-02 ..................
GroupA means Numbers starting with A. If there are multiple numbers starting with A in one single row it count be counted as one. The class of my number-column is character.
> class(df[1,3])
[1] "character"
> df[1,3]
[1] "A41.8, A41.51, A41.51"**
My problem is how I can combine the aggregate-command with the counts. My real df is a lot bigger, it is more than 2 years long, so I would need an automatized solution.
EDIT: See data down below
structure(list(Day= c("2017-01-07", "2017-01-23", "2017-01-08",
"2017-01-13", "2017-02-10", "2017-01-07", "2017-01-24", "2017-01-02",
"2017-01-03", "2017-01-06", "2017-01-11", "2017-01-21", "2017-01-13",
"2017-01-10", "2017-02-18", "2017-01-10", "2017-01-31", "2017-01-27",
"2017-01-23", "2017-01-13", "2017-02-10", "2017-01-09", "2017-01-23",
"2017-01-09", "2017-01-08"), Time= c("02:02", "14:51", "02:12",
"17:49", "00:00", "21:30", "22:28", "17:27", "12:14", "22:52",
"14:19", "11:40", "19:33", "04:01", "15:59", "14:57", "08:34",
"13:21", "02:01", "14:29", "20:17", "14:30", "02:34", "04:56",
"14:34"), Number= c("H10.9", "K85.80, K85.20, K85.80, K85.20",
"R09.1", "I10.90", "I48.9, I48.0, I48.9, I48.0", "A09.0, A09.0, R42, R42",
"H16.1", "K92.1, K92.1, K92.1", "K40.90, J12.2, J18.0, J96.01, J12.2",
"B99, J15.8, J18.0, J15.8", "S01.55", "M21.33", "I10.01, I10.01, J44.81, J44.81",
"S00.95", "B08.2", "S05.1", "M20.1", "G40.2, S93.40, S93.40",
"M25.51", "J44.19, J44.11, J44.19, J44.11", "G40.9, G40.2, G40.2",
"E87.1, E87.1, J18.0, J18.0", "I10.91", "R22.0", "S06.5, S06.5, S06.5, R06.88, S12.22"
)), .Names = c("Day", "Time", "Number"), row.names = c(1336L,
4687L, 1536L, 2737L, 8272L, 1507L, 4994L, 400L, 550L, 1305L,
2325L, 4292L, 2748L, 2008L, 9974L, 2113L, 6144L, 5433L, 4577L,
2697L, 8468L, 1883L, 4578L, 1783L, 1657L), class = "data.frame")
This is a pretty interesting problem that takes a little digging into. The first thing to do is get all the unique capital letters in each set in Number per row. stringr::str_extract_all gets you a list-column of string vectors that match this regex, and after taking unique values from each list entry, you have this:
library(dplyr)
library(tidyr)
as_tibble(df1) %>%
mutate(Day = lubridate::ymd(Day),
letters = purrr::map(stringr::str_extract_all(Number, "[A-Z]"), unique)) %>%
select(-Number) %>%
head()
#> # A tibble: 6 x 3
#> Day Time letters
#> <date> <chr> <list>
#> 1 2017-01-07 02:02 <chr [1]>
#> 2 2017-01-23 14:51 <chr [1]>
#> 3 2017-01-08 02:12 <chr [1]>
#> 4 2017-01-13 17:49 <chr [1]>
#> 5 2017-02-10 00:00 <chr [1]>
#> 6 2017-01-07 21:30 <chr [2]>
Unnest it so you have one row per date & time per letter, then count the number of observations of each letter per day—gets confusing, and the order matters here. Then reshape it into a wide format so each group gets a column.
as_tibble(df1) %>%
mutate(Day = lubridate::ymd(Day),
letters = purrr::map(stringr::str_extract_all(Number, "[A-Z]"), unique)) %>%
select(-Number) %>%
unnest(letters) %>%
count(Day, letters) %>%
arrange(letters) %>%
pivot_wider(names_from = letters, names_prefix = "group",
values_from = n, values_fill = list(n = 0)) %>%
head()
#> # A tibble: 6 x 12
#> Day groupA groupB groupE groupG groupH groupI groupJ groupK groupM
#> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2017-01-07 1 0 0 0 1 0 0 0 0
#> 2 2017-01-06 0 1 0 0 0 0 1 0 0
#> 3 2017-02-18 0 1 0 0 0 0 0 0 0
#> 4 2017-01-09 0 0 1 0 0 0 1 0 0
#> 5 2017-01-27 0 0 0 1 0 0 0 0 0
#> 6 2017-02-10 0 0 0 1 0 1 0 0 0
#> # … with 2 more variables: groupR <int>, groupS <int>
In this first few rows with the sample of data, there aren't any 2s, but there are some later on in the data frame. (I don't yet understand how pivot_wider orders things, but you can arrange by day after this if you want.)