I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23
I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows
I want to calculate the daylight saving time beginning date for each year from 2003 through 2021 and keep only the days that are 60 days before and after the daylight saving time begin date each year.
i.e date will change each year (falls on a Sunday) and moved from happening in April 2003-2006 to happening in March during the years 2007-2021.
I need to Create a running variable “days” that measures the distance from the daylight saving time begin date for each year with days=0 on the first day of daylight saving time.
Here's dataset
year month day propertycrimes violentcrimes
2003 1 1 94 34
2004 1 1 60 46
2005 1 1 106 41
2006 1 1 87 40
2007 1 1 72 36
2008 1 1 43 50
2009 1 1 35 32
2010 1 1 32 50
2011 1 1 29 45
2012 1 1 32 45
Here's my code so far
library(readr)
dailycrimedataRD <- read_csv("dailycrimedataRD.csv")
View(dailycrimedataRD)
days <- .POSIXct(month, tz="GMT")
How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(readr)
dailycrimedataRD <- read_csv("~/Downloads/dailycrimedataRD.csv")
#> Rows: 6940 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (5): year, month, day, propertycrimes, violentcrimes
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tmp <- dailycrimedataRD %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep="-"), tz='Canada/Eastern'),
dst = lubridate::dst(date)) %>%
arrange(date) %>%
group_by(year) %>%
mutate(dst_date = date[which(dst == TRUE & lag(dst) == FALSE)],
diff = (as.Date(dst_date) - as.Date(date))) %>%
filter(diff <= 60 & diff >= 0)
tmp
#> # A tibble: 1,159 × 9
#> # Groups: year [19]
#> year month day propertycrimes violentcrimes date dst
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <lgl>
#> 1 2003 2 6 68 8 2003-02-06 00:00:00 FALSE
#> 2 2003 2 7 71 8 2003-02-07 00:00:00 FALSE
#> 3 2003 2 8 81 12 2003-02-08 00:00:00 FALSE
#> 4 2003 2 9 68 7 2003-02-09 00:00:00 FALSE
#> 5 2003 2 10 68 9 2003-02-10 00:00:00 FALSE
#> 6 2003 2 11 61 8 2003-02-11 00:00:00 FALSE
#> 7 2003 2 12 73 10 2003-02-12 00:00:00 FALSE
#> 8 2003 2 13 62 14 2003-02-13 00:00:00 FALSE
#> 9 2003 2 14 71 10 2003-02-14 00:00:00 FALSE
#> 10 2003 2 15 90 11 2003-02-15 00:00:00 FALSE
#> # … with 1,149 more rows, and 2 more variables: dst_date <dttm>, diff <drtn>
Created on 2022-04-14 by the reprex package (v2.0.1)
I have a dataframe containing a lot of tweets. Each tweet has a unique timestamp. Now, what I would like to calculate is how many tweets have been published in each week, based on the timestamp. Any ideas? I tried to do it with tidyverse and dplyr, sadly it didn't work.
Kind regards,
Daniel
library(dplyr)
set.seed(42)
tweets <- tibble(timestamp = sort(Sys.time() - runif(1000, 0, 365*86400)), tweet = paste("tweet", 1:1000))
tweets
# # A tibble: 1,000 x 2
# timestamp tweet
# <dttm> <chr>
# 1 2021-01-27 09:39:47 tweet 1
# 2 2021-01-28 02:38:29 tweet 2
# 3 2021-01-28 07:33:02 tweet 3
# 4 2021-01-29 08:42:47 tweet 4
# 5 2021-01-29 09:21:58 tweet 5
# 6 2021-01-29 16:01:09 tweet 6
# 7 2021-01-30 05:04:18 tweet 7
# 8 2021-01-30 21:45:05 tweet 8
# 9 2021-01-31 18:32:24 tweet 9
# 10 2021-02-02 02:57:51 tweet 10
# # ... with 990 more rows
tweets %>%
group_by(yearweek = format(timestamp, format = "%Y-%U")) %>%
summarize(date = min(as.Date(timestamp)), ntweets = n(), .groups = "drop")
# # A tibble: 54 x 3
# yearweek date ntweets
# <chr> <date> <int>
# 1 2021-04 2021-01-27 8
# 2 2021-05 2021-01-31 15
# 3 2021-06 2021-02-07 19
# 4 2021-07 2021-02-14 24
# 5 2021-08 2021-02-21 28
# 6 2021-09 2021-02-28 22
# 7 2021-10 2021-03-07 16
# 8 2021-11 2021-03-15 13
# 9 2021-12 2021-03-21 15
# 10 2021-13 2021-03-28 19
# # ... with 44 more rows
See ?strptime for definitions of the various "week of the year" options ("%U", "%V", "%W").
I have a dataset on a stock exchange's daily closing price and their respective dates for several years. I have further created a counter, counting which trading day in the month each day is (because the dataset is excluding weekends and holidays). It looks like this:
df$date <- as.Date(c("2017-03-25","2017-03-26","2017-03-27","2017-03-29","2017-03-30",
"2017-03-31","2017-04-03","2017-04-04","2017-04-05","2017-04-06",
"2017-04-07","2017-04-08","2017-04-09"))
df$DayofMonth <- c(18,19,20,21,22,23,1,2,3,4,5,6,7)
df$price <- (100, 100.53, 101.3 ,100.94, 101.42, 101.40, 101.85, 102, 101.9, 102, 102.31, 102.1, 102.23)
I would now like to create a dummyvariable taking the value 1 for the last 3 trading days and the first 5 trading days of the following month, for every month. So it would in this case look something like this:
df$ToM_dummy <- c(0,0,0,1,1,1,1,1,1,1,1,0,0)
Thanks for helping out!
Here's a dplyr solution. It's probably a little more complex than it needs to be for your real data because your sample stops on the 7th day of a month, and the algorithm needs to know that 7 isn't really the end of the month - the data is just incomplete for that month.
I have therefore arbitrarily added a cutoff of 18 days to indicate that if there are less trading days than that in a month we can assume the data for that month is incomplete. You may wish to change this if needed (I have no idea whether there are always more than 18 trading days in December or February, for example)
library(dplyr)
df %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
mutate(ToM_dummy = +(DayofMonth < 6 |
(DayofMonth > (max(DayofMonth) - 3) &
max(DayofMonth) > 18))) # Change to appropriate number
#> # A tibble: 13 x 5
#> # Groups: month [2]
#> date DayofMonth price month ToM_dummy
#> <date> <dbl> <dbl> <dbl> <int>
#> 1 2017-03-25 18 100 3 0
#> 2 2017-03-26 19 101. 3 0
#> 3 2017-03-27 20 101. 3 0
#> 4 2017-03-29 21 101. 3 1
#> 5 2017-03-30 22 101. 3 1
#> 6 2017-03-31 23 101. 3 1
#> 7 2017-04-03 1 102. 4 1
#> 8 2017-04-04 2 102 4 1
#> 9 2017-04-05 3 102. 4 1
#> 10 2017-04-06 4 102 4 1
#> 11 2017-04-07 5 102. 4 1
#> 12 2017-04-08 6 102. 4 0
#> 13 2017-04-09 7 102. 4 0
Data
df <- structure(list(date = structure(c(17250, 17251, 17252, 17254,
17255, 17256, 17259, 17260, 17261, 17262, 17263, 17264, 17265
), class = "Date"), DayofMonth = c(18, 19, 20, 21, 22, 23, 1,
2, 3, 4, 5, 6, 7), price = c(100, 100.53, 101.3, 100.94, 101.42,
101.4, 101.85, 102, 101.9, 102, 102.31, 102.1, 102.23)), row.names = c(NA,
-13L), class = "data.frame")
df
#> date DayofMonth price
#> 1 2017-03-25 18 100.00
#> 2 2017-03-26 19 100.53
#> 3 2017-03-27 20 101.30
#> 4 2017-03-29 21 100.94
#> 5 2017-03-30 22 101.42
#> 6 2017-03-31 23 101.40
#> 7 2017-04-03 1 101.85
#> 8 2017-04-04 2 102.00
#> 9 2017-04-05 3 101.90
#> 10 2017-04-06 4 102.00
#> 11 2017-04-07 5 102.31
#> 12 2017-04-08 6 102.10
#> 13 2017-04-09 7 102.23