R: counting timestamps per week - r

I have a dataframe containing a lot of tweets. Each tweet has a unique timestamp. Now, what I would like to calculate is how many tweets have been published in each week, based on the timestamp. Any ideas? I tried to do it with tidyverse and dplyr, sadly it didn't work.
Kind regards,
Daniel

library(dplyr)
set.seed(42)
tweets <- tibble(timestamp = sort(Sys.time() - runif(1000, 0, 365*86400)), tweet = paste("tweet", 1:1000))
tweets
# # A tibble: 1,000 x 2
# timestamp tweet
# <dttm> <chr>
# 1 2021-01-27 09:39:47 tweet 1
# 2 2021-01-28 02:38:29 tweet 2
# 3 2021-01-28 07:33:02 tweet 3
# 4 2021-01-29 08:42:47 tweet 4
# 5 2021-01-29 09:21:58 tweet 5
# 6 2021-01-29 16:01:09 tweet 6
# 7 2021-01-30 05:04:18 tweet 7
# 8 2021-01-30 21:45:05 tweet 8
# 9 2021-01-31 18:32:24 tweet 9
# 10 2021-02-02 02:57:51 tweet 10
# # ... with 990 more rows
tweets %>%
group_by(yearweek = format(timestamp, format = "%Y-%U")) %>%
summarize(date = min(as.Date(timestamp)), ntweets = n(), .groups = "drop")
# # A tibble: 54 x 3
# yearweek date ntweets
# <chr> <date> <int>
# 1 2021-04 2021-01-27 8
# 2 2021-05 2021-01-31 15
# 3 2021-06 2021-02-07 19
# 4 2021-07 2021-02-14 24
# 5 2021-08 2021-02-21 28
# 6 2021-09 2021-02-28 22
# 7 2021-10 2021-03-07 16
# 8 2021-11 2021-03-15 13
# 9 2021-12 2021-03-21 15
# 10 2021-13 2021-03-28 19
# # ... with 44 more rows
See ?strptime for definitions of the various "week of the year" options ("%U", "%V", "%W").

Related

How to control the fill_gaps interval in tsibble?

I have two data frames that fill missing in different intervals.
I would like to fill the two to the same interval.
Consider two data frames with the same month-day but two years apart:
library(tidyverse)
library(fpp3)
df_2020 <- tibble(month_day = as_date(c('2020-1-1','2020-2-1','2020-3-1')),
amount = c(5, 2, 1))
df_2022 <- tibble(month_day = as_date(c('2022-1-1','2022-2-1','2022-3-1')),
amount = c(5, 2, 1))
These data frames both have three rows, with the same dates, 2 years apart.
Create tsibbles with a yearweek index:
ts_2020 <- df_2020 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2022 <- df_2022 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2020
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022
#> # A tsibble: 3 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 2022-02-01 2 2022 W05
#> 3 2022-03-01 1 2022 W09
Still three rows in each tsibble
Now fill gaps:
ts_2020_filled <- ts_2020 |> fill_gaps()
ts_2022_filled <- ts_2022 |> fill_gaps()
ts_2020_filled
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022_filled
#> # A tsibble: 10 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 NA NA 2022 W01
#> 3 NA NA 2022 W02
#> 4 NA NA 2022 W03
#> 5 NA NA 2022 W04
#> 6 2022-02-01 2 2022 W05
#> 7 NA NA 2022 W06
#> 8 NA NA 2022 W07
#> 9 NA NA 2022 W08
#> 10 2022-03-01 1 2022 W09
Here is the issue:
ts_2020_filled has 4-weekly steps, and ts_2022_filled has 1-weekly steps.
This is because the two tsibbles have different intervals:
tsibble::interval(ts_2020)
#> <interval[1]>
#> [1] 4W
tsibble::interval(ts_2022)
#> <interval[1]>
#> [1] 1W
This is because the tsibbles have different steps:
ts_2020 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 4 4
ts_2022 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 5 4
Therefore, the greatest common divisors are different (4 and 1). From the manual
for as_tibble:
regular Regular time interval (TRUE) or irregular (FALSE). The
interval is determined by the greatest common divisor of index column,
if TRUE.
Both tsibbles are
regular:
is_regular(ts_2020)
#> [1] TRUE
is_regular(ts_2020)
#> [1] TRUE
So, I would like to set the gap fill interval, so the periods are consistent.
I tried setting .full in fill_gaps and .regular in as_tsibble.
I could not find a way to set the interval of a tsibble.
Is there a way of manually setting the interval used by fill_gaps? Granted an interval of four weeks won't work for df_2022, but the LCM of one would work for both.

how to find the growth rate of applicants per year

I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23

Dataframe with start & end date to daily data

I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows

How can I create a column in R that calculates a value of a row based on the previous row's value?

Here's what I would like to a achieve as a function in Excel, but I can't seem to find a solution to do it in R.
This is what I tried to do but it does not seem to allow me to operate with the previous values of the new column I'm trying to make.
Here is a reproducible example:
library(dplyr)
set.seed(42) ## for sake of reproducibility
dat <- data.frame(date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"))
This would be the output of the dataframe:
dat
date
1 2020-12-26
2 2020-12-27
3 2020-12-28
4 2020-12-29
5 2020-12-30
6 2020-12-31
Desired output:
date periodNumber
1 2020-12-26 1
2 2020-12-27 2
3 2020-12-28 3
4 2020-12-29 4
5 2020-12-30 5
6 2020-12-31 6
My try at this:
dat %>%
mutate(periodLag = dplyr::lag(date)) %>%
mutate(periodNumber = ifelse(is.na(periodLag)==TRUE, 1,
ifelse(date == periodLag, dplyr::lag(periodNumber), (dplyr::lag(periodNumber) + 1))))
Excel formula screenshot:
You could use dplyr's cur_group_id():
library(dplyr)
set.seed(42)
# I used a larger example
dat <- data.frame(date=sample(seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"), size = 30, replace = TRUE))
dat %>%
arrange(date) %>% # needs sorting because of the random example
group_by(date) %>%
mutate(periodNumber = cur_group_id())
This returns
# A tibble: 30 x 2
# Groups: date [6]
date periodNumber
<date> <int>
1 2020-12-26 1
2 2020-12-26 1
3 2020-12-26 1
4 2020-12-26 1
5 2020-12-26 1
6 2020-12-26 1
7 2020-12-26 1
8 2020-12-26 1
9 2020-12-27 2
10 2020-12-27 2
11 2020-12-27 2
12 2020-12-27 2
13 2020-12-27 2
14 2020-12-27 2
15 2020-12-27 2
16 2020-12-28 3
17 2020-12-28 3
18 2020-12-28 3
19 2020-12-29 4
20 2020-12-29 4
21 2020-12-29 4
22 2020-12-29 4
23 2020-12-29 4
24 2020-12-29 4
25 2020-12-30 5
26 2020-12-30 5
27 2020-12-30 5
28 2020-12-30 5
29 2020-12-30 5
30 2020-12-31 6

How to read a list of list in r

I have a txt file like this:
[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]
I would like to read it by R as a table like this:
seller_id
product_id
buyer_id
sale_date
quantity
price
7
11
49
2019-01-21
5
3330
13
32
6
2019-02-10
9
1089
50
47
4
2019-01-06
1
1343
1
22
2
2019-03-03
9
7677
How to write the correct R code? Thanks very much for your time.
An easier option is fromJSON
library(jsonlite)
library(janitor)
fromJSON(txt = "file1.txt") %>%
as_tibble %>%
row_to_names(row_number = 1) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 4 x 6
# seller_id product_id buyer_id sale_date quantity price
# <int> <int> <int> <chr> <int> <int>
#1 7 11 49 2019-01-21 5 3330
#2 13 32 6 2019-02-10 9 1089
#3 50 47 4 2019-01-06 1 1343
#4 1 22 2 2019-03-03 9 7677
You will need to parse the json from arrays into a data frame. Perhaps something like this:
# Get string
str <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'
df_list <- jsonlite::parse_json(str)
do.call(rbind, lapply(df_list[-1], function(x) {
setNames(as.data.frame(x), unlist(df_list[1]))}))
#> seller_id product_id buyer_id sale_date quantity price
#> 1 7 11 49 2019-01-21 5 3330
#> 2 13 32 6 2019-02-10 9 1089
#> 3 50 47 4 2019-01-06 1 1343
#> 4 1 22 2 2019-03-03 9 7677
Created on 2020-12-11 by the reprex package (v0.3.0)
Some base R options using:
gsub + read.table
read.table(
text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)),
sep = ",",
header = TRUE
)
gsub + read.csv
read.csv(text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)))
which gives
seller_id product_id buyer_id sale_date quantity price
1 7 11 49 2019-01-21 5 3330
2 13 32 6 2019-02-10 9 1089
3 50 47 4 2019-01-06 1 1343
4 1 22 2 2019-03-03 9 7677
Data
s <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'

Resources