How to perform lag in R when there are multiple repeating rows for a group - r

Suppose I have a data frame as follows:
date
price
company
2000-10-01
18
A
2001-10-01
20
A
2001-10-01
20
A
2001-10-01
20
A
I want to create a new variable lagged_price as follows:
date
price
company
lagged_price
2000-10-01
18
A
NA
2001-10-01
20
A
18
2001-10-01
20
A
18
2001-10-01
20
A
18
The new variable, lagged_price, takes the lagged value of price for group company. That is, lagged_price captures the price for the company on a previous date. Using group_by is problematic since it captures the value in the preceding row of the group company. Instead, I want to capture the lagged price on the previous date for that company. I also do not want to perform distinct() on the original dataset. Although that does the job in this example, I still want to keep other rows.
my failed solution:
out <- data %>%
group_by(company) %>%
mutate(lagged_price = lag(price))
Any help is appreciated.

Lagging before grouping gives
df %>%
mutate(lagged_price = lag(price)) %>%
group_by(date) %>%
mutate(lagged_price = lagged_price[1]) %>%
ungroup()
# A tibble: 4 × 4
date price company lagged_price
<chr> <int> <chr> <int>
1 2000-10-01 18 A NA
2 2001-10-01 20 A 18
3 2001-10-01 20 A 18
4 2001-10-01 20 A 18

Related

Sum unique occurrences per night and create a new data frame in R

I have studied prey deliveries in a breeding owl and want to score the number of prey items delivered during the night to the nestlings. I define night as from 21 to 5. How could I make a new data frame with number of prey each night per location ID based upon these 24/7 observation dataset? In the new data frame, I wish to have the following columns: ID (A & B), No_prey_during_night (the sum of prey items), Time (date, e.g. 4/6 to 5/6), there will be a unique row per night per ID.
https://drive.google.com/file/d/1y5VCoNWZCmYbyWCktKfMSBqjOIaLeumQ/view?usp=sharing. I have done it in Excel so far, but very time demanding. I would be happy to get help with a simple script I could use in R.
To take into account the fact that a night begins and ends on different dates, you could first assign all the morning hours to the prior day. The final label (the Time column in your question) then includes the next day. If the year of the data collection has a Feb 29, make sure the year is correct (I used 2022).
library(dplyr)
library(lubridate)
read.csv("Tot_prey_example.csv") %>%
mutate(time = make_datetime(year = 2022, month = Month, day = Day, hour = Hour),
night_time = if_else(between(Hour, 0, 5), time - days(1), time),
night_date = floor_date(night_time, unit = "day"),
night = Hour <= 5 | Hour >= 21) %>%
filter(night) %>%
group_by(ID, night_date) %>%
summarise(No_prey_during_night = sum(n), .groups = "drop") %>%
mutate(next_day = night_date + days(1),
Time = glue::glue("{day(night_date)}/{month(night_date)} to {day(next_day)}/{month(next_day)}")) %>%
select(ID, No_prey_during_night, Time)
#> # A tibble: 88 × 3
#> ID No_prey_during_night Time
#> <chr> <int> <glue>
#> 1 A 12 4/6 to 5/6
#> 2 A 22 5/6 to 6/6
#> 3 A 20 6/6 to 7/6
#> 4 A 14 7/6 to 8/6
#> 5 A 14 8/6 to 9/6
#> 6 A 27 9/6 to 10/6
#> 7 A 22 10/6 to 11/6
#> 8 A 18 11/6 to 12/6
#> 9 A 22 12/6 to 13/6
#> 10 A 25 13/6 to 14/6
#> # … with 78 more rows
Created on 2022-05-18 by the reprex package (v2.0.1)
You can do something like this:
library(dplyr)
library(lubridate)
read.csv("Tot_prey_example.csv") %>%
# create initial datetime variable, `night`
mutate(night = lubridate::make_datetime(2021, Month,Day,Hour)) %>%
# filter to nighttime hours
filter(Hour>=21 | Hour<=5) %>%
# flip datetime variable to the next day if hour is >=21
mutate(night = if_else(Hour>=21,night + 60*60*24, night)) %>%
# now group by the date part of `night`
group_by(ID,Night_No = as.Date(night)) %>%
# summarize the sum of prey
summarize(
No_prey_during_night = sum(n),
No_deliveries_during_night = sum(PreyDelivery)
) %>%
# replace the Night_No with a character variable showing both dates
mutate(Night_No = paste0(Night_No-1, "-", Night_No))
Output:
# A tibble: 88 × 4
# Groups: ID [2]
ID Night_No No_prey_during_night No_deliveries_during_night
<chr> <chr> <int> <int>
1 A 2021-06-04-2021-06-05 12 5
2 A 2021-06-05-2021-06-06 22 6
3 A 2021-06-06-2021-06-07 20 5
4 A 2021-06-07-2021-06-08 14 6
5 A 2021-06-08-2021-06-09 14 5
6 A 2021-06-09-2021-06-10 27 5
7 A 2021-06-10-2021-06-11 22 4
8 A 2021-06-11-2021-06-12 18 6
9 A 2021-06-12-2021-06-13 22 6
10 A 2021-06-13-2021-06-14 25 5
# … with 78 more rows

How to calculate duration of time between two dates

I'm working with a large data set in RStudio that includes multiple test scores for the same individuals. I've filtered my data set to display the same individual's scores in two consecutive rows with the test date for each test administration in one column. My data appears as follows:
id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline
I would like to calculate the total duration of time between baseline 1 and baseline 2 administration and store that value in a new column. Therefore, my first question is what is the best way to calculate the duration of time between two dates? And two, what is the best way to condense each individual's data into one row to make calculating the difference between test scores easier and to be stored in a new column?
Thank you for any assistance!
This is a solution inside the tidyverse universe. The packages we are going to use are dplyr and tidyr.
First, we create the dataset (you read it from a file instead) and convert strings to date format:
library(dplyr)
library(tidyr)
dataset <- read.table(text = "id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline", header = TRUE)
dataset$test_date <- as.Date(dataset$test_date, format = "%m/%d/%Y")
# id test_date score baseline_number_1 baseline_number_2
# 1 1 2017-08-15 21.18 Baseline <NA>
# 2 1 2019-08-28 28.55 <NA> Baseline
# 3 2 2017-11-22 33.38 Baseline <NA>
# 4 2 2019-11-06 35.30 <NA> Baseline
# 5 3 2018-07-25 30.77 Baseline <NA>
# 6 3 2019-07-31 33.42 <NA> Baseline
The best solution to condense each individual's data into one row and compute the difference between the two baselines can be achieved as follows:
dataset %>%
group_by(id) %>%
mutate(number = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = id,
names_from = number,
values_from = c(test_date, score),
names_glue = "{.value}_{number}"
) %>%
mutate(
time_between = test_date_2 - test_date_1
)
Brief explanation: first we create the variable number which indicates the baseline number in each row; then we use pivot_wider to make the dataset "wider" indeed, i.e. we have one row for each id along with its features; finally we create the variable time_between which contains the difference in days between two baselines. In you are not familiar with some of these functions, I suggest you break the pipeline after each operation and analyse it step by step.
Final output
# A tibble: 3 x 6
# id test_date_1 test_date_2 score_1 score_2 time_between
# <int> <date> <date> <dbl> <dbl> <drtn>
# 1 1 2017-08-15 2019-08-28 21.2 28.6 743 days
# 2 2 2017-11-22 2019-11-06 33.4 35.3 714 days
# 3 3 2018-07-25 2019-07-31 30.8 33.4 371 days

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

How do I know, what day of week is a date

I've got the following problem: I have the daily stock exchange rates of a certain share stored in a vector with the belonging date(from 2015 to 2017).
I need to extract the last exchange rate of every week.
This means I need to know what weekday corresponds to every date and store those rates in a vector (or delete the other rows from the existing vector). I did this by using 'wday' (from lubridate) and then did the following:
vector<-stochexchangerate
weekdays<-wday(stockexchangerate) ## length =35; monday=2,
tuesday=3,..
for(i in 1:10){
if(weekdays[i]<6){
vector<-vector[-c(i)]
}
}
But this only has the consequence, that some "random" rows are deleted and if I run this code 6 times, there is only 1 row left although there were some values which were taken on friday. Can anyone help me?
Yes, using lubridate was a good insight. I would extract the day of the week using lubridate::wday and argument label = TRUE and filter that column.
Assuming that you have a dataframe with 2 columns (one for the dates and, one for the value of rates) you can do:
library(tidyverse)
library(lubridate)
# DATA
#> df <- tibble(date = mdy("02/15/1980") + 1:300,
#> value = 1:300)
df %>%
mutate(day = wday(date, label = TRUE)) %>%
filter(day == "Fri")
#> # A tibble: 42 x 3
#> date value day
#> <date> <int> <ord>
#> 1 1980-02-22 7 Fri
#> 2 1980-02-29 14 Fri
#> 3 1980-03-07 21 Fri
#> 4 1980-03-14 28 Fri
#> 5 1980-03-21 35 Fri
#> 6 1980-03-28 42 Fri
#> 7 1980-04-04 49 Fri
#> 8 1980-04-11 56 Fri
#> 9 1980-04-18 63 Fri
#> 10 1980-04-25 70 Fri
#> # … with 32 more rows

Create a Table with Alternating Total Rows Followed by Sub-Rows Using Dplyr and Tidyverse

library(dplyr)
library(forcats)
Using the simple dataframe and code below, I want to create a table with total rows and sub-rows. For example, the first row would be "Region1" from the NEW column and 70 from the TotNumber column, then below that would be three rows for "Town1", "Town2", and "Town3", and their associated numbers from the Number column, and the same for "Region2" and "Region3". I attached a pic of the desired table...
I'm also looking for a solution using dplyr and Tidyverse.
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number)
DF<-DF%>%mutate_at(vars(Town),funs(as.factor))
To create Region variable...
DF<-DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")))%>%
group_by(NEW)%>%
summarise(TotNumber=sum(Number))
Modifying your last pipes and adding some addition steps:
library(dplyr)
library(forcats)
DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")),
NEW = as.character(NEW)) %>%
group_by(NEW) %>%
mutate(TotNumber=sum(Number)) %>%
ungroup() %>%
split(.$NEW) %>%
lapply(function(x) rbind(setNames(x[1,3:4], names(x)[1:2]), x[1:2])) %>%
do.call(rbind, .)
Results:
# A tibble: 13 × 2
Town Number
* <chr> <dbl>
1 Region1 70
2 Town1 10
3 Town2 30
4 Town3 30
5 Region2 96
6 Town4 10
7 Town5 56
8 Town6 30
9 Region3 133
10 Town7 40
11 Town8 50
12 Town9 33
13 Town10 10
Data:
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-c("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number) %>%
mutate_at(vars(Town),funs(as.factor))

Resources