Creating new variable based on reference date calculation [duplicate] - r

This question already has answers here:
Calculate number of days between two dates in r
(4 answers)
Closed 2 years ago.
I have a dataframe with multiple participants (distinguished by the variable "ID") and calendar dates (MM/DD/YYYY) associated with each row of data.
I would like to create a "Day" column to calculate the number of days that has elapsed since the first calendar date for each ID (i.e. using the first date for each participant as a reference date).
Example Structure:
ID Calendar.date Day
1 06/23/2020 1
1 06/25/2020 3
1 06/26/2020 4
2 03/24/2019 1
2 03/30/2019 7
2 03/31/2019 8

Here is a dplyr approach. If you group_by the ID, you can subtract dates from the first date for each ID. This assumes you have your data in a data frame df:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Calendar_date = as.Date(Calendar_date, format = "%m/%d/%Y"),
Day = Calendar_date - first(Calendar_date) + 1)
For the output below, I modified your example data to avoid impossible dates in February. Also, the result for Day is a difftime object. If you simply want the numeric number of days just use as.numeric:
as.numeric(Calendar_date - first(Calendar_date))
Output
# A tibble: 6 x 3
# Groups: ID [2]
ID Calendar_date Day
<dbl> <date> <drtn>
1 1 2020-06-23 1 days
2 1 2020-06-25 3 days
3 1 2020-06-26 4 days
4 2 2019-02-20 1 days
5 2 2019-02-26 7 days
6 2 2019-02-27 8 days

Related

Calculate length of night in data frame

I have a some test data with two columns. The column "hour" shows hourly values (p.m.). The column "day" indicates the corresponding day, i.e. on day 1 there are hourly values from 7 to 11 o'clock.
I now want to calculate how big the time span is for each day and store these values in a vector.
Something like:
timespan <- c(5,7,3)
How could I calculate this in a loop?
I thought about something like length(unique...)
Thanks in advance!
Here is the code:
day <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
hour <- c(7,7,8,10,11,5,6,6,7,11,9,10,10,11,11)
df <- data.frame(day,hour)
library(dplyr)
df %>%
group_by(day) %>%
summarise(time_span = max(hour) - min(hour) + 1)
## A tibble: 3 x 2
# day time_span
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 3

New data frame with unique values and counts [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'd like to create a new data table from my old one that includes a count of all the "article_id" that occur for each date (i.e. there are three article_id's listed for the date 2001-10-01, so I'd like one column with the date and one column that has the article count, "3").
Here is the output of the data table:
date article_id N
1: 2001-09-01 FAS_200109_11104 3
2: 2001-10-01 FAS_200110_11126 6
3: 2001-10-01 FAS_200110_11157 21
4: 2001-10-01 FAS_200110_11160 5
5: 2001-11-01 FAS_200111_11220 26
---
7359: 2019-08-01 FAZ_201908_2958 7
7360: 2019-09-01 FAZ_201909_3316 8
7361: 2019-09-01 FAZ_201909_3515 13
7362: 2000-12-01 FAZ_200012_92981 3
7363: 2001-08-01 FAZ_200108_86041 14
So I'll have to move over the unique date values to a new data frame (so that each date is only shown once), as well as a count of article_id's shown for each date.
I've been trying to figure this out but haven't found exactly the right answer regarding how to count the occurrence of a character vector (the article_id) by group (date). I think this is something pretty simple in R, but I'm new to the program and don't have much support so I would very much appreciate your suggestions - thank you so much!
The expected output is not clear. Some assumptions of expected output
Sum of 'N' by 'date'
library(data.table)
dt[, .(N = sum(N, na.rm = TRUE)), by = date]
Count of unique 'article_id' for each date
dt1[, .(N = uniqueN(article_id)), by = date]
Get the first count by 'date'
dt1[, .(N = first(N)), by = date]
We could group and then summarise:
library(dplyr)
df %>%
group_by(date) %>%
summarise(n = n())
date n
<chr> <int>
1 2000-12-01 1
2 2001-08-01 1
3 2001-09-01 1
4 2001-10-01 3
5 2001-11-01 1
6 2019-08-01 1
7 2019-09-01 2
Here 2 tidyverse solutions:
Libraries
library(tidyverse)
Example Data
df <-
tibble(
date = ymd(c("2001-09-01","2001-10-01","2001-10-01")),
article_id = c("FAS_200109_11104","FAS_200110_11126","FAS_200110_11157"),
N = c(3,6,21)
)
Solution
Solution 1
df %>%
group_by(date) %>%
summarise(N = sum(N,na.rm = TRUE))
Solution 2
df %>%
count(date,wt = N)
Result
# A tibble: 2 x 2
date n
<date> <dbl>
1 2001-09-01 3
2 2001-10-01 27

How to take an arithmetic average over common variable, rather than whole data?

So I have a data frame which is daily data for stock prices, however, I have also a variable that indicates the week of year (1,2,3,4,...,51,52) this is repeated for 22 companies. I would like to create a new variable that takes an average of the daily prices but only across each week.
The above equation has d = day and t = week. My challenge is taking this average of days across each week. Therefore, I should have 52 values per stock that I observe.
Using ave().
dat <- transform(dat, avg_week_price=ave(price, week, company))
head(dat, 9)
# week company wday price avg_week_price
# 1 1 1 a 16.16528 15.47573
# 2 2 1 a 18.69307 15.13812
# 3 3 1 a 11.01956 12.99854
# 4 1 2 a 15.92029 14.56268
# 5 2 2 a 12.26731 13.64916
# 6 3 2 a 17.40726 17.27226
# 7 1 3 a 11.83037 13.02894
# 8 2 3 a 13.09144 12.95284
# 9 3 3 a 12.08950 15.81040
Data:
setseed(42)
dat <- expand.grid(week=1:3, company=1:5, wday=letters[1:7])
dat$price <- runif(nrow(dat), 10, 20)
An option with dplyr
library(dplyr)
dat %>%
group_by(week, company) %>%
mutate(avg_week_price = mean(price))

How could I form date interval with counts in R?

I have a date variable called DATE as follows:
DATE
2019-12-31
2020-01-01
2020-01-05
2020-01-09
2020-01-25
I am trying to return a result that counts the number of times the date occur in a week considering the Week variable starts from the minimum of DATE variable. So it would look something like this:
Week Count
1 3
2 1
3 0
4 1
Thanks in advance.
From base R
dates <- c('2019-12-31','2020-01-01','2020-01-05','2020-01-09','2020-01-25')
weeks <- strftime(dates, format = "%V")
table(weeks)
We subtract DATE values with minimum DATE value to get the difference in days between DATES. We divide the difference by 7 to get it in weeks and count it. We then use complete to fill the missing week information.
df %>%
dplyr::count(week = floor(as.integer(DATE - min(DATE))/7) + 1) %>%
tidyr::complete(week = min(week):max(week), fill = list(n = 0))
# week n
# <dbl> <dbl>
#1 1 3
#2 2 1
#3 3 0
#4 4 1
If your DATE column is not of date class, first run this :
df$DATE <- as.Date(df$DATE)

R Conditional Summarizing [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
faster way to create variable that aggregates a column by id [duplicate]
(6 answers)
Closed 5 years ago.
I have a column for company, one for sales and another column for country.I need to sum all the sales in each of the countries separately so that I would have one column for each of the companies(names) for the total sales in the country. The sales in all of the countries is expressed in the same currency.
I have tried several ways of doing so, but neither of them work:
df$total_country_sales = if(df$country[row] == df$country) { sum(df$sales)}
This sums all valuations, not only the ones that I need.
Name Sales Country I would like to have a new column Total Country Sales
abc 122 US 5022
abc 100 Canada
aad 4900 US
I need to have the values in the same dataframe, but in a new column.
Since it is a large dataset, I cannot make a function to do so, but rather need to save it directly as a variable. (Or have I understood incorrectly that making functions is not the best way to solve such issues?)
I am new to R and programming in general, so I might be addressing the issue in an incorrect way.
Sorry for probably a stupid question.
Thanks!
If I understand your question correctly, this solves your problem:
df = data.frame(sales=c(1,3,2,4,5),region=c("A","A","B","B","B"))
library(dplyr)
totals = df %>% group_by(region) %>% summarize(total = sum(sales))
df = left_join(df,totals)
It adds the group totals as a separate column, like this:
sales region total
1 1 A 4
2 3 A 4
3 2 B 11
4 4 B 11
5 5 B 11
Hope this helps.
We can use base R to do this
df$total_country_sales <- with(df, ave(sales, country, FUN = sum))
It can be achieved using dplyr's mutate()
df = data.frame(sales=c(1,3,2,4,5),country=c("A","A","B","B","B"))
df
# sales country
# 1 1 A
# 2 3 A
# 3 2 B
# 4 4 B
# 5 5 B
df %>% group_by(country) %>% mutate(total_sales = sum(sales))
# Source: local data frame [5 x 3]
# Groups: country [2]
#
# # A tibble: 5 x 3
# sales country total_sales
# <dbl> <fctr> <dbl>
# 1 1 A 4
# 2 3 A 4
# 3 2 B 11
# 4 4 B 11
# 5 5 B 11
using data.table
library(data.table)
setDT(df)[, total_sales := sum(sales), by = country]
df
# sales country total_sales
# 1: 1 A 4
# 2: 3 A 4
# 3: 2 B 11
# 4: 4 B 11
# 5: 5 B 11

Resources