Create multiple columns in R using a formula - r

I'm a bit new to R and trying to find a simplified way of creating multiple columns based on a formula.
I have a dataset that has a base date followed by scores that were taken weekly (score1 = score from 1 week after base date). I would like to generate a date for each week i.e. adding X*7 to the base date. I have found a way to do this by simply creating each date variable one at a time (see below) but since I have over 500 scores, I was wondering if there is a simplified way of doing this that does not take up hundreds of lines of code.
Dataset$score1_date <- Dataset$base_date + (1*7)
Dataset$score2_date <- Dataset$base_date + (2*7)
Dataset$score3_date <- Dataset$base_date + (3*7)
Here is an example dataset:
Dataset <- structure(list(id = c(1, 2, 3), base_date = structure(c(18628, 18633, 18641), class = "Date"), score1 = c(4, 5, 5), score2 = c(6, 5, 2), score3 = c(5, 5, 1)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
Thank you!

We can use lapply to loop over the multiplier index i.e 1:3 in the OP's post, multiply by 7 and add to base_date, then assign the list of vectors to new columns by pasteing the 'score' with the index and '_date'
Dataset[paste0('score', 1:3, '_date')] <- lapply(1:3,
function(i) Dataset$base_date + i*7)
Or using dplyr, loop across the 'score' columns, extract the numeric part from the column name (cur_column()) with parse_number, multiply by 7 and add to 'base_date' while modifying the column names in .names by adding the '_date' to create new columns
library(dplyr)
Dataset <- Dataset %>%
mutate(across(starts_with('score'), ~ base_date +
(readr::parse_number(cur_column())) * 7, .names = '{.col}_date'))
-output
Dataset
# A tibble: 3 x 8
# id base_date score1 score2 score3 score1_date score2_date score3_date
# <dbl> <date> <dbl> <dbl> <dbl> <date> <date> <date>
#1 1 2021-01-01 4 6 5 2021-01-08 2021-01-15 2021-01-22
#2 2 2021-01-06 5 5 5 2021-01-13 2021-01-20 2021-01-27
#3 3 2021-01-14 5 2 1 2021-01-21 2021-01-28 2021-02-04

You can try using a for loop and indicating a column of a data.frame using double brackets (i.e. [[.]]). For example:
for (i in c(1:500)){
Dataset[[paste0("score", i, "_date")]] <- Dataset$base_date + (i*7)
}

Related

Count the number of entries that fall within a range of dates in a separate dataframe in R

I am trying to count the number of rows in df1, which contains the date of an event,
df1 = data.frame(date = c("2021-07-31", "2021-08-01", "2021-08-12", "2021-08-14"))
that fall within the start and end dates of df2,
df2 = data.frame(Id = c(1,2),
Start = c("2021-06-01", "2021-08-01"),
End = c("2021-08-15", "2021-09-15"))
In this example, the output would look like
Id Start End Count
1 1 2021-06-01 2021-08-15 3
2 2 2021-08-01 2021-09-15 3
I have tried similar examples How to get the number of counts between two dates in R? and
count row if date falls within date range for all dates in series in R without any success.
Any help or suggestions would be greatly appreciated. Thank you!
Please note: should Id 1 count be 4 in your expected output?
You can group_by your data and sum the dates that fall %within% the interval like this:
df1 = data.frame(date = c("2021-07-31", "2021-08-01", "2021-08-12", "2021-08-14"))
df2 = data.frame(Id = c(1,2),
Start = c("2021-06-01", "2021-08-01"),
End = c("2021-08-15", "2021-09-15"))
library(dplyr)
library(lubridate)
df2 %>%
group_by(Id) %>%
mutate(Count = sum(as_date(df1$date) %within% lubridate::interval(Start, End)))
#> # A tibble: 2 × 4
#> # Groups: Id [2]
#> Id Start End Count
#> <dbl> <chr> <chr> <int>
#> 1 1 2021-06-01 2021-08-15 4
#> 2 2 2021-08-01 2021-09-15 3
Created on 2022-07-12 by the reprex package (v2.0.1)
Using data.table::between in outer.
f <- Vectorize(\(i, j) data.table::between(df1[i, 1L], df2[j, 2], df2[j, 3]))
transform(df2, count=colSums(outer(seq_len(nrow(df1)), seq_len(nrow(df2)), f)))
# Id Start End count
# 1 1 2021-06-01 2021-08-15 4
# 2 2 2021-08-01 2021-09-15 3
Note, that "Date" format is required, so you may want to do this beforehand:
df1[] <- lapply(df1, as.Date)
df2[-1] <- lapply(df2[-1], as.Date)
Data:
df1 <- structure(list(date = structure(c(18839, 18840, 18851, 18853), class = "Date")), row.names = c(NA,
-4L), class = "data.frame")
df2 <- structure(list(Id = c(1, 2), Start = structure(c(18779, 18840
), class = "Date"), End = structure(c(18854, 18885), class = "Date")), row.names = c(NA,
-2L), class = "data.frame")
Or with base:
df2$Count <- apply(df2, 1, function(x) sum(as.Date(df1$date) %in% seq(as.Date(x["Start"]), as.Date(x["End"]), by = "1 day")))
Output:
Id Start End Count
1 1 2021-06-01 2021-08-15 4
2 2 2021-08-01 2021-09-15 3

Group consecutive dates [duplicate question, but can't make it work with my data]

I have a database with a 142 columns with one called "Date" (of class POSIXct) that I'd like to make a new column from that groups consecutive dates together. Dates with more than 2 days separating one another are categorized into separate groups.
I'd also like to name the level of the group with the name of month the consecutive dates start in (For example: Jan. 3rd, 2018 -> Jan. 12th 2018 = group level called "January sampling event"; Feb 27th, 2018 -> March 1st, 2018 = group level called "February sampling event"; etc...).
I've seen very similar questions like Group consecutive dates in R and R: group dates that are next to each other, but just can't get it to work for my data.
EDIT:
My data example (Last row shows dates separated by over a year are grouped together, for some reason)
> dput(df)
structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
My attempt:
df$group <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Remove time from date time
It's hard to tell exactly what the problem is without seeing your data (or similar example data), but my guess is that the date time format (the 00:00:00 part) is messing up as.Date
One solution would be to extract just the date part and then try again with just the date part:
# here are your date times
date_time <- "2018-01-03 00:00:00"
# this looks for 4 digits between 0 and 9, followed by a dash, followed by 2 digits between 0 and 9,followed by a dash, followed by 2 digits between 0 and 9
date_pattern <- " ?([0-9]{4}-[0-9]{2}-[0-9]{2}) ?"
#need this library
library(stringr)
library(magrittr) #for pipes
#this pulls out text matching the pattern we specified in date pattern
date_new <- str_extract(date_time, date_pattern) %>%
str_squish() # this removes white space
# this is the new date without the time
date_new
# then we convert to as date
date_new <- as.Date(date_new)
See if converting your date column to just dates and then rerunning your grouping works.
If you have dates in different formats and need to adapt the regular expression, here's something about regular expressions: https://stackoverflow.com/a/49286794/16502170
Group dates
Let's start with an example data frame that contains a date column
# here's a bunch of example dates:
library(lubridate)
dates2 <- seq.Date(as.Date("2018-03-01"),by="days",length.out = 60)
#here's the dataframe
exampl_df <- data.frame(animals = rep(c("cats","dogs","rabbits"),20), dates=dates2,
numbers= rep(1:3,20))
Here's what it looks like:
head(exampl_df)
animals dates numbers
1 cats 2018-03-01 1
2 dogs 2018-03-02 2
3 rabbits 2018-03-03 3
4 cats 2018-03-04 1
5 dogs 2018-03-05 2
6 rabbits 2018-03-06 3
Then let's make a sequence of every day between the minimum and maximum date in the sequence. This step is important because there may be missing dates in our data that we still want counting towards the separation between days.
# this is a day by day sequence from the earliest day in your data to the latest day
date_sequence <- seq.Date(from = min(dates2),max(dates2),by="day")
Then let's make a sequence of numbers each repeated seven times. If you wanted to group every three days, you could change each to 3. Then the length.out= length(date_sequence) tells R to make this vector have as many entries as the min to max date sequence has:
# and then if you want a new group every seven days you can make this number sequence
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
Then let's attach the groups to the date_sequence to make a grouping index
date_grouping_index <- data.frame(a=date_sequence,b=groups)
then you can do a join to attach the groups to the original dataframe
library(dplyr)
example_df 2 <- exampl_df %>%
inner_join(date_grouping_index, by=c("dates"="a"))
This is what we get:
head(example_df2,n=10)
animals dates numbers b
1 cats 2018-03-01 1 1
2 dogs 2018-03-02 2 1
3 rabbits 2018-03-03 3 1
4 cats 2018-03-04 1 1
5 dogs 2018-03-05 2 1
6 rabbits 2018-03-06 3 1
7 cats 2018-03-07 1 1
8 dogs 2018-03-08 2 2
9 rabbits 2018-03-09 3 2
10 cats 2018-03-10 1 2
Then you should be able to group_by() or aggregate() your data using column b
Using the data provided in the question
#original data
df <- structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
#plus extra step
df$group2 <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Method described above
date_sequence <- seq.Date(from = min(df$Date),max(df$Date),by="day")
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
date_grouping_index <- data.frame(a=date_sequence,groups=groups)
example_df2<- df %>%
inner_join(date_grouping_index, by=c("Date"="a"))
Looks like it worked?
example_df2
# A tibble: 9 x 4
Date group group2 groups
<date> <dbl> <dbl> <int>
1 2018-01-03 1 1 1
2 2018-01-04 1 1 1
3 2018-01-05 1 1 1
4 2018-01-06 1 1 1
5 2020-01-18 2 2 107
6 2020-01-19 2 2 107
7 2020-01-20 2 2 107
8 2020-01-21 2 2 107
9 2019-02-05 2 2 57
Here's something you could do to make group names with the date and year in them:
example_df2$group_name <- paste0("sampling number ",
example_df2$groups,
" (",
month.name[month(example_df2$Date)],
"-",
year(example_df2$Date),
")")

R: Loops, Dplyr and lubridate, how to combine them

I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.
I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K+ rows and 50+ analysis periods.
I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.
The following is an example of code that works for the first analysis period.
library(tidyverse)
library(lubridate)
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m+% months(seq(0, 12, by = 6))
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
if_else(DateIn > vec_date[1], Premium, 0)))
Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?
Edit: an outline of the format you would need as a result would be:
df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500),
Aux1 = c("Y", "Y", "N"),
test1 = c(5421.429, 4849.246, 24500.000),
Aux2 = c("N", "Y", "Y"),
test2 = c(0.0000, 301.5075, 20125.0000),
Aux3 = c("N", "N", "Y"),
test3 = c(0, 0, 4025))
Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.
As #Jon Spring suggests in the comments, probably the preferred approach here
would be to use tidyr::complete() to extend your data frame, repeating each
row in it for each of your analysis dates. Then, you can stick to vectorized
calculations and get the analysis date column in the resulting data, too.
Below is how to do just that with the example data you provided. I took the
liberty of renaming some columns, and simplifying the control-flow based
calculation according to my understanding of the problem, based on what you
shared.
First, the example data slightly reframed:
library(tidyverse)
library(lubridate)
policies <- tibble(
policy_id = seq_len(3),
start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
premium = c(11000, 5000, 24500)
)
policies
#> # A tibble: 3 x 4
#> policy_id start end premium
#> <int> <date> <date> <dbl>
#> 1 1 2021-10-21 2022-03-10 11000
#> 2 2 2021-12-25 2022-07-12 5000
#> 3 3 2022-05-11 2023-02-15 24500
Then, finding remaining prorated premiums for policies at given dates:
start_date <- ymd("2021-12-31")
dates <- start_date %m+% months(seq(0, 12, by = 6))
policies %>%
mutate(
days = as.numeric(end - start),
daily_premium = premium / days
) %>%
crossing(date = dates) %>%
mutate(
days_left = pmax(0, end - pmax(start, date)),
premium_left = days_left * daily_premium
) %>%
select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#> policy_id date days_left premium_left
#> <int> <date> <dbl> <dbl>
#> 1 1 2021-12-31 69 5421.
#> 2 1 2022-06-30 0 0
#> 3 1 2022-12-31 0 0
#> 4 2 2021-12-31 193 4849.
#> 5 2 2022-06-30 12 302.
#> 6 2 2022-12-31 0 0
#> 7 3 2021-12-31 280 24500
#> 8 3 2022-06-30 230 20125
#> 9 3 2022-12-31 46 4025

Combine the contents of two columns into one column using R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
I got some data like this
structure(list(id = c(1, 1, 1), time1 = c(10, 20, 30), time2 = c(15, 25, 35)), row.names = c(NA, 3L), class = "data.frame")
and I want to create a single column from the two columns in the above data
structure(list(id = c(1, 1, 1, 1, 1, 1), time = c(10, 15, 20, 25, 30, 35)), row.names = c(NA, 6L), class = "data.frame")
I dont think its the same as converting into long format because I dont want two columns as a result of gather(), with one the names of the columns used and one the values.
We can use pivot_longer and this should be more general as it can also do reshaping based on other patterns and multiple columns as well. Note that pivot_longer succeeds the reshape2 function melt with more enhanced capabilities and bug fixes
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = time1:time2, values_to = 'time') %>%
select(-name)
-output
# A tibble: 6 x 2
# id time
# <dbl> <dbl>
#1 1 10
#2 1 15
#3 1 20
#4 1 25
#5 1 30
#6 1 35
Or using base R with stack
transform(stack(df1[-1])[1], id = rep(df1$id, 2))[2:1]
Or can use data.frame with unlist
data.frame(id = df1$id, value = unlist(df1[-1], use.names = FALSE))
Alternative to tidyr, though that's a good way to do it:
reshape2::melt(dat, "id")[,-2]
# id value
# 1 1 10
# 2 1 20
# 3 1 30
# 4 1 15
# 5 1 25
# 6 1 35
(Normally it includes the pivoted column names as a column itself, so the [,-2] removes that since your expected output didn't have it. You can do just melt(.) if you want/need to keep it.)

Create a user generated function in R which creates a new column of dates based on values from other columns

Let's say these are the first few columns of my dataset:
library(tidyverse)
df <- tibble(Year = rep(2020, times = 5),
Month = seq(1:5),
DayOfMonth = seq(1:5),
DayOfWeek = seq(1:5))
# A tibble: 5 x 4
Year Month DayOfMonth DayOfWeek
<dbl> <int> <int> <int>
1 2020 1 1 1
2 2020 2 2 2
3 2020 3 3 3
4 2020 4 4 4
5 2020 5 5 5
At the moment the date is split into different columns as above. I would like to create a function that takes the value from each relevant column for each row in the dataset and adds a new column which combines those values into a yyyy/mm/dd format.
For example, I want the first row to have a corresponding value in a new "Date" column to be 2020/01/01 (same as 1 January 2020)
Since I'm new to R I don't have a great understanding of making functions myself so I'm struggling to find a starting point. Any help is much appreciated :)
this should do it
library(tidyverse)
df <- tibble(Year = rep(2020, times = 5),
Month = seq(1:5),
DayOfMonth = seq(1:5),
DayOfWeek = seq(1:5))
out <- apply(df, MARGIN = 1, paste, collapse="/")
[1] "2020/1/1/1" "2020/2/2/2" "2020/3/3/3" "2020/4/4/4" "2020/5/5/5"
apply with margin = 1 goes through each row and uses the paste function collapsing the row vector and places a "/" instead
Edit: whats up with the "DayOfWeek" maybe that should be removed, or just run apply(df[,1:3], MARGIN = 1, paste,collapse="/")
out <- apply(df[,1:3], MARGIN = 1, paste,collapse="/")
test <- as.Date(out)
test
[1] "2020-01-01" "2020-02-02" "2020-03-03" "2020-04-04" "2020-05-05"

Resources