FOR Loop with multiple parameters from a dataframe in R [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I would like to know whether it is possible to build a FOR loop in R which would change multiple parameters at every run.
I have parameter dataframe [df_params] which looks like this:
group person date_from date_to
1 Mike 2020-10-01 12:00:00 2020-10-01 13:00:00
2 Mike 2020-10-04 09:00:00 2020-10-07 17:00:00
3 Dave 2020-10-07 12:00:00 2020-10-07 13:00:00
4 Dave 2020-10-09 09:00:00 2020-10-11 17:00:00
I would like to loop over a larger dataframe [df] and get only the rows matching parameters of individual rows in the "df_params" dataframe.
The large dataframe [df] looks like this:
person datetime books tasks done
Mike 2020-10-01 12:15:00 5 7 2
Mike 2020-10-01 12:17:00 5 7 3
Mike 2020-10-01 18:00:00 5 7 4
Mike 2020-10-02 12:00:00 5 5 0
Mike 2020-10-04 09:08:00 5 3 3
Mike 2020-10-09 12:00:00 5 7 1
Dave 2020-10-07 12:22:00 7 5 1
Dave 2020-10-08 02:34:00 7 5 2
Dave 2020-10-09 07:00:00 7 3 3
Dave 2020-10-09 08:00:00 7 8 5
Dave 2020-10-09 09:48:00 7 7 2
Nick 2020-10-01 13:00:00 3 7 3
Nick 2020-10-02 12:58:00 3 3 2
Nick 2020-10-03 10:02:00 3 7 1
The desired result would look like this:
person datetime books tasks done group
Mike 2020-10-01 12:15:00 5 7 2 1
Mike 2020-10-01 12:17:00 5 7 3 1
Mike 2020-10-04 09:08:00 5 3 3 2
Dave 2020-10-07 12:22:00 7 5 1 3
Dave 2020-10-09 09:48:00 7 7 2 4
Is something like this possible in R.
Thank you very much for any suggestions.

This might be a slightly expensive solution if your datasets are very large, but it outputs the desired result.
I don't know if your date variables are already in date format; below I convert them with the lubridate package just in case they aren't.
Also, I create the variable date_interval that will be used later for a filtering condition.
library(dplyr)
library(lubridate)
# convert to date format
df_params <- df_params %>%
mutate(
date_from = ymd_hms(date_from),
date_to = ymd_hms(date_to),
# create interval
date_interval = interval(date_from, date_to)
)
df <- df %>%
mutate(datetime = ymd_hms(datetime))
After this manipulation step, I use a left_join on the person name in order to have a larger dataframe - for this reason I said before that this operation might be a little expensive - and then filter only the rows where datetime is within the above-mentioned interval.
left_join(df, df_params, by = "person") %>%
filter(datetime %within% date_interval) %>%
select(person:group)
# person datetime books tasks done group
# 1 Mike 2020-10-01 12:15:00 5 7 2 1
# 2 Mike 2020-10-01 12:17:00 5 7 3 1
# 3 Mike 2020-10-04 09:08:00 5 3 3 2
# 4 Dave 2020-10-07 12:22:00 7 5 1 3
# 5 Dave 2020-10-09 09:48:00 7 7 2 4
Starting data
df_params <- read.table(text="
group person date_from date_to
1 Mike 2020-10-01T12:00:00 2020-10-01T13:00:00
2 Mike 2020-10-04T09:00:00 2020-10-07T17:00:00
3 Dave 2020-10-07T12:00:00 2020-10-07T13:00:00
4 Dave 2020-10-09T09:00:00 2020-10-11T17:00:00", header=T)
df <- read.table(text="
person datetime books tasks done
Mike 2020-10-01T12:15:00 5 7 2
Mike 2020-10-01T12:17:00 5 7 3
Mike 2020-10-01T18:00:00 5 7 4
Mike 2020-10-02T12:00:00 5 5 0
Mike 2020-10-04T09:08:00 5 3 3
Mike 2020-10-09T12:00:00 5 7 1
Dave 2020-10-07T12:22:00 7 5 1
Dave 2020-10-08T02:34:00 7 5 2
Dave 2020-10-09T07:00:00 7 3 3
Dave 2020-10-09T08:00:00 7 8 5
Dave 2020-10-09T09:48:00 7 7 2
Nick 2020-10-01T13:00:00 3 7 3
Nick 2020-10-02T12:58:00 3 3 2
Nick 2020-10-03T10:02:00 3 7 1 ", header=T)

Related

How to combine two datasets in R using two 'by' arguments for the dplyr *_join() functions? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
Let's say these are my two datasets:
# This is dataset "df"
df <- tibble(country = rep(c("US", "UK", "FRA", "SPA"), times = 3),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01"), each = 4))
country date
1 FRA 2020-01-01
2 SPA 2020-01-01
3 UK 2020-01-01
4 US 2020-01-01
5 FRA 2020-02-01
6 SPA 2020-02-01
7 UK 2020-02-01
8 US 2020-02-01
9 FRA 2020-03-01
10 SPA 2020-03-01
11 UK 2020-03-01
12 US 2020-03-01
# This is dataset "dd"
dd <- tibble(country = rep(c("US", "UK", "FRA"), times = 3),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01"), each = 3),
cases = seq(1:9),
deaths = seq(from = 2, to = 18, by = 2))
country date cases deaths
1 FRA 2020-01-01 1 2
2 UK 2020-01-01 2 4
3 US 2020-01-01 3 6
4 FRA 2020-02-01 4 8
5 UK 2020-02-01 5 10
6 US 2020-02-01 6 12
7 FRA 2020-03-01 7 14
8 UK 2020-03-01 8 16
9 US 2020-03-01 9 18
I would like to create a new dataset called dj which combines df and the cases column in dd by country and date. I also want NA to be printed in the cases column (in dj) where there is missing data (i.e. for country "SPA").
In essence, I am trying to get dj to have 12 rows and 3 columns (country, date, cases), with NA for cases in "SPA". Like this:
country date cases
1 FRA 2020-01-01 1
2 SPA 2020-01-01 NA
3 UK 2020-01-01 2
4 US 2020-01-01 3
5 FRA 2020-02-01 4
6 SPA 2020-02-01 NA
7 UK 2020-02-01 5
8 US 2020-02-01 6
9 FRA 2020-03-01 7
10 SPA 2020-03-01 NA
11 UK 2020-03-01 8
12 US 2020-03-01 9
I know you can combine two datasets using dplyr package *_join() functions, but how can I use these when I only want to merge a dataset with one particular column from another dataset and use more than one 'by' argument (i.e. country and date)??
I'm having difficulty in figuring this out myself since I'm new to R, so any help would be very much appreciated :)
With dplyr you can use left_join and then select which fields you do or do not want. Remember ?left_join will indicate the "left" dataframe (df here) will keep all records which will maintain the SPA values.
You could also perform a select on the dataframes before the join, removing 'deaths' from df. The important piece here is making sure you're using the correct join to keep all the records within df.
library(dplyr)
dj <- left_join(df, dd, by = c('date', 'country')) %>%
select(-deaths) # OR select(country, date, cases)
Output:
# A tibble: 12 x 3
country date cases
<chr> <chr> <int>
1 FRA 2020-01-01 3
2 FRA 2020-02-01 6
3 FRA 2020-03-01 9
4 SPA 2020-01-01 NA
5 SPA 2020-02-01 NA
6 SPA 2020-03-01 NA
7 UK 2020-01-01 2
8 UK 2020-02-01 5
9 UK 2020-03-01 8
10 US 2020-01-01 1
11 US 2020-02-01 4
12 US 2020-03-01 7
You can use left_join(), as follows:
dj <- left_join(df, dd, by = c('country', 'date')) %>%
select(-deaths)
dj
Here is the output:
# A tibble: 12 x 3
country date cases
<chr> <chr> <int>
1 FRA 2020-01-01 3
2 FRA 2020-02-01 6
3 FRA 2020-03-01 9
4 SPA 2020-01-01 NA
5 SPA 2020-02-01 NA
6 SPA 2020-03-01 NA
7 UK 2020-01-01 2
8 UK 2020-02-01 5
9 UK 2020-03-01 8
10 US 2020-01-01 1
11 US 2020-02-01 4
12 US 2020-03-01 7

How to print a date when the input is number of days since 01-01-60?

I received a set of dates, but it turns out that time is reported in days since 01-01-1960 in this specific data set.
D_INDDTO
1 20758
2 20856
3 21062
4 19740
5 21222
6 21203
The specific date of interest for Patient 1 is 20758 days since 01-01-60
I want to create a new covariate u$date containing the specific date of interest i d%m%y%. I tried
library(tidyverse)
u %>% mutate(date=as.date(D_INDDTO,origin="1960-01-01")
But that did not solve it.
u <- structure(list(D_INDDTO = c(20758, 20856, 21062, 19740, 21222,
21203, 20976, 20895, 18656, 18746)), row.names = c(NA, 10L), class = "data.frame")
Try this:
#Code 1
u %>% mutate(date=as.Date("1960-01-01")+D_INDDTO)
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 2
u %>% mutate(date=as.Date(D_INDDTO,origin="1960-01-01"))
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 3
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d%m%y'))
Output:
D_INDDTO date
1 20758 311016
2 20856 060217
3 21062 310817
4 19740 170114
5 21222 070218
6 21203 190118
7 20976 060617
8 20895 170317
9 18656 290111
10 18746 290411
If more customization is required:
#Code 4
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d-%m-%Y'))
Output:
D_INDDTO date
1 20758 31-10-2016
2 20856 06-02-2017
3 21062 31-08-2017
4 19740 17-01-2014
5 21222 07-02-2018
6 21203 19-01-2018
7 20976 06-06-2017
8 20895 17-03-2017
9 18656 29-01-2011
10 18746 29-04-2011

Cumulative sums in R with multiple conditions?

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.
The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.
So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.
# A tibble: 15 x 3
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index
<date> <int> <int>
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5
The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.
For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.
The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)
I've provided the desired output for the sample below ("Copies").
# A tibble: 15 x 4
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index Copies
<date> <int> <int> <dbl>
1 2013-07-09 2 1 1
2 2013-08-07 2 2 2
3 2013-08-20 2 3 3
4 2013-09-08 2 4 4
5 2013-09-28 2 5 5
6 2013-12-27 2 6 5
7 2014-02-10 2 7 5
8 2014-03-12 2 8 5
9 2014-03-14 2 9 5
10 2014-08-27 2 10 5
11 2014-04-27 6 1 1
12 2014-08-01 6 2 2
13 2014-11-13 6 3 2
14 2015-02-14 6 4 2
15 2015-05-14 6 5 2
>
I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.
Thanks a million!
*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.
Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10680332 2013-10-17 6 1 NA days NA days NA days NA days NA days 1
2 PEN-10835735 2014-04-27 6 2 192 days NA days NA days NA days NA days 1
3 PEN-10873506 2014-08-01 6 3 96 days 288 days NA days NA days NA days 1
4 PEN-10951264 2014-11-13 6 4 104 days 200 days 392 days NA days NA days 1
5 PEN-11029526 2015-02-14 6 5 93 days 197 days 293 days 485 days NA days 1
6 PEN-11106581 2015-05-14 6 6 89 days 182 days 286 days 382 days 574 days 1
Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10999392 2015-01-20 76 1 NA days NA days NA days NA days NA days 1
2 YAL-11004302 2015-01-22 76 2 2 days NA days NA days NA days NA days 2
3 COR-11108471 2015-05-18 76 3 116 days 118 days NA days NA days NA days 2
4 HVD-11136632 2015-07-27 76 4 70 days 186 days 188 days NA days NA days 2
5 MIT-11164843 2015-09-09 76 5 44 days 114 days 230 days 232 days NA days 2
6 HVD-11166239 2015-09-10 76 6 1 days 45 days 115 days 231 days 233 days 2
You can use runner package to apply any R function on cumulative window.
This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.
library(dplyr)
library(runner)
data %>%
group_by(Title.ID) %>%
mutate(
Copies = runner(
x = REQUEST.DATE,
f = function(x) {
length(x[x <= (min(x + 112))])
}
)
)
# # A tibble: 15 x 4
# # Groups: Title.ID [2]
# REQUEST.DATE Title.ID ID.Index Copies
# <date> <int> <int> <int>
# 1 2013-07-09 2 1 1
# 2 2013-08-07 2 2 2
# 3 2013-08-20 2 3 3
# 4 2013-09-08 2 4 4
# 5 2013-09-28 2 5 5
# 6 2013-12-27 2 6 5
# 7 2014-02-10 2 7 5
# 8 2014-03-12 2 8 5
# 9 2014-03-14 2 9 5
# 10 2014-08-27 2 10 5
# 11 2014-04-27 6 1 1
# 12 2014-08-01 6 2 2
# 13 2014-11-13 6 3 2
# 14 2015-02-14 6 4 2
# 15 2015-05-14 6 5 2
data
data <- read.table(
text = " REQUEST.DATE Title.ID ID.Index
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5",
header = TRUE)
data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))
I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.
data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>%
group_by(Title.ID) %>%
mutate(
Copies = sapply(REQUEST.DATE, function(x)
sum(as.Date(REQUEST.DATE) <= as.Date(x) &
as.Date(RETURN.DATE) >= as.Date(x)
))
)
Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.
I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.

How to create a column based on two conditions from other data frame?

I'm trying to create a column that identifies if the row meets two conditions. For example, I have a table similar to this:
> dat <- data.frame(Date = c(rep(c("2019-01-01", "2019-02-01","2019-03-01", "2019-04-01"), 4)),
+ Rep = c(rep("Mike", 4), rep("Tasha", 4), rep("Dane", 4), rep("Trish", 4)),
+ Manager = c(rep("Amber", 2), rep("Michelle", 2), rep("Debbie", 4), rep("Brian", 4), rep("Tim", 3), "Trevor"),
+ Sales = floor(runif(16, min = 0, max = 10)))
> dat
Date Rep Manager Sales
1 2019-01-01 Mike Amber 6
2 2019-02-01 Mike Amber 3
3 2019-03-01 Mike Michelle 9
4 2019-04-01 Mike Michelle 2
5 2019-01-01 Tasha Debbie 9
6 2019-02-01 Tasha Debbie 6
7 2019-03-01 Tasha Debbie 0
8 2019-04-01 Tasha Debbie 4
9 2019-01-01 Dane Brian 3
10 2019-02-01 Dane Brian 6
11 2019-03-01 Dane Brian 6
12 2019-04-01 Dane Brian 1
13 2019-01-01 Trish Tim 6
14 2019-02-01 Trish Tim 7
15 2019-03-01 Trish Tim 6
16 2019-04-01 Trish Trevor 1
Out of the Reps that have switched manager, I would like to identify weather this manager is the first or the second manager with respect to the date. The ideal output would look something like:
Date Rep Manager Sales New_Column
1 2019-01-01 Mike Amber 6 1
2 2019-02-01 Mike Amber 3 1
3 2019-03-01 Mike Michelle 9 2
4 2019-04-01 Mike Michelle 2 2
5 2019-01-01 Trish Tim 6 1
6 2019-02-01 Trish Tim 7 1
7 2019-03-01 Trish Tim 6 1
8 2019-04-01 Trish Trevor 1 2
I have tried a few things but they're not quite working out. I have created two separate data frames where one consists of the first instance of that Rep and associated manager (df1) and the other one consists of the last instance of that rep and associated manager (df2). The code that I have tried that has gotten the closest is:
dat$New_Column <- ifelse(dat$Rep %in% df1$Rep & dat$Manager %in% df1$Manager, 1,
ifelse(dat$Rep %in% df2$Rep & dat$Manager %in% df2$Manager, 2, NA))
However this reads as two separate conditions, rather than having a condition of a condition (i.e. If Mike exists in the first instance and Amber exists in the first instance assign 1 rather than If Mike exists with the manager Amber in the first instance assign 1). Any help would be really appreciated. Thank you!
An option is to first grouped by 'Rep' filter the rows where the number of unique 'Manager' is 2, and then add a column by matching the 'Manager' with the unique elements of 'Manager' to get the indices
library(dplyr)
dat %>%
group_by(Rep) %>%
filter(n_distinct(Manager) == 2) %>%
mutate(New_Column = match(Manager, unique(Manager)))
# A tibble: 8 x 5
# Groups: Rep [2]
# Date Rep Manager Sales New_Column
# <chr> <chr> <chr> <int> <int>
#1 2019-01-01 Mike Amber 6 1
#2 2019-02-01 Mike Amber 3 1
#3 2019-03-01 Mike Michelle 9 2
#4 2019-04-01 Mike Michelle 2 2
#5 2019-01-01 Trish Tim 6 1
#6 2019-02-01 Trish Tim 7 1
#7 2019-03-01 Trish Tim 6 1
#8 2019-04-01 Trish Trevor 1 2

Fixing dates that were coerced into the wrong format

I have a large df with dates that were accidentally coerced into the wrong format.
Data:
id <- c(1:12)
date <- c("2014-01-03","2001-08-14","2001-08-14","2014-06-02","2006-06-14", "2006-06-14",
"2014-08-08","2014-08-08","2008-04-14","2009-12-13","2010-09-14","2012-09-14")
df <- data.frame(id,date)
Structure:
id date
1 1 2014-01-03
2 2 2001-08-14
3 3 2001-08-14
4 4 2014-06-02
5 5 2006-06-14
6 6 2006-06-14
7 7 2014-08-08
8 8 2014-08-08
9 9 2008-04-14
10 10 2009-12-13
11 11 2010-09-14
12 12 2012-09-14
The data set only includes, or rather should only include the years 2014 and 2013. The dates 2001-08-14 and 2006-06-14 are most likely 2014-08-01 and 2014-06-06, respectively.
Output:
id date
1 1 2014-01-03
2 2 2014-08-01
3 3 2014-08-01
4 4 2014-06-02
5 5 2014-06-06
6 6 2014-06-06
7 7 2014-08-08
8 8 2014-08-08
9 9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12
How can I reconcile this mess?
Package lubridate has the convenient function year that will be useful here.
library(lubridate)
# Convert date to proper date class variable
df$date <- as.Date(df$date)
# Isolate problematic indices; when year is not in 2013 or 2014,
# we'll go to and from character representation. We'll trim
# the "20" in front of the "false year" and then specify the
# proper format to read the character back into a Date class.
tmp.indices <- which(!year(df$date) %in% c("2013", "2014"))
df$date[tmp.indices] <- as.Date(substring(as.character(df$date[tmp.indices]),
first = 3), format = "%d-%m-%y")
Result:
id date
1 1 2014-01-03
2 2 2014-08-01
3 3 2014-08-01
4 4 2014-06-02
5 5 2014-06-06
6 6 2014-06-06
7 7 2014-08-08
8 8 2014-08-08
9 9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12
We could convert the 'date' column to 'Date' class, extract the 'year' to create a logical index ('indx') for years 2013, 2014).
df$date <- as.Date(df$date)
indx <- !format(df$date, '%Y') %in% 2013:2014
By using lubridate, convert to 'Date' class using dmy after removing the first two characters.
library(lubridate)
df$date[indx] <- dmy(sub('^..', '', df$date[indx]))
df
# id date
#1 1 2014-01-03
#2 2 2014-08-01
#3 3 2014-08-01
#4 4 2014-06-02
#5 5 2014-06-06
#6 6 2014-06-06
#7 7 2014-08-08
#8 8 2014-08-08
#9 9 2014-04-08
#10 10 2013-12-09
#11 11 2014-09-10
#12 12 2014-09-12

Resources