I have df1 sorted by date like this:
Date <- c("12/17/17","12/19/17","12/20/17","12/30/17","12/31/17","1/1/18")
Jon <- c(388,299,412,NA,NA,353)
Eric <- c(121,NA,321,473,832,NA)
Scott <- c(NA,122,NA,NA,NA,424)
df1 <- data.frame(Date,Jon,Eric,Scott)
df1$Date <- as.Date(df1$Date,format='%m/%d/%y')
#df1
Date Jon Eric Scott
1 12/17/17 388 121 NA
2 12/19/17 299 NA 122
3 12/20/17 412 321 NA
4 12/30/17 NA 473 NA
5 12/31/17 NA 832 NA
6 1/1/18 353 NA 424
I'm trying to create a new list that includes only the data that is within the last 12 days of each person's most recent date with a non-NA value. If there is only one non-NA value within 12 days of the person's most recent non-NA value, then I want to take the 2 most recent non-NA values for that person, even if one falls outside of the 12 day date range.
The code below successfully puts data within the last 12 days of each person's most recent non-NA value in a new list:
df2 <- lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)])
This code successfully takes the 2 most recent non-NA entries, regardless of whether or not it's within the 12 day range:
df3 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2))
This code comes very close to doing what I want it to do, except it loses the column names. Notice that the column names are replaced with numbers, unlike the lapply statements above, which both keep the column names.
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
How can I maintain the column names?
I would approach this problem using the tidyverse packages.
Data
library(tidyr)
library(dplyr)
library(lubridate)
df <- tibble(
my_date = as.Date(
c("12/17/17", "12/19/17", "12/20/17", "12/30/17", "12/31/17", "1/1/18"),
"%m/%d/%y"
),
jon = c(388, 299, 412, NA, NA, 353),
eric = c(121, NA, 321, 473, 832, NA),
scott = c(NA, 122, NA, NA, NA, 424)
)
Long format data frame
This output feels more natural.
df_long <- df %>%
gather(key, value, -my_date) %>%
drop_na %>%
group_by(key) %>%
mutate(
in_date = if_else(my_date >= max(my_date) - days(12), TRUE, FALSE),
count = sum(in_date)
) %>%
filter(in_date | count < 2) %>%
top_n(2, my_date) %>%
ungroup %>%
select(-c(in_date, count))
df_long
# # A tibble: 6 x 3
# my_date key value
# <date> <chr> <dbl>
# 1 2017-12-20 jon 412
# 2 2018-01-01 jon 353
# 3 2017-12-30 eric 473
# 4 2017-12-31 eric 832
# 5 2017-12-19 scott 122
# 6 2018-01-01 scott 424
Wide format
Thankfully, it is only one additional step to spread to your original columns.
df_long %>% spread(key, value)
# # A tibble: 5 x 4
# my_date eric jon scott
# * <date> <dbl> <dbl> <dbl>
# 1 2017-12-19 NA NA 122
# 2 2017-12-20 NA 412 NA
# 3 2017-12-30 473 NA NA
# 4 2017-12-31 832 NA NA
# 5 2018-01-01 NA 353 424
Seems like the easiest thing to do for me is to store the column headers in a variable and then reattach them:
myHeaders <- names(df1[-1])
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
names(df4) <- myHeaders
Related
My Tibble:
df1 <- tibble(a = c("123*", "123", "124", "678*", "678", "679", "677"))
# A tibble: 7 x 1
a
<chr>
1 123*
2 123
3 124
4 678*
5 678
6 679
7 677
What it should become:
# A tibble: 3 x 2
a b
<chr> <chr>
1 123 124
2 678 679
3 678 677
The values with the stars refer to the following values with no stars, until a new value with a star comes and so on.
Each value with a star should go to the first column, the other values (except the ones that are identical to the values with a star, except the star) should go to the second column. If one value with a star is followed by several values, they should still be linked to eachother, so the values in the first column are duplicated to keep the connection.
I know how to filter and bring the values in each column, but not sure how i would keep the connection.
Regards
We can use tidyverse. Create a grouping column based on the occurence of * in 'a', extract the numeric part with parse_number, get the distinct rows, grouped by 'grp', create a new column with the first value of 'b'
library(dplyr)
library(stringr)
df1 %>%
transmute(grp = cumsum(str_detect(a, fixed("*"))),
b = readr::parse_number(a)) %>%
distinct(b, .keep_all = TRUE) %>%
group_by(grp) %>%
mutate(a = first(b)) %>%
slice(-1) %>%
ungroup %>%
select(a, b)
-output
# A tibble: 3 × 2
a b
<dbl> <dbl>
1 123 124
2 678 679
3 678 677
Here is one base R option -
Using cumsum and grepl we split the data on occurrence of *.
In each group, we drop the values which are similar to the star values and create a dataframe with two columns.
Finally, combine the list of dataframes in one combined dataframe.
result <- do.call(rbind, lapply(split(df1,
cumsum(grepl('*', df1$a, fixed = TRUE))), function(x) {
a <- x[[1]]
a[1] <- sub('*', '', a[1], fixed = TRUE)
data.frame(a = a[1], b = a[a != a[1]])
}))
rownames(result) <- NULL
result
# a b
#1 123 124
#2 678 679
#3 678 677
I have time-series data with three columns: a value column, a group_var column (used for grouping), and a date column. For each row in the data frame, I'd like to get the mean of that row's group after further subsetting by a specific timeframe. Here's an example of the code for subsetting:
df$value[df$date >= (current_row$date - 545) & df$date <= (current_row$date - 365)]
After I get this subset I can easily apply mean(), but where I'm stuck on is how I can get this code to work with something like this:
df %>%
group_by(group_var) %>%
mutate(subset_mean = mean(df$value[df$date >= (current_row$date - 545) & df$date <= (current_row$date - 365)])
)
The issues I see is that I don't think I can use 'df' inside the mutate() line after I group the original 'df'. Also I'm not sure how I can create 'current_row' variable for referencing the current row to calculate the data subset.
Edit:
Added example data and reproducible code
library(dplyr)
date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
date <- date %>% as.Date(format = "%Y-%m-%d")
val <- c(10, 20, 50, 70, 30, 44, 67, 42)
group_var <- c("A", "B", "B", "A", "B", "A", "A", "B")
df <- data.frame(date, val, group_var)
df %>%
group_by(group_var)
I would suggest using slider::slide_index_dbl for this:
library(dplyr)
df %>%
group_by(group_var) %>%
arrange(group_var, date) %>% # slider 0.1.5 requires the window variable to be ascending
mutate(subset_mean = slider::slide_index_dbl(
val, date, mean, .before = 545, .after = -365
# negative ".after" means the window ends before the current date
)) %>%
ungroup()
With the updated data, I get
#date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
# A tibble: 8 x 4
date val group_var subset_mean
<date> <dbl> <chr> <dbl>
1 2016-01-13 44 A NaN
2 2016-02-03 10 A NaN
3 2017-04-16 70 A 27
4 2017-04-24 67 A 27
5 2016-01-27 30 B NaN
6 2016-03-15 50 B NaN
7 2016-06-14 20 B NaN
8 2017-06-15 42 B 33.3
1) This can be done with a self join using sql:
library(sqldf)
sqldf("select a.date, a.val, a.group_var, avg(b.val) as mean
from df a
left join df b on a.group_var = b.group_var and
b.date between a.date - 595 and a.date - 365
group by a.rowid")
giving:
date val group_var mean
1 2016-02-03 10 A NA
2 2016-06-14 20 B NA
3 2016-03-15 50 B NA
4 2017-04-16 70 A 27.00000
5 2016-01-27 30 B NA
6 2016-01-13 44 A NA
7 2017-04-24 67 A 27.00000
8 2017-06-15 42 B 33.33333
2) or we can use SQL window functions:
sqldf("select date, val, group_var,
avg(val) over (partition by group_var
order by date
range between 595 preceding and 365 preceding) as mean
from df"
)
giving:
date val group_var mean
1 2016-01-13 44 A NA
2 2016-02-03 10 A NA
3 2017-04-16 70 A 27.00000
4 2017-04-24 67 A 27.00000
5 2016-01-27 30 B NA
6 2016-03-15 50 B NA
7 2016-06-14 20 B NA
8 2017-06-15 42 B 33.33333
Lubridate provides a very elegant solution...
library(tidyverse)
library(lubridate)
df = tibble(
value = runif(100,1,100),
group = rep(1:4,25),
dt = as.Date(round(runif(100,1000,2000)), origin = "1970-01-01")
)
first_year <- interval(ymd("1972-01-01"), ymd("1972-12-31"))
sec_year <- interval(ymd("1973-01-01"), ymd("1973-12-31"))
furhter <- interval(ymd("1974-01-01"), ymd("1975-12-31"))
df <- df %>%
mutate(
range = case_when(
dt %within% first_year ~"1972",
dt %within% sec_year ~"1973",
TRUE ~"1974-1975"
)
)
mean_by_group_interval <- df %>%
group_by(
group,
range
) %>%
summarise(
mean = mean(value)
)
Here is a solution that utilizes the dplyr package.
library(dplyr)
date <- c("2016-02-03", "2016-06-14", "2016-03-15", "2017-04-16","2016-01-27", "2016-01-13", "2017-04-24", "2017-06-15")
date <- date %>% as.Date(format = "%Y-%m-%d")
val <- c(10, 20, 50, 70, 30, 44, 67, 42)
group_var <- c("A", "B", "B", "A", "B", "A", "A", "B")
df <- data.frame(date, val, group_var)
df %>%
group_by(group_var) %>%
arrange(group_var, date) %>%
mutate(
# Determine if the current date - the first date of each group is between 365 and 595 days.
match = between(date - first(date), 365, 595),
# Count the number of dates that are not within the range described above to be used in calculating the mean.
count_false = sum(match == FALSE),
# Calculate the cumulative sum for rows in each group that are not within the range described above.
sum_match_false = ifelse(match == FALSE, cumsum(val), NA),
# Calculate the mean.
mean_match_true = ifelse(match == TRUE, max(sum_match_false, na.rm = TRUE) / count_false, NA)
) %>%
# Return only these variables.
select(date, val, group_var, mean_match_true)
#> date val group_var mean_match_true
#> <date> <dbl> <chr> <dbl>
#> 1 2016-01-13 44 A NA
#> 2 2016-02-03 10 A NA
#> 3 2017-04-16 70 A 27
#> 4 2017-04-24 67 A 27
#> 5 2016-01-27 30 B NA
#> 6 2016-03-15 50 B NA
#> 7 2016-06-14 20 B NA
#> 8 2017-06-15 42 B 33.3
Created on 2021-03-12 by the reprex package (v0.3.0)
I have strange problem with calculation, and I am not sure what I should do. I have a data that looks like this:
and I need to sort by ID and Date at first,which I did. Then i need to find the baseline date, only if duration for that date is <=0 and closest to 0, that one can be used as baseline, then I need to calculate usable=current score/baseline date score. so the final results should look like this:
What should I do? How can I check the oldest day and build "usable" to use the score/oldest score?
The codes for sample data are:
ID <-c("1","1","1","1","2","2","2","2")
Date<- c("4/19/2018","7/27/2018","8/24/2018","9/21/2018","10/19/2018","12/14/2018","1/11/2019","1/24/2019")
Duration <- c("-13","-7","95","142","2","36","75","81")
score <- c("0.06","0.071","0.054","0.0258","0.0208","0.0448","0.0638","0.0227")
Sample.data <- data.frame(ID, Date, Duration, score)
The columns in 'Sample.data' are all character class as the values were quoted (used R 4.0.0. If it was < R 4.0, stringsAsFactors = TRUE by default), so we used type.convert to change the class based on the values automatically, then before we do the arrange on 'ID', 'Date', convert the 'Date' to Date class (in case there are some inconsistency in the original data with respect to the order), after grouping by 'ID', create the new column 'Useable' with an if/else condition to return the standardized 'score' with the first value of 'score' or else return NA
library(dplyr)
library(lubridate)
Sample.data <- Sample.data %>%
type.convert(as.is = TRUE) %>%
mutate(Date = mdy(Date)) %>%
arrange(ID, Date) %>%
group_by(ID) %>%
mutate(Useable = if(first(Duration) <=0) c(NA, score[-1]/first(score))
else NA_real_)
Sample.data
# A tibble: 8 x 5
# Groups: ID [2]
# ID Date Duration score Useable
# <int> <date> <int> <dbl> <dbl>
#1 1 2018-04-19 -13 0.06 NA
#2 1 2018-07-27 86 0.071 1.18
#3 1 2018-08-24 95 0.054 0.9
#4 1 2018-09-21 142 0.0258 0.43
#5 2 2018-10-19 2 0.0208 NA
#6 2 2018-12-14 36 0.0448 NA
#7 2 2019-01-11 75 0.0638 NA
#8 2 2019-01-24 81 0.0227 NA
I have a dataset like this.
ID EQP_ID DATE ENTRY EXIT
10 1232 10/01/2018 0058 NA
10 8123 10/01/2018 NA 0059
11 8231 10/02/2018 0063 NA
11 233 10/03/2018 0064 NA
11 2512 10/04/2018 NA 0099
11 2111 10/05/2018 NA 1000
I want to collapse the observations such that the earliest row I see with an 'ENTRY' for a given ID is combined with the latest row with an EXIT value, and I also get the EQP_ID associated with the exit record:
ID EQP_ID ENTRY EXIT
10 8123 0058 0059
11 2111 0063 1000
I'm fairly new to R and this was complicated enough that I couldn't think of a good way to do it without resorting to a loop, and performance is predictably not very good.
Edit
I think this does it, but I'd still be curious if other more experienced folks have a better answer
> group_by(dataset, ID) %>%
arrange(ENTRY) %>%
summarize(ENTRY = first(ENTRY), EXIT = last(exit), EQP_ID = last(EQP_ID))
Using dplyr::first and dplyr::last we can do the below, another option we can use min and max
library(dplyr)
df %>% group_by(ID) %>%
summarise(EQP_ID=dplyr::last(EQP_ID), First=dplyr::first(ENTRY),Last=dplyr::last(EXIT))
# A tibble: 2 x 4
ID EQP_ID First Last
<int> <int> <int> <int>
1 10 8123 58 59
2 11 2111 63 1000
This solution uses dplyr. First, define the data frame.
df <- read.table(text = "ID EQP_ID DATE ENTRY EXIT
10 1232 10/01/2018 0058 NA
10 8123 10/01/2018 NA 0059
11 8231 10/02/2018 0063 NA
11 233 10/03/2018 0064 NA
11 2512 10/04/2018 NA 0099
11 2111 10/05/2018 NA 1000", header = TRUE)
Next, group by ID and take either the first or last value of variables in the group using head or tail, respectively.
df %>%
group_by(ID) %>%
summarise(EQP_ID = tail(EQP_ID, 1),
ENTRY = head(ENTRY, 1),
EXIT = tail(EXIT, 1))
This gives,
# # A tibble: 2 x 4
# ID EQP_ID ENTRY EXIT
# <int> <int> <int> <int>
# 1 10 8123 58 59
# 2 11 2111 63 1000
One option with data.table:
library(data.table)
#create example data
dt <- data.table(
id = c(10, 10, 11, 11, 11, 11),
date = seq(as.Date("2018-10-1"), as.Date("2018-10-6"), by="day"),
entry = c(58, NA, 63, 64, NA, NA),
exit = c(NA, 59, NA, NA, 99, 100)
)
# number rows by id
dt[order(id, date), num := 1:.N, by=id]
# get first-entry and last-exit values by id
dt[ , keepentry := entry[1],by=id]
dt[ , keepexit := exit[.N],by=id]
# keep one row per id
dt[num==1, .(id, keepentry, keepexit)]
Not my most elegant work, but it will get the job done.
I need to count of future visits by specific customer in the next 7 days. I solved this with purrr:map2 but I'm experiencing very slow performance. I think I must be missing something basic about how to use purrr. How do I speed this up? Thanks
This toy example takes 2.3 secs with 100 rows, but 3.3 minutes with 1000 rows on my machine. My actual data has 400K rows!
library(tidyverse)
set.seed(123)
rows <- 1000
df= data.frame(cust_num = sample(c("123","124","128"),rows,replace=T),
date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by="day"), rows, replace=T))
df <- df %>%
rowwise() %>%
mutate( visits.next.7.days = map2_lgl(df$cust_num,df$date,~.x==cust_num&.y>date&.y<(date+7)) %>% sum() )
Here's an option that uses purrr::reduce to sum the list of vectors returned by data.table::shift (a vectorized version of lead/lag). pmap_int with sum would do the same as reduce with + if you like, but it a little slower. You could similarly do map(1:7, ~lead(n, .x, default = 0L)) instead of data.table::shift, but it's more code and slower.
library(tidyverse)
set.seed(123)
rows <- 1000
df = data.frame(cust_num = sample(c("123","124","128"), rows, replace = TRUE),
date = sample(seq(as.Date('2017/01/01'),
as.Date('2017/01/31'),
by = "day"),
rows, replace = TRUE))
df2 <- df %>%
count(cust_num, date) %>%
group_by(cust_num) %>%
# add dates with no occurrences; none in sample data, but quite possible in real
complete(date = seq(min(date), max(date), by = 'day'), fill = list(n = 0L)) %>%
mutate(visits_next_7 = reduce(data.table::shift(n, 1:7, type = 'lead', fill = 0L), `+`)) %>%
right_join(df)
df2
#> # A tibble: 1,000 x 4
#> # Groups: cust_num [?]
#> cust_num date n visits_next_7
#> <fctr> <date> <int> <int>
#> 1 123 2017-01-09 10 78
#> 2 128 2017-01-19 12 70
#> 3 124 2017-01-05 15 73
#> 4 128 2017-01-27 14 37
#> 5 128 2017-01-27 14 37
#> 6 123 2017-01-15 19 74
#> 7 124 2017-01-24 12 59
#> 8 128 2017-01-10 10 78
#> 9 124 2017-01-03 19 77
#> 10 124 2017-01-14 8 84
#> # ... with 990 more rows
This may not be the most efficient algorithm, as depending on the spacing of your data, complete could potentially expand your data dramatically.
Further, with data this size, you may find data.table is more practical unless you want to put your data in a database and access it with dplyr.
A solution using the zoo package. The idea is to group the data by cust_num and date and count the row number first, and then use the lead function to shift the count number by 1 and use rollapply to calculate the sum of the next six days (not include the beginning date). Finally, use left_join to merge the results back to the original data frame. This should be much faster than your original approach. df3 is the final output.
library(dplyr)
library(zoo)
df2 <- df %>%
count(cust_num, date) %>%
ungroup() %>%
mutate(n2 = lead(n)) %>%
mutate(visits.next.7.days = rollapply(n2, width = 6, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)) %>%
select(cust_num, date, visits.next.7.days)
df3 <- df %>% left_join(df2, by = c("cust_num", "date"))
head(df3)
# cust_num date visits.next.7.days
# 1 123 2017-01-09 70
# 2 128 2017-01-19 54
# 3 124 2017-01-05 58
# 4 128 2017-01-27 37
# 5 128 2017-01-27 37
# 6 123 2017-01-15 68