How to obtain daily time series of categorical frequencies in R - r

I have a data frame as such:
data <- data.frame(daytime = c('2005-05-03 11:45:23', '2005-05-03 11:47:45',
'2005-05-03 12:00:32', '2005-05-03 12:25:01',
'2006-05-02 10:45:15', '2006-05-02 11:15:14',
'2006-05-02 11:16:15', '2006-05-02 11:18:03'),
category = c("A", "A", "A", "B", "B", "B", "B", "A"))
print(data)
daytime category date2
1 2005-05-03 11:45:23 A 05/03/05
2 2005-05-03 11:47:45 A 05/03/05
3 2005-05-03 12:00:32 A 05/03/05
4 2005-05-03 12:25:01 B 05/03/05
5 2006-05-02 10:45:15 B 05/02/06
6 2006-05-02 11:15:14 B 05/02/06
7 2006-05-02 11:16:15 B 05/02/06
8 2006-05-02 11:18:03 A 05/02/06
I would like to turn this data frame into a time series of daily categorical frequencies like this:
day cat_A_freq cat_B_freq
1 2005-05-01 3 1
2 2005-05-02 1 3
I have tried doing:
library(anytime)
data$daytime <- anytime(data$daytime)
data$day <- factor(format(data$daytime, "%D"))
table(data$day, data$category)
A B
05/02/06 1 3
05/03/05 3 1
But as you can see the formatting a new variable, day, changes the appearance of the date. You can also see that the table does not return the days in proper order (the years are out of order) so that I can then convert to a time series, easily.
Any ideas on how to get frequencies in an easier way, or if this is the way, how to get the frequencies in correct date order and into a dataframe for easy conversion to a time series object?

A solution using tidyverse. The format of your daytime column in your data is good, so we can use as.Date directly without specifying other formats or using other functions.
library(tidyverse)
data2 <- data %>%
mutate(day = as.Date(daytime)) %>%
count(day, category) %>%
spread(category, n)
data2
# # A tibble: 2 x 3
# day A B
# * <date> <int> <int>
# 1 2005-05-03 3 1
# 2 2006-05-02 1 3

Related

R: Long to wide without time

I am working with a medication prescription dataset which I want to transfer from long to wide format.
I tried to use the reshape function, however, this requires a time variable, which I don't have (at least not in a useful format I believe).
Concept dataset:
id <- c(1, 1, 1, 2, 2, 3, 3, 3)
prescription_date <- c("17JAN2009", "02MAR2009", "20MAR2009", "05JUL2009", "10APR2009", "09MAY2009", "13JUN2009", "29MAY2009")
med <- c("A", "B", "A", "B", "A", "B", "A", "B")
df <- data.frame(id, prescription_date, med)
To make a time variable I have tried to make a time variable like 1st, 2nd, etc med per id, but I didn't succeed.
Background: I want this in a wide format to eventually create definitions for diagnoses (i.e. if a patient had >1 prescriptions of A, diagnosis is confirmed). This has to be combined with factors from other datasets, hence the idea to go from long to wide.
Any help is much appreciated, thank you.
You might consider keeping the data in long format to perform some of these calculations. I would also suggest changing your dates into a date format that can be calculated upon. This will show, for instance, that the last two rows are not chronological. For instance:
library(dplyr)
df %>%
mutate(prescription_date = lubridate::dmy(prescription_date)) %>%
arrange(id, prescription_date) %>%
group_by(id) %>%
mutate(A_cuml = cumsum(med=="A"),
A_ttl = sum(med=="A")) %>%
ungroup()
# A tibble: 8 × 5
id prescription_date med A_cuml A_ttl
<dbl> <date> <chr> <int> <int>
1 1 2009-01-17 A 1 2
2 1 2009-03-02 B 1 2
3 1 2009-03-20 A 2 2
4 2 2009-04-10 A 1 1
5 2 2009-07-05 B 1 1
6 3 2009-05-09 B 0 1
7 3 2009-05-29 B 0 1
8 3 2009-06-13 A 1 1
If you calculate summary stats for each id, you might save this in a summarized table with one row per id and use joins (e.g. left_join) to append the results of each of these summaries.

How to take an arithmetic average over common variable, rather than whole data?

So I have a data frame which is daily data for stock prices, however, I have also a variable that indicates the week of year (1,2,3,4,...,51,52) this is repeated for 22 companies. I would like to create a new variable that takes an average of the daily prices but only across each week.
The above equation has d = day and t = week. My challenge is taking this average of days across each week. Therefore, I should have 52 values per stock that I observe.
Using ave().
dat <- transform(dat, avg_week_price=ave(price, week, company))
head(dat, 9)
# week company wday price avg_week_price
# 1 1 1 a 16.16528 15.47573
# 2 2 1 a 18.69307 15.13812
# 3 3 1 a 11.01956 12.99854
# 4 1 2 a 15.92029 14.56268
# 5 2 2 a 12.26731 13.64916
# 6 3 2 a 17.40726 17.27226
# 7 1 3 a 11.83037 13.02894
# 8 2 3 a 13.09144 12.95284
# 9 3 3 a 12.08950 15.81040
Data:
setseed(42)
dat <- expand.grid(week=1:3, company=1:5, wday=letters[1:7])
dat$price <- runif(nrow(dat), 10, 20)
An option with dplyr
library(dplyr)
dat %>%
group_by(week, company) %>%
mutate(avg_week_price = mean(price))

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

How to perform an R equivalent of Excel's COUNTIFS function across multiple variables in a data frame

I'm working on a project and trying to create a plot of the number of open cases we have on any given date. An example of the data table is as follows.
case_files <- tibble(case_id = 1:10,
date_opened = c("2017-1-1",
"2017-1-1",
"2017-3-4",
"2017-4-4",
"2017-5-5",
"2017-5-6",
"2017-6-7",
"2017-6-6",
"2017-7-8",
"2017-7-8"),
date_closed = c("2017-4-1",
"2017-4-1",
"2017-5-4",
"2017-7-4",
"2017-7-5",
"2017-7-6",
"2017-8-7",
"2017-8-6",
"2017-9-8",
"2017-10-8"))
case_files$date_opened <- as.Date(case_files$date_opened)
case_files$date_closed <- as.Date(case_files$date_closed)
What I'm trying to do is create another data frame with the dates from the past year and the number of cases that are considered "Open" during each date. I would then be able to plot from this data frame.
daily_open_cases <- tibble(n = 0:365,
date = today() - n,
qty_open = .....)
Cases are considered Open on dates on orafter the date_opened AND on or before the date_closed
I've considered doing conditional subsetting and then using nrow(), but can't seem to get it to work. There must be an easier way to do this. I can do this easily in Excel using the COUNTIFS function.
Thanks!
The Excel funtion basically does a sum of logical 1's and 0's. Easy to do in R with sum function. I'd build a structure that had all the dates and then march through those dates summing up the logical vectors using the two inequalities below across the all paired rows in the case_files structure. The &-function in R is vectorized:
daily_open_cases <- tibble(dt = as.Date("2017-01-01")+0:365,
qty_open = NA)
daily_open_cases$qty_open = sapply(daily_open_cases$dt,
function(d) sum(case_files$date_opened <= d & case_files$date_closed >=d) )
> head( daily_open_cases)
# A tibble: 6 x 2
dt qty_open
<date> <int>
1 2017-01-01 2
2 2017-01-02 2
3 2017-01-03 2
4 2017-01-04 2
5 2017-01-05 2
6 2017-01-06 2
>
Here's a 'tidyverse' solution, the approach is the same as the one of 42 I just used dplyrs group_by and mutate instead of base-r sapply.
library(tidyverse)
library(magrittr)
days_files <- tibble(
date = as.Date("2017-01-01")+0:365,
no_open = NA_integer_
)
days_files %<>%
group_by(date) %>%
mutate(
no_open = sum(case_files$date_opened <= date & case_files$date_closed >= date)
)
# A tibble: 366 x 2
# Groups: date [366]
date no_open
<date> <int>
1 2017-01-01 2
2 2017-01-02 2
3 2017-01-03 2
4 2017-01-04 2
5 2017-01-05 2
6 2017-01-06 2
7 2017-01-07 2
8 2017-01-08 2
9 2017-01-09 2
10 2017-01-10 2
# ... with 356 more rows

Transform and Count Difference of Unique Customers over Time in R

I've got a data frame in R that looks like the following:
cust = c("A", "B", "C", "A", "B", "E", "A", "F", "A", "G")
period = as.Date(c("2013/1/1", "2013/1/1", "2013/1/1", "2013/1/2", "2013/1/2",
"2013/1/2", "2013/1/3", "2013/1/3", "2013/1/4", "2013/1/4"))
df = data.frame(cust, period)
I wanted to transform it in a way that I can arrive in the following format as an output:
period NumCust_Initial GainedCust LostCust NumCust_EndUpWith
1/1/2013 3 NA NA NA
2/1/2013 3 1 1 3
3/1/2013 2 1 2 2
4/1/2013 2 1 1 2
The idea is that I'd arrive in a count of unique customers for each period. Then, I'd calculate the number of new customers acquired GainedCust and the number of customers lost LostCust all based on the previous period. Finally, we'd do a calculation that would get the
From df in 2/1/2013 I had 3 unique customers. I gained 1 (relative to 1/1/2013) but lost another 1 (relative to 1/1/2013) so I ended up with 3 customers (which is calculated as 3 from NumCust_Initial in 1/1/2013 plus the number of new customers GainedCust in 2/1/2013 and minus the number of lost customers LostCust in 2/1/2013).
Similarly, we can see from df that in 3/1/2013 we started with 2 customers. We then gained 1 new customer (relative to 2/1/2013) and lost 2 customers (relative to 2/1/2013). And so forth and so on.
How can I perform all these transformations / calculations in R? I've tried looking at some of the functions in dplyr and reshape2 but could not arrive in anything as of yet. Has anybody faced similar data transformation challenges in R before? How can I achieve the desired outcome in R?
You can do this with a combination of tidyr and dplyr. Be sure to install the development version of tidyr.
# required packages
require(tidyr) # development version
require(dplyr)
df %>%
mutate(current = TRUE) %>%
complete(period, cust, fill = list(current = FALSE)) %>%
group_by(cust) %>%
mutate(gain = c(NA, diff(current))) %>%
group_by(period) %>%
summarise(GainedCust = sum(gain > 0),
LostCust = sum(gain < 0),
NumCust_EndUpWith = sum(current))
## Source: local data frame [4 x 4]
##
## period GainedCust LostCust NumCust_EndUpWith
## 1 2013-01-01 NA NA 3
## 2 2013-01-02 1 1 3
## 3 2013-01-03 1 2 2
## 4 2013-01-04 1 1 2

Resources