Transform and Count Difference of Unique Customers over Time in R - r

I've got a data frame in R that looks like the following:
cust = c("A", "B", "C", "A", "B", "E", "A", "F", "A", "G")
period = as.Date(c("2013/1/1", "2013/1/1", "2013/1/1", "2013/1/2", "2013/1/2",
"2013/1/2", "2013/1/3", "2013/1/3", "2013/1/4", "2013/1/4"))
df = data.frame(cust, period)
I wanted to transform it in a way that I can arrive in the following format as an output:
period NumCust_Initial GainedCust LostCust NumCust_EndUpWith
1/1/2013 3 NA NA NA
2/1/2013 3 1 1 3
3/1/2013 2 1 2 2
4/1/2013 2 1 1 2
The idea is that I'd arrive in a count of unique customers for each period. Then, I'd calculate the number of new customers acquired GainedCust and the number of customers lost LostCust all based on the previous period. Finally, we'd do a calculation that would get the
From df in 2/1/2013 I had 3 unique customers. I gained 1 (relative to 1/1/2013) but lost another 1 (relative to 1/1/2013) so I ended up with 3 customers (which is calculated as 3 from NumCust_Initial in 1/1/2013 plus the number of new customers GainedCust in 2/1/2013 and minus the number of lost customers LostCust in 2/1/2013).
Similarly, we can see from df that in 3/1/2013 we started with 2 customers. We then gained 1 new customer (relative to 2/1/2013) and lost 2 customers (relative to 2/1/2013). And so forth and so on.
How can I perform all these transformations / calculations in R? I've tried looking at some of the functions in dplyr and reshape2 but could not arrive in anything as of yet. Has anybody faced similar data transformation challenges in R before? How can I achieve the desired outcome in R?

You can do this with a combination of tidyr and dplyr. Be sure to install the development version of tidyr.
# required packages
require(tidyr) # development version
require(dplyr)
df %>%
mutate(current = TRUE) %>%
complete(period, cust, fill = list(current = FALSE)) %>%
group_by(cust) %>%
mutate(gain = c(NA, diff(current))) %>%
group_by(period) %>%
summarise(GainedCust = sum(gain > 0),
LostCust = sum(gain < 0),
NumCust_EndUpWith = sum(current))
## Source: local data frame [4 x 4]
##
## period GainedCust LostCust NumCust_EndUpWith
## 1 2013-01-01 NA NA 3
## 2 2013-01-02 1 1 3
## 3 2013-01-03 1 2 2
## 4 2013-01-04 1 1 2

Related

R: Long to wide without time

I am working with a medication prescription dataset which I want to transfer from long to wide format.
I tried to use the reshape function, however, this requires a time variable, which I don't have (at least not in a useful format I believe).
Concept dataset:
id <- c(1, 1, 1, 2, 2, 3, 3, 3)
prescription_date <- c("17JAN2009", "02MAR2009", "20MAR2009", "05JUL2009", "10APR2009", "09MAY2009", "13JUN2009", "29MAY2009")
med <- c("A", "B", "A", "B", "A", "B", "A", "B")
df <- data.frame(id, prescription_date, med)
To make a time variable I have tried to make a time variable like 1st, 2nd, etc med per id, but I didn't succeed.
Background: I want this in a wide format to eventually create definitions for diagnoses (i.e. if a patient had >1 prescriptions of A, diagnosis is confirmed). This has to be combined with factors from other datasets, hence the idea to go from long to wide.
Any help is much appreciated, thank you.
You might consider keeping the data in long format to perform some of these calculations. I would also suggest changing your dates into a date format that can be calculated upon. This will show, for instance, that the last two rows are not chronological. For instance:
library(dplyr)
df %>%
mutate(prescription_date = lubridate::dmy(prescription_date)) %>%
arrange(id, prescription_date) %>%
group_by(id) %>%
mutate(A_cuml = cumsum(med=="A"),
A_ttl = sum(med=="A")) %>%
ungroup()
# A tibble: 8 × 5
id prescription_date med A_cuml A_ttl
<dbl> <date> <chr> <int> <int>
1 1 2009-01-17 A 1 2
2 1 2009-03-02 B 1 2
3 1 2009-03-20 A 2 2
4 2 2009-04-10 A 1 1
5 2 2009-07-05 B 1 1
6 3 2009-05-09 B 0 1
7 3 2009-05-29 B 0 1
8 3 2009-06-13 A 1 1
If you calculate summary stats for each id, you might save this in a summarized table with one row per id and use joins (e.g. left_join) to append the results of each of these summaries.

How to order grouped rows while keeping duplicates together [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 2 years ago.
I have a dataframe with several "people".
There are repeat instances for "people", however, the measured "value" is different in each instance.
Here is an example of dataframe.
df2 <- data.frame(
value = c(1, 2, 3, 4, 5),
people = c("d", "c", "b", "d", "b")
)
which looks like:
value people
1 d
2 c
3 b
4 d
5 b
I would like to group the data by "people", then sort the groups of rows by "value", and within the groups, I would like to sort descending by the "value".
That is, I want to keep duplicates together while sorting by value.
Here is how I would like the data to look:
value people
1 d
4 d
2 c
3 b
5 b
I have tried multiple attempts with group_by and arrange using {dplyr} but seems I am missing something.
Thanks for the help.
I have made a change - for clarity, I do not want "people" sorted alphabetically - this is a schedule in reality - person D has the first appointment (1), and his second appointment is 4. I want them to appear first and together. Person C has a 2nd appointment. Person B has a 3rd appointment, his other appointment is 5. I hope this makes it more clear. Thanks again
You can use arrange in this form :
library(dplyr)
df2 %>%
arrange(value) %>%
arrange(match(people, unique(people)))
# value people
#1 1 d
#2 4 d
#3 2 c
#4 3 b
#5 5 b
Though a longer code, but this will also work
df2 %>% group_by(people) %>% arrange(value) %>%
mutate(d = first(value)) %>% arrange(d) %>% ungroup() %>% select(-d)
# A tibble: 5 x 2
value people
<dbl> <chr>
1 1 d
2 4 d
3 2 c
4 3 b
5 5 b
I got your result with the following one-liner base-R code:
df2[order(df2$people, decreasing = TRUE),]
# value people
# 1 1 d
# 4 4 d
# 2 2 c
# 3 3 b
# 5 5 b

R: Vlookup for a 'for' loop

i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())

How to obtain daily time series of categorical frequencies in R

I have a data frame as such:
data <- data.frame(daytime = c('2005-05-03 11:45:23', '2005-05-03 11:47:45',
'2005-05-03 12:00:32', '2005-05-03 12:25:01',
'2006-05-02 10:45:15', '2006-05-02 11:15:14',
'2006-05-02 11:16:15', '2006-05-02 11:18:03'),
category = c("A", "A", "A", "B", "B", "B", "B", "A"))
print(data)
daytime category date2
1 2005-05-03 11:45:23 A 05/03/05
2 2005-05-03 11:47:45 A 05/03/05
3 2005-05-03 12:00:32 A 05/03/05
4 2005-05-03 12:25:01 B 05/03/05
5 2006-05-02 10:45:15 B 05/02/06
6 2006-05-02 11:15:14 B 05/02/06
7 2006-05-02 11:16:15 B 05/02/06
8 2006-05-02 11:18:03 A 05/02/06
I would like to turn this data frame into a time series of daily categorical frequencies like this:
day cat_A_freq cat_B_freq
1 2005-05-01 3 1
2 2005-05-02 1 3
I have tried doing:
library(anytime)
data$daytime <- anytime(data$daytime)
data$day <- factor(format(data$daytime, "%D"))
table(data$day, data$category)
A B
05/02/06 1 3
05/03/05 3 1
But as you can see the formatting a new variable, day, changes the appearance of the date. You can also see that the table does not return the days in proper order (the years are out of order) so that I can then convert to a time series, easily.
Any ideas on how to get frequencies in an easier way, or if this is the way, how to get the frequencies in correct date order and into a dataframe for easy conversion to a time series object?
A solution using tidyverse. The format of your daytime column in your data is good, so we can use as.Date directly without specifying other formats or using other functions.
library(tidyverse)
data2 <- data %>%
mutate(day = as.Date(daytime)) %>%
count(day, category) %>%
spread(category, n)
data2
# # A tibble: 2 x 3
# day A B
# * <date> <int> <int>
# 1 2005-05-03 3 1
# 2 2006-05-02 1 3

Assign an ID based on multiple criteria

I am new to R and am trying to put together a script to automate a now very manual task of triangulating different reports.
In my job I receive reports from different sources, which I need to triangulate and aggregate if needed. To simplify (and anonymise) my example let's say that I get reports about sales made by different merchants in a market. This data includes "Observer", "Seller", "Buyer", and "Date of sale".
Example:
market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom"),
seller=c("A", "A", "B", "A"),
buyer=c("X", "X", "Y", "X"),
date_sale=c("2017/01/01", "2017/01/03", "2017/01/04", "2017/01/05"))
Now, some of this data might overlap, so I need to make sure that I know if a transaction has been reported already across merchants in a similar time period (+/- 7 days) and assign the same ID to it (so later I can merge the two). If, however, the same observer reports the same transaction again a short time later I can assume that in that case it's a separate one.
In my example, I can see that Tom and Fred both reported a purchase from A to X within 2 days of each other, while Tom reported a second one in the same time period. So ideally R should give the same ID to the first two transactions and a separate one to the third.
The result should be:
market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom"),
seller=c("A", "A", "B", "A"),
buyer=c("X", "X", "Y", "X"),
date_sale=c("2017/01/01", "2017/01/03", "2017/01/04", "2017/01/05"),
id=c(1, 1, 2, 3))
I have tried with getanID from the splitstackshape package, but I cannot manage to find out how to give a parameter of "within +/- 7 days of an earlier transaction". I am open to any suggestions, thank you very much!
For completeness sake, I added another data point to your data.frame, that is more than 7 days away. I also converted your dates to the correct class to simplify date arithmetic:
market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom", "Joe"),
seller=c("A", "A", "B", "A", "A"),
buyer=c("X", "X", "Y", "X", "X"),
date_sale=as.Date(c("2017/01/01", "2017/01/03",
"2017/01/04","2017/01/05", "2017/01/09")) )
The first step you'll want to do is bin your data into 7-day bins:
library( dplyr ) # We'll make extensive use of this package
m1 <- market %>% mutate( date_bin = as.integer((date_sale - min(date_sale)) / 7) )
# observer seller buyer date_sale date_bin
# 1 Tom A X 2017-01-01 0
# 2 Fred A X 2017-01-03 0
# 3 Hank B Y 2017-01-04 0
# 4 Tom A X 2017-01-05 0
# 5 Joe A X 2017-01-09 1
The final id is going to be a product of two "sub-IDs": the outer ID which comes from grouping your data by date_bin, seller and buyer (i.e., what are all the possible versions of transactions that can happen within a 7-day period), and the inner ID which enumerates duplicate transactions made by the same observer within each group.
The two IDs can be computed as follows:
i1 <- m1 %>% group_by( date_bin, seller, buyer ) %>% group_indices()
m2 <- m1 %>% mutate( outID = i1 ) %>% group_by( outID, observer ) %>%
mutate( inID = 1:n() )
# observer seller buyer date_sale date_bin outID inID
# 1 Tom A X 2017-01-01 0 1 1
# 2 Fred A X 2017-01-03 0 1 1
# 3 Hank B Y 2017-01-04 0 2 1
# 4 Tom A X 2017-01-05 0 1 2
# 5 Joe A X 2017-01-09 1 3 1
Finally, we create the final id from all the unique pairs of outID and inID:
market %>% mutate( id = group_by( m2, outID, inID ) %>% group_indices() )
# observer seller buyer date_sale id
# 1 Tom A X 2017-01-01 1
# 2 Fred A X 2017-01-03 1
# 3 Hank B Y 2017-01-04 3
# 4 Tom A X 2017-01-05 2
# 5 Joe A X 2017-01-09 4
Note that the indices are not exactly in the same order as what you requested in your question, but since these are arbitrary integers, you can reassign them to desired values without loss of generality.

Resources