I am new to R and am trying to put together a script to automate a now very manual task of triangulating different reports.
In my job I receive reports from different sources, which I need to triangulate and aggregate if needed. To simplify (and anonymise) my example let's say that I get reports about sales made by different merchants in a market. This data includes "Observer", "Seller", "Buyer", and "Date of sale".
Example:
market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom"),
seller=c("A", "A", "B", "A"),
buyer=c("X", "X", "Y", "X"),
date_sale=c("2017/01/01", "2017/01/03", "2017/01/04", "2017/01/05"))
Now, some of this data might overlap, so I need to make sure that I know if a transaction has been reported already across merchants in a similar time period (+/- 7 days) and assign the same ID to it (so later I can merge the two). If, however, the same observer reports the same transaction again a short time later I can assume that in that case it's a separate one.
In my example, I can see that Tom and Fred both reported a purchase from A to X within 2 days of each other, while Tom reported a second one in the same time period. So ideally R should give the same ID to the first two transactions and a separate one to the third.
The result should be:
market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom"),
seller=c("A", "A", "B", "A"),
buyer=c("X", "X", "Y", "X"),
date_sale=c("2017/01/01", "2017/01/03", "2017/01/04", "2017/01/05"),
id=c(1, 1, 2, 3))
I have tried with getanID from the splitstackshape package, but I cannot manage to find out how to give a parameter of "within +/- 7 days of an earlier transaction". I am open to any suggestions, thank you very much!
For completeness sake, I added another data point to your data.frame, that is more than 7 days away. I also converted your dates to the correct class to simplify date arithmetic:
market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom", "Joe"),
seller=c("A", "A", "B", "A", "A"),
buyer=c("X", "X", "Y", "X", "X"),
date_sale=as.Date(c("2017/01/01", "2017/01/03",
"2017/01/04","2017/01/05", "2017/01/09")) )
The first step you'll want to do is bin your data into 7-day bins:
library( dplyr ) # We'll make extensive use of this package
m1 <- market %>% mutate( date_bin = as.integer((date_sale - min(date_sale)) / 7) )
# observer seller buyer date_sale date_bin
# 1 Tom A X 2017-01-01 0
# 2 Fred A X 2017-01-03 0
# 3 Hank B Y 2017-01-04 0
# 4 Tom A X 2017-01-05 0
# 5 Joe A X 2017-01-09 1
The final id is going to be a product of two "sub-IDs": the outer ID which comes from grouping your data by date_bin, seller and buyer (i.e., what are all the possible versions of transactions that can happen within a 7-day period), and the inner ID which enumerates duplicate transactions made by the same observer within each group.
The two IDs can be computed as follows:
i1 <- m1 %>% group_by( date_bin, seller, buyer ) %>% group_indices()
m2 <- m1 %>% mutate( outID = i1 ) %>% group_by( outID, observer ) %>%
mutate( inID = 1:n() )
# observer seller buyer date_sale date_bin outID inID
# 1 Tom A X 2017-01-01 0 1 1
# 2 Fred A X 2017-01-03 0 1 1
# 3 Hank B Y 2017-01-04 0 2 1
# 4 Tom A X 2017-01-05 0 1 2
# 5 Joe A X 2017-01-09 1 3 1
Finally, we create the final id from all the unique pairs of outID and inID:
market %>% mutate( id = group_by( m2, outID, inID ) %>% group_indices() )
# observer seller buyer date_sale id
# 1 Tom A X 2017-01-01 1
# 2 Fred A X 2017-01-03 1
# 3 Hank B Y 2017-01-04 3
# 4 Tom A X 2017-01-05 2
# 5 Joe A X 2017-01-09 4
Note that the indices are not exactly in the same order as what you requested in your question, but since these are arbitrary integers, you can reassign them to desired values without loss of generality.
Related
I am working with a medication prescription dataset which I want to transfer from long to wide format.
I tried to use the reshape function, however, this requires a time variable, which I don't have (at least not in a useful format I believe).
Concept dataset:
id <- c(1, 1, 1, 2, 2, 3, 3, 3)
prescription_date <- c("17JAN2009", "02MAR2009", "20MAR2009", "05JUL2009", "10APR2009", "09MAY2009", "13JUN2009", "29MAY2009")
med <- c("A", "B", "A", "B", "A", "B", "A", "B")
df <- data.frame(id, prescription_date, med)
To make a time variable I have tried to make a time variable like 1st, 2nd, etc med per id, but I didn't succeed.
Background: I want this in a wide format to eventually create definitions for diagnoses (i.e. if a patient had >1 prescriptions of A, diagnosis is confirmed). This has to be combined with factors from other datasets, hence the idea to go from long to wide.
Any help is much appreciated, thank you.
You might consider keeping the data in long format to perform some of these calculations. I would also suggest changing your dates into a date format that can be calculated upon. This will show, for instance, that the last two rows are not chronological. For instance:
library(dplyr)
df %>%
mutate(prescription_date = lubridate::dmy(prescription_date)) %>%
arrange(id, prescription_date) %>%
group_by(id) %>%
mutate(A_cuml = cumsum(med=="A"),
A_ttl = sum(med=="A")) %>%
ungroup()
# A tibble: 8 × 5
id prescription_date med A_cuml A_ttl
<dbl> <date> <chr> <int> <int>
1 1 2009-01-17 A 1 2
2 1 2009-03-02 B 1 2
3 1 2009-03-20 A 2 2
4 2 2009-04-10 A 1 1
5 2 2009-07-05 B 1 1
6 3 2009-05-09 B 0 1
7 3 2009-05-29 B 0 1
8 3 2009-06-13 A 1 1
If you calculate summary stats for each id, you might save this in a summarized table with one row per id and use joins (e.g. left_join) to append the results of each of these summaries.
I have a dataframe:
UserId <- c("A", "A", "A", "B", "B", "B")
SellerId <- c("X", "X", "Y", "Y", "Z", "Z")
Product <- c("ball", "ball", "ball", "ball", "doll", "doll")
SalesDate <- c("2022-01-01", "2022-01-01", "2022-01-02", "2022-01-04", "2022-01-06", "2022-01-07")
sales <- data.frame(UserId, SellerId, Product, SalesDate)
And I want to find sales for which:
the same user bought the same product twice from the same seller on the same day, but of course I need to do it on a larger scale.
I've been thinking for a long time how to even use one of these criteria and nothing comes to mind. The table I should be left with in this case is:
UserId
SellerId
Product
SalesDate
A
X
ball
2022-01-01
A
X
ball
2022-01-01
UserId is the same, seller is the same, the product is the same and salesdate is the same. The problem is that I don't look for specific users or specific products.
I would like to find all users who bought the same product twice (no matter what the product is - the list is long), the same with purchasedate (the date doesn't matter, it needs to be the same for the same user).
Do you have any ideas how to do even a part of the code?
Using dplyr, you can group_by_all variables, and filter out anything that do not have more than 1 records.
library(dplyr)
sales %>% group_by_all() %>% filter(n() > 1)
# A tibble: 2 × 4
# Groups: UserId, SellerId, Product, SalesDate [1]
UserId SellerId Product SalesDate
<chr> <chr> <chr> <chr>
1 A X ball 2022-01-01
2 A X ball 2022-01-01
Group by all and use filter. The difference to #benson23 +1 is to use across:
library(dplyr)
sales %>%
group_by(across(everything())) %>%
filter( n() > 1 )
or even as everything() is default:
sales %>%
group_by(across()) %>%
filter( n() > 1 )
Using add_count() will give you the number of each occurence.
sales %>%
add_count(UserId, SellerId, Product, SalesDate)
UserId SellerId Product SalesDate n
1 A X ball 2022-01-01 2
2 A X ball 2022-01-01 2
3 A Y ball 2022-01-02 1
4 B Y ball 2022-01-04 1
5 B Z doll 2022-01-06 1
6 B Z doll 2022-01-07 1
from there on you can filter for n == 2 or n > 1 depending on your question.
i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())
I have an easy question related to the library dplyr in R.
My actual data frame looks like this:
Players <- data.frame(Group = c("A", "A", "A", "A", "B", "B", "B", "C","C","C"), Players= c("Jhon", "Jhon", "Jhon", "Charles", "Mike", "Mike","Carl", "Max", "Max","Max"))
:
Group Players
A Jhon
A Jhon
A Jhon
A Charles
B Mike
B Mike
B Carl
C Max
C Max
C Max
And I would like to get another data frame with the players more repeated of each group and how many times are they listed. So I would like to get this data frame:
Group Players TimesListed
A Jhon 3
B Mike 2
B Max 3
I have tried this:
Station <- Players %>% group_by(Group,Players) %>%
summarise(TimesListed=length(Players)) %>%
summarise(TimesListed=max(TimesListed))
But I get a data frame without the names of the players like this:
Group TimesListed
1 A 3
2 B 2
3 C 3
Any idea? Thank you!
This should get you what you want:
library(dplyr)
Players %>%
group_by(Group) %>%
count(Players) %>%
top_n(1, n)
# A tibble: 3 x 3
# Groups: Group [3]
Group Players n
<fctr> <fctr> <int>
1 A Jhon 3
2 B Mike 2
3 C Max 3
You could do the following to convert the factors to characters:
Players[] <- lapply(Players, as.character)
And if you need to change variable n to TimesListed, add the following to the end of the chain:
rename(TimesListed = n)
You can use aggregate function in base R:
aggregate(.~Group,dat,function(x)max(table(x)))
Group Players
1 A 3
2 B 2
3 C 3
For completeness, here is a solution using data.table.
library(data.table)
setDT(Players)
Players[, .(TimesListed = .N), by = .(Group, Players)][
, .SD[which.max(TimesListed)], by = Group]
# Group Players TimesListed
# 1: A Jhon 3
# 2: B Mike 2
# 3: C Max 3
The above solution will return the first row with maximum in TimesListed. If we want to return all the rows equal to the maximum, we can do the following. In this case, the two solutions lead to the same results.
Players[, .(TimesListed = .N), by = .(Group, Players)][
, .SD[TimesListed == max(TimesListed)], by = Group]
# Group Players TimesListed
# 1: A Jhon 3
# 2: B Mike 2
# 3: C Max 3
I've got a data frame in R that looks like the following:
cust = c("A", "B", "C", "A", "B", "E", "A", "F", "A", "G")
period = as.Date(c("2013/1/1", "2013/1/1", "2013/1/1", "2013/1/2", "2013/1/2",
"2013/1/2", "2013/1/3", "2013/1/3", "2013/1/4", "2013/1/4"))
df = data.frame(cust, period)
I wanted to transform it in a way that I can arrive in the following format as an output:
period NumCust_Initial GainedCust LostCust NumCust_EndUpWith
1/1/2013 3 NA NA NA
2/1/2013 3 1 1 3
3/1/2013 2 1 2 2
4/1/2013 2 1 1 2
The idea is that I'd arrive in a count of unique customers for each period. Then, I'd calculate the number of new customers acquired GainedCust and the number of customers lost LostCust all based on the previous period. Finally, we'd do a calculation that would get the
From df in 2/1/2013 I had 3 unique customers. I gained 1 (relative to 1/1/2013) but lost another 1 (relative to 1/1/2013) so I ended up with 3 customers (which is calculated as 3 from NumCust_Initial in 1/1/2013 plus the number of new customers GainedCust in 2/1/2013 and minus the number of lost customers LostCust in 2/1/2013).
Similarly, we can see from df that in 3/1/2013 we started with 2 customers. We then gained 1 new customer (relative to 2/1/2013) and lost 2 customers (relative to 2/1/2013). And so forth and so on.
How can I perform all these transformations / calculations in R? I've tried looking at some of the functions in dplyr and reshape2 but could not arrive in anything as of yet. Has anybody faced similar data transformation challenges in R before? How can I achieve the desired outcome in R?
You can do this with a combination of tidyr and dplyr. Be sure to install the development version of tidyr.
# required packages
require(tidyr) # development version
require(dplyr)
df %>%
mutate(current = TRUE) %>%
complete(period, cust, fill = list(current = FALSE)) %>%
group_by(cust) %>%
mutate(gain = c(NA, diff(current))) %>%
group_by(period) %>%
summarise(GainedCust = sum(gain > 0),
LostCust = sum(gain < 0),
NumCust_EndUpWith = sum(current))
## Source: local data frame [4 x 4]
##
## period GainedCust LostCust NumCust_EndUpWith
## 1 2013-01-01 NA NA 3
## 2 2013-01-02 1 1 3
## 3 2013-01-03 1 2 2
## 4 2013-01-04 1 1 2