Pairing observations in R - r

Let's take the following sample dataset:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
sample.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price)
Rows 1 and 4 actually give identical observations - the only issue is that row 1 says that "A" buys the asset (implying that "B" sells) and in row 4 it says that "B" sells the asset (implying that "A" buys).
I'd like code to create the following dataset:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
transaction_number <- c(1,2,3,1,4)
duplicate <- c(1,0,0,1,0)
clean.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price,transaction_number,duplicate)
In reality of course my dataset is much, much larger so I can't hard-code.
Update: I added row 5, which is identical to row 2, including the fact that counterparty 1 and 2 are in the same order. I want the "duplicate" variable to only flag rows 1 and 4 as duplicates (since they are inverses), not rows 2 and 5.

Updated Answer:
Addressing OP's follow up question, stating that if the very same transaction happens twice, it should not be picked up as duplicates.(for instance party B selling something to party C for $3.7K on two occasions); read the comments and updated question.
library(dplyr)
sample.data %>%
mutate(transaction=if_else(counterparty1_side=="buy",
paste0(counterparty1,counterparty2),
paste0(counterparty2,counterparty1))) %>%
group_by_all %>%
mutate(dup_dum = 1:n()) %>%
group_by(transaction, dup_dum) %>%
mutate(transaction_number = group_indices(),
duplicate = +(n()!=n_distinct(transaction, dup_dum))) %>%
ungroup() %>% select(-transaction, -dup_dum)
#> # A tibble: 5 x 6
#> counterparty1 counterparty2 counterparty1_s~ price transaction_num~ duplicate
#> <fct> <fct> <fct> <dbl> <int> <int>
#> 1 A B buy 1.2 1 1
#> 2 B C sell 3.7 3 0
#> 3 B A buy 2.5 2 0
#> 4 B A sell 1.2 1 1
#> 5 B C sell 3.7 4 0
Original Answer:
Considering dupes (doesn't matter if they are dupes just because counter-party role has changed or they are actual dupes) (look at the edits to the question to see the first version of the question).
library(dplyr)
sample.data %>%
mutate(transaction=if_else(counterparty1_side=="buy",
paste0(counterparty1,counterparty2),
paste0(counterparty2,counterparty1))) %>%
group_by(transaction) %>%
mutate(transaction_number = group_indices(),
duplicate = +(n()!=n_distinct(transaction))) %>%
ungroup() %>% select(-transaction)
# # A tibble: 4 x 6
# counterparty1 counterparty2 counterparty1_side price transaction_number duplicate
# <fct> <fct> <fct> <dbl> <int> <int>
# 1 A B buy 1.2 1 1
# 2 B C sell 3.7 3 0
# 3 B A buy 2.5 2 0
# 4 B A sell 1.2 1 1

Related

R tables with proportion

ID <- c(1,2,3,4,5,6,7,8)
Hospital <- c("A","A","A","A","B","B","B","B")
risk <- c("Low","Low","High","High","Low","Low","High","High")
retest <- c(1,0,1,1,1,1,0,1)
df <- data.frame(ID, Hospital, risk, retest)
# freq. table
df %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <int> <int>
1 High 2 2
2 Low 2 2
#freq. table of retest by risk and Hospital
df %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <dbl> <dbl>
1 High 2 1
2 Low 1 2
I want to get the proportions of retest by Hospital and by risk categories.
For example, Hospital A, low risk , retested 1 person / 2 person = 50.
Need to create A% B% columns to get the final result of the table below.
Please help me get the prop. columns and also (n=x) part in the final table.
Just divide the second table's numeric values by those of the first. Fortunately elementwise division does not destroy the structure if the two tibbles have the same dimensions:
d2 <- df1 %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
`summarise()` has grouped output by 'risk'. You can override using the `.groups` argument.
d3 <- df1 %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
You can deliver a proportion or a percentage
# proportion
> d3[-1]/d2[-1]
A B
1 1.0 0.5
2 0.5 1.0
#percentage
> 100*d3[-1]/d2[-1]
A B
1 100 50
2 50 100
``

Divide multiple variable values by a specific value in R

I'm trying to pull something that is simple but can't seem to get my head over it. My data looks like this
|Assay|Sample|Number|
|A|1|10|
|B|1|25|
|C|1|30|
|A|2|45|
|B|2|65|
|C|2|8|
|A|3|10|
|B|3|81|
|C|3|12|
What I need to do is to divide each "Number" value for each sample by the value of the respective assay A. That is, for sample 1, I would like to have 10/10, 25/10 and 30/10. Then for sample 2, I would need 45/45, 65/45 and 8/45 and so on with the rest of the samples.
I have already tried doing:
mutate(Normalised = Number/Number[Assay == "A"])
as suggested in another post but the results are not correct.
Any help would be great. Thank you very much!
Using dplyr
df <- data.frame(Assay=rep(c('A','B','C'),3),
Sample=rep(1:3,each=3),
Number=c(10,25,30,45,65,8,10,81,12))
df <- df %>%
group_by(Sample) %>%
arrange(Assay) %>%
mutate(Normalised=Number/first(Number)) %>%
ungroup() %>%
arrange(Sample)
gives out
> df
# A tibble: 9 × 4
Assay Sample Number Normalised
<chr> <int> <dbl> <dbl>
1 A 1 10 1
2 B 1 25 2.5
3 C 1 30 3
4 A 2 45 1
5 B 2 65 1.44
6 C 2 8 0.178
7 A 3 10 1
8 B 3 81 8.1
9 C 3 12 1.2
Note: I added arrange(Assay) just to make sure "A" is always the first row within each group. Also, arrange(Sample) is there just to get the output in the same order as it was but it doesn't really need to be there if you don't care about the display order.

mutate new column based on conditions in another row in R

I am working with a dataset of animal behaviors, and am trying to create a new column ("environment") based on conditions fulfilled in another row. Specifically, I want the new column to return "water" if the behavior falls between the start/stop times of the behavior "o_water", and "land" if it falls outside these bounds. If this is unclear here is a minimal example:
library(dplyr)
library(magrittr)
otters <- data.frame(
observation_id = 1,
subject = 1,
behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
start_time = c(1,1,2,6,6,11,11),
stop_time = c(5,3,4,10,9,15,14)
)
#this does it, but manually. need to go over very large dataset and search for conditions
otters <- otters %>%
group_by(subject, observation_id, behavior) %>%
mutate(environment = ifelse((start_time >= 1 & stop_time <= 5) |
(start_time >= 11 & stop_time <= 15), "water", "land"))
This is the output desired.
Groups: subject, observation_id, behavior [5]
observation_id subject behavior start_time stop_time environment
<dbl> <dbl> <fct> <dbl> <dbl> <chr>
1 1 1 o_water 1 5 water
2 1 1 swim 1 3 water
3 1 1 float 2 4 water
4 1 1 o_land 6 10 land
5 1 1 walk 6 9 land
6 1 1 o_water 11 15 water
7 1 1 float 11 14 water
The second set of commands is sort of what I want, but I need this to search out and apply it to an entire dataset rather than typing out each parameter. The grouping is so the functions are performed over the applicable rows; in the full dataset, there are multiple subjects and observation_id's.
I've tried using when() and case_when() to no avail, but I am very novice level at R so would appreciate any help!
Apologies for any missteps I've done. I haven't been able to find a problem quite like this elsewhere on stackoverflow.
Here is another approach with dplyr that also uses fuzzyjoin package.
You can separate your o_water behavior rows from otters and designate the environment as water.
Then, with fuzzy_left_join, merge the o_water rows with the rest of your data, where the start_time and end_time fall between the o_water range.
The remaining NA in environment will be non-merged rows, which can be land or other designation.
library(dplyr)
library(fuzzyjoin)
otters_water <- otters %>%
filter(behavior == "o_water") %>%
mutate(environment = "water") %>%
select(-behavior)
otters %>%
fuzzy_left_join(otters_water,
by = c("subject", "observation_id", "start_time", "stop_time"),
match_fun = list(`==`, `==`, `>=`, `<=`)) %>%
replace_na(list(environment = "land")) %>%
select(c(observation_id.x:stop_time.x, environment))
Output
observation_id.x subject.x behavior start_time.x stop_time.x environment
1 1 1 o_water 1 5 water
2 1 1 swim 1 3 water
3 1 1 float 2 4 water
4 1 1 o_land 6 10 land
5 1 1 walk 6 9 land
6 1 1 o_water 11 15 water
7 1 1 float 11 14 water
Great job on your question.
I think rearranging your dataset will help a lot here. I'd suggest rearranging it so that each time point has only one record (per individual otter, perhaps), and individual behaviors each have their own column, with binary data indicating whether or not that behavior is occurring at each time point.
There's a lot of rearranging that happens in the first few lines; I'd suggest stepping through the code one line at a time just to see how each line moves the data around.
Using the data you provided:
library(tidyverse)
otters_wide <- otters %>%
# first pivot to a longer form, so the time values are all in one column
pivot_longer(cols = c("start_time", "stop_time"), names_to = "start_stop", values_to = "time", names_pattern = "(.*)_time") %>%
# then pivot to w wider format, so each behavior has its own column.
pivot_wider(names_from = "behavior", values_from = "start_stop") %>%
#Then arrange everything in order of time.
arrange(time) %>%
#Fill behavior columns downward
fill(o_water, swim, float, o_land, walk) %>%
#change all "start"s and the first "stop" in each series to "yes", and all other "stop"s to "no"
mutate_at(.vars = c("o_water", "swim", "float", "o_land", "walk"), ~ if_else(. == "start" | lag(.) == "start", "yes", "no")) %>%
# this column is a little redundant now, but here's the water/land column, at last.
mutate(environment = if_else(o_water == "yes", "water", "land"))
otters_wide
# A tibble: 11 x 9
observation_id subject time o_water swim float o_land walk environment
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 1 yes yes NA NA NA water
2 1 1 2 yes yes yes NA NA water
3 1 1 3 yes yes yes NA NA water
4 1 1 4 yes no yes NA NA water
5 1 1 5 yes no no NA NA water
6 1 1 6 no no no yes yes land
7 1 1 9 no no no yes yes land
8 1 1 10 no no no yes no land
9 1 1 11 yes no yes no no water
10 1 1 14 yes no yes no no water
11 1 1 15 yes no no no no water
In this alternate format you could go even further and include time points that weren't explicitly recorded (e.g. times 7, 8, 12 & 13, in this example) which, once filled in, would make summarizing things like total time spent on each behavior much more straightforward.
Since the behavioral data is binary, those columns could certainly contain logical data rather than character data, but because they started as character data, keeping them that way seemed simplest.
Hope this helps.
Here is another approach, in which I extract all the start_ and stop_times of each behavior of o_water into a list. As there are 2 entries for o_water these lists have two elements.
With the help of purrr I create a list pu of vectors which span the start_ and stop_time for each entry.
purrr also helps to find which row in otters falls within the sequences defined in pu. l_true is the sum of all columns which have a FALSE/TRUE, it contains either a 1, if there is one entry matching the time sequence or a 0 if it does not.
This list is bound to otters and with ifelse the new column is named.
library(tidyverse)
otters <- data.frame(
observation_id = 1,
subject = 1,
behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
start_time = c(1,1,2,6,6,11,11),
stop_time = c(5,3,4,10,9,15,14)
)
# otters
# find all start_times of 'o_water'
otters %>%
dplyr::filter(grepl('water', behavior)) %>%
select(ends_with('time')) %>%
`[[`(1) -> start
start
#> [1] 1 11
# find all stop_times
otters %>%
dplyr::filter(grepl('water', behavior)) %>%
select(ends_with('time')) %>%
`[[`(2) -> stop
stop
#> [1] 5 15
# bring start and stop_times together in one
# list per 'o_water'
pu <- purrr::map2(start, stop, ~ .x : .y)
pu
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 11 12 13 14 15
# check with pu, if start_ and stop_time of each row is
# in pu, and combine the row of FALSE/TRUE into a list
l_true <- map_dfc(pu, ~ otters$start_time %in% . & otters$stop_time %in% .) %>%
mutate(l = rowSums(.[1:ncol(.)])) %>% `[[`(ncol(.))
otters %>%
cbind(., l_true) %>%
mutate(ev = ifelse(l_true == 1, 'water', 'land'))
#> observation_id subject behavior start_time stop_time l_true ev
#> 1 1 1 o_water 1 5 1 water
#> 2 1 1 swim 1 3 1 water
#> 3 1 1 float 2 4 1 water
#> 4 1 1 o_land 6 10 0 land
#> 5 1 1 walk 6 9 0 land
#> 6 1 1 o_water 11 15 1 water
#> 7 1 1 float 11 14 1 water

dplyr: calculate the days for product replenishment

I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)

Filtering on grouped variable maintaining sequence

I have a dataframe :
df <- data.frame(
Group=c('A','A','A','A','B','B','B','B'),
Activity = c('EOSP','NOR','EOSP','COSP','NOR','EOSP','WL','NOR'),
TimeLine=c(1,2,3,4,1,2,3,4)
)
I want to filter for only two activities for each group and in the order in which I am filtering. For example, I am only looking for the activities EOSP and NOR but in the order too. This code:
df %>% group_by(Group) %>%
filter(all(c('EOSP','NOR') %in% Activity) & Activity %in% c('EOSP','NOR'))
results in:
# A tibble: 6 x 3
# Groups: Group [2]
Group Activity TimeLine
<fct> <fct> <dbl>
1 A EOSP 1
2 A NOR 2
3 A EOSP 3
4 B NOR 1
5 B EOSP 2
6 B NOR 4
I don't want row 3 as EOSP occurs after NOR. Similarly for group B, I don't want row 4, as NOR is occurring before EOSP. How do I achieve this?
You can use match to get the first instance of Activity == EOSP and use slice to remove everything before that. Once you do that, then you can remove duplicates and filter on EOSP and NOR, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
mutate(new = match('EOSP', Activity)) %>%
slice(new:n()) %>%
distinct(Activity, .keep_all = TRUE) %>%
filter(Activity %in% c('EOSP', 'NOR'))
which gives,
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine new
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2
NOTE 1: You can ungroup() and select(-new)
NOTE 2: The warning messages being issued here
(Warning messages:
1: In new:4L : numerical expression has 4 elements: only the first used
2: In new:4L : numerical expression has 4 elements: only the first used
)
do not affect us since we only need it to use the first element since all are the same anyway
here is an option with data.table package: you join df with itself, subsetted it to keep only EOSP Activity and computing the min of TimeLine by group, then you can keep only the rows with TimeLine greater or equal to this TimeLine, in order to be sure you keep NOR only if there is EOSP before. Then you drop duplicated Group and Activity if you want to only keep 2 activities per group:
df[df[Activity=="EOSP", min(TimeLine), by=Group], on="Group"][Activity %in% c("NOR", "EOSP") & TimeLine >= V1][!duplicated(paste(Group, Activity))]
# Group Activity TimeLine V1
#1: A EOSP 1 1
#2: A NOR 2 1
#3: B EOSP 2 2
#4: B NOR 4 2
Here is a dplyr idea:
df %>%
filter(Activity %in% c('EOSP','NOR')) %>%
group_by(Group) %>%
mutate(tmp = which(Activity == 'EOSP' & !duplicated(Activity))) %>%
filter(row_number() %in% c(tmp, tmp+1))
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine tmp
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2

Resources