Matching rows within the same dataset in R - r

I have a dataset of unique matches like this. Each row is a match with result.
date <- c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01')
team1 <- c('A','B','B','C','D','A','B')
team1_score <- c(1,0,4,3,5,6,7)
team2 <- c('B','A','A','B','C','C','A')
team2_score <- c(0,1,5,4,6,9,10)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
I want to create 2 new columns, forms for team 1 and team 2. The result of the match can be determined by which team have a larger score or a draw. The result would look something like below. So the form would be the result of team1 in the last 2 matches. For example, for the first 3 rows, form of team 1 and 2 respectively are. There will be times where there are not enough 2 matches of a particular team, so a result of NULL is sufficient. I want to know the form of team1 and team2 going into a match.
Form1: W-W, L-W, W-L
Form2: L-L, W-L, L-W
In the actual data set, there are a lot more than just 4 unique teams. I have been thinking but can't think of a good way to create these 2 variables.

Here is my solution:
library(tidyverse)
date <- as.Date(c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01', '2017/05/30'))
team1 <- c('A','B','B','C','D','A','B','A')
team1_score <- c(1,0,4,3,5,6,7,0)
team2 <- c('B','A','A','B','C','C','A','D')
team2_score <- c(0,1,5,4,6,9,10,0)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
## 1. Create a unique identifier for each match. It assumes that teams can only play each other once a day.
matches$UID <- paste(matches$date, matches$team1, matches$team2, sep = "-")
## 2. Create a Score Difference Varaible reflecting team1's score
matches <- matches %>% mutate(score_dif_team1 = team1_score - team2_score)
## 3. Create a Result (WDL) reflecting team1's results
matches <- matches %>% mutate(results_team1 = if_else(score_dif_team1 < 0, true = "L", false = if_else(score_dif_team1 > 0, true = "W", false = "D")))
## 4. Cosmetic step: Reorder variables for easier comparison across variables
matches <- matches %>% select(UID, date:results_team1)
## 5. Reshape the table into a long format based on the teams. Each observation will now reflect the results of 1 team within a match. Each game will have two observations.
matches <- matches %>% gather(key = old_team_var, value = team, team1, team2)
## 6. Stablishes a common results variable for each observation. It essentially inverts the results_team1 varaible for teams2, and keeps results_team1 identical for teams1
matches <- matches %>%
mutate(results = if_else(old_team_var == "team2",
true = if_else(results_team1 == "W",
true = "L",
false = if_else(results_team1 == "L",
true = "W",
false = "D")),
false = results_team1))
## Final step: Filter the matches table by the dates you are interested into, and then reshapes the table to show a data frame of DLW in long format.
Results_table <- matches %>% filter(date <= as.Date("2017-12-01")) %>% group_by(team, results) %>% summarise(cases = n()) %>% spread(key = results, value = cases, fill = 0)
## Results:
# A tibble: 4 x 4
# Groups: team [4]
team D L W
* <chr> <dbl> <dbl> <dbl>
1 A 1 1 4
2 B 0 4 1
3 C 0 1 2
4 D 1 1 0

Related

Filtering dataframe to only show 1 pair of two variables

I have information on physicians working in different hospitals at different points in time. I would like to output a dataframe with that informs each pair of physician in each hospital. I would like to see each pair only once in the dataset; meaning that if physicians A and B work together in the same hospital I would like to see either the pair A-B or the pair B-A, but not both.
Consider the very simple example of hospitals x-y-w, periods 1-2 and physicians A-B-C-D.
mydf <- data.frame(hospital = c("x","x","x","x","x","y","y","y","w","w","w","w"),
period = c(1,1,1,2,2,1,2,2,1,1,2,2),
physician = c("A","B","C","A","B","A","A","C","C","D","A","D"))
Below I manage to get all pairs, however each pair shows twice (swapping between from and to). How could get each pair only showing up once in the output?
pairs_df <- mydf %>%
rename(from = physician) %>%
left_join(mydf, by=c("hospital","period")) %>%
rename(to = physician) %>%
filter(from!=to)
We can use pmin/pmax with duplicated to sort the elements rowwise between the 'from', 'to' columns, apply the duplicated, negate (!) in filter to return the unique rows
library(dplyr)
pairs_df %>%
filter(!duplicated(cbind(pmax(from, to), pmin(from, to))))
Or use base R
subset(pairs_df, !duplicated(cbind(pmax(from, to), pmin(from, to))))
-output
hospital period from to
1 x 1 A B
2 x 1 A C
4 x 1 B C
11 w 1 C D
13 w 2 A D
NOTE: Here, we assume that the columns are character class based on the input data i.e. data.frame construct uses stringsAsFactors = FALSE by default (>= R 4.0.0), but previously it was TRUE by default. If the columns are factor, then we could convert to character class with type.convert
pairs_df <- type.convert(pairs_df, as.is = TRUE)
Or before the filter convert those factor to character
pairs_df %>%
mutate(across(where(is.factor), as.character)) %>%
filter(!duplicated(cbind(pmax(from, to), pmin(from, to))))
Another option is using igraph
get.data.frame(
simplify(
graph_from_data_frame(
pairs_df[c("from", "to", "hospital", "period")],
directed = FALSE
),
edge.attr.comb = "first"
)
)
which gives
from to hospital period
1 A B x 1
2 A C x 1
3 A D w 2
4 B C x 1
5 C D w 1

match values in 2 columns with the corresponding position in another character column

An example dataframe:
example_df = data.frame(Gene.names = c("A", "B"),
Score = c("3.69,2.97,2.57,3.09,2.94",
"3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83"),
ResidueAA = c("S", "Y"),
ResidueNo = c(3, 3),
Sequence = c("MSSYT", "MSSYTRAP") )
I want to check if the character at ResidueAA column at the position at ResidueNo column matches with the corresponding position in the ‘Sequence’ column. The output should be another column, say, ‘Check’ with a Yes or No.
This is working code:
example_df$Check=sapply(1:nrow(example_df),FUN=function(i){d=example_df[i,]; substr(d$Sequence,d$ResidueNo,d$ResidueNo)==d$ResidueAA})
Is there an easier/elegant way to do this? Ideally, I want something that works within a dplyr pipe.
Also, related to this, how can I extract the corresponding value from the 'Score' column into a new column, say, 'Score_1'?
Thanks
We can use substr directly
library(dplyr)
example_df %>%
mutate(Check = substr(Sequence, ResidueNo, ResidueNo) == ResidueAA)
-output
# Gene.names Score ResidueAA ResidueNo Sequence Check
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT TRUE
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP FALSE
To create a new column with matching 'Score', use match to get the corresponding index instead of == (which does an elementwise comparison) and use the index for extracting the 'Score' element
example_df %>%
mutate(Score2 = Score[match(ResidueAA,
substr(Sequence, ResidueNo, ResidueNo), ResidueAA)])
-output
#Gene.names Score ResidueAA ResidueNo Sequence
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP
# Score2
#1 3.69,2.97,2.57,3.09,2.94
#2 <NA>
Update
Based on the comments, we need to extract the corresponding element of 'Score' based on the 'ResidueNo' if the substring values of 'Sequence' is the same as the 'ResidueAA'. This can be done by splitting the 'Score' with strsplit into a list, extract the first element ([[1]] - after a rowwise operation) and then use the 'ResidueNo' to get the splitted word on that location
example_df %>%
rowwise %>%
mutate(Score2 = if(substr(Sequence, ResidueNo, ResidueNo) ==
ResidueAA) strsplit(Score, ",")[[1]][ResidueNo] else NA_character_) %>%
ungroup
-output
# A tibble: 2 x 6
# Gene.names Score ResidueAA ResidueNo Sequence Score2
# <chr> <chr> <chr> <dbl> <chr> <chr>
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT 2.57
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP <NA>
Or another option is separate_rows to split the rows to expand the data, then do a group by 'Gene.names', `summarise to get the corresponding 'Score2' element (similar to previous solution) and do a join with the original dataset
library(tidyr)
example_df %>%
separate_rows(Score, sep= ",") %>%
group_by(Gene.names) %>%
summarise(Score2 = if(substr(first(Sequence), first(ResidueNo), first(ResidueNo)) ==
first(ResidueAA)) Score[first(ResidueNo)] else
NA_character_, .groups = 'drop') %>%
right_join(example_df)
To get an individual score, you would need to split the string and return the index corresponding to the position. You could vectorize this, e.g.:
getScore <- Vectorize(function(x, pos) unlist(strsplit(x, ",", TRUE), use.names = FALSE)[pos])
example_df %>% mutate(check=substr(Sequence, ResidueNo, ResidueNo) == ResidueAA,
MyScore=ifelse(check, as.numeric(getScore(Score, ResidueNo)), NA))
#> Gene.names Score ResidueAA ResidueNo
#> 1 A 3.69,2.97,2.57,3.09,2.94 S 3
#> 2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3
#> Sequence check MyScore
#> 1 MSSYT TRUE 2.57
#> 2 MSSYTRAP FALSE NA

Track changes for a given field given some ID and date

I have the following dataframe:
Each client's cap could be upgraded at some point in time defined by column Date. I would like to aggregate on ID and show on what date the cap has been upgraded. Sometimes this could happen twice. The output should look like this:
Thank you in advance !
library(tidyverse)
df <- tibble(
ID = c(1,1,1,2,2,2,2,3,3),
Cap = c("S", "S", "M", "S", "M", "L", "L", "S", "L"),
Date = paste("01", c(1:2, 4, 3:6, 2:3), "2000") %>% lubridate::dmy()
)
df2 <- df %>%
group_by(ID) %>% # looking at each ID separately
mutate(prev = lag(Cap), # what is the row - 1 value
change = !(Cap == prev)) %>% # is the row - 1 value different than the current row value
filter(change) %>% # filtering where there are changes
select(ID, "From" = prev, "To" = Cap, Date) # renaming columns and selecting the relevant ones
You can use the lag command here to create a column with the previous rows value of Cap included. Then you simply filter out first entries and rows which are the same.
out <- dat %>%
## calculate lag within unique subjects
group_by(ID) %>%
mutate(
## copy previous row value to new column
from=lag(Cap),
to=Cap
) %>%
ungroup() %>%
## ignore first entry for each subject
drop_na(from) %>%
## ignore all rows where Cap didn't change
filter(from != to) %>%
## reorder columns
select(ID, from, to, Date)
This gives us output matching your expected format
> out
# A tibble: 4 x 4
ID from to Date
<dbl> <fct> <fct> <dbl>
1 1 S M 4
2 2 S M 4
3 2 M L 5
4 3 S L 3

Vectorising linear interpolation function for use with mutate

I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?

Gather twice in same data frame

I have a dataframe where I want to do two separate gathers
library(tidyverse)
id <- c("A","B","C","D","E")
test_1_baseline <- c(1,2,4,5,6)
test_2_baseline <- c(21000, 23400, 26800,29000,30000)
test_1_followup <- c(0,4,2,3,1)
test_2_followup <- c(10000,12000,13000,15000,21000)
layout_1 <-data.frame(id,test_1_baseline,test_1_followup,test_2_baseline,test_2_followup)
This is the current layout.
Each person is 1 line.
The result of Test 1 at baseline is one variable
The result of Test 2 at baseline is a second variable
The same applies to Test 1/2 follow-up results
I would like the data to be tidier. One column for timepoint, one for result of test A, one for result of test B.
id2 <- c("A","B","C","D","E","A","B","C","D","E")
time <- c(rep("baseline",5),rep("followup",5))
test_1_result <- c(1,2,4,5,6,0,4,2,3,1)
test_2_result <- c(21000, 23400, 26800,29000,30000,10000,12000,13000,15000,21000)
layout_2 <- data.frame(id2, time,test_1_result,test_2_result)
I'm currently doing a what seems to me odd process where first of all I gather the test 1 data
test_1 <- select(layout_1,id,test_1_baseline,test_1_followup) %>%
gather("Timepoint","test_1",c(test_1_baseline,test_1_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_followup", "followup"))
Then I do same for test 2 and join them
test_2 <- select(layout_1,id,test_2_baseline,test_2_followup) %>%
gather("Timepoint","test_2",c(test_2_baseline,test_2_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_followup", "followup"))
test_combined <- full_join(test_1,test_2)
I tried doing the first Gather and then the second on the same dataframe but then you end up with duplicates; i.e you end up with
ID 1 Test_1 Baseline Test_2 Baseline
ID 1 Test_1 Baseline Test_2 Followup
ID 1 Test_1 Followup Test_2
Baseline ID 1 Test_1 Followup Test_2 Followup
== 4 rows where there should only be 2
I feel there must be a cleaner tidyverse way to do this.
Guidance welcomed
One option with data.table using melt which can take multiple measure patterns
library(data.table)
nm1 <- unique(sub(".*_", "", names(layout_1)[-1]))
melt(setDT(layout_1), measure = patterns("test_1", "test_2"),
value.name = c('test_1_result', 'test_2_result'),
variable.name = 'time')[, time := nm1[time]][]
You could gather all columns except id, then use separate to split into result and time.
Note that this code assumes that the result name is always 6 characters (test_1, test_2), and separates based on that assumption. You'll need to devise a different separate if that is not the case.
library(tidyr)
library(dplyr)
layout_1 %>%
gather(Var, Val, -id) %>%
separate(Var, into = c("result", "time"), sep = 6) %>%
spread(result, Val) %>%
mutate(time = gsub("_", "", time))
Result:
id time test_1 test_2
1 A baseline 1 21000
2 A followup 0 10000
3 B baseline 2 23400
4 B followup 4 12000
5 C baseline 4 26800
6 C followup 2 13000
7 D baseline 5 29000
8 D followup 3 15000
9 E baseline 6 30000
10 E followup 1 21000

Resources