R - Extracting values from other rows - r

As suggested by the title, I would like to extract values from other rows.
In particular, as an example please consider the following dataset:
id.in.group <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
group <- c(1,1,1,2,2,2,3,3,3,4,4,4,1,1,1,2,2,2,3,3,3,4,4,4,1,1,1,2,2,2,3,3,3,4,4,4)
trial <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
subject <- c("s7","s11","s3","s6","s9","s4","s12","s10","s1","s8","s2","s5","s5","s9","s6","s10","s1","s3","s4","s7","s2","s8","s12","s11","s5","s3","s9","s12","s11","s10","s1","s6","s7","s4","s2","s8")
df <- data.frame(group, id.in.group, trial, subject)
df$other1.id <- 0
df$other2.id <- 0
df$other1.id <- ifelse(df$id.in.group == "1" , 2, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "1" , 3, df$other2.id)
df$other1.id <- ifelse(df$id.in.group == "2" , 1, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "2" , 3, df$other2.id)
df$other1.id <- ifelse(df$id.in.group == "3" , 1, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "3" , 2, df$other2.id)
View(df)
Given the group number (df$group) and the id of the others in the group (df$other1.id and df$other2.id), I would like to create two further variables showing, for each trial and each subject, the value of the other 2 subjects rather than their relative id.in.group, so as to get the two following columns
df$other1.subject<-c("s11","s7","s7","s9","s6","s6","s10","s12","s12","s2","s8","s8","s9","s5","s5","s1","s10","s10","s7","s4","s4","s12","s8","s8", "s3","s5","s5","s11","s12","s12","s6","s1","s1","s2","s4","s4")
df$other2.subject<-c("s3","s3","s11","s4","s4","s9","s1","s1","s10","s5","s5","s2","s6","s6","s9","s3","s3","s1","s2","s2","s7","s11","s11","s12","s9","s9","s3","s10","s10","s11","s7","s7","s6","s8","s8","s2")
View(df)
For instance, if trial = 1 and id.in.group = 1 (or alternatively, subject = s7), then other1.subject = s11 while other2.subject = s3. I would like to extract such values for each id.in.group (or each subject) or for each row.
I beg you a pardon if I don't provide any previous attempt but, honestly, I have no clue about how to tackle the problem. I remain open to any further clarification.
Many thanks for all your help!

You need to left join df with itself two times - one for other1, second for other2:
library(dplyr)
df %>%
left_join(
df %>%
select(group, trial, other1.id = id.in.group, other1.subject = subject),
by = c("group", "trial", "other1.id")
) %>%
left_join(
df %>%
select(group, trial, other2.id = id.in.group, other2.subject = subject),
by = c("group", "trial", "other2.id")
)

Related

R: Finding indirect links between colleagues. Code works with string ids, but not numeric ids

I am trying the extract the set of indirect colleagues of doctors. I call colleagues doctors who work together in the same hospital. An indirect colleague is a doctor who works with the colleague of a doctor in another hospital. In the example below, doctor "a" works with doctor "b" in hospital 1, who in turn work with doctor "c" in hospital 2. Therefore "c" is an indirect colleague of "a".
The code below works well when physician id constitutes of string values (df0) or low numeric values (df1), but not when physicians id constitutes of high numeric value (df2). I would like to fix the code to work with high numeric values (while keeping the original ids of physicians).
df0 <- tribble(
~hospital, ~doctors,
1, c("a", "b"),
2, c("b", "c"),
3, c("a", "d"),
) %>%
unnest(doctors)
# Below, I replaced doctor id with numeric values
df1 <- tribble(
~hospital, ~doctors,
1, c(1, 2),
2, c(2, 3),
3, c(1, 4),
) %>%
unnest(doctors)
# Now I added +5 to each physician id
df2 <- tribble(
~hospital, ~doctors,
1, c(6, 7),
2, c(7, 8),
3, c(6, 9)
) %>%
unnest(doctors)
df <- df2 # The code only works with df0 and df1, not with df2
colleagues <- full_join(df, df, by = c("hospital")) %>%
rename(doctor = doctors.x, colleagues = doctors.y) %>%
filter(doctor != colleagues) %>%
distinct(doctor, colleagues) %>%
chop(colleagues) %>%
deframe()
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(ego, ~ colleagues[[.x]]),
alter_colleagues = map(alter, ~ colleagues[[.x]]),
alter_colleague_only = map2(alter_colleagues, ego_colleagues, ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
The issue is in your map calls. Using df2, when you map(ego, ~ colleagues[[.x]]), colleagues[.x] is indexing by position, not name. When you use character names, it defaults to using character names. When you use numeric names and they're 1, 2, 3, 4 it happens to work by luck. But when you have a list of 4 and you're calling colleagues[[6]], then you get the index out of bounds error. If that's not totally clear, print these:
colleagues[[1]] vs. colleagues[[6]] vs. colleagues$`6` .
A quick fix would be to wrap the first part of those map statements in as.character like this:
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(as.character(ego), ~ colleagues[[.x]]),
alter_colleagues = map(as.character(alter), ~ colleagues[[.x]]),
alter_colleague_only = map2(as.character(alter_colleagues), as.character(ego_colleagues), ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
UPDATE:
Depending on your setup, you could try using the furrr package with future_map and future_map2, but at least in this minimal example that was a much slower approach. I don't know if that holds true on your real data.
Here's another option. While ugly because it has a lot of intermediate objects, it may be helpful. It uses matrices and leverages the fact that you have these reciprocal relationships (if I'm interpreting correctly). I benchmarked it and it takes half as long.
t1 <- colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
filter(!duplicated(paste0(pmax(ego, alter), pmin(ego, alter)))) %>%
as.matrix()
t2 <- t1 %>%
rbind(t1[1:nrow(t1),c(2,1)])
alter_colleague_only <- t2[match(t2[,2], t2[,1]), "alter"]
t3 <- cbind(t2, alter_colleague_only)
t4 <- t3[which(t2[,1] != t3[,3]),]
t5 <- t4[,c(3,2,1)]
t6 <- rbind(t4, t5) %>%
as_tibble() %>%
arrange(ego)

Using a loop to create columns based on two data frames

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

Calculate mean after last time of specific column specification

example.df <- data.frame(GY = sample(300:600, 200, replace = T), sacc
= rep("f", each = 100), trial.number = rep(1:2,
each = 100), stringsAsFactors = F)
example.df$sacc[50:70] <- "s"
example.df$sacc[164:170] <- "s"
I have data looking similar to this. I would like to calculate the mean of GY after the last appearance of "s" for all the rest of the values of GY where sacc is f. In this example I could ofcourse just average on index number 71:100, however in the real data this isn't the case.
What I tried after the comment of Ronak (thanks!):
library(dplyr)
example.df %>%
group_by(trial.number) %>%
summarise(mean_tr = mean(GY[(max(which(sacc == "s")) + 1) : n()]))
%>%
data.frame()
I cant get it to work. Can someone help me out ? My original data.frame is 70k rows, an consists of a lot of variables. class = data.frame.
Update
As we need to do this by group, we can split it on trial.number and then apply the same operation to each group.
sapply(split(example.df, example.df$trial.number), function(x)
mean(x$GY[(max(which(x$sacc == "s")) + 1) : nrow(x)]))
# 1 2
#446.2333 471.7000
The same using dplyr could be achieved by
library(dplyr)
example.df %>%
group_by(trial.number) %>%
summarise(mean_tr = mean(GY[(max(which(sacc == "s")) + 1) : n()])) %>%
data.frame()
# trial.number mean_tr
#1 1 446.2333
#2 2 471.7000
Confirming again,
mean(example.df$GY[71:100])
#[1] 446.2333
mean(example.df$GY[171:200])
#[1] 471.7
Original Answer
We could do
mean(example.df$GY[(max(which(example.df$sacc == "s")) + 1) : nrow(example.df)])
#[1] 443.6667
Here, we first get all the indices where sacc is "s" then take max of it to get last occurrence. We get the mean of GY values from that index to end of the dataframe (nrow(example.df)).
To confirm,
mean(example.df$GY[71:100])
#[1] 443.6667

Create column based on multiple conditions in r

I have a data frame with 3 columns: individual ID, trip (which is sequenced by ID), and forage (yes or no):
example <- data.frame(IDs = c(rep("A",30),rep("B",30)),
timestamp = seq(c(ISOdate(2016,10,01)), by = "day", length.out = 60),
trip = c(rep("1",15),rep("2",15)),
forage = c(rep("Yes",3),rep("No",5),rep("Yes",3),rep("No",4),rep("Yes",7),rep("No",8)))
I want to create two separate columns that will list foraging events for each observation. In the first column, I want to number each observation with foraging = "yes" within ID and trip (so, each trip within individual will have x number of foraging events, starting over again with "1" for the next trip within individual). This column would look like:
example$forageEvent1 <- c(rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(1,7),rep("NA",8),rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(1,7),rep("NA",8))
The second column will number the foraging events by ID only:
example$forageEvent2 <- c(rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(3,7),rep("NA",8),rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(3,7),rep("NA",8))
I can subset/pipe down to individual and trip & have tried ifelse(), but have no idea how to write a code that will create a sequence of events. Thanks all.
EDIT: the code below, edited from a comment, gets close. However, it prints starting with "Forage0" instead of "Forage1".
library(dplyr)
Test_example <- example %>%
group_by(IDs) %>%
mutate(
ForagebyID = case_when(
forage == "Yes" ~ "Forage",
forage == "No" ~"NonForage"),
rleid = cumsum(ForagebyID != lag(ForagebyID, 1, default = "NA")),
ForagebyID = case_when(
ForagebyID == "Forage" ~ paste0(ForagebyID, rleid %/% 2),
TRUE ~ "NonForage"),
rleid = NULL
)
I think this will do what you want:
library(dplyr)
example <- data.frame(IDs = c(rep("A",30),rep("B",30)),
timestamp = seq(c(ISOdate(2016,10,01)), by = "day", length.out = 60),
trip = c(rep("1",15),rep("2",15)),
forage = c(rep("Yes",3),rep("No",5),rep("Yes",3),rep("No",4),rep("Yes",7),rep("No",8)))
Test_example <- example %>%
arrange(IDs, timestamp) %>%
group_by(IDs, trip) %>%
mutate(forageEvent1 = case_when(forage == "No" ~ 0,
TRUE ~ cumsum(forage != lag(forage, default = 1)) %/% 2 + 1)) %>%
group_by(IDs) %>%
mutate(forageEvent2 = case_when(forage == "No" ~ 0,
TRUE ~ cumsum(forage != lag(forage, default = 1)) %/% 2 + 1))

Decision table in R

Given a data frame and a set of rules defined in a table.
df <- rbind(c("blue","M",9),c("blue","F",11))
colnames(df)<-c("eyes","gender","age")
rule <- rbind(c("blue","M","<10",1),c("blue","M",">10",2),c("blue","F","<10",3),c("blue","F",">10",4))
colnames(rule)<-c("eyes","gender","age","category" )
Is there a way in R to apply the rules without rewriting the if else script?
The result should look like this:
eyes gender age category
"blue" "M" 9 1
"blue" "F" 11 4
require(magrittr); require(dplyr)
# Convert to data.frame
rule %<>% data.frame(stringsAsFactors = F)
df %<>% data.frame(stringsAsFactors = F)
#left join on eyes and gender, then remove rows where age doesn't match
result <- df %>%
left_join(rule, by = c('eyes', 'gender'))%>%
filter(paste(age.x, age.y) %>% sapply(function(x) eval(parse(text = x)))) %>%
select(-age.y)

Resources