I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]
Related
I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.
site <- rep(1:4, each = 8, len = 32)
rep <- rep(1:8, times = 4, len = 32)
treatment <- rep(c("A.low","A.low","A.high","A.high","A.mix","A.mix","B.mix","B.mix"), 4)
sp.1 <- sample(0:3,size=32,replace=TRUE)
sp.2 <- sample(0:2,size=32,replace=TRUE)
df.dummy <- data.frame(site, rep, treatment, sp.1, sp.2)
The final dataframe looks like this
For each site, I want to summarize various groups. Two for example: "A.low / A.high" = "sp.1/sp.1"; "A.low/ A.mix" = "sp.1/sp.2". As you will notice, there are two for each site and I want all permutations of that in my final columns. My final product would resemble something like:
site rep treatment value
1. 1/3. A.low/A.high. Inf
1. 1/4. A.low/A.high. 1
I started to use dplyr but I am really not sure how to proceed especially with all the combinations
df.dummy %>%
group_by(site) %>%
summarise(value.1 = sp.1[treatment = "A.low"] / sp.1[treatment = "A.high"])
You could use reshape2 to get the data in a format that is easier to work with.
The code below separates out the sp.1 and sp.2 data. acast is used so that each dataframe consists of a single row per site, and each column is a unique sample with the values being from sp.1 and sp.2.
Name the columns something unique and combine the dataframes with cbind.
Now each column can be compared based on your requirements.
library(dplyr)
library(reshape2)
##your setup
site <- rep(1:4, each = 8, len = 32)
rep <- rep(1:8, times = 4, len = 32)
treatment <- rep(c("A.low","A.low","A.high","A.high","A.mix","A.mix","B.mix","B.mix"), 4)
sp.1 <- sample(0:3,size=32,replace=TRUE)
sp.2 <- sample(0:2,size=32,replace=TRUE)
df.dummy <- data.frame(site, rep, treatment, sp.1, sp.2)
##create unique ids and create a dataframe containing 1 value column
sp1 <- df.dummy %>% mutate(id = paste(rep, treatment, sep = "_")) %>% select(id, site, rep, treatment, sp.1)
sp2 <- df.dummy %>% mutate(id = paste(rep, treatment, sep = "_")) %>% select(id, site, rep, treatment, sp.2)
##reshape the data so that each treament and replicate is assigned a single column
##each row will be a single site
##each column will contain the values from sp.1 or sp.2
sp1 <- reshape2::acast(data = sp1, formula = site ~ id)
sp2 <- reshape2::acast(data = sp2, formula = site ~ id)
##rename columns something sensible and unique
colnames(sp1) <- c("low.1.sp1", "low.2.sp1", "high.3.sp1", "high.4.sp1",
"mix.5.sp1", "mix.6.sp1", "mix.7.sp1", "mix.8.sp1")
colnames(sp2) <- c("low.1.sp2", "low.2.sp2", "high.3.sp2", "high.4.sp2",
"mix.5.sp2", "mix.6.sp2", "mix.7.sp2", "mix.8.sp2")
##combine datasets
dat <- sp1 %>% cbind(sp2)
##choose which columns to compare. Some examples shown below
dat <- dat %>% mutate(low.1.sp1/high.3.sp1, low.1.sp1/high.4.sp1,
low.2.sp1/high.3.sp2)
I'm trying to do something extremely simple, but I can't work out how to do it.
Basically this:
Where A and B are columns in a data frame.
If I use:
df$B <- lag(df$B,1) + df$A
It obviously results in NA because there is no lag of B before row 1.
We could use accumulate
library(tidyverse)
df %>%
mutate(B = accumulate(A, `+`))
Or it could be just cumsum
df %>%
mutate(B = cumsum(A))
data
df <- data.frame(A= c(10, 9, 3, 1, 7))
I am trying to filter the data within a group until a condition is met (in this case until status is "completed") and drop rest of the rows within a group. I've managed to come up with this ranking solution but I've ran into few issues with it when applying the code to my "real data". The function would sometimes not keep the last row (with max rank). Is there a more elegant solution to this?
Code i've used:
require(dplyr)
time <- seq(as.Date('2017/01/01'), as.Date('2017/01/15'), by="day")
set.seed(42); status <- sample(c("Completed", "On hold", "Active"), 15, replace = T)
ID <- c(rep(1, 5),rep(2, 5),rep(3, 5))
DF <- data.frame(Time = time,
Status = status,
ID = ID)
DF <- DF %>% group_by(ID) %>% mutate(ID_Rank = row_number())
DF$ID_Rank[DF$Status == "Completed"] <- max(DF$ID_Rank)+1
DF2 <- DF %>% group_by(ID) %>% filter(row_number() <= which.max(ID_Rank))
Change from baseline for repeated ids with missing baseline points
A similar question has been asked and answered below:
Change from baseline for repeated ids
My question differs from the original question in that I have missing baseline values. I am including a small reproducible example below:
df1 <- data.frame( probeID = c( rep("A", 19), rep("B",19), rep("C",19)),
Subject_ID = c( rep( c( rep(1,5), rep(2,4), rep(3,5), rep(4,5)),3)),
time = c(rep( c( c(1:5), c(2:5), rep( 1:5,2)),3)))
df1$measure <- df1$Subject_ID*c( 1:nrow(df1))
df2 <- subset( df1, Subject_ID != 2)
df2 %>%
group_by(probeID, Subject_ID) %>%
mutate(change = measure - measure[time==1])
However, when I replace df2 with df1 in the pipe above, it fails because data is missing for the time = 1 data point for Subject_ID=2. My desired output in the df1 case should be be identical to the output from df2. I would appreciate any help.
Thanks
JJ
Was having some trouble trying to figure out what your question was asking for, does this work?
df1 %>%
group_by(probeID, Subject_ID) %>%
mutate(change = measure - first(measure))