Aggregation of variables by date in R - r

I am more used to working with STATA and have been trying to switch to R, and having trouble getting this aggregation using dplyr/summarise to work.
I have a dataframe with admission/discharge variables, and series of columns with binary (0,1) results indicating drug received on 'DrugDate'.
# ID AdmitDate DCdate DrugDate DrugA DrugB .. DrugZ
# 1 03/01/2017 03/04/2017 03/01/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/02/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/02/2017 0 1 0
# 1 03/01/2017 03/04/2017 03/03/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/04/2017 1 0 0
Where each row is essentially an series of indicators of what drugs a patient received that day.
STEP 1.
I would like to first consolidate the dataset like so:
# ID AdmitDate DCdate DrugDate DrugA DrugB .. DrugZ
# 1 03/01/2017 03/04/2017 03/01/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/02/2017 1 1 0
# 1 03/01/2017 03/04/2017 03/03/2017 1 0 0
# 1 03/01/2017 03/04/2017 03/04/2017 1 0 0
So that there is now one row per day (whereas before duplicate DrugDates existed when more than one drug given on a certain day).
STEP 2
I would then like to create a new dataset that counts "drug days" i.e.
# ID AdmitDate DCdate TotDays DrugDaysA DrugDaysB .. DrugZ
# 1 03/01/2017 03/04/2017 4 4 1 0
Step 2 I figured out how to do, but I thought maybe the community would have opinions about the fastest way to compute as the dataset is quite large. My understanding is dplyr is usually computationally efficient.
I would prefer not to simply do something like:
DF %>% group_by(id, drugdate) %>% summarise(NewVar = max(DrugA))
Because there are many variables.
It would be ideal for me to define a list of varnames, then use apply/for-loop to automate the process.

You can reshape or melt the different drugs into one column using a package like reshape2 or a tidyverse package.
Then the call to dplyr doesn't matter how many variables (drugs) you have. I provided a simple example bellow that should illustrate the point. You can extend as needed.
library(dplyr)
library(reshape2)
# set up for data
set.seed(5)
n <- 9
#create data frame
df <- data.frame(id = as.factor(rep(1:3, n/3)),
date = as.character(sample(size=n, 1:10)),
drugA = sample(size=n, 1:2, replace=TRUE),
drugB = sample(size=n, 1:2, replace=TRUE))
#melt data
dfm <- melt(df, id.vars=c("id", "date"))
#call to dplyr
dfms <- dfm %>% group_by(id, date, variable) %>% summarise(max = max(value))
> head(dfms)
Source: local data frame [6 x 4]
Groups: id, date [3]
id date variable max
<fctr> <fctr> <fctr> <int>
1 1 6 drugA 1
2 1 6 drugB 2
3 1 7 drugA 2
4 1 7 drugB 2
5 1 9 drugA 2
6 1 9 drugB 1
To get back into wide format you can use the cast functions.
> head(dcast(dfms, id + date ~ variable, value.var = "max"))
id date drugA drugB
1 1 6 1 2
2 1 7 2 2
3 1 9 2 1
4 2 10 1 2
5 2 2 2 1
6 2 8 1 1

Related

Remove if unit only has one observation

I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2

Longitudinal dataset - difference between two dates

I have a longitudinal dataset that I imported in R from Excel that looks like this:
STUDYID VISIT# VISITDate
1 1 2012-12-19
1 2 2018-09-19
2 1 2013-04-03
2 2 2014-05-14
2 3 2016-05-12
In this dataset, each patient/study ID has a different number of visits to the hospital, and their first visit dates which is likely to differ from individual to individual. I want to create a new time variable which is essentially time in years since first visit, so the dataset will look like this:
STUDYID VISIT# VISITDate Time(years)
1 1 2012-12-19 0
1 2 2018-09-19 5
2 1 2013-04-03 0
2 2 2014-05-14 1
2 3 2016-05-12 3
The reason for creating a time variable like this is to assess differential regression effects over time (which is a continuous variable). Is there any way to create a new time variable like this in R so I can use it as an independent variable in my regression analyses?
Consider ave to calculate the minimum of VISITDate by STUDYID group, then take the date difference with conversion to integer years:
df <- within(df, {
minVISITDate <- ave(VISITDate, STUDYID, FUN=min)
Time <- floor(as.double(difftime(VISITDate, minVISITDate, unit="days") / 365))
rm(minVISITDate)
})
df
# STUDYID VISIT# VISITDate Time
# 1 1 1 2012-12-19 0
# 2 1 2 2018-09-19 5
# 3 2 1 2013-04-03 0
# 4 2 2 2014-05-14 1
# 5 2 3 2016-05-12 3
Loading up packages:
library(tibble)
library(dplyr)
library(lubridate)
Setting up the data:
dat <- tribble(~STUDYID , ~VISIT , ~VISITDate ,
1 , 1 , "2012-12-19",
1 , 2 , "2018-09-19",
2 , 1 , "2013-04-03",
2 , 2 , "2014-05-14",
2 , 3 , "2016-05-12") %>%
mutate(VISITDate = as.Date(VISITDate))
Creating the wanted variable:
dat %>%
group_by(STUDYID) %>%
mutate(Time = first(VISITDate) %--% VISITDate,
Time = as.numeric(Time, "years")) %>%
ungroup()
# A tibble: 5 x 4
STUDYID VISIT VISITDate Time
<dbl> <dbl> <date> <dbl>
1 1 1 2012-12-19 0
2 1 2 2018-09-19 5.75
3 2 1 2013-04-03 0
4 2 2 2014-05-14 1.11
5 2 3 2016-05-12 3.11

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Filter Data frame using 3 different vector conditions

My sample dataset
df <- data.frame(period=rep(1:3,3),
product=c(rep('A',9)),
account= c(rep('1001',3),rep('1002',3),rep('1003',3)),
findme= c(0,0,0,1,0,1,4,2,0))
My Desired output dataset
output <- data.frame(period=rep(1:3,2),
product=c(rep('A',6)),
account= c(rep('1002',3),rep('1003',3)),
findme= c(1,0,1,4,2,0))
Here my conditions are....
I want to eliminate records 3 records from 9 based on below conditions.
If all my periods (1, 2 and 3) meet “findme” value is equal to ‘Zero’ and
if that happens to the same product and
and same account.
Rule 1: It should meet Periods 1, 2, 3
Rule 2: Findme value for all periods = 0
Rule 3: All those 3 records (Preiod 1,2,3) should have same Product
Rule 4: All those 3 recods (period 1,2,3) should have one account.
If I understand correctly, you want to drop all records from a product-account combination where findme == 0, if all periods are included in this combination?
library(dplyr)
df %>%
group_by(product, account, findme) %>%
mutate(all.periods = all(1:3 %in% period)) %>%
ungroup() %>%
filter(!(findme == 0 & all.periods)) %>%
select(-all.periods)
# A tibble: 6 x 4
period product account findme
<int> <fctr> <fctr> <dbl>
1 1 A 1002 1
2 2 A 1002 0
3 3 A 1002 1
4 1 A 1003 4
5 2 A 1003 2
6 3 A 1003 0
Here is an option with data.table
library(data.table)
setDT(df)[df[, .I[all(1:3 %in% period) & !all(!findme)], .(product, account)]$V1]
# period product account findme
#1: 1 A 1002 1
#2: 2 A 1002 0
#3: 3 A 1002 1
#4: 1 A 1003 4
#5: 2 A 1003 2
#6: 3 A 1003 0

Summarise in dplyr package: How to go around `Error: expecting a single value`

I want to make a summary by ID, DRUG, FED summarising the sum of the CONC for DVID =1 and DVID==2
df<-
ID DRUG FED DVID CONC
1 1 1 1 20
1 1 1 2 40
2 2 0 1 30
2 2 0 2 100
I tried using this:
df2 <- df %>%
group_by(ID,DRUG,FED) %>%
summarise(SumCOnc=CONC+lag(CONC))
However I am getting this error:
Error: expecting a single value
I don't get the error when I use mutate. Is there a way to go around it so I use summarise in the case described above?
The output should basically be this:
ID DRUG FED SumConc
1 1 1 60
2 2 0 130
This seems pretty straightforward: just use sum(), don't mess around with lag() ...
Get data:
df<- read.table(header=TRUE,
text="
ID DRUG FED DVID CONC
1 1 1 1 20
1 1 1 2 40
2 2 0 1 30
2 2 0 2 100
")
Process:
library(dplyr)
df %>%
group_by(ID,DRUG,FED) %>%
summarise(SumConc=sum(CONC))
## ID DRUG FED SumConc
## 1 1 1 1 60
## 2 2 2 0 130
A simple Base R approach would be using aggregate
aggregate(CONC ~ ID + DRUG + FED, df, sum)
# ID DRUG FED CONC
#1 2 2 0 130
#2 1 1 1 60
Or from data.table
library(data.table)
setDT(df)[, .(SumConc = sum(CONC)), .(ID, DRUG, FED)]
# ID DRUG FED SumConc
#1: 1 1 1 60
#2: 2 2 0 130

Resources