How can I identify and extract duplicates from data frame?

How can I identify and extract duplicates from data frame? - r

My objective is to check if a patient is using two drugs at the same date.
In the example, patient 1 is using drug A and drug B at the same date, but I want to extract it with code.
df <- data.frame(id = c(1,1,1,2,2,2),
date = c("2020-02-01","2020-02-01","2020-03-02","2019-10-02","2019-10-18","2019-10-26"),
drug_type = c("A","B","A","A","A","B"))
df$date <- as.factor(df$date)
df$drug_type <- as.factor(df$drug_type)
In order to do this, I firstly made date and drug type factor variables.
Next I used following code:
df %>%
mutate(lev_actdate = as.factor(actdate))%>%
filter(nlevels(drug_type)>1 & nlevels(date) < nrow(date))
But I failed. I assumed that if a patient is using two drugs at the same date, the number of levels in the date column will be less than its row number. However, now I don't know how to make it with code.
Additionally, I feel weird about following:
if I use nlevels(df$date), right result will be returned, but when I use df %>% nlevels(date), the error will be return with showing
"Error in nlevels(., df$date) : unused argument (df$date)"
Could you please tell me why this occurred and how can I fix it?
Thank you for your time.

You could use something like
library(dplyr)
df %>%
group_by(id, date) %>%
filter(n_distinct(drug_type) >= 2)
df %>% nlevels(date) is the same as nlevels(df, date) which is not the same as nlevels(df$date). Instead of the latter youcould try df %>% nlevels(.$date) or perhaps df %>% {nlevels(.$date)}.

Do you need something like this?
library(dplyr)
df %>%
group_by(date) %>%
distinct() %>%
summarise(drug_type_sum = toString(drug_type))
date drug_type_sum
<fct> <chr>
1 2019-10-02 A
2 2019-10-18 A
3 2019-10-26 B
4 2020-02-01 A, B
5 2020-03-02 A

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])

mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

Using mutate and summarize to find elements in a vector

I'm trying to replace vba code with R code. Currently in vba I use sumif in a range to find the total value of an ID depending on some dates. In R I'm using mutate an summarize but there's always an error. I don´t know how to fix it.
If i want to find the value for ID=1 that made some value withing 2 days:
#sys.Date() = 2016-01-06
df
DATES ID VALUE
2016/01/01 1 10
2016/01/02 2 15
2016/01/05 1 13
the result must be:
ID Value
1 13
Currently, the code is:
df%>%
group_by(ID) %>%
mutate(Total_op = if (Sys.Date()-as.Date(Dates,format="%YYYY-%mm-
%dd")>=1) Value else 0)))%>%
summarize(SumTotal = sum(Total_op))%>%
collect
But the error showed is:
Error: Column 'sumTotal' must be length X (the group size) or one, not Y

With lubridate we can convert the DATES string to a datetime object and filter accordingly:
library(lubridate)
library(tidyverse)
Dat <- ymd("2016-01-06") #Set a date. Can be done by Sys.Date()
df %>%
mutate_at("DATES",ymd) %>% #convert to datetime
filter(DATES %within% interval(Dat-2,Dat)) %>% #filter entries in the last 2 days
group_by(ID) %>% #group by ID
summarise(SumTotal = sum(VALUE)) #summarise value as Sum

Filling missing dates in R

I would like some help regarding a data frame transformation required for an analysis. My data consists of a large amount of individuals with all their historic employment. "EX" is a code representing the reason for ending employment. Something like this:
id Date_start Date_end EX
13 "2001-02-01" "2001-05-30" A
13 "2002-03-01" "2010-06-02" B
14 ... ...
...
So what I would like to do is to "fill in the gaps". This may not be easy but its even more difficult because I want it aggregated by id and each new row should have the EX value of the row before, like this:
id Date_start Date_end EX
13 "2001-02-01" "2001-05-30" A
13 "2001-05-31" "2002-02-28" A
13 "2002-03-01" "2010-06-02" B
14 ... ...
...
I believe the trick would be some kind of lag and aggregate but I'm totally lost.

This is a little bit tricky, and you can mainly utilize the dplyr package to do the manipulation and lubridate packages to convert the date format(you can use as.Date() for sure, but lubridate makes it easier).
library(dplyr)
library(lubridate)
1.Creating the sample data you provided.
names <- c("id", "Date_start", "Date_end", "EX")
row1 <- c(13 , "2001-02-01" , "2001-05-30" , "A")
row2 <- c(13 , "2002-03-01" , "2010-06-02" , "B")
testdata <- rbind(row1,row2) %>% data.frame(stringsAsFactors = F)
row.names(testdata) <- NULL
names(testdata) <- names
testdata$Date_start <- testdata$Date_start %>% as_date()
testdata$Date_end <- testdata$Date_end %>% as_date()
testdata
2.Creating a new data set that has the data you want to add.
id: we are using the same id value since it is grouping by id.
Date_start: we are creating the Date_start with a value if there is gap, otherwise "" (empty column, and we are filtering them out).
Date_end: Same logic for Date_end.
EX: we are using the second last EX value as you stated.
new_data <- test_data %>%
group_by(id) %>%
mutate(Date_start1 = ifelse(Date_start-lag(Date_end) == 1,0,lag(Date_end)+1),
Date_end1 = ifelse(Date_start-lag(Date_end) == 1,0,Date_start-1),
EX=first(EX)) %>%
filter(!Date_start1 ==0) %>%
select(id, Date_start=Date_start1,Date_end=Date_end1,EX) %>%
distinct() %>%
ungroup()
3.Since we want to fill the gap days, mutate made it into numeric value, and we are using as_date() from lubriate to convert it into date format.
new_data$Date_start <- as_date(new_data$Date_start)
new_data$Date_end <- as_date(new_data$Date_end)
4.Combine it with your sample data and arrange it by Date_state.
final <- rbind(testdata,new_data) %>% data.frame() %>% arrange(Date_start)
final
Your final result is as below.

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.

What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100

A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

how to use dplyr() to subset observations based on the difference between two date

I've got a data frame (df1) with an ID variable and two date variables (dat1 and dat2).
I'd like to subset the data frame so that I get the observations for which the difference between dat2 and dat1 is less than or equal to 30 days.
I'm trying to use dplyr() but I can't get it to work.
Any help would be much appreciated.
Starting point (df):
df1 <- data.frame(ID=c("a","b","c","d","e","f"),dat1=c("01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017"),dat2=c("14/05/2017","05/06/2017","23/05/2017","15/10/2017","15/11/2017","15/12/2017"), stringsAsFactors = FALSE)
Desired outcome (df):
dfgoal <- data.frame(ID=c("a","c"),dat1=c("01/05/2017","01/05/2017"),dat2=c("14/05/2017","23/05/2017"),newvar=c(13,22))
Current code:
library(dplyr)
df2 <- df1 %>% mutate(newvar = as.Date(dat2) - as.Date(dat1)) %>%
filter(newvar <= 30)

We need to convert to Date class before doing the subtraction
library(dplyr)
library(lubridate)
df1 %>%
mutate_at(2:3, dmy) %>%
mutate(newvar = as.numeric(dat2- dat1)) %>%
filter(newvar <=30)
The as.Date also needs to include the format argument, otherwise, it will think that the format is in the accepted %Y-%m-%d. Here, it is in %d/%m/%Y
df1 %>%
mutate(newvar = as.numeric(as.Date(dat2, "%d/%m/%Y") - as.Date(dat1, "%d/%m/%Y"))) %>%
filter(newvar <= 30)
# ID dat1 dat2 newvar
#1 a 01/05/2017 14/05/2017 13
#2 c 01/05/2017 23/05/2017 22

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I identify and extract duplicates from data frame? - r

You could use something like library(dplyr) df %>% group_by(id, date) %>% filter(n_distinct(drug_type) >= 2) df %>% nlevels(date) is the same as nlevels(df, date) which is not the same as nlevels(df$date). Instead of the latter youcould try df %>% nlevels(.$date) or perhaps df %>% {nlevels(.$date)}.

Do you need something like this? library(dplyr) df %>% group_by(date) %>% distinct() %>% summarise(drug_type_sum = toString(drug_type)) date drug_type_sum <fct> <chr> 1 2019-10-02 A 2 2019-10-18 A 3 2019-10-26 B 4 2020-02-01 A, B 5 2020-03-02 A

Related

How to subtract using max(date) and second latest (month) date

Using mutate and summarize to find elements in a vector

Filling missing dates in R

To create a frequency table with dplyr to count the factor levels and missing values and report it

how to use dplyr() to subset observations based on the difference between two date

Categories

Resources