Using mutate and summarize to find elements in a vector - r

I'm trying to replace vba code with R code. Currently in vba I use sumif in a range to find the total value of an ID depending on some dates. In R I'm using mutate an summarize but there's always an error. I don´t know how to fix it.
If i want to find the value for ID=1 that made some value withing 2 days:
#sys.Date() = 2016-01-06
df
DATES ID VALUE
2016/01/01 1 10
2016/01/02 2 15
2016/01/05 1 13
the result must be:
ID Value
1 13
Currently, the code is:
df%>%
group_by(ID) %>%
mutate(Total_op = if (Sys.Date()-as.Date(Dates,format="%YYYY-%mm-
%dd")>=1) Value else 0)))%>%
summarize(SumTotal = sum(Total_op))%>%
collect
But the error showed is:
Error: Column 'sumTotal' must be length X (the group size) or one, not Y

With lubridate we can convert the DATES string to a datetime object and filter accordingly:
library(lubridate)
library(tidyverse)
Dat <- ymd("2016-01-06") #Set a date. Can be done by Sys.Date()
df %>%
mutate_at("DATES",ymd) %>% #convert to datetime
filter(DATES %within% interval(Dat-2,Dat)) %>% #filter entries in the last 2 days
group_by(ID) %>% #group by ID
summarise(SumTotal = sum(VALUE)) #summarise value as Sum

Related

How can I identify and extract duplicates from data frame?

My objective is to check if a patient is using two drugs at the same date.
In the example, patient 1 is using drug A and drug B at the same date, but I want to extract it with code.
df <- data.frame(id = c(1,1,1,2,2,2),
date = c("2020-02-01","2020-02-01","2020-03-02","2019-10-02","2019-10-18","2019-10-26"),
drug_type = c("A","B","A","A","A","B"))
df$date <- as.factor(df$date)
df$drug_type <- as.factor(df$drug_type)
In order to do this, I firstly made date and drug type factor variables.
Next I used following code:
df %>%
mutate(lev_actdate = as.factor(actdate))%>%
filter(nlevels(drug_type)>1 & nlevels(date) < nrow(date))
But I failed. I assumed that if a patient is using two drugs at the same date, the number of levels in the date column will be less than its row number. However, now I don't know how to make it with code.
Additionally, I feel weird about following:
if I use nlevels(df$date), right result will be returned, but when I use df %>% nlevels(date), the error will be return with showing
"Error in nlevels(., df$date) : unused argument (df$date)"
Could you please tell me why this occurred and how can I fix it?
Thank you for your time.
You could use something like
library(dplyr)
df %>%
group_by(id, date) %>%
filter(n_distinct(drug_type) >= 2)
df %>% nlevels(date) is the same as nlevels(df, date) which is not the same as nlevels(df$date). Instead of the latter youcould try df %>% nlevels(.$date) or perhaps df %>% {nlevels(.$date)}.
Do you need something like this?
library(dplyr)
df %>%
group_by(date) %>%
distinct() %>%
summarise(drug_type_sum = toString(drug_type))
date drug_type_sum
<fct> <chr>
1 2019-10-02 A
2 2019-10-18 A
3 2019-10-26 B
4 2020-02-01 A, B
5 2020-03-02 A

R -- Always grab the last day of the previous year in R

I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22

Filling missing dates in R

I would like some help regarding a data frame transformation required for an analysis. My data consists of a large amount of individuals with all their historic employment. "EX" is a code representing the reason for ending employment. Something like this:
id Date_start Date_end EX
13 "2001-02-01" "2001-05-30" A
13 "2002-03-01" "2010-06-02" B
14 ... ...
...
So what I would like to do is to "fill in the gaps". This may not be easy but its even more difficult because I want it aggregated by id and each new row should have the EX value of the row before, like this:
id Date_start Date_end EX
13 "2001-02-01" "2001-05-30" A
13 "2001-05-31" "2002-02-28" A
13 "2002-03-01" "2010-06-02" B
14 ... ...
...
I believe the trick would be some kind of lag and aggregate but I'm totally lost.
This is a little bit tricky, and you can mainly utilize the dplyr package to do the manipulation and lubridate packages to convert the date format(you can use as.Date() for sure, but lubridate makes it easier).
library(dplyr)
library(lubridate)
1.Creating the sample data you provided.
names <- c("id", "Date_start", "Date_end", "EX")
row1 <- c(13 , "2001-02-01" , "2001-05-30" , "A")
row2 <- c(13 , "2002-03-01" , "2010-06-02" , "B")
testdata <- rbind(row1,row2) %>% data.frame(stringsAsFactors = F)
row.names(testdata) <- NULL
names(testdata) <- names
testdata$Date_start <- testdata$Date_start %>% as_date()
testdata$Date_end <- testdata$Date_end %>% as_date()
testdata
2.Creating a new data set that has the data you want to add.
id: we are using the same id value since it is grouping by id.
Date_start: we are creating the Date_start with a value if there is gap, otherwise "" (empty column, and we are filtering them out).
Date_end: Same logic for Date_end.
EX: we are using the second last EX value as you stated.
new_data <- test_data %>%
group_by(id) %>%
mutate(Date_start1 = ifelse(Date_start-lag(Date_end) == 1,0,lag(Date_end)+1),
Date_end1 = ifelse(Date_start-lag(Date_end) == 1,0,Date_start-1),
EX=first(EX)) %>%
filter(!Date_start1 ==0) %>%
select(id, Date_start=Date_start1,Date_end=Date_end1,EX) %>%
distinct() %>%
ungroup()
3.Since we want to fill the gap days, mutate made it into numeric value, and we are using as_date() from lubriate to convert it into date format.
new_data$Date_start <- as_date(new_data$Date_start)
new_data$Date_end <- as_date(new_data$Date_end)
4.Combine it with your sample data and arrange it by Date_state.
final <- rbind(testdata,new_data) %>% data.frame() %>% arrange(Date_start)
final
Your final result is as below.

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Find difference between dates in consecutive rows

I know this has been asked before but the answers I have found seem to rely on POSIXct whereas I don't see why I cant do this with date
I have data like
Person Event VisitDate
1 RFA 2004-06-04
1 EMR 2016-06-03
1 Nil 2016-06-05
I want to get the difference between the dates in a separate column (eventually to average the difference of the dates over all Person ids).
Expected output:
Person Event VisitDate Date Difference in days
1 RFA 2004-06-04
1 EMR 2016-06-03 4383
1 Nil 2016-06-05 2
So far I have used:
EndoSubsetOnSurveil %>%
arrange(Person, as.Date(EndoSubsetOnSurveil$VisitDate, '%d/%m/%y')) %>%
difftime(VisitDate[1:(length(VisitDate)-1)] , VisitDate[2:length(VisitDate)])
but I get the error
Error in as.POSIXct.default(time1, tz = tz) :
do not know how to convert 'time1' to class “POSIXct”
Explanation:
(i) The format provided in as.Date should be changed to %Y-%m-%d. (ii) Your variable should be changed to as.Date if you want it to be recognized as such. In your code, it is only used to arrange the database but is not recognized later. (iii) Using lag makes is more useful.
Code:
I think that the last chunk output is what you want in comparison with the second chunk.
# SAMPLE DATA -------------------------------------------------------------
EndoSubsetOnSurveil <-
data.frame(Person = c(1,1,2,2),
VisitDate = c("2004-06-04", "2016-06-03", "2016-07-01",
"2016-08-01"))
EndoSubsetOnSurveil$VisitDate <-
as.Date(EndoSubsetOnSurveil$VisitDate, '%Y-%m-%d')
# DIFFERENCE BETWEEN VISIT WITHOUT GROUPING -------------------------------
library(dplyr)
EndoSubsetOnSurveil %>% arrange(Person, VisitDate) %>%
mutate(diffDate = difftime(VisitDate, lag(VisitDate,1)))
# DIFFERENCE BETWEEN VISIT BY PATIENT -------------------------------------
EndoSubsetOnSurveil %>% arrange(Person, VisitDate) %>% group_by(Person) %>%
mutate(diffDate = difftime(VisitDate, lag(VisitDate,1))) %>% ungroup()

Resources