R: expand and fill data frame by date in series - r

I have the raw data frame:
igroup=c("A", "B", "C")
demo_df=data.frame(date=c("2018-11-28", "2018-12-17", "2019-01-23"), group)
Raw data frame:
date group
1 2018-11-28 A
2 2018-12-17 B
3 2019-01-23 C
I want to have a data frame that expand the date to next column but still keep the group information. For example, date from 2018-11-28 to 2018-12-16 is with group A, date from 2018-12-17 to 2019-01-22 is with group B and 2019-01-23 is with group C.
This is the output (result_df) I want:
time=c(seq(as.Date("2018-11-28"), as.Date("2018-12-17")-1, by=1),
seq(as.Date("2018-12-17"), as.Date("2019-01-23")-1, by=1),as.Date("2019-01-23") )
group1=c(rep("A",as.numeric(as.Date("2018-12-17")-as.Date("2018-11-28"))),
rep("B",as.numeric(as.Date("2019-01-23")-as.Date("2018-12-17"))), "C" )
result_df=data.frame(time,group1 )
result_df
I am wondering if there is any more efficient way (using dplyr) to handle this issue.
Thanks in advance.

First, make sure date is stored as a date object:
demo_df$date <- as.Date(demo_df$date, format = "%Y-%m-%d")
Then using tidyverse, we first complete the sequence, then fill the group down:
library(tidyverse)
demo_df %>% complete(date = seq.Date(min(date), max(date), by = "day")) %>%
fill(igroup)

Going through this years later, here is a variation on Mako212's answer:
demo_df %>% complete(date=full_seq(date,1)) %>% fill(group)

Related

How to filter date numbers, incomplete dates, and NAs from database and convert to uniform date class in r

I have a large database with a date column that has date numbers coming from Excel, incomplete dates that are missing the year (but year is in another column), and some cells with missing date. I found out how to change format of the dates, but the problem is how to filter the three types of cells I have in the date variable (that is date numbers from excel, incomplete dates, and empty cell). I managed to do it by filtering a by a created column (value) that I DON'T have in the real database.
This is my original database:
This is what I required end result:
What I managed to do was to filter the dataset with the fictitious value column and convert the date to the required format. This is what I did:
library(dplyr)
data_a <- read.csv(text = "
year,date,value
2018,43238,1
2017,43267,2
2020,7/25,3
2018,,4
2013,,5
2000,8/23,6
2000,9/21,7")
data_b <- data_a %>%
filter(value %in% c(1,2)) %>%
mutate(data_formatted = as.Date(as.numeric(date), origin = "1899-12-30"))
data_c <- data_a %>%
filter(value %in% c(3, 6, 7)) %>%
mutate(data_formatted = as.Date(paste0(year, "/", date)))
data_d <- data_a %>%
filter(value %in% c(4, 5)) %>%
mutate(data_formatted = NA)
data_final <- rbind(data_b, data_c, data_d)
I need to do the same all at once WITHOUT using the value column.
You can use do conditional for the scenarios and apply different functions to convert to date.
Code
library(dplyr)
library(stringr)
library(lubridate)
data_a %>%
mutate(
data_formatted = case_when(
!str_detect(date,"/") ~ as.Date(as.numeric(date), origin = "1899-12-30"),
TRUE ~ ymd(paste0(year, "/", date))
)
)
Output
year date value data_formatted
1 2018 43238 1 2018-05-18
2 2017 43267 2 2018-06-16
3 2020 7/25 3 2020-07-25
4 2018 4 <NA>
5 2013 5 <NA>
6 2000 8/23 6 2000-08-23
7 2000 9/21 7 2000-09-21
Please try
data_a2 <- data_a %>% mutate(date2=as.numeric(ifelse(str_detect(date,'\\/'), '',date)),
date2_=as.numeric(as.Date(ifelse(str_detect(date,'\\/'), paste0(year,'/',date),''), format='%Y/%m/%d')),
date_formatted=as.Date(coalesce(date2,date2_), origin = "1970-01-01")) %>%
dplyr::select(-date2,-date2_)

How can I identify and extract duplicates from data frame?

My objective is to check if a patient is using two drugs at the same date.
In the example, patient 1 is using drug A and drug B at the same date, but I want to extract it with code.
df <- data.frame(id = c(1,1,1,2,2,2),
date = c("2020-02-01","2020-02-01","2020-03-02","2019-10-02","2019-10-18","2019-10-26"),
drug_type = c("A","B","A","A","A","B"))
df$date <- as.factor(df$date)
df$drug_type <- as.factor(df$drug_type)
In order to do this, I firstly made date and drug type factor variables.
Next I used following code:
df %>%
mutate(lev_actdate = as.factor(actdate))%>%
filter(nlevels(drug_type)>1 & nlevels(date) < nrow(date))
But I failed. I assumed that if a patient is using two drugs at the same date, the number of levels in the date column will be less than its row number. However, now I don't know how to make it with code.
Additionally, I feel weird about following:
if I use nlevels(df$date), right result will be returned, but when I use df %>% nlevels(date), the error will be return with showing
"Error in nlevels(., df$date) : unused argument (df$date)"
Could you please tell me why this occurred and how can I fix it?
Thank you for your time.
You could use something like
library(dplyr)
df %>%
group_by(id, date) %>%
filter(n_distinct(drug_type) >= 2)
df %>% nlevels(date) is the same as nlevels(df, date) which is not the same as nlevels(df$date). Instead of the latter youcould try df %>% nlevels(.$date) or perhaps df %>% {nlevels(.$date)}.
Do you need something like this?
library(dplyr)
df %>%
group_by(date) %>%
distinct() %>%
summarise(drug_type_sum = toString(drug_type))
date drug_type_sum
<fct> <chr>
1 2019-10-02 A
2 2019-10-18 A
3 2019-10-26 B
4 2020-02-01 A, B
5 2020-03-02 A

How to reorder file in chronological order

I have a dataset with multiple columns but I'd like to change the order in chronological order by date!
This is a really bad example but would there be a code to r
Station
year
ID
1
2020
D
2
2019
C
3
2017
A
4
2018
B
This is a really bad example but would there be a code to reorder by date oldest to newest?
Station
year
ID
3
2017
A
4
2018
B
2
2019
C
1
2020
D
To look something like this!
Any help would be amazing! :)
Thank you
Well... "2020" is not a date, and you can order the column as regular integer.
But, if you had dates like "2020-01-25"... transforming strings to dates is easy as...
df <- tibble(n = c(1,2,3,4),
dt = c("2020-01-01","2019-01-01","2017-01-01", "2018-01-01"),
l = c("D","C","A","B"))
df <- df %>%
mutate(
dt = as.Date(dt)
) %>%
arrange(
dt
)
Use ymd () function from lubridate package to bring dt to date format and year () to extract the year. With this format you can sort your dates with arrange
library(dplyr)
library(lubridate)
# data borrowed from abreums
df <- tibble(n = c(1,2,3,4),
dt = c("2020-01-01","2019-01-01","2017-01-01", "2018-01-01"),
l = c("D","C","A","B"))
df1 <- df %>%
mutate(dt = ymd(dt), # "2020-01-01"
dt = year(dt)) %>% # "2020"
arrange(dt)

R -- Always grab the last day of the previous year in R

I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Resources