I would like some help regarding a data frame transformation required for an analysis. My data consists of a large amount of individuals with all their historic employment. "EX" is a code representing the reason for ending employment. Something like this:
id Date_start Date_end EX
13 "2001-02-01" "2001-05-30" A
13 "2002-03-01" "2010-06-02" B
14 ... ...
...
So what I would like to do is to "fill in the gaps". This may not be easy but its even more difficult because I want it aggregated by id and each new row should have the EX value of the row before, like this:
id Date_start Date_end EX
13 "2001-02-01" "2001-05-30" A
13 "2001-05-31" "2002-02-28" A
13 "2002-03-01" "2010-06-02" B
14 ... ...
...
I believe the trick would be some kind of lag and aggregate but I'm totally lost.
This is a little bit tricky, and you can mainly utilize the dplyr package to do the manipulation and lubridate packages to convert the date format(you can use as.Date() for sure, but lubridate makes it easier).
library(dplyr)
library(lubridate)
1.Creating the sample data you provided.
names <- c("id", "Date_start", "Date_end", "EX")
row1 <- c(13 , "2001-02-01" , "2001-05-30" , "A")
row2 <- c(13 , "2002-03-01" , "2010-06-02" , "B")
testdata <- rbind(row1,row2) %>% data.frame(stringsAsFactors = F)
row.names(testdata) <- NULL
names(testdata) <- names
testdata$Date_start <- testdata$Date_start %>% as_date()
testdata$Date_end <- testdata$Date_end %>% as_date()
testdata
2.Creating a new data set that has the data you want to add.
id: we are using the same id value since it is grouping by id.
Date_start: we are creating the Date_start with a value if there is gap, otherwise "" (empty column, and we are filtering them out).
Date_end: Same logic for Date_end.
EX: we are using the second last EX value as you stated.
new_data <- test_data %>%
group_by(id) %>%
mutate(Date_start1 = ifelse(Date_start-lag(Date_end) == 1,0,lag(Date_end)+1),
Date_end1 = ifelse(Date_start-lag(Date_end) == 1,0,Date_start-1),
EX=first(EX)) %>%
filter(!Date_start1 ==0) %>%
select(id, Date_start=Date_start1,Date_end=Date_end1,EX) %>%
distinct() %>%
ungroup()
3.Since we want to fill the gap days, mutate made it into numeric value, and we are using as_date() from lubriate to convert it into date format.
new_data$Date_start <- as_date(new_data$Date_start)
new_data$Date_end <- as_date(new_data$Date_end)
4.Combine it with your sample data and arrange it by Date_state.
final <- rbind(testdata,new_data) %>% data.frame() %>% arrange(Date_start)
final
Your final result is as below.
Related
My objective is to check if a patient is using two drugs at the same date.
In the example, patient 1 is using drug A and drug B at the same date, but I want to extract it with code.
df <- data.frame(id = c(1,1,1,2,2,2),
date = c("2020-02-01","2020-02-01","2020-03-02","2019-10-02","2019-10-18","2019-10-26"),
drug_type = c("A","B","A","A","A","B"))
df$date <- as.factor(df$date)
df$drug_type <- as.factor(df$drug_type)
In order to do this, I firstly made date and drug type factor variables.
Next I used following code:
df %>%
mutate(lev_actdate = as.factor(actdate))%>%
filter(nlevels(drug_type)>1 & nlevels(date) < nrow(date))
But I failed. I assumed that if a patient is using two drugs at the same date, the number of levels in the date column will be less than its row number. However, now I don't know how to make it with code.
Additionally, I feel weird about following:
if I use nlevels(df$date), right result will be returned, but when I use df %>% nlevels(date), the error will be return with showing
"Error in nlevels(., df$date) : unused argument (df$date)"
Could you please tell me why this occurred and how can I fix it?
Thank you for your time.
You could use something like
library(dplyr)
df %>%
group_by(id, date) %>%
filter(n_distinct(drug_type) >= 2)
df %>% nlevels(date) is the same as nlevels(df, date) which is not the same as nlevels(df$date). Instead of the latter youcould try df %>% nlevels(.$date) or perhaps df %>% {nlevels(.$date)}.
Do you need something like this?
library(dplyr)
df %>%
group_by(date) %>%
distinct() %>%
summarise(drug_type_sum = toString(drug_type))
date drug_type_sum
<fct> <chr>
1 2019-10-02 A
2 2019-10-18 A
3 2019-10-26 B
4 2020-02-01 A, B
5 2020-03-02 A
I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1
I have a dataframe that has one column which contains json data. I want to extract some attributes from this json data into named columns of the data frame.
Sample data
json_col = c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}')
id = c(1,2,3)
df <- data.frame(id, json_col)
I was able to achieve this using
library(tidyverse)
library(jsonlite)
extract_json_attr <- function(from, attr, default=NA) {
value <- from %>%
as.character() %>%
jsonlite::fromJSON(txt = .) %>%
.[attr]
return(ifelse(is.null(value[[1]]), default, value[[1]]))
}
df <- df %>%
rowwise() %>%
mutate(name = extract_json_attr(json_col, "name"),
points = extract_json_attr(json_col, "points", 0))
In this case the extract_json_attr needs to parse the json column multiple times for each attribute to be extracted.
Is there a better way to extract all attributes at one shot?
I tried this function to return multiple values as a list, but I am not able to use it with mutate to set multiple columns.
extract_multiple <- function(from, attributes){
values <- from %>%
as.character() %>%
jsonlite::fromJSON(txt = .) %>%
.[attributes]
return (values)
}
I am able to extract the desired values using this function
extract_multiple(df$json_col[1],c('name','points'))
extract_multiple(df$json_col[2],c('name','points'))
But cannot apply this to set multiple columns in a single go. Is there a better way to do this efficiently?
Here is one way using bind_rows from dplyr
dplyr::bind_rows(lapply(as.character(df$json_col), jsonlite::fromJSON))
# A tibble: 3 x 2
# name points
# <chr> <int>
#1 john NA
#2 doe 10
#3 jane 20
To subset specific attribute from the function, we can do
bind_rows(lapply(as.character(df$json_col), function(x)
jsonlite::fromJSON(x)[c('name', 'points')]))
On the R4DS slack channel I received an alternative approach for handling json arrays as columns. Using that, I found another approach that seems to work better on larger datasets.
library(tidyverse)
library(jsonlite)
extract <- function(input, fields){
json_df <- fromJSON(txt=input)
missing <- setdiff(fields, names(json_df))
json_df[missing] <- NA
return (json_df %>% select(fields))
}
df <- data.frame(id=c(1,2,3),
json_col=c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}'),
stringsAsFactors=FALSE)
df %>%
mutate(json_col = paste0('[',json_col,']'),
json_col = map(json_col, function(x) extract(input=x, fields=c('name', 'points')))) %>%
unnest(cols=c(json_col))
I have a data frame like DF below which will be imported directly from the database (as tibble).
library(tidyverse)
library(lubridate)
date_until <- dmy("31.05.2019")
date_val <- dmy("30.06.2018")
DF <- data.frame( date_bal = as.Date(c("2018-04-30", "2018-05-31", "2018-06-30", "2018-05-31", "2018-06-30")),
department = c("A","A","A","B","B"),
amount = c(10,20,30,40,50)
)
DF <- DF %>%
as_tibble()
DF
It represents the amount of money spent by each department in a specific month. My task is to project how much money will be spent by each department in the following months until a specified date in the future (in this case date_until=31.05.2019)
I would like to use tidyverse in order to generate additional rows for each department where the first column date_bal would be a sequence of dates from the last one from "original" DF up until date_until which is predefined. Then I would like to add additional column called "DIFF" which would represent the difference between DATE_BAL and DATE_VAL, where DATE_VAL is also predefined. My final result would look like this:
Final result
I have managed to do this in the following way:
first filter data from DF for department A
Create another DF2 by populating it with date sequence from min(dat_bal) to date_until from 1.
Merge data frames from 1. and 2. and then add calculated columns using mutate
Since I will have to repeat this procedure for many departments I wonder if it's possible to add rows (create date sequence) in existing DF (without creating a second DF and then merging).
Thanks in advance for your help and time.
I add one day to the dates, create a sequence and then rollback to the last day of the previous month.
seq(min(date_val + days(1)), date_until + days(1), by = 'months')[-1] %>%
rollback() %>%
tibble(date_bal = .) %>%
crossing(DF %>% distinct(department)) %>%
bind_rows(DF %>% select(date_bal, department)) %>%
left_join(DF) %>%
arrange(department, date_bal) %>%
mutate(
amount = if_else(is.na(amount), 0, amount),
DIFF = interval(
rollback(date_val, roll_to_first = TRUE),
rollback(date_bal, roll_to_first = TRUE)) %/% months(1)
)
I've got a data frame (df1) with an ID variable and two date variables (dat1 and dat2).
I'd like to subset the data frame so that I get the observations for which the difference between dat2 and dat1 is less than or equal to 30 days.
I'm trying to use dplyr() but I can't get it to work.
Any help would be much appreciated.
Starting point (df):
df1 <- data.frame(ID=c("a","b","c","d","e","f"),dat1=c("01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017"),dat2=c("14/05/2017","05/06/2017","23/05/2017","15/10/2017","15/11/2017","15/12/2017"), stringsAsFactors = FALSE)
Desired outcome (df):
dfgoal <- data.frame(ID=c("a","c"),dat1=c("01/05/2017","01/05/2017"),dat2=c("14/05/2017","23/05/2017"),newvar=c(13,22))
Current code:
library(dplyr)
df2 <- df1 %>% mutate(newvar = as.Date(dat2) - as.Date(dat1)) %>%
filter(newvar <= 30)
We need to convert to Date class before doing the subtraction
library(dplyr)
library(lubridate)
df1 %>%
mutate_at(2:3, dmy) %>%
mutate(newvar = as.numeric(dat2- dat1)) %>%
filter(newvar <=30)
The as.Date also needs to include the format argument, otherwise, it will think that the format is in the accepted %Y-%m-%d. Here, it is in %d/%m/%Y
df1 %>%
mutate(newvar = as.numeric(as.Date(dat2, "%d/%m/%Y") - as.Date(dat1, "%d/%m/%Y"))) %>%
filter(newvar <= 30)
# ID dat1 dat2 newvar
#1 a 01/05/2017 14/05/2017 13
#2 c 01/05/2017 23/05/2017 22