I have a dataset that looks like this:
group id date1 date2 date3 date4
1 1 1 1991-10-14 1992-05-20 1992-12-09 1993-06-30
2 1 2 <NA> 1992-05-21 1992-12-10 1993-06-29
3 1 3 <NA> <NA> 1992-12-08 1993-06-29
4 1 4 1991-10-14 1992-05-19 <NA> <NA>
5 1 5 1991-10-15 1992-05-21 <NA> 1993-06-30
6 1 6 1991-10-15 <NA> <NA> 1993-06-30
Here the data is in R format:
structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L),
id = 1:6,
date1 = structure(c(7956, NA, NA, 7956, 7957, 7957), class = "Date"),
date2 = structure(c(8175, 8176, NA, 8174, 8176, NA), class = "Date"),
date3 = structure(c(8378, 8379, 8377, NA, NA, NA), class = "Date"),
date4 = structure(c(8581, 8580, 8580, NA, 8581, 8581), class = "Date")),
.Names = c("group", "id", "date1", "date2", "date3", "date4"),
row.names = c(NA, 6L), class = "data.frame")
That is, we have a grouping variable, several individuals and four possible dates of interest.
Now I want to construct a linear month time trend for each individual from this. In other words, I try to construct a trend with value 1 on the first non-NA date. After that, the trend for the remaining non-NA periods are the months passed since the first non-NA date.
My goal is this structure (individual 1, group 1):
group id period trend
1 1 1 1 1
2 1 1 2 8
3 1 1 3 15
4 1 1 4 21
That is, a molten data.frame with the months passed since t = 1.
I've played around with the ideas from this thread: Number of months between two dates. However, I can't find a solution that does not involve a for-loop and and excruciating number of if-statements.
Any help appreciated!
Here is one potential solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
gather(period, date, -group, -id) %>%
arrange(group, id, period) %>%
mutate(date = as.Date(date)) %>%
group_by(group, id) %>%
filter(!all(is.na(date))) %>%
mutate(
trend = as.integer(
floor(difftime(date, date[which.max(!is.na(date))], units = 'days') / 30)
) + 1,
period = str_replace(period, 'date', '')
) %>%
select(-date)
Output is as follows:
# A tibble: 24 x 4
# Groups: group, id [6]
group id period trend
<int> <int> <chr> <dbl>
1 1 1 1 1
2 1 1 2 8
3 1 1 3 15
4 1 1 4 21
5 1 2 1 NA
6 1 2 2 1
7 1 2 3 7
8 1 2 4 14
9 1 3 1 NA
10 1 3 2 NA
# ... with 14 more rows
NOTE: Edited to add a filter to filter out cases where ALL dates are NA for a given group / id. Otherwise, which,max will fail.
data.table approach
I leave the rounding and/or adding +1 to you.. this is always tricky with months. I personally try to avoid this, and calculate with days or weeks (or just about anything BUT months)...
library( data.table)
dt <- melt ( as.data.table( df ), id.vars = c("group", "id"), variable.name = "date_id", value.name = "date" )
setkey(dt, id, group, date_id)
dt[, diff := lubridate::interval( date[which.min( date ) ], date ) / months(1) , by = c("group", "id")]
head(dt)
# group id date_id date diff
# 1: 1 1 date1 1991-10-14 0.000000
# 2: 1 1 date2 1992-05-20 7.193548
# 3: 1 1 date3 1992-12-09 13.833333
# 4: 1 1 date4 1993-06-30 20.533333
# 5: 1 2 date1 <NA> NA
# 6: 1 2 date2 1992-05-21 0.000000
Related
I wanted to delete rows in x1 column that don't appear in EVERY month in another column:
The dataset is as follows:
id month
1 01
2 01
3 01
1 02
2 02
1 03
2 03
I want to delete id = 3 from the dataset, since it doesn't appear in month = 02
Im using R
Thank you for helping
You can split the dataset and use Reduce, i.e.
remove <- Reduce(setdiff, split(df$id, df$month))
df[!df$id %in% remove,]
id month
1 1 1
2 2 1
4 1 2
5 2 2
6 1 3
7 2 3
As #jay.sf mentioned, you need to assign it back to your dataframe,
df <- df[!df$id %in% remove,]
Using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(n_distinct(month) == n_distinct(df$month)) %>%
ungroup
-output
# A tibble: 6 × 2
id month
<int> <int>
1 1 1
2 2 1
3 1 2
4 2 2
5 1 3
6 2 3
Or using data.table
library(data.table)
data_hh[, if(uniqueN(month) == uniqueN(.SD$month)) .SD, .(id)]
data
data_hh <- structure(list(id = c(18354L, 18815L, 19014L, 63960L, 72996L,
73930L), month = c(1, 1, 1, 1, 1, 1), value = c(113.33, 251.19,
160.15, 278.8, 254.39, 733.22), x1 = c(96.75, 186.78, 106.02,
195.23, 184.57, 473.92), x2 = c(1799.1, 5399.1, 1799.1, 1349.1,
2924.1, 2024.1), x3 = c(85.37, 74.36, 66.2, 70.02, 72.55, 64.63
), x4 = c(6.29, 4.65, 8.9, 20.66, 8.69, 36.22)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
lets say I have the following data frame:
dt <- data.frame(id= c(1),
parameter= c("a","b","c"),
start_day = c(1,8,4),
end_day = c(16,NA,30))
I need to combine start_day and end_day columns (lets call the new column as day) such that I reserve all the other columns. Also I need to create another column that indicates if each row is showing start_day or end_day. To clarify, I am looking to create the following data frame
I am creating the above data frame using the following code:
dt1 <- subset(dt, select = -c(end_day))
dt1 <- dt1 %>% rename(day = start_day)
dt1$start <- 1
dt2 <- subset(dt, select = -c(start_day))
dt2 <- dt2 %>% rename(day = end_day)
dt2$end <- 1
dt <- bind_rows(dt1, dt2)
dt <- dt[order(dt$id, dt$parameter),]
Although my code works, but I am not happy with my solution. I am certain that there is a better and cleaner way to do that. I would appreciate any input on better alternatives of tackling this problem.
(tidyr::pivot_longer(dt, cols = c(start_day, end_day), values_to = "day")
|> dplyr::mutate(start = ifelse(name == "start_day", 1, NA),
end = ifelse(name == "end_day", 1, NA))
)
Result:
# A tibble: 6 × 6
id parameter name day start end
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 a start_day 1 1 NA
2 1 a end_day 16 NA 1
3 1 b start_day 8 1 NA
4 1 b end_day NA NA 1
5 1 c start_day 4 1 NA
6 1 c end_day 30 NA 1
You could get rid of the name column, but maybe it would be more useful than your new start/end columns?
using base R (faster than data.table up to ~300 rows; faster than tidyr up to ~1k rows) :
cbind(dt[1:2], day = c(dt$start_day,dt$end_day)) |>
(\(x) x[order(x$id, x$parameter),])() |>
(`[[<-`)("start", value = c(1, NA)) |>
(`[[<-`)("end", value = c(NA, 1))
id parameter day start end
1 1 a 1 1 NA
4 1 a 16 NA 1
2 1 b 8 1 NA
5 1 b NA NA 1
3 1 c 4 1 NA
6 1 c 30 NA 1
using the data.table package (faster than tidyr up to ~500k rows) :
dt <- as.data.table(dt)
dt[,.(day = c(start_day, end_day),
start = rep(c(1, NA), .N),
end = rep(c(NA, 1), .N)),
by = .(id, parameter)]
id parameter day start end
1: 1 a 1 1 NA
2: 1 a 16 NA 1
3: 1 b 8 1 NA
4: 1 b NA NA 1
5: 1 c 4 1 NA
6: 1 c 30 NA 1
I have a data set set up like the example below:
Name df Value
A 1 .5
A 2 2
A 3 3
B 1 1
B 2 .5
I would like to get the difference between the values unitil the Name column changes then I would like it to stop and start getting the new differences. Like below:
Name df Value Diff
A 1 .5 NA
A 2 2 1.5
A 3 3 2.5
B 1 1 NA
B 2 .5 -.5
Is there any way I can do this? I have tried making the data set into wide format but I cannot figure out a way to make that work either.
An option would be to do a group by diff
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(Diff = c(NA, cumsum(diff(Value))))
# A tibble: 5 x 4
# Groups: Name [2]
# Name df Value Diff
# <chr> <int> <dbl> <dbl>
#1 A 1 0.5 NA
#2 A 2 2 1.5
#3 A 3 3 2.5
#4 B 1 1 NA
#5 B 2 0.5 -0.5
data
df1 <- structure(list(Name = c("A", "A", "A", "B", "B"), df = c(1L,
2L, 3L, 1L, 2L), Value = c(0.5, 2, 3, 1, 0.5)),
class = "data.frame", row.names = c(NA,
-5L))
#akrun answer is the way to go, but just as a riddle, this works too:
df1 %>%
group_by(Name) %>%
mutate(Diff = cumsum(Value - lag(Value, default = Value[1])))
# # A tibble: 5 x 4
# # Groups: Name [2]
# Name df Value Diff
# <chr> <int> <dbl> <dbl>
# 1 A 1 0.5 0
# 2 A 2 2 1.5
# 3 A 3 3 2.5
# 4 B 1 1 0
# 5 B 2 0.5 -0.5
I have data in the below format
name date x y z
a March-2018 1 2 2
a Feb-2018 2 3 3
b June-2017 3 4 4
b April-2017 4 5 5
c Sep-2018 5 5 6
c Aug-2017 7 7 8
Need to select the name and other columns based on latest Month as below.
name date x y z
a March-2018 1 2 2
b June-2017 3 4 4
c Sep-2018 5 5 6
I tried with distinct names and selecting max date but didn't workout.
We convert the date column to an actual date column by pasting an arbitrary date ("01") and then group_by name and get the max row.
library(dplyr)
df %>%
mutate(newdate = as.Date(paste0("01-", date), "%d-%b-%Y")) %>%
group_by(name) %>%
slice(which.max(newdate)) %>%
select(-newdate)
# name date x y z
# <fct> <fct> <int> <int> <int>
#1 a March-2018 1 2 2
#2 b June-2017 3 4 4
#3 c Sep-2018 5 5 6
A base R option using ave, we first convert the dates and then get max dates by group (name) and subset it from original dataframe.
df$new_date <- as.Date(paste0("01-", df$date), "%d-%b-%Y")
#I was trying to use which.max instead of max but it giving me an error, not sure why
df[with(df, new_date %in% ave(new_date, name, FUN = max)), ]
# name date x y z new_date
#1 a March-2018 1 2 2 2018-03-01
#3 b June-2017 3 4 4 2017-06-01
#5 c Sep-2018 5 5 6 2018-09-01
Note - As mentioned by # IceCreamToucan ave method works here because each name has different max date, if the date is same it can give different results since we are using %in% here.
Using tidyverse you can do:
df %>%
mutate(temp = match(gsub("-.*$", "", date), month.abb),
temp2 = ifelse(is.na(temp), match(gsub("-.*$", "", date), month.name), temp)) %>%
group_by(name) %>%
filter(temp2 == max(temp2)) %>%
select(-starts_with("temp"))
name date x y z
<fct> <fct> <int> <int> <int>
1 a March-2018 1 2 2
2 b June-2017 3 4 4
3 c Sep-2018 5 5 6
First, it takes out the names of months from "date" and then assign a number to abbreviated months names, with January being 1 and December being 12. Second, it assigns a number to non-abbreviated months names. Third, it filters out the rows per group with the highest number assigned to months. Finally, it removes the redundant variables.
Below is a roundabout way of replicating group_by and slice in base using split and lapply with [.
do.call(rbind,
lapply(split(df, df$name),
function(x) x[which.max(as.Date(paste0("01-", x$date), "%d-%b-%Y")),])
)
# name date x y z
# a a March-2018 1 2 2
# b b June-2017 3 4 4
# c c Sep-2018 5 5 6
Another option is to aggregate and then merge. Seems like there may be some other sinple way to do this in base I'm missing.
to.keep <-
aggregate(date ~ name, data = df,
function(x) x[which.max(as.Date(paste0("01-", x), "%d-%b-%Y"))])
merge(df, to.keep, by = names(to.keep))
# name date x y z
# a a March-2018 1 2 2
# b b June-2017 3 4 4
# c c Sep-2018 5 5 6
Data used
structure(list(name = c("a", "a", "b", "b", "c", "c"), date = c("March-2018",
"Feb-2018", "June-2017", "April-2017", "Sep-2018", "Aug-2017"
), x = c(1L, 2L, 3L, 4L, 5L, 7L), y = c(2L, 3L, 4L, 5L, 5L, 7L
), z = c(2L, 3L, 4L, 5L, 6L, 8L)), row.names = c(NA, -6L), class = "data.frame")
I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3