How to remove all rows in R that have a specific month? - r

I have a data.frame where in one column I have a lot of different dates in the format Year-Month-Day and I would like to keep only the rows that have as a month 12, so December.
I tried two different codes:
First version:
IBES1985_1990[IBES1985_1990$`Forecast Period End Date, SAS Format` !=
month(1,2,3,4,5,6,7,8,9,10,11, )]
But here I get an error saying that undefined columns where selected.
Second version:
IBES1985_1990 <- IBES1985_1990 %>%
mutate(`Forecast Period End Date, SAS Format`= ifelse(month(`Forecast Period End Date, SAS Format`)
%in% c(1,2,3,4,5,6,7,8,9,10,11),NA,`Forecast Period End Date, SAS Format`))
Here I wanted to then delete all the rows that have NA in it but the date format changed to pure numbers and I couldn't change it back to see if I the dates that don't have December were already deleted or not.
In summary, I would like to have a code where all rows are deleted that are not December.

If your data looks like this
library(lubridate)
df <- data.frame(dates = seq.Date(ymd("2022-09-02"), ymd("2023-02-02"), "month"),
data = 1:6)
df
dates data
1 2022-09-02 1
2 2022-10-02 2
3 2022-11-02 3
4 2022-12-02 4
5 2023-01-02 5
6 2023-02-02 6
keep all December dates e.g. by using strftime
df[strftime(df$dates, format="%b") == "Dec", ]
dates data
4 2022-12-02 4
With dplyr you can do
library(dplyr)
df %>%
rowwise() %>%
summarize(dates = dates[strftime(dates, format="%b") == "Dec"], data)
# A tibble: 1 × 2
dates data
<date> <int>
1 2022-12-02 4
or, if you want to use lubridates month
library(dplyr)
library(lubridate)
df %>%
rowwise() %>%
summarize(dates = dates[month(dates) == 12], data)
# A tibble: 1 × 2
dates data
<date> <int>
1 2022-12-02 4

Related

Filtering multiple time series values in R

I have a problem with a time series which I don´t know to solve.
I have a tibble with 4 different variables. In my real dataset there are over 10.000 Documents.
document date author label
1 2018-04-05 Mr.X 1
2 2018-02-05 Mr.Y 0
3 2018-04-17 Mr.Z 1
So now my problem is that in the first step I want to count my articles which are occur in a specific month and a specific year for every month in my time series.I know that I can filter for a specific month in a year like this:
tibble%>%
filter(date > "2018-02-01" && date < "2018-02-28")
Result out of this would be a tibble with 1 Observation, but my problem is that I have 360 different time periods in my data. Can I write a function for this to solve this problem or do I need to make 360 own calculations?
The best solution for me would be a table with 360 different columns where in every column the amount of articles which are counted in this month are represented. Is this possible?
Thank you so much in advance.
If you want each result into a separate list, you can do something like this
suppressMessages(library(dplyr))
df %>% mutate(date = as.Date(date)) %>%
group_split(substr(date, 1, 7), .keep = F)
<list_of<
tbl_df<
document: integer
date : date
author : character
label : integer
>
>[2]>
[[1]]
# A tibble: 1 x 4
document date author label
<int> <date> <chr> <int>
1 2 2018-02-05 Mr.Y 0
[[2]]
# A tibble: 2 x 4
document date author label
<int> <date> <chr> <int>
1 1 2018-04-05 Mr.X 1
2 3 2018-04-17 Mr.Z 1
you can further use list2env() to save each item of this list as a separate item.
To count the number of rows for each month-year combination, in tidyverse you can do :
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date),
year_mon = format(date, '%Y-%m')) %>%
select(year_mon) %>%
pivot_wider(names_from = year_mon, values_from = year_mon,
values_fn = length, values_fill = 0)
# `2018-04` `2018-02`
# <int> <int>
#1 2 1
and in base R :
df$date <- as.Date(df$date)
table(format(df$date, '%Y-%m'))

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Count date observations in a month

I have a dataframe containing daily prices of a stock exchange with corresponding dates for several years. These dates are tradingdates and is thus excluded weekends and holidays. Ex:
df$date <- c(as.Date("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")
I have used lubridate to extract a column containg which month each date is in, but what I struggle with is creating a column that for each month of every year, calculates which number of trading day in the month it is. I.e. from the example, a counter that will start at 1 for 2017-04-03 as this is the first observation of the month and not 3 as it is the third day of the month and end at the last observation of the month. So that the column would look like this:
df$DayofMonth <- c(22, 23, 1, 2)
and not
df$DayofMonth <- c(30, 31, 3, 4)
Is there anybody that can help me?
Maybe this helps:
library(data.table)
library(stringr)
df <- setDT(df)
df[,YearMonth:=str_sub(Date,1,7)]
df[, DayofMonth := seq(.N), by = YearMonth]
You have a column called YearMonth with values like these '2020-01'.
Then for each group (month) you give each date an index which in your case would correspond to the trading day.
As you can see this would lead to 1 for the date '2017-04-03' since it is the first trading day that month. This works if your df is sorted from first date to latest date.
There is a way using lubridate to extract the date components and dplyr.
library(dplyr)
library(lubridate)
df <- data.frame(date = as.Date(c("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")))
df %>%
mutate(month = month(date),
year = year(date),
day = day(date)) %>%
group_by(year, month) %>%
mutate(DayofMonth = day - min(day) + 1)
# A tibble: 4 x 5
# Groups: year, month [2]
date month year day DayofMonth
<date> <dbl> <dbl> <int> <dbl>
1 2017-03-30 3 2017 30 1
2 2017-03-31 3 2017 31 2
3 2017-04-03 4 2017 3 1
4 2017-04-04 4 2017 4 2
You can try the following :
For each date find out the first day of that month.
Count how many working days are present between first_day_of_month and current date.
library(dplyr)
library(lubridate)
df %>%
mutate(first_day_of_month = floor_date(date, 'month'),
day_of_month = purrr::map2_dbl(first_day_of_month, date,
~sum(!weekdays(seq(.x, .y, by = 'day')) %in% c('Saturday', 'Sunday'))))
# date first_day_of_month day_of_month
#1 2017-03-30 2017-03-01 22
#2 2017-03-31 2017-03-01 23
#3 2017-04-03 2017-04-01 1
#4 2017-04-04 2017-04-01 2
You can drop the first_day_of_month column if not needed.
data
df <- data.frame(Date = as.Date(c("2017-03-30", "2017-03-31",
"2017-04-03", "2017-04-04")))

Dplyr doesn't respect groups when ranking data

Using the below code in dplyr 0.7.6, I try to calculate the rank of a variable for each day on a dataset. But dplyr doesn't account for the group_by(CREATIONDATE_DAY)
dates <- sample(seq(from=as.POSIXct("2019-03-12",tz="UTC"),to=as.POSIXct("2019-03-20",tz="UTC"),by = "day"),size = 100,replace=TRUE)
group <- sample(c("A","B","C"),100,TRUE)
df <- data.frame(CREATIONDATE_DAY = dates,GROUP = group)
# calculate the occurances for each day and group
dfMod <- df %>% group_by(CREATIONDATE_DAY,GROUP) %>%
dplyr::summarise(COUNT = n()) %>% ungroup()
# Compute the rank by count for each day
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>%
mutate(rank = rank(-COUNT, ties.method ="min"))
But the rank values are calculate on the entire group instead on the creation day value. As seen in the image the row with id 24 should be rank 1 due to 4 being the highest value for 16.03.2019 and row 23 should be rank 2 of this particular day. Where is my mistake?
Edit: added desired output:
Edit #2: as MrFlick has pointed out I checked my dplyr version (0.7.6) and upgrade to the most current version fixed the issue for me.
It seems that may be are some conflict with another package. If you have active lubridate, try to inverse the order in which you call the packages lubridate and dplyr (I've tried your example and gave me the right answer). Yet, you can stil try with:
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>% mutate(rank = row_number(desc(COUNT)))
> head(dfMod)
# A tibble: 6 x 4
# Groups: CREATIONDATE_DAY [2]
CREATIONDATE_DAY GROUP COUNT rank
<dttm> <fct> <int> <int>
1 2019-03-12 00:00:00 A 2 3
2 2019-03-12 00:00:00 B 5 1
3 2019-03-12 00:00:00 C 4 2
4 2019-03-13 00:00:00 A 4 1
5 2019-03-13 00:00:00 B 3 2
6 2019-03-13 00:00:00 C 2 3

Calculate largest value for multiple overlapping events in a specific range

I have multiple large data frames that capture events that last a certain amount of time. This example gives a simplified version of my data set
Data frame 1:
ID Days Date Value
1 10 80 30
1 10 85 30
2 20 75 20
2 10 80 20
3 5 90 30
Data frame 2:
ID Days Date Value
1 20 0 30
1 10 3 20
2 20 5 30
3 20 1 10
3 10 10 10
The same ID is used for the same person in all datasets
Days specifies the length of the event (if Days has the value 10 then the event lasts 10 days)
Date specifies the date at which the event starts. In this case,Date can be any number between 0 and 90 or 91 (the data represent days in quarter)
Value is an attribute that is repeated for the number of Days specified. For example, for the first row in df1, the value 30 is repeated for 10 times starting from day 80 ( 30 is repeated for 10 days)
What I am interested in is to give for each ID in each data frame the highest value per day. Keep in mind that multiple events can overlap and values then have to be summed.
The final data frame should look like this:
ID HighestValuedf1 HighestValuedf2
1 60 80
2 40 30
3 30 20
For example, for ID 1 three events overlapped and resulted in the highest value of 80 in data frame 2. There was no overlap between the events of df1 and df1 for ID 3, only an overlap withing df2.
I would prefer a solution that avoids merging all data frames into one data frame because of the size of my files.
EDIT
I rearranged my data so that all events that overlap are in one data frame. I only need the highest overlap value for every data frame.
Code to reproduce the data frames:
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
ID= c(1,2,3)
HighestValuedf1 = c(60,40,30)
HighestValuedf2 = c(80,30,20)
df3 = data.frame(ID, HighestValuedf1, HighestValuedf2)
I am interpreting highest value per day to mean highest value on a single day throughout the time period. This is probably not the most efficient solution, since I expect something can be done with map or apply functions, but I didn't see how on a first look. Using df1 and df2 as defined above:
EDIT: Modified code upon understanding that df1 and df2 are supposed to represent sequential quarters. I think the easiest way to do this is simply to stack the dataframes so anything that overlaps will automatically be caught (i.e. day 1 of df2 is day 91 overall). You will probably need to either adjust this code manually because of the different length of quarters, or preferably simply convert days of quarters into actual dates of the year with a date formate ((df1 day 1 becomes January 1st 2017, for example). The code below just rearranges to achieve this and then produces the results desired for each quarter by filtering on days 1:90, 91:180 as shown)
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
library(tidyverse)
#> -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1.9000 v purrr 0.2.4
#> v tibble 1.4.2 v dplyr 0.7.4
#> v tidyr 0.7.2 v stringr 1.2.0
#> v readr 1.1.1 v forcats 0.2.0
#> -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
df2 <- df2 %>%
mutate(Date = Date + 90)
# Make a dataframe with complete set of day-ID combinations
df_completed <- df1 %>%
mutate(day = factor(Date, levels = 1:180)) %>% # set to total day length
complete(ID, day) %>%
mutate(daysum = 0) %>%
select(ID, day, daysum)
# Function to apply to each data frame containing events
# Should take each event and add value to the appropriate days
sum_df_daily <- function(df_complete, df){
for (i in 1:nrow(df)){
event_days <- seq(df[i, "Date"], df[i, "Date"] + df[i, "Days"] - 1)
df_complete <- df_complete %>%
mutate(
to_add = case_when(
ID == df[i, "ID"] & day %in% event_days ~ df[i, "Value"],
!(ID == df[i, "ID"] & day %in% event_days) ~ 0
),
daysum = daysum + to_add
)
}
return(df_complete)
}
df_filled <- df_completed %>%
sum_df_daily(df1) %>%
sum_df_daily(df2) %>%
mutate(
quarter = case_when(
day %in% 1:90 ~ "q1",
day %in% 91:180 ~ "q2"
)
)
df_filled %>%
group_by(quarter, ID) %>%
summarise(maxsum = max(daysum))
#> # A tibble: 6 x 3
#> # Groups: quarter [?]
#> quarter ID maxsum
#> <chr> <dbl> <dbl>
#> 1 q1 1.00 60.0
#> 2 q1 2.00 40.0
#> 3 q1 3.00 30.0
#> 4 q2 1.00 80.0
#> 5 q2 2.00 30.0
#> 6 q2 3.00 40.0

Resources