R count difference in days based on column value - r

I'm trying to count the difference in dates from a single column, based on another columns value.
This is the result I'm looking for

Try this
library('dplyr')
df <- data.frame(id = c(1, 2, 3, 1, 2, 3),
Date = c('1/1/2020', '1/3/2020','1/1/2020','1/7/2020','1/6/2020','1/5/2020'))
df %>% mutate(Date = as.Date(Date, format='%m/%d/%Y')) %>%
group_by(id) %>%
mutate(DIFF = Date - lag(Date))

Here is a way using dplyr and lubridate (needed to make the dates behave when subtracting). It looks like you want the calculation to determine the number of days between the dates in a group by ID and the earliest date for that ID.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = Date - min(Date))
If you want to have NA instead of 0, you can do the following:
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = if_else(Date == min(Date), NA_integer_, Date - min(Date))

Related

complete() in R for 6-month dates

I have data in 6-month intervals (ID, 6-month-start-date, outcome value), but for some IDs, there are half years where the outcome is missing. Simplified example:
id = c("aa", "aa", "ab", "ab", "ab")
date = as.Date(c("2021-07-01", "2022-07-01", "2021-07-01", "2022-01-01", "2022-07-01"))
col3 = c(1,2,1,2,1)
df <- data.frame(id, date, col3)
For similar datasets where the date is monthly, I used complete(date = seq.Date(start date, end date, by = "month") to fill the missing months and add 0 to the outcome field in the 3rd column.
I could do the following and expand the data to monthly, then create a new 6-month-start-date column, group by it and ID, and sum col3.
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="month")) %>%
mutate (col3 = replace_na(col3, 0))
df_complete_6mth <- df_complete %>% mutate(
halfyear = ifelse(as.integer(format(date, '%m')) <= 6,
paste0(format(date, '%Y'), '-01-01'),
paste0(format(date, '%Y'), '-07-01'))) %>%
group_by(id, halfyear) %>%
summarise(col3_halfyear = sum(col3))
However, is there a solution where the "by =" argument specifies 6 months? I tried
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="months(6)")) %>%
mutate (col3 = replace_na(col3, 0))
but it didn't work.
From the help for seq.Date:
by can be specified in several ways.
A number, taken to be in days.
A object of class difftime
A character string, containing one of "day", "week", "month",
"quarter" or "year". This can optionally be preceded by a (positive or
negative) integer and a space, or followed by "s".
So I expect you want:
library(dplyr); library(tidyr)
df %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by="6 month"),
fill = list(col3 = 0))
Could you do something like this. You make a sequence of dates by month and then take every sixth one after the first one?
library(lubridate)
dates <- seq(mdy("01-01-2020"), mdy("01-01-2023"), by="month")
dates[seq(1, length(dates), by=6)]
#> [1] "2020-01-01" "2020-07-01" "2021-01-01" "2021-07-01" "2022-01-01"
#> [6] "2022-07-01" "2023-01-01"
Created on 2023-02-08 by the reprex package (v2.0.1)

Counting Times within a Range

I have this function that counts all instances in 1 day(12am to 12am) i want it to count all instances from 12pm to 12pm :
days_analyzed_data <- data%>% #(Liat`s code with a change for day instead of minute)
filter(PERCENTAGE != "100.00%") %>%
mutate(TIME = dmy_hms(TIME, tz = "Asia/Jerusalem")) %>%
mutate(DATE = date(TIME),
TIME = as_hms(floor_date(TIME, unite="1 day"))) %>% #here is where u choose minute/hour/day.. etc'
group_by(CAMERA, DATE, TIME) %>%
summarise(count = n())
TNKS

Filtering uneven data sets

I am trying to filter out a data set into two months. I would like to filter out the ID and year that have data, and to remove the ID and year that do not have an associated pair.
For example if an ID and year has both the January and July month in the data set, I would like to include this ID and the year in my filtered data. If an ID has only the month of January and not July, I would like to remove this data and not include it in the filtered data set. Is there a good way to do this? Just a note that I wasn't sure how to simulate the uneven data set in the example.
After filtering for my desired output, I test by creating a list for each seasonal month where each ID and year has at least 15 rows associated with it.
library(lubridate)
library(dplyr)
set.seed(12345)
df <- tibble(
date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),
1000, replace = TRUE),
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID = rep(1:5, 200),
month = month(date),
year =year(date)) %>%
arrange(ID, date)
df %>%
filter(month %in% c(1,7)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 2) %>%
group_by(ID) %>%
filter(all(complete)) %>%
group_by(ID, year)
# Creates a list for each year and by ID
summer_list <- df %>%
filter(month %in% 7) %>%
filter(n() >= 15) %>%
group_split(year, ID)
# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
# Creates a list for each year and by ID
winter_list <- df1 %>%
filter(month %in% 1) %>%
filter(n() >= 15) %>%
group_split(year, ID)
# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
You were really close. I think your filter can be simplified to the following. Just be sure to save it to df.
df <- df %>%
filter(month %in% c(1,7)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 2) %>%
filter(complete)
# could add "%>% select(-c(complete))" to get rid of complete
On summer_list and winter_list, add a group_by between the filters. With the dataset you provided, there were no groups with 15 records, but I tested that this works by bumping up the size of df until I got some.
summer_list <- df %>%
filter(month == 7) %>% # used == since there's only one test value
group_by(ID, year) %>% # added this
filter(n() >= 15) %>%
group_split()
There's also a typo in your first use of winter_list -- the input data is df1, but I think you want df. Hope this works!
Here's the complete code including the larger df:
library(lubridate)
library(dplyr)
set.seed(12345)
df <- tibble(
date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),
4000, replace = TRUE),
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID = rep(1:5, 800),
month = month(date),
year =year(date)) %>%
arrange(ID, date)
df <- df %>%
filter(month %in% c(1,7)) %>%
group_by(ID, year) %>%
mutate(complete = length(unique(month)) == 2) %>%
filter(complete)
# could add "%>% select(-c(complete))" to get rid of complete
# Creates a list for each year and by ID
summer_list <- df %>%
filter(month == 7) %>%
group_by(ID, year) %>%
filter(n() >= 15) %>%
group_split()
# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))
# Creates a list for each year and by ID
winter_list <- df %>%
filter(month == 1) %>%
group_by(ID, year) %>%
filter(n() >= 15) %>%
group_split()
# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list,
function(x) paste(x$ID[1],
x$year[1], sep = '_'))

How to fill in missing value of a data.frame in R?

I have multiple columns that has missing values. I want to use the mean of the same day across all years while filling the missing data for each column. for example, DF is my fake data where I see missing values for the two columns (A & X)
library(lubridate)
library(tidyverse)
library(naniar)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1985-01-01"), to = as.Date("1987-12-31"), by = "day"),
A = sample(1:10,1095, replace = T), X = sample(5:15,1095, replace = T)) %>%
replace_with_na(replace = list(A = 2, X = 5))
To fill in Column A, i use the following code
Fill_DF_A <- DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(A = ifelse(is.na(A), mean(A, na.rm=TRUE), A))
I have many columns in my data.frame and I would like to generalize this for all the columns to fill in the missing value?
We can use na.aggregate from zoo
library(dplyr)
library(zoo)
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, na.aggregate))
Or if we prefer to use conditional statements
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, ~ case_when(is.na(.)
~ mean(., na.rm = TRUE), TRUE ~ as.numeric(.))))

How to mutate values of a tibble in long format

I want to normalize using min/max the values of two indicators. Is it possible to do it keeping the tibble in long format? (Below I use left join to do it in wide format).
library(tidyverse)
df <- tibble(ind =c(1, 2),
`2015` = c(3,10),
`2016` = c(7,18),
`2017` = c(1,4))
# long format
df2 <- df %>%
gather("year", "value", 2:4)
df3 <- df2 %>%
group_by(ind) %>%
summarise(mn = min(value),
mx = max(value))
# wide format?
df4 <- left_join(df2, df3, by = c("ind"="ind"))
df5 <- df4 %>%
mutate(value2 = (value-mn)/(mx-mn))
Created on 2019-10-07 by the reprex package (v0.3.0)
Instead of doing the left_join, can create the columns with mutate and avoid the summarise step
library(dplyr)
df2 %>%
group_by(ind) %>%
mutate(mn = min(value), mx = max(value)) %>%
ungroup %>%
mutate(value2 = (value - mn)/(mx-mn))
NOTE: Here, we assumed the OP wanted the columns 'mx', 'mn' in the final output. But, if the intention is to get only 'value2', there is no need for creating additional columns as #Gregor mentioned in the comments
df2 %>%
group_by(ind) %>%
mutate(value2 = (value - min(value))/(max(value) - min(value)))
Also, with the tidyr_1.0.0, instead of gather, can use pivot_longer which is more generalized as it can deal with multiple sets of columns to reshape from 'wide' to 'long'
library(tidyr)
df %>%
pivot_longer(cols = -ind) %>%
group_by(ind) %>%
mutate(mn = min(value), mx = max(value)) %>%
ungroup %>%
mutate(value2 = (value - mn)/(mx-mn))

Resources