how to do conditional calculation with strange requirement - r

I have strange problem with calculation, and I am not sure what I should do. I have a data that looks like this:
and I need to sort by ID and Date at first,which I did. Then i need to find the baseline date, only if duration for that date is <=0 and closest to 0, that one can be used as baseline, then I need to calculate usable=current score/baseline date score. so the final results should look like this:
What should I do? How can I check the oldest day and build "usable" to use the score/oldest score?
The codes for sample data are:
ID <-c("1","1","1","1","2","2","2","2")
Date<- c("4/19/2018","7/27/2018","8/24/2018","9/21/2018","10/19/2018","12/14/2018","1/11/2019","1/24/2019")
Duration <- c("-13","-7","95","142","2","36","75","81")
score <- c("0.06","0.071","0.054","0.0258","0.0208","0.0448","0.0638","0.0227")
Sample.data <- data.frame(ID, Date, Duration, score)

The columns in 'Sample.data' are all character class as the values were quoted (used R 4.0.0. If it was < R 4.0, stringsAsFactors = TRUE by default), so we used type.convert to change the class based on the values automatically, then before we do the arrange on 'ID', 'Date', convert the 'Date' to Date class (in case there are some inconsistency in the original data with respect to the order), after grouping by 'ID', create the new column 'Useable' with an if/else condition to return the standardized 'score' with the first value of 'score' or else return NA
library(dplyr)
library(lubridate)
Sample.data <- Sample.data %>%
type.convert(as.is = TRUE) %>%
mutate(Date = mdy(Date)) %>%
arrange(ID, Date) %>%
group_by(ID) %>%
mutate(Useable = if(first(Duration) <=0) c(NA, score[-1]/first(score))
else NA_real_)
Sample.data
# A tibble: 8 x 5
# Groups: ID [2]
# ID Date Duration score Useable
# <int> <date> <int> <dbl> <dbl>
#1 1 2018-04-19 -13 0.06 NA
#2 1 2018-07-27 86 0.071 1.18
#3 1 2018-08-24 95 0.054 0.9
#4 1 2018-09-21 142 0.0258 0.43
#5 2 2018-10-19 2 0.0208 NA
#6 2 2018-12-14 36 0.0448 NA
#7 2 2019-01-11 75 0.0638 NA
#8 2 2019-01-24 81 0.0227 NA

Related

New data frame with unique values and counts [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'd like to create a new data table from my old one that includes a count of all the "article_id" that occur for each date (i.e. there are three article_id's listed for the date 2001-10-01, so I'd like one column with the date and one column that has the article count, "3").
Here is the output of the data table:
date article_id N
1: 2001-09-01 FAS_200109_11104 3
2: 2001-10-01 FAS_200110_11126 6
3: 2001-10-01 FAS_200110_11157 21
4: 2001-10-01 FAS_200110_11160 5
5: 2001-11-01 FAS_200111_11220 26
---
7359: 2019-08-01 FAZ_201908_2958 7
7360: 2019-09-01 FAZ_201909_3316 8
7361: 2019-09-01 FAZ_201909_3515 13
7362: 2000-12-01 FAZ_200012_92981 3
7363: 2001-08-01 FAZ_200108_86041 14
So I'll have to move over the unique date values to a new data frame (so that each date is only shown once), as well as a count of article_id's shown for each date.
I've been trying to figure this out but haven't found exactly the right answer regarding how to count the occurrence of a character vector (the article_id) by group (date). I think this is something pretty simple in R, but I'm new to the program and don't have much support so I would very much appreciate your suggestions - thank you so much!
The expected output is not clear. Some assumptions of expected output
Sum of 'N' by 'date'
library(data.table)
dt[, .(N = sum(N, na.rm = TRUE)), by = date]
Count of unique 'article_id' for each date
dt1[, .(N = uniqueN(article_id)), by = date]
Get the first count by 'date'
dt1[, .(N = first(N)), by = date]
We could group and then summarise:
library(dplyr)
df %>%
group_by(date) %>%
summarise(n = n())
date n
<chr> <int>
1 2000-12-01 1
2 2001-08-01 1
3 2001-09-01 1
4 2001-10-01 3
5 2001-11-01 1
6 2019-08-01 1
7 2019-09-01 2
Here 2 tidyverse solutions:
Libraries
library(tidyverse)
Example Data
df <-
tibble(
date = ymd(c("2001-09-01","2001-10-01","2001-10-01")),
article_id = c("FAS_200109_11104","FAS_200110_11126","FAS_200110_11157"),
N = c(3,6,21)
)
Solution
Solution 1
df %>%
group_by(date) %>%
summarise(N = sum(N,na.rm = TRUE))
Solution 2
df %>%
count(date,wt = N)
Result
# A tibble: 2 x 2
date n
<date> <dbl>
1 2001-09-01 3
2 2001-10-01 27

Is there a way to find daily maximum from hourly data with missing values

I have measured hourly data of ground O3 but with some missing data (marked as NA). I want to calculate daily maximums, but only in case there are more than 17 hourly measurements per date. In case it is less than 18 measurement per date I want to write NA.
head(o3sat)
date hour O3
1/1/2010 0 50.2
1/1/2010 1 39.8
1/1/2010 2 41.8
1/1/2010 3 NA
1/1/2010 4 9.2
1/1/2010 5 6.0
Is there a possibility to add some argument to this function to indicate that at least 75% of the data must be available in a day for the value to be calculated, else the data is removed
maximums <- aggregate(o3sat["dnevnik"], list(Date = as.Date(o3sat$datum)), max, na.rm = TRUE)
It is better to provide a reproducible example when asking a question. Here, I created an example data frame based on the information you provided. This data frame contains hourly O3 measurements from 2010-01-01 to 2010-01-03.
library(dplyr)
library(tidyr)
library(lubridate)
o3sat <- read.table(text = " date hour O3
'1/1/2010' 0 50.2
'1/1/2010' 1 39.8
'1/1/2010' 2 41.8
'1/1/2010' 3 NA
'1/1/2010' 4 9.2
'1/1/2010' 5 6.0 ",
stringsAsFactors = FALSE, header = TRUE)
set.seed(1234)
o3sat_ex <- o3sat %>%
mutate(date = mdy(date)) %>%
complete(date = seq.Date(ymd("2010-01-01"), ymd("2010-01-03"), 1), hour = 0:23) %>%
mutate(O3 = c(o3sat$O3, rnorm(66, 30, 10))) %>%
mutate(O3 = ifelse(row_number() %in% sample(7:72, 18), NA, O3))
We can count how many non-NA value per day using the following code.
o3sat_ex %>%
group_by(date) %>%
summarize(sum(!is.na(O3)))
# # A tibble: 3 x 2
# date `sum(!is.na(O3))`
# <date> <int>
# 1 2010-01-01 18
# 2 2010-01-02 17
# 3 2010-01-03 18
Based on your description, we would like to calculate the maximum for 2010-01-01 and 2010-01-03, but not 2010-01-02 as it only contains 17 non-NA values.
Here is one way to achieve the task, we can define a function, max_helper, that only returns maximum if the count of non-NA values is larger than 17.
max_helper <- function(x, threshold){
if (sum(!is.na(x)) >= threshold) {
r <- max(x, na.rm = TRUE)
} else {
r <- NA
}
return(r)
}
We can apply this number using the dplyr code to get the answer.
o3sat_ex2 <- o3sat_ex %>%
group_by(date) %>%
summarize(O3 = max_helper(O3, 18))
o3sat_ex2
# # A tibble: 3 x 2
# date O3
# <date> <dbl>
# 1 2010-01-01 50.2
# 2 2010-01-02 NA
# 3 2010-01-03 47.8

Dplyr doesn't respect groups when ranking data

Using the below code in dplyr 0.7.6, I try to calculate the rank of a variable for each day on a dataset. But dplyr doesn't account for the group_by(CREATIONDATE_DAY)
dates <- sample(seq(from=as.POSIXct("2019-03-12",tz="UTC"),to=as.POSIXct("2019-03-20",tz="UTC"),by = "day"),size = 100,replace=TRUE)
group <- sample(c("A","B","C"),100,TRUE)
df <- data.frame(CREATIONDATE_DAY = dates,GROUP = group)
# calculate the occurances for each day and group
dfMod <- df %>% group_by(CREATIONDATE_DAY,GROUP) %>%
dplyr::summarise(COUNT = n()) %>% ungroup()
# Compute the rank by count for each day
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>%
mutate(rank = rank(-COUNT, ties.method ="min"))
But the rank values are calculate on the entire group instead on the creation day value. As seen in the image the row with id 24 should be rank 1 due to 4 being the highest value for 16.03.2019 and row 23 should be rank 2 of this particular day. Where is my mistake?
Edit: added desired output:
Edit #2: as MrFlick has pointed out I checked my dplyr version (0.7.6) and upgrade to the most current version fixed the issue for me.
It seems that may be are some conflict with another package. If you have active lubridate, try to inverse the order in which you call the packages lubridate and dplyr (I've tried your example and gave me the right answer). Yet, you can stil try with:
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>% mutate(rank = row_number(desc(COUNT)))
> head(dfMod)
# A tibble: 6 x 4
# Groups: CREATIONDATE_DAY [2]
CREATIONDATE_DAY GROUP COUNT rank
<dttm> <fct> <int> <int>
1 2019-03-12 00:00:00 A 2 3
2 2019-03-12 00:00:00 B 5 1
3 2019-03-12 00:00:00 C 4 2
4 2019-03-13 00:00:00 A 4 1
5 2019-03-13 00:00:00 B 3 2
6 2019-03-13 00:00:00 C 2 3

Add sequence of week count aligned to a date column with infrequent dates

I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))

use lapply within ifelse and maintain column names

I have df1 sorted by date like this:
Date <- c("12/17/17","12/19/17","12/20/17","12/30/17","12/31/17","1/1/18")
Jon <- c(388,299,412,NA,NA,353)
Eric <- c(121,NA,321,473,832,NA)
Scott <- c(NA,122,NA,NA,NA,424)
df1 <- data.frame(Date,Jon,Eric,Scott)
df1$Date <- as.Date(df1$Date,format='%m/%d/%y')
#df1
Date Jon Eric Scott
1 12/17/17 388 121 NA
2 12/19/17 299 NA 122
3 12/20/17 412 321 NA
4 12/30/17 NA 473 NA
5 12/31/17 NA 832 NA
6 1/1/18 353 NA 424
I'm trying to create a new list that includes only the data that is within the last 12 days of each person's most recent date with a non-NA value. If there is only one non-NA value within 12 days of the person's most recent non-NA value, then I want to take the 2 most recent non-NA values for that person, even if one falls outside of the 12 day date range.
The code below successfully puts data within the last 12 days of each person's most recent non-NA value in a new list:
df2 <- lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)])
This code successfully takes the 2 most recent non-NA entries, regardless of whether or not it's within the 12 day range:
df3 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2))
This code comes very close to doing what I want it to do, except it loses the column names. Notice that the column names are replaced with numbers, unlike the lapply statements above, which both keep the column names.
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
How can I maintain the column names?
I would approach this problem using the tidyverse packages.
Data
library(tidyr)
library(dplyr)
library(lubridate)
df <- tibble(
my_date = as.Date(
c("12/17/17", "12/19/17", "12/20/17", "12/30/17", "12/31/17", "1/1/18"),
"%m/%d/%y"
),
jon = c(388, 299, 412, NA, NA, 353),
eric = c(121, NA, 321, 473, 832, NA),
scott = c(NA, 122, NA, NA, NA, 424)
)
Long format data frame
This output feels more natural.
df_long <- df %>%
gather(key, value, -my_date) %>%
drop_na %>%
group_by(key) %>%
mutate(
in_date = if_else(my_date >= max(my_date) - days(12), TRUE, FALSE),
count = sum(in_date)
) %>%
filter(in_date | count < 2) %>%
top_n(2, my_date) %>%
ungroup %>%
select(-c(in_date, count))
df_long
# # A tibble: 6 x 3
# my_date key value
# <date> <chr> <dbl>
# 1 2017-12-20 jon 412
# 2 2018-01-01 jon 353
# 3 2017-12-30 eric 473
# 4 2017-12-31 eric 832
# 5 2017-12-19 scott 122
# 6 2018-01-01 scott 424
Wide format
Thankfully, it is only one additional step to spread to your original columns.
df_long %>% spread(key, value)
# # A tibble: 5 x 4
# my_date eric jon scott
# * <date> <dbl> <dbl> <dbl>
# 1 2017-12-19 NA NA 122
# 2 2017-12-20 NA 412 NA
# 3 2017-12-30 473 NA NA
# 4 2017-12-31 832 NA NA
# 5 2018-01-01 NA 353 424
Seems like the easiest thing to do for me is to store the column headers in a variable and then reattach them:
myHeaders <- names(df1[-1])
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
names(df4) <- myHeaders

Resources