I want to check if there is a discontinuity in years in the records of my streamflow data. I create a loop to select the stations with higher number of 30 years of records and I want to also select only the ones with continued data (this means continuity in years: 1998,1999,2000,20001, but not a change from 1998 to 2000.
How can I make a loop to select continuity in the years or even months? At the moment this is my code working as I expected:
Daily_Streamflow <- Daily_Streamflow %>% mutate(month = month(Date))
Daily_Streamflow <- Daily_Streamflow %>% mutate(month = month(Date))
if (length(unique(Daily_Streamflow$year)) > 30)
{
(name <- paste("DSF", siteNumber[i], sep = "_"))
(name2 <- paste("Site_Info", siteNumber[i], sep = "_"))
assign(name, value = Daily_Streamflow)
assign(name2, value = siteINFO)
#print(name)
}
How can I add the continuity condition in the if?
My desire output is to save as data frames in R only the data records of streamflow gages with a period of records higher than 30 years (length(unique(Daily_Streamflow$year)) > 30) and I want to add the condition that only when there is no a discontinuity in the records of more than 3 months. What I am missing is the second condition of continuity
Related
I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame.
I have a first data.frame (dates) which has 122880 rows of dates (each 15 min).
I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID.
I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.
Here is my code:
for(hour in 1:122880){
for(patient in 1:nrow(episode){
if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
no_episode = episode$id[patient]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[patient]
}
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
The first loop is to set the date and write the results.
The second loop + first if statement is to search in the episode data.frame which patients are in the ER during the time frame.
The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.
It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:
if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)
Any idea how to clean my code so it doesn't take so much time?
I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!
Thank you!
Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!
Here is the code after modification:
for(hour in 1:122880){
temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
for(element in 1:length(temp){
no_episode = temp[element]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[episode$id==no_episode]
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
Getting rid of the first if statement and the 2nd (longest) loop did the trick!
Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.
First, here's some fake data. I assume here that "id" + "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.
library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data
care <- tibble(
id = sample(1:99999, epi_n, replace = TRUE),
begin = ymd_h(2010010100) + runif(epi_n, max = 3.5*365*24*60*60),
date = as_date(begin)) %>%
filter(hour(begin) >= 7, hour(begin) <= 17) %>%
distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers
episode <- care %>%
transmute(id, date,
begin = begin - rpois(nrow(care), 30)*60,
end = begin + rgamma(nrow(care), 1)*600) %>%
arrange(begin)
Matching the data between the two data sets takes 0.2 sec.
tictoc::tic()
combined <- left_join(
episode %>% rename("patient_arrival" = "begin"),
care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()
Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.
Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.
tictoc::tic()
combined %>%
select(id, patient_arrival, dr_arrival) %>%
pivot_longer(-id, values_to = "hour") %>%
mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
arrange(hour) %>%
mutate(wait_count = cumsum(wait_chg)) %>%
bind_rows(.,
tibble(name = "15 minute interval",
hour = seq.POSIXt(ymd_h(2010010100),
to = ymd_h(2013070112),
by = "15 min")) %>%
filter(hour(hour) >= 6, hour(hour) < 20)
) %>%
arrange(hour) %>%
fill(wait_count) %>%
replace_na(list(wait_count = 0)) %>%
filter(name == "15 minute interval") %>%
select(hour,wait_count)
tictoc::toc()
I have a dataset with patients who were under observation for 72 hours. Patient's oxygen levels were measured every 4 seconds but some observations had to be removed due to issues with accuracy of the data. As a result, patients have a varying number of observations.
While patients were observed, they underwent various interventions. The objective of my analysis is to assess whether interventions affected the patient's oxygen levels or not. Therefore, I am doing a comparison of the oxygen levels pre-intervention, during intervention and post-intervention.
While the actual analysis isn't too difficult, I am having a hard time subsetting the data. For example, I only want observations 300 seconds prior to the start of an intervention and 300 seconds post intervention. I have to take into account the fact that an individual may have multiple interventions over the course of the time window and there are multiple subjects.
I have provided some sample code below to generate a dataset but please let me know if I'm missing anything.
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
So far, I have tried this method but I was only able to isolate single observations 5 mins pre and post an intervention and not all observations between those time windows (ie, the single observation 5mins prior to start of an intervention and the single observation 5 mins post an intervention but not all the observations in between these three points)
data <- final4 %>%
filter(intervention == 1) %>%
mutate(five_mins_after = time + 300, #5 mins after intervention
five_mins_before = time - 300) %>% #5 mins before intervention %>%
filter(id == "1")
data2 <- final4 %>%
filter(intervention == 0,
id == "1")
data_after <- data %>%
dplyr::select(five_mins_after)
data_before <- data %>%
dplyr::select(five_mins_before)
data3 <- merge(data2, data_after, by.x = "time", by.y = "five_mins_after")
data4 <- merge(data2, data_before, by.x = "time", by.y = "five_mins_before")
final <- final %>%
dplyr::bind_rows(data3) %>%
dplyr::bind_rows(data4)
Please let me know if you need any additional information and thanks for your time!
PS: Apologies if I missed anything, first time asking for help here
Here is the answer. Although long, it worked fine for gathering times 300 seconds before and 300 after the beggining of an intervention.
Let me know if you'd like further explanation or if I have misunderstood anything.
library(magrittr)
library(tidyverse)
### Sample code
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
### Start of data processing to get wanted observations
# Order it by id and time
final %<>% arrange(id, time)
# Loop over the unique ids
obs_to_keep <- list()
for(i in unique(final$id)) {
# Get starts of treatment
time_zero_intervention <- final %>%
filter(id == i & intervention == 0) %>%
select(time)
# Obtain all times after zero interventions, that could be intervention == 1
time_plus_4 <- time_zero_intervention$time + 4
# Where in the times after 0 intervention there is a 1 intervention
starts_of_interventions <- final %>%
filter(id == i & time %in% time_plus_4) %>%
filter(intervention == 1)
# Loop over each one of the times where intervention starts
all_times <- list()
for(n in 1:length(starts_of_interventions$time)) {
# Gather 300 secs prior and post
time_300_before <- starts_of_interventions$time[n] - 300
time_300_after <- starts_of_interventions$time[n] + 300
# Filter for observations in this interval
all_times[[n]] <- final %>%
filter(id == i) %>%
filter(time >= time_300_before & time <= time_300_after)
}
if(length(all_times) == 1){
obs_to_keep[[i]] <- as.data.frame(all_times)
}
else {
obs_to_keep[[i]] <- do.call(rbind, all_times)
}
}
# Make a data frame from the list
df <- do.call(rbind, obs_to_keep)
# Order it by id and time
df %<>% arrange(id, time)
I wish to generate some Tidy data.
26 companies are observed everyday for 10 days.
Each day a value is recorded.
The first day is: 2020/1/1
How do I create a list of dates so that the first 26 rows of the date column of the date frame is "2020/1/1" (Year, Month, Day) and the next 26 rows are "2020/1/2" etc.
Here is the data frame without the date column:
library(tidyverse)
set.seed(33)
date_chunk <- rep(as.Date("2020/1/1"), 26)
# Tidy data. 10 sequential days starting 2020/1/1/
df <- tibble(
company = rep(letters, 10),
value = sample(0:5, 260, replace = TRUE),
color = "grey"
)
You can try this
rep(seq(as.Date("2020-01-01"),as.Date("2020-01-10"),1),each=26)
This will return a list of dates from 2020-01-01 to 2020-01-10 where each date will be repeated 26 times
For each company we can add row_number() to first date_chunk to get an incremental sequence of dates.
library(dplyr)
df %>%
group_by(company) %>%
mutate(date = first(date_chunk) + row_number() - 1)
I have a question on how to count the occurrence of specified permuations in a data set in R.
I am currently working on continuous-glucose-monitoring data sets. Shortly, each data set has between 1500 to 2000 observations (each observation is a plasma glucose value measured every 5 minutes over 6 days).
I need to count the occurrence of glucose values below 3.9 occurring for 15 minutes or more and less than 120 minutes in a row (>3 observations and <24 observations for values <3.9 in a row) on a numeric scale.
I have made a new variable with a factor 1 or 0 for whether the plasma glucose value is below 3.9 or not.
I would then like to count the number of occurrences of permutations > three 1’s in a row and < twenty-four 1’s in a row.
Is there a function in R for this or what would be the easiest approach?
Im not sure if i got your data-structure right, but maybe the following code still can help
I'm assuming a data-structure that includes Measurement, person-id and measurement-id.
library(dplyr)
# create dumy-data
set.seed(123)
data_test = data.frame(measure = rnorm(100, 3.5,2), person_id = rep(1:10, each = 10), measure_id = rep(1:10, 10))
data_test$below_criterion = 0 # indicator for measures below crit-value
data_test$below_criterion[which(data_test$measure < 3.9)] = 1 # indicator for measures below crit-value
# indicator, that shows if the current measurement is the first one below crit_val in a possible series
# shift columns, to compare current value with previous one
data_test = data_test %>% group_by(person_id) %>% mutate(prev_below_crit = c(below_criterion[1], below_criterion[1:(n()-1)]))
data_test$start_of_run = 0 # create the indicator variable
data_test$start_of_run[which(data_test$below_criterion == 1 & data_test$prev_below_crit == 0)] = 1 # if current value is below crit and previous value is above, this is the start of a series
data_test = data_test %>% group_by(person_id) %>% mutate(grouper = cumsum(start_of_run)) # helper-variable to group all the possible series within a person
data_test = data_test %>% select(measure, person_id, measure_id, below_criterion, grouper) # get rid of the previous created helper-variables
data_results = data_test %>% group_by(person_id, grouper) %>% summarise(count_below_crit = sum(below_criterion)) # count the length of each series by summing up all below_crit indicators within a person and series
data_results = data_results %>% group_by(person_id) %>% filter(count_below_crit >= 3 & count_below_crit <=24) %>% summarise(n()) # count all series within a desired length for each person
data_results
data.frame(data_test)
I have a data set that looks something like below. Basically, I am interested in checking if a particular id is present at the beginning of the year(in this case jan,1,2003) that it is present everyday until the end of the year( dec 31 2003) then starting the checking process over again with the start of next year as people might change from year to year but should not change within a year. If on certain day, an id is not present I would like to know which day and which id.
I first started with a for loop and checked every two days but this is super inefficient since my data set spans roughly 50 years and will grow later on with new data.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
Edit:The above chunk has all the dates in it but if I delete for example id = 1 on the second day, the code should tell me it is missing so the count shouldn't be the same. I added the piece to delete the id = 1 on the second day below.
df <- df[-4,]
The code below will make the same data set but delete id = 1 for jan 2, 2003 and jan 3, 2003. I am trying to get something that returns the id that is missing and the date.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
df <- df[-4,]
df <- df[-6,]
This code chunk will count number of times a person appears in each year. if the answer is 365 or 366 in leap years a person was there everyday of the year.
library(dplyr)
library(tidyr)
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
dfx <- df %>%
mutate(yrs = lubridate::year(dates)) %>%
group_by(id, dates) %>%
filter(row_number()==1) %>%
group_by(id, yrs) %>%
tally
#remove values
dfa <- df[c(-4,-6),]
The in oder to find the date of missing value add an indicator column to the data set. then fill in the missing dates by id. After this the val column will have missing values. Filter the data to get the dates where it went missing.
dfx <- dfa %>%
mutate(val = 1) %>%
complete(nesting(id),
dates = seq(min(dates),max(dates),by = "day")) %>%
filter(is.na(val))