Sampling not completely at random, with boundary conditions - r

I have summary level data that tells me how often a group of patients actually went to the doctor until a certain cut-off date. I do not have individual data, I only know that some e.g. went 5 times, and some only once.
I also know that some were already patients at the beginning of the observation interval, and would be expected to come more often, whereas some were new patients that entered later. If they only joined a month before the cutoff data, they would be expected to come less often than someone who was in the group from the beginning.
Of course, the patients are not well behaved, so they sometimes miss a visit, or they come more often than expected. I am setting some boundary conditions to define the expectation about minimum and maximum number of doctor visits relative to the month they started appearing at the doctor.
Now, I want to distribute the actual summary level data to individuals, i.e. create a data frame that tells me during which month each individual started appearing at the doctor, and how many times they came for check-up until the cut-off date.
I am assuming this can be done with some type of random sampling, but the result needs to fit both the summary level information I have about the actual subjects as well as the boundary conditions telling how often a subject would be expected to come to the doctor relative to their joining time.
Here is some code that generates the target data frame that contains the month when the observation period starts, the respective number of doctor's visits that is expected (including boundary for minimum and maximum visits), and the associated percentage of subjects who start coming to the doctor during this month:
library(tidyverse)
months <- c("Nov", "Dec", "Jan", "Feb", "Mar", "Apr")
target.visits <- c(6,5,4,3,2,1)
percent <- c(0.8, 0.1, 0.05, 0.03, 0.01, 0.01)
df.target <- data.frame(month = months, target.visits = target.visits,
percent = percent) %>%
mutate(max.visits = c(7,6,5,4,3,2),
min.visits = c(5,4,3,2,1,1))
This is the data frame:
month target.visits percent max.visits min.visits
Nov 6 0.80 7 5
Dec 5 0.10 6 4
Jan 4 0.05 5 3
Feb 3 0.03 4 2
Mar 2 0.01 3 1
Apr 1 0.01 2 1
In addition, I can create the data frame that shows the actual subject n with the actual number of visits:
subj.n <- 1000
actual.visits = c(7,6,5,4,3,2,1)
actual.subject.perc = c(0.05,0.6,0.2,0.06,0.035, 0.035,0.02)
df.observed <- data.frame(actual.visits = actual.visits,
actual.subj.perc = actual.subject.perc, actual.subj.n = subj.n * actual.subject.perc)
Here is the data frame with the actual observations:
actual.visits actual.subj.perc actual.subj.n
7 0.050 50
6 0.600 600
5 0.200 200
4 0.060 60
3 0.035 35
2 0.035 35
1 0.020 20
Unfortunately I do not have any idea how to bring these together. I just know that if I have e.g. 60 subjects that come to the doctor 4 times during their observation period, I would like to randomly assign a starting month to each of them. However, based on the boudary conditions min.visits and max.visits, I know that it can only be a month from Dec - Feb.
Any thoughts are much appreciated.

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Sum of lag functions

Within one person's data from a behavioral task, I am trying to sum the clock time at which a target appears (data$onset) and the reaction time of their response (data$Latency) to find the clock time at which they entered their response at. For future data processing reasons, these calculated values will have to be placed in the data$onset column two values down from when the target appeared on the screen. In the example below:
Item
onset
Latency
Prime
9.97
0
Target
10.70
0.45
Mask
11.02
0
Response
NA
0
Onset is how many seconds into the task the stimuli appeared, and latency is reaction time to the target. latency for non-targets will always be 0, as subjects don't respond to them. in the "NA" under onset, I need that value to be the sum of the onset of the target+reaction time to the target (10.70+0.45). Here is the code I have tried:
data$onset=if_else(is.na(data$onset), sum(lag(data$onset, n = 2)+lag(data$Latency, n = 2)), data$onset)
If any clarification is needed please let me know.
since you used if_else I'm adding a dplyr solution;
library(dplyr)
data %>%
mutate(onset=ifelse(is.na(onset),lag(onset,n =2)+lag(Latency,n = 2),onset))
output;
Item onset Latency
<fct> <dbl> <dbl>
1 Prime 9.97 0
2 Target 10.7 0.45
3 Mask 11.0 0
4 Response 11.1 0
Also note that, if you want to stick to your own syntax;
data$onset=if_else(is.na(data$onset), lag(data$onset, n = 2)+lag(data$Latency, n = 2), data$onset)

R: calculate closing time for chamber N2O flux measurements

I performed static N2O chamber measurements that I would like to analyse now using the "gasfluxes package" https://cran.r-project.org/web/packages/gasfluxes/gasfluxes.pdf.
I measured different samples (POTS) during 10 min intervals. Each sample was measured two times a day (SESSION: AM, PM) for 9 days. The N2O analyzer saved data (conc.) every second!
My data now looks like this:
DATE POT SESSION TIME Concentration
1: 2017-10-18T00:00:00Z O11 AM 10:16:00.746 0.3512232
2: 2017-10-18T00:00:00Z O11 AM 10:16:01.382 0.3498687
3: 2017-10-18T00:00:00Z O11 AM 10:16:02.124 0.3482681
4: 2017-10-18T00:00:00Z O11 AM 10:16:03.216 0.3459306
5: 2017-10-18T00:00:00Z O11 AM 10:16:04.009 0.3459124
6: 2017-10-18T00:00:00Z O11 AM 10:16:04.326 0.3456660
To use the package, I need to calculate closing times out of the exact time (TIME) data points. The time should look like this (table taken from the package pdf. see above)
serie V A time C
1: ID1 0.522625 1 0.0000000 0.3317823
2: ID1 0.522625 1 0.3333333 0.3304053
3: ID1 0.522625 1 0.6666667 0.3394311
4: ID1 0.522625 1 1.0000000 0.4469102
5: ID2 0.523625 1 0.0000000 0.4572708
How can I calculate this for each individual 10-minute measurement period for each pot? Basically it should list the increasing nr. of seconds as my machine measured conc. every second.
My idea is to group by "POT", "DATE" and "Session" which creates a unique identifier for one complete chamber measurement and do a loop.
I also learned that I should use "lubridate" as I'm working with times (https://data.library.virginia.edu/working-with-dates-and-time-in-r-using-the-lubridate-package/). I still don't know how to calculate Time durations now for my case? I think I need to write a loop?
Something like this but I always get error messages (my former question R: Calculate measurement time-points for separate samples)
df.HMR %>% group_by(DATE, Series, Session) %>%
mutate(dt=as.POSIXct(df.HMR$TIME,format="%H:%M:%S"), time_diff = dt-lag(dt))
Error message: Column dt must be length 838 (the group size) or one, not 379698
Anyone can help me or knows about an another approach?
Any help is very welcome.
Many thanks!

Sample exactly four maintaining almost equal sample distances

I am trying to generate appointment times for yearly scheduled visits. The available days=1:365 and the first appointment should be randomly chosen first=sample(days,1,replace=F)
Now given the first appointment I want to generate 3 more appointment in the space between 1:365 so that there will be exactly 4 appointments in the 1:365 space, and as equally spaced between them as possible.
I have tried
point<-sort(c(first-1:5*364/4,first+1:5*364/4 ));point<-point[point>0 & point<365]
but it does not always give me 4 appointments. I have eventually run this many times and picked only the samples with 4 appointments, but I wanted to ask if there is a more elegant way to get exactly 4 points as equally distanced a s possible.
I was thinking of equal spacing (around 91 days between appointments) in a year starting at the first appointment... Essentially one appointment per quarter of the year.
# Find how many days in a quarter of the year
quarter = floor(365/4)
first = sample(days, 1)
all = c(first, first + (1:3)*quarter)
all[all > 365] = all[all > 365] - 365
all
sort(all)
Is this what you're looking for?
set.seed(1) # for reproducible example ONLY - you need to take this out.
first <- sample(1:365,1)
points <- c(first+(0:3)*(365-first)/4)
points
# [1] 97 164 231 298
Another way uses
points <- c(first+(0:3)*(365-first)/3)
This creates 4 points euqally spaced on [first, 365], but the last point will always be 365.
The reason your code is giving unexpected results is because you use first-1:5*364/4. This creates points prior to first, some of which can be < 0. Then you exclude those with points[points>0...].

R: Very basic example for averaging a Time Series

I am qute new to R and studied several posts and websites about time series and moving averaging but simply cannot find a useful hint averging a special period of time.
My data is a table via readcsv with a date and time in one column and several other columns with values. The time steps in the data are not constant, so sometimes 5 minutes, sometimes 2 hours. Eg.
2014-01-25 14:50:00, 4, 8
2014-01-25 14:55:00, 3, 7
2014-01-25 15:00:00, 1, 4
2014-01-25 15:20:24, 12, 34
2014-01-25 17:19:00, 150, 225
2014-01-25 19:00:00, 300, 400
2014-01-25 21:00:00, NA, NA
2014-01-25 23:19:00, 312, 405
So I look for an averaging plot that
calculates data average in arbitrary intervals like 30 minutes, 1 hour, 1 day etc. So lower steps should be aggregated and higher steps should be disaggregated.
(removed, since it is trivial to get value per hour from a time series D which is averaged by X hours with D/x.)
data flagged as NA should not be taken into account. So the function should not interpolate/smooth through Na gaps and performing a line plot should not connect the points between a NA gap with a line.
I already tried
aggregate(list(value1=data$value1,value2=data$value2), list(time=cut(data$time, "1 hour")), sum)
but this does not fulfill needs 1 and 3 and is not able to disaggregate 2-hourly data steps.
Answering point 3: plot automatically skips NA values and breaks the line.
Try this example:
plot(c(1:5,NA,NA,6:10),t='l')
Now, if you want to 'smooth' or average over time intervals purely for graphical purposes, It's probably easiest to start out by separating your data at each line with an NA and then doing a spline or other smoothing operation on each subsection separately.

Resources