How can get data between two defined characters in a string? r - r

I've seen this problem answered here in other languages, but can't find the solution in r:
I have a dataset where order of interactions is crucial, and depending on how the experiment has progressed, the apparatus can have one of two states. The hardware doesn't note the current state though, so the only way to separate the states is to filter the data between 'start' and 'stop' interactions. State 1 is outside of the 'start'-'stop' and state 2 is everything between a 'start' and a 'stop'.
My data is in the following format:
Time Individual Interaction
11:57:31 XX002 2
12:00:00 XX123 Start
12:00:03 XX123 1
12:00:37 XX334 2
12:01:00 NA Stop
12:04:12 XX441 2
How can I filter the data to get two separate dfs, one for all occurrences outside of 'start'-'stop', and another for everything between 'start' and 'stop'? Ideally it would result in the data being searched chronologically for a 'start' interaction, then filtering out all the data between that and the next 'stop', and repeat (as there can sometimes be multiple 'start' interactions before the next stop.
In this example it would result in:
Time Individual Interaction
11:57:31 XX002 2
12:04:12 XX441 2
and
Time Individual Interaction
12:00:00 XX123 Start
12:00:03 XX123 1
12:00:37 XX334 2
12:01:00 NA Stop
Thanks in advance

Using cumsum we accumulate the changes in Start and Stop. Substracting both we get 1 when in between start/stop and 0 when out. Sadly, we need to use lag() to put the value in stop also in the dfin as it also has a 0.
z = cumsum(df$Interaction=="Start")-cumsum(df$Interaction=="Stop")
sep = ifelse(z==0 & lag(z,default=z[1])==1,1,z)
dfoin=df[sep==1,]
dfout=df[sep==0,]
> dfout
Time Individual Interaction
3 12:00:00 XX123 Start
4 12:00:03 XX123 1
5 12:00:37 XX334 2
6 12:01:00 <NA> Stop
> dfin
Time Individual Interaction
2 11:57:31 XX002 2
7 12:04:12 XX441 2
Using dplyrpiping
df2=df%>%mutate(n=cumsum(Interaction=="Start")-cumsum(Interaction=="Stop"))%>%
mutate(n=ifelse(n==0 & lag(z,default=z[1])==1,1,z))%>%split(.$n)
> df2
$`0`
Time Individual Interaction n
1 11:57:31 XX002 2 0
6 12:04:12 XX441 2 0
$`1`
Time Individual Interaction n
2 12:00:00 XX123 Start 1
3 12:00:03 XX123 1 1
4 12:00:37 XX334 2 1
5 12:01:00 <NA> Stop 1

You may try finding the times of the start and stop interactions, and then subset the data frame based on that:
time_start <- df$Time[df$Interaction == "Start"]
time_stop <- df$Time[df$Interaction == "Stop"]
df_in <- df[df$Time >= time_start & df$Time <= time_stop,]
df_out <- df[df$Time < time_start | df$Time > time_stop,]
df_in
Time Individual Interaction
2 12:00:00 XX123 Start
3 12:00:03 XX123 1
4 12:00:37 XX334 2
5 12:01:00 <NA> Stop
df_out
Time Individual Interaction
1 11:57:31 XX002 2
6 12:04:12 XX441 2

Related

Identify if a day of the week is 2nd/3rd etc Mon/Tues/etc day of the month in R

Given a date and the day of the week it is, I want to know if there is a code that tells me which of those days of the month it is. For example in the picture below, given 2/12/2020 and "Wednesday" I want to be given the output "2" for it being the second Wednesday of the month.
You can do that in base R in essentially one operation. You also do not need the second input column.
Here is slower walkthrough:
Code
dates <- c("2/12/2020","2/11/2020","2/10/2020","2/7/2020","2/6/2020", "2/5/2020")
Dates <- anytime::anydate(dates) ## one of several parsers
dow <- weekdays(Dates) ## for illustration, base R function
cnt <- (as.integer(format(Dates, "%d")) - 1) %/% 7 + 1
res <- data.frame(dt=Dates, dow=dow, cnt=cnt)
res
(Final) Output
R> res
dt dow cnt
1 2020-02-12 Wednesday 2
2 2020-02-11 Tuesday 2
3 2020-02-10 Monday 2
4 2020-02-07 Friday 1
5 2020-02-06 Thursday 1
6 2020-02-05 Wednesday 1
R>
Functionality like this is often in dedicated date/time libraries. I wrapped some code from the (C++) Boost date_time library in package RcppBDH -- that allowed to easily find 'the third Wednesday in the last month each quarter' and alike.
(lubridate::day(your_date) - 1) %/% 7 + 1
The idea here is that the first 7 days of the month are all the first for their weekday. Next 7 are 2nd, etc.
> (1:30 - 1) %/% 7 + 1
# [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5
Just to offer an alternative calculation for the nth-weekday of the month, you can just divide the day by 7 and always round up:
date <- lubridate::mdy("02/12/2020")
ceiling(day(date)/7)

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

Accounting for Time in R

I have a log of times for 2 periods (1 & 2) in a data frame. I need to account for the time accumulated for each person based on a third column 'in' vs 'out'. I then need to create an additional column to track the sum of accumulated time for both periods.
Period Time Subs
1 10:00 'Peter in'
1 .
1 .
1 8:00 'Peter out' #In this period he has accumulated 2 minutes
2 10:00 'Peter in'
2 .
2 2:00 'Peter out' #In this period he has accumulated 8 minutes
I know I need to use an if and ifelse statement but I'm not sure how to start. I started and stopped learning R and now I'm trying to pick back up where I left off.
It depends a lot on how your data is formatted, of course. if you have something like
df <- data.frame(Period=c(1,1,1,1,2,2,2), Time=c("10:00",NA,NA,"8:00","10:00",NA,"2:00"))
> df
Period Time
1 1 10:00
2 1 <NA>
3 1 <NA>
4 1 8:00
5 2 10:00
6 2 <NA>
7 2 2:00
If the Time variable is formatted as character, you can strip out the minutes column like so:
df$Min <- as.numeric(sapply(strsplit(as.character(df$Time), ":"), "[[", 1))
> df
Period Time Min
1 1 10:00 10
2 1 <NA> NA
3 1 <NA> NA
4 1 8:00 8
5 2 10:00 10
6 2 <NA> NA
7 2 2:00 2
This is much easier if you can have the Min column already as numeric!
Then, an easy way to return the total time accumulated for each period is the diff of the range for each period, within a tapply() call.
tapply(df$Min, df$Period, function(x) diff(range(x, na.rm=T)))
1 2
2 8

Count event occurrence and assign it to individuals according to date and place of interview

I have two dataframes in R: A.df and B.df. The first contains N rows where each row is an event that happened in a certain date and place.
The second is a list of individuals that have been interviewed in a certain date and place.
For each individual, I would like to count the number of events that happened within a certain timeframe before the interview date in the same location of the individual's place of interview.
Let's say that the time frame is x days before the date of interview, and that I have computed that date and stored in the variable xdaysbefore.
Here below how the data frames look like
A.df
#Event Date Place
1 2015-05-01 1
2 2015-03-11 1
3 2015-07-04 2
4 2015-05-10 3
B.df
#Individual Date of Interview Place xdaysbefore
1 2016-07-11 1 2014-09-11
2 2016-05-07 3 2014-07-04
3 2016-08-09 2 2014-03-22
4 2016-01-10 3 2014-09-17
Note that Date, Date of Interview and xdaysbefore are all in Date R class
How can I count for each individual in B.df the events happened within the time frame Date of Interview - xdaysbefore according to the place in which the event has happened and the individual place of interview.
What I would expect in B.df would look like this:
B.df
#Individual Date of Interview Place xdaysbefore CountedEvents
1 2016-07-11 1 2014-09-11 2
2 2016-05-07 3 2014-07-04 1
3 2016-08-09 2 2014-03-22 1
4 2016-01-10 3 2014-09-17 1
where CountedEvents are the number of events happened in the time frame Date of Interview - xdaysbefore and in the same location where the individual i has been interviewed.
You can use apply on every row of B.df.
Take a subset of A.df where places are equal. Check if the Date in A.df is within the range of Date_of_Interview and xdaysbefore
B.df$CountedEvents <- apply(B.df, 1, function(x) {
temp = A.df[A.df$Place %in% x[3],]
length(temp$Date < as.Date(x[2]) & temp$Date > as.Date(x[4]))
})
B.df
# Individual Date_of_Interview Place xdaysbefore CountedEvents
#1 1 2016-07-11 1 2014-09-11 2
#2 2 2016-05-07 3 2014-07-04 1
#3 3 2016-08-09 2 2014-03-22 1
#4 4 2016-01-10 3 2014-09-17 1
EDIT
If you want to access columns with names instead of indexes, you can use
apply(B.df, 1, function(x) {
temp = A.df[A.df$Place %in% x["Place"],]
length(temp$Date < as.Date(x["Date_of_Interview"]) &
temp$Date > as.Date(x["xdaysbefore"]))
})
You can achieve that by using a combination of merge and aggregate:
# merge into a new dataset
AB <- merge(A, B, by = 'Place', all = TRUE)
# create a logical variable which indicates whether 'Date' falls within the range
AB$count <- AB$xdaysbefore < AB$Date & AB$Date_of_Interview > AB$Date
# aggregate into a count varaible
aggregate(count ~ Individual + Date_of_Interview + xdaysbefore, AB, sum)
which gives:
Individual Date_of_Interview xdaysbefore count
1 3 2016-08-09 2014-03-22 1
2 2 2016-05-07 2014-07-04 1
3 1 2016-07-11 2014-09-11 2
4 4 2016-01-10 2014-09-17 1
Alternatively you could use the new non-equi join possibility from the development version of the data.table package:
library(data.table)
# convert the dataframes to data.table's (which are enhanced dataframes)
setDT(A)
setDT(B)
# join and count
A[B, on = .(Place, Date < Date_of_Interview, Date > xdaysbefore)
][, .(count = .N), .(Individual, Place, Date_of_Interview = Date, xdaysbefore = Date.1)]
which gives:
Individual Place Date_of_Interview xdaysbefore count
1: 1 1 2016-07-11 2014-09-11 2
2: 2 3 2016-05-07 2014-07-04 1
3: 3 2 2016-08-09 2014-03-22 1
4: 4 3 2016-01-10 2014-09-17 1

R programming - Split up a group of time series indexed by ID with irregular observation periods into regular monthly observations

I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r.
ID start date end date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
....
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000
The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. However, they are all formatted in the above manner; the start and end dates are Date objects.
I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly.
In other words, I would like to turn this series into something like
ID start date end date amount
1 1-1-2012 2-1-2012 3096 = 6000 * 16/31
1 2-1-2012 3-1-2012 4339 = 6000*15/31+4000*14/39
1 3-1-2012 4-1-2012 etc
....
1 6-1-2012 7-1-2012 etc
2 1-1-2012 2-1-2012 etc
2 2-1-2012 3-1-2012 etc
2 3-1-2012 4-1-2012 etc
2 4-1-2012 5-1-2012 etc
2 5-1-2012 6-1-2012 etc
....
10000 1-1-2012 2-1-2012 etc
....
10000 6-1-2012 7-1-2012 etc
Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. We do not consider the case where the observation periods all fit into one month; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings.
I'm rather new to r and could certainly use some help on this!
Here is using native R:
#The data
df=read.table(text='ID start_date end_date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000',
header=T,row.names = NULL,stringsAsFactors =FALSE)
df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")
df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot
#using by compute new amount
df1$amount=do.call(c,
by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
)
df1
> df1
ID startm endm amount
1 1 2012-01-01 2012-02-01 3096.774
2 1 2012-02-01 2012-03-01 4339.123
3 1 2012-03-01 2012-05-01 4306.038
4 1 2012-05-01 2012-06-01 1535.842
5 2 2012-01-01 2012-02-01 2500.000
6 2 2012-02-01 2012-03-01 4700.000
7 2 2012-03-01 2012-05-01 3754.545
8 10000 2012-01-01 2012-02-01 5302.326
9 10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000
To solve this I think the easiest way is to break it down into two problems.
How can I get a daily breakdown of the figures I'm interested in? This is my assumption based on the information you provided above.
How do I group by a date range and summarise to what I'm interested in?
For the following example, I will use the data set which I created using the code below:
df <- data.frame(
id=c(1,1,1,1,2,2,2),
start_date=as.Date(c("1-15-2012",
"2-15-2012",
"3-25-2012",
"5-26-2012",
"1-16-2012",
"2-27-2012",
"3-18-2012"), "%m-%d-%Y"),
end_date=as.Date(c("2-15-2012",
"3-25-2012",
"5-26-2012",
"6-13-2012",
"2-27-2012",
"3-18-2012",
"5-23-2012"), "%m-%d-%Y"),
amount=c(6000,
4000,
3000,
1000,
7000,
2000,
3000)
)
1. Provide daily figures
To provide the daily figures, firstly we get the daily contribution:
df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)
Then, we will expand the date range using the start and end dates. There are a couple ways which you can do it, but seeing that you apply the dplyr tag, using the dplyr way we have:
library(dplyr)
df <- df %>%
rowwise() %>%
do(data.frame(id=.$id,
date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")),
daily_contribution=.$daily_contribution))
which has some output which looks like this:
Source: local data frame [285 x 3]
Groups: <by row>
id date daily_contribution
1 1 2012-01-15 193.5484
2 1 2012-01-16 193.5484
3 1 2012-01-17 193.5484
4 1 2012-01-18 193.5484
5 1 2012-01-19 193.5484
6 1 2012-01-20 193.5484
7 1 2012-01-21 193.5484
8 1 2012-01-22 193.5484
9 1 2012-01-23 193.5484
10 1 2012-01-24 193.5484
.. .. ... ...
2. Create a grouping variable
Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates:
library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)
Now with all of this we can easily use dplyr to summarise our information by the dates as required.
df %>%
group_by(id, mnth, yr) %>%
summarise(amount=sum(daily_contribution))
with output:
Source: local data frame [11 x 4]
Groups: id, mnth
id mnth yr amount
1 1 1 2012 3290.3226
2 1 2 2012 4441.6873
3 1 3 2012 2902.8122
4 1 4 2012 1451.6129
5 1 5 2012 1591.3978
6 1 6 2012 722.2222
7 2 1 2012 2666.6667
8 2 2 2012 4800.0000
9 2 3 2012 2436.3636
10 2 4 2012 1363.6364
11 2 5 2012 1045.4545
To get it precisely in the format you specified:
df %>% rowwise() %>%
mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
select(id, start_date, end_date, amount)
with output:
Source: local data frame [11 x 4]
Groups: <by row>
id start_date end_date amount
1 1 2012-01-01 2012-02-01 3290.3226
2 1 2012-02-01 2012-03-01 4441.6873
3 1 2012-03-01 2012-04-01 2902.8122
4 1 2012-04-01 2012-05-01 1451.6129
5 1 2012-05-01 2012-06-01 1591.3978
6 1 2012-06-01 2012-07-01 722.2222
7 2 2012-01-01 2012-02-01 2666.6667
8 2 2012-02-01 2012-03-01 4800.0000
9 2 2012-03-01 2012-04-01 2436.3636
10 2 2012-04-01 2012-05-01 1363.6364
11 2 2012-05-01 2012-06-01 1045.4545
as needed.
note: I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39, but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. You can trivially alter this information if required.
Here's a solution using plyr and reshape. The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month).
df$index <- 1:nrow(df) #Create a unique index number
#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")
library(plyr); library(reshape) #Load the libraries
#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
# ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) {
ID <- X$ID #Keep the ID value
n.days <- as.numeric(difftime( X$end.date, X$start.date )) #Calculate time difference in days, report the result as a number
day <- seq(X$start.date, X$end.date, by="days") #Sequence of days
amount.per.day <- X$amount/n.days #Amount for that day
data.frame(ID, day, amount.per.day) #Last line is the output
})
#Change list back into data.frame
df3 <- ldply(df2, data.frame) #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <- as.numeric(format(df3$day, "%m")) #Assign a month to all dates
#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))
# ID mon amount
# 1 1 1 3290.3226
# 2 1 2 4441.6873
# 3 1 3 2902.8122
# 4 1 4 1451.6129
# 5 1 5 1591.3978
# 6 1 6 722.2222
# 7 2 1 2666.6667
# 8 2 2 4800.0000
# 9 2 3 2436.3636
# 10 2 4 1363.6364
# 11 2 5 1045.4545
Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. If your code is somewhat complicated, you can use dput(yourdata).
HTH!

Resources