Time intervals between resightings of several individuals - r

In R, I need to calculate several time interval variables between resightings of marked individuals. I have a dataset similar to this:
ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
.
.
.
b 11.12 8 7
etc
In which each ID represents a different animal marked for individual recognition, and each row contains the date and time in which it was relocated.
For each individual, I'd need to calculate the number of days each animal was observed, the mean and standard deviation of the number of relocations per day, and the mean and standard deviation of the days elapsed between relocations (including 0 days between observations on the same day.
Ideally, I need to obtain a data frame such this:
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
a 27 7 4.2 1.1 1.5 0.5
b 32 5 3.4 0.4 3.2 0.7
c 17 6 4.4 0.2 4.5 1.2
d etc
I've been doing it in using the tapply function and transferring the results to an Excel, but I am sure there must be a relatively simple code which could help me to ignite the process in R.

The OP has requested to aggregate 6 statistics per ID. Four of them can by directly aggregated by grouping by ID. Two (mean.Obs.per.Day and m.O.D.sd) need to be grouped by date and ID first.
Unfortunately, the time stamps are split up in three different fields, Time, Day, and Month with the year missing. As four of the statistics are based on dates, we need to construct a Date column which combines Day, Month, and a dummy year.
The code below utilises the data.table and lubridate packages for efficiency.
library(data.table)
# coerce to data.table and add Date column
setDT(DF)[, Date := lubridate::make_date(, Month, Day)]
# aggregate by ID,
# use temporary variable to hold the day differences between resightings
agg_per_id <- DF[, {
tmp <- as.numeric(diff(Date))
.(N.Obs = .N, N.days = uniqueN(Date),
mean.days.elapsed = mean(tmp),
mde.sd = sd(tmp))
} , by = ID]
# aggregate by Date and ID
agg_per_day_and_id <- DF[, .N, by = .(ID, Date)][
, .(mean.Obs.per.Day = mean(N), m.O.D.sd = sd(N)), by = ID]
# join partial results
result <- agg_per_day_and_id[agg_per_id, on = "ID"]
# reorder columns (for comparison with expected result)
setcolorder(result, c("ID", "N.Obs", "N.days", "mean.Obs.per.Day",
"m.O.D.sd", "mean.days.elapsed", "mde.sd"))
result
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
1: a 5 3 1.666667 0.5773503 0.5 0.5773503
2: b 1 1 1.000000 NA NaN NA
Note that the figures differ from the expected result of the OP due to different input data.
Data
As far as provided by the OP
DF <- readr::read_table(
"ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
b 11.12 8 7"
)

Related

Cumsum function step wise in R

I am facing one problem, I calculated a monthly interest rate for a mortgage, however, I would need to sum the results in order to have it yearly (always 12 months).
H <- 2000000 # mortgage
i.m <- 0.03/12 # rate per month
year <- 15 # years
a <- (H*i.m*(1+i.m)^(12*year))/
((1+i.m)^(12*year)-1)
a # monthly payment
interest <- a*(1-(1/(1+i.m)^(0:(year*12))))
interest
cumsum(a*(1-(1/(1+i.m)^(0:(year*12))))) # first 12 values together and then next 12 values + first values and ... (I want to have for every year a value)
You may do this with tapply in base R.
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
yearly <- tapply(monthly, ceiling(seq_along(monthly)/12), sum)
I think you can use the following solution:
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
sapply(split(monthly, ceiling(seq_along(monthly) / 12)), function(x) x[length(x)])
1 2 3 4 5 6 7 8
2254.446 9334.668 21098.218 37406.855 58126.414 83126.695 112281.337 145467.712
9 10 11 12 13 14 15 16
182566.812 223463.138 268044.605 316202.434 367831.057 422828.023 481093.905 486093.905

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Dates are not keeping specified format in R data frame

Simply put, I'm grabbing dates for events that meet certain conditions in df1 and putting them in a new data frame (df2). The formatting of dates in df2 should be the same formatting in df1 ("2000-09-12", or %Y-%m-%d). However, the dates in df2 read "11212", "11213", etc.
to generate data:
"Date"<-c("2000-09-08", "2000-09-11","2000-09-12","2000-09-13","2000-09-14","2000-09-15","2000-09-18","2000-09-19","2000-09-20","2000-09-21", "2000-09-22","2000-09-25")
"Event"<-c("A","N","O","O","O","O","N","N","N","N","N","A")
df1<-data.frame(Date,Event)
df1
Date Event
1 2000-09-08 A
2 2000-09-11 N
3 2000-09-12 O
4 2000-09-13 O
5 2000-09-14 O
6 2000-09-15 O
7 2000-09-18 N
8 2000-09-19 N
9 2000-09-20 N
10 2000-09-21 N
11 2000-09-22 N
12 2000-09-25 A
here is the code:
"df2"<-data.frame()
"tmp"<-data.frame(1,2)
i<-c(1:4)
for (x in i)
{
date1<- df1$Date[df1$Event=="O"][x]
date2<- df1$Date[df1$Event=="A" & df1$Date => date1] [1]
as.numeric(difftime(date2, date1))->tmp[1,2]
as.Date(as.character(df1$Date[df1$Event=="O"][x]), "%Y-%m-%d")->tmp[1,1] ##the culprit
rbind(df2, tmp)->df2
}
Loop output looks like this:
X1 X2
1 11212 13
2 11213 12
3 11214 11
4 11215 10
I want it to look like this:
X1 X2
1 "2000-09-12" 13
2 "2000-09-13" 12
3 "2000-09-14" 11
4 "2000-09-14" 10
If I understand correctly, the OP wants to find for each "O" event the difference in days to the next following "A" event.
This can be solved using a rolling join. We extract the "O" events and the "A" events into two separate data.tables and join both on date.
This will avoid all the hassle with the data format and works also if df1 is not already ordered by Date.
library(data.table)
setDT(df1)[Event == "A"][df1[Event == "O"],
on = "Date", roll = -Inf, .(Date, x.Date - i.Date)]
Date V2
1: 2000-09-12 13 days
2: 2000-09-13 12 days
3: 2000-09-14 11 days
4: 2000-09-15 10 days
Note that roll = -Inf rolls backwards (next observation carried backward (NOCB)) because the date of the next "A" event is required.
Data
Date <- as.Date(c("2000-09-08", "2000-09-11","2000-09-12","2000-09-13","2000-09-14","2000-09-15",
"2000-09-18","2000-09-19","2000-09-20","2000-09-21", "2000-09-22","2000-09-25"))
Event <- c("A","N","O","O","O","O","N","N","N","N","N","A")
df1 <- data.frame(Date,Event)

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Resources