I am looking to convert the following R data frame into one that is indexed by seconds and have no idea how to do it. Maybe dcast but then in confused on how to expand out the word that's being spoken.
startTime endTime word
1 1.900s 2.300s hey
2 2.300s 2.800s I'm
3 2.800s 3s John
4 3s 3.400s right
5 3.400s 3.500s now
6 3.500s 3.800s I
7 3.800s 4.300s help
Time word
1.900s hey
2.000s hey
2.100s hey
2.200s hey
2.300s I'm
2.400s I'm
2.500s I'm
2.600s I'm
2.700s I'm
2.800s John
2.900s John
3.000s right
3.100s right
3.200s right
3.300s right
One solution can be achieved using tidyr::expand.
EDITED: Based on feedback from OP, as his data got duplicate startTime
library(tidyverse)
step = 0.1
df %>% group_by(rnum = row_number()) %>%
expand(Time = seq(startTime, max(startTime, (endTime-step)), by=step), word = word) %>%
arrange(Time) %>%
ungroup() %>%
select(-rnum)
# # A tibble: 24 x 2
# # Groups: word [7]
# Time word
# <dbl> <chr>
# 1 1.90 hey
# 2 2.00 hey
# 3 2.10 hey
# 4 2.20 hey
# 5 2.30 I'm
# 6 2.40 I'm
# 7 2.50 I'm
# 8 2.60 I'm
# 9 2.70 I'm
# 10 2.80 John
# ... with 14 more rows
Data
df <- read.table(text =
"startTime endTime word
1.900 2.300 hey
2.300 2.800 I'm
2.800 3 John
3 3.400 right
3.400 3.500 now
3.500 3.800 I
3.800 4.300 help",
header = TRUE, stringsAsFactors = FALSE)
dcast() is used for reshaping data from long to wide format (thereby aggregating) while the OP wants to reshape from wide to long format thereby filling the missing timestamps.
There is an alternative approach which uses a non-equi join.
Prepare data
However, startTime and endTime need to be turned into numeric variables after removing the trailing "s" before we can proceed.
library(data.table)
cols <- stringr::str_subset(names(DF), "Time$")
setDT(DF)[, (cols) := lapply(.SD, function(x) as.numeric(stringr::str_replace(x, "s", ""))),
.SDcols = cols]
Non-equi join
A sequence of timestamps covering the whole period is created and right joined to the dataset but only those timestamps are retained which fall within the given intervall. From the accepted answer, it seems that endTime must not be included in the result. So, the join condition has to be adjusted accordingly.
DF[DF[, CJ(time = seq(min(startTime), max(endTime), 0.1))],
on = .(startTime <= time, endTime > time), nomatch = 0L][
, endTime := NULL][] # a bit of clean-up
startTime word
1: 1.9 hey
2: 2.0 hey
3: 2.1 hey
4: 2.2 hey
5: 2.3 I'm
6: 2.4 I'm
7: 2.5 I'm
8: 2.6 I'm
9: 2.7 I'm
10: 2.8 John
11: 2.9 John
12: 3.0 right
13: 3.1 right
14: 3.2 right
15: 3.3 right
16: 3.4 now
17: 3.5 I
18: 3.6 I
19: 3.7 I
20: 3.8 help
21: 3.9 help
22: 4.0 help
23: 4.1 help
24: 4.2 help
startTime word
Note that this approach does not require to introduce row numbers.
nomatch = 0L avoids NA rows in case of gaps in the dialogue.
Data
library(data.table)
DF <- fread("
rn startTime endTime word
1 1.900s 2.300s hey
2 2.300s 2.800s I'm
3 2.800s 3s John
4 3s 3.400s right
5 3.400s 3.500s now
6 3.500s 3.800s I
7 3.800s 4.300s help
", drop = 1L)
Related
does anyone know if it is possible to use the na.approx function to interpolate depending on a varying timeframe ?
Let's suggest we have a df like (the real df has over 5'000 rows):
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67
I want to conduct an linear interpolation depending on the date range. For example the two NA's at beginning: 1) 14-5 = 9, 2) Counting the days from 2022-05-02 until 2022-05-06 = 5 days, so we divide 3) 9/5 = 1.8.
Value for NA at 2022-05-03 is 6.8 and for 2022-05-06 is 8.6.
Second example at 2022-05-09: 1) 67-15 = 52, 2) 2022-05-08 until 2022-05-10 = 3 days, 3) 52/3 = 17.33333. Value for NA at 2022-05-09 is 32.33333 (= 15 + 17.33333)
Is this possible to conduct it with the na.approx function? If not, how can I approach this?
Using the data frame DF defined reproducibly in the Note at the end, we see that what is asked for, i.e. linear interpolation with respect to Date, is what you get (after correcting the calculations in the question) if you apply na.approx to a zoo series:
library(zoo)
z <- read.zoo(DF)
na.approx(z)
## 2022-05-01 2022-05-02 2022-05-03 2022-05-06 2022-05-07 2022-05-08 2022-05-09
## 6.0 5.0 6.8 12.2 14.0 15.0 41.0
## 2022-05-10
## 67.0
or in terms of the original data frame we can use the x= argument of na.approx to specify that interpolation is with respect to Date.
DF$Date <- as.Date(DF$Date)
transform(DF, Value = na.approx(Value, Date, na.rm = FALSE))
## Date Value
## 1 2022-05-01 6.0
## 2 2022-05-02 5.0
## 3 2022-05-03 6.8
## 4 2022-05-06 12.2
## 5 2022-05-07 14.0
## 6 2022-05-08 15.0
## 7 2022-05-09 41.0
## 8 2022-05-10 67.0
Suggest you review the documentation using ?na.approx from R.
Note
Lines <- "Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67"
DF <- read.csv(text = Lines, strip.white = TRUE)
I can do it with a bit of a slog, but I may need some help with the date diffs. For the first case there's 5 days between 02/05 & 07/05. In the second there's 2 days, not 3 between 08/05 & 10/05. Have I missed something? :)
Code below:
# get data into required shape, and using data.table package
df <- read.table(text="
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67
", header=T)
library(data.table)
library(zoo)
library(lubridate)
dt <- as.data.table(df)
dt[, Date := lubridate::ymd(gsub(",","",`Date.`))]
setorder(dt, Date)
# first step, fill in to get the starting value
dt[, Value2 := zoo::na.locf0(Value)]
# group together the rows, only really interested in the NA ones,
# ensuring they are grouped together. rleid makes a group where it finds new values
dt[, Group := rleid(is.na(Value))]
# find the value after the NA
dt[, ValueNext := shift(Value2, n=1, type="lead")]
# find the dates before and after the NA period
dt[, DatePre := shift(Date, n=1, type="lag")]
dt[, DateNext := shift(Date, n=1, type="lead")]
# find the differences in the values & dates
dt[, ValueDiff := ValueNext[.N]-Value2[1], by=Group]
dt[, DateDiff := as.integer(DateNext[.N]-DatePre[1]), by=Group]
# divide through to get the addition
dt[, ValueAdd := ValueDiff/DateDiff]
# by group, use cumulative sum to add to the starting value
dt[, ValueOut := Value2+cumsum(ValueAdd), by=Group]
# we only care about NA groups, so revert back to original value for other
# cases
dt[!is.na(Value), ValueOut := Value]
# check the NA rows
# ! only difference is I get 2 as the date diff for 2022-05-09, not 3
dt[is.na(Value),]
# Final output
dt[, .(Date, Value, ValueOut)]
I have a data.frame df that has monthly data:
Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1
I want there to be data on every day in the month (and I will assume Value does not change during each month) since I will be merging this into a different table that has monthly data.
I want the output to look like this:
Date Value
2008-01-02 3.5
2008-01-03 3.5
2008-01-04 3.5
2008-01-05 3.5
2008-01-06 3.5
2008-01-07 3.5
2008-01-08 3.5
2008-01-09 3.5
2008-01-10 3.5
2008-01-11 3.5
2008-01-12 3.5
2008-01-13 3.5
2008-01-14 3.5
2008-01-15 3.5
2008-01-16 3.5
2008-01-17 3.5
2008-01-18 3.5
2008-01-19 3.5
2008-01-20 3.5
2008-01-21 3.5
2008-01-22 3.5
2008-01-23 3.5
2008-01-24 3.5
2008-01-25 3.5
2008-01-26 3.5
2008-01-27 3.5
2008-01-28 3.5
2008-01-29 3.5
2008-01-30 3.5
2008-01-31 3.5
2008-02-01 9.5
I have tried to.daily but my call:
df <- to.daily(df$Date)
returns
Error in to.period(x, "days", name = name, ...) : ‘x’ contains no data
Not sure if i understood perfectly but i think something like this may work.
First, i define the monthly data table
library(data.table)
DT_month=data.table(Date=as.Date(c("2008-01-01","2008-02-01","2008-03-01","2008-05-01","2008-07-01"))
,Value=c(3.5,9.5,0.1,5,8))
Then, you have to do the following
DT_month[,Month:=month(Date)]
DT_month[,Year:=year(Date)]
start_date=min(DT_month$Date)
end_date=max(DT_month$Date)
DT_daily=data.table(Date=seq.Date(start_date,end_date,by="day"))
DT_daily[,Month:=month(Date)]
DT_daily[,Year:=year(Date)]
DT_daily[,Value:=-100]
for( i in unique(DT_daily$Year)){
for( j in unique(DT_daily$Month)){
if(length(DT_month[Year==i & Month== j,Value])!=0){
DT_daily[Year==i & Month== j,Value:=DT_month[Year==i & Month== j,Value]]
}
}
}
Basically, the code will define the month and year of each monthly value in separate columns.
Then, it will create a vector of daily data using the minimum and maximum dates in your monthly data, and will create two separate columns for year and month for the daily data as well.
Finally, it goes through every combination of year and months of data filling the daily values with the monthly ones. In case there is no data for certain combination of month and year, it will show a -100.
Please let me know if it works.
An option using tidyr::expand expand a row between 1st day of month to last day of month. The lubridate::floor_date can provide 1st day of month and lubridate::ceiling_date() - days(1) will provide last day of month.
library(tidyverse)
library(lubridate)
df %>% mutate(Date = ymd(Date)) %>%
group_by(Date) %>%
expand(Date = seq(floor_date(Date, unit = "month"),
ceiling_date(Date, unit="month")-days(1), by="day"), Value) %>%
as.data.frame()
# Date Value
# 1 2008-01-01 3.5
# 2 2008-01-02 3.5
# 3 2008-01-03 3.5
# 4 2008-01-04 3.5
# 5 2008-01-05 3.5
#.....so on
# 32 2008-02-01 9.5
# 33 2008-02-02 9.5
# 34 2008-02-03 9.5
# 35 2008-02-04 9.5
# 36 2008-02-05 9.5
#.....so on
# 85 2008-03-25 0.1
# 86 2008-03-26 0.1
# 87 2008-03-27 0.1
# 88 2008-03-28 0.1
# 89 2008-03-29 0.1
# 90 2008-03-30 0.1
# 91 2008-03-31 0.1
Data:
df <- read.table(text =
"Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1",
header = TRUE, stringsAsFactors = FALSE)
to.daily can only be applied to xts/zooobjects and can only convert to a LOWER frequency. i.e. from daily to monthly, but not the other way round.
One easy way to accomplish what you want is converting df to an xts object:
df.xts <- xts(df$Value,order.by = df$Date)
And merge, like so:
na.locf(merge(df.xts, foo=zoo(NA, order.by=seq(start(df.xts), end(df.xts),
"day",drop=F)))[, 1])
df.xts
2018-01-01 3.5
2018-01-02 3.5
2018-01-03 3.5
2018-01-04 3.5
2018-01-05 3.5
2018-01-06 3.5
2018-01-07 3.5
….
2018-01-27 3.5
2018-01-28 3.5
2018-01-29 3.5
2018-01-30 3.5
2018-01-31 3.5
2018-02-01 9.5
2018-02-02 9.5
2018-02-03 9.5
2018-02-04 9.5
2018-02-05 9.5
2018-02-06 9.5
2018-02-07 9.5
2018-02-08 9.5
….
2018-02-27 9.5
2018-02-28 9.5
2018-03-01 0.1
If you want to adjust the price continuously over the course of a month use na.spline in place of na.locf.
Maybe not an efficient one but with base R we can do
do.call("rbind", lapply(1:nrow(df), function(i)
data.frame(Date = seq(df$Date[i],
(seq(df$Date[i],length=2,by="months") - 1)[2], by = "1 days"),
value = df$Value[i])))
We basically generate a sequence of dates from start_date to the last day of that month which is calculated by
seq(df$Date[i],length=2,by="months") - 1)[2]
and repeat the same value for all the dates and put them in the data frame.
We get a list of dataframe and then we can rbind them using do.call.
Another way:
library(lubridate)
d <- read.table(text = "Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1",
stringsAsFactors = FALSE, header = TRUE)
Dates <- seq(from = min(as.Date(d$Date)),
to = ceiling_date(max(as.Date(d$Date)), "month") - days(1),
by = "1 days")
data.frame(Date = Dates,
Value = setNames(d$Value, d$Date)[format(Dates, format = "%Y-%m-01")])
I want to connect two datasets with each other by adding a new column called Average. This column is the average of the durations between Date and Date - diff. I got two datasets, the first one is called data and looks like this:
Date Weight diff Loc.nr
2013-01-24 1040 7 2
2013-01-31 1000 7 2
2013-01-19 500 4 9
2013-01-23 1040 4 9
2013-01-28 415 5 9
2013-01-31 650 3 9
The other one is called Rain.duration, in the column Duration are the hours of rain on that day. This dataset looks like this:
Date Duration
2013-01-14 4.5
2013-01-15 0.0
2013-01-16 6.9
2013-01-17 0.0
2013-01-18 1.8
2013-01-19 2.1
2013-01-20 0.0
2013-01-21 0.0
2013-01-22 4.3
2013-01-23 0.0
2013-01-24 7.5
2013-01-25 4.7
2013-01-26 0.0
2013-01-27 0.7
2013-01-28 5.0
2013-01-29 0.0
2013-01-30 3.1
2013-01-31 2.8
I made a code to do this:
for(i in 1:nrow(data)) {
for(j in 1:nrow(Rain.duration)) {
if(data$Date[i] == Rain.duration$Date[j]) {
average <- as.array(Rain.duration$Duration[(j-(data$diff[i])):j])
j <- nrow(Rain.duration)
}
}
data$Average[i] <- mean(average)
}
The problem of this code is that, because of the size of my datasets, it takes like 3 days to run. Is there a faster way to do this?
My expected outcome is:
Date Weight diff Loc.nr Average
2013-01-24 1040 7 2 1.96
2013-01-31 1000 7 2 2.98
2013-01-19 500 4 9 2.16
2013-01-23 1040 4 9 1.28
2013-01-28 415 5 9 2.98
2013-01-31 650 3 9 2.73
Here's a dplyr solution:
library(dplyr)
# add row number as a new column just to make it easier to read
weather_with_rows <- Weather %>%
mutate(Rownum = row_number())
# write function to filter by row number, then return the average duration
getavgduration <- function(mydate, mydiff) {
myrow = weather_with_rows %>%
filter(Date == mydate) %>%
pluck("Rownum")
mystartrow = myrow -mydiff
myduration = weather_with_rows %>%
filter(
Rownum <= myrow
, Rownum >= mystartrow
)
mean(myduration$Duration)
}
# get the average duration for each Date/diff pair
averages <- data %>%
group_by(Date, Diff) %>%
summarize(Average = getavgduration(Date, Diff)) %>%
ungroup()
# join this back into the original data frame
# this step might not be necessary
# and might be a big drag on performance,
# depending on the size of your real data
data_with_avg_duration <- data %>%
left_join(averages, by = c('Date','Diff')
This old question does not have an accepted answer yet, so I feel obliged to post an alternative solution which aggregates in a non-equi join.
The OP has requested to compute the average duration of rain from a table Rain.duration of daily hours of rain fall for each date interval given in data.
library(data.table)
# make sure Date columns are of class Date
setDT(data)[, Date := as.Date(Date)]
setDT(Rain.duration)[, Date := as.Date(Date)]
# aggregate in a non-equi join and assign the result to a new column
data[, Average := Rain.duration[data[, .(upper = Date, lower = Date - diff)],
on = .(Date <= upper, Date >= lower),
mean(Duration), by = .EACHI]$V1][]
Date Weight diff Loc.nr Average
1: 2013-01-24 1040 7 2 1.962500
2: 2013-01-31 1000 7 2 2.975000
3: 2013-01-19 500 4 9 2.160000
4: 2013-01-23 1040 4 9 1.280000
5: 2013-01-28 415 5 9 2.983333
6: 2013-01-31 650 3 9 2.725000
The key part is
Rain.duration[data[, .(upper = Date, lower = Date - diff)],
on = .(Date <= upper, Date >= lower),
mean(Duration), by = .EACHI]
Date Date V1
1: 2013-01-24 2013-01-17 1.962500
2: 2013-01-31 2013-01-24 2.975000
3: 2013-01-19 2013-01-15 2.160000
4: 2013-01-23 2013-01-19 1.280000
5: 2013-01-28 2013-01-23 2.983333
6: 2013-01-28 2013-01-23 2.983333
7: 2013-01-31 2013-01-28 2.725000
which does a non-equi join with the date ranges derived from data:
data[, .(upper = Date, lower = Date - diff)]
upper lower
1: 2013-01-24 2013-01-17
2: 2013-01-31 2013-01-24
3: 2013-01-19 2013-01-15
4: 2013-01-23 2013-01-19
5: 2013-01-28 2013-01-23
6: 2013-01-28 2013-01-23
7: 2013-01-31 2013-01-28
by = .EACHI requests to compute the aggregate mean(Duration) for each date interval on-the-fly which avoids to create and copy temporay subsets.
Note that this solution will give correct answers even if Rain.duration has gaps or is unordered as it relies only on Date as opposed to the other solutions which use row numbers.
So let's take the following data.table. It has dates and a column of numbers. I'd like to get the week of each date and then aggregate (sum) of each two weeks.
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-05", "1981-01-05", "1982-01-08", "1982-01-15", "1980-01-16", "1980-01-17",
"1981-01-18", "1981-01-22", "1982-01-24", "1982-01-26"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Date, Runoff)
DT
So from the date, I can easily get the year and week.
DT[,c("Date_YrWeek") := paste(substr(Date,1,4), week(Date), sep="-")][]
What I'm struggling with is aggregating with every two week.
I thought that I'd get the first date for each week and filter using those values. Unfortunately, that would be pretty foolish.
DT[,.(min(Date)),by=.(Date_YrWeek)][order(Date)]
The final result would end up being the sum of every two weeks.
weeks sum_value
1 and 2 ...
3 and 4 ...
5 and 6 ...
Anyone have an efficient way to do this with data.table?
1) Define the two week periods as starting from the minimum Date. Then we can get the total Runoff for each such period like this.
DT[, .(sum_value = sum(Runoff)),
keyby = .(Date = 14 * (as.numeric(Date - min(Date)) %/% 14) + min(Date))]
giving the following where the Date column is the date of the first day of the two week period.
Date sum_value
1: 1980-01-01 3.0
2: 1980-01-15 2.0
3: 1980-12-30 3.1
4: 1981-01-13 2.3
5: 1981-12-29 2.0
6: 1982-01-12 6.5
7: 1982-01-26 4.0
2) If you prefer the text shown in the question for the first column then:
DT[, .(sum_value = sum(Runoff)),
keyby = .(two_week = as.numeric(Date - min(Date)) %/% 14)][
, .(weeks = paste(2*two_week + 1, "and", 2*two_week + 2), sum_value)]
giving:
weeks sum_value
1: 1 and 2 3.0
2: 3 and 4 2.0
3: 53 and 54 3.1
4: 55 and 56 2.3
5: 105 and 106 2.0
6: 107 and 108 6.5
7: 109 and 110 4.0
Update: Revised and added (2).
With tidyverse and lubridate:
library(tidyverse)
library(lubridate)
summary <- DT %>%
mutate(TwoWeeks = round_date(Date, "2 weeks")) %>%
group_by(TwoWeeks) %>%
summarise(sum_value = sum(Runoff))
summary
# A tibble: 9 × 2
TwoWeeks sum_value
<date> <dbl>
1 1979-12-30 3.0
2 1980-01-13 1.5
3 1980-01-20 0.5
4 1981-01-04 3.1
5 1981-01-18 0.3
6 1981-01-25 2.0
7 1982-01-10 2.0
8 1982-01-17 5.0
9 1982-01-24 5.5
Lubridate's round_date() will aggregate dates within ranges you can specify through size and unit, in this case, "2 weeks". round_date()'s output is the first calendar day of that period.
I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)