Match values with multiple conditions using two data.frames - r

I'm fairly new in R and need some help.
I have two dataframes with rather similar information. The first dataframe has information about misconnections for an airline, whereas the other one is the entire timetable for the same airline. Now, what I need is to make a new column in the misconnection data.frame including flights from the timetable that can replace the delayed flights on the transit.
The flights that I want to replace need to meet a range of conditions (within a certain time-horizon, needs to be the same weekday and it needs to fly to the same destination). I addition, I want R to choose the flight that is closest (by time) to the new arrival time at a transit(from the misconnection data.frame).
The misconnection data.frame looks like the following (1620 lines in total):
miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),
arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),
next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
weekday = c("4", "7", "7", "3"))
view(miscon)
flight.date Outbound.airport arr.time next.pos.dep Weekday
1 2019-08-05 MXP 19:25:00 19:36:00 4
2 2019-10-03 KRK 20:52:00 21:17:00 7
3 2019-07-21 KLU 07:33:00 07:58:00 7
4 2019-05-29 OTP 18:49:00 19:14:00 3
And the timetable data.frame would look like this:
tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),
dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))
view(tt)
start.date end.date Weekday Outbound.airport dep.time
1 2019-03-25 2019-10-21 1234567 KLU 12:20:00
2 2019-05-02 2019-10-27 1.3..67 KLU 15:55:00
3 2019-07-30 2019-08-26 1.34567 MXP 19:55:00
4 2019-03-30 2019-06-01 ..3.5.. OTP 20:34:00
In Excel, this problem is solved using Index matching, which I've managed. However, the problem is slightly to big for excel to handle which is why I need to convert this to R. Did try with the match and mutate function in R, but seems like the values I'm matching must be equal - which I do not expect mine to be.
Also found an interesting solution to a similar problem using the DescTools package, which I tried to implemt with no success.
get_close2 <- function(xx=tt, yy=miscon) {
pos <- vector(mode = "numeric")
for(i in 1:dim(yy)[1]) {
pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
#print(pos[i])
yy$new.flight[i] <- pos[i]
}
out <- yy
return(out)
}
get_close2()
For this one, I tried with only one condition. It generated a column, but with NA's only. Obviously, I am far out right now, which is why I'm reaching out for help. Hope the problem was clear. The end result would preferrably look something like the following:
miscon
flight.date Outbound.airport arr.time next.pos.dep Weekday new.flight.time
1 2019-12-05 MXP 19:25:00 19:36:00 4 19:55:00
2 2019-10-03 KRK 20:52:00 21:17:00 7 NA
3 2019-07-21 KLU 07:33:00 07:58:00 7 12:20:00
4 2019-05-29 OTP 18:49:00 19:14:00 3 20:34:00

I think you can do it as follows. First, I would rearrange the Weekday column so that you have one row for each weekday a flight is going:
library(data.table)
library(dplyr)
library(tidyr)
tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>%
gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
filter(weekday %in% 1:7) %>%
select(-key)
Then I would do a left join of miscon and tt on the airport and weekday.
tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]
Check if flight date is on a valid date:
df = df[flight.date>=start.date & flight.date<=end.date]
Now you have a data.frame of all possible connections. The only thing left is to find the minimum time between the flights for each connection.
df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]
Now you can filter the rows by the minimum time delay (timediff):
df = df[ , .SD[which.min(timediff)], by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")
> df
weekday Outbound.airport flight.date arr.time next.pos.dep start.date end.date new.flight.time timediff
1: 7 KLU 2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2: 4 MXP 2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00 1800 secs
3: 3 OTP 2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00 6300 secs
The solution is a bit of a mix of dplyr and data.table.

Ok, it's not pretty but you have a fairly complex issue, and it's not fully clear to me if this gives you what you are looking for - you will need to check it on a larger dataset than the small example you provide to be sure first.
# setup
library(data.table)
setDT(tt)
setDT(miscon)
# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]
# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
, new.dep.time := as.numeric(dep.time - arr.time)][
, c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
, new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]

Related

Separate operations on groups of time series values identified by same flag in R

Does anyone have a solution to perform
separate operations on
groups of consecutive values that are a
subset of a time series and are
identified by a reoccurring, identical flag
with R ?
In the example data set created by the code below, this would refer for example to calculating the mean of “value” separately for each group where “flag” == 1 on consecutive days.
A typical case in science would be a data set recorded by an instrument that repeatedly executes a calibration procedure and flags the corresponding data with the same flag, but the user needs to evaluate each calibration separately with the same procedure.
Thanks for your suggestions. Jens
library(lubridate)
df <- data.frame(
date = seq(ymd("2018-01-01"), ymd("2018-06-29"), by = "days"),
flag = rep( c(rep(1,10), rep(0, 20)), 6),
value = seq(1,180,1)
)
The data.table function rleid is great for giving group IDs to runs of consecutive values. I continue to use data.table, but you could everything but the rleid part just as well in dplyr or base.
My answer comes down to use data.table::rleid and then pick your favorite way to take the mean by group (R-FAQ link).
library(data.table)
setDT(df)
df[, r_id := rleid(flag)]
df[flag == 1, list(
min_date = min(date),
max_date = max(date),
mean_value = mean(value)
), by = r_id]
# r_id min_date max_date mean_value
# 1: 1 2018-01-01 2018-01-10 5.5
# 2: 3 2018-01-31 2018-02-09 35.5
# 3: 5 2018-03-02 2018-03-11 65.5
# 4: 7 2018-04-01 2018-04-10 95.5
# 5: 9 2018-05-01 2018-05-10 125.5
# 6: 11 2018-05-31 2018-06-09 155.5

How to use TTR::adjRatios() with tidyquant?

I'm trying to reproduce this example which adjusts stock prices for dividends using the tidyquant framework.
Here is the original example:
library(quantmod)
library(tidyquant)
library(timetk)
SPY.Close <- Cl(getSymbols("SPY", auto.assign=FALSE))
SPY.Div <- getDividends("SPY", auto.assign=FALSE)
# Within xts framework
SPY <- merge(SPY.Close, SPY.Div)
# now adjust close for dividends
ratios <- adjRatios(dividends=SPY[,"SPY.div"], close=SPY[,"SPY.Close"])
SPY$SPY.Adjusted <- (ratios$Split * ratios$Div) * SPY$SPY.Close
# only keep dates from the original object
SPY <- SPY[index(SPY.Close), ]
Here is my attempt to do this in tidyquant:
#convert xts to tibble
spy.tbl <- tk_tbl(merge(SPY.Close, SPY.Div), preserve_index = TRUE)
#add a splits placeholder because adjRatios() complains if its not there.
spy.tbl$SPY.splits <- 0
spy.adj <- spy.tbl %>%
tq_mutate(
select = c(index, SPY.Close, SPY.div, SPY.splits),
mutate_fun = adjRatios,
splits = SPY.splits,
dividends = SPY.div,
close = SPY.Close
)
but this gives the error:
Error in fun_transmute(., ...) : unused argument (.)
I've tried various combinations of arguments, but I can't seem to make it work.
Just in case anyone searches for this topic, I've solved my own problem with the following code. The benefit of this is that it is done in the tidyverse framework and is easily expanded to many tickers with group_by(ticker).
data is a data frame with Close and previous close:
Date ticker Close Cl.prev
1 2017-08-14 SPY_US 246.54 244.12
2 2017-08-15 SPY_US 246.51 246.54
3 2017-08-16 SPY_US 246.94 246.51
4 2017-08-17 SPY_US 243.09 246.94
5 2017-08-18 SPY_US 242.71 243.09
6 2017-08-21 SPY_US 242.90 242.71
div.data is a tibble with only the dividend payments, Date here is the Ex-Div date.
ticker Date div
2 SPY_US 2017-09-15 1.234574
3 SPY_US 2017-12-15 1.351333
4 SPY_US 2018-03-16 1.096775
5 SPY_US 2018-06-15 1.245568
This chain merges the price data into the div.data in order to get the prices to calculate the adjRatio
div.data <- div.data %>%
left_join(., data[, c("Date", "ticker", "Close", "Cl.prev")], by = c("ticker", "Date"))
This chain calculates the adjRatio:
div.data <- div.data %>%
mutate(ratio = 1-div / Cl.prev) %>%
mutate(adjRatio = rev(cumprod(rev(ratio)))) %>%
select(-Close, -Cl.prev, -ratio)
This chain merges the div.data back into the price series, propagates the adjRatio and calculates the Adjusted Close:
data.adj <- data %>%
left_join(., div.data, by = c("ticker", "Date") ) %>%
mutate(adjRatio = dplyr::lead(adjRatio, n=1)) %>%
mutate(adjRatio = na.locf(adjRatio, fromLast = TRUE, na.rm = FALSE)) %>%
mutate(adjRatio = na.fill(adjRatio, fill = 1.0)) %>%
mutate(Cl.adj = Close * adjRatio) %>%
select(-Cl.prev, -div, -adjRatio)
Here is the final data:
> head(data.adj)
Date ticker Close Cl.adj
1 2017-08-14 SPY_US 246.54 242.0153
2 2017-08-15 SPY_US 246.51 241.9858
3 2017-08-16 SPY_US 246.94 242.4079
4 2017-08-17 SPY_US 243.09 238.6286
5 2017-08-18 SPY_US 242.71 238.2556
6 2017-08-21 SPY_US 242.90 238.4421
At the moment, there are only two forms of tq_mutate() and tq_mutate_xy(). The adjRatios() function has 3 inputs, which would require x,y,z.

Moving average over 5 years with irregular dates

I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
library(lubridate)
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
library(dplyr)
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
library(data.table)
library(lubridate)
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
}
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
dat<-read.csv("paixnidi.csv")
install.packages("stringr")
library(stringr)
dates<-dat[,1]
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
spread_y<-years[length(years)]-years[1]
ind<-list()
#find how many 5-year intervals there are
groups<-ceiling(spread_y/4)
meangroups<-matrix(0,ncol=2,nrow=groups)
k<-0
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
ind[[i]]<-which(years>=(years[1]+k)&years<=(years[1]+k+4),arr.ind=TRUE)
meangroups[i,2]<-mean(dat[ind[[i]],2])
meangroups[i,1]<-(years[1]+k)
k<-k+5
}
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")

Convert hour to date-time

I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time

Convert unstructured csv file to a data frame

I am learning R for text mining. I have a TV program schedule in form of CSV. The programs usually start at 06:00 AM and goes on until 05:00 AM the next day which is called a broadcast day. For example: the programs for 15/11/2015 start at 06:00 AM and ends at 05:00 AM the next day.
Here is a sample code showing how the schedule looks like:
read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
whose output is as follows:
V1|V2
Sunday |
01-Nov-15 |
6 | Tom
some information about the program |
23.3 | Jerry
some information about the program |
5 | Avatar
some information about the program |
5.3 | Panda
some information about the program |
Monday |
02-Nov-15|
6 Jerry
some information about the program |
6.25 | Panda
some information about the program |
23.3 | Avatar
some information about the program |
7.25 | Tom
some information about the program |
I want to convert the above data into a form of data.frame
Date |Program|Synopsis
2015-11-1 06:00 |Tom | some information about the program
2015-11-1 23:30 |Jerry | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda | some information about the program
2015-11-2 06:00 |Jerry | some information about the program
2015-11-2 06:25 |Panda | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom | some information about the program
I am thankful for any suggestions/tips regarding functions or packages I should have a look at.
An alternative solution with data.table:
library(data.table)
library(zoo)
library(splitstackshape)
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
wd <- levels(weekdays(1:7, abbreviate = FALSE))
DT <- DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL
][, idx := rleid(day)
][, date := tv[2], by = idx
][, .SD[-c(1,2)], by = idx]
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]
DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
][, .(datetime, Program, Info)]
The result:
> DT
datetime Program Info
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
Explanation:
1: read data, convert to a data.table & remove trailing |:
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
2: extract the weekdays into a new column
wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL]
3: create an index per day & create a column with the dates
DT[, idx := rleid(day)][, date := tv[2], by = idx]
4: remove unnecessary lines
DT <- DT[, .SD[-c(1,2)], by = idx]
5: split the time and the program-name into separate rows & create a label column
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
6: reshape into wide format using the 'rowid' function from the development version of data.table
DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]
7: create a dattime column & set the late night time to the next day
DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]
8: keep the needed columns
DT <- DT[, .(datetime, Program, Info)]
It's a bit of a mess, but it seems to work:
df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
cat(txt)
Sys.setlocale("LC_TIME", "English") # if needed
weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
days <- split(df, cumsum(df$V1 %in% weekdays))
lapply(days, function(dayDF) {
tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
tmp[, 2] <- as.numeric(tmp[, 2])
tmp[, 5] <- NULL
idx <- c(FALSE, diff(tmp[, 2])<0)
tmp[idx, 1] <- tmp[idx, 1] + 1
return(tmp)
}) -> days
days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)
names(days) <- c("Date", "Synopsis", "Program")
rownames(days) <- NULL
days[, c(1, 3, 2)]
# Date Program Synopsis
# 1 2015-11-01 06:00:00 Tom some information about the program
# 2 2015-11-01 23:30:00 Jerry some information about the program
# 3 2015-11-02 05:00:00 Avatar some information about the program
# 4 2015-11-02 06:00:00 Tom some information about the program
# 5 2015-11-02 23:30:00 Jerry some information about the program
# 6 2015-11-03 05:00:00 Avatar some information about the program
1) This sets up some functions and then consists of four transform(...) %>% subset(...) code fragments linked together using a magrittr pipeline. We assume DF is the output of the read.table in the question.
First, load the zoo package so get access to na.locf. Define a Lead function which shifts each element by 1 position. Also define a datetime function which converts a date plus a h.m number to a datetime.
Now convert the dates to "Date" class. The rows that are not dates will become NA. Use Lead to shift that vector by 1 position and then extract the NA positions effectively removing the weekday rows. Now use na.locf to fill in the dates and keep only rows with duplicated dates effectively removing the rows containing only a date. Next set Program as V1 and Synopsis as V2 except we must shift V2 using Lead since the Synopsis is on the second row of each pair. Keep only the odd positioned rows. Produce datetime and pick out desired columns.
library(magrittr)
library(zoo) # needed for na.locf
Lead <- function(x, fill = NA) c(x[-1], fill) # shift down and fill
datetime <- function(date, time) {
time <- as.numeric(time)
as.POSIXct(sprintf("%s %.0f:%02f", date, time, 100 * (time %% 1))) +
24 * 60 * 60 * (time < 6) # add day if time < 6
}
DF %>%
transform(date = as.Date(V1, "%d-%b-%y")) %>%
subset(Lead(is.na(date), TRUE)) %>% # rm weekday rows
transform(date = na.locf(date)) %>% # fill in dates
subset(duplicated(date)) %>% # rm date rows
transform(Program = V2, Synopsis = Lead(V1)) %>%
subset(c(TRUE, FALSE)) %>% # keep odd positioned rows only
transform(Date = datetime(date, V1)) %>%
subset(select = c("Date", "Program", "Synopsis"))
giving:
Date Program Synopsis
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
2) dplyr and here it is using dplyr and the datetime function above. We could have replaced the transform and subset functions in (1) with dplyr mutate and filter and Lead with lead but for variety we do it another way:
library(dplyr)
library(zoo) # na.locf
DF %>%
mutate(date = as.Date(V1, "%d-%b-%t")) %>%
filter(lead(is.na(date), default = TRUE)) %>% # rm weekday rows
mutate(date = na.locf(date)) %>% # fill in dates
group_by(date) %>%
mutate(Program = V2, Synopsis = lead(V1)) %>%
slice(seq(2, n(), by = 2)) %>%
ungroup() %>%
mutate(Date = datetime(date, V1)) %>%
select(Date, Program, Synopsis)
giving:
Source: local data frame [6 x 3]
Date Program Synopsis
(time) (chr) (chr)
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
3) data.table This also uses na.locf from zoo and datetime defined in (1):
library(data.table)
library(zoo)
dt <- data.table(DF)
dt <- dt[, date := as.Date(V1, "%d-%b-%y")][
shift(is.na(date), type = "lead", fill = TRUE)][, # rm weekday rows
date := na.locf(date)][duplicated(date)][, # fill in dates & rm date rows
Synopsis := shift(V1, type = "lead")][seq(1, .N, 2)][, # align Synopsis
c("Date", "Program") := list(datetime(date, V1), V2)][,
list(Date, Program, Synopsis)]
giving:
> dt
Date Program Synopsis
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
UPDATE: Simplified (1) and added (2) and (3).

Resources