Convert unstructured csv file to a data frame

Convert unstructured csv file to a data frame - r

I am learning R for text mining. I have a TV program schedule in form of CSV. The programs usually start at 06:00 AM and goes on until 05:00 AM the next day which is called a broadcast day. For example: the programs for 15/11/2015 start at 06:00 AM and ends at 05:00 AM the next day.
Here is a sample code showing how the schedule looks like:
read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
whose output is as follows:
V1|V2
Sunday |
01-Nov-15 |
6 | Tom
some information about the program |
23.3 | Jerry
some information about the program |
5 | Avatar
some information about the program |
5.3 | Panda
some information about the program |
Monday |
02-Nov-15|
6 Jerry
some information about the program |
6.25 | Panda
some information about the program |
23.3 | Avatar
some information about the program |
7.25 | Tom
some information about the program |
I want to convert the above data into a form of data.frame
Date |Program|Synopsis
2015-11-1 06:00 |Tom | some information about the program
2015-11-1 23:30 |Jerry | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda | some information about the program
2015-11-2 06:00 |Jerry | some information about the program
2015-11-2 06:25 |Panda | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom | some information about the program
I am thankful for any suggestions/tips regarding functions or packages I should have a look at.

An alternative solution with data.table:
library(data.table)
library(zoo)
library(splitstackshape)
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
wd <- levels(weekdays(1:7, abbreviate = FALSE))
DT <- DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL
][, idx := rleid(day)
][, date := tv[2], by = idx
][, .SD[-c(1,2)], by = idx]
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]
DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
][, .(datetime, Program, Info)]
The result:
> DT
datetime Program Info
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
Explanation:
1: read data, convert to a data.table & remove trailing |:
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
2: extract the weekdays into a new column
wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL]
3: create an index per day & create a column with the dates
DT[, idx := rleid(day)][, date := tv[2], by = idx]
4: remove unnecessary lines
DT <- DT[, .SD[-c(1,2)], by = idx]
5: split the time and the program-name into separate rows & create a label column
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
6: reshape into wide format using the 'rowid' function from the development version of data.table
DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]
7: create a dattime column & set the late night time to the next day
DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]
8: keep the needed columns
DT <- DT[, .(datetime, Program, Info)]

It's a bit of a mess, but it seems to work:
df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
cat(txt)
Sys.setlocale("LC_TIME", "English") # if needed
weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
days <- split(df, cumsum(df$V1 %in% weekdays))
lapply(days, function(dayDF) {
tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
tmp[, 2] <- as.numeric(tmp[, 2])
tmp[, 5] <- NULL
idx <- c(FALSE, diff(tmp[, 2])<0)
tmp[idx, 1] <- tmp[idx, 1] + 1
return(tmp)
}) -> days
days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)
names(days) <- c("Date", "Synopsis", "Program")
rownames(days) <- NULL
days[, c(1, 3, 2)]
# Date Program Synopsis
# 1 2015-11-01 06:00:00 Tom some information about the program
# 2 2015-11-01 23:30:00 Jerry some information about the program
# 3 2015-11-02 05:00:00 Avatar some information about the program
# 4 2015-11-02 06:00:00 Tom some information about the program
# 5 2015-11-02 23:30:00 Jerry some information about the program
# 6 2015-11-03 05:00:00 Avatar some information about the program

1) This sets up some functions and then consists of four transform(...) %>% subset(...) code fragments linked together using a magrittr pipeline. We assume DF is the output of the read.table in the question.
First, load the zoo package so get access to na.locf. Define a Lead function which shifts each element by 1 position. Also define a datetime function which converts a date plus a h.m number to a datetime.
Now convert the dates to "Date" class. The rows that are not dates will become NA. Use Lead to shift that vector by 1 position and then extract the NA positions effectively removing the weekday rows. Now use na.locf to fill in the dates and keep only rows with duplicated dates effectively removing the rows containing only a date. Next set Program as V1 and Synopsis as V2 except we must shift V2 using Lead since the Synopsis is on the second row of each pair. Keep only the odd positioned rows. Produce datetime and pick out desired columns.
library(magrittr)
library(zoo) # needed for na.locf
Lead <- function(x, fill = NA) c(x[-1], fill) # shift down and fill
datetime <- function(date, time) {
time <- as.numeric(time)
as.POSIXct(sprintf("%s %.0f:%02f", date, time, 100 * (time %% 1))) +
24 * 60 * 60 * (time < 6) # add day if time < 6
}
DF %>%
transform(date = as.Date(V1, "%d-%b-%y")) %>%
subset(Lead(is.na(date), TRUE)) %>% # rm weekday rows
transform(date = na.locf(date)) %>% # fill in dates
subset(duplicated(date)) %>% # rm date rows
transform(Program = V2, Synopsis = Lead(V1)) %>%
subset(c(TRUE, FALSE)) %>% # keep odd positioned rows only
transform(Date = datetime(date, V1)) %>%
subset(select = c("Date", "Program", "Synopsis"))
giving:
Date Program Synopsis
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
2) dplyr and here it is using dplyr and the datetime function above. We could have replaced the transform and subset functions in (1) with dplyr mutate and filter and Lead with lead but for variety we do it another way:
library(dplyr)
library(zoo) # na.locf
DF %>%
mutate(date = as.Date(V1, "%d-%b-%t")) %>%
filter(lead(is.na(date), default = TRUE)) %>% # rm weekday rows
mutate(date = na.locf(date)) %>% # fill in dates
group_by(date) %>%
mutate(Program = V2, Synopsis = lead(V1)) %>%
slice(seq(2, n(), by = 2)) %>%
ungroup() %>%
mutate(Date = datetime(date, V1)) %>%
select(Date, Program, Synopsis)
giving:
Source: local data frame [6 x 3]
Date Program Synopsis
(time) (chr) (chr)
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
3) data.table This also uses na.locf from zoo and datetime defined in (1):
library(data.table)
library(zoo)
dt <- data.table(DF)
dt <- dt[, date := as.Date(V1, "%d-%b-%y")][
shift(is.na(date), type = "lead", fill = TRUE)][, # rm weekday rows
date := na.locf(date)][duplicated(date)][, # fill in dates & rm date rows
Synopsis := shift(V1, type = "lead")][seq(1, .N, 2)][, # align Synopsis
c("Date", "Program") := list(datetime(date, V1), V2)][,
list(Date, Program, Synopsis)]
giving:
> dt
Date Program Synopsis
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
UPDATE: Simplified (1) and added (2) and (3).

Related

Subsetting R data frame automatically based on changing date range

I have an R script that I run monthly. I'd like to subset my data frame to only show data within a 6 month time period, but each month I'd like the time period to move forward one month.
Original data frame from Sept.:
ID Name Date
1 John 1/1/2020
2 Adam 5/2/2020
3 Kate 9/30/2020
4 Jill 10/15/2020
After subsetting for only dates from May 1, 2020 - Sept. 30, 2020:
ID Name Date
2 Adam 5/2/2020
3 Kate 9/30/2020
The next month when I run my script, I'd like the dates it's subsetting to move forward by one month, so June 1, 2020 - Oct. 31, 2020:
ID Name Date
3 Kate 9/30/2020
4 Jill 10/15/2020
Right now, I'm changing this part of my script manually each month, ie:
df$Date >= subset(df$Date >= '2020-05-01' & df$date <= '2020-09-30')
Is there a way to make this automatic, so that I don't have to manually move forward the date one month every time?

We can use between after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
start <- as.Date("2020-05-01")
end <- as.Date("2020-09-30")
df1 %>%
mutate(Date = mdy(Date)) %>%
filter(between(Date, start, end))
# ID Name Date
#1 2 Adam 2020-05-02
#2 3 Kate 2020-09-30
In the next month, we can change the 'start', 'end' by adding 1 month
start <- start %m+% months(1)
end <- ceiling_date(end %m+% months(1), 'months') - days(1)
start
#[1] "2020-06-01"
end
#[1] "2020-10-31"

using base R and no package dependency.
Data:
dt <- read.table(text = 'ID Name Date
1 John 1/1/2020
2 Adam 3/2/2021
3 Kate 12/30/2020
4 Jill 5/15/2021', header = TRUE, stringsAsFactors = FALSE)
Code:
date_format <- "%m/%d/%Y"
dt$Date <- as.Date(dt$Date, format = date_format)
today <- Sys.Date()
six_month <- today+(6*30)
start <- as.Date(paste(format(today, "%m"), "01",
format(today, "%Y"), sep = "/"),
format = date_format)
end <- as.Date(paste(format(six_month, "%m"), "31",
format(six_month, "%Y"), sep = "/"),
format = date_format)
dt[with(dt, Date >= start & Date <= end), ]
# ID Name Date
# 2 2 Adam 2021-03-02
# 3 3 Kate 2020-12-30
# 4 4 Jill 2021-05-15

This is a very simple solution:
library(lubridate)
t <- today() #automatic
t <- as.Date('2020-11-26') # manual (you can change it as you like)
start <- floor_date(t %m-% months(6), unit="months")
end <- floor_date(t %m-% months(1), unit="months")-1
df$Date >= subset(df$Date >= start & df$date <= end)

Match values with multiple conditions using two data.frames

I'm fairly new in R and need some help.
I have two dataframes with rather similar information. The first dataframe has information about misconnections for an airline, whereas the other one is the entire timetable for the same airline. Now, what I need is to make a new column in the misconnection data.frame including flights from the timetable that can replace the delayed flights on the transit.
The flights that I want to replace need to meet a range of conditions (within a certain time-horizon, needs to be the same weekday and it needs to fly to the same destination). I addition, I want R to choose the flight that is closest (by time) to the new arrival time at a transit(from the misconnection data.frame).
The misconnection data.frame looks like the following (1620 lines in total):
miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),
arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),
next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
weekday = c("4", "7", "7", "3"))
view(miscon)
flight.date Outbound.airport arr.time next.pos.dep Weekday
1 2019-08-05 MXP 19:25:00 19:36:00 4
2 2019-10-03 KRK 20:52:00 21:17:00 7
3 2019-07-21 KLU 07:33:00 07:58:00 7
4 2019-05-29 OTP 18:49:00 19:14:00 3
And the timetable data.frame would look like this:
tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),
dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))
view(tt)
start.date end.date Weekday Outbound.airport dep.time
1 2019-03-25 2019-10-21 1234567 KLU 12:20:00
2 2019-05-02 2019-10-27 1.3..67 KLU 15:55:00
3 2019-07-30 2019-08-26 1.34567 MXP 19:55:00
4 2019-03-30 2019-06-01 ..3.5.. OTP 20:34:00
In Excel, this problem is solved using Index matching, which I've managed. However, the problem is slightly to big for excel to handle which is why I need to convert this to R. Did try with the match and mutate function in R, but seems like the values I'm matching must be equal - which I do not expect mine to be.
Also found an interesting solution to a similar problem using the DescTools package, which I tried to implemt with no success.
get_close2 <- function(xx=tt, yy=miscon) {
pos <- vector(mode = "numeric")
for(i in 1:dim(yy)[1]) {
pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
#print(pos[i])
yy$new.flight[i] <- pos[i]
}
out <- yy
return(out)
}
get_close2()
For this one, I tried with only one condition. It generated a column, but with NA's only. Obviously, I am far out right now, which is why I'm reaching out for help. Hope the problem was clear. The end result would preferrably look something like the following:
miscon
flight.date Outbound.airport arr.time next.pos.dep Weekday new.flight.time
1 2019-12-05 MXP 19:25:00 19:36:00 4 19:55:00
2 2019-10-03 KRK 20:52:00 21:17:00 7 NA
3 2019-07-21 KLU 07:33:00 07:58:00 7 12:20:00
4 2019-05-29 OTP 18:49:00 19:14:00 3 20:34:00

I think you can do it as follows. First, I would rearrange the Weekday column so that you have one row for each weekday a flight is going:
library(data.table)
library(dplyr)
library(tidyr)
tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>%
gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
filter(weekday %in% 1:7) %>%
select(-key)
Then I would do a left join of miscon and tt on the airport and weekday.
tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]
Check if flight date is on a valid date:
df = df[flight.date>=start.date & flight.date<=end.date]
Now you have a data.frame of all possible connections. The only thing left is to find the minimum time between the flights for each connection.
df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]
Now you can filter the rows by the minimum time delay (timediff):
df = df[ , .SD[which.min(timediff)], by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")
> df
weekday Outbound.airport flight.date arr.time next.pos.dep start.date end.date new.flight.time timediff
1: 7 KLU 2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2: 4 MXP 2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00 1800 secs
3: 3 OTP 2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00 6300 secs
The solution is a bit of a mix of dplyr and data.table.

Ok, it's not pretty but you have a fairly complex issue, and it's not fully clear to me if this gives you what you are looking for - you will need to check it on a larger dataset than the small example you provide to be sure first.
# setup
library(data.table)
setDT(tt)
setDT(miscon)
# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]
# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
, new.dep.time := as.numeric(dep.time - arr.time)][
, c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
, new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]

Vectorized functions in R's data.table

Problem: I try to add to the below data.table object a column in which for each row a list of weeks will be displayed. I.e. if START = "2020-01-01" and END = "2020-01-15" the week column shall consist of a list of respective weeks for this time interval (2020 W01, 2020 W02, 2020 W03). I want to keep the function that prepares the data separatly due to code structure. However, the current function results in an error.
Question: Is there a way to keep it that simple i.e. w/o referring in the function call get_weeks to the data.table object? How could a modified function look like? Cheers!
dt <- data.table(
ID = c(1, 2, 3),
START = c("2020-01-01", "2020-03-01", "2020-03-14"),
END = c("2020-01-15", "2020-03-12", "2020-03-26")
)
get_weeks <- function(start_date, end_date){
date_range <- c(start_date, end_date)
date_range <- ymd(date_range)
dt_range <- seq.Date(date_range[1], date_range[2], "day")
dt_range_week <- list(unique(format(as.Date(dt_range), "%G W%V")))
dt_range_week
}
dt[, weeks_for_filter_table := get_weeks("START", "END")]

You could use Map/mapply :
library(data.table)
dt[, weeks_for_filter_table := mapply(get_weeks, START, END)]
dt
# ID START END weeks_for_filter_table
#1: 1 2020-01-01 2020-01-15 2020 W01,2020 W02,2020 W03
#2: 2 2020-03-01 2020-03-12 2020 W09,2020 W10,2020 W11
#3: 3 2020-03-14 2020-03-26 2020 W11,2020 W12,2020 W13

Separate operations on groups of time series values identified by same flag in R

Does anyone have a solution to perform
separate operations on
groups of consecutive values that are a
subset of a time series and are
identified by a reoccurring, identical flag
with R ?
In the example data set created by the code below, this would refer for example to calculating the mean of “value” separately for each group where “flag” == 1 on consecutive days.
A typical case in science would be a data set recorded by an instrument that repeatedly executes a calibration procedure and flags the corresponding data with the same flag, but the user needs to evaluate each calibration separately with the same procedure.
Thanks for your suggestions. Jens
library(lubridate)
df <- data.frame(
date = seq(ymd("2018-01-01"), ymd("2018-06-29"), by = "days"),
flag = rep( c(rep(1,10), rep(0, 20)), 6),
value = seq(1,180,1)
)

The data.table function rleid is great for giving group IDs to runs of consecutive values. I continue to use data.table, but you could everything but the rleid part just as well in dplyr or base.
My answer comes down to use data.table::rleid and then pick your favorite way to take the mean by group (R-FAQ link).
library(data.table)
setDT(df)
df[, r_id := rleid(flag)]
df[flag == 1, list(
min_date = min(date),
max_date = max(date),
mean_value = mean(value)
), by = r_id]
# r_id min_date max_date mean_value
# 1: 1 2018-01-01 2018-01-10 5.5
# 2: 3 2018-01-31 2018-02-09 35.5
# 3: 5 2018-03-02 2018-03-11 65.5
# 4: 7 2018-04-01 2018-04-10 95.5
# 5: 9 2018-05-01 2018-05-10 125.5
# 6: 11 2018-05-31 2018-06-09 155.5

Convert hour to date-time

I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.

A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')

You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Convert unstructured csv file to a data frame - r

Related

Subsetting R data frame automatically based on changing date range

Match values with multiple conditions using two data.frames

Vectorized functions in R's data.table

Separate operations on groups of time series values identified by same flag in R

Convert hour to date-time

Categories

Resources