Splitting a factor at a space in R - r

I want to split x (which is a factor)
dd = data.frame(x = c("29-4-2014 06:00:00", "9-4-2014 12:00:00", "9-4-2014 00:00:00", "6-5-2014 00:00:00" ,"7-4-2014 00:00:00" , "29-5-2014 00:00:00"))
x
29-4-2014 06:00:00
9-4-2014 12:00:00
9-4-2014 00:00:00
6-5-2014 00:00:00
7-4-2014 00:00:00
29-5-2014 00:00:00
at the horizontal space and get two columns as:
x.date x.time
29-4-2014 06:00:00
9-4-2014 12:00:00
9-4-2014 00:00:00
6-5-2014 00:00:00
7-4-2014 00:00:00
29-5-2014 00:00:00
Any suggestion is appreciated!

strsplit is typically used here, but you can also use read.table:
read.table(text = as.character(dd$x))
# V1 V2
# 1 29-4-2014 06:00:00
# 2 9-4-2014 12:00:00
# 3 9-4-2014 00:00:00
# 4 6-5-2014 00:00:00
# 5 7-4-2014 00:00:00
# 6 29-5-2014 00:00:00

Other option (better)
# Convert to POSIXct objects
times <- as.POSIXct(dd$x, format="%d-%m-%Y %T")
# You may also want to specify the time zone
times <- as.POSIXct(dd$x, format="%d-%m-%Y %T", tz="GMT")
Then, to extract times
strftime(times, "%T")
[1] "06:00:00" "12:00:00" "00:00:00" "00:00:00" "00:00:00" "00:00:00"
or dates
strftime(times, "%D")
[1] "04/29/14" "04/09/14" "04/09/14" "05/06/14" "04/07/14" "05/29/14"
or, any format you want, really
strftime(times, "%d %b %Y at %T")
[1] "29 Apr 2014 at 06:00:00" "09 Apr 2014 at 12:00:00"
[3] "09 Apr 2014 at 00:00:00" "06 May 2014 at 00:00:00"
[5] "07 Apr 2014 at 00:00:00" "29 May 2014 at 00:00:00"
See, for more info: ?as.POSIXct and ?strftime

Here is another approach using lubridate:
dd = data.frame(x = c("29-4-2014 06:00:00", "9-4-2014 12:00:00", "9-4-2014 00:00:00", "6-5-2014 00:00:00" ,"7-4-2014 00:00:00" , "29-5-2014 00:00:00"),
stringsAsFactors = FALSE)
Note the use of stringsAsFactors = FALSE, which prevents your dates from being read as factors.
library(lubridate)
dd2 <- transform(dd,x2 = dmy_hms(x))
transform(dd2, the_year = year(x2))
x x2 the_year
1 29-4-2014 06:00:00 2014-04-29 06:00:00 2014
2 9-4-2014 12:00:00 2014-04-09 12:00:00 2014
3 9-4-2014 00:00:00 2014-04-09 00:00:00 2014
4 6-5-2014 00:00:00 2014-05-06 00:00:00 2014
5 7-4-2014 00:00:00 2014-04-07 00:00:00 2014
6 29-5-2014 00:00:00 2014-05-29 00:00:00 2014

Related

Convert dd/mm/yyyy H to date format

I have the following column in R
dteday = c("01/01/2011 0", "01/01/2011 1" , "01/01/2011 2", "01/01/2011 19")
df = data.frame(dteday)
dteday
1 01/01/2011 0
2 01/01/2011 1
3 01/01/2011 2
4 01/01/2011 19
I want the column to be converted into a proper %d/%m/%Y H:M format
The string on the left is in %d/%m/%Y format while the integer on the right is the hour. The integer on the right represents the hour / time. This is my desired output
dteday
1 01/01/2011 00:00
2 01/01/2011 01:00
3 01/01/2011 02:00
4 01/01/2011 19:00
That specificly-formatted output can be achieved by combining strftime and as.POSIXct, but it will still be a character string
df$dteday = strftime(as.POSIXct(df$dteday, format = "%d/%m/%Y %H"), format = "%d/%m/%Y %H:%M")
# dteday
# 1 01/01/2011 00:00
# 2 01/01/2011 01:00
# 3 01/01/2011 02:00
# 4 01/01/2011 19:00
Without converting to date you could do:
sapply(
strsplit(dteday, ' '),
function(x) sprintf('%s %02d:00', x[1], as.integer(x[2])))
)
# [1] "01/01/2011 00:00" "01/01/2011 01:00" "01/01/2011 02:00" "01/01/2011 19:00"
You can use lubridate's dmy_h to convert string to POSIXct.
df$dteday <- lubridate::dmy_h(df$dteday)
df$dteday
#[1] "2011-01-01 00:00:00 UTC" "2011-01-01 01:00:00 UTC"
# "2011-01-01 02:00:00 UTC" "2011-01-01 19:00:00 UTC"
Then use format to get data in format of your choice.
format(df$dteday, '%d/%m/%Y %H:%M')
#[1] "01/01/2011 00:00" "01/01/2011 01:00" "01/01/2011 02:00" "01/01/2011 19:00"

How can i split one row of a time period into multiple rows based on hourly base on date and time format, In R

How to do below task in R?
df <- tribble(
~ID, ~StartTime, ~EndTime
, 01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
, 02, "2018-05-14 21:30:00", "2018-05-15 02:00:00"
, 03, "2018-05-15 07:00:00", "2018-05-16 22:30:00"
, 04, "2018-05-16 23:00:00", "2018-05-16 23:40:00"
, 05, "2018-05-17 01:00:00", "2018-05-19 15:00:00"
)
df$StartDate <- as.POSIXlt(df$StartDate, tryFormats = "%Y-%m-%d %H:%M:%S")
df$EndDate <- as.POSIXlt(df$EndDate, tryFormats = "%Y-%m-%d %H:%M:%S")
Note: Multiple duplicate rows needs to be created from single row,
For example
Original Single row:
01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
Post processing Multiple row:
01, "2018-05-14 09:30:00", "2018-05-14 10:00:00"
01, "2018-05-14 10:00:00", "2018-05-14 11:00:00"
01, "2018-05-14 11:00:00", "2018-05-14 12:00:00"
01, "2018-05-14 12:00:00", "2018-05-14 12:10:00"
Hoping my request is clear.
We can write a function which generates an hourly sequence between two timestamps. Using map2 we call that function for every pair of StartTime and EndTime and expand the dataframe.
library(dplyr)
library(lubridate)
generate_hourly_time <- function(x, y) {
EndTime <- ceiling_date(x, 'hour')
EndTime2 <- seq(EndTime, floor_date(y, 'hour'), 'hour')
tibble(StartTime = c(x, EndTime2), EndTime = c(EndTime2, y))
}
df %>%
mutate(across(-1, ymd_hms)) %>%
#For dplyr < 1.0.0
#mutate_at(-1, ymd_hms) %>%
mutate(time = purrr::map2(StartTime, EndTime, generate_hourly_time)) %>%
select(ID, time) %>%
tidyr::unnest(time)
# A tibble: 117 x 3
# ID StartTime EndTime
# <dbl> <dttm> <dttm>
# 1 1 2018-05-14 09:30:00 2018-05-14 10:00:00
# 2 1 2018-05-14 10:00:00 2018-05-14 11:00:00
# 3 1 2018-05-14 11:00:00 2018-05-14 12:00:00
# 4 1 2018-05-14 12:00:00 2018-05-14 12:10:00
# 5 2 2018-05-14 21:30:00 2018-05-14 22:00:00
# 6 2 2018-05-14 22:00:00 2018-05-14 23:00:00
# 7 2 2018-05-14 23:00:00 2018-05-15 00:00:00
# 8 2 2018-05-15 00:00:00 2018-05-15 01:00:00
# 9 2 2018-05-15 01:00:00 2018-05-15 02:00:00
#10 2 2018-05-15 02:00:00 2018-05-15 02:00:00
# … with 107 more rows
I hope it's useful:
df <- tribble(
~ID, ~StartTime, ~EndTime
, 01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
, 01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
, 01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
, 01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
, 01, "2018-05-14 09:30:00", "2018-05-14 12:10:00"
, 02, "2018-05-14 21:30:00", "2018-05-15 02:00:00"
, 03, "2018-05-15 07:00:00", "2018-05-16 22:30:00"
, 04, "2018-05-16 23:00:00", "2018-05-16 23:40:00"
, 05, "2018-05-17 01:00:00", "2018-05-19 15:00:00"
)
nrow(df)
id.unique <- unique(df[,'ID'])
id.unique.numeric <- as.numeric(unlist(id.unique))
id.i <- id.unique.numeric
for (i in id.i) {
out.pre <- subset(df, ID==i)
name.out <- paste('df', i, '<-out.pre', sep = '')
eval(parse(text=name.out))
}
df1
FM
You could also do:
library(tidyverse)
df %>%
pivot_longer(-ID)%>%
group_by(ID)%>%
mutate(start = list(unique(c(value[1],seq(strptime(value[1],"%F %H"),
value[2],"1 hour")[-1],value[2]))),
name = NULL, value = NULL)%>%
slice(1)%>%
unnest(start)%>%
mutate(end = lead(start,1,last(start)))
# A tibble: 117 x 3
# Groups: ID [5]
ID start end
<dbl> <dttm> <dttm>
1 1 2018-05-14 09:30:00 2018-05-14 10:00:00
2 1 2018-05-14 10:00:00 2018-05-14 11:00:00
3 1 2018-05-14 11:00:00 2018-05-14 12:00:00
4 1 2018-05-14 12:00:00 2018-05-14 12:10:00
5 1 2018-05-14 12:10:00 2018-05-14 12:10:00
6 2 2018-05-14 21:30:00 2018-05-14 22:00:00
7 2 2018-05-14 22:00:00 2018-05-14 23:00:00
8 2 2018-05-14 23:00:00 2018-05-15 00:00:00
9 2 2018-05-15 00:00:00 2018-05-15 01:00:00
10 2 2018-05-15 01:00:00 2018-05-15 02:00:00
# ... with 107 more rows

Create a time series with a row every 15 minutes

I'm having trouble creating a time series (POSIXct or dttm column) with a row every 15 minutes.
Something that will look like this for every 15 minutes between Jan 1st 2015 and Dec 31st 2016 (here as month/day/year hour:minutes):
1/15/2015 0:00
1/15/2015 0:15
1/15/2015 0:30
1/15/2015 0:45
1/15/2015 1:00
A loop starting date of 01/01/2015 0:00 and then adding 15 minutes until 12/31/2016 23:45?
Does anyone has an idea of how this can be done easily?
Little bit easier to read
library(lubridate)
seq(ymd_hm('2015-01-01 00:00'),ymd_hm('2016-12-31 23:45'), by = '15 mins')
intervals.15.min <- 0 : (366 * 24 * 60 * 60 / 15 / 60)
res <- as.POSIXct("2015-01-01","GMT") + intervals.15.min * 15 * 60
res <- res[res < as.POSIXct("2016-01-01 00:00:00 GMT")]
head(res)
# "2015-01-01 00:00:00 GMT" "2015-01-01 00:15:00 GMT" "2015-01-01 00:30:00 GMT"
tail(res)
# "2015-12-31 23:15:00 GMT" "2015-12-31 23:30:00 GMT" "2015-12-31 23:45:00 GMT"

R: Find missing timestamps in csv

as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:
metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06
As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.
So my idea was
to create a vector that is correct (includes all sec-by-sec timestamps),
match the given CSV file with that new vector, and
sum up all the timestamps with no value.
I managed to make step 1 happen with the following code:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")
To have an idea what I did I also exported the vector to a CSV that looks like this:
"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06
Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.
Some hints for the greenhorn would be very very helpful – thank you very much in advance :)
In the tidyverse,
library(dplyr)
library(tidyr)
# parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>%
# complete sequence to full sequence from min to max by second
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))
## # A tibble: 7 x 2
## timestamp metering
## <time> <int>
## 1 2016-01-01 00:00:00 123
## 2 2016-01-01 00:00:01 345
## 3 2016-01-01 00:00:02 243
## 4 2016-01-01 00:00:03 NA
## 5 2016-01-01 00:00:04 101
## 6 2016-01-01 00:00:05 NA
## 7 2016-01-01 00:00:06 134
If you want the number of NAs (i.e. the number of seconds with no data), add on
%>% tally(is.na(metering))
## # A tibble: 1 x 1
## n
## <int>
## 1 2
You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds
This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:
> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9
This will return the actual values:
> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"
Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.
> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3
If you want to merge the files together to see the missing values you can do something like this:
#data with regular time series and a "step"
df <- data.frame(
RegularTimeSeries
)
df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")
resulting in:
> df[1:12,]
RegularTimeSeries BrokenTimeSeries
1 2016-01-01 00:00:00 2016-01-01 00:00:00
2 2016-01-01 00:00:01 2016-01-01 00:00:01
3 2016-01-01 00:00:02 <NA>
4 2016-01-01 00:00:03 2016-01-01 00:00:02
5 2016-01-01 00:00:04 2016-01-01 00:00:03
6 2016-01-01 00:00:05 <NA>
7 2016-01-01 00:00:06 2016-01-01 00:00:04
8 2016-01-01 00:00:07 2016-01-01 00:00:05
9 2016-01-01 00:00:08 <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08
If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:
n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows
You might change the time range and the variable of your data frame.
Hope it helps
d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)
for (i in 2:length(d)){
if(difftime(d[i-1],d[i], units = "secs") < -1 ){
c[i] <- d[i]
}
}
class(c) <- c('POSIXt','POSIXct')
c
[1] NA NA NA
NA NA
[6] NA "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01
00:00:22 EST"

How to rearrange date and time

Could you please tell me how to rearrange the datetime of data set A in order to compatible with datetime of data set B (which is in GMT+10 format)?
Thank you.
**data set A**
sitecode status start end
ANS0009 spike 11/09/2013 04:45:00 PM (GMT+11) 11/09/2013 05:00:00 PM (GMT+11)
ARM0064 spike 05/03/2014 11:00:00 AM (GMT+10) 05/03/2014 11:15:00 AM (GMT+10)
BAS0059 dry 13/01/2013 00:00:00 AM (GMT+11) 29/03/2013 11:45:00 PM (GMT+11)
BAS0059 spike 11/03/2014 10:15:00 AM (GMT+10) 11/03/2014 10:30:00 AM (GMT+10)
BLC0097 failure 12/20/2012 05:00:00 PM (GMT+11) 12/31/2012 11:45:00 PM (GMT+11)
BLC0097 spike 24/12/2015 04:59:45 PM (GMT+10) 24/12/2015 05:01:50 PM (GMT+10)
**data set B**
sitecode status start end
EUM0056 record 2012-12-01 11:00:00 2013-10-06 01:45:00
EUM0056 missing 2013-10-06 01:45:00 2013-10-06 03:00:00
EUM0056 record 2013-10-06 03:00:00 2014-03-11 20:15:00
MDL0026 record 2012-12-07 11:00:00 2013-04-04 19:45:00
MDL0026 missing 2013-04-04 19:45:00 2014-02-27 23:00:00
MDL0026 record 2014-02-27 23:00:00 2014-10-05 01:45:00
We can could use lubridate to parse multiple formats after splitting the string into two to remove the (GMT + ...).
library(lubridate)
library(stringr)
v1 <- strsplit(str1, "\\s+(?=\\()", perl = TRUE)[[1]]
parse_date_time(v1[1], c("%d/%m/%Y %I:%M:%S %p", "%m/%d/%Y %I:%M:%S %p"),
tz= "GMT", exact = TRUE) + lubridate::hours(str_extract(v1[2], "\\d+"))
#[1] "2013-09-12 03:45:00 GMT"
Using the full dataset example
datA[c("start", "end")] <- lapply(datA[c("start", "end")], function(x){
m1 <- do.call(rbind, strsplit(x, "\\s+(?=\\()", perl = TRUE))
parse_date_time(m1[,1], c("%d/%m/%Y %I:%M:%S %p", "%m/%d/%Y %I:%M:%S %p"),
tz = "GMT", exact = TRUE) + lubridate::hours(str_extract(m1[,2], "\\d+")
)})
data
str1 <- "11/09/2013 04:45:00 PM (GMT+11)"
require(lubridate)
exampleA <- c("11/09/2013 04:45:00 PM (GMT+11)",
"11/09/2013 04:45:00 PM (GMT+10)")
exampleA <- as.data.frame(exampleA)
exampleA$flag <- 0
exampleA$flag[grep(" PM \\(GMT\\+11\\)", exampleA$exampleA)] <- 1
exampleA$exampleA <- gsub(" PM \\(GMT\\+11\\)","", exampleA$exampleA)
exampleA$exampleA <- gsub(" PM \\(GMT\\+10\\)","", exampleA$exampleA)
exampleA$exampleA <- mdy_hms(exampleA$exampleA)
exampleA$exampleA[exampleA$flag == 1] <- exampleA$exampleA - 3600
exampleB <- c("2013-11-09 03:45:00", "2013-11-09 04:45:00")
exampleB <- ymd_hms(exampleB)
# Proof it works
exampleA$exampleA == exampleB
[1] TRUE TRUE
If you have a mix of formats in 1 data set (i.e. mdy, ydm, etc) you can deal with this by using if statements -- either in a function which you can apply or a for loop -- and text if a certain position has a value >12 to determine the format, then use the appropriate lubridate function to convert it.

Resources