Text process using R

Text process using R - r

I am quite new in programming and R Software.
My data-set includes date-time variables as following:
2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106
I need an operator which count from left up to the character number 10 and then execute a space and copy the last two characters and then add :00 for all columns.
Expected results:
2007/11/01 03:00
2007/11/01 04:00
2007/11/01 05:00
2007/11/01 06:00

If you want to actually turn your data into a "POSIXlt" "POSIXt" class in R (so you could subtract/add days, minutes and etc from/to it) you could do
# Your data
temp <- c("2007/11/0103", "2007/11/0104", "2007/11/0105", "2007/11/0106")
temp2 <- strptime(temp, "%Y/%m/%d%H")
## [1] "2007-11-01 03:00:00 IST" "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST"
You could then extract hours for example
temp2$hour
## [1] 3 4 5 6
Add hours
temp2 + 3600
## [1] "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST" "2007-11-01 07:00:00 IST"
And so on. If you just want the format you mentioned in your question (which is just a character string), you can also do
format(strptime(temp, "%Y/%m/%d%H"), format = "%Y/%m/%d %H:%M")
#[1] "2007/11/01 03:00" "2007/11/01 04:00" "2007/11/01 05:00" "2007/11/01 06:00"

Try
library(lubridate)
dat <- read.table(text="2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106",header=F,stringsAsFactors=F)
dat$V1 <- format(ymd_h(dat$V1),"%Y/%m/%d %H:%M")
dat
# V1
# 1 2007/11/01 03:00
# 2 2007/11/01 04:00
# 3 2007/11/01 05:00
# 4 2007/11/01 06:00

Suppose your dates are a vector named dates
library(stringr)
paste0(paste(str_sub(dates, end=10), str_sub(dates, 11)), ":00")

paste and substr are your friends here. Type ? before either to see the documentation
my.parser <- function(a){
paste0(substr(a, 0,10),' ',substr(a,11,12),':00') # paste0 is like paste but does not add whitespace
}
a<- '2007/11/0103'
my.parser(a) # = "2007/11/01 03:00"

Related

How can I add a midnight time stamp to dates in a column in R?

I have a column of dates in an R data frame, that look like this,
Date
2020-08-05
2020-08-05
2020-08-05
2020-08-07
2020-08-08
2020-08-08
So the dates are formatted as 'yyyy-mm-dd'.
I am writing this data frame to a CSV that needs to be formatted in a very specific manner. I need to convert these dates to the format 'mm/dd/yyyy hh:mm:ss', so this is what I want the columns to look like:
Date
8/5/2020 12:00:00 AM
8/5/2020 12:00:00 AM
8/5/2020 12:00:00 AM
8/7/2020 12:00:00 AM
8/8/2020 12:00:00 AM
8/8/2020 12:00:00 AM
The dates do not have a timestamp attached to begin with, so all dates will need a midnight timestamp in the format shown above.
I spent quite some time trying to coerce this format yesterday and was unable. I am easily able to change 2020-08-05 to 8/5/2020 using as.Date(), but the issue arises when I attempt to add the midnight time stamp.
How can I add a midnight timestamp to these reformatted dates?
Thanks so much for any help!

You can use format:
df <- data.frame(Date = as.Date(c("2020-08-05", "2020-08-07")))
format(df$Date, "%d-%m-%Y 12:00:00 AM")
[1] "05-08-2020 12:00:00 AM" "07-08-2020 12:00:00 AM"

dat <- data.frame(
Date = as.Date("2020-08-05") + c(0, 0, 0, 2, 3, 3)
)
dat[["Date"]] <- format(dat[["Date"]], "%m/%d/%Y %I:%M:%S %p")
dat[["Date"]] <- sub("([ap]m)$", "\\U\\1", dat[["Date"]], perl = T)
dat
## Date
## 1 08/05/2020 12:00:00 AM
## 2 08/05/2020 12:00:00 AM
## 3 08/05/2020 12:00:00 AM
## 4 08/07/2020 12:00:00 AM
## 5 08/08/2020 12:00:00 AM
## 6 08/08/2020 12:00:00 AM

Try this:
format(as.POSIXct("2022-11-08", tz = "Australia/Sydney"), "%Y-%m-%d %H:%M:%S")

How to deal with one column of two formats and single class?

I have one column with two different formats but the same class 'factor'.
D$date
2009-05-12 11:30:00
2009-05-13 11:30:00
2009-05-14 11:30:00
2009-05-15 11:30:00
42115.652
2876
8765
class(D$date)
factor
What I need is to convert the number to date.
D$date <- as.character(D$date)
D$date=ifelse(!is.na(as.numeric(D$date)),
as.POSIXct(as.numeric(D$date) * (60*60*24), origin="1899-12-30", tz="UTC"),
D$date)
Now the number was converted but to a strange number "1429630800".
I tried without ifelse:
as.POSIXct(as.numeric(42115.652) * (60*60*24), origin="1899-12-30", tz="UTC")
[1] "2015-04-21 15:38:52 UTC"
It was converted nicely.

The problem is that you are mixing up classes in the true/false halves of your ifelse. You can fix this by adding as.character like this
D$date = ifelse(!is.na(as.numeric(D$date)),
as.character(as.POSIXct(as.numeric(D$date) * (60*60*24), origin="1899-12-30", tz="UTC")),
D$date)
#D
# date
#1 2009-05-12 11:30:00
#2 2009-05-13 11:30:00
#3 2009-05-14 11:30:00
#4 2009-05-15 11:30:00
#5 2015-04-21 15:38:52
#6 1907-11-15 00:00:00
#7 1923-12-30 00:00:00

You can also create a function which transforms each value in POSIX, then using lapply and do.call.
b <- c("2009-05-12 11:30:00", "2009-05-13 11:30:00", "2009-05-14 11:30:00",
"2009-05-15 11:30:00", "42115.652", "2876", "8765")
foo <- function(x){
if(!is.na(as.numeric(x))){
as.POSIXct(as.numeric(x) * (60*60*24), origin="1899-12-30", tz="UTC")
}else{
as.POSIXct(x, origin="1899-12-30", tz="UTC")
}
}
do.call("c", lapply(b, foo))
[1] "2009-05-12 13:30:00 CEST" "2009-05-13 13:30:00 CEST" "2009-05-14 13:30:00 CEST" "2009-05-15 13:30:00 CEST"
[5] "2015-04-21 17:38:52 CEST" "1907-11-15 01:00:00 CET" "1923-12-30 01:00:00 CET"

R: Find missing timestamps in csv

as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:
metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06
As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.
So my idea was
to create a vector that is correct (includes all sec-by-sec timestamps),
match the given CSV file with that new vector, and
sum up all the timestamps with no value.
I managed to make step 1 happen with the following code:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")
To have an idea what I did I also exported the vector to a CSV that looks like this:
"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06
Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.
Some hints for the greenhorn would be very very helpful – thank you very much in advance :)

In the tidyverse,
library(dplyr)
library(tidyr)
# parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>%
# complete sequence to full sequence from min to max by second
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))
## # A tibble: 7 x 2
## timestamp metering
## <time> <int>
## 1 2016-01-01 00:00:00 123
## 2 2016-01-01 00:00:01 345
## 3 2016-01-01 00:00:02 243
## 4 2016-01-01 00:00:03 NA
## 5 2016-01-01 00:00:04 101
## 6 2016-01-01 00:00:05 NA
## 7 2016-01-01 00:00:06 134
If you want the number of NAs (i.e. the number of seconds with no data), add on
%>% tally(is.na(metering))
## # A tibble: 1 x 1
## n
## <int>
## 1 2

You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds
This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:
> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9
This will return the actual values:
> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"
Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.
> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3
If you want to merge the files together to see the missing values you can do something like this:
#data with regular time series and a "step"
df <- data.frame(
RegularTimeSeries
)
df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")
resulting in:
> df[1:12,]
RegularTimeSeries BrokenTimeSeries
1 2016-01-01 00:00:00 2016-01-01 00:00:00
2 2016-01-01 00:00:01 2016-01-01 00:00:01
3 2016-01-01 00:00:02 <NA>
4 2016-01-01 00:00:03 2016-01-01 00:00:02
5 2016-01-01 00:00:04 2016-01-01 00:00:03
6 2016-01-01 00:00:05 <NA>
7 2016-01-01 00:00:06 2016-01-01 00:00:04
8 2016-01-01 00:00:07 2016-01-01 00:00:05
9 2016-01-01 00:00:08 <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08

If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:
n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows
You might change the time range and the variable of your data frame.

Hope it helps
d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)
for (i in 2:length(d)){
if(difftime(d[i-1],d[i], units = "secs") < -1 ){
c[i] <- d[i]
}
}
class(c) <- c('POSIXt','POSIXct')
c
[1] NA NA NA
NA NA
[6] NA "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01
00:00:22 EST"

Associate numbers to datetime/timestamp

I have a dataframe df with a certain number of columns. One of them, ts, is timestamps:
1462147403122 1462147412990 1462147388224 1462147415651 1462147397069 1462147392497
...
1463529545634 1463529558639 1463529556798 1463529558788 1463529564627 1463529557370.
I have also at my disposal the corresponding datetime in the datetime column:
"2016-05-02 02:03:23 CEST" "2016-05-02 02:03:32 CEST" "2016-05-02 02:03:08 CEST" "2016-05-02 02:03:35 CEST" "2016-05-02 02:03:17 CEST" "2016-05-02 02:03:12 CEST"
...
"2016-05-18 01:59:05 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:16 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:24 CEST" "2016-05-18 01:59:17 CEST"
As you can see my dataframe contains data accross several day. Let's say there are 3. I would like to add a column containing number 1, 2 or 3. 1 if the line belongs to the first day, 2 for the second day, etc...
Thank you very much in advance,
Clement

One way to do this is to keep track of total days elapsed each time the date changes, as demonstrated below.
# Fake data
dat = data.frame(datetime = c(seq(as.POSIXct("2016-05-02 01:03:11"),
as.POSIXct("2016-05-05 01:03:11"), length.out=6),
seq(as.POSIXct("2016-05-09 01:09:11"),
as.POSIXct("2016-05-16 02:03:11"), length.out=4)))
tz(dat$datetime) = "UTC"
Note, if your datetime column is not already in a datetime format, convert it to one using as.POSIXct.
Now, create a new column with the day number, counting the first day in the sequence as day 1.
dat$day = c(1, cumsum(as.numeric(diff(as.Date(dat$datetime, tz="UTC")))) + 1)
dat
datetime day
1 2016-05-02 01:03:11 1
2 2016-05-02 15:27:11 1
3 2016-05-03 05:51:11 2
4 2016-05-03 20:15:11 2
5 2016-05-04 10:39:11 3
6 2016-05-05 01:03:11 4
7 2016-05-09 01:09:11 8
8 2016-05-11 09:27:11 10
9 2016-05-13 17:45:11 12
10 2016-05-16 02:03:11 15
I specified the timezone in the code above to avoid getting tripped up by potential silent shifts between my local timezone and UTC. For example, note the silent shift from my default local time zone ("America/Los_Angeles") to UTC when converting a POSIXct datetime to a date:
# Fake data
datetime = seq(as.POSIXct("2016-05-02 01:03:11"), as.POSIXct("2016-05-05 01:03:11"), length.out=6)
tz(datetime)
[1] ""
date = as.Date(datetime)
tz(date)
[1] "UTC"
data.frame(datetime, date)
datetime date
1 2016-05-02 01:03:11 2016-05-02
2 2016-05-02 15:27:11 2016-05-02
3 2016-05-03 05:51:11 2016-05-03
4 2016-05-03 20:15:11 2016-05-04 # Note day is different due to timezone shift
5 2016-05-04 10:39:11 2016-05-04
6 2016-05-05 01:03:11 2016-05-05

R add specific (different) amounts of times to entire column

I have a table in R like:
start duration
02/01/2012 20:00:00 5
05/01/2012 07:00:00 6
etc... etc...
I got to this by importing a table from Microsoft Excel that looked like this:
date time duration
2012/02/01 20:00:00 5
etc...
I then merged the date and time columns by running the following code:
d.f <- within(d.f, { start=format(as.POSIXct(paste(date, time)), "%m/%d/%Y %H:%M:%S") })
I want to create a third column called 'end', which will be calculated as the number of hours after the start time. I am pretty sure that my time is a POSIXct vector. I have seen how to manipulate one datetime object, but how can I do that for the entire column?
The expected result should look like:
start duration end
02/01/2012 20:00:00 5 02/02/2012 01:00:00
05/01/2012 07:00:00 6 05/01/2012 13:00:00
etc... etc... etc...

Using lubridate
> library(lubridate)
> df$start <- mdy_hms(df$start)
> df$end <- df$start + hours(df$duration)
> df
# start duration end
#1 2012-02-01 20:00:00 5 2012-02-02 01:00:00
#2 2012-05-01 07:00:00 6 2012-05-01 13:00:00
data
df <- structure(list(start = c("02/01/2012 20:00:00", "05/01/2012 07:00:00"
), duration = 5:6), .Names = c("start", "duration"), class = "data.frame", row.names = c(NA,
-2L))

You can simply add dur*3600 to start column of the data frame. E.g. with one date:
start = as.POSIXct("02/01/2012 20:00:00",format="%m/%d/%Y %H:%M:%S")
start
[1] "2012-02-01 20:00:00 CST"
start + 5*3600
[1] "2012-02-02 01:00:00 CST"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Text process using R - r

Try library(lubridate) dat <- read.table(text="2007/11/0103 2007/11/0104 2007/11/0105 2007/11/0106",header=F,stringsAsFactors=F) dat$V1 <- format(ymd_h(dat$V1),"%Y/%m/%d %H:%M") dat # V1 # 1 2007/11/01 03:00 # 2 2007/11/01 04:00 # 3 2007/11/01 05:00 # 4 2007/11/01 06:00

Suppose your dates are a vector named dates library(stringr) paste0(paste(str_sub(dates, end=10), str_sub(dates, 11)), ":00")

paste and substr are your friends here. Type ? before either to see the documentation my.parser <- function(a){ paste0(substr(a, 0,10),' ',substr(a,11,12),':00') # paste0 is like paste but does not add whitespace } a<- '2007/11/0103' my.parser(a) # = "2007/11/01 03:00"

Related

How can I add a midnight time stamp to dates in a column in R?

How to deal with one column of two formats and single class?

R: Find missing timestamps in csv

Associate numbers to datetime/timestamp

R add specific (different) amounts of times to entire column

Categories

Resources