Adding missing rows - r

The format of my excel data file is:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:04:00 1
I open my file with this:
ts = read.csv(file=pathfile, header=TRUE, sep=",")
How can I add additional rows with zero number in column “value” into the data frame. Output example:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:03:00 0
01-01-2000 00:04:00 1

This is now completely automated in the padr package. Takes only one line of code.
original <- data.frame(
day = as.POSIXct(c("01-01-2000 00:00:00",
"01-01-2000 00:01:00",
"01-01-2000 00:02:00",
"01-01-2000 00:04:00"), format="%m-%d-%Y %H:%M:%S"),
value = c(4, 3, 1, 1))
library(padr)
library(dplyr) # for the pipe operator
original %>% pad %>% fill_by_value(value)
See vignette("padr") or this blog post for its working.

I think this is a more general solution, which relies on creating a sequence of all timestamps, using that as the basis for a new data frame, and then filling in your original values in that df where applicable.
# convert original `day` to POSIX
ts$day <- as.POSIXct(ts$day, format="%m-%d-%Y %H:%M:%S", tz="GMT")
# generate a sequence of all minutes in a day
minAsNumeric <- 946684860 + seq(0,60*60*24,by=60) # all minutes of your first day
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="1970-01-01", tz="GMT") # convert those minutes to POSIX
# build complete dataframe
newdata <- as.data.frame(minAsPOSIX)
newdata$value <- ts$value[pmatch(newdata$minAsPOSIX, ts$day)] # fill in original `value`s where present
newdata$value[is.na(newdata$value)] <- 0 # replace NAs with 0

Try:
ts = read.csv(file=pathfile, header=TRUE, sep=",", stringsAsFactors=F)
ts.tmp = rbind(ts,list("01-01-2000 00:03:00",0))
ts.out = ts.tmp[order(ts.tmp$day),]
Notice that you need to force load the strings in first column as character and not factors otherwise you will have issue with the rbind. To get the day column to be a factor after than just do:
ts.out$day = as.factor(ts.out$day)

Tidyr offers the nice complete function to generate rows for implicitly missing data. I use replace_na to turn NA values to 0 in second step.
ts%>%
tidyr::complete(day=seq.POSIXt(min(day), max(day), by="min"))%>%
dplyr::mutate(value=tidyr::replace_na(value,0))
Notice that I set the granularity of the dates to minutes since your dataset expects a row every minute.

Related

Converting character to dates with hours and minutes

I'm having trouble converting character values into date (hour + minutes), I have the following codes:
start <- c("2022-01-10 9:35PM","2022-01-10 10:35PM")
end <- c("2022-01-11 7:00AM","2022-01-11 8:00AM")
dat <- data.frame(start,end)
These are all in character form. I would like to:
Convert all the datetimes into date format and into 24hr format like: "2022-01-10 9:35PM" into "2022-01-10 21:35",
and "2022-01-11 7:00AM" into "2022-01-11 7:00" because I would like to calculate the difference between the dates in hrs.
Also I would like to add an ID column with a specific ID, the desired data would like this:
ID <- c(101,101)
start <- c("2022-01-10 21:35","2022-01-10 22:35")
end <- c("2022-01-11 7:00","2022-01-11 8:00")
diff <- c(9,10) # I'm not sure how the calculations would turn out to be
dat <- data.frame(ID,start,end,diff)
I would appreciate all the help there is! Thanks!!!
You can use lubridate::ymd_hm. Don't use floor if you want the exact value.
library(dplyr)
library(lubridate)
dat %>%
mutate(ID = 101,
across(c(start, end), ymd_hm),
diff = floor(end - start))
start end ID diff
1 2022-01-10 21:35:00 2022-01-11 07:00:00 101 9 hours
2 2022-01-10 22:35:00 2022-01-11 08:00:00 101 9 hours
The base R approach with strptime is:
strptime(dat$start, "%Y-%m-%d %H:%M %p")
[1] "2022-01-10 09:35:00 CET" "2022-01-10 10:35:00 CET"

Filling not observed observations

I want to make a time series with the frequency a date and time is observed. The raw data looked something like this:
dd-mm-yyyy hh:mm
28-2-2018 0:12
28-2-2018 11:16
28-2-2018 12:12
28-2-2018 13:22
28-2-2018 14:23
28-2-2018 14:14
28-2-2018 16:24
The date and time format is in the wrong way for R, so I had to adjust it:
extracted_times <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
I ordered the data with frequency in a table using the following code:
timeserieswithoutzeros <- table(extracted_times)
The data looks something like this now:
2018-02-28 00:11:00 2018-02-28 01:52:00 2018-02-28 03:38:00
1 2 5
2018-02-28 04:10:00 2018-02-28 04:40:00 2018-02-28 04:45:00
2 1 1
As you may see there are a lot of unobserved dates and times.
I want to add these unobserved dates and times with the frequency of 0.
I tried the complete function, but the error states that it can't best used, because I use as.POSIXct().
Any ideas?
As already mentinoned in the comments by #eric-lecoutre, you can combine your observations with a sequence begining at the earliest ending at the last date using seq and subtract 1 of the frequency table.
timeseriesWithzeros <- table(c(extracted_times, seq(min(extracted_times), max(extracted_times), "1 min")))-1
Maybe the following is what you want.
First, coerce the data to class "POSIXt" and create the sequence of all date/time between min and max by steps of 1 minute.
bedrijf.CSV$viewed_at <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
new <- seq(min(bedrijf.CSV$viewed_at),
max(bedrijf.CSV$viewed_at),
by = "1 mins")
tmp <- data.frame(viewed_at = new)
Now see if these values are in the original data.
tmp$viewed <- tmp$viewed_at %in% bedrijf.CSV$viewed_at
tbl <- xtabs(viewed ~ viewed_at, tmp)
sum(tbl != 0)
#[1] 7
Final clean up.
rm(new, tmp)

Convert different date-time-formats at once (strptime)

Hi :) I have a column of my data.frame which contains dates in two formats. Here is an short minimal example:
D = data.frame(dates = c("3/31/2016", "01.12.2015"))
dates
1 3/31/2016
2 01.12.2015
With the nice function strptime I can easily get date-times for each format:
D$date1 <- strptime(D$dates, format = "%m/%d/%Y")
D$date2 <- strptime(D$dates, format = "%d.%m.%Y")
I already managed a workaround with:
D$date12 <- do.call(pmin, c(D[c("date1","date2")], na.rm=TRUE) )
To achieve this:
dates date1 date2 date12
1 3/31/2016 2016-03-31 <NA> 2016-03-31
2 01.12.2015 <NA> 2015-12-01 2015-12-01
Is there are more sophisticated way to do this transformation (from dates to date12) at once?
Regards
You can use the anytime package.
library(anytime)
anytime::addFormats("%d.%m.%Y")
anydate(D$dates)
Note that the argument in anydate has to be a vector, so just select the coloumn dates.
Or use lubridate
parse_date_time(D$dates, c("mdy", "dmy"))

R: plotting data.frame against time with "hardcoded" date column

I have googled a lot and yet I still cant figure this one out. I am trying to plot one column in a dataframe against time, however my date column is "hardcoded" (for the lack of a better word) as index in the dataframe not a DATE column, as a variable, by itself.
> head(tmp)[1]
this is what I get, the 1st column is Returns:
RETURNS
2010-01-13 00:00:00 0.8291384
2010-01-14 00:00:00 0.2423567
2010-01-15 00:00:00 -1.0882186
2010-01-19 00:00:00 1.2422194
2010-01-20 00:00:00 -1.0654438
2010-01-21 00:00:00 -1.9126605
If I plot it like:
plot(tmp$RETURNS)
I get a plot of returns against index from 1 to 1500 (number of obs.) and not time. If I got a distinct time column I would plot it like this and it would be fine:
plot(tmp$DATE, tmp$RETURNS)
However, I dont know how to extract the date from that "hardcoded" date column, if that makes sense. I tried to convert it to other objects, timeSeries, zoo etc. Didnt help.I am sure there is some kind of simple function, I just cant find it. Thanks for any help guys.
EDIT:
Thanks guys, your help is very much appreciated. All answers are great, too bad that I cant accept them all ;) Of course it was rownames what I was looking for.
Reproducing your data (you should really have used dput to make life easier for us):
df <- as.data.frame(c(0.8291384, 0.2423567,-1.0882186, 1.2422194,-1.0654438,-1.9126605))
names(df) <- c("RETURNS")
rownames(df) <- c("2010-01-13 00:00:00", "2010-01-14 00:00:00", "2010-01-15 00:00:00", "2010-01-19 00:00:00","2010-01-20 00:00:00","2010-01-21 00:00:00")
df
RETURNS
2010-01-13 00:00:00 0.8291384
2010-01-14 00:00:00 0.2423567
2010-01-15 00:00:00 -1.0882186
2010-01-19 00:00:00 1.2422194
2010-01-20 00:00:00 -1.0654438
2010-01-21 00:00:00 -1.9126605
Cleaning up:
df$Date <- as.Date(rownames(df))
rownames(df) <- NULL
df
RETURNS Date
1 0.8291384 2010-01-13
2 0.2423567 2010-01-14
3 -1.0882186 2010-01-15
4 1.2422194 2010-01-19
5 -1.0654438 2010-01-20
6 -1.9126605 2010-01-21
Plotting:
plot(df$Date, df$RETURNS)
or
library(ggplot2)
ggplot(df, aes(x=Date, y=RETURNS)) + geom_point() + scale_x_date()
Assuming that the input is as in the Note below then using zoo we can plot using classic graphics, ggplot2 and lattice as follows. We also show a base R solution at the end and a variation. Note that since the time is always 0 we used "Date" class for the time index in all cases.
library(zoo)
z <- zoo(df$RETURNS, as.Date(rownames(df)))
plot(z)
library(ggplot2)
autoplot(z)
library(lattice)
xyplot(z)
# this one does not use any packages
df2 <- data.frame(time = as.Date(rownames(df)), RETURNS = df$RETURNS)
plot(RETURNS ~ time, df2)
# this also works using df2 just calculated
plot(df2)
Note: We assume the input is:
df <- data.frame(
RETURNS = c(0.8291384, 0.2423567,-1.0882186, 1.2422194,-1.0654438,-1.9126605),
row.names = c("2010-01-13 00:00:00", "2010-01-14 00:00:00", "2010-01-15 00:00:00",
"2010-01-19 00:00:00","2010-01-20 00:00:00","2010-01-21 00:00:00"))

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

Resources