How do I convert a data frame dataset to time series? - r

I seem to have some trouble converting my data frame data into a time series. I have a typical data set consisting of date, export quantity, GDP, FDI etc.
# A tibble: 252 x 10
Date `Maize Exports (m/t)` `Rainfall (mm)` `Temperature ©` `Exchange rate (R/$)` `Maize price (R)` `FDI (Million R)` GDP (Million~1 Oil p~2 Infla~3
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000-05-01 00:00:00 21000 30.8 14.4 0.144 678. 4337 9056 192. 5.1
2 2000-06-01 00:00:00 54000 14.9 14.0 0.147 583. -4229 9056 205. 5.1
3 2000-07-01 00:00:00 134000 11.1 12.6 0.144 518. -4229 8841 196. 5.9
4 2000-08-01 00:00:00 213000 6.1 15.3 0.143 526. -4229 8841 205. 6.8
5 2000-09-01 00:00:00 123000 38.5 17.8 0.138 576. 6315 8841 234. 6.8
6 2000-10-01 00:00:00 94000 61.9 20.1 0.132 636. 6315 4487 231. 7.1
7 2000-11-01 00:00:00 192000 93.9 19.9 0.129 685. 6315 4487 250. 7.1
8 2000-12-01 00:00:00 134000 85.6 22.3 0.132 747. -2143 4487 192. 7
9 2001-01-01 00:00:00 133000 92.4 23.4 0.0875 1066. -5651 7365 226. 5
10 2001-02-01 00:00:00 168000 51 22.0 0.0879 1042. -5651 7365 233. 5.9
I've installed the right packages (readxl), I've used the as.Date function so ensure my Date is recognized as such, and I've used the as.ts function to convert the dataset. However, after using the as.ts function, the date column is all muddled up into a random number and not a date anymore. What am I doing wrong? Please help!
Date Maize Exports (m/t) Rainfall (mm) Temperature © Exchange rate (R/$) Maize price (R) FDI (Million R) GDP (Million R) Oil prices (R/barrel)
[1,] 957139200 21000 30.8 14.36 0.1435235 677.88 4337 9056 192.35
[2,] 959817600 54000 14.9 13.96 0.1474926 583.48 -4229 9056 205.36
[3,] 962409600 134000 11.1 12.61 0.1437298 518.10 -4229 8841 196.38
[4,] 965088000 213000 6.1 15.27 0.1433075 525.59 -4229 8841 204.66
[5,] 967766400 123000 38.5 17.83 0.1382170 576.08 6315 8841 233.64
[6,] 970358400 94000 61.9 20.10 0.1322751 635.79 6315 4487 231.27

In short nothing is wrong - and while this response should really be a comment, I wanted to use a full answer to have a bit more space to explain.
Behind each date is a numeric value tethered to an origin, so this is just R's way of handling it. And since you imported from excel originally, those origins may not line up if you tried to cross check it (see below).
You didn't make your question reproducible, but I put some similar data together to demonstrate what's going on:
Data
df <- data.frame(date = as.Date(c("2000-05-01",
"2000-06-01",
"2000-07-01",
"2000-08-01",
"2000-09-01",
"2000-10-01",
"2000-11-01")),
maize = c(21, 54, 132, 213, 123, 94, 192) * 1000,
rainfall = c(30, 14, 11, 6, 38, 61, 93))
tb <- tidyr::as_tibble(df)
Turning this into a time series object using as.ts()
tb_ts <- as.ts(tb)
# Time Series:
# Start = 1
# End = 7
# Frequency = 1
# date maize rainfall
# 1 11078 21000 30
# 2 11109 54000 14
# 3 11139 132000 11
# 4 11170 213000 6
# 5 11201 123000 38
# 6 11231 94000 61
# 7 11262 192000 93
Since I created these data in R, the "origin" is January 1, 1970, and we can see this in numerical dates from the time series object and convert them back into date formats:
as.Date(tb_ts[1:7], origin = '1970-01-01')
# [1] "2000-05-01" "2000-06-01" "2000-07-01" "2000-08-01"
# [5] "2000-09-01" "2000-10-01" "2000-11-01"
Note that if you import data from Excel, Excel's origin is December 30th, 1899 (i.e., as.Date(xx, origin = "1899-12-30")), so if you tried that you get the wrong dates:
as.Date(tb_ts[1:7], origin = "1899-12-30")
# [1] "1930-04-30" "1930-05-31" "1930-06-30" "1930-07-31"
# [5] "1930-08-31" "1930-09-30" "1930-10-31

The function worked as it's supposed to. Keeping the date format you're familiar with isn't practical for execution, so it converts the dates to a different value, usually something like the number of days (or minutes or seconds) since a certain year, usually Jan. 1 1970. For example, here is a little set to make the point:
# a test vector of dates
> del1 <- seq(as.Date("2012-04-01"), length.out=4, by=30)
# looks like
> del1
[1] "2012-04-01" "2012-05-01" "2012-05-31" "2012-06-30"
# use the as.ts
> as.ts(del1)
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 15431 15461 15491 15521
So you can see the dates, which are 30 days apart, are converted to a series of values that are 30 integers apart.

Related

Replace from NA to random values

I wanna replace from NA to random values. This data frame have a columns like "Dayofweek" and I don't know how can i complete this data frame. I try by function missforest but this function work on columns with integer I think. Do you have any idea how I can complete all of the columns?
travel <- read.csv("https://openmv.net/file/travel-times.csv")
library(missForest)
summary(travel)
set.seed(82)
travel1 <- prodNA(travel, noNA = 0.2)
travel2 <- missForest(travel1)
You can use the imputeTS package for inserting random values to your time series. The function na_random can be used for this. The function can be used for numeric columns (the other columns will be left untouched, which might be useful, since you probably do not need random texts for the comments column)
You can call
library("imputeTS")
na_random(yourData)
and the function will look for the lowest and highest value of each column and insert random values between this bounds for you.
But you can also define your own bounds for the random values like this:
library("imputeTS")
na_random(yourData, lower_bound = 0, upper_bound = 25)
For your data this could look like this:
library("imputeTS")
# To read the input correctly and have the right data types
travel <- read.csv("https://openmv.net/file/travel-times.csv", na.strings = "")
travel$FuelEconomy <- as.numeric(travel$FuelEconomy)
# To perform the missing data replacement
travel <- na_random(travel)
First, if you want to read "" strings as NAs, you need an additional argument na.strings = "" in read.csv. Then, do you mean replacing an NA observation of a variable with the other random observation of the same variable? If so, consider the following procedure:
travel <- read.csv("https://openmv.net/file/travel-times.csv", na.strings = "")
set.seed(82)
res <- data.frame(lapply(travel, function(x) {
is_na <- is.na(x)
replace(x, is_na, sample(x[!is_na], sum(is_na), replace = TRUE))
}))
res looks like this
Date StartTime DayOfWeek GoingTo Distance MaxSpeed AvgSpeed AvgMovingSpeed FuelEconomy TotalTime MovingTime Take407All Comments
1 1/6/2012 16:37 Friday Home 51.29 127.4 78.3 84.8 8.5 39.3 36.3 No Medium amount of rain
2 1/6/2012 08:20 Friday GSK 51.63 130.3 81.8 88.9 8.5 37.9 34.9 No Put snow tires on
3 1/4/2012 16:17 Wednesday Home 51.27 127.4 82.0 85.8 8.5 37.5 35.9 No Heavy rain
4 1/4/2012 07:53 Wednesday GSK 49.17 132.3 74.2 82.9 8.31 39.8 35.6 No Accident blocked 407 exit
5 1/3/2012 18:57 Tuesday Home 51.15 136.2 83.4 88.1 9.08 36.8 34.8 No Rain, rain, rain
6 1/3/2012 07:57 Tuesday GSK 51.80 135.8 84.5 88.8 8.37 36.8 35.0 No Backed up at Bronte
7 1/2/2012 17:31 Monday Home 51.37 123.2 82.9 87.3 - 37.2 35.3 No Pumped tires up: check fuel economy improved?
8 1/2/2012 07:34 Monday GSK 49.01 128.3 77.5 85.9 - 37.9 34.3 No Pumped tires up: check fuel economy improved?
9 12/23/2011 08:01 Friday GSK 52.91 130.3 80.9 88.3 8.89 39.3 36.0 No Police slowdown on 403
10 12/22/2011 17:19 Thursday Home 51.17 122.3 70.6 78.1 8.89 43.5 39.3 No Start early to run a batch

Group by weekly data and summarize by month in R with dplyr

I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows

How to create a matrix with time in columns and date in rows for a long time-data frame?

I'm new here so...
I have a data frame with two variables (R is new for me, I used Matlab for a long). One is a classic POSIXlt with timestamps with 30 minutes between each data point. The second one is the data itself (for example, Air Temperature data) and same dimensions with time vector. I used this pair to get nice plots.
I want to reshape data using time in this fashion: I want to sort the data using days in the row-direction and time (up to 48 columns, using the 30-minute interval between 0:00 and 23:30) in the column-direction, to use this data in another R package to fill missing data.
>> head(data_f, 10)
time data
1 2013-08-01 00:30:00 8.001
2 2013-08-01 01:00:00 7.918
3 2013-08-01 01:30:00 7.621
4 2013-08-01 02:00:00 7.564
5 2013-08-01 02:30:00 7.718
6 2013-08-01 03:00:00 7.846
7 2013-08-01 03:30:00 7.481
8 2013-08-01 04:00:00 7.351
9 2013-08-01 04:30:00 7.275
10 2013-08-01 05:00:00 7.291
More data
48 2013-08-02 00:00:00 9.372
49 2013-08-02 00:30:00 9.485
50 2013-08-02 01:00:00 9.151
51 2013-08-02 01:30:00 8.870
52 2013-08-02 02:00:00 8.504
53 2013-08-02 02:30:00 8.404
54 2013-08-02 03:00:00 8.342
55 2013-08-02 03:30:00 8.278
56 2013-08-02 04:00:00 8.229
57 2013-08-02 04:30:00 8.163
58 2013-08-02 05:00:00 8.092
59 2013-08-02 05:30:00 8.038
I want an ideally rectangular output (could be a matrix instead of a data frame), putting NAs where is no data available for that time. Something like this:
(30-min span in this direction -->)
2013-08-01 NA 8.001 7.918 7.621 7.564 7.718 7.846 7.481 7.351 7.275 7.291 ...
2013-08-02 9.372 9.485 9.151 8.870 8.504 8.404 8.342 8.278 8.229 8.092 8.038 ...
2013-08-03 ... ... ... ... ... ... ... ... ... ... ... ...
2013-08-04 ... ... ... ... ... ... ... ... ... ... ... ...
...
...
I have worked porting a Matlab function (wrote for myself) to accomplish that but with no success, by the way R interprets date and time.
Update: How to generate data. (Consider that original data is from a 7-yr database from my work)
library(lubridate)
data_f = data.frame(time = seq(from = as_datetime("2013-08-01 00:30:00"),
to = as_datetime("2013-10-12 18:00:00"),
by = "30 min"),
data = runif(3491, 2, 14))
Thanks in advance.
One approach you could follow is separating date and time an then reshaping the data. Here the code with tidyverse functions:
#Data
df <- structure(list(time = structure(c(1375317000, 1375318800, 1375320600,
1375322400, 1375324200, 1375326000, 1375327800, 1375329600, 1375331400,
1375333200, 1375401600, 1375403400, 1375405200, 1375407000, 1375408800,
1375410600, 1375412400, 1375414200, 1375416000, 1375417800, 1375419600,
1375421400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
data = c(8.001, 7.918, 7.621, 7.564, 7.718, 7.846, 7.481,
7.351, 7.275, 7.291, 9.372, 9.485, 9.151, 8.87, 8.504, 8.404,
8.342, 8.278, 8.229, 8.163, 8.092, 8.038)), class = "data.frame", row.names = c(NA,
-22L))
Code:
#Split and reshape
df %>% separate(time,into = c('V1','V2'),sep = ' ') %>%
pivot_wider(names_from = V2,values_from=data)
Output:
# A tibble: 2 x 13
V1 `00:30:00` `00:59:59` `01:30:00` `02:00:00` `02:29:59` `03:00:00` `03:30:00` `03:59:59` `04:30:00`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013~ 8.00 7.92 7.62 7.56 7.72 7.85 7.48 7.35 7.28
2 2013~ 9.48 9.15 8.87 8.50 8.40 8.34 8.28 8.23 8.16
# ... with 3 more variables: `05:00:00` <dbl>, `00:00:00` <dbl>, `05:29:59` <dbl>
As names of new variables can change you could rearrange them.

Use time values for x-axis labels

I have some climate data with temperature and humidity as well as a timestamp which is transformed to the time in %H:%M.
When using ggplot2 for visualization, the time gets sorted - screwing the order of measurements as the first measurement was taken at 14:00 (2pm) and the last one at 10:27 (10:27am) the following day.
How do I prevent ggplot2 from sorting the x-values? (see plot)
MVE:
library(tidyverse)
df = read_csv('./climate_stats_incl_time.csv')
colnames(df)[1] <- c('sample')
head(df)
tail(df)
ggplot(data=df, mapping=aes(x=time)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
> head(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 0 1581253210. 21.9 47.6 14:00
2 1 1581253275. 21.7 47.8 14:01
3 2 1581253336. 21.7 47.8 14:02
4 3 1581253397. 21.8 47.8 14:03
5 4 1581253457. 21.7 47.8 14:04
6 5 1581253520. 21.8 47.8 14:05
> tail(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 1203 1581326567. 19.1 49.8 10:22
2 1204 1581326628. 19.1 49.7 10:23
3 1205 1581326688. 19.1 49.9 10:24
4 1206 1581326749. 19.1 49.9 10:25
5 1207 1581326812. 19.1 49.7 10:26
6 1208 1581326873. 19.1 49.8 10:27
Format your timestamps to a proper date-time (assuming the origin is 1970):
df$date_time <- as.POSIXct(df$timestamp, origin="1970-01-01", tz = "GMT")
Then use this new date_time variable instead of time for plotting
Edit:
I accidentally submitted a wrong solution (I re-formated the date-time to a date) . Now the solution should work for your problem (i.e. it makes a date-time!)
A workaround
df %>%
mutate(orig_seq = seq(1,nrow(df),1)) %>%
ggplot(mapping=aes(x=reorder(time, orig_seq)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))

How to reverse the order of a dataframe in R

I've endlessly looked for this and somehow nothing has solved this simple problem.
I have a dataframe called Prices in which there are 4 columns, one of which is a list of historical dates - the other 3 are lists of prices for products.
1 10/10/2016 53.14 50.366 51.87
2 07/10/2016 51.93 49.207 50.38
3 06/10/2016 52.51 49.655 50.98
4 05/10/2016 51.86 49.076 50.38
5 04/10/2016 50.87 48.186 49.3
6 03/10/2016 50.89 48.075 49.4
7 30/09/2016 50.19 47.384 48.82
8 29/09/2016 49.81 46.924 48.4
9 28/09/2016 49.24 46.062 47.65
10 27/09/2016 46.52 43.599 45.24
The list is 252 prices long. How can I have my output stored with the latest date at the bottom of the list and the corresponding prices listed with the latest prices at the bottom of the list?
Another tidyverse solution and I think the simplest one is:
df %>% map_df(rev)
or using just purrr::map_df we can do map_df(df, rev).
If you just want to reverse the order of the rows in a dataframe, you can do the following:
df<- df[seq(dim(df)[1],1),]
Just for completeness sake. There is actually no need to call seq here. You can just use the :-R-logic:
### Create some sample data
n=252
sampledata<-data.frame(a=sample(letters,n,replace=TRUE),b=rnorm(n,1,0.7),
c=rnorm(n,1,0.6),d=runif(n))
### Compare some different ways to reorder the dataframe
myfun1<-function(df=sampledata){df<-df[seq(nrow(df),1),]}
myfun2<-function(df=sampledata){df<-df[seq(dim(df)[1],1),]}
myfun3<-function(df=sampledata){df<-df[dim(df)[1]:1,]}
myfun4<-function(df=sampledata){df<-df[nrow(df):1,]}
### Microbenchmark the functions
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 63.994 67.686 117.61797 71.3780 87.3765 5818.494 1000
myfun2() 63.173 67.686 99.29120 70.9680 87.7865 2299.258 1000
myfun3() 56.610 60.302 92.18913 62.7635 76.9155 3241.522 1000
myfun4() 56.610 60.302 99.52666 63.1740 77.5310 4440.582 1000
The fastest way in my trial here was to use df<-df[dim(df)[1]:1,]. However using nrow instead of dim is only slightly slower. Making this a question of personal preference.
Using seq here definitely slows the process down.
UPDATE September 2018:
From a speed view there is little reason to use dplyr here. For maybe 90% of users the basic R functionality should suffice. The other 10% need to use dplyr for querying a database or need code translation into another language.
## hmhensen's function
dplyr_fun<-function(df=sampledata){df %>% arrange(rev(rownames(.)))}
microbenchmark::microbenchmark(myfun3(),myfun4(),dplyr_fun(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun3() 55.8 69.75 132.8178 103.85 139.95 8949.3 1000
myfun4() 55.9 68.40 115.6418 100.05 135.00 2409.1 1000
dplyr_fun() 1364.8 1541.15 2173.0717 1786.10 2757.80 8434.8 1000
Yet another tidyverse solution is:
df %>% arrange(desc(row_number()))
Another option is to order the list by the vector you want to sort it by,
> data[order(data$Date), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-09-27 00:00:00 46.5 43.6 45.2
2 2016-09-28 00:00:00 49.2 46.1 47.6
3 2016-09-29 00:00:00 49.8 46.9 48.4
4 2016-09-30 00:00:00 50.2 47.4 48.8
5 2016-10-03 00:00:00 50.9 48.1 49.4
6 2016-10-04 00:00:00 50.9 48.2 49.3
7 2016-10-05 00:00:00 51.9 49.1 50.4
8 2016-10-06 00:00:00 52.5 49.7 51.0
9 2016-10-07 00:00:00 51.9 49.2 50.4
10 2016-10-10 00:00:00 53.1 50.4 51.9
Then if you are so inclined, you want to flip the order, reverse it,
> data[rev(order(data$Date)), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-10-10 00:00:00 53.1 50.4 51.9
2 2016-10-07 00:00:00 51.9 49.2 50.4
3 2016-10-06 00:00:00 52.5 49.7 51.0
4 2016-10-05 00:00:00 51.9 49.1 50.4
5 2016-10-04 00:00:00 50.9 48.2 49.3
6 2016-10-03 00:00:00 50.9 48.1 49.4
7 2016-09-30 00:00:00 50.2 47.4 48.8
8 2016-09-29 00:00:00 49.8 46.9 48.4
9 2016-09-28 00:00:00 49.2 46.1 47.6
10 2016-09-27 00:00:00 46.5 43.6 45.2
If you wanted to do this in base R use:
df <- df[rev(seq_len(nrow(df))), , drop = FALSE]
All other base R solutions posted here will have problems in the edge cases of zero row data frames (seq(0,1) == c(0, 1), that's why we use seq_len) or single column data frames (data.frame(a=7:9)[3:1,] == 9:7, that's why we use , drop = FALSE).
If you want to stick with base R, you could also use lapply().
do.call(cbind, lapply(df, rev))

Resources