I have some climate data with temperature and humidity as well as a timestamp which is transformed to the time in %H:%M.
When using ggplot2 for visualization, the time gets sorted - screwing the order of measurements as the first measurement was taken at 14:00 (2pm) and the last one at 10:27 (10:27am) the following day.
How do I prevent ggplot2 from sorting the x-values? (see plot)
MVE:
library(tidyverse)
df = read_csv('./climate_stats_incl_time.csv')
colnames(df)[1] <- c('sample')
head(df)
tail(df)
ggplot(data=df, mapping=aes(x=time)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
> head(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 0 1581253210. 21.9 47.6 14:00
2 1 1581253275. 21.7 47.8 14:01
3 2 1581253336. 21.7 47.8 14:02
4 3 1581253397. 21.8 47.8 14:03
5 4 1581253457. 21.7 47.8 14:04
6 5 1581253520. 21.8 47.8 14:05
> tail(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 1203 1581326567. 19.1 49.8 10:22
2 1204 1581326628. 19.1 49.7 10:23
3 1205 1581326688. 19.1 49.9 10:24
4 1206 1581326749. 19.1 49.9 10:25
5 1207 1581326812. 19.1 49.7 10:26
6 1208 1581326873. 19.1 49.8 10:27
Format your timestamps to a proper date-time (assuming the origin is 1970):
df$date_time <- as.POSIXct(df$timestamp, origin="1970-01-01", tz = "GMT")
Then use this new date_time variable instead of time for plotting
Edit:
I accidentally submitted a wrong solution (I re-formated the date-time to a date) . Now the solution should work for your problem (i.e. it makes a date-time!)
A workaround
df %>%
mutate(orig_seq = seq(1,nrow(df),1)) %>%
ggplot(mapping=aes(x=reorder(time, orig_seq)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
Related
I seem to have some trouble converting my data frame data into a time series. I have a typical data set consisting of date, export quantity, GDP, FDI etc.
# A tibble: 252 x 10
Date `Maize Exports (m/t)` `Rainfall (mm)` `Temperature ©` `Exchange rate (R/$)` `Maize price (R)` `FDI (Million R)` GDP (Million~1 Oil p~2 Infla~3
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000-05-01 00:00:00 21000 30.8 14.4 0.144 678. 4337 9056 192. 5.1
2 2000-06-01 00:00:00 54000 14.9 14.0 0.147 583. -4229 9056 205. 5.1
3 2000-07-01 00:00:00 134000 11.1 12.6 0.144 518. -4229 8841 196. 5.9
4 2000-08-01 00:00:00 213000 6.1 15.3 0.143 526. -4229 8841 205. 6.8
5 2000-09-01 00:00:00 123000 38.5 17.8 0.138 576. 6315 8841 234. 6.8
6 2000-10-01 00:00:00 94000 61.9 20.1 0.132 636. 6315 4487 231. 7.1
7 2000-11-01 00:00:00 192000 93.9 19.9 0.129 685. 6315 4487 250. 7.1
8 2000-12-01 00:00:00 134000 85.6 22.3 0.132 747. -2143 4487 192. 7
9 2001-01-01 00:00:00 133000 92.4 23.4 0.0875 1066. -5651 7365 226. 5
10 2001-02-01 00:00:00 168000 51 22.0 0.0879 1042. -5651 7365 233. 5.9
I've installed the right packages (readxl), I've used the as.Date function so ensure my Date is recognized as such, and I've used the as.ts function to convert the dataset. However, after using the as.ts function, the date column is all muddled up into a random number and not a date anymore. What am I doing wrong? Please help!
Date Maize Exports (m/t) Rainfall (mm) Temperature © Exchange rate (R/$) Maize price (R) FDI (Million R) GDP (Million R) Oil prices (R/barrel)
[1,] 957139200 21000 30.8 14.36 0.1435235 677.88 4337 9056 192.35
[2,] 959817600 54000 14.9 13.96 0.1474926 583.48 -4229 9056 205.36
[3,] 962409600 134000 11.1 12.61 0.1437298 518.10 -4229 8841 196.38
[4,] 965088000 213000 6.1 15.27 0.1433075 525.59 -4229 8841 204.66
[5,] 967766400 123000 38.5 17.83 0.1382170 576.08 6315 8841 233.64
[6,] 970358400 94000 61.9 20.10 0.1322751 635.79 6315 4487 231.27
In short nothing is wrong - and while this response should really be a comment, I wanted to use a full answer to have a bit more space to explain.
Behind each date is a numeric value tethered to an origin, so this is just R's way of handling it. And since you imported from excel originally, those origins may not line up if you tried to cross check it (see below).
You didn't make your question reproducible, but I put some similar data together to demonstrate what's going on:
Data
df <- data.frame(date = as.Date(c("2000-05-01",
"2000-06-01",
"2000-07-01",
"2000-08-01",
"2000-09-01",
"2000-10-01",
"2000-11-01")),
maize = c(21, 54, 132, 213, 123, 94, 192) * 1000,
rainfall = c(30, 14, 11, 6, 38, 61, 93))
tb <- tidyr::as_tibble(df)
Turning this into a time series object using as.ts()
tb_ts <- as.ts(tb)
# Time Series:
# Start = 1
# End = 7
# Frequency = 1
# date maize rainfall
# 1 11078 21000 30
# 2 11109 54000 14
# 3 11139 132000 11
# 4 11170 213000 6
# 5 11201 123000 38
# 6 11231 94000 61
# 7 11262 192000 93
Since I created these data in R, the "origin" is January 1, 1970, and we can see this in numerical dates from the time series object and convert them back into date formats:
as.Date(tb_ts[1:7], origin = '1970-01-01')
# [1] "2000-05-01" "2000-06-01" "2000-07-01" "2000-08-01"
# [5] "2000-09-01" "2000-10-01" "2000-11-01"
Note that if you import data from Excel, Excel's origin is December 30th, 1899 (i.e., as.Date(xx, origin = "1899-12-30")), so if you tried that you get the wrong dates:
as.Date(tb_ts[1:7], origin = "1899-12-30")
# [1] "1930-04-30" "1930-05-31" "1930-06-30" "1930-07-31"
# [5] "1930-08-31" "1930-09-30" "1930-10-31
The function worked as it's supposed to. Keeping the date format you're familiar with isn't practical for execution, so it converts the dates to a different value, usually something like the number of days (or minutes or seconds) since a certain year, usually Jan. 1 1970. For example, here is a little set to make the point:
# a test vector of dates
> del1 <- seq(as.Date("2012-04-01"), length.out=4, by=30)
# looks like
> del1
[1] "2012-04-01" "2012-05-01" "2012-05-31" "2012-06-30"
# use the as.ts
> as.ts(del1)
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 15431 15461 15491 15521
So you can see the dates, which are 30 days apart, are converted to a series of values that are 30 integers apart.
I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows
I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.
Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))
Trying to create a time series per hour in R.
I've a data frame collecting the amount of vehicules per hour, it looks as:
> head(df)
# A tibble: 6 x 8
interval cars vans trucks total `mean speed` `% occupation` hour
<dttm> <int> <int> <int> <int> <dbl> <dbl> <int>
1 2017-10-09 00:00:00 7 0 0 7 7.37 1. 0
2 2017-10-09 01:00:00 24 0 0 24 16.1 3. 1
3 2017-10-09 02:00:00 27 0 0 27 18.1 2. 2
4 2017-10-09 03:00:00 47 3 0 50 31.5 3. 3
5 2017-10-09 04:00:00 122 1 5 128 48.0 16. 4
6 2017-10-09 05:00:00 353 6 2 361 66.3 20. 5
> tail(df,1)
# A tibble: 1 x 8
interval cars vans trucks total `mean speed` `% occupation` hour
<dttm> <int> <int> <int> <int> <dbl> <dbl> <int>
1 2018-03-15 20:00:00 48 0 2 50 31.5 5. 20
Looking at the answer at starting a daily time series in R that clearly explains how to create a ts by day
I've converted this df to a time series as:
ts2Start <- df$interval[1]
ts2End <- df$interval[nrow(df)]
indexPerHour <- seq(ts2Start, ts2End, by = 'hour')
Since we have 365 days in a year and 24h per day, I created the ts as:
> df.ts <- ts(df$total, start = c(2017, as.numeric(format(indexPerHour[1], '%j'))),
+ frequency=24*365)
where
as.numeric(format(indexPerHour[1], '%j')))
returns 282
In order to validate what I'm doing I checked if the date obtained from the index is the same as the first row in my data frame
head(date_decimal(index(df.ts)),1)
but while my first date/time should be: "2017-10-09 00:00:00 "
I'm getting: "2017-01-12 16:59:59 UTC"
It looks as the first index in the df.ts series has started at ~ 282/24
I do not understand what I'm doing wrong. How the start parameter works in ts()?
I also checked the post: How to Create a R TimeSeries for Hourly data
where it is suggested to use xts package.
The issue is that I'm just learning from a book where tslm() is used and xts object does not seem to be supported.
Can I use ts() to create hourly time series ?
You should use an xts library instead. For example:
time_index <- seq(from = as.POSIXct("2016-01-01 00:00:00"),
to = as.POSIXct("2018-10-01 00:00:00"), by = "hour")
traff = xts(df, order.by = time_index)```
I've endlessly looked for this and somehow nothing has solved this simple problem.
I have a dataframe called Prices in which there are 4 columns, one of which is a list of historical dates - the other 3 are lists of prices for products.
1 10/10/2016 53.14 50.366 51.87
2 07/10/2016 51.93 49.207 50.38
3 06/10/2016 52.51 49.655 50.98
4 05/10/2016 51.86 49.076 50.38
5 04/10/2016 50.87 48.186 49.3
6 03/10/2016 50.89 48.075 49.4
7 30/09/2016 50.19 47.384 48.82
8 29/09/2016 49.81 46.924 48.4
9 28/09/2016 49.24 46.062 47.65
10 27/09/2016 46.52 43.599 45.24
The list is 252 prices long. How can I have my output stored with the latest date at the bottom of the list and the corresponding prices listed with the latest prices at the bottom of the list?
Another tidyverse solution and I think the simplest one is:
df %>% map_df(rev)
or using just purrr::map_df we can do map_df(df, rev).
If you just want to reverse the order of the rows in a dataframe, you can do the following:
df<- df[seq(dim(df)[1],1),]
Just for completeness sake. There is actually no need to call seq here. You can just use the :-R-logic:
### Create some sample data
n=252
sampledata<-data.frame(a=sample(letters,n,replace=TRUE),b=rnorm(n,1,0.7),
c=rnorm(n,1,0.6),d=runif(n))
### Compare some different ways to reorder the dataframe
myfun1<-function(df=sampledata){df<-df[seq(nrow(df),1),]}
myfun2<-function(df=sampledata){df<-df[seq(dim(df)[1],1),]}
myfun3<-function(df=sampledata){df<-df[dim(df)[1]:1,]}
myfun4<-function(df=sampledata){df<-df[nrow(df):1,]}
### Microbenchmark the functions
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 63.994 67.686 117.61797 71.3780 87.3765 5818.494 1000
myfun2() 63.173 67.686 99.29120 70.9680 87.7865 2299.258 1000
myfun3() 56.610 60.302 92.18913 62.7635 76.9155 3241.522 1000
myfun4() 56.610 60.302 99.52666 63.1740 77.5310 4440.582 1000
The fastest way in my trial here was to use df<-df[dim(df)[1]:1,]. However using nrow instead of dim is only slightly slower. Making this a question of personal preference.
Using seq here definitely slows the process down.
UPDATE September 2018:
From a speed view there is little reason to use dplyr here. For maybe 90% of users the basic R functionality should suffice. The other 10% need to use dplyr for querying a database or need code translation into another language.
## hmhensen's function
dplyr_fun<-function(df=sampledata){df %>% arrange(rev(rownames(.)))}
microbenchmark::microbenchmark(myfun3(),myfun4(),dplyr_fun(),times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
myfun3() 55.8 69.75 132.8178 103.85 139.95 8949.3 1000
myfun4() 55.9 68.40 115.6418 100.05 135.00 2409.1 1000
dplyr_fun() 1364.8 1541.15 2173.0717 1786.10 2757.80 8434.8 1000
Yet another tidyverse solution is:
df %>% arrange(desc(row_number()))
Another option is to order the list by the vector you want to sort it by,
> data[order(data$Date), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-09-27 00:00:00 46.5 43.6 45.2
2 2016-09-28 00:00:00 49.2 46.1 47.6
3 2016-09-29 00:00:00 49.8 46.9 48.4
4 2016-09-30 00:00:00 50.2 47.4 48.8
5 2016-10-03 00:00:00 50.9 48.1 49.4
6 2016-10-04 00:00:00 50.9 48.2 49.3
7 2016-10-05 00:00:00 51.9 49.1 50.4
8 2016-10-06 00:00:00 52.5 49.7 51.0
9 2016-10-07 00:00:00 51.9 49.2 50.4
10 2016-10-10 00:00:00 53.1 50.4 51.9
Then if you are so inclined, you want to flip the order, reverse it,
> data[rev(order(data$Date)), ]
# A tibble: 10 x 4
Date priceA priceB priceC
<dttm> <dbl> <dbl> <dbl>
1 2016-10-10 00:00:00 53.1 50.4 51.9
2 2016-10-07 00:00:00 51.9 49.2 50.4
3 2016-10-06 00:00:00 52.5 49.7 51.0
4 2016-10-05 00:00:00 51.9 49.1 50.4
5 2016-10-04 00:00:00 50.9 48.2 49.3
6 2016-10-03 00:00:00 50.9 48.1 49.4
7 2016-09-30 00:00:00 50.2 47.4 48.8
8 2016-09-29 00:00:00 49.8 46.9 48.4
9 2016-09-28 00:00:00 49.2 46.1 47.6
10 2016-09-27 00:00:00 46.5 43.6 45.2
If you wanted to do this in base R use:
df <- df[rev(seq_len(nrow(df))), , drop = FALSE]
All other base R solutions posted here will have problems in the edge cases of zero row data frames (seq(0,1) == c(0, 1), that's why we use seq_len) or single column data frames (data.frame(a=7:9)[3:1,] == 9:7, that's why we use , drop = FALSE).
If you want to stick with base R, you could also use lapply().
do.call(cbind, lapply(df, rev))