Using Time Diary Data with TraMineR - r

I am trying to do sequence analysis using time-diary data (American Time Use Survey) using TraMineR in R. I have the data as SPELL data (id, start time, stop time, event) but I receive the following error when trying to convert it to STS or SPS data:
Error in as.matrix.data.frame(subset(data, , 2)) : dims [product 0] do not match the length of object [9]
I believe it has something to do with how I convert my time (as character) to date/time types. I believe TraMineR requires an POSIXlt format?
Here is a snippet of my raw data (trcode is the event)
head(atus.act.short)
tucaseid tustarttim tustoptime trcode
1 2.00701e+13 04:00:00 08:00:00 10101
2 2.00701e+13 08:00:00 08:20:00 110101
3 2.00701e+13 08:20:00 08:50:00 10201
4 2.00701e+13 08:50:00 09:30:00 20102
5 2.00701e+13 09:30:00 09:40:00 180201
6 2.00701e+13 09:40:00 11:40:00 20102
I use strptime to convert the character strings to POSIXlt:
atus.act.short$starttime.new <- strptime(atus.act.short$tustarttim, format="%X")
atus.act.short$stoptime.new <- strptime(atus.act.short$tustoptime, format="%X")
I also cut the ID down to only two digits
atus.act.short$id <- atus.act.short$tucaseid-20070101070000
I end up with a new data frame as follows:
id starttime.new stoptime.new trcode
1 44 2012-08-03 04:00:00 2012-08-03 08:00:00 10101
2 44 2012-08-03 08:00:00 2012-08-03 08:20:00 110101
3 44 2012-08-03 08:20:00 2012-08-03 08:50:00 10201
4 44 2012-08-03 08:50:00 2012-08-03 09:30:00 20102
5 44 2012-08-03 09:30:00 2012-08-03 09:40:00 180201
6 44 2012-08-03 09:40:00 2012-08-03 11:40:00 20102
7 44 2012-08-03 11:40:00 2012-08-03 11:50:00 180201
8 44 2012-08-03 11:50:00 2012-08-03 12:05:00 20102
9 44 2012-08-03 12:05:00 2012-08-03 13:05:00 120303
10 44 2012-08-03 13:05:00 2012-08-03 13:20:00 180704
11 44 2012-08-03 13:20:00 2012-08-03 15:20:00 70104
12 44 2012-08-03 15:20:00 2012-08-03 15:35:00 180704
13 44 2012-08-03 15:35:00 2012-08-03 17:00:00 120303
14 44 2012-08-03 17:00:00 2012-08-03 17:20:00 180701
15 44 2012-08-03 17:20:00 2012-08-03 17:25:00 180701
16 44 2012-08-03 17:25:00 2012-08-03 17:55:00 70101
17 44 2012-08-03 17:55:00 2012-08-03 18:00:00 181203
18 44 2012-08-03 18:00:00 2012-08-03 19:00:00 120303
19 44 2012-08-03 19:00:00 2012-08-03 19:30:00 110101
20 44 2012-08-03 19:30:00 2012-08-03 21:30:00 120303
21 44 2012-08-03 21:30:00 2012-08-03 23:00:00 10101
22 44 2012-08-03 23:00:00 2012-08-03 23:03:00 10201
26 48 2012-08-03 06:45:00 2012-08-03 08:15:00 10201
27 48 2012-08-03 08:15:00 2012-08-03 08:45:00 180209
28 48 2012-08-03 08:45:00 2012-08-03 09:00:00 20902
29 48 2012-08-03 09:00:00 2012-08-03 11:00:00 50101
30 48 2012-08-03 11:00:00 2012-08-03 11:45:00 120312
Then I try to create a sequence object [using library(TraMineR)]
atus.seq <- seqdef(atus.act.short, informat = "SPELL", id="id")
And I get the following error:
Error in as.matrix.data.frame(subset(data, , 2)) : dims [product 0] do not match the length of object [9]
Thoughts?

I've managed to work around this by converting the time to minutes (following another questions on stackoverflow), making the status code a character (as.character), using seqformat, and assigning it to a time axis. The new code reads:
atus.seq2 <- seqformat(atus.act.short2, id="id", from="SPELL", to="STS", begin = "startmin", end = "stopmin", status="trcode", process = "FALSE")

Related

Find the value of a variable 30 days into future

Hi below is my r dataframe. It is a date excerpt from S&P500 data for the past 10 years or so. As you can see I have created a column called Date 30, which is the date + 30 days. I want to add a new column (using dplyr if I can) called Close30, which is the "Close" value on the date of "Date30" - I want to look into the future from a given date (obviously it wont work for the past 30 days...). Sort of like offsetting a column, but it needs a filter/lookup function, because the data is business days, and I need to add 30 calendar days - so I cannot do a constant offset - it needs to be a lookup.
I have tried a few things but an getting nowhere...
Thanks so much if you can help!!?
tidySP500 = na.omit(SP500_Raw) # remove NA in casefuture data have NAs
tidySP500$Date = AsDate(tidySP500$Date)
tidySP500 = tidySP500 %>%
select("Date", "Open", "High", "Low", "Price") %>% # select and re-order required variables
rename("Close" = "Price") %>%
filter(Date >= as.Date("2014-01-05") & Date <= (as.Date("2014-01-05")+100)) %>%
mutate(Date30 = Date + 30)# %>% #WORKS UP TO HERE
mutate(Close30 = Close[Date == Date30]) %>% # FAILS
mutate(Close30 = filter(Close, Date == Date30)) #FAILS
Date Open High Low Close Date30
1 2014-04-15 1831.45 1844.02 1816.29 1842.98 2014-05-15
2 2014-04-14 1818.18 1834.19 1815.80 1830.61 2014-05-14
3 2014-04-11 1830.65 1835.07 1814.36 1815.69 2014-05-11
4 2014-04-10 1872.28 1872.53 1830.87 1833.08 2014-05-10
5 2014-04-09 1852.64 1872.43 1852.38 1872.18 2014-05-09
6 2014-04-08 1845.48 1854.95 1837.49 1851.96 2014-05-08
7 2014-04-07 1863.92 1864.04 1841.48 1845.04 2014-05-07
8 2014-04-04 1890.25 1897.28 1863.26 1865.09 2014-05-04
9 2014-04-03 1891.43 1893.80 1882.65 1888.77 2014-05-03
10 2014-04-02 1886.61 1893.17 1883.79 1890.90 2014-05-02
11 2014-04-01 1873.96 1885.84 1873.96 1885.52 2014-05-01
12 2014-03-31 1859.16 1875.18 1859.16 1872.34 2014-04-30
13 2014-03-28 1850.07 1866.63 1850.07 1857.62 2014-04-27
14 2014-03-27 1852.11 1855.55 1842.11 1849.04 2014-04-26
15 2014-03-26 1867.09 1875.92 1852.56 1852.56 2014-04-25
16 2014-03-25 1859.48 1871.87 1855.96 1865.62 2014-04-24
17 2014-03-24 1867.67 1873.34 1849.69 1857.44 2014-04-23
18 2014-03-21 1874.53 1883.97 1863.46 1866.52 2014-04-20
19 2014-03-20 1860.09 1873.49 1854.63 1872.01 2014-04-19
20 2014-03-19 1872.25 1874.14 1850.35 1860.77 2014-04-18
21 2014-03-18 1858.92 1873.76 1858.92 1872.25 2014-04-17
22 2014-03-17 1842.81 1862.30 1842.81 1858.83 2014-04-16
23 2014-03-14 1845.07 1852.44 1839.57 1841.13 2014-04-13
24 2014-03-13 1869.06 1874.40 1841.86 1846.34 2014-04-12
25 2014-03-12 1866.15 1868.38 1854.38 1868.20 2014-04-11
26 2014-03-11 1878.26 1882.35 1863.88 1867.63 2014-04-10
27 2014-03-10 1877.86 1877.87 1867.04 1877.17 2014-04-09
28 2014-03-07 1878.52 1883.57 1870.56 1878.04 2014-04-06
29 2014-03-06 1874.18 1881.94 1874.18 1877.03 2014-04-05
30 2014-03-05 1874.05 1876.53 1871.11 1873.81 2014-04-04
31 2014-03-04 1849.23 1876.23 1849.23 1873.91 2014-04-03
32 2014-03-03 1857.68 1857.68 1834.44 1845.73 2014-04-02
33 2014-02-28 1855.12 1867.92 1847.67 1859.45 2014-03-30
34 2014-02-27 1844.90 1854.53 1841.13 1854.29 2014-03-29
35 2014-02-26 1845.79 1852.65 1840.66 1845.16 2014-03-28
36 2014-02-25 1847.66 1852.91 1840.19 1845.12 2014-03-27
37 2014-02-24 1836.78 1858.71 1836.78 1847.61 2014-03-26
38 2014-02-21 1841.07 1846.13 1835.60 1836.25 2014-03-23
39 2014-02-20 1829.24 1842.79 1824.58 1839.78 2014-03-22
40 2014-02-19 1838.90 1847.50 1826.99 1828.75 2014-03-21
41 2014-02-18 1839.03 1842.87 1835.01 1840.76 2014-03-20
42 2014-02-14 1828.46 1841.65 1825.59 1838.63 2014-03-16
43 2014-02-13 1814.82 1830.25 1809.22 1829.83 2014-03-15
44 2014-02-12 1820.12 1826.55 1815.97 1819.26 2014-03-14
45 2014-02-11 1800.45 1823.54 1800.41 1819.75 2014-03-13
46 2014-02-10 1796.20 1799.94 1791.83 1799.84 2014-03-12
47 2014-02-07 1776.01 1798.03 1776.01 1797.02 2014-03-09
48 2014-02-06 1752.99 1774.06 1752.99 1773.43 2014-03-08
49 2014-02-05 1753.38 1755.79 1737.92 1751.64 2014-03-07
50 2014-02-04 1743.82 1758.73 1743.82 1755.20 2014-03-06
51 2014-02-03 1782.68 1784.83 1739.66 1741.89 2014-03-05
52 2014-01-31 1790.88 1793.88 1772.26 1782.59 2014-03-02
53 2014-01-30 1777.17 1798.77 1777.17 1794.19 2014-03-01
54 2014-01-29 1790.15 1790.15 1770.45 1774.20 2014-02-28
55 2014-01-28 1783.00 1793.87 1779.49 1792.50 2014-02-27
56 2014-01-27 1791.03 1795.98 1772.88 1781.56 2014-02-26
57 2014-01-24 1826.96 1826.96 1790.29 1790.29 2014-02-23
58 2014-01-23 1842.29 1842.29 1820.06 1828.46 2014-02-22
59 2014-01-22 1844.71 1846.87 1840.88 1844.86 2014-02-21
60 2014-01-21 1841.05 1849.31 1832.38 1843.80 2014-02-20
61 2014-01-17 1844.23 1846.04 1835.23 1838.70 2014-02-16
62 2014-01-16 1847.99 1847.99 1840.30 1845.89 2014-02-15
63 2014-01-15 1840.52 1850.84 1840.52 1848.38 2014-02-14
64 2014-01-14 1821.36 1839.26 1821.36 1838.88 2014-02-13
65 2014-01-13 1841.26 1843.45 1815.52 1819.20 2014-02-12
66 2014-01-10 1840.06 1843.15 1832.43 1842.37 2014-02-09
67 2014-01-09 1839.00 1843.23 1830.38 1838.13 2014-02-08
68 2014-01-08 1837.90 1840.02 1831.40 1837.49 2014-02-07
69 2014-01-07 1828.71 1840.10 1828.71 1837.88 2014-02-06
70 2014-01-06 1832.31 1837.16 1823.73 1826.77 2014-02-05
Something like this?
library(tidyverse)
tidySP500 %>% left_join(select(tidySP500, Close, Date30 = Date), by = c('Date30'))
#> # A tibble: 70 x 7
#> Date Open High Low Close.x Date30 Close.y
#> <date> <dbl> <dbl> <dbl> <dbl> <date> <dbl>
#> 1 2014-04-15 1831. 1844. 1816. 1843. 2014-05-15 NA
#> 2 2014-04-14 1818. 1834. 1816. 1831. 2014-05-14 NA
#> 3 2014-04-11 1831. 1835. 1814. 1816. 2014-05-11 NA
#> 4 2014-04-10 1872. 1873. 1831. 1833. 2014-05-10 NA
#> 5 2014-04-09 1853. 1872. 1852. 1872. 2014-05-09 NA
#> 6 2014-04-08 1845. 1855. 1837. 1852. 2014-05-08 NA
#> 7 2014-04-07 1864. 1864. 1841. 1845. 2014-05-07 NA
#> 8 2014-04-04 1890. 1897. 1863. 1865. 2014-05-04 NA
#> 9 2014-04-03 1891. 1894. 1883. 1889. 2014-05-03 NA
#> 10 2014-04-02 1887. 1893. 1884. 1891. 2014-05-02 NA
#> # … with 60 more rows
Created on 2020-02-22 by the reprex package (v0.3.0)
DATA
tidySP500 <- read.so::read_so('Date Open High Low Close Date30
1 2014-04-15 1831.45 1844.02 1816.29 1842.98 2014-05-15
2 2014-04-14 1818.18 1834.19 1815.80 1830.61 2014-05-14
3 2014-04-11 1830.65 1835.07 1814.36 1815.69 2014-05-11
4 2014-04-10 1872.28 1872.53 1830.87 1833.08 2014-05-10
5 2014-04-09 1852.64 1872.43 1852.38 1872.18 2014-05-09
6 2014-04-08 1845.48 1854.95 1837.49 1851.96 2014-05-08
7 2014-04-07 1863.92 1864.04 1841.48 1845.04 2014-05-07
8 2014-04-04 1890.25 1897.28 1863.26 1865.09 2014-05-04
9 2014-04-03 1891.43 1893.80 1882.65 1888.77 2014-05-03
10 2014-04-02 1886.61 1893.17 1883.79 1890.90 2014-05-02
11 2014-04-01 1873.96 1885.84 1873.96 1885.52 2014-05-01
12 2014-03-31 1859.16 1875.18 1859.16 1872.34 2014-04-30
13 2014-03-28 1850.07 1866.63 1850.07 1857.62 2014-04-27
14 2014-03-27 1852.11 1855.55 1842.11 1849.04 2014-04-26
15 2014-03-26 1867.09 1875.92 1852.56 1852.56 2014-04-25
16 2014-03-25 1859.48 1871.87 1855.96 1865.62 2014-04-24
17 2014-03-24 1867.67 1873.34 1849.69 1857.44 2014-04-23
18 2014-03-21 1874.53 1883.97 1863.46 1866.52 2014-04-20
19 2014-03-20 1860.09 1873.49 1854.63 1872.01 2014-04-19
20 2014-03-19 1872.25 1874.14 1850.35 1860.77 2014-04-18
21 2014-03-18 1858.92 1873.76 1858.92 1872.25 2014-04-17
22 2014-03-17 1842.81 1862.30 1842.81 1858.83 2014-04-16
23 2014-03-14 1845.07 1852.44 1839.57 1841.13 2014-04-13
24 2014-03-13 1869.06 1874.40 1841.86 1846.34 2014-04-12
25 2014-03-12 1866.15 1868.38 1854.38 1868.20 2014-04-11
26 2014-03-11 1878.26 1882.35 1863.88 1867.63 2014-04-10
27 2014-03-10 1877.86 1877.87 1867.04 1877.17 2014-04-09
28 2014-03-07 1878.52 1883.57 1870.56 1878.04 2014-04-06
29 2014-03-06 1874.18 1881.94 1874.18 1877.03 2014-04-05
30 2014-03-05 1874.05 1876.53 1871.11 1873.81 2014-04-04
31 2014-03-04 1849.23 1876.23 1849.23 1873.91 2014-04-03
32 2014-03-03 1857.68 1857.68 1834.44 1845.73 2014-04-02
33 2014-02-28 1855.12 1867.92 1847.67 1859.45 2014-03-30
34 2014-02-27 1844.90 1854.53 1841.13 1854.29 2014-03-29
35 2014-02-26 1845.79 1852.65 1840.66 1845.16 2014-03-28
36 2014-02-25 1847.66 1852.91 1840.19 1845.12 2014-03-27
37 2014-02-24 1836.78 1858.71 1836.78 1847.61 2014-03-26
38 2014-02-21 1841.07 1846.13 1835.60 1836.25 2014-03-23
39 2014-02-20 1829.24 1842.79 1824.58 1839.78 2014-03-22
40 2014-02-19 1838.90 1847.50 1826.99 1828.75 2014-03-21
41 2014-02-18 1839.03 1842.87 1835.01 1840.76 2014-03-20
42 2014-02-14 1828.46 1841.65 1825.59 1838.63 2014-03-16
43 2014-02-13 1814.82 1830.25 1809.22 1829.83 2014-03-15
44 2014-02-12 1820.12 1826.55 1815.97 1819.26 2014-03-14
45 2014-02-11 1800.45 1823.54 1800.41 1819.75 2014-03-13
46 2014-02-10 1796.20 1799.94 1791.83 1799.84 2014-03-12
47 2014-02-07 1776.01 1798.03 1776.01 1797.02 2014-03-09
48 2014-02-06 1752.99 1774.06 1752.99 1773.43 2014-03-08
49 2014-02-05 1753.38 1755.79 1737.92 1751.64 2014-03-07
50 2014-02-04 1743.82 1758.73 1743.82 1755.20 2014-03-06
51 2014-02-03 1782.68 1784.83 1739.66 1741.89 2014-03-05
52 2014-01-31 1790.88 1793.88 1772.26 1782.59 2014-03-02
53 2014-01-30 1777.17 1798.77 1777.17 1794.19 2014-03-01
54 2014-01-29 1790.15 1790.15 1770.45 1774.20 2014-02-28
55 2014-01-28 1783.00 1793.87 1779.49 1792.50 2014-02-27
56 2014-01-27 1791.03 1795.98 1772.88 1781.56 2014-02-26
57 2014-01-24 1826.96 1826.96 1790.29 1790.29 2014-02-23
58 2014-01-23 1842.29 1842.29 1820.06 1828.46 2014-02-22
59 2014-01-22 1844.71 1846.87 1840.88 1844.86 2014-02-21
60 2014-01-21 1841.05 1849.31 1832.38 1843.80 2014-02-20
61 2014-01-17 1844.23 1846.04 1835.23 1838.70 2014-02-16
62 2014-01-16 1847.99 1847.99 1840.30 1845.89 2014-02-15
63 2014-01-15 1840.52 1850.84 1840.52 1848.38 2014-02-14
64 2014-01-14 1821.36 1839.26 1821.36 1838.88 2014-02-13
65 2014-01-13 1841.26 1843.45 1815.52 1819.20 2014-02-12
66 2014-01-10 1840.06 1843.15 1832.43 1842.37 2014-02-09
67 2014-01-09 1839.00 1843.23 1830.38 1838.13 2014-02-08
68 2014-01-08 1837.90 1840.02 1831.40 1837.49 2014-02-07
69 2014-01-07 1828.71 1840.10 1828.71 1837.88 2014-02-06
70 2014-01-06 1832.31 1837.16 1823.73 1826.77 2014-02-05')

How to transform a datetime column from a `Non UTC` format to `UTC` format without loosing data the days in which there is a time change in R

I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.

as.POSIXct gives inexplicable NA value [duplicate]

This question already has answers here:
How do I clear an NA flag for a posix value?
(3 answers)
Closed 5 years ago.
I have a large dataset (21683 records) and I've managed to combine date and time to datetime in a correct way using asPOSIXct. Nevertheless, this did not work for 6 records (17463:17468). This is the dataset I'm using:
> head(solar.angle)
Date Time sol.elev.angle ID Datetime
1 2016-11-24 15:00:00 41.32397 1 2016-11-24 15:00:00
2 2016-11-24 15:10:00 39.11225 2 2016-11-24 15:10:00
3 2016-11-24 15:20:00 36.88180 3 2016-11-24 15:20:00
4 2016-11-24 15:30:00 34.63507 4 2016-11-24 15:30:00
5 2016-11-24 15:40:00 32.37418 5 2016-11-24 15:40:00
6 2016-11-24 15:50:00 30.10096 6 2016-11-24 15:50:00
> solar.angle[17460:17470,]
Date Time sol.elev.angle ID Datetime
17488 2017-03-26 01:30:00 -72.01821 17460 2017-03-26 01:30:00
17489 2017-03-26 01:40:00 -69.53832 17461 2017-03-26 01:40:00
17490 2017-03-26 01:50:00 -67.05409 17462 2017-03-26 01:50:00
17491 2017-03-26 02:00:00 -64.56682 17463 <NA>
17492 2017-03-26 02:10:00 -62.07730 17464 <NA>
17493 2017-03-26 02:20:00 -59.58609 17465 <NA>
17494 2017-03-26 02:30:00 -57.09359 17466 <NA>
17495 2017-03-26 02:40:00 -54.60006 17467 <NA>
17496 2017-03-26 02:50:00 -52.10572 17468 <NA>
17497 2017-03-26 03:00:00 -49.61071 17469 2017-03-26 03:00:00
17498 2017-03-26 03:10:00 -47.11515 17470 2017-03-26 03:10:00
This is the code I'm using:
solar.angle$Datetime <- as.POSIXct(paste(solar.angle$Date,solar.angle$Time), format="%Y-%m-%d %H:%M:%S")
I've already tried to fill them in manually but this did not make any difference:
> solar.angle$Datetime[17463] <- as.POSIXct('2017-03-26 02:00:00', format = "%Y-%m-%d %H:%M:%S")
> solar.angle$Datetime[17463]
[1] NA
Any help will be appreciated!
The problem here is that this is the time you switch to summer time, so you need to specify the time zone, otherwise there is ambiguity.
If you specify a time zone, it will work:
as.POSIXct('2017-03-26 02:00:00', format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
Which returns:
"2017-03-26 02:00:00 GMT"
You can check ?timezones for more information.

R: Compare data.table and pass variable while respecting key

I have two data.tables:
original <- data.frame(id = c(rep("RE01",5),rep("RE02",5)),date.time = head(seq.POSIXt(as.POSIXct("2015-11-01 01:00:00"),as.POSIXct("2015-11-05 01:00:00"),60*60*10),10))
compare <- data.frame(id = c("RE01","RE02"),seq = c(1,2),start = as.POSIXct(c("2015-11-01 20:00:00","2015-11-04 08:00:00")),end = as.POSIXct(c("2015-11-02 08:00:00","2015-11-04 20:00:00")))
setDT(original)
setDT(compare)
I would like to check the date in each row of original and see if it lies between the start and finish dates of compare whilst respecting the id. If it does lie between the two elements, a variable should be passed to original (compare$diff.seq). The output should look like this:
original
id date.time diff.seq
1 RE01 2015-11-01 01:00:00 NA
2 RE01 2015-11-01 11:00:00 NA
3 RE01 2015-11-01 21:00:00 1
4 RE01 2015-11-02 07:00:00 1
5 RE01 2015-11-02 17:00:00 NA
6 RE02 2015-11-03 03:00:00 NA
7 RE02 2015-11-03 13:00:00 NA
8 RE02 2015-11-03 23:00:00 NA
9 RE02 2015-11-04 09:00:00 2
10 RE02 2015-11-04 19:00:00 2
I've been reading the manual and SO for hours and trying "on", "by" and so on.. without any success. Can anybody point me in the right direction?
As said in the comments, this is very straight forward using data.table::foverlaps
You basically have to create an additional column in the original data set in order to set join boundaries, then key the two data sets by the columns you want to join on and then simply run forverlas and select the desired columns
original[, end := date.time]
setkey(original, id, date.time, end)
setkey(compare, id, start, end)
foverlaps(original, compare)[, .(id, date.time, seq)]
# id date.time seq
# 1: RE01 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2
Alternatively, you can run foverlaps the other way around and then just update the original data set by reference while selecting the correct rows to update
indx <- foverlaps(compare, original, which = TRUE)
original[indx$yid, diff.seq := indx$xid]
original
# id date.time end diff.seq
# 1: RE01 2015-11-01 01:00:00 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2015-11-04 19:00:00 2

Partitioning data set by time intervals in R

I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.
The sample of the data is below.
date obs
2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11
First I entered the data with the multiple spaces replaced with tabs.
dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
date obs
1 2011-10-24 01:00:00 12
2 2011-10-24 02:00:00 4
3 2011-10-24 19:00:00 18
4 2011-10-24 20:00:00 7
5 2011-10-24 21:00:00 4
6 2011-10-24 22:00:00 2
$`297`
date obs
7 2011-10-25 00:00:00 4
8 2011-10-25 01:00:00 2
9 2011-10-25 02:00:00 2
10 2011-10-25 15:00:00 12
11 2011-10-25 18:00:00 2
12 2011-10-25 19:00:00 3
13 2011-10-25 21:00:00 2
14 2011-10-25 23:00:00 9
$`298`
date obs
15 2011-10-26 00:00:00 13
16 2011-10-26 01:00:00 11
The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.
?cut.POSIXt
split(dat , cut(dat$date, "week") )
If you wanted to sum within date:
tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298
47 36 24
I'd use a time series class such as xts
dat <- read.table(text="2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11", header=FALSE, stringsAsFactors=FALSE)
xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))
xts subsetting is very intuitive. For all data on "2011-10-25", do this
xobj["2011-10-25"]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)
xobj["2011-10-24/2011-10-25"]
Or, if you want all data from October 2011,
xobj["2011-10"]
If you want to get all data from any day that is between 19:00 and 20:00,
xobj['T19:00:00/T20:00:00']
# [,1]
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-25 19:00:00 3
You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)
endpoints(xobj, "days")
[1] 0 6 14 16
Or you can convert to a lower frequency
to.weekly(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26 12 18 2 11
to.daily(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25 12 18 2 2
#2011-10-26 4 12 2 9
#2011-10-26 13 13 11 11
Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE
to.daily(xobj, OHLC=FALSE)
# [,1]
#2011-10-25 2
#2011-10-26 9
#2011-10-26 11
For more basic subsetting, and much more, visit http://www.quantmod.com/examples/
As #JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.
You can split by day (or week, or month, etc), apply a function, then recombine
split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
# [,1]
#2011-10-24 01:00:00 12
#2011-10-24 02:00:00 4
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-24 21:00:00 4
#2011-10-24 22:00:00 2
#
#[[2]]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
#
#[[3]]
# [,1]
#2011-10-26 00:00:00 13
#2011-10-26 01:00:00 11
Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.
do.call(rbind, lapply(split(xobj, 'days'), first))
# [,1]
#2011-10-24 01:00:00 12
#2011-10-25 00:00:00 4
#2011-10-26 00:00:00 13

Resources