I have a data of swimming times that I would like to be able to plot over time. I was wondering if there was a quick way to change these variables from character to numeric?
I started by trying to convert the times to a POSIX date-time format, but that proved to not be helpful, especially because I would like to do some ARIMA predictions on the data.
Here is my data
times <- c("47.45","47.69",
"47.69","47.82",
"47.84","47.92",
"47.96","48.13",
"48.16","48.16",
"48.16","48.31",
"49.01","49.27",
"49.33","49.40",
"49.48","49.51",
"52.85","52.89",
"53.14","54.31",
"54.63","56.91",
"1:18.39","1:20.26",
"1:38.30")
dates <- c("2017-02-24 MST",
"2017-02-24 MST",
"2016-02-26 MST",
"2018-02-23 MST",
"2015-12-04 MST",
"2015-03-06 MST",
"2015-03-06 MST",
"2016-12-02 MST",
"2016-02-26 MST",
"2017-11-17 MST",
"2016-12-02 MST",
"2017-11-17 MST",
"2014-11-22 MST",
"2017-01-13 MST",
"2017-01-21 MST",
"2015-10-17 MDT",
"2017-01-27 MST",
"2016-01-29 MST",
"2017-10-20 MDT",
"2016-11-05 MDT",
"2015-11-07 MST",
"2015-10-30 MDT",
"2014-11-22 MST",
"2016-11-11 MST",
"2014-02-28 MST",
"2014-02-28 MST",
"2014-02-28 MST",)
df <- cbind(as.data.frame(dates),as.data.frame(times))
I hope to get a column for time, probably in seconds, so the first 24 obs would stay the same, but the last 3 obs would change to 78.39,80.26, and 98.30
One way is to pre-pend those times that don't have minutes with "00:".
Then you can use lubridate::ms to do the time conversion.
library(dplyr)
library(lubridate)
data.frame(times = times,
stringsAsFactors = FALSE) %>%
mutate(times2 = ifelse(grepl(":", times), times, paste0("00:", times)),
seconds = as.numeric(ms(times2)))
Result:
times times2 seconds
1 47.45 00:47.45 47.45
2 47.69 00:47.69 47.69
3 47.69 00:47.69 47.69
4 47.82 00:47.82 47.82
5 47.84 00:47.84 47.84
6 47.92 00:47.92 47.92
7 47.96 00:47.96 47.96
8 48.13 00:48.13 48.13
9 48.16 00:48.16 48.16
10 48.16 00:48.16 48.16
11 48.16 00:48.16 48.16
12 48.31 00:48.31 48.31
13 49.01 00:49.01 49.01
14 49.27 00:49.27 49.27
15 49.33 00:49.33 49.33
16 49.40 00:49.40 49.40
17 49.48 00:49.48 49.48
18 49.51 00:49.51 49.51
19 52.85 00:52.85 52.85
20 52.89 00:52.89 52.89
21 53.14 00:53.14 53.14
22 54.31 00:54.31 54.31
23 54.63 00:54.63 54.63
24 56.91 00:56.91 56.91
25 1:18.39 1:18.39 78.39
26 1:20.26 1:20.26 80.26
27 1:38.30 1:38.30 98.30
as.difftime, and a quick regex to add the minutes when they are not present, should handle it:
as.difftime(sub("(^\\d{1,2}\\.)", "0:\\1", times), format="%M:%OS")
#Time differences in secs
# [1] 47.45 47.69 47.69 47.82 47.84 47.92 47.96 48.13 48.16 48.16 48.16 48.31
#[13] 49.01 49.27 49.33 49.40 49.48 49.51 52.85 52.89 53.14 54.31 54.63 56.91
#[25] 78.39 80.26 98.30
You can use separate in the Tidyverse tidyr package to split the strings into minutes and seconds:
library(tidyr)
library(dplyr)
separate(tibble(times = times), times, sep = ":",
into = c("min", "sec"), fill = "left", convert = T) %>%
mutate(min = ifelse(is.na(min), 0, min),
seconds = 60 * min + sec)
# A tibble: 27 x 3
min sec seconds
<dbl> <dbl> <dbl>
1 0 47.4 47.4
2 0 47.7 47.7
3 0 47.7 47.7
4 0 47.8 47.8
5 0 47.8 47.8
6 0 47.9 47.9
7 0 48.0 48.0
8 0 48.1 48.1
9 0 48.2 48.2
10 0 48.2 48.2
# ... with 17 more rows
The new column seconds is the number of seconds, multiplying the number of minutes by 60.
Related
My goal is to apply the geosphere::bearing function to a very large data frame,
yet because the data frame concerns multiple individuals, I split it using the purrr package and split function.
I have seen the use of 'lists' and 'forloops' in the past but I have no experience with these.
Below is a fraction of my dataset, I have split the dataframe by ID, into a list with 43 elements. I have attached long and lat in wgs84 to the initial data frame.
ID Date Time Datetime Long Lat x y
10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 -91.72272 46.35156
10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409 5179885 -91.7044 46.34891
10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 -91.72297 46.35134
10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 -91.72298 46.35134
10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 -91.7242 46.34506
10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 -91.72515 46.34738
10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 -91.7184 46.32236
10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 -91.65361 46.34712
10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 -91.66127 46.3485
10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 -91.70303 46.35451
10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 -91.6685 46.32941
10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 -91.70263 46.35481
10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 -91.67099 46.34138
10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 -91.66324 46.34763
10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 -91.73075 46.3684
10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 -91.70413 46.35429
10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232 -91.66452 46.37274
I then try this function
library(geosphere)
library(sf)
library(magrittr)
dis_list <- split(data, data$ID)
answer <- lapply(dis_list, function(df) {
start <- df[-1 , c("x", "y")] %>%
st_as_sf(coords = c('x', 'y'))
end <- df[-nrow(df), c("x", "y")] %>%
st_as_sf(coords = c('x', 'y'))
angles <-geosphere::bearing(start, end)
df$angles <- c(NA, angles)
df
})
answer
which gives the error
Error in .pointsToMatrix(p1) :
'list' object cannot be coerced to type 'double'
A google search on "pass sf points to geosphere bearings" brings up this SE::GIS answer that seems to address the issue which I would characterize as "how to extract numeric vectors from items that are sf-classed POINTS": https://gis.stackexchange.com/questions/416316/compute-east-west-or-north-south-orientation-of-polylines-sf-linestring-in-r
I needed to work with a single section first and then apply the lessons from #Spacedman to this task:
> st_coordinates( st_as_sf(dis_list[[1]], coords = c('x', 'y')) )
X Y
1 -91.72272 46.35156
2 -91.70440 46.34891
3 -91.72297 46.35134
4 -91.72420 46.34506
5 -91.65361 46.34712
So st_coordinates wilL extract the POINTS classed values into a two column matrix that can THEN get passed to geosphere::bearings
dis_list <- split(dat, dat$ID)
answer <- lapply(dis_list, function(df) {
start <- df[-1 , c("x", "y")] %>%
st_as_sf(coords = c('x', 'y')) %>% st_coordinates
end1 <- df[-nrow(df), c("x", "y")] %>%
st_as_sf(coords = c('x', 'y')) %>% st_coordinates
angles <-geosphere::bearing(start, end1)
df$angles <- c(NA, angles)
df
})
answer
#------------------------
$`10_17`
ID Date Time date time Long Lat x y
1 10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 -91.72272 46.35156
2 10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409.0 5179885 -91.70440 46.34891
3 10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 -91.72297 46.35134
5 10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 -91.72420 46.34506
8 10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 -91.65361 46.34712
Datetime angles
1 4/18/2017 15:02 NA
2 4/20/2017 6:00 -78.194383
3 4/21/2017 21:02 100.694352
5 4/23/2017 12:01 7.723513
8 4/26/2017 18:02 -92.387473
$`10_24`
ID Date Time date time Long Lat x y
4 10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 -91.72298 46.35134
6 10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 -91.72515 46.34738
7 10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 -91.71840 46.32236
Datetime angles
4 4/22/2017 10:03 NA
6 4/24/2017 1:00 20.77910
7 4/25/2017 16:01 -10.58228
$`10_36`
ID Date Time date time Long Lat x y
9 10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 -91.66127 46.34850
10 10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 -91.70303 46.35451
11 10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 -91.66850 46.32941
Datetime angles
9 4/27/2017 20:00 NA
10 4/29/2017 11:01 101.72602
11 4/30/2017 0:00 -43.60192
$`10_40`
ID Date Time date time Long Lat x y
12 10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 -91.70263 46.35481
13 10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 -91.67099 46.34138
14 10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 -91.66324 46.34763
Datetime angles
12 4/30/2017 13:02 NA
13 5/2/2017 17:02 -58.48235
14 5/3/2017 6:01 -139.34297
$`10_88`
ID Date Time date time Long Lat x y
15 10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 -91.73075 46.36840
16 10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 -91.70413 46.35429
17 10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232 -91.66452 46.37274
Datetime angles
15 5/3/2017 19:02 NA
16 5/4/2017 8:01 -52.55217
17 5/4/2017 21:03 -123.91920
The help page for st_coordinates characterizes its function as "retrieve coordinates in matrix form".
Given the data is all ready in longitude and latitude form.
Then just using bearing(data[, c("Long", "Lat")]) and distGeo(data[, c("Long", "Lat")]) from geosphere on the split data frames will work. No need to create a start and end points.
library(geosphere)
dfs <- split(data, data$ID)
library(geosphere)
answer <- lapply(dfs, function(df) {
df$distances <-c(distGeo(df[,c("Long", "Lat")]))
df$bearings <- c(bearing(df[,c("Long", "Lat")]))
df
})
answer
The sf package is useful for converting between coordinate systems, but with the data set above, that step can be skipped. I find the geosphere package more straight forward and simpler to use.
Suppose I have a dataframe like so:
contracts
Dates Last.Price Last.Price.1 id carry
1 1998-11-30 94.50 98.50 QS -0.040609137
2 1998-11-30 31.32 32.13 HO -0.025210084
3 1998-12-31 95.50 98.00 QS -0.025510204
4 1998-12-31 34.00 34.28 HO -0.008168028
5 1999-01-29 100.00 100.50 QS -0.004975124
6 1999-01-29 33.16 33.42 HO -0.007779773
7 1999-02-26 100.25 100.25 QS 0.000000000
8 1999-02-26 32.29 32.37 HO -0.002471424
9 1999-02-26 10.88 11.00 CO -0.010909091
10 1999-03-31 131.50 130.75 QS 0.005736138
11 1999-03-31 44.68 44.00 HO 0.015454545
12 1999-03-31 15.24 15.16 CO 0.005277045
I want to calculate the weights of each id in each month. I have a function that does this. I use dplyr to achieve this:
library(dplyr)
library(lubridate)
contracts <- contracts %>%
mutate(Dates = ymd(Dates)) %>%
group_by(Dates) %>%
mutate(weights = weight(carry))
which gives:
contracts
Dates Last.Price Last.Price.1 id carry weights
1 1998-11-30 94.50 98.50 QS -0.040609137 0.616979910
2 1998-11-30 31.32 32.13 HO -0.025210084 0.383020090
3 1998-12-31 95.50 98.00 QS -0.025510204 0.757468623
4 1998-12-31 34.00 34.28 HO -0.008168028 0.242531377
5 1999-01-29 100.00 100.50 QS -0.004975124 0.390056023
6 1999-01-29 33.16 33.42 HO -0.007779773 0.609943977
7 1999-02-26 100.25 100.25 QS 0.000000000 NA
8 1999-02-26 32.29 32.37 HO -0.002471424 0.184703218
9 1999-02-26 10.88 11.00 CO -0.010909091 0.815296782
10 1999-03-31 131.50 130.75 QS 0.057361377 0.057361377
11 1999-03-31 44.68 44.00 HO 0.015454545 0.015454545
12 1999-03-31 15.24 15.16 CO 0.005277045 0.005277045
Now I want the lag the weights, such that the weights calculated in november are applied in december. So I essentially want to shift the weights column by group, the group being the dates. So the values in November end up being the values in December and so on.
Now I also want the shift to match by id, such that if a new id is included, the group where the id first appears will have an NA in the lagged column.
The desired output is given below:
desired
Dates Last.Price Last.Price.1 id carry weights w
1 1998-11-30 94.50 98.50 QS -0.040609137 0.616979910 NA
2 1998-11-30 31.32 32.13 HO -0.025210084 0.383020090 NA
3 1998-12-31 95.50 98.00 QS -0.025510204 0.757468623 0.61697991
4 1998-12-31 34.00 34.28 HO -0.008168028 0.242531377 0.38302009
5 1999-01-29 100.00 100.50 QS -0.004975124 0.390056023 0.75746862
6 1999-01-29 33.16 33.42 HO -0.007779773 0.609943977 0.24253138
7 1999-02-26 100.25 100.25 QS 0.000000000 NA 0.39005602
8 1999-02-26 32.29 32.37 HO -0.002471424 0.184703218 0.60994398
9 1999-02-26 10.88 11.00 CO -0.010909091 0.815296782 NA
10 1999-03-31 131.50 130.75 QS 0.057361377 0.057361377 NA
11 1999-03-31 44.68 44.00 HO 0.015454545 0.015454545 0.18470322
12 1999-03-31 15.24 15.16 CO 0.005277045 0.005277045 0.81529678
Take note of February 1999. CO has an NA because it first appears in February.
Now look at March 1999, CO has the value from Februray, QS has an NA only because the February value was NA (due to division by 0).
Can this be done?
Data:
contracts <- read.table(text = "Dates, Last.Price, Last.Price.1, id,carry
1998-11-30, 94.500, 98.500, QS, -0.0406091371
1998-11-30, 31.320, 32.130, HO, -0.0252100840
1998-12-31, 95.500, 98.000, QS, -0.0255102041
1998-12-31, 34.000, 34.280, HO, -0.0081680280
1999-01-29, 100.000, 100.500, QS, -0.0049751244
1999-01-29, 33.160, 33.420, HO, -0.0077797726
1999-02-26, 100.250, 100.250, QS, 0.0000000000
1999-02-26, 32.290, 32.370, HO, -0.0024714242
1999-02-26, 10.880, 11.000, CO, -0.0109090909
1999-03-31, 131.500, 130.750, QS, 0.0057361377
1999-03-31, 44.680, 44.000, HO, 0.0154545455
1999-03-31, 15.240, 15.160, CO, 0.0052770449", sep = ",", header = T)
desired <- read.table(text = "Dates,Last.Price,Last.Price.1,id,carry,weights,w
1998-11-30,94.5,98.5, QS,-0.0406091371,0.616979909839741,NA
1998-11-30,31.32,32.13, HO,-0.025210084,0.383020090160259,NA
1998-12-31,95.5,98, QS,-0.0255102041,0.757468623182272,0.616979909839741
1998-12-31,34,34.28, HO,-0.008168028,0.242531376817728,0.383020090160259
1999-01-29,100,100.5, QS,-0.0049751244,0.390056023188584,0.757468623182272
1999-01-29,33.16,33.42, HO,-0.0077797726,0.609943976811416,0.242531376817728
1999-02-26,100.25,100.25, QS,0,NA,0.390056023188584
1999-02-26,32.29,32.37, HO,-0.0024714242,0.184703218189261,0.609943976811416
1999-02-26,10.88,11, CO,-0.0109090909,0.815296781810739,NA
1999-03-31,131.5,130.75, QS,0.057361377,0.057361377,NA
1999-03-31,44.68,44, HO,0.0154545455,0.0154545455,0.184703218189261
1999-03-31,15.24,15.16, CO,0.0052770449,0.0052770449,0.815296782", sep = ",", header = TRUE)
weights function:
weight <- function(vec) {
neg <- which(vec<0)
w <- vec
w[neg] <- vec[vec<0] / sum(vec[vec<0])
w[-neg] <- vec[vec>=0] / sum(vec[vec>=0])
w
}
contracts %>%
group_by(Dates) %>%
mutate(weights = weight(carry)) %>%
arrange(Dates) %>%
group_by(id) %>%
mutate(w = dplyr::lag(weights)) %>%
ungroup()
# # A tibble: 12 x 7
# Dates Last.Price Last.Price.1 id carry weights w
# <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1998-11-30 94.5 98.5 " QS" -0.0406 0.617 NA
# 2 1998-11-30 31.3 32.1 " HO" -0.0252 0.383 NA
# 3 1998-12-31 95.5 98 " QS" -0.0255 0.757 0.617
# 4 1998-12-31 34 34.3 " HO" -0.00817 0.243 0.383
# 5 1999-01-29 100 100. " QS" -0.00498 0.390 0.757
# 6 1999-01-29 33.2 33.4 " HO" -0.00778 0.610 0.243
# 7 1999-02-26 100. 100. " QS" 0 NaN 0.390
# 8 1999-02-26 32.3 32.4 " HO" -0.00247 0.185 0.610
# 9 1999-02-26 10.9 11 " CO" -0.0109 0.815 NA
# 10 1999-03-31 132. 131. " QS" 0.00574 0.00574 NaN
# 11 1999-03-31 44.7 44 " HO" 0.0155 0.0155 0.185
# 12 1999-03-31 15.2 15.2 " CO" 0.00528 0.00528 0.815
Notes:
I used dplyr::lag instead of just lag because of the possibility of confusion with stats::lag, which behaves significantly differently than dplyr::lag. While most of the time it'll work just fine, it works until it doesn't ... and it doesn't usually warn you :-)
This is lagging by Dates regardless of month. I'll assume that you are certain that Dates are always perfectly frequent. If you think there's the possibility in a gap (where lagging by-row is not correct), then you'll need to break out the year/month into a new field and join on itself instead of doing a lag.
I'm new here so...
I have a data frame with two variables (R is new for me, I used Matlab for a long). One is a classic POSIXlt with timestamps with 30 minutes between each data point. The second one is the data itself (for example, Air Temperature data) and same dimensions with time vector. I used this pair to get nice plots.
I want to reshape data using time in this fashion: I want to sort the data using days in the row-direction and time (up to 48 columns, using the 30-minute interval between 0:00 and 23:30) in the column-direction, to use this data in another R package to fill missing data.
>> head(data_f, 10)
time data
1 2013-08-01 00:30:00 8.001
2 2013-08-01 01:00:00 7.918
3 2013-08-01 01:30:00 7.621
4 2013-08-01 02:00:00 7.564
5 2013-08-01 02:30:00 7.718
6 2013-08-01 03:00:00 7.846
7 2013-08-01 03:30:00 7.481
8 2013-08-01 04:00:00 7.351
9 2013-08-01 04:30:00 7.275
10 2013-08-01 05:00:00 7.291
More data
48 2013-08-02 00:00:00 9.372
49 2013-08-02 00:30:00 9.485
50 2013-08-02 01:00:00 9.151
51 2013-08-02 01:30:00 8.870
52 2013-08-02 02:00:00 8.504
53 2013-08-02 02:30:00 8.404
54 2013-08-02 03:00:00 8.342
55 2013-08-02 03:30:00 8.278
56 2013-08-02 04:00:00 8.229
57 2013-08-02 04:30:00 8.163
58 2013-08-02 05:00:00 8.092
59 2013-08-02 05:30:00 8.038
I want an ideally rectangular output (could be a matrix instead of a data frame), putting NAs where is no data available for that time. Something like this:
(30-min span in this direction -->)
2013-08-01 NA 8.001 7.918 7.621 7.564 7.718 7.846 7.481 7.351 7.275 7.291 ...
2013-08-02 9.372 9.485 9.151 8.870 8.504 8.404 8.342 8.278 8.229 8.092 8.038 ...
2013-08-03 ... ... ... ... ... ... ... ... ... ... ... ...
2013-08-04 ... ... ... ... ... ... ... ... ... ... ... ...
...
...
I have worked porting a Matlab function (wrote for myself) to accomplish that but with no success, by the way R interprets date and time.
Update: How to generate data. (Consider that original data is from a 7-yr database from my work)
library(lubridate)
data_f = data.frame(time = seq(from = as_datetime("2013-08-01 00:30:00"),
to = as_datetime("2013-10-12 18:00:00"),
by = "30 min"),
data = runif(3491, 2, 14))
Thanks in advance.
One approach you could follow is separating date and time an then reshaping the data. Here the code with tidyverse functions:
#Data
df <- structure(list(time = structure(c(1375317000, 1375318800, 1375320600,
1375322400, 1375324200, 1375326000, 1375327800, 1375329600, 1375331400,
1375333200, 1375401600, 1375403400, 1375405200, 1375407000, 1375408800,
1375410600, 1375412400, 1375414200, 1375416000, 1375417800, 1375419600,
1375421400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
data = c(8.001, 7.918, 7.621, 7.564, 7.718, 7.846, 7.481,
7.351, 7.275, 7.291, 9.372, 9.485, 9.151, 8.87, 8.504, 8.404,
8.342, 8.278, 8.229, 8.163, 8.092, 8.038)), class = "data.frame", row.names = c(NA,
-22L))
Code:
#Split and reshape
df %>% separate(time,into = c('V1','V2'),sep = ' ') %>%
pivot_wider(names_from = V2,values_from=data)
Output:
# A tibble: 2 x 13
V1 `00:30:00` `00:59:59` `01:30:00` `02:00:00` `02:29:59` `03:00:00` `03:30:00` `03:59:59` `04:30:00`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013~ 8.00 7.92 7.62 7.56 7.72 7.85 7.48 7.35 7.28
2 2013~ 9.48 9.15 8.87 8.50 8.40 8.34 8.28 8.23 8.16
# ... with 3 more variables: `05:00:00` <dbl>, `00:00:00` <dbl>, `05:29:59` <dbl>
As names of new variables can change you could rearrange them.
I have some data that looks like
CustomerID InvoiceDate
<fctr> <dttm>
1 13313 2011-01-04 10:00:00
2 18097 2011-01-04 10:22:00
3 16656 2011-01-04 10:23:00
4 16875 2011-01-04 10:37:00
5 13094 2011-01-04 10:37:00
6 17315 2011-01-04 10:38:00
7 16255 2011-01-04 11:30:00
8 14606 2011-01-04 11:34:00
9 13319 2011-01-04 11:40:00
10 16282 2011-01-04 11:42:00
It tells me when a person make a transaction. I would like to know the time between transactions for each customer, preferably in days. I do this in the following way
d <- data %>%
arrange(CustomerID,InvoiceDate) %>%
group_by(CustomerID) %>%
mutate(delta.t = InvoiceDate - lag(InvoiceDate), #calculating the difference
delta.day = as.numeric(delta.t, unit = 'days')) %>%
na.omit() %>%
arrange(CustomerID) %>%
inner_join(Ntrans) %>% #Existing data.frame telling me the number of transactions per customer
filter(N>=10) %>% #only want people with more than 10 transactions
select(-N)
However, the result doesn't make sense (seen below)
CustomerID InvoiceDate delta.t delta.day
<fctr> <dttm> <time> <dbl>
1 12415 2011-01-10 09:58:00 5686 days 5686
2 12415 2011-02-15 09:52:00 51834 days 51834
3 12415 2011-03-03 10:59:00 23107 days 23107
4 12415 2011-04-01 14:28:00 41969 days 41969
5 12415 2011-05-17 15:42:00 66314 days 66314
6 12415 2011-05-20 14:13:00 4231 days 4231
7 12415 2011-06-15 13:37:00 37404 days 37404
8 12415 2011-07-13 15:30:00 40433 days 40433
9 12415 2011-07-13 15:31:00 1 days 1
10 12415 2011-07-19 10:51:00 8360 days 8360
The differences measured in days are way off. What I want is something close to SQL's rolling window function partitioned over customerID. How can I implement this?
If you just want to change the difference to days you can use the package lubridate.
> library('lubridate')
> library('dplyr')
>
> InvoiceDate <- c('2011-01-10 09:58:00', '2011-02-15 09:52:00', '2011-03-03 10:59:00')
> CustomerID <- c(111, 111, 111)
>
> dat <- data.frame('Invo' = InvoiceDate, 'ID' = CustomerID)
>
> dat %>% mutate('Delta' = as_date(Invo) - as_date(lag(Invo)))
Invo ID Delta
1 2011-01-10 09:58:00 111 NA days
2 2011-02-15 09:52:00 111 36 days
3 2011-03-03 10:59:00 111 16 days
I've downloaded a list of every Bitcoin transaction on a large exchange since 2013. What I have now looks like this:
Time Price Volume
1 2013-03-31 22:07:49 93.3 80.628518
2 2013-03-31 22:08:13 100.0 20.000000
3 2013-03-31 22:08:14 100.0 1.000000
4 2013-03-31 22:08:16 100.0 5.900000
5 2013-03-31 22:08:19 100.0 29.833879
6 2013-03-31 22:08:21 100.0 20.000000
7 2013-03-31 22:08:25 100.0 10.000000
8 2013-03-31 22:08:29 100.0 1.000000
9 2013-03-31 22:08:31 100.0 5.566121
10 2013-03-31 22:09:27 93.3 33.676862
I'm trying to work with the data in R, but my computer isn't powerful enough to handle processing it when I run getSymbols(BTC_XTS). I'm trying to convert it to a format like the following (price action over a day):
Date Open High Low Close Volume Adj.Close
1 2014-04-11 32.64 33.48 32.15 32.87 28040700 32.87
2 2014-04-10 34.88 34.98 33.09 33.40 33970700 33.40
3 2014-04-09 34.19 35.00 33.95 34.87 21597500 34.87
4 2014-04-08 33.10 34.43 33.02 33.83 35440300 33.83
5 2014-04-07 34.11 34.37 32.53 33.07 47770200 33.07
6 2014-04-04 36.01 36.05 33.83 34.26 41049900 34.26
7 2014-04-03 36.66 36.79 35.51 35.76 16792000 35.76
8 2014-04-02 36.68 36.86 36.56 36.64 14522800 36.64
9 2014-04-01 36.16 36.86 36.15 36.49 15734000 36.49
10 2014-03-31 36.46 36.58 35.73 35.90 15153200 35.90
I'm new to R, and any response would be greatly appreciated!
I don't know what you could mean when you say your "computer isn't powerful enough to handle processing it when [you] run getSymbols(BTC_XTS)". getSymbols retrieves data... why do you need to retrieve data you already have?
Also, you have no adjusted close data, so it's not possible to have an Adj.Close column in the output.
You can get what you want by coercing your input data to xts and calling to.daily on it. For example:
require(xts)
Data <- structure(list(Time = c("2013-03-31 22:07:49", "2013-03-31 22:08:13",
"2013-03-31 22:08:14", "2013-03-31 22:08:16", "2013-03-31 22:08:19",
"2013-03-31 22:08:21", "2013-03-31 22:08:25", "2013-03-31 22:08:29",
"2013-03-31 22:08:31", "2013-03-31 22:09:27"), Price = c(93.3,
100, 100, 100, 100, 100, 100, 100, 100, 93.3), Volume = c(80.628518,
20, 1, 5.9, 29.833879, 20, 10, 1, 5.566121, 33.676862)), .Names = c("Time",
"Price", "Volume"), class = "data.frame", row.names = c(NA, -10L))
x <- xts(Data[,-1], as.POSIXct(Data[,1]))
d <- to.daily(x, name="BTC")