Suppose I have a dataframe like so:
contracts
Dates Last.Price Last.Price.1 id carry
1 1998-11-30 94.50 98.50 QS -0.040609137
2 1998-11-30 31.32 32.13 HO -0.025210084
3 1998-12-31 95.50 98.00 QS -0.025510204
4 1998-12-31 34.00 34.28 HO -0.008168028
5 1999-01-29 100.00 100.50 QS -0.004975124
6 1999-01-29 33.16 33.42 HO -0.007779773
7 1999-02-26 100.25 100.25 QS 0.000000000
8 1999-02-26 32.29 32.37 HO -0.002471424
9 1999-02-26 10.88 11.00 CO -0.010909091
10 1999-03-31 131.50 130.75 QS 0.005736138
11 1999-03-31 44.68 44.00 HO 0.015454545
12 1999-03-31 15.24 15.16 CO 0.005277045
I want to calculate the weights of each id in each month. I have a function that does this. I use dplyr to achieve this:
library(dplyr)
library(lubridate)
contracts <- contracts %>%
mutate(Dates = ymd(Dates)) %>%
group_by(Dates) %>%
mutate(weights = weight(carry))
which gives:
contracts
Dates Last.Price Last.Price.1 id carry weights
1 1998-11-30 94.50 98.50 QS -0.040609137 0.616979910
2 1998-11-30 31.32 32.13 HO -0.025210084 0.383020090
3 1998-12-31 95.50 98.00 QS -0.025510204 0.757468623
4 1998-12-31 34.00 34.28 HO -0.008168028 0.242531377
5 1999-01-29 100.00 100.50 QS -0.004975124 0.390056023
6 1999-01-29 33.16 33.42 HO -0.007779773 0.609943977
7 1999-02-26 100.25 100.25 QS 0.000000000 NA
8 1999-02-26 32.29 32.37 HO -0.002471424 0.184703218
9 1999-02-26 10.88 11.00 CO -0.010909091 0.815296782
10 1999-03-31 131.50 130.75 QS 0.057361377 0.057361377
11 1999-03-31 44.68 44.00 HO 0.015454545 0.015454545
12 1999-03-31 15.24 15.16 CO 0.005277045 0.005277045
Now I want the lag the weights, such that the weights calculated in november are applied in december. So I essentially want to shift the weights column by group, the group being the dates. So the values in November end up being the values in December and so on.
Now I also want the shift to match by id, such that if a new id is included, the group where the id first appears will have an NA in the lagged column.
The desired output is given below:
desired
Dates Last.Price Last.Price.1 id carry weights w
1 1998-11-30 94.50 98.50 QS -0.040609137 0.616979910 NA
2 1998-11-30 31.32 32.13 HO -0.025210084 0.383020090 NA
3 1998-12-31 95.50 98.00 QS -0.025510204 0.757468623 0.61697991
4 1998-12-31 34.00 34.28 HO -0.008168028 0.242531377 0.38302009
5 1999-01-29 100.00 100.50 QS -0.004975124 0.390056023 0.75746862
6 1999-01-29 33.16 33.42 HO -0.007779773 0.609943977 0.24253138
7 1999-02-26 100.25 100.25 QS 0.000000000 NA 0.39005602
8 1999-02-26 32.29 32.37 HO -0.002471424 0.184703218 0.60994398
9 1999-02-26 10.88 11.00 CO -0.010909091 0.815296782 NA
10 1999-03-31 131.50 130.75 QS 0.057361377 0.057361377 NA
11 1999-03-31 44.68 44.00 HO 0.015454545 0.015454545 0.18470322
12 1999-03-31 15.24 15.16 CO 0.005277045 0.005277045 0.81529678
Take note of February 1999. CO has an NA because it first appears in February.
Now look at March 1999, CO has the value from Februray, QS has an NA only because the February value was NA (due to division by 0).
Can this be done?
Data:
contracts <- read.table(text = "Dates, Last.Price, Last.Price.1, id,carry
1998-11-30, 94.500, 98.500, QS, -0.0406091371
1998-11-30, 31.320, 32.130, HO, -0.0252100840
1998-12-31, 95.500, 98.000, QS, -0.0255102041
1998-12-31, 34.000, 34.280, HO, -0.0081680280
1999-01-29, 100.000, 100.500, QS, -0.0049751244
1999-01-29, 33.160, 33.420, HO, -0.0077797726
1999-02-26, 100.250, 100.250, QS, 0.0000000000
1999-02-26, 32.290, 32.370, HO, -0.0024714242
1999-02-26, 10.880, 11.000, CO, -0.0109090909
1999-03-31, 131.500, 130.750, QS, 0.0057361377
1999-03-31, 44.680, 44.000, HO, 0.0154545455
1999-03-31, 15.240, 15.160, CO, 0.0052770449", sep = ",", header = T)
desired <- read.table(text = "Dates,Last.Price,Last.Price.1,id,carry,weights,w
1998-11-30,94.5,98.5, QS,-0.0406091371,0.616979909839741,NA
1998-11-30,31.32,32.13, HO,-0.025210084,0.383020090160259,NA
1998-12-31,95.5,98, QS,-0.0255102041,0.757468623182272,0.616979909839741
1998-12-31,34,34.28, HO,-0.008168028,0.242531376817728,0.383020090160259
1999-01-29,100,100.5, QS,-0.0049751244,0.390056023188584,0.757468623182272
1999-01-29,33.16,33.42, HO,-0.0077797726,0.609943976811416,0.242531376817728
1999-02-26,100.25,100.25, QS,0,NA,0.390056023188584
1999-02-26,32.29,32.37, HO,-0.0024714242,0.184703218189261,0.609943976811416
1999-02-26,10.88,11, CO,-0.0109090909,0.815296781810739,NA
1999-03-31,131.5,130.75, QS,0.057361377,0.057361377,NA
1999-03-31,44.68,44, HO,0.0154545455,0.0154545455,0.184703218189261
1999-03-31,15.24,15.16, CO,0.0052770449,0.0052770449,0.815296782", sep = ",", header = TRUE)
weights function:
weight <- function(vec) {
neg <- which(vec<0)
w <- vec
w[neg] <- vec[vec<0] / sum(vec[vec<0])
w[-neg] <- vec[vec>=0] / sum(vec[vec>=0])
w
}
contracts %>%
group_by(Dates) %>%
mutate(weights = weight(carry)) %>%
arrange(Dates) %>%
group_by(id) %>%
mutate(w = dplyr::lag(weights)) %>%
ungroup()
# # A tibble: 12 x 7
# Dates Last.Price Last.Price.1 id carry weights w
# <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1998-11-30 94.5 98.5 " QS" -0.0406 0.617 NA
# 2 1998-11-30 31.3 32.1 " HO" -0.0252 0.383 NA
# 3 1998-12-31 95.5 98 " QS" -0.0255 0.757 0.617
# 4 1998-12-31 34 34.3 " HO" -0.00817 0.243 0.383
# 5 1999-01-29 100 100. " QS" -0.00498 0.390 0.757
# 6 1999-01-29 33.2 33.4 " HO" -0.00778 0.610 0.243
# 7 1999-02-26 100. 100. " QS" 0 NaN 0.390
# 8 1999-02-26 32.3 32.4 " HO" -0.00247 0.185 0.610
# 9 1999-02-26 10.9 11 " CO" -0.0109 0.815 NA
# 10 1999-03-31 132. 131. " QS" 0.00574 0.00574 NaN
# 11 1999-03-31 44.7 44 " HO" 0.0155 0.0155 0.185
# 12 1999-03-31 15.2 15.2 " CO" 0.00528 0.00528 0.815
Notes:
I used dplyr::lag instead of just lag because of the possibility of confusion with stats::lag, which behaves significantly differently than dplyr::lag. While most of the time it'll work just fine, it works until it doesn't ... and it doesn't usually warn you :-)
This is lagging by Dates regardless of month. I'll assume that you are certain that Dates are always perfectly frequent. If you think there's the possibility in a gap (where lagging by-row is not correct), then you'll need to break out the year/month into a new field and join on itself instead of doing a lag.
Related
I have the following sample df from 20m sprint testing athletes with split times. They do 3 trials. I want to create new columns for each split that average their two fastest trials (drop the slowest trial).
Here is a sample of the df:
Athlete 0_10m_1 10_20m_1 0_20m_1 0_10m_2 10_20m_2 0_20m_2 0_10m_3 10_20m_3 0_20m_3
1 Athlete 1 2.005 1.320 3.325 1.904 1.306 3.210 1.993 1.316 3.309
2 Athlete 2 1.967 1.383 3.350 1.931 1.391 3.322 2.005 1.399 3.404
3 Athlete 3 2.008 1.381 3.389 2.074 1.365 3.439 2.047 1.408 3.455
4 Athlete 4 1.817 1.286 3.103 1.924 1.285 3.209 NA NA NA
The end result would be 3 new columns with the mean values of the 2 fastest trials (based on the 0_20m time) ("Avg_0_10m", "Avg_10_20m", Avg_0_20m"). Ideally the solution is robust enough to handle NA values as there will be some within the dataset.
Any suggestions on how to approach this? I'm not sure how to be able to filter out the slowest 0_20m trial with the related split times and average the other trials.
library(tidyverse)
x <- read.table(text=" Athlete 0_10m_1 10_20m_1 0_20m_1 0_10m_2 10_20m_2 0_20m_2 0_10m_3 10_20m_3 0_20m_3
'Athlete 1' 2.005 1.320 3.325 1.904 1.306 3.210 1.993 1.316 3.309
'Athlete 2' 1.967 1.383 3.350 1.931 1.391 3.322 2.005 1.399 3.404
'Athlete 3' 2.008 1.381 3.389 2.074 1.365 3.439 2.047 1.408 3.455
'Athlete 4' 1.817 1.286 3.103 1.924 1.285 3.209 NA NA NA", header=TRUE, check.names=FALSE)
x %>%
gather(trial,time,-Athlete) %>%
separate(trial, sep = "(?<=m)_", into = c("trial_time", "trial_try")) %>%
group_by(Athlete, trial_time) %>%
group_split() %>%
purrr::map(function(x) {
x %>%
arrange(time) %>%
group_by(Athlete, trial_time) %>%
summarise(time_avg = mean(time[1:2], na.rm = TRUE))
}) %>%
bind_rows() %>%
spread(trial_time, time_avg)
First to create the data.frame.
x <- read.table(text="x Athlete 0_10m_1 10_20m_1 0_20m_1 0_10m_2 10_20m_2 0_20m_2 0_10m_3 10_20m_3 0_20m_3
1 Athlete 1 2.005 1.320 3.325 1.904 1.306 3.210 1.993 1.316 3.309
2 Athlete 2 1.967 1.383 3.350 1.931 1.391 3.322 2.005 1.399 3.404
3 Athlete 3 2.008 1.381 3.389 2.074 1.365 3.439 2.047 1.408 3.455
4 Athlete 4 1.817 1.286 3.103 1.924 1.285 3.209 NA NA NA", header=T, check.names=F)
x %>% select(-x) %>%
gather("split", "time", -Athlete) %>%
mutate(split = gsub("_\\d$","", split)) %>%
group_by(Athlete, split) %>%
arrange(time) %>%
slice(1:2) %>%
summarize(Avg = mean(time))
# A tibble: 12 x 3
# Groups: Athlete [4]
# Athlete split Avg
# <int> <chr> <dbl>
# 1 1 0_10m 1.95
# 2 1 0_20m 3.26
# 3 1 10_20m 1.31
# 4 2 0_10m 1.95
# 5 2 0_20m 3.34
# 6 2 10_20m 1.39
# 7 3 0_10m 2.03
# 8 3 0_20m 3.41
# 9 3 10_20m 1.37
#10 4 0_10m 1.87
#11 4 0_20m 3.16
#12 4 10_20m 1.29
I have a data of swimming times that I would like to be able to plot over time. I was wondering if there was a quick way to change these variables from character to numeric?
I started by trying to convert the times to a POSIX date-time format, but that proved to not be helpful, especially because I would like to do some ARIMA predictions on the data.
Here is my data
times <- c("47.45","47.69",
"47.69","47.82",
"47.84","47.92",
"47.96","48.13",
"48.16","48.16",
"48.16","48.31",
"49.01","49.27",
"49.33","49.40",
"49.48","49.51",
"52.85","52.89",
"53.14","54.31",
"54.63","56.91",
"1:18.39","1:20.26",
"1:38.30")
dates <- c("2017-02-24 MST",
"2017-02-24 MST",
"2016-02-26 MST",
"2018-02-23 MST",
"2015-12-04 MST",
"2015-03-06 MST",
"2015-03-06 MST",
"2016-12-02 MST",
"2016-02-26 MST",
"2017-11-17 MST",
"2016-12-02 MST",
"2017-11-17 MST",
"2014-11-22 MST",
"2017-01-13 MST",
"2017-01-21 MST",
"2015-10-17 MDT",
"2017-01-27 MST",
"2016-01-29 MST",
"2017-10-20 MDT",
"2016-11-05 MDT",
"2015-11-07 MST",
"2015-10-30 MDT",
"2014-11-22 MST",
"2016-11-11 MST",
"2014-02-28 MST",
"2014-02-28 MST",
"2014-02-28 MST",)
df <- cbind(as.data.frame(dates),as.data.frame(times))
I hope to get a column for time, probably in seconds, so the first 24 obs would stay the same, but the last 3 obs would change to 78.39,80.26, and 98.30
One way is to pre-pend those times that don't have minutes with "00:".
Then you can use lubridate::ms to do the time conversion.
library(dplyr)
library(lubridate)
data.frame(times = times,
stringsAsFactors = FALSE) %>%
mutate(times2 = ifelse(grepl(":", times), times, paste0("00:", times)),
seconds = as.numeric(ms(times2)))
Result:
times times2 seconds
1 47.45 00:47.45 47.45
2 47.69 00:47.69 47.69
3 47.69 00:47.69 47.69
4 47.82 00:47.82 47.82
5 47.84 00:47.84 47.84
6 47.92 00:47.92 47.92
7 47.96 00:47.96 47.96
8 48.13 00:48.13 48.13
9 48.16 00:48.16 48.16
10 48.16 00:48.16 48.16
11 48.16 00:48.16 48.16
12 48.31 00:48.31 48.31
13 49.01 00:49.01 49.01
14 49.27 00:49.27 49.27
15 49.33 00:49.33 49.33
16 49.40 00:49.40 49.40
17 49.48 00:49.48 49.48
18 49.51 00:49.51 49.51
19 52.85 00:52.85 52.85
20 52.89 00:52.89 52.89
21 53.14 00:53.14 53.14
22 54.31 00:54.31 54.31
23 54.63 00:54.63 54.63
24 56.91 00:56.91 56.91
25 1:18.39 1:18.39 78.39
26 1:20.26 1:20.26 80.26
27 1:38.30 1:38.30 98.30
as.difftime, and a quick regex to add the minutes when they are not present, should handle it:
as.difftime(sub("(^\\d{1,2}\\.)", "0:\\1", times), format="%M:%OS")
#Time differences in secs
# [1] 47.45 47.69 47.69 47.82 47.84 47.92 47.96 48.13 48.16 48.16 48.16 48.31
#[13] 49.01 49.27 49.33 49.40 49.48 49.51 52.85 52.89 53.14 54.31 54.63 56.91
#[25] 78.39 80.26 98.30
You can use separate in the Tidyverse tidyr package to split the strings into minutes and seconds:
library(tidyr)
library(dplyr)
separate(tibble(times = times), times, sep = ":",
into = c("min", "sec"), fill = "left", convert = T) %>%
mutate(min = ifelse(is.na(min), 0, min),
seconds = 60 * min + sec)
# A tibble: 27 x 3
min sec seconds
<dbl> <dbl> <dbl>
1 0 47.4 47.4
2 0 47.7 47.7
3 0 47.7 47.7
4 0 47.8 47.8
5 0 47.8 47.8
6 0 47.9 47.9
7 0 48.0 48.0
8 0 48.1 48.1
9 0 48.2 48.2
10 0 48.2 48.2
# ... with 17 more rows
The new column seconds is the number of seconds, multiplying the number of minutes by 60.
when I have a dataframe named "Historical_Stock_Prices_R" like this
Date1 MSFT AAPL GOOGL
25-01-05 21.03 4.87 88.56
26-01-05 21.02 4.89 94.62
27-01-05 21.10 4.91 94.04
28-01-05 21.16 5.00 95.17
I use the following formulas to get a lsit of monthly max and monthly mean log return from daily price data file
return<- cbind.data.frame(date=Historical_Stock_Prices_R$Date1[2:nrow(Historical_Stock_Prices_R)],apply(Historical_Stock_Prices_R[,2:4],2,function(x) log(x[-1]/x[-length(x)])*100))
return$Date <- as.Date(return$date,format="%d-%m-%y")
RMax <- aggregate(return[,-1],
by=list(Month=format(return$Date,"%y-%m")),
FUN=max)
RMean <- aggregate(return[,-1],
by=list(Month=format(return$Date,"%y-%m")),
FUN=mean)
But now I have a matrix (not a dataframe) named "df" like this
AAPL.High ABT.High ABBV.High ACN.High ADBE.High
07-01-02 NA NA NA NA NA
03-01-07 12.37 24.74 NA 37 41.32
04-01-07 12.28 25.12 NA 37.23 41
05-01-07 12.31 25 NA 36.99 40.9
Now how can I calculate same monthly mean and monthly max using similar kind of code?
This question already has answers here:
Aggregate Daily Data to Month/Year intervals
(9 answers)
Closed 7 years ago.
I have day-wise data of interest rate of 15 years from 01-01-2000 to 01-01-2015.
I want to convert this data to monthly data, which only having month and year.
I want to take mean of the values of all the days in a month and make it one value of that month.
How can I do this in R.
> str(mibid)
'data.frame': 4263 obs. of 6 variables:
$ Days: int 1 2 3 4 5 6 7 8 9 10 ...
$ Date: Date, format: "2000-01-03" "2000-01-04" "2000-01-05" "2000-01-06" ...
$ BID : num 8.82 8.82 8.88 8.79 8.78 8.8 8.81 8.82 8.86 8.78 ...
$ I.S : num 0.092 0.0819 0.0779 0.0801 0.074 0.0766 0.0628 0.0887 0.0759 0.073 ...
$ BOR : num 9.46 9.5 9.52 9.36 9.33 9.37 9.42 9.39 9.4 9.33 ...
$ R.S : num 0.0822 0.0817 0.0828 0.0732 0.084 0.0919 0.0757 0.0725 0.0719 0.0564 ...
> head(mibid)
Days Date BID I.S BOR R.S
1 1 2000-01-03 8.82 0.0920 9.46 0.0822
2 2 2000-01-04 8.82 0.0819 9.50 0.0817
3 3 2000-01-05 8.88 0.0779 9.52 0.0828
4 4 2000-01-06 8.79 0.0801 9.36 0.0732
5 5 2000-01-07 8.78 0.0740 9.33 0.0840
6 6 2000-01-08 8.80 0.0766 9.37 0.0919
>
I'd do this with xts:
set.seed(21)
mibid <- data.frame(Date=Sys.Date()-100:1,
BID=rnorm(100, 8, 0.1), I.S=rnorm(100, 0.08, 0.01),
BOR=rnorm(100, 9, 0.1), R.S=rnorm(100, 0.08, 0.01))
require(xts)
# convert to xts
xmibid <- xts(mibid[,-1], mibid[,1])
# aggregate
agg_xmibid <- apply.monthly(xmibid, colMeans)
# convert back to data.frame
agg_mibid <- data.frame(Date=index(agg_xmibid), agg_xmibid, row.names=NULL)
head(agg_mibid)
# Date BID I.S BOR R.S
# 1 2015-04-30 8.079301 0.07189111 9.074807 0.06819096
# 2 2015-05-31 7.987479 0.07888328 8.999055 0.08090253
# 3 2015-06-30 8.043845 0.07885779 9.018338 0.07847999
# 4 2015-07-31 7.990822 0.07799489 8.980492 0.08162038
# 5 2015-08-07 8.000414 0.08535749 9.044867 0.07755017
A small example of how this might be done using dplyr and lubridate
set.seed(321)
dat <- data.frame(day=seq.Date(as.Date("2010-01-01"), length.out=200, by="day"),
x = rnorm(200),
y = rexp(200))
head(dat)
day x y
1 2010-01-01 1.7049032 2.6286754
2 2010-01-02 -0.7120386 0.3916089
3 2010-01-03 -0.2779849 0.1815379
4 2010-01-04 -0.1196490 0.1234461
5 2010-01-05 -0.1239606 2.2237404
6 2010-01-06 0.2681838 0.3217511
require(dplyr)
require(lubridate)
dat %>%
mutate(year = year(day),
monthnum = month(day),
month = month(day, label=T)) %>%
group_by(year, month) %>%
arrange(year, monthnum) %>%
select(-monthnum) %>%
summarise(x = mean(x),
y = mean(y))
Source: local data frame [7 x 4]
Groups: year
year month x y
1 2010 Jan 0.02958633 0.9387509
2 2010 Feb 0.07711820 1.0985411
3 2010 Mar -0.06429982 1.2395438
4 2010 Apr -0.01787658 1.3627864
5 2010 May 0.19131861 1.1802712
6 2010 Jun -0.04894075 0.8224855
7 2010 Jul -0.22410057 1.1749863
Another option is using data.table which has several very convenient datetime functions. Using the data of #SamThomas:
library(data.table)
setDT(dat)[, lapply(.SD, mean), by=.(year(day), month(day))]
this gives:
year month x y
1: 2010 1 0.02958633 0.9387509
2: 2010 2 0.07711820 1.0985411
3: 2010 3 -0.06429982 1.2395438
4: 2010 4 -0.01787658 1.3627864
5: 2010 5 0.19131861 1.1802712
6: 2010 6 -0.04894075 0.8224855
7: 2010 7 -0.22410057 1.1749863
On the data of #JoshuaUlrich:
setDT(mibid)[, lapply(.SD, mean), by=.(year(Date), month(Date))]
gives:
year month BID I.S BOR R.S
1: 2015 5 7.997178 0.07794925 8.999625 0.08062426
2: 2015 6 8.034805 0.07940600 9.019823 0.07823314
3: 2015 7 7.989371 0.07822263 8.996015 0.08195401
4: 2015 8 8.010541 0.08364351 8.982793 0.07748399
If you want the names of the months instead of numbers, you will have to include [, day:=as.IDate(day)] after the setDT() part and use months instead of month:
setDT(mibid)[, Date:=as.IDate(Date)][, lapply(.SD, mean), by=.(year(Date), months(Date))]
Note: Especially on larger datasets, data.table will probably be (a lot) faster then the other two solutions.
this is my first question on this forum.
I would like to re-model the structure of my dataset.
I would like to split the column "Teams" into two columns. One with the hometeam and another with the awayteam.
I also would like to split the result into two columns. Homegoals and Awaygoals. The new columns should not have a zero infront of the "real" goals scored.
BEFORE
Date Time Teams Results Homewin Draw Awaywin
18 May 19:45 AC Milan - Sassuolo 02:01 1.26 6.22 10.47
18 May 19:45 Chievo - Inter 02:01 3.73 3.42 2.05
18 May 19:45 Fiorentina - Torino 02:02 2.84 3.58 2.39
AFTER
Date Time Hometeam Awayteam Homegoals Awaygoals Homewin Draw Awaywin
18 May 19:45 AC Milan Sassuolo 2 1 1.26 6.22 10.47
18 May 19:45 Chievo Inter 2 1 3.73 3.42 2.05
18 May 19:45 Fiorentina Torino 2 2 2.84 3.58 2.39
Can R fix this problem for me? Which packages do i need?
I want to be able to do this for many excel spreadsheets with different leagues and divisions but all with the same structure.
Can someone help me and my data.frame?
tidyr solution:
separate(your.data.frame, Teams, c('Home', 'Away'), sep = " - ")
Base R solution (following this answer):
df <- data.frame(do.call(rbind, strsplit(as.character(your.df$teams), " - ")))
names(df) <- c("Home", "Away")
Here's an approach that uses cSplit from the splitstackshape package, which uses and returns a data.table. Presuming your original data frame is named df,
library(splitstackshape)
setnames(
cSplit(df, 3:4, c(" - ", ":"))[, c(1:2, 6:9, 3:5), with = FALSE],
3:6,
paste0(c("Home", "Away"), rep(c("Team", "Goals"), each = 2))
)[]
# Date Time HomeTeam AwayTeam HomeGoals AwayGoals Homewin Draw Awaywin
# 1: 18 May 19:45 AC Milan Sassuolo 2 1 1.26 6.22 10.47
# 2: 18 May 19:45 Chievo Inter 2 1 3.73 3.42 2.05
# 3: 18 May 19:45 Fiorentina Torino 2 2 2.84 3.58 2.39