Sort Data by decade in R - r

So I am fairly new to R and I am having a bit of trouble getting the hang of it. What I am trying to do is to sort my data into decades so that I can analyze the mean value for each decade. So far this is what I have tried:
fred$decade = cut(as.numeric(format(fred$DATE, "%Y")),breaks=seq(1940, 2020, 10))
Error in format.default(structure(as.character(x), names = names(x),
dim = dim(x), :
invalid 'trim' argument
Here is part of the data I am using: I am looking at CPI data since 1948 for every month until 9/1/2016. I want to get the mean CPI of each decade since then:
DATE CPI
8/1/49 23.7
9/1/49 23.75
10/1/49 23.67
11/1/49 23.7
12/1/49 23.61
1/1/50 23.51
2/1/50 23.61
3/1/50 23.64
4/1/50 23.65
5/1/50 23.77
6/1/50 23.88
7/1/50 24.07
8/1/50 24.2
When I use this I always get an error message. I cannot seem to figure out what I am doing wrong. I went through my data to make sure it was fine. Thanks for your help!

Considering dput(stsample) as
structure(list(Date = structure(c(8L, 10L, 11L, 12L, 13L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 9L), .Label = c("01-01-1950", "02-01-1950",
"03-01-1950", "04-01-1950", "05-01-1950", "06-01-1950", "07-01-1950",
"08-01-1949", "08-01-1950", "09-01-1949", "10-01-1949", "11-01-1949",
"12-01-1949"), class = "factor"), CPI = c(23.7, 23.75, 23.67,
23.7, 23.61, 23.51, 23.61, 23.64, 23.65, 23.77, 23.88, 24.07,
24.2)), .Names = c("Date", "CPI"), class = "data.frame", row.names = c(NA,
-13L))
you can try something like
stsample$Date <- as.Date(stsample$Date, "%d-%m-%Y")
stsample$year<-as.numeric(format(stsample$Date, "%Y"))
stsample$decade = cut(stsample$year, seq(from = 1940, to = 2020, by = 10))
Note that the breaks work only on the year part of the date and not the whole object. If you have datetime objects, it might be worth looking into
cut.POSIXt

You can try this too (output shown with some randomly generated data):
# assuming 40-49 is the decade 40s
fred$DECADE <- 10*as.integer(as.numeric(substring(as.character(fred$DATE), 7, 8)) / 10)
head(fred)
DATE CPI DECADE
1 08/01/49 23.41955 40
2 09/01/49 26.99772 40
3 10/02/49 29.53724 40
4 11/02/49 19.84247 40
5 12/03/49 26.75672 40
6 01/03/50 30.97788 50
# mean value for each DECADE
aggregate(CPI~DECADE, data=fred, FUN=mean)
DECADE CPI
1 40 25.31074
2 50 25.27004
3 60 24.72269

Related

random.forest.importance and XTS package

I am quite new to R coding, the TTR/XTS package and random.forest.importance functions.
I am extracting trading data using the xts function, calculating whether the difference between Close and Open is positive/negative/flat, applying a handful of technical indicators using the TTR function , and then combining the indicators to calculate the random.forest.importance function.
When I run the code, I get the
Error in model.frame.default(formula, data, na.action = NULL) : variable lengths differ (found for 'Close').
Data:
Date Time Open High Low Close TVolume
2017-10-12 14:00:00 1.18462 1.18487 1.18334 1.18347 1165
2017-10-12 15:00:00 1.18351 1.18377 1.18295 1.18347 884
2017-10-12 16:00:00 1.18348 1.18348 1.18265 1.18276 1000
2017-10-12 17:00:00 1.18245 1.18329 1.18242 1.18303 1184
2017-10-12 18:00:00 1.18305 1.18373 1.18284 1.18343 469
2017-10-12 19:00:00 1.18343 1.18343 1.18247 1.18303 886
Code as follows:
pkgs <- c('class', 'gmodels', 'quantmod', 'TTR','xts','corrplot','caret','FSelector')
z <- head(tail(hist_r, samples+retro), samples)
z <- as.xts(z[,2:6], order.by=as.POSIXct(z$Timestamp, origin='1970-01-01 00:00', tz='UTC'))
hist <- getHist(z)
h <- as.xts(hist)
price <- z$Close-z$Open
class = ifelse(price > 0,""'UP'"",ifelse(price <0,""'DOWN'"",'""FLAT'""))
forceindex <- (z$Close-z$Open) * z$TVolume
WillR5 <- WPR(z[,c(""'High'"",""'Low'"",""'Close'"")], n = 5)
dataset = data.frame(class,forceindex,WillR5)
dataset = na.omit(dataset)
dput(head(dataset, 10))
set.seed(5)
weights <- random.forest.importance(class~., dataset, importance.type = 1)
print(weights)
When I run dput, i get the following:
tructure(list(Close = structure(c(1L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 3L), .Label = c("DOWN", "FLAT", "UP"), class = "factor"), Close.1 = c(-12.9382400000007, 0.107400000000227, -1.66915000000001, -0.290530000000006, -1.18667999999979, 0.0752800000000753, -0.244080000000094, 0.0653999999999928, -0.395999999999996, 0.372089999999928)
Would sincerely appreciate any help that anyone can give me
Many thanks in advance
Kkel

record how long a variable was above a level in r

I am working on converting a project that I currently have programmed in Excel to R. The reason for doing this is that the code includes lots of logic and data which means thats Excel's performance is very poor. So far I have coded up around 50% of this project in R and I am extremely impressed with the performance.
The code I have does the following:
Loads a 5min time-series data of a stock and adds a day of the year column labeled doy in this example below.
The OHLC data looks like this:
Date Open High Low Close doy
1 2015-09-21 09:30:00 164.6700 164.7100 164.3700 164.5300 264
2 2015-09-21 09:35:00 164.5300 164.9000 164.5300 164.6400 264
3 2015-09-21 09:40:00 164.6600 164.8900 164.6000 164.8900 264
4 2015-09-21 09:45:00 164.9100 165.0900 164.9100 164.9736 264
5 2015-09-21 09:50:00 164.9399 165.0980 164.8200 164.8200 264
Converts that data to a table called df df <- tbl_df(DIA_5)
Using mainly plyr with hint of TTR it filters through the data creating a set of 10 new variables in a new data frame called data. See below:
data <- structure(list(doy = c(264, 265, 266, 267, 268, 271, 272, 11,12, 13),
Date = structure(c(1442824200, 1442910600, 1442997000,1443083400,
1443169800, 1443429000, 1443515400, 1452504600,
1452591000,1452677400), class = c("POSIXct", "POSIXt"), tzone = ""),
OR_High = c(164.71,162.96, 163.38, 161.37, 163.91, 162.06, 160.22,
164.5, 165.23,165.84), OR_Low = c(164.37, 162.62, 162.98, 161.06,
163.57, 161.66,159.7, 164.06, 164.84, 165.4), HOD = c(165.56, 163.36,
163.38,162.24, 164.43, 162.06, 160.96, 164.5, 165.78, 165.84), LOD =
c(165.22,163.1, 162.98, 161.95, 164.24, 161.66, 160.75, 164.06,
165.56,165.4), Close = c(164.92, 163.02, 162.58, 161.85, 162.94,
159.84,160.19, 163.83, 165.02, 161.38), Range =
c(0.340000000000003,0.260000000000019, 0.400000000000006,
0.29000000000002, 0.189999999999998,0.400000000000006,
0.210000000000008, 0.439999999999998, 0.219999999999999,0.439999999999998),
`A-val` = c(NA, NA, NA, NA, NA, NA, NA,
0.0673439999999994,0.0659639999999996, 0.0729499999999996),
`A-up` = c(NA, NA, NA,NA, NA, NA, NA, 164.567344, 165.295964,
165.91295), `A-down` = c(NA,NA, NA, NA, NA, NA, NA, 163.992656,
164.774036, 165.32705)), .Names = c("doy","Date", "OR_High", "OR_Low",
"HOD", "LOD", "Close", "Range","A-val", "A-up", "A-down"),
row.names = c(1L, 2L, 3L, 4L, 5L,6L, 7L, 78L, 79L, 80L),
class = "data.frame")
The next part is where it gets complicated. What I need to do is to analyse the high and low prices of each 5 minute bar of the day in relation to my A-up & A-down and close values as seen in the table. What I am looking for is to be able to compute a score for the day depending on the time spent above the A-up level or below the A-down level.
The way I got by this in Excel was to index each 5 minute high & low price of the time series then used logic to score the activity in that 5min time slice. If the low was > A-up level it was given a 1 and - 1 if the high was < A-down. For the scoring if price stays > A-up level or < A-down level for greater than 30 mins I score it a 2 0r -2. This was achieved by using a running 5 period sum of the results of the and if one had more than 5 ones I knew that price had stayed > the A-up level etc then I would score it a 2.
For the days scoring I need to know the following;
Did price stay above or below and A level for > 30 minutes or fail by spending < 30 minutes there?
If price went above and below both levels in one day, which level did it break first?
So after a long winded intro my question. Does anyone out there have a good idea of the best way to go about coding this. I don't need specific code but moreover what packages may help to accomplish this. As I mentioned above my reason for switching to R was mainly for speed so whatever code used must be efficient. When I have this coded I intend on programming a loop so that it can analyse several hundred instruments.
Thanks.

Decimal hours in r (excluding todays date)

For a sample dataframe:
light <- structure(list(daylight.hours = structure(c(62L, 22L, 60L, 58L,
34L, 37L), .Label = c("07:12:05", "07:14:41", "07:18:24", "07:28:59",
"07:31:07", "07:45:51", "07:48:08", "07:51:29", "07:52:06", "07:58:18",
"08:01:16", "08:07:25", "08:10:08", "08:18:16", "08:23:33", "08:27:03",
"08:30:36", "08:34:13", "08:41:35", "08:46:01", "08:53:52", "08:54:17",
"09:31:16", "09:35:29", "09:39:44", "10:27:19", "10:31:45", "10:36:12",
"11:53:41", "12:11:39", "12:16:10", "12:20:23", "12:34:10", "14:18:26",
"14:22:41", "14:26:55", "14:35:21", "14:39:49", "14:44:00", "14:48:09",
"14:54:29", "14:59:08", "15:03:18", "15:11:01", "15:15:38", "15:15:52",
"15:19:09", "15:58:22", "16:07:10", "16:08:33", "16:24:12", "16:27:14",
"16:42:57", "16:55:32", "16:57:52", "17:00:06", "17:02:15", "17:03:49",
"17:04:17", "17:05:24", "17:06:14", "17:06:53", "17:08:05", "17:09:38",
"17:11:04", "17:12:24", "17:13:26", "17:13:47", "17:14:22", "17:14:32",
"17:14:42", "17:14:44", "17:15:39", "17:15:40", "17:16:22", "17:16:51",
"17:17:55"), class = "factor"), school.id = c(4L, 4L, 4L, 4L,
14L, 14L)), .Names = c("daylight.hours", "school.id"), row.names = c(NA,
6L), class = "data.frame")
I want to create another variable called d.daylight to change the daylight hours variable to a decimal. (i.e. 18:30:00 would read 18.5)
When I use the following it automatically puts todays date which is not what I am after (everything is under 24 hours).
light$d.daylight <- as.POSIXlt(light$daylight.hours, format="%H:%M:%S")
Could anyone advise me how to rectify this?
The times function from package chron is useful if you need to deal with times (without dates).
library(chron)
light$d.daylight <- as.numeric(times(light$daylight.hours)) * 24
# daylight.hours school.id d.daylight
#1 17:06:53 4 17.114722
#2 08:54:17 4 8.904722
#3 17:05:24 4 17.090000
#4 17:03:49 4 17.063611
#5 14:18:26 14 14.307222
#6 14:35:21 14 14.589167

Iterate by month with lubridate and merge combined

I am trying to write a function that merges based on two columns both found in two dataframes. One of the columns is an identifier string and the other is a date.
The first df ("model") includes identifiers, starting dates, and some other relevant info.
The second df ("futurevalues") is a melted df that includes the identifier, multiple months for each identifier, and the relevant value for each identifier-month pair.
I would like to merge values for each identifier based on a certain period of time in the future. So for instance, for Identifier= Mary and starting month="2005-01-31" in "model" I would like to pull in the relevant value for the next month and 11 more months after (so 12 data points for Mary for months starting month+1:starting month+12).
I can merge my dfs by the two columns to get the as-of date value (see below), but this isn't what I need.
testmerge=merge(model,futurevalues,by=c("month","identifier"),all=TRUE)
To solve this, I am trying to use the lubridate date functions. For instance, the function below will allow me to enter a month (and then lapply across the df maybe) to get the values for each of the starting months (which vary across the df, meaning it's not a standard time period across the entire thing).
monthiterate=function (x) {
x %m+% months(1:12)
}
Thanks a lot for your help.
EDIT: adding toy data (first one is model, second one is futurevalues)
structure(list(month = structure(c(12814, 12814, 12814, 12814,
12814, 12814, 12814, 12814, 12814, 12814), class = "Date"), identifier = structure(c(1L,
3L, 2L, 4L, 5L, 7L, 8L, 6L, 9L, 10L), .Label = c("AB1", "AC5",
"BB9", "C99", "D81", "GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"),
value = c(0.831876072999969, 0.218494398256579, 0.550872926656984,
1.81882711231324, -0.245597705276932, -0.964277509916354,
-1.84714556574606, -0.916239506529079, -0.475649743547525,
-0.227721186387637)), .Names = c("month", "identifier", "value"
), class = "data.frame", row.names = c(NA, 10L))
structure(list(identifier = structure(c(1L, 3L, 2L, 4L, 5L, 7L,
8L, 6L, 9L, 10L), .Label = c("AB1", "AC5", "BB9", "C99", "D81",
"GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"), month = structure(c(12814,
13238, 12814, 12814, 12964, 12903, 12903, 12842, 13148, 13148
), class = "Date"), futurereturns = c(-0.503033205660682, 1.22446988772542,
-0.825490985851348, 1.03902417581908, 0.172595565260429, 0.894967582911769,
-0.242324006922964, 0.415520398113024, -0.734437328639625, 2.64184935856802
)), .Names = c("identifier", "month", "futurereturns"), class = "data.frame", row.names
= c(NA, 10L))
You need to create a table of all the combinations of ID and month that you want. Starting with a table of each ID and their starting month:
library(lubridate)
set.seed(1834)
# 3 people, each with a different starting month
x <- data.frame(id = sample(LETTERS, 3)
, month = ymd("2005-01-01") + months(sample(0:11, 3)) - days(1))
> x
id month
1 D 2005-03-31
2 R 2005-07-31
3 Y 2005-02-28
Now add rows for the following two months, per ID. I use dplyr for this kind of thing.
library(dplyr)
y <- x %>%
rowwise %>%
do(data.frame(id = .$id
, month = seq(.$month + days(1)
, by = "1 month"
, length.out = 3) - days(1)))
> y
Source: local data frame [9 x 2]
Groups: <by row>
id month
1 D 2005-03-31
2 D 2005-04-30
3 D 2005-05-31
4 R 2005-07-31
5 R 2005-08-31
6 R 2005-09-30
7 Y 2005-02-28
8 Y 2005-03-31
9 Y 2005-04-30
Now you can use merge() (or left_join() from dplyr) to retrieve the rows you want from the full dataset.

How can i convert a dataframe with a factor column to a xts object?

I have a csv file and when i use this command
SOLK<-read.table('Book1.csv',header=TRUE,sep=';')
I get this output
> SOLK
Time Close Volume
1 10:27:03,6 0,99 1000
2 10:32:58,4 0,98 100
3 10:34:16,9 0,98 600
4 10:35:46,0 0,97 500
5 10:35:50,6 0,96 50
6 10:35:50,6 0,96 1000
7 10:36:10,3 0,95 40
8 10:36:10,3 0,95 100
9 10:36:10,4 0,95 500
10 10:36:10,4 0,95 100
. . . .
. . . .
. . . .
285 17:09:44,0 0,96 404
Here is the result of dput(SOLK[1:10,]):
> dput(SOLK[1:10,])
structure(list(Time = structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L,
6L, 7L, 7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9",
"10:35:46,0", "10:35:50,6", "10:36:10,3", "10:36:10,4", "10:36:30,8",
"10:37:23,3", "10:37:38,2", "10:37:39,3", "10:37:45,9", "10:39:07,5",
"10:39:07,6", "10:39:46,6", "10:41:21,8", "10:43:20,6", "10:43:36,4",
"10:43:48,8", "10:43:48,9", "10:43:54,6", "10:44:01,5", "10:44:08,4",
"10:45:47,2", "10:46:16,7", "10:47:03,6", "10:47:48,6", "10:47:55,0",
"10:48:09,9", "10:48:30,6", "10:49:20,6", "10:50:31,9", "10:50:34,6",
"10:50:38,1", "10:51:02,8", "10:51:11,5", "10:55:57,7", "10:57:57,2",
"10:59:06,9", "10:59:33,5", "11:00:31,0", "11:00:31,1", "11:04:46,4",
"11:04:53,4", "11:04:54,6", "11:04:56,1", "11:04:58,9", "11:05:02,0",
"11:05:02,6", "11:05:24,7", "11:05:56,7", "11:06:15,8", "11:13:24,1",
"11:13:24,2", "11:13:32,1", "11:13:36,2", "11:13:37,2", "11:13:44,5",
"11:13:46,8", "11:14:12,7", "11:14:19,4", "11:14:19,8", "11:14:21,2",
"11:14:38,7", "11:14:44,0", "11:14:44,5", "11:15:10,5", "11:15:10,6",
"11:15:12,9", "11:15:16,6", "11:15:23,3", "11:15:31,4", "11:15:36,4",
"11:15:37,4", "11:15:49,5", "11:16:01,4", "11:16:06,0", "11:17:56,2",
"11:19:08,1", "11:20:17,2", "11:26:39,4", "11:26:53,2", "11:27:39,5",
"11:28:33,0", "11:30:42,3", "11:31:00,7", "11:33:44,2", "11:39:56,1",
"11:40:07,3", "11:41:02,1", "11:41:30,1", "11:45:07,0", "11:45:26,6",
"11:49:50,8", "11:59:58,1", "12:03:49,9", "12:04:12,6", "12:06:05,8",
"12:06:49,2", "12:07:56,0", "12:09:37,7", "12:14:25,5", "12:14:32,1",
"12:15:42,1", "12:15:55,2", "12:16:36,9", "12:16:44,2", "12:18:00,3",
"12:18:12,8", "12:28:17,8", "12:28:17,9", "12:28:23,7", "12:28:51,1",
"12:36:33,2", "12:37:45,0", "12:39:22,2", "12:40:19,5", "12:42:22,1",
"12:58:46,3", "13:06:05,8", "13:06:05,9", "13:07:17,6", "13:07:17,7",
"13:09:01,3", "13:09:01,4", "13:09:11,3", "13:09:31,0", "13:10:07,8",
"13:35:43,8", "13:38:27,7", "14:11:16,0", "14:17:31,5", "14:26:13,9",
"14:36:11,8", "14:38:43,7", "14:38:47,8", "14:38:51,8", "14:48:26,7",
"14:52:07,4", "14:52:13,8", "15:09:24,7", "15:10:25,8", "15:29:12,1",
"15:31:55,9", "15:34:04,1", "15:44:10,8", "15:45:07,1", "15:57:04,9",
"15:57:13,9", "16:16:27,9", "16:21:41,7", "16:36:01,5", "16:36:13,2",
"16:46:10,5", "16:46:10,6", "16:47:37,3", "16:50:52,4", "16:50:52,5",
"16:51:44,5", "16:55:11,5", "16:56:21,8", "16:56:37,5", "16:57:37,9",
"16:58:18,6", "16:58:44,5", "17:00:39,1", "17:01:50,7", "17:03:13,2",
"17:03:28,3", "17:03:46,7", "17:03:47,0", "17:04:30,4", "17:08:41,8",
"17:09:44,0"), class = "factor"), Close = structure(c(8L, 7L,
7L, 6L, 5L, 5L, 4L, 4L, 4L, 4L), .Label = c("0,92", "0,93", "0,94",
"0,95", "0,96", "0,97", "0,98", "0,99"), class = "factor"), Volume = c(1000L,
100L, 600L, 500L, 50L, 1000L, 40L, 100L, 500L, 100L)), .Names = c("Time",
"Close", "Volume"), row.names = c(NA, 10L), class = "data.frame")
The first column includes the time stamp of every transaction during a stock's exchange daily session. I would like to convert the Close and Volume columns to an xts object ordered by the Time column.
UPDATE: From your edits, it appears you imported your data using two different commands. It also appears you should be using read.csv2. I've updated my answer with Lines that (I assume) look more like your original CSV (I have to guess because you don't say what the file looks like). The rest of the answer doesn't change.
You have to add a date to your times because xts stores all index values internally as POSIXct (I just used today's date).
I had to convert the "," decimal notation to the "." convention (using gsub), but that may be locale-dependent and you may not need to. paste today's date with the (possibly converted) time and then convert it to POSIXct to create an index suitable for xts.
I've also formatted the index so you can see the fractional seconds.
Lines <- "Time;Close;Volume
10:27:03,6;0,99;1000
10:32:58,4;0,98;100
10:34:16,9;0,98;600
10:35:46,0;0,97;500
10:35:50,6;0,96;50
10:35:50,6;0,96;1000
10:36:10,3;0,95;40
10:36:10,3;0,95;100
10:36:10,4;0,95;500
10:36:10,4;0,95;100"
SOLK <- read.csv2(con <- textConnection(Lines))
close(con)
solk <- xts(SOLK[,c("Close","Volume")],
as.POSIXct(paste("2011-09-02", gsub(",",".",SOLK[,1]))))
indexFormat(solk) <- "%Y-%m-%d %H:%M:%OS6"
solk
# Close Volume
# 2011-09-02 10:27:03.599999 0.99 1000
# 2011-09-02 10:32:58.400000 0.98 100
# 2011-09-02 10:34:16.900000 0.98 600
# 2011-09-02 10:35:46.000000 0.97 500
# 2011-09-02 10:35:50.599999 0.96 50
# 2011-09-02 10:35:50.599999 0.96 1000
# 2011-09-02 10:36:10.299999 0.95 40
# 2011-09-02 10:36:10.299999 0.95 100
# 2011-09-02 10:36:10.400000 0.95 500
# 2011-09-02 10:36:10.400000 0.95 100
That's an odd structure. Translating it to dput syntax
SOLK <- structure(list(structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L, 6L, 7L,
7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9", "10:35:46,0",
"10:35:50,6", "10:36:10,3", "10:36:10,4"), class = "factor"),
Close = c(0.99, 0.98, 0.98, 0.97, 0.96, 0.96, 0.95, 0.95,
0.95, 0.95), Volume = c(1000L, 100L, 600L, 500L, 50L, 1000L,
40L, 100L, 500L, 100L)), .Names = c("", "Close", "Volume"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))
I'm assuming the comma in the timestamp is decimal separator.
library("chron")
time.idx <- times(gsub(",",".",as.character(SOLK[[1]])))
Unfortunately, it seems xts won't take this as a valid order.by; so a date (today, for lack of a better choice) must be included to make xts happy.
xts(SOLK[[2]], order.by=chron(Sys.Date(), time.idx))

Resources