I am quite new to R coding, the TTR/XTS package and random.forest.importance functions.
I am extracting trading data using the xts function, calculating whether the difference between Close and Open is positive/negative/flat, applying a handful of technical indicators using the TTR function , and then combining the indicators to calculate the random.forest.importance function.
When I run the code, I get the
Error in model.frame.default(formula, data, na.action = NULL) : variable lengths differ (found for 'Close').
Data:
Date Time Open High Low Close TVolume
2017-10-12 14:00:00 1.18462 1.18487 1.18334 1.18347 1165
2017-10-12 15:00:00 1.18351 1.18377 1.18295 1.18347 884
2017-10-12 16:00:00 1.18348 1.18348 1.18265 1.18276 1000
2017-10-12 17:00:00 1.18245 1.18329 1.18242 1.18303 1184
2017-10-12 18:00:00 1.18305 1.18373 1.18284 1.18343 469
2017-10-12 19:00:00 1.18343 1.18343 1.18247 1.18303 886
Code as follows:
pkgs <- c('class', 'gmodels', 'quantmod', 'TTR','xts','corrplot','caret','FSelector')
z <- head(tail(hist_r, samples+retro), samples)
z <- as.xts(z[,2:6], order.by=as.POSIXct(z$Timestamp, origin='1970-01-01 00:00', tz='UTC'))
hist <- getHist(z)
h <- as.xts(hist)
price <- z$Close-z$Open
class = ifelse(price > 0,""'UP'"",ifelse(price <0,""'DOWN'"",'""FLAT'""))
forceindex <- (z$Close-z$Open) * z$TVolume
WillR5 <- WPR(z[,c(""'High'"",""'Low'"",""'Close'"")], n = 5)
dataset = data.frame(class,forceindex,WillR5)
dataset = na.omit(dataset)
dput(head(dataset, 10))
set.seed(5)
weights <- random.forest.importance(class~., dataset, importance.type = 1)
print(weights)
When I run dput, i get the following:
tructure(list(Close = structure(c(1L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 3L), .Label = c("DOWN", "FLAT", "UP"), class = "factor"), Close.1 = c(-12.9382400000007, 0.107400000000227, -1.66915000000001, -0.290530000000006, -1.18667999999979, 0.0752800000000753, -0.244080000000094, 0.0653999999999928, -0.395999999999996, 0.372089999999928)
Would sincerely appreciate any help that anyone can give me
Many thanks in advance
Kkel
Related
Noob here, I'm stuck trying to use S3 to summarise proportion data for a data.frame where there are four columns of character data. My goal is to build a summary method to show the proportions for every level of every variable at one time.
I can see how to get the propotion for each column
a50survey1 <- table(Student1995$alcohol)
a50survey2 <- table(Student1995$drugs)
a50survey3 <- table(Student1995$smoke)
a50survey4 <- table(Student1995$sport)
prop.table(a50survey1)
prop.table(a50survey1)
Not Once or Twice a week Once a month Once a week More than once a week
0.10 0.32 0.24 0.28 0.06
But I cannot find a way to combine all of the prop.table outputs into one summary output.
Unless I'm really wrong. I cannot find a S3 method like summary.prop.table which would work for me. The goal is to set up for the current data frame and then drop in new same size & observations data frames in the future.
I'm really a step by step guy and if you can help me, that would be great - thank you
Dataframe info here. There are four columns, where each column has a different number of catagorical options for obersvations.
> dput(head(Student1995,5))
structure(list(alcohol = structure(c(3L, 2L, 2L, 2L, 3L), .Label = c("Not",
"Once or Twice a week", "Once a month", "Once a week", "More than once a week"
), class = "factor"), drugs = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("Not",
"Tried once", "Occasional", "Regular"), class = "factor"), smoke = structure(c(2L,
3L, 1L, 1L, 1L), .Label = c("Not", "Occasional", "Regular"), class = "factor"),
sport = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Not regular",
"Regular"), class = "factor")), row.names = c(NA, 5L), class = "data.frame")
The Summary data if it helps - edit
> summary(Student1995)
alcohol drugs smoke sport
Not : 5 Not :36 Not :38 Not regular:13
Once or Twice a week :16 Tried once: 6 Occasional: 5 Regular :37
Once a month :12 Occasional: 7 Regular : 7
Once a week :14 Regular : 1
More than once a week: 3
Maybe this is what you wanted. Values in each category sum up to 100%.
lis <- sapply( Student1995, function(x) t( sapply( x, table ) ) )
sapply( lis, function(x) colSums(prop.table(x)) )
$alcohol
Not Once.or.Twice.a.week Once.a.month
0.0 0.6 0.4
Once.a.week More.than.once.a.week
0.0 0.0
$drugs
Not Tried.once Occasional Regular
0.8 0.2 0.0 0.0
$smoke
Not Occasional Regular
0.6 0.2 0.2
$sport
Not.regular Regular
0.4 0.6
and the whole summary...
prop.table( table(as.vector( sapply( Student1995, unlist ))) )
Not Not regular Occasional
0.35 0.10 0.05
Once a month Once or Twice a week Regular
0.10 0.15 0.20
Tried once
0.05
I am trying to forecast multiple time series data that are present in a single data frame.
The dataframe df looks like below. The dput(df) is given below as well to reproduce quickly.
Date Group Value
01-04-2019 Saffron 62.78
01-04-2019 Green 75.65
01-05-2019 Saffron 67.89
01-06-2019 Saffron 54.56
01-06-2019 Green 77.00
01-07-2019 Green 71.22
structure(list(Date = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("01-04-2019", "01-05-2019", "01-06-2019", "01-07-2019"), class = "factor"),
Group = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("Green",
"Saffron"), class = "factor"), Value = c(62.78, 75.65, 67.89,
54.56, 77, 71.22)), .Names = c("Date", "Group", "Value"), class = "data.frame", row.names = c(NA, -6L))
Objective: I want to forecast for each Group using forecast package.
So my approach was as follows:
col_name_date <- "Date"
col_name_measure <- "Value"
col_name_sku_depo <- "Group"
dates_to_forecast <- 3
for (v in unique(as.character(df$Group))) {
temp <-subset(data,Group == v)
assign(paste0("df_",tolower(v)),temp)
temp <- temp [order(temp[, col_name_date]), ]
start_date <- as.Date(temp[1, col_name_date], date_format) #< ---library(lubridate)
ts_historic <- ts(temp[, col_name_measure],
start = c(year(start_date), month(start_date)),
frequency = 12)
----Forecasting process using forecast package, omitting as it is out of scope-----
forecast_mean <- rep(NA, dates_to_forecast)
forecast_mean <- ts_forecast$mean
forecast_upper <- ts_forecast$upper
forecast_lower <- ts_forecast$lower
dates_all_mean <- as.numeric(c(as.numeric(ts_historic), as.numeric(forecast_mean)))
dates_all_lower <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_lower)))
dates_all_upper <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_upper)))
result <- data.frame(
MONTH = dates_all,
MEASURETYPE = date_types,
GROUP = v
MEASURE = dates_all_mean,
MEASURELOWER = dates_all_lower,
MEASUREUPPER = dates_all_upper,
MODEL = model_descr)
}
The above code works fine for a single Group i.e. Saffron. But this doesn't produce the result for Green group.
I am looking for the following output:
MONTH MEASURETYPE GROUP MEASURE MEASUREUPPER MEASURELOWER MODEL
01-04-2019 Actual Saffron 62.78 NA NA Test
01-05-2019 Actual Saffron 67.89 NA NA Test
01-06-2019 Actual Saffron 54.56 NA NA Test
01-07-2019 Forecast Saffron 55.35 56.15 54.23 Test
01-08-2019 Forecast Saffron 57.29 58.15 56.39 Test
01-04-2019 Actual Green 75.65 NA NA Test
01-05-2019 Actual Green 77.00 NA NA Test
01-06-2019 Actual Green 71.22 NA NA Test
01-07-2019 Forecast Green 76.35 77.15 75.23 Test
01-08-2019 Forecast Green 73.29 74.29 72.30 Test
As you can see from the code, I am able to generate the above output only for Saffron.
How can I also add Green as shown in the above output?
Where I am missing out in for loop?
So I am fairly new to R and I am having a bit of trouble getting the hang of it. What I am trying to do is to sort my data into decades so that I can analyze the mean value for each decade. So far this is what I have tried:
fred$decade = cut(as.numeric(format(fred$DATE, "%Y")),breaks=seq(1940, 2020, 10))
Error in format.default(structure(as.character(x), names = names(x),
dim = dim(x), :
invalid 'trim' argument
Here is part of the data I am using: I am looking at CPI data since 1948 for every month until 9/1/2016. I want to get the mean CPI of each decade since then:
DATE CPI
8/1/49 23.7
9/1/49 23.75
10/1/49 23.67
11/1/49 23.7
12/1/49 23.61
1/1/50 23.51
2/1/50 23.61
3/1/50 23.64
4/1/50 23.65
5/1/50 23.77
6/1/50 23.88
7/1/50 24.07
8/1/50 24.2
When I use this I always get an error message. I cannot seem to figure out what I am doing wrong. I went through my data to make sure it was fine. Thanks for your help!
Considering dput(stsample) as
structure(list(Date = structure(c(8L, 10L, 11L, 12L, 13L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 9L), .Label = c("01-01-1950", "02-01-1950",
"03-01-1950", "04-01-1950", "05-01-1950", "06-01-1950", "07-01-1950",
"08-01-1949", "08-01-1950", "09-01-1949", "10-01-1949", "11-01-1949",
"12-01-1949"), class = "factor"), CPI = c(23.7, 23.75, 23.67,
23.7, 23.61, 23.51, 23.61, 23.64, 23.65, 23.77, 23.88, 24.07,
24.2)), .Names = c("Date", "CPI"), class = "data.frame", row.names = c(NA,
-13L))
you can try something like
stsample$Date <- as.Date(stsample$Date, "%d-%m-%Y")
stsample$year<-as.numeric(format(stsample$Date, "%Y"))
stsample$decade = cut(stsample$year, seq(from = 1940, to = 2020, by = 10))
Note that the breaks work only on the year part of the date and not the whole object. If you have datetime objects, it might be worth looking into
cut.POSIXt
You can try this too (output shown with some randomly generated data):
# assuming 40-49 is the decade 40s
fred$DECADE <- 10*as.integer(as.numeric(substring(as.character(fred$DATE), 7, 8)) / 10)
head(fred)
DATE CPI DECADE
1 08/01/49 23.41955 40
2 09/01/49 26.99772 40
3 10/02/49 29.53724 40
4 11/02/49 19.84247 40
5 12/03/49 26.75672 40
6 01/03/50 30.97788 50
# mean value for each DECADE
aggregate(CPI~DECADE, data=fred, FUN=mean)
DECADE CPI
1 40 25.31074
2 50 25.27004
3 60 24.72269
I have a 2 dimensional data set (matrix/data frame) that looks like this
779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00
The 779, 489,859, 1156 are values that I want to draw on the x-axis
The rest of the values on the column are values that correpond to each x
Now I want to plot the entire data set, so that I have a graph with the the following points
(779,56916) , (779, 41784)......
(482,78968) , (482, 64440)..... and so on
The way I did it so far is like this (it gives me the plot I am looking for)
plot(colnames(resultsSummary),resultsSummary[1,],ylim=c(0,80000),pch=6)
points(colnames(resultsSummary),resultsSummary[2,],pch=3)
points(colnames(resultsSummary),resultsSummary[3,])
and so on..... plotting row by row
I am sure there is a better way to do it, but I dont know how, any suggestions?
DF <- read.table(text=" 779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00",
header=TRUE, check.names=FALSE)
m <- as.matrix(DF)
matplot(as.integer(colnames(m)),
t(m), pch=seq_len(ncol(m)))
Following also works:
ddf = structure(list(var = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("maxs",
"Mean_Cost", "Means_Stdv", "Means+Stdv", "mins"), class = "factor"),
X779 = c(56916, 41784.7, 31863.18, 21941.66, 21088), X482 = c(78968,
64440.83, 44407.4, 24373.97, 13768), X859 = c(51156, 38319.1,
29365.78, 20412.45, 24132), X1156 = c(44827.01, 42767.14,
38711.29, 34655.43, 31452)), .Names = c("var", "X779", "X482",
"X859", "X1156"), class = "data.frame", row.names = c(NA, -5L
))
ddf
var X779 X482 X859 X1156
1 maxs 56916.00 78968.00 51156.00 44827.01
2 Means+Stdv 41784.70 64440.83 38319.10 42767.14
3 Mean_Cost 31863.18 44407.40 29365.78 38711.29
4 Means_Stdv 21941.66 24373.97 20412.45 34655.43
5 mins 21088.00 13768.00 24132.00 31452.00
ddf[6,2:5]=as.numeric(substr(names(ddf)[2:5],2,4))
ddf2 = data.frame(t(ddf))
ddf2 = ddf2[-1,]
mm = melt(ddf2, id='X6')
ggplot(mm)+geom_point(aes(x=X6, y=value, color=variable))
I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data. I need to calculate a time difference column if the macid's match.
The data has the format:
macid time
00:03:7A:4D:F3:59 82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141
And I need to create:
macid time timediff
00:03:7A:4D:F3:59 82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA
My first attempt at this is extremely slow and not really usable:
dedupeIDs <- function (zz) {
#Order by macid and then time
zz <- zz[order(zz$macid, zz$time) ,]
zz$timediff <- c(999999, diff(zz$time))
for (i in 2:nrow(zz)) {
if (zz[i, "macid"] == zz[i - 1, "macid"]) {
print("Different IDs")
} else {
zz[i, "timediff"] <- 999999
}
}
return(zz)
}
I'll then be able to filter the data.frame based on the time difference column.
Sample data:
structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F",
"00:05:4F:0B:45:F7"), class = "factor"),
time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L),
class = "data.frame")
How about:
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))