I am confused of what to scale for Keras in R.
I have a univariate time series vector data = [1, 2, 3, ..., 1000] which I want a LSTM to predict. I split this vector between train = [1, 2, ..., 997] and test = [998, 999, 1000] vectors. After, I took the train vector and created two sliding window matrices, for training it.
X_train, y_train
[1, 2, 3] [4, 5, 6]
[2, 3, 4] [5, 6, 7]
[3, 4, 5] [6, 7 ,8]
[4, 5, 6] [7, 8, 9]
[5, 6, 7] [8, 9, 10]
... ...
I am confused when to scale. Should I scale my inicial data = [1, 2, 3, ..., 1000] vector or should I scale the train = [1, 2, ..., 997] and test = [998, 999, 1000] vectors separately? Is there any difference in these two approaches?
I want to try two sort of scaling, between -1 and 1 and between 0 and 1.
EDIT
My real data lies between -9 and +2
Well, scaling depends on the dataset. If you consider that, you should also consider if you want to use your supervised model to predict new data given only one new input at a time or a batch of new inputs. A general approach is to apply the scaling "coeficcients" (let's say, max min, or z-score) from the training set to the test set. Because that's one way of making test data look like test data, which is what your model was trained on.
So, being very objective: yes, there is difference between scaling everything and then train/test splitting vs splitting first and then scaling separately. I suggest you read this question from stats exchange.
Related
I am trying to build a model of points in the space, where each point has constraints with other points (which means that if point a and b has a constraint of 5, then the distance between them must be exactly 5).
is a basic model, where the green is the nodes, and the red are the constraints.
I need to find the x1,y1,x2,y2,x3,y3.
The model receive a matrix of constraints.
In the case of the model above, the matrix will be:
[[0, 4 -1]
[ 4, 0, 5],
[-1, 5, 0]]
now, when the model is easy, This is an easy task.
But when adding more constrains, like this model,
that will get the matrix :
[[0, 4 -1, 4]
[ 4, 0, 5, -1],
[-1, 5, 0, 5],
[4, -1, 5, 0]]
Does anyone have an idea how to create this model when the input is a matrix of constraints?
I have a requirement where I have set of numeric values for example: 2, 4, 2, 5, 0
As we can see in above set of numbers the trend is mixed but since the latest number is 0, I would consider the value is getting DOWN. Is there any way to measure the trend (either it is getting up or down).
Is there any R package available for that?
Thanks
Suppose your vector is c(2, 4, 2, 5, 0) and you want to know last value (increasing, constant or decreasing), then you could use diff function with a lag of 1. Below is an example.
MyVec <- c(2, 4, 2, 5, 0)
Lagged_vec <- diff(MyVec, lag=1)
if(MyVec[length(MyVec)]<0){
print("Decreasing")}
else if(MyVec[length(MyVec)]==0){
print("Constant")}
else {print("Increasing")}
Please let me know if this is what you wanted.
I am setting my R code for doing a Monte Carlo, however I need a sample of 1 number with a random distribution, so in order to test the function of the sample in R, I set the code below, however I do not understand the reason of the different results.
x <- rnorm(1,8,0)
x
#8
y <-sample(x=rnorm(1,8,0), size=1)
y
#4
Quoting ?sample,
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x.
you're actually drawing from c(1, 2, 3, 4, 5, 6, 7, 8) and not from c(8).
However, it works if we draw from "character" class.
as.numeric(sample(as.character(rnorm(1,8,0)), size=1))
# [1] 8
Consider a matrix A and an array b. I would like to calculate the distance between b and each row of A. For instance consider the following data:
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
I would expect as output some array of the form:
distance_array = c(0, 11.18, 22.36)
where the value 11.18 comes from the euclidean distance between a[2,] and b:
sqrt(sum((a[2,]-b)^2))
This seems pretty basic but so far all R functions I have found allow to compute distance matrices between all the pairs of rows of a matrix, but not this array-matrix calculation.
I would recommend putting the rows a A in list instead of a matrix as it might allow for faster processing time. But here's how I would do it with respect to your example
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
apply(A,1,function(x)sqrt(sum((x-b)^2)))
Consider the treerings dataset.
library("datasets", lib.loc="C:/Program Files/R/R-3.3.1/library")
tr<-treering
length(tr)
[1] 7980
class(tr)
[1] "ts"
From my understanding, it is a time series of length 7980.
How can I find out what the time stamps are for each value?
After plotting the time series, looking at the x axis of the plot, it appears that the time stamps range between -6000 to 2000. But to me the time stamps appear to be "hidden".
plot(tr)
More generally, I'm trying to understand what exactly is a ts object and what are the benefits of using this type of object.
A univariate and multivariate time series can easily be displayed in a data frame with 2 or more columns: Time and variables .
univariatetimeseries <- data.frame(Time = c(0, 1, 2, 3, 4, 5, 6), y = c(1, 2, 3, 4, 5, 6, 7))
multivariatetimeseries <- data.frame(Time = c(0,1,2,3,4,5,6), y = c(1, 2, 3, 4, 5, 6, 7), z = c(7,6,5,4,3,2,1))
This to me seems simple and straighforward and it is consistent with the basic science examples that I learned in high school. Additionally, the time stamps are not "hidden" as is the case of the treering example. So what are the benefits of using ts object?
Object of class comes with many generic functions for convenience. Say for "ts" object class there are ts.plot, plot.ts, etc. If you store your time series as a data frame, you have to do lots of work yourself when plotting them.
Perhaps for seasonal time series, the advantage of using "ts" is more evident. For example, x <- ts(rnorm(36), start = c(2000, 1), frequency = 12) generates monthly time series for 3 years. The print method will nicely arrange it like a matrix when you print x.
A "ts" object has a number of attributes. Modelling fitting routines like arima0 and arima can see such attributes so you don't need to specify them manually.
For your question, there are a number of functions to extract / set attributes of a time series. Have a look at ?start, ?tsp, ?time, ?window.