R keras LSTM input shape - r

There are lots of answers on how to reshape data for Keras LSTM, but they are all about Python, not R.
Array transformation for KerasR LSTM in R
This answer shows the transformation method, but I still have a question. What if the number of features is 2.
This is my data.
It is a temperature data. It has 290 rows and 122 cols. Each col represents a time series data for a station. And each row means the max temp for one day. I want to predict the max temp of next day using historical data, so the number of features is 122, but I do not know what is the samples and timesteps.

Based on my experience, you should reshape data into a 3D array such that the dimensions are: samples: timesteps: features
Originally I have an input matrix, X, with n columns (features) and r rows (observations, days). I apply a time-lag of m periods on each column of the matrix, so now I have n separate matrices (one for each feature) with the same r rows, but with m columns, corresponding to the number of time-lags I implemented. I squish all of these individual matrices together in the z-dimension, so I now have a matrix with r rows, m columns, and a depth of n.
I actually had my own question about this and I will be posting my own query soon on either StackOverflow or CrossValidated.

Related

R - filtering negative and positive spikes in data

I am plotting a data that consists of some intervals that are more or less constant, and spikes in the data originating from the data being a quotient from two parameters. The relatively high and large quotients aren't not relevant for my purpose, so I have been looking for a way to filter these out. The dataset contains 40k+ values so I can not manually remove the high/low quotients.
Is there any function that can trim/filter out the very large/small quotients?
You can use the filter() function from dplyr. This can create a new dataframe without outliers that you can then plot. For example:
no_spikes <- filter(original_df, x > -100 & x < 100)
This would create a new dataframe, no_spikes, that only contains observations where the variable x is between the values -100 and 100.

How to choose one row from each seven dataframe to form a new dataframe?

I have seven dataframes now, they have same structures including 41 rows and 12 columns. Now I want to choose one row from each dataframe to get a new dataframe. I can get 41^7 new dataframes theoretically so that I can run regressions with all of them. My target is to get a range of coefficient of my most important independent variable, but now my crucial step is get these 41^7 new dataframes. What should I do?
I have done a panel regression with 30 individual and 12 semi-year. For my most important variable, I calculated it due to its partly missing. Now I want to adjust its method of calculation to get a interval of coefficient. My process met an obstacle here.
FIT <- t(cbind(FITgan[1,],FITxin[1,],FITqing[1,],FITnei[1,],FIThe[1,],FITshan[1,],FITshann[1,]))
This is my attempt for one combination, the number of whole combination is 41^7(7 dataframes, each with 41 rows). I would like to get a list with all these results.
The number of whole combination is 41^7(7 dataframes, each with 41 rows). I would like to get a list with all these results. Thanks!!
I am not sure if this questions belong here. It rather sounds to be a topic for Cross Validated. As you probably should not do a regression on 41^7 dataframes I would suggest a bootstrapping approach:
sample from your true observations WITH replacement 41 times.
run your regression
save coeffiecients of interest
repeat step 1-3 maybe 10,000 times
estimate a simulated distribution from your coefficients of interest
This will lead you to a better understanding of the underlying uncertainty of your parameters.

R code to generate random pairs of rows and do simulation

I have a data matrix in R having 45 rows. Each row represents a value of a individual sample. I need to do to a trial simulation; I want to pair up samples randomly and calculate their differences. I want a large sampling (maybe 10000) from all the possible permutations and combinations.
This is how I managed to do it till now:-
My data matrix ("data") has 45 rows and 2 columns. I selected 45 rows randomly and subtracted from another randomly generated 45 rows.
n1<-(data[sample(nrow(data),size=45,replace=F),])-(data[sample(nrow(data),size=45,replace=F),])
This gave me a random set of differences of size 45.
I made 50 such vectors (n1 to n50) and did rbind, which gave me a big data matrix containing random differences.
Of course, many rows between first random set and second random set were same and cancelled out. I removed it with a code as follows:
row_sub = apply(new, 1, function(row) all(row !=0 ))
new.remove.zero<-new[row_sub,]
BUT, is there a cleaner way to do this? A simpler way to generate all possible random pairs of rows, calculate their difference as bind it together as a new matrix?
Thanks in advance.

nzv filter for continuous features in caret

I am a beginner to practical machine learning using R, specifically caret.
I am currently applying a random forest algorithm for a microbiome dataset. The values are relative abundance transformed so if my features are columns, sum of all columns for Row 1 == 1
It is common to have cells with a lot of 0 values.
Typically I used the default nzv preprocessing feature in caret.
Default:
a. One unique value across the entire dataset
b. few unique values relative to the number of samples in the dataset (<10 %)
c. large ratio of the frequency of the most common value to the frequency of the second most common value (cutoff used is > 19)
So is this function not actually calculating variance, but determining a frequency of occurence of features and filter based on the frequency? If so is it only safe to use it for discrete/categorical variables?
I have a number of features in my dataset ~12k, many of which might be singletons or have a zero value for a lot of features.
My question: Is nzv suitable for such a continuous, zero inflated dataset?
What pre-processing options would you recommend?
When I use default nzv I am dropping a tonne of features (from 12k to ~2,700 k) in the final table
I do want a less noisy dataset but at the same time do not want to loose good features
This is my first question and I am willing to re-revise, edit and resubmit if required.
Any solutions will be appreciated.
Thanks a tonne!

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

Resources