Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am working with a financial instrument's intraday time series data. I have to predict the price of a financial instrument on the basis of some statistical parameters(Var1, Var2, Var3) and of time series intraday(Obeserv.1, Observ.2.......Observ.80) data of the previous period. I have to predict the price of financial instrument in 81st period.
All lines in the table are mixed so that the information in any i-line is useless for prediction of the j-line.
I am planning to solve this problem by using R. I am new into this financial modelling field. What approach I can take for prediction. Please help me out for this.
Data set looks like that
Sample Data
In the past I've done this kind of thing for a living. The general concept is to think of a model that would have some predictive power, then fit e.g. the first half of the data set to the model, and finally test to see whether the model has any predictive power on the second half of the data set.
If you haven't tried anything before, a good place to start is with ARMA models, (see e.g. AutoRegressive–Moving-Average model on wikipedia).
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have data for countries in Southeast Asia which looks like this:
enter image description here
The list goes on for Cambodia, Indonesia, Laos etc.
I am planning to conduct a regression analysis with HDI as the independent variable and GTD (Number of terrorist incidents) as the dependent variable. However, from my understanding, conducting an OLS regression would only account for one year?
How could I go about conducting a time-series analysis and subsequent regression using all the years (1996-2017) with the country data? I hope to seek some clarification on this. Thank you :)
This seems like a question more suited to https://stats.stackexchange.com/ (as #Jason pointed out in his comment there are some dangers). In short you would be using all the years because what you are asking in a simple regression is 'how much does one variable depend on another'. In your case you would be using all your data HDI and GTD, they just happen to be spread over many years.
Simple example:
library(gapminder) #if you do not have these packages then install.packages("gapminder"), install.packages("tidyverse")
library(tidyverse)
head(gapminder) #first 6 rows of the data
plot(gapminder$lifeExp, gapminder$gdpPercap) #Probably not a linear relationship but let's go on anyway for demonstration purposes
fit <- lm(lifeExp ~ gdpPercap, data = gapminder) #Simple regression using all data from all countries for all years
summary(fit)
plot(fit)
Ideally you will want to do a full exploration (is linear regression certainly appropriate?) including the distribution of residuals and experiment with subsetting by country so I recommend reading up on some of the many resources on how to do simple regression.
P.S. Ignore the poor realism of the simple example, that would in reality be a very poor use of linear regression!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm doing credit risk modelling and the data have large number of features.I am using boruta package for feature selection. The package is too computationally expensive, I cannot run it on the complete training dataset. What i'm trying to do is take a subset of the training data(let's say about 20-30%) and run the boruta package on that subsetted data and get the important features. But when i use random forest to train the data I have too use the full dataset. My question is, Is it right to select features only on a part of train data but then build the model on whole of training data?
Since the question is logical in nature, I will give my two cents.
A single random sample of 20% of the population is good enough i believe
A step further would be taking 3-4 such random sets and the intersection of the significant variables from all of them is an improvement to the above
Using feature selection from multiple methods (xgboost, some caret feature selection methods) -> use a different random sample for each of them, and then take the common significant features
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I got the following time series of residuals from another regression.
One index is a day. You can directly observe the year cycle.
Aim is to fit a harmonic function through it to expalain further part of the underlying time series.
I really appreciate your ideas about which function to use for estimating the right parameters! From acf we learn that there is also a week cycle. However, this issue i will adress later with sarima.
This seems to be the sort of thing a fourier transform is designed for.
Try
fftobj = fft(x)
plot(Mod(fftobj)[1:floor(length(x)/2)])
The peaks in this plot corresponds to frequencies with high coefficients in the fit. Arg(fftobj) will give you the phases.
Well i tried it, but it provides a forecast that looks like a exponential distribution. I solved the problem meanwhile in another way. I added a factor component for each month and draw a regression. In the next step I smoothed the results from this regression and got a intra-year pattern that is more accurate than a harmonic function. E.g. during the June and July (around 185) there is generally a low level but also a high amount of peaks.