Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a vector of yearly population data from 1980 to 2020 with only two values (years 2000 and 2010) and I need to predict the missing data.
My first thought was to use na.approx to fill in the missing data between 2000 and 2010 and then use the ARIMA model. However, as the population is declining, in the remote future its values would become negative, which is illogical.
My second thought was to use differences of logarithms between the sample data dividing it by 10(since there is a 10 year gap between the actual values) and using it as a percentage change to predict the missing data.
However, I am new to R and statistics so I am not sure if this is the best way to get the predictions. Any ideas would be really appreciated.
Since the line that the two data points provides does not make intuitive sense, I would recommend just using the average of the two unless you can get additional data. If you are able to get either more yearly data, or even expected variation values, then you can do some additional analysis. But for now, you're kinda stuck.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have data for countries in Southeast Asia which looks like this:
enter image description here
The list goes on for Cambodia, Indonesia, Laos etc.
I am planning to conduct a regression analysis with HDI as the independent variable and GTD (Number of terrorist incidents) as the dependent variable. However, from my understanding, conducting an OLS regression would only account for one year?
How could I go about conducting a time-series analysis and subsequent regression using all the years (1996-2017) with the country data? I hope to seek some clarification on this. Thank you :)
This seems like a question more suited to https://stats.stackexchange.com/ (as #Jason pointed out in his comment there are some dangers). In short you would be using all the years because what you are asking in a simple regression is 'how much does one variable depend on another'. In your case you would be using all your data HDI and GTD, they just happen to be spread over many years.
Simple example:
library(gapminder) #if you do not have these packages then install.packages("gapminder"), install.packages("tidyverse")
library(tidyverse)
head(gapminder) #first 6 rows of the data
plot(gapminder$lifeExp, gapminder$gdpPercap) #Probably not a linear relationship but let's go on anyway for demonstration purposes
fit <- lm(lifeExp ~ gdpPercap, data = gapminder) #Simple regression using all data from all countries for all years
summary(fit)
plot(fit)
Ideally you will want to do a full exploration (is linear regression certainly appropriate?) including the distribution of residuals and experiment with subsetting by country so I recommend reading up on some of the many resources on how to do simple regression.
P.S. Ignore the poor realism of the simple example, that would in reality be a very poor use of linear regression!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a time-series data containing 16 values (of no. of Vehicles) from 2001 to 2016. I wanted to predict - based on the underlying trend - the values upto 2050 (which is a long shot I agree).
Upon doing some research, I found that it can be done by methods like HoltWinters or TBATS which, even though, did not go with my own plan of using some Machine Learning algorithm.
I am using R for all my work. Now, after using HoltWinters() and then forecast() methods, I did get an extrapolated curve uptil 2050 but it is a simple exponential curve from 2017 to 2050 which I think I could have obtained through meager calculations.
My question is twofold:
1) What would be the best approach to obtain a meaningful extrapolation?
2) Does my current approach be modified to give me a more meaningful extrapolation?
By meaningful I want to express that a curve with the details more closer to actuality.
Thanks a lot.
I guess you need more data to make predictions. HoltWinters or TBATS may work but there are many other ML models for time series data you can try.
http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
This link has the R sample code for Holtwinters and the plots.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
So I have a lot of points, kind of like this:
animalid1;A;time
animalid1;B;time
animalid1;C;time
animalid2;A;time
animalid2;B;time
animalid2;A;time
animalid2;B;time
animalid2;C;time
animalid3;A;time
animalid3;B;time
animalid3;C;time
animalid3;B;time
animalid3;A;time
What I want to do is to first of all make R understand that the points A,B,C are connected. Then I want to get comparisons of movement from A to C and how long time it takes, how many steps were used, etc. So maybe I have a movement sequence like ABC on 20 animals and then ABABC on 10 animals and then ABCBA on 5 animals. I want to get some sort of statistical test done to see if the total time is different between these groups, and so on.
I bet this has been done before. But my Google skills are not good enough to find it.
Look at the msm package (msm stands for Multi State Model). Given observations of states at different times it will estimate probabilities of transitions and average time in the different states.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
More a general question, but since I am using R -> tags
My training data set is 15,000 entries big from which around 20 i would like to use for positive data set -> building up the svm. I wanted to use the remaining resampled dataset as my negative dataset, but i was wondering, it might be better to take the same size (around 20) as the negative data set, otherwise it's highly imbalanced? Is there an easy approach to pool then the classifiers (ensemble based) in R after 1000 rounds of resampling? (or even with the e1071 package)
Followup question: I would like to calculate a score for each prediction afterwards, is it fine just to take the probabilities times 100??
Thx
You can try "class weight" approach in which the smaller class gets more weight, thus taking more cost to mis-classify the positive labelled class.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I got the following time series of residuals from another regression.
One index is a day. You can directly observe the year cycle.
Aim is to fit a harmonic function through it to expalain further part of the underlying time series.
I really appreciate your ideas about which function to use for estimating the right parameters! From acf we learn that there is also a week cycle. However, this issue i will adress later with sarima.
This seems to be the sort of thing a fourier transform is designed for.
Try
fftobj = fft(x)
plot(Mod(fftobj)[1:floor(length(x)/2)])
The peaks in this plot corresponds to frequencies with high coefficients in the fit. Arg(fftobj) will give you the phases.
Well i tried it, but it provides a forecast that looks like a exponential distribution. I solved the problem meanwhile in another way. I added a factor component for each month and draw a regression. In the next step I smoothed the results from this regression and got a intra-year pattern that is more accurate than a harmonic function. E.g. during the June and July (around 185) there is generally a low level but also a high amount of peaks.