Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have data for countries in Southeast Asia which looks like this:
enter image description here
The list goes on for Cambodia, Indonesia, Laos etc.
I am planning to conduct a regression analysis with HDI as the independent variable and GTD (Number of terrorist incidents) as the dependent variable. However, from my understanding, conducting an OLS regression would only account for one year?
How could I go about conducting a time-series analysis and subsequent regression using all the years (1996-2017) with the country data? I hope to seek some clarification on this. Thank you :)
This seems like a question more suited to https://stats.stackexchange.com/ (as #Jason pointed out in his comment there are some dangers). In short you would be using all the years because what you are asking in a simple regression is 'how much does one variable depend on another'. In your case you would be using all your data HDI and GTD, they just happen to be spread over many years.
Simple example:
library(gapminder) #if you do not have these packages then install.packages("gapminder"), install.packages("tidyverse")
library(tidyverse)
head(gapminder) #first 6 rows of the data
plot(gapminder$lifeExp, gapminder$gdpPercap) #Probably not a linear relationship but let's go on anyway for demonstration purposes
fit <- lm(lifeExp ~ gdpPercap, data = gapminder) #Simple regression using all data from all countries for all years
summary(fit)
plot(fit)
Ideally you will want to do a full exploration (is linear regression certainly appropriate?) including the distribution of residuals and experiment with subsetting by country so I recommend reading up on some of the many resources on how to do simple regression.
P.S. Ignore the poor realism of the simple example, that would in reality be a very poor use of linear regression!
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 days ago.
Improve this question
I am modeling bird nesting success and my advisor wants me to use event/trials syntax in R to model the amount of eggs that hatched vs the total amount of eggs per nest (i.e. the event/trials) against a variety of predictor variables - using essentially the logistic regression format.
This is totally new to me, so any online resources or code help would be incredibly useful! Thank you!
Haven't tried much yet, can't find the resources! I can only find info for SAS.
When specifying a logistic regression with glm(), the response can be specified multiple different ways. One is as a two-column matrix, first column is number of successes, second is number of failures.
So if you have two variables, total_eggs and hatched, try
mod <- glm( cbind(hatched, total_eggs-hatched) ~ x + ...,
data=your_data_frame, family=binomial)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a time-series data containing 16 values (of no. of Vehicles) from 2001 to 2016. I wanted to predict - based on the underlying trend - the values upto 2050 (which is a long shot I agree).
Upon doing some research, I found that it can be done by methods like HoltWinters or TBATS which, even though, did not go with my own plan of using some Machine Learning algorithm.
I am using R for all my work. Now, after using HoltWinters() and then forecast() methods, I did get an extrapolated curve uptil 2050 but it is a simple exponential curve from 2017 to 2050 which I think I could have obtained through meager calculations.
My question is twofold:
1) What would be the best approach to obtain a meaningful extrapolation?
2) Does my current approach be modified to give me a more meaningful extrapolation?
By meaningful I want to express that a curve with the details more closer to actuality.
Thanks a lot.
I guess you need more data to make predictions. HoltWinters or TBATS may work but there are many other ML models for time series data you can try.
http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
This link has the R sample code for Holtwinters and the plots.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am working with a financial instrument's intraday time series data. I have to predict the price of a financial instrument on the basis of some statistical parameters(Var1, Var2, Var3) and of time series intraday(Obeserv.1, Observ.2.......Observ.80) data of the previous period. I have to predict the price of financial instrument in 81st period.
All lines in the table are mixed so that the information in any i-line is useless for prediction of the j-line.
I am planning to solve this problem by using R. I am new into this financial modelling field. What approach I can take for prediction. Please help me out for this.
Data set looks like that
Sample Data
In the past I've done this kind of thing for a living. The general concept is to think of a model that would have some predictive power, then fit e.g. the first half of the data set to the model, and finally test to see whether the model has any predictive power on the second half of the data set.
If you haven't tried anything before, a good place to start is with ARMA models, (see e.g. AutoRegressive–Moving-Average model on wikipedia).