How can I test the Environmental-Kuznets Curve theory with R? - r

I need help with writing the code for the environmental Kuznets Curve in R. The theory behind it is that with increasing economic growth environmental degradation increases until a certain point and then economic growth has the opposit effect and environmental degradation decreases. If correct this relationship should look like an inverted u-shape. I have GDP and greenhouese gas emissions data for 90 countries and multiple years and would like to see if the theory is true. I know that I somehow need to use the lm() function, but other than that I really need help on how to go about it. If anyone could offer any insights/ideas I would be really thankful.
My data looks like this:
Country.Name Country.Code GDP1980 GDP2020 GDPGrowth
1 Aruba ABW NA 24487.9 NA
2 Africa Eastern and Southern AFE 738.9 1353.8 614.9
3 Afghanistan AFG 291.6 516.9 225.3
4 Africa Western and Central AFW 709.8 1683.4 973.6
5 Angola AGO 711.9 1604.0 892.1
6 Albania ALB NA 5332.2 NA
I tried ggplot(aes(GDPGrowth, GHP.new), data=DataM) + geom_line() and the output is a graph, but I feel like it does not match the theory.

Related

How to perform a Chi^2 Test with two data tables where x and y have different lengths

So I have the following tables (simplified here):
this is Ost_data
Raumeinheit
Langzeitarbeitslose
Hamburg
22
Koln
45
This is West_data
Raumeinheit
Langzeitarbeitslose
Hamburg
42
Koln
11
Ost_data has 76 rows and West_data has 324 rows.
I am tasked with proving my hypothesis that the Variable "Langzeitarbeitslose" is statistically, significantly higher in Ost_data than in West_data. Because that variable is not normally distributed I am trying to use Pearson's Chi Square Test.
I tried
chisq.test(Ost_data$Langzeitarbeitslose, West_data$Langzeitarbeitslose)
but that just retuns that it can't be performed because x and y differs in length.
Is there a way to navigate around that problem and perform the Chi Square test regardless with my two tables which have varying lengths?
Pearson's ChiSq test is when the rows are measuring the same thing. It sounds like here your rows are just measuring some quantity on repeated samples, so you should use a t-test.
t.test(Ost_data$Langzeitarbeitslose, West_data$Langzeitarbeitslose)
The most important aspect of your variable "Langzeitarbeitslose" (longtime-unemployed)is not whether it is normally distributed but its scale-level. I assume it is a dichotomous variable (either yes or no).
t-Test needs interval-scale
wilcoxon test needs ordinal scale
chi-square test works for nominal (and therefore also for dichotomous) data
If you have both, the number of long-time unemployed and the number of not-longtime-unemployed per city you can compare the probability of being unemployed in the east and west.
l_west <- absolute number of longtime-unemployed in the west
l_ost <- absolute number of longtime-unemployed in the east
n_west <- absolut number of observed people (unemployed or not) in the west
n_ost <- absolut number of observed people (unemployed or not) in the east
N <- n_west + n_ost # absolut number of observations
chisq.test(c(l_west,l_ost),p=c(n_west/N, n_ost/N))
# this tests whether the relative frequency of unemployment in the east (l_ost / n_ost)
# differs from the equivalent rel. frequency in the west (l_west / n_west)
# while considering the absolut number of observeations in east and west
I know the words Ost (east), West, and Langzeitarbeitslose (unemployed). I know that Hamburg and Köln are in the west and not in the east (therefore they should not appear in your "Ost_data"). Somebody who does not know this cannot help you. --> Bear this in mind in the future.
Best,
ajj

Comparing Kmeans and Agglomerative Clustering

I'm working to compare kmeans and agglomerative clustering methods for a country-level analysis, but I'm struggling because they are assigning numerical clusters differently, so even if a country (let's say Canada) is in a cluster by itself, that cluster is '1' in one version and '2' in another. How would I reconcile that both methods clustered this the same way, but assigned a different order?
I tried some hacky things with averaging the two, but am struggling to figure the logic out.
geo_group a_cluster k_cluster
<chr> <int> <int>
1 United States 1 1
2 Canada 2 3
3 United Kingdom 3 5
4 Australia 4 5
5 Germany 4 5
6 Mexico 5 6
7 France 5 5
8 Sweden 6 8
9 Brazil 6 6
10 Netherlands 6 6
Agglomerative clustering and kmeans are different methods to define a partition of a set of samples (e.g. samples 1 and 2 belong to cluster A and sample 3 belongs to cluster B).
kmeans calculates the Euclidean distance between each sample pair. This is only possible for numerical features and is often only useful for spatial data (e.g. longitude and latitude), because here Eukledian distance is just the distance as the crow flies.
Agglomerative clustering, however, can be used with many other dissimilarity measures, not just metric distances, even e.g. Jaccard allowing not only numerical but also categorical data.
Furthermore, the number of clusters can be defined afterwards whereas in kmeans, the chosen k affects the clustering in the first place. Here, in agglomerative clustering, clusters were merged together in a hierarchical manner. This merging can be either single, complete, or average linkage resulting in not just one but many different agglomerative algorithms.
It is very normal to get different results from these methods.

Post-hoc test for lmer Error message

I am running a Linear Mixed Effect Model in R and I was able to successfully run my code and get results.
My code is as follow:
library(lme4)
library(multcomp)
read.csv(file="bh_new_all_woas.csv")
whb=read.csv(file="bh_new_all_woas.csv")
attach(whb)
head(whb)
whb.model = lmer(Density ~ distance + (1|Houses) + Cats, data = whb)
summary(whb.model)
However, I would like to do a comparison of my distance fixed factor that has 4 levels to it. I tried running a lsmean as followed:
lsmeans(whb.model, pairwise ~ distance, adjust = "tukey")
This error popped up:
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
I also tried glht using this code:
glht(whb.model, linfct=mcp(distance="tukey"))
and got the same results. A sample of my data is as follows:
Houses distance abund density
House 1 20 0 0
House 1 120 6.052357 0.00077061
House 1 220 3.026179 0.000385305
House 1 320 7.565446 0.000963263
House 2 20 0 0
House 2 120 4.539268 0.000577958
House 2 220 6.539268 0.000832606
House 2 320 5.026179 0.000639953
House 3 20 0 0
House 3 120 6.034696 0.000768362
House 3 220 8.565446 0.001090587
House 3 320 5.539268 0.000705282
House 4 20 0 0
House 4 120 6.052357 0.00077061
House 4 220 8.052357 0.001025258
House 4 320 2.521606 0.000321061
House 5 20 4.513089 0.000574624
House 5 120 6.634916 0.000844784
House 5 220 4.026179 0.000512629
House 5 320 5.121827 0.000652131
House 6 20 2.513089 0.000319976
House 6 120 9.308185 0.001185155
House 6 220 7.803613 0.000993587
House 6 320 6.130344 0.00078054
House 7 20 3.026179 0.000385305
House 7 120 9.052357 0.001152582
House 7 220 7.052357 0.000897934
House 7 320 6.547785 0.00083369
House 8 20 5.768917 0.000734521
House 8 120 4.026179 0.000512629
House 8 220 4.282007 0.000545202
House 8 320 7.537835 0.000959747
House 9 20 3.513089 0.0004473
House 9 120 5.026179 0.000639953
House 9 220 8.052357 0.001025258
House 9 320 9.573963 0.001218995
House 10 20 2.255828 0.000287221
House 10 120 5.255828 0.000669193
House 10 220 10.060874 0.001280991
House 10 320 8.539268 0.001087254
Does anyone have any suggestions on how to fix this problem?
So which problem is it that needs fixing? One issue is the model, and another is the follow-up to it.
The model displayed is fitted using the fixed effects ~ distance + Cats. Now, Cats is not in the dataset provided, so that's an issue. But aside from that, distance enters the model as a quantitative predictor (if I am to believe the read.csv statements etc.). This model implies that changes in the expected Density are proportional to changes in distance. Is that a reasonable model? Maybe, maybe not. But is it reasonable to follow that up with multiple comparisons for distance? Definitely not. From this model, the change between distances of 20 to 120 will be exactly the same as the change between distances of 120 and 220. The estimated slope of distance, from the model summary, embodies everything you need to know about the effect of distance. Multiple comparisons should not be done.
Now, one might guess from the question that what you really had wanted to do was to fit a model where each of the four distances has its own effect, separate from the other distances. That would require a model with factor(distance) as a predictor; in that case, factor(distance) will account for 3 degrees of freedom rather than 1 d.f. for distance as a quantitative predictor. For such a model, it is appropriate to follow it up with multiple comparisons (unless possibly distance also interacts with some other predictors). If you were to fit such a model, I believe you will find there will be no errors in your lsmeans call (though you need a library("lsmeans") statement, not shown in your code.
Ultimately, getting programs to run without error is not necessarily the same as producing sensible or meaningful answers. So my real answer is to consider carefully what is a reasonable model for the data. I might suggest seeking one-on-one help from a statistical consultant to make sure you understand the modeling issues. Once that is settled, then appropriate interpretation of that model is the next step; and again, that may require some advice.
Additional minor notes about the code provided:
The first read.csv call accomplishes nothing because it doesn't store the data.
R is case-sensitive, so technically, Density isn't in your dataset either
When the data frame is attached, you don't also need the data argument in the lmer call.
The apparent fact that Houses has levels "House 1", "House 2", etc. is messed-up in your listing because the comma delimiters in your data file are not shown.

Forecasting multivariate data with Auto.arima

I am trying to forecasts sales of weekly data. The data consists of these variables week no, sales, avgprice/perunit , holiday(whether that week contains holiday or not) and promotion(if any promotion is going) of 104 weeks. So basically the last 6 obs of data set looks as:
Week Sales Avg.price.unit Holiday Promotion
101 8,970 50 0 1
102 17,000 50 1 1
103 23,000 80 1 0
104 28,000 180 1 0
105 176 1 0
106 75 0 1
Now I want to forecast for 105th and 106th week. So I created univariate time series x by using ts function and then ran auto.arima function by issuing the command:
x<-ts(sales$Sales, frequency=7)
> fit<-auto.arima(x,xreg=external, test=c("kpss","adf","pp"),seasonal.test=c("ocsb","ch"),allowdrift=TRUE)
>fit
ARIMA(1,1,1)
**Coefficients:
ar1 ma1 Avg.price.unit Holiday Promotion
-0.1497 -0.9180 0.0363 -10.4181 -4.8971
s.e. 0.1012 0.0338 0.0646 5.1999 5.5148
sigma^2 estimated as 479.3: log likelihood=-465.09
AIC=942.17 AICc=943.05 BIC=957.98**
Now when I want to forecast the values for last 2 weeks(105th and 1o6th) I supply the external values of regressors for 105th and 106th week:
forecast(fit, xreg=ext)
where ext consists of future values of regressors for last 2 weeks.
The output comes as:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
15.85714 44.13430 16.07853 72.19008 1.226693 87.04191
16.00000 45.50166 17.38155 73.62177 2.495667 88.50765
The output looks incorrect since the forecasted value of sales is very less as the sales value of previous values(training) values are generallly in range of thousands.
If anyone can tell me why it is coming incorrect/unexpected, that would be great.
If you knew a priori that certain weeks of the year or certain events in the year were possibly important you could form a Transfer Function that couild be useful. You might have to include some ARIMA structure to deal with short-term autoregressive structure AND/OR some Pulse/Level Shift/Local Trends to deal with unspecified deterministic series ( omitted variables ). If you would like to post all of your data I would be glad to demonstrate that for you thus providing ground zero help. Alternatively you can email it to me at dave#autobox.com and I will analyze it and post the data and the results to the list. Other commentators on this question might also want to do the same for comparative analytics.
Where are the 51 weekly dummies in your model? Without them you have no way to capture seasonality.

panel regression with non-individual-specific fixed effects

I have weekly observations of revenues from the sale of different products, separately for different countries, like so:
df <- data.frame(year=rep(c(2002,2003), each=16),
week=rep(1:4,4),
product=rep(c('A','B'), each=8, times=2),
country=rep(c('usa','germany'), each=4, times=4),
revenue=abs(rnorm(32)))
That means observations of revenues are only unique for a combination of year-week-country-product
I would now like to estimate a model that includes fixed effects for the interaction of country and year and for each product but cannot figure out how to do this:
estimating via summary(lm(revenue~factor(paste(country,year)) + factor(product) + ..., data=df)) fails for lack of memory because my data set is rather larger than the example above, which means I have to estimate something on the order of 1000 fixed effects
as far as I understand it panels are better estimated using the plm package but my case doesn't seem to fit neatly within the standard framework of a panel in which observations differ only across one time and one cross-sectional dimension each and fixed effects are estimated for each. I can generate a time index from year and week but that (a) still leaves me with two cross-sectional dimensions and (b) will give me fixed effects for each year-week interaction, which is rather more fine than I want it to be.
Are there any ways of estimating this with plm or are there other packages which do this sort of thing? I know I could demean the data within the groups described above, estimate via lm and then do a df-correction, but I'd rather avoid this.
First, create a variable, "fe", that identifies unique combinations of country, year, product.
library(data.table)
# convert data.frame to data.table
setDT(df)
# create a new group variable
df[, fe := .GRP, by = list(country, year, product)]
head(df)
year week product country revenue fe
1: 2002 1 A usa 0.84131750 1
2: 2002 2 A usa 0.07530538 1
3: 2002 3 A usa 0.56183346 1
4: 2002 4 A usa 0.80720792 1
5: 2002 1 A germany 1.25329883 2
6: 2002 2 A germany 0.44860296 2
Now use plm or felm. I like felm since it also works with multiple fixed effects and interactive fixed effects
library(lfe)
felm(revenue ~ week | fe, df)

Resources