Creating predictions from a model - r

I am trying to create predictions from my model, but i wish to keep the Country_Name in the predictions if possible. Is there any way this can be done as i'm having no lucking using the standard predict() function
My model is;
mod = gam(gdp_per_capita ~ s(fisheries_production_pc, k = 10, bs = 'cs') + s(food_yield_pc, k = 10, bs = 'cs') +
s(freshwaster_production_pc, k = 5, bs = 'cs') + s(co2, k = 5, bs = 'cs') + Country_Name,
data = economy_df,
family = gaussian(link = "log"))
data snipet;
economy_df
Country_Name year gdp_per_capita Agriculture_GDP_per fisheries_production_pc food_yield_pc freshwaster_production_pc co2
Albania 2018 5287.6637 18.4294792 0.0052701739 1.688718e-03 3.342199e-07 1.782739
Albania 2019 5396.2159 18.3893474 0.0053295312 1.765194e-03 3.342199e-07 1.692248
Albania 2020 5332.1605 19.2644408 0.0059591472 1.835616e-03 3.342199e-07 3.926145
Algeria 2018 4142.0186 11.8742008 0.0028456292 4.622480e-05 2.321186e-07 3.920109
Algeria 2019 3989.6683 12.3362121 0.0024478768 4.105168e-05 2.321186e-07 3.977650
Algeria 2020 3306.8582 14.1347926 0.0019817330 3.467192e-05 2.321186e-07 2.448906
Bosnia 2018 6070.3530 5.8854355 0.0011864874 1.651028e-03 1.206103e-07 6.799183
Bosnia 2019 6119.7624 5.6030922 0.0012912459 1.622146e-03 1.206103e-07 6.382918
Bosnia 2020 6082.3667 6.0844855 0.0012438373 1.844267e-03 1.206103e-07 4.962175
Croatia 2018 15227.5601 2.9570919 0.0220747984 1.725996e-03 1.646345e-07 4.019235
Croatia 2019 15311.7669 2.8687641 0.0209151509 1.760604e-03 1.646345e-07 4.063708
Croatia 2020 14132.4866 3.2165075 0.0230609534 1.727508e-03 1.646345e-07 8.057848
Cyprus 2018 29334.1113 1.7335399 0.0074306923 8.853390e-04 1.740575e-07 6.054175
Cyprus 2019 29206.0762 1.8086052 0.0079922641 2.216217e-03 1.740575e-07 5.998795
Cyprus 2020 27681.5664 1.9308417 0.0071299388 1.961717e-03 1.740575e-07 5.614297
Egypt 2018 2537.1252 11.2250002 0.0199902966 6.887169e-05 7.874128e-07 2.518806
Egypt 2019 3019.0923 11.0489759 0.0203110909 6.022130e-05 7.874128e-07 2.484060
Egypt 2020 3569.2068 11.5676091 0.0196471464 6.046745e-05 7.874128e-07 5.295201
What I'm looking for would look something like this i imagine:
Country_Name prediction
Albania <value>
Albania <value>
Albania <value>

To check the order of values you can perform a correlation between observed versus predicted values:
cor(economy_df$gdp_per_capita,preds[["fit"]])
Then make new df with your desired columns:
mypred<-data.frame(Country_Name =
economy_df$Country_Name
,prediction=preds[["fit"]])
head(mypred)
Country_Name prediction
1 Albania 5259.356
2 Albania 5382.758
3 Albania 5373.350
4 Algeria 4099.978
5 Algeria 3951.231
6 Algeria 3402.162

Related

R transform data with year begin and year end into time series data

I have a problem that is very similar to this:
R transform data frame with start and end year into time series however, none of the solutions have worked for me.
This the original df:
df <- data.frame(country = c("Albania", "Albania", "Albania"), leader = c("Sali Berisha", "Sali Berisha", "Sali Berisha"), term = c(2, 2, 2), yearbegin = c(2009,2009, 2009), yearend = c(2013, 2013, 2013))
And it currently looks like this:
#> country leader term yearbegin yearend
#> 1 Albania Sali Berisha 2 2009 2013
#> 2 Albania Sali Berisha 2 2009 2013
#> 3 Albania Sali Berisha 2 2009 2013
And I'm trying to get it to look like this:
#> 1 Albania Sali Berisha 2 2009
#> 2 Albania Sali Berisha 2 2010
#> 3 Albania Sali Berisha 2 2011
#> 4 Albania Sali Berisha 2 2012
#> 5 Albania Sali Berisha 2 2013
When using this code:
library(tidyverse)
gpd_df<- gpd_df %>%
mutate(year = map2(yearbegin, yearend, `:`)) %>%
select(-yearbegin, -yearend) %>%
unnest```
I get a column that looks like this:
```year
2009:2013
2009:2013
2009:2013
Many thanks in advance for your help!!
trying to transform date into time-series form year begin/year end. Have just found errors :')
Use distinct first:
library(dplyr)
library(tidyr)
gpd_df %>%
distinct() %>%
mutate(year = map2(yearbegin, yearend, `:`), .keep = "unused") %>%
unnest_longer(year)
country leader term year
1 Albania Sali Berisha 2 2009
2 Albania Sali Berisha 2 2010
3 Albania Sali Berisha 2 2011
4 Albania Sali Berisha 2 2012
5 Albania Sali Berisha 2 2013

Why does lag() in this data frame repeat and reverse calculations?

This is a data frame I've called CA_less, which I want to use to calculate GDP change over five decades:
CountryName
Year
GDP
Costa Rica
1960
507513829.9949
Costa Rica
2010
37268635287.0856
Guatemala
1960
1043599900
Guatemala
2010
41338595380.8159
Honduras
1960
335650000
Honduras   
2010
15839344591.9842
Panama     
1960
537147100
Panama
2010
28917200000
I used this code:
CA_GDP_decade <- mutate(CA_less, Year2 = lag(Year, 1),GDP2 = lag(GDP, 1), CHANGE_PERC = ((GDP - GDP2) / GDP2 ) * 100 %>%
mutate_if(is.numeric,
round,
digits = 0)
CA_GDP_decade
I was expecting this:
CountryName
Year
GDP
Year2
GDP2
Change_perc
Costa Rica
1960
507513830
NA
NA
NA
Costa Rica
2010
37268635287
1960
507513830
7243
Guatemala
2010
41338595381
1960
1043599900
3861
Honduras
2010
15839344592
1960
335650000
4619
Panama
2010
28917200000
1960
537147100
5283
However, I got this instead:
CountryName
Year
GDP
Year2
GDP2
Change_perc
Costa Rica
1960
507513830
NA
NA
NA
Costa Rica
2010
37268635287
1960
507513830
7243
Guatemala
1960
1043599900
2010
37268635287
-97
Guatemala
2010
41338595381
1960
1043599900
3861
Honduras
1960
335650000
2010
41338595381
-99
Honduras
2010
15839344592
1960
335650000
4619
Panama
1960
537147100
2010
15839344592
-97
Panama
2010
28917200000
1960
537147100
5283
How could I use lag() in such a way that I avoid the duplication and reversal of the operations?
Taking your original code you could so something like this.
library(dplyr)
df = data.frame(
CountryName = c("Costa Rica","Costa Rica","Guatemala","Guatemala","Honduras","Honduras","Panama","Panama"),
Year = c(1960,2010,1960,2010,1960,2010,1960,2010),
GDP = c(507513829.9949,37268635287.0856, 1043599900, 41338595380.8159, 335650000, 15839344591.9842, 537147100, 28917200000.)
)
df %>%
group_by(CountryName) %>%
mutate(Year2 = lag(Year, 1),GDP2 = lag(GDP, 1), CHANGE_PERC = ((GDP - GDP2) / GDP2 ) * 100) %>%
mutate_if(is.numeric,
round,
digits = 0) %>%
na.omit()
# A tibble: 4 × 6
CountryName Year GDP Year2 GDP2 CHANGE_PERC
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Costa Rica 2010 37268635287 1960 507513830 7243
2 Guatemala 2010 41338595381 1960 1043599900 3861
3 Honduras 2010 15839344592 1960 335650000 4619
4 Panama 2010 28917200000 1960 537147100 5283

Creating a Variable Initial Values from a base variable in Panel Data Structure in R

I'm trying to create a new variable in R containing the initial values of another variable (crime) based on groups (countries) considering the initial period of time observable per group (on panel data framework), my current data looks like this:
country
year
Crime
Albania
2016
2.7369478
Albania
2017
2.0109779
Argentina
2002
9.474084
Argentina
2003
7.7898825
Argentina
2004
6.0739941
And I want it to look like this:
country
year
Crime
Initial_Crime
Albania
2016
2.7369478
2.7369478
Albania
2017
2.0109779
2.7369478
Argentina
2002
9.474084
9.474084
Argentina
2003
7.7898825
9.474084
Argentina
2004
6.0739941
9.474084
I saw that ddply could make it work this way, but the problem is that it is not longer supported by the latest R updates.
Thank you in advance.
Maybe arrange by year, then after grouping by country set Initial_Crime to be the first Crime in the group.
library(tidyverse)
df %>%
arrange(year) %>%
group_by(country) %>%
mutate(Initial_Crime = first(Crime))
Output
country year Crime Initial_Crime
<chr> <int> <dbl> <dbl>
1 Argentina 2002 9.47 9.47
2 Argentina 2003 7.79 9.47
3 Argentina 2004 6.07 9.47
4 Albania 2016 2.74 2.74
5 Albania 2017 2.01 2.74
library(data.table)
setDT(data)[, Initial_Crime:=.SD[1,Crime], by=country]
country year Crime Initial_Crime
1: Albania 2016 2.736948 2.736948
2: Albania 2017 2.010978 2.736948
3: Argentina 2002 9.474084 9.474084
4: Argentina 2003 7.789883 9.474084
5: Argentina 2004 6.073994 9.474084
A data.table solution
setDT(df)
df[, x := 1:.N, country
][x==1, initial_crime := crime
][, initial_crime := nafill(initial_crime, type = "locf")
][, x := NULL
]

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources