Extracting slopes by categorical variable from a new data frame? - r

I have a data set depicting the BMI change for 199 countries over the span of 38 years (1980 to 2008). The data was originally in a wide format, and I (thought I) tidied it up:
BMI <- read.csv("Indicator_BMI female ASM.csv", header = TRUE)
BMI2 <- gather(BMI, "year", "BMI", X1980:X2008)
BMI2$year <- as.numeric(BMI2$year)
BMI <- BMI2
until it ended up in this format:
Country Year BMI
Afghanistan X1980 20.443
Afghanistan X1981 20.477
Afghanistan X1982 20.523
...
Albania X1980 25.174
Albania X1981 25.191
.......
Algeria X1980 20.241
.....
etc, you get the idea..
What I want to do is, for each country to get the gradient of the slope of the relationship between Year and BMI.
First I tried very simply to extract the slope for one country at a time:
thefit <- lm(BMI ~ year, subset(BMI, Country == "Albania"))
gradientAlbania <- round(coef(thefit)[2],4)
but the output from the lm alone is quite unexpected (only intercepts for each year separately):
Coefficients:
(Intercept) yearX1981 yearX1982 yearX1983 yearX1984 yearX1985 yearX1986 yearX1987 yearX1988 yearX1989 yearX1990
25.17427 0.01661 0.02605 0.04479 0.04932 0.03830 0.03512 0.01491 -0.00462 -0.02057 -0.03550
yearX1991 yearX1992 yearX1993 yearX1994 yearX1995 yearX1996 yearX1997 yearX1998 yearX1999 yearX2000 yearX2001
-0.12757 -0.20812 -0.23869 -0.23865 -0.23721 -0.20784 -0.20673 -0.17207 -0.11173 -0.04630 0.02905
yearX2002 yearX2003 yearX2004 yearX2005 yearX2006 yearX2007 yearX2008
0.09655 0.15771 0.22377 0.29098 0.35901 0.41967 0.48299
I guess it has to do with the format of the data frame (like the X in front of the year that I couldn't get rid off..)
Although I successfully created a BMI~year scatter plot for this data, by country, so I assumed in principle the format should be ok?
BMI year scatter plot
Thanks for the help in advance.
In case it matters, later I want to include a slope function as part of the dplyr group_by, but this will come in a separate post (I decided to break down the question in different posts for clarity).

As #aosmith mentioned, you need to get rid of those "X"s.
Does this do work on your data and accomplish what you want:
BMI <- read.csv("Indicator_BMI female ASM.csv", header = TRUE)
BMI2 <- gather(BMI, "year", "BMI", X1980:X2008)
BMI2$year <- as.numeric(gsub('X',"",BMI2$year))
# Adapted from "R for Data Science"
country_model <- function(df){
lm(BMI~year, data = df)
}
BMI2 %>%
group_by(country) %>%
nest() %>%
mutate(model = map(data, country_model)) %>%
mutate(slope = model[[1]][[1]][[2]])

Related

Setting a class and avoid turning output from a tibble to a list

I'm trying to get the below function to work using s3 class. It is working when I dont add the class dimension and run each part seperately but when I add the class command it changes the output to a list as oppose to a tibble and then the rest of the command won't work. Does anyone know how I can fix this?
When I try to apply the below functions its returns:
Error in UseMethod("summarise_") : no applicable method for 'summarise_' applied to an object of class "growth_ind"
project <- function(countries)
{
if(!require(dplyr)){
install.packages("dplyr")
library(dplyr)}
if(!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)}
if(!require(pwt9)){
install.packages("pwt9")
library(pwt9.1)}
data("pwt9.1")
#first subsetting the data set to the variables required
data <- pwt9.1 %>%
filter(country %in% countries) %>%
select(year,isocode,rgdpna, rkna, emp, labsh) %>%
filter(year >= 1954,
year<= 2017)%>%
na.omit()
#calculating the real output and capital per worker, taking the logs and the
#first difference to obtain percentage changes
data <- data %>%
mutate(y_pc = log(rgdpna / emp), #GDP per worker
k_pc = log(rkna / emp), #Capital per worker
a = 1-labsh) %>% #Calculate the capital share
arrange(year) %>% #order by year
group_by(isocode) %>% #for each country calculate the following
mutate(g=(y_pc - lag(y_pc))*100, # calculating the growth of GDP per capita
dk = (k_pc - lag(k_pc))*100, # calculating the growth rate of capital per capita
dsolow = g - a*dk) %>% #the solow residual
na.omit()
class(data) <- "growth_ind"
print(data)
}
print.growth_ind <- function(data)
{
return(data)
}
summary.growth_ind <- function(data)
{
solow <- data %>%
summarise("Growth rate"=mean(g),
"Solow residual" = mean(dsolow),
"Capital deepening" = mean(a*dk),
"TFP share" = mean(dsolow) / mean(g),
"Capital share" = mean(a))
return(solow)
}
plot.growth_ind <- function(data)
{
ggplot(data, aes(x = year, y = g, colour = isocode)) +
geom_line()+
labs(title="Growth rate comparison",
subtitle = "Growth rate per capita for each country from 1954 to 2017",
caption = "Data taken from Penn World Table 9.1",
x="Year",
y="Growth rate")
}
proj <- project(c("Ireland"))
summary(proj)
print(proj)
plot(proj)
Nice. My good old friend the solow growth model ... (;
Your main issue could be fixed by appending the class, i.e. by doing class(data) <- append("growth_ind", class(data)). This way it remains a tibble and works fine with all dplyr verbs. Additionally I ungrouped the data by adding an ungroup at the end of your data wrangling pipeline and in you print method I made use of print.data.frame as otherwise nothing was printed:
library(dplyr)
library(ggplot2)
library(pwt9)
project <- function(countries) {
#first subsetting the data set to the variables required
data <- pwt9.1 %>%
filter(country %in% countries) %>%
select(year,isocode,rgdpna, rkna, emp, labsh) %>%
filter(year >= 1954,
year<= 2017)%>%
na.omit()
#calculating the real output and capital per worker, taking the logs and the
#first difference to obtain percentage changes
data <- data %>%
mutate(y_pc = log(rgdpna / emp), #GDP per worker
k_pc = log(rkna / emp), #Capital per worker
a = 1-labsh) %>% #Calculate the capital share
arrange(year) %>% #order by year
group_by(isocode) %>% #for each country calculate the following
mutate(g=(y_pc - lag(y_pc))*100, # calculating the growth of GDP per capita
dk = (k_pc - lag(k_pc))*100, # calculating the growth rate of capital per capita
dsolow = g - a*dk) %>% #the solow residual
ungroup() %>%
na.omit()
class(data) <- append("growth_ind", class(data))
data
}
print.growth_ind <- function(data)
{
print.data.frame(data)
}
summary.growth_ind <- function(data)
{
solow <- data %>%
summarise("Growth rate"=mean(g),
"Solow residual" = mean(dsolow),
"Capital deepening" = mean(a*dk),
"TFP share" = mean(dsolow) / mean(g),
"Capital share" = mean(a))
return(solow)
}
plot.growth_ind <- function(data)
{
ggplot(data, aes(x = year, y = g, colour = isocode)) +
geom_line()+
labs(title="Growth rate comparison",
subtitle = "Growth rate per capita for each country from 1954 to 2017",
caption = "Data taken from Penn World Table 9.1",
x="Year",
y="Growth rate")
}
proj <- project(c("Ireland"))
summary(proj)
#> # A tibble: 1 x 5
#> `Growth rate` `Solow residual` `Capital deepening` `TFP share` `Capital share`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.54 1.71 1.83 0.483 0.476
print(head(proj))
#> year isocode rgdpna rkna emp labsh y_pc k_pc
#> 1 1955 IRL 23016.36 0.07782067 1.194252 0.5573497 9.866440 -2.730868
#> 2 1956 IRL 22227.86 0.07958002 1.178225 0.5573497 9.845093 -2.695001
#> 3 1957 IRL 21593.21 0.07986675 1.135285 0.5573497 9.853251 -2.654279
#> 4 1958 IRL 21526.32 0.07999133 1.118528 0.5573497 9.865018 -2.637851
#> 5 1959 IRL 22753.12 0.08304243 1.110150 0.5573497 9.927963 -2.592898
#> 6 1960 IRL 23636.37 0.08525588 1.104913 0.5573497 9.970775 -2.561865
#> a g dk dsolow
#> 1 0.4426503 5.8672548 6.944282 2.7933665
#> 2 0.4426503 -2.1347642 3.586724 -3.7224284
#> 3 0.4426503 0.8157955 4.072170 -0.9867516
#> 4 0.4426503 1.1767534 1.642884 0.4495303
#> 5 0.4426503 6.2944526 4.495221 4.3046418
#> 6 0.4426503 4.2812699 3.103357 2.9075680
plot(proj)

Error: Problem with `mutate()` input `gap`. x non-numeric argument to binary operator i Input

I am new in R and working with gapminder dataset to try out some plots.
This dataset is latest data downloaded from gapminder website which returns me an error on mutate where as gapminder data from library(gapminder) doesn't.
df: gapminder_new
gapminder_new %>%
select(country, "2010", "2019") %>% head()
############## output ################
country 2010 2019
<chr> <dbl> <dbl>
Afghanistan 543 571
Albania 4090 5210
Algeria 4480 4710
Andorra 40900 45900
Angola 3590 3100
Antigua and Barbuda 13000 15700
Error: getting an error on subtracting year's column values in gapminder_new df:
gapminder_new %>%
select(country, "2010", "2019") %>%
mutate(gap = "2019" - "2010") %>% head()
Error: Problem with `mutate()` input `gap`. x non-numeric argument to binary operator i Input `gap` is `"2019" - "2010"`. Run `rlang::last_error()` to see where the error occurred.
But strange thing is when I use gapminder data from library and shaped to similar wider format then this one has no issues:
df: library(gapminder) then it works
library(gapminder)
gapminder %>%
filter(year == 1967 | year == 2007) %>%
select(country, year, lifeExp) %>%
spread(year, lifeExp) %>%
mutate(gap = `2007` - `1967`) %>% head()
#################### output #########################
country 1967 2007 gap
<fctr> <dbl> <dbl> <dbl>
Afghanistan 34.020 43.828 9.808
Albania 66.220 76.423 10.203
Algeria 51.407 72.301 20.894
Angola 35.985 42.731 6.746
Argentina 65.634 75.320 9.686
Both dataset looks exactly same and I am trying to take difference of years which are dbl in all cases but it works in one case and not in other.
What is going on here, where am I going wrong and how do I fix it?
Sometimes these small problems of data type returns which appears same visually in R dataframes are really frustrating.
It was not hard in moving from excel to python as it seems to be in moving from Python to R.
it has now worked using mutate(gap = .[["2019"]] - .[["2010"]] ) %>% head() but I still don't know why this is required to work in new gapminder dataset where as I didn't need to use this form with library(gapminder) dataset

How can I use only one year from my dummy?

I have some data with some individuals where I know their age (18-98) and which country they are from (1 = Germany , 2 = France, etc. ).
I have an other variable for which i want to see the effect eg. of 18 year old people from Germany.
With dummy(data$age, sep='_') and dummy(data$sg2, sep='_')[sg2 = country] I was able to create dummy variables for these two variables.
But while regressing, the output shows the effect now of every age and every country separate.
How can I combine 18 year olds from Germany that I see their effect on the other variable?
#Dummy age
x1 <- dummy(dat$age, sep='_')
#Dummy country
x2 <- dummy(dat$sg2, sep = '_')
fm <- lm(myvariable~x1+x2, data=dat)
summary(fm)
Estimate Std. Error t value Pr(>|t|)
x1age_18 1.547691 0.567995 2.725 0.006437 **
x1age_19 1.632648 0.567939 2.875 0.004047 **
x2sg2_1 0.083239 0.030118 2.764 0.005717 **
x2sg2_2 0.056555 0.030655 1.845 0.065063 .
This is what I get, bit how can i get x1age_18 & x2sg2_1 in one?

Problem with factor and reordering facet_grid

I have constructed a dataset from the gss data (https://gss.norc.org/) associating data in decades
env_data <- select(gss, year, sex, degree, natenvir) %>% na.omit()
env_datadecades <- env_data %>%
mutate(decade=as.factor(ifelse(year<1980,
"70s",
ifelse(year>1980 & year<=1990,
"80s",
ifelse(year>1990 & year<2000, "90s", "00s")))))
I want to plot it with ggplot2 and facet_grid() and the order is not right so I made it as seen somewhere else
set.seed(6809)
env_datadecades$decade <- factor(env_datadecades$decade,
levels = c("Seventies", "Eighties", "Nineties", "Twothous"))
It worked the first time but when I try to run the code again I get NA for all data in decade. What is happening?
I just made a simple dataset of years
df <- data.frame(Years = sample(1970:2010, 20, replace = T))
Convert it into the required factors by this method,
df <- df %>%
mutate(Decades = case_when(Years < 1980 ~ "Seventies",
1980 <= Years & Years < 1990 ~ "Eighties",
1990 <= Years & Years < 2000 ~ "Nineties",
2000 <= Years ~ "TwoThousands"))
df$Decades <- factor(df$Decades, levels = c("Seventies", "Eighties", "Nineties", "TwoThousands"), ordered = T)
and now try faceting.
I think the problem with your code was that you gave the levels one set of names when you first converted the variables to a factor, and then in the second line of code, you give them another set of names. Stick to the same set, and it should work

Automate basic calculations with residuals in R

I have some basic calculations I want to apply on residuals of a plm model but I am stuck on how to automate the steps for a lot of data.
Let's assume the input is a data.frame (df) with the following data:
Id Year Population Y X1 X2 X3
country A 2009 977612 212451.009 19482.7995 0.346657979 0.001023221
country A 2010 985332 221431.632 18989.3 0.345142551 0.001015205
country A 2011 998211 219939.296 18277.79286 0.344020453 0.001002106
country A 2012 1010001 218487.503 17916.2765 0.342434314 0.000990409
country B 2009 150291 177665.268 18444.04522 0.330864789 0.001940218
country B 2010 150841 183819.407 18042 0.327563461 0.001933143
country B 2011 152210 183761.566 17817.3515 0.32539255 0.001915756
country B 2012 153105 182825.112 17626.62261 0.321315437 0.001904557
country c 2009 83129 132328.034 17113.64268 0.359525557 0.005862866
country c 2010 83752 137413.878 16872.5 0.357854141 0.005819254
country c 2011 84493 136002.537 16576.17856 0.356479235 0.005768219
country c 2012 84958 133064.911 16443.3057 0.355246122 0.005736648
A model was applied and the residuals are stored:
fixed <- plm(Y ~ Y1 + X2 + X3,
data=df, drop.unused.levels = TRUE, index=c("Id", "Year"), model="within")
residuals <- resid(fixed)
In my next step, I want to calculate "weighted averages" of my residuals with:
with nit standing for the population in country i at time t and nt being the total population at t.
My approach so far is:
First I compute the total population nt for every year:
year_range <- seq(from=2009,to=2012,by=1)
tot_pop = NULL
for (n in year_range)
{
tot_pop[n] = with(df, sum(Population[Year == n]))
}
Before taking the sum of the "weighted" residuals, my next step would be to automate the calculation of my "new" residuals:
res1 <- df$Population[1]/tot_pop[2009] * residuals[1]
res2 <- df$Population[2]/tot_pop[2010] * residuals[2]
res3 <- df$Population[3]/tot_pop[2011] * residuals[3]
...
res12 <- df$Population[12]/tot_pop[2011] * residuals[12]
Edit: Applying the solution of JTT to my problem, the last step would then be:
year_range1 <- rep(year_range, 3)
df_res <- data.frame(year = year_range1, res=as.vector(res))
aggr_res <- aggregate(df_res$res, list(df_res$year), sum)
colnames(aggr_res) <- c("Year", "Aggregated residual")
Is that correct?
I have tried the lapply function and a double "for-loop" without success. I don't know how to do this. Your help would be appreciated. If my question is unclear, please comment and I will try to improve it.
First, instead of a for-loop, you might want to calculate the total population using the aggregate funtion, e.g.:
a<-aggregate(df$Population, list(df$Year), sum)
Notice the column names of a (Group.1 and x).
Then you could match the results in a to the data in df using the match()-function. It gives the matching row numbers, which can be used to subset data from df to the division before multiplying with the residuals. For example:
res<-df$Population/a$x[match(df$Year, a$Group.1)]*residuals
Now you should have a vector of "new" residuals in object res.

Resources