Average of entity values in panel data - r

I have a panel dataset, with entries for every country in 5-year intervals (earliest 1960, latest 2000). Each entry has values such as a democracy index, logarithm gdp, etc. I want to find the average of democracy index for each country, over all periods it has entries over. There are some NA values.
An example is
Andorra 1960 NA
Andorra 1965 NA
Andorra 1970 0.50
Andorra 1975 NA
Andorra 1980 NA
Andorra 1985 NA
Andorra 1990 NA
Andorra 1995 1.00
Afghanistan 1960 0.14
and so on.
Each country also has a code value, starting at 1 for Andorra, increasing as you go down the alphabet (so Andorra is 1, Afghanistan is 2, Angola is 3, and so on).
I have looked at other panel data questions but they seem either irrelevant or the code is too complex for me to see if it is relevant. Do you have any recommendations?
Thank you in advance.

We can use aggregate from base R to get the mean of the 'democracy_index' column grouped by the 'country' column
aggregate(democracy_index ~ country, df1, mean, na.rm = TRUE, na.action = NULL)

Related

Scatter plot with variables that have multiple different years

I'm currently trying to make a scatter plot of child mortality rate and child labor. My problem is, I don't actually have a lot of data, and some countries may only get values for some years, and some other countries may only have data for some other years, so I can't plot all the data together, nor the data in any year is big enough to limit to that only year. I was wondering if there is a function that takes the last value available in the dataset for any given specified variable. So, for instance, if my last data for child labor from Germany is from 2015 and my last data from Italy is from 2014, and so forth with the rest of the countries, is there a way I can plot the last values for each country?
Code goes like this:
head(data2)
# A tibble: 6 x 5
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AFG 1962 34.5 NA
2 Afghanistan AFG 1963 33.9 NA
3 Afghanistan AFG 1964 33.3 NA
4 Afghanistan AFG 1965 32.8 NA
5 Afghanistan AFG 1966 32.2 NA
6 Afghanistan AFG 1967 31.7 NA
Never mind about those NA's. Labor data just doesn't go back there. But I do have it in the dataset, for more recent years. Child mortality data, on the other hand, is actually pretty complete.
Thanks.
I cannot find which variable to plot, but following code can select only last of each country.
data2 %>%
group_by(Entity) %>%
filter(Year == max(Year)) %>%
ungroup
result is like
Entity Code Year mortality labor
<chr> <chr> <dbl> <dbl> <lgl>
1 Afghanistan AFG 1967 31.7 NA
No you can plot some variable.
You might want to define what you mean by 'last' value per group - as in most recent, last occurrence in the data or something else?
dplyr::last picks out the last occurrence in the data, so you could use it along with arrange to order your data. In this example we sort the data by Year (ascending order by default), so the last observation will be the most recent. Assuming you don't want to include NA values, we also use filter to remove them from the data.
data2 %>%
# first remove NAs from the data
filter(
!is.na(labor)
) %>%
# then sort the data by Year
arrange(Year) %>%
# then extract the last observation per country
group_by(Entity) %>%
summarise(
last_record = last(labor)
)

How to add variable from one dataframe to another dataframe (several conditions)

I had a read through the existing topics, but nothing I've read matched the thing I want to do.
dataframe 1: newdata (excerpt)
country year sector emissions
Austria 1990 Total 6.229223e+04
Austria 1990 Regulated 3.826440e+04
Austria 1990 Unregulated 2.402783e+04
Austria 1991 Total 6.589968e+04
Austria 1991 Regulated 3.931820e+04
Austria 1991 Unregulated 2.658148e+04
dataframe 2: EUETS (excerpt)
country year emissions
Austria 2005 164925659
Belgium 2005 282762153
Croatia 2005 0
Cyprus 2005 16021583
Czech Republic 2005 288986144
Denmark 2005 171815416
Estonia 2005 71336242
What I want to do:
Add information from EUETS$emissions to a new column newdata$EUETS
this insertation should be based on country and year and be inserted in the row for this country and year where newdata$sector = "regulated"
newdata$sector = "unregulated" and newdata$sector = "Total" need to receive NA and under no circumstances 0
if there is no corresponding information in EUETS$country and/or EUETS$year, NA should be inserted into newdata$EUETS
if there is information in EUETS$emissions, but no matching year and/or country for this in newdata, a new row shall be created for this information filling in the values from EUETS as above, but inserting NA in the new cells for newdata$emissions = Total and newdata$unregulated.
This should look like this:
country year sector emissions EUETS
Austria 1990 Total 6.229223e+04 NA
Austria 1990 Regulated 3.826440e+04 2516843
Austria 1990 Unregulated 2.402783e+04 NA
Austria 1991 Total 6.589968e+04 NA
Austria 1991 Regulated 3.931820e+04 446656
Austria 1991 Unregulated 2.658148e+04 NA
Liechtenstein 2005 Total NA NA
Liechtenstein 2005 Regulated NA 654612641
Liechtenstein 2005 Unregulated NA NA
Liechtenstein was only in EUETS$country and didn't exist in newdata$country and was consequently added to the latter dataframe.
This may be several questions/post in one, but I hope this is appropriate to ask here. I tried myself a few things, but didn't manage especially when it comes to filling in the values into the existing columns in newdata (country and year).
I appreciate help with any part of this task.
Thanks so much in advance!
Nordsee
First, change the EUETS column names and sector as you want the to show up in the end:
names(EUETS)[3] = "EUETS"
EUETS$sector = "Regulated"
Make sure your original sector column is a character, not a factor:
newdata$sector = as.character(newdata$sector)
Merge the data
result = merge(newdata, EUETS, all = TRUE)
For adding unrepresented countries back into EUETS, I'm not sure what year and emissions values you want to add in, so I'll ignore that for now. But basically you want to use merge again.

Removing certain rows whose column value does not match another column (all within the same data frame)

I'm attempting to remove all of the rows (cases) within a data frame in which a certain column's value does not match another column value.
The data frame bilat_total contains these 10 columns/variables:
bilat_total[,c("year", "importer1", "importer2", "flow1",
"flow2", "country", "imports", "exports", "bi_tot",
"mother")]
Thus the table's head is:
year importer1 importer2 flow1 flow2 country
6 2009 Afghanistan Bhutan NA NA Afghanistan
11 2009 Afghanistan Solomon Islands NA NA Afghanistan
12 2009 Afghanistan India 516.13 120.70 Afghanistan
13 2009 Afghanistan Japan 124.21 0.46 Afghanistan
15 2009 Afghanistan Maldives NA NA Afghanistan
19 2009 Afghanistan Bangladesh 4.56 1.09 Afghanistan
imports exports bi_tot mother
6 6689.35 448.25 NA United Kingdom
11 6689.35 448.25 NA United Kingdom
12 6689.35 448.25 1.804361e-02 United Kingdom
13 6689.35 448.25 6.876602e-05 United Kingdom
15 6689.35 448.25 NA United Kingdom
19 6689.35 448.25 1.629456e-04 United Kingdom
I've attempted to remove all the cases in which importer2 do not match mother by making a subset:
subset(bilat_total, importer2 == mother)
But each time I do, I get the error:
Error in Ops.factor(importer2, mother) : level sets of factors are different
How would I go about dropping all the rows/cases in which importer 2 and mother don't match?
The error may be because the columns are factor class. We can convert the columns to character class and then compare to subset the rows.
subset(bilat_total, as.character(importer2) == as.character(mother))
Based on the data example showed
subset(bilat_total, importer2 == mother)
# Error in Ops.factor(importer2, mother) :
# level sets of factors are different

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

Resources