Creating a Variable Initial Values from a base variable in Panel Data Structure in R - r

I'm trying to create a new variable in R containing the initial values of another variable (crime) based on groups (countries) considering the initial period of time observable per group (on panel data framework), my current data looks like this:
country
year
Crime
Albania
2016
2.7369478
Albania
2017
2.0109779
Argentina
2002
9.474084
Argentina
2003
7.7898825
Argentina
2004
6.0739941
And I want it to look like this:
country
year
Crime
Initial_Crime
Albania
2016
2.7369478
2.7369478
Albania
2017
2.0109779
2.7369478
Argentina
2002
9.474084
9.474084
Argentina
2003
7.7898825
9.474084
Argentina
2004
6.0739941
9.474084
I saw that ddply could make it work this way, but the problem is that it is not longer supported by the latest R updates.
Thank you in advance.

Maybe arrange by year, then after grouping by country set Initial_Crime to be the first Crime in the group.
library(tidyverse)
df %>%
arrange(year) %>%
group_by(country) %>%
mutate(Initial_Crime = first(Crime))
Output
country year Crime Initial_Crime
<chr> <int> <dbl> <dbl>
1 Argentina 2002 9.47 9.47
2 Argentina 2003 7.79 9.47
3 Argentina 2004 6.07 9.47
4 Albania 2016 2.74 2.74
5 Albania 2017 2.01 2.74

library(data.table)
setDT(data)[, Initial_Crime:=.SD[1,Crime], by=country]
country year Crime Initial_Crime
1: Albania 2016 2.736948 2.736948
2: Albania 2017 2.010978 2.736948
3: Argentina 2002 9.474084 9.474084
4: Argentina 2003 7.789883 9.474084
5: Argentina 2004 6.073994 9.474084

A data.table solution
setDT(df)
df[, x := 1:.N, country
][x==1, initial_crime := crime
][, initial_crime := nafill(initial_crime, type = "locf")
][, x := NULL
]

Related

Reshaping Dataframe in R (melt?)

So, I currently have a dataframe that looks like:
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
There are 140+ countries. The years are in 5 year intervals. From 1952- 2007 I want to reshape my dataframe such that I get.
Country gdpPercap(1952) gdpPercap(1957) ... gdpPercap(2007)
<fctr> <dbl>
1 Afghanistan 974.5803 .... ...
2 Albania 5937.0295 ... ...
3 Algeria 6223.3675 ... ...
4 Angola 4797.2313
5 Argentina 12779.3796
6 Australia 34435.3674
7 Austria 36126.4927
8 Bahrain 29796.0483
9 Bangladesh 1391.2538
10 Belgium 33692.6051
My attempt is this:
gapminder %>% #my dataframe
filter(year >= 1952) %>%
group_by(country) %>%
summarise(gdpPercap = mean(gdpPercap))
OUTPUT:
country gdpPercap <- but this takes the mean of gdpPercap from 1952-2007
<fctr> <dbl>
1 Afghanistan 802.6746
2 Albania 3255.3666
3 Algeria 4426.0260
4 Angola 3607.1005
5 Argentina 8955.5538
6 Australia 19980.5956
7 Austria 20411.9163
8 Bahrain 18077.6639
9 Bangladesh 817.5588
10 Belgium 19900.7581
# ... with 132 more rows
Any ideas? PS: I'm new to R. I'm also looking at melt(). Any help will be appreciated!
tidyr::spread() would solve your problem
library(dplyr); library(tidyr)
gapminder %>%
select(country, year, gdpPercap) %>%
spread(year, gdpPercap)
You should use year also in group_by, and after summary, just reshape the data the way you want using dcast or rehape
Here is a sample solution :
library(dplyr)
library(reshape2)
gapminder <- data.frame(cbind(gdpPercap=runif(10000), year =as.integer(seq(from=1952, to=2007, by=5)), country = c("India", "US", "UK")))
gapminder$gdpPercap <- as.numeric(as.character(gapminder$gdpPercap))
gapminder$year <- as.integer(as.character(gapminder$year))
gapminder %>% #my dataframe
filter(year >= 1952) %>%
group_by(country, year) %>%
summarise(gdpPercap = mean(gdpPercap)) %>%
dcast(country ~ year, value.var="gdpPercap")
I have to generate a new data, because your example is not reproducible. Go through the link How to make a great R reproducible example?. It helps in answering and understanding the problem, as well as, quicker answers.
Built-in reshape can do this.
foo.data.frame <- data.frame(
Country=rep(c("Here", "There"), each=3),
year=rep(c(1952, 1957, 1962),2),
gdpPercap=779:784
# ... other variables
)
reshape(foo.data.frame[, c("Country", "year", "gdpPercap")],
timevar="year", idvar="Country", direction="wide", sep=" ")
# Country gdpPercap 1952 gdpPercap 1957 gdpPercap 1962
# 1 Here 779 780 781
# 4 There 782 783 784

decompose() for yearly time series in R

I'm trying to perform analysis on a time series data of inflation rates from the year 1960 to 2015. The dataset is a yearly time series over 56 years with 1 real value per each year, which is the following:
Year Inflation percentage
1960 1.783264746
1961 1.752021563
1962 3.57615894
1963 2.941176471
1964 13.35403727
1965 9.479452055
1966 10.81081081
1967 13.0532972
1968 2.996404315
1969 0.574712644
1970 5.095238095
1971 3.081105573
1972 6.461538462
1973 16.92815855
1974 28.60169492
1975 5.738605162
1976 -7.63438068
1977 8.321619342
1978 2.517518817
1979 6.253164557
1980 11.3652609
1981 13.11510484
1982 7.887270664
1983 11.86886396
1984 8.32157969
1985 5.555555556
1986 8.730811404
1987 8.798689021
1988 9.384775808
1989 3.26256011
1990 8.971233545
1991 13.87024609
1992 11.78781925
1993 6.362038664
1994 10.21150033
1995 10.22488756
1996 8.977149075
1997 7.16425362
1998 13.2308409
1999 4.669821024
2000 4.009433962
2001 3.684807256
2002 4.392199745
2003 3.805865922
2004 3.76723848
2005 4.246353323
2006 6.145522388
2007 6.369996746
2008 8.351816444
2009 10.87739112
2010 11.99229692
2011 8.857845297
2012 9.312445605
2013 10.90764331
2014 6.353194544
2015 5.872426595
'stock1' contains my data where the first column stands for Year, and the second for 'Inflation.percentage', as follows:
stock1<-read.csv("India-Inflation time series.csv", header=TRUE, stringsAsFactors=FALSE, as.is=TRUE)
The following is my code for creating the time series object:
stock <- ts(stock1$Inflation.percentage,start=(1960), end=(2015),frequency=1)
Following this, I am trying to decompose the time series object 'stock' using the following line of code:
decom_add <- (decompose(stock, type ="additive"))
Here I get an error:
Error in decompose(stock, type = "additive") : time series has no
or less than 2 periods
Why is this so? I initially thought it has something to do with frequency, but since the data is annual, the frequency has to be 1 right? If it is 1, then aren't there definitely more than 2 periods in the data?
Why isn't decompose() working? What am I doing wrong?
Thanks a lot in advance!
Please try for frequency=2, because frequency needs to be greater than 1. Because this action will change your model, for me the better way is to load data which contain and month column, so the frequency will be 12.

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources