Filling NAs withing a variable depending on where they are located - r

I have data of the following sort:
> df <- data.frame(Date = rep(seq(2000:2014), 2),
+ Country=c(rep("Italy",15),rep("Germany",15)),
+ var1= c(NA, NA, NA, NA, NA, 20:21, NA, NA, NA, 27:28, NA, NA, NA, NA, NA, NA, NA, 74:77, NA, NA, 68:70, NA, NA))
>
> df
Date Country var1
1 1 Italy NA
2 2 Italy NA
3 3 Italy NA
4 4 Italy NA
5 5 Italy NA
6 6 Italy 20
7 7 Italy 21
8 8 Italy NA
9 9 Italy NA
10 10 Italy NA
11 11 Italy 27
12 12 Italy 28
13 13 Italy NA
14 14 Italy NA
15 15 Italy NA
16 1 Germany NA
17 2 Germany NA
18 3 Germany NA
19 4 Germany NA
20 5 Germany 74
21 6 Germany 75
22 7 Germany 76
23 8 Germany 77
24 9 Germany NA
25 10 Germany NA
26 11 Germany 68
27 12 Germany 69
28 13 Germany 70
29 14 Germany NA
30 15 Germany NA
>
> df1 <- data.frame(Date = rep(seq(2000:2014), 2),
+ Country=c(rep("Italy",15),rep("Germany",15)),
+ var1= c(15.67052, 16.45405, 17.27675, 18.14059, 19.04762, 20:21, 22.36173, 23.81176, 25.35582, 27:28, 29.03704, 30.11249, 31.22777, 63.12417, 65.68326, 68.34609, 71.11688, 74:77, 73.87488, 70.8766, 68:70, 72.05882, 76.3599))
>
> df1
Date Country var1
1 1 Italy 15.67052
2 2 Italy 16.45405
3 3 Italy 17.27675
4 4 Italy 18.14059
5 5 Italy 19.04762
6 6 Italy 20.00000
7 7 Italy 21.00000
8 8 Italy 22.36173
9 9 Italy 23.81176
10 10 Italy 25.35582
11 11 Italy 27.00000
12 12 Italy 28.00000
13 13 Italy 29.03704
14 14 Italy 30.11249
15 15 Italy 31.22777
16 1 Germany 63.12417
17 2 Germany 65.68326
18 3 Germany 68.34609
19 4 Germany 71.11688
20 5 Germany 74.00000
21 6 Germany 75.00000
22 7 Germany 76.00000
23 8 Germany 77.00000
24 9 Germany 73.87488
25 10 Germany 70.87660
26 11 Germany 68.00000
27 12 Germany 69.00000
28 13 Germany 70.00000
29 14 Germany 72.05882
30 15 Germany 76.35990
As you can see I have
"beginning NAs" i.e. NAs at the beginnning of the sampling period for each country;
"within NAs" i.e. NAs present in the middle of the sampling period as gaps in the data.
"ending NAs" i.e. NAs at the end of the sampling period when the data is not available yet, that is for the last couple of years.
I have a threefold goal:
make the "beginning NAs" grow at the same rate as the first two non-NAs are growing, or as another proxy country is growing.
Make the "within NAs" grow at a CAGR (Compound Annual Growth rate). I.e. each "within NA" growth factor is the ratio between the next available non-NA over the previous value elevated to the inverse of the number of years between the latter and the "within NA" in question. E.g. for Italy in year 9 the growth rate should be (27/(value in year 8))^(1/number of years between year 9 and 11)=(27/(value in year 8))^1/3
Make the "ending NAs" grow in different ways. Default should be simply grow as the last two non-NAs are growing. But when needed I need to make some country specific adjustments depending on assumptions (e.g. some countries where data is not available will be assumed to grow at the same rate as a proxy country is growing). Perhaps for this goal what's best is a loop that breaks for the exceptions.
The reason why it's so messy is that I have to automate an excel data processing that has been done in the past. Therefore, for as much as I'd like to change things (e.g. simply interpolate the within NA) I should follow what was done in the past.
Please use all your fantasy, however I'd love to remain within the DPLYR framework.
Thanks in advance

Related

Dividing a column cell with a different number based on number of observations in a panel long format

I have the following data which is of a panel structure. I need to normalize each cell so that the observation for a country is divided by total number of observations for that country divided by total number of observations in the panel structure (here 10 - in my data 1100). Also I have showcased three countries (AL, UK, FR) but I have 92 in total so I need some general formula (mutate: by = country?).
This is my data
df1 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2,3,2,3,2,3,2,NA,1,2,1,2,1,2,1,2,1,2,NA,NA,NA,NA,NA,NA,NA,NA,4,NA))
df1
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 2
4 AL 3
5 AL 2
6 AL 3
7 AL 2
8 AL 3
9 AL 2
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 4
30 FR NA
Now, what I want is to divide each cell with number of observations available for each country / total obs like so,
df2 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,
NA,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,
2*10/10,1*10/10,2*10/10,NA,NA,NA,NA,NA,NA,NA,NA,4*1/10,NA))
df2
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 1.4
4 AL 3.7
5 AL 2.7
6 AL 3.7
7 AL 2.7
8 AL 3.7
9 AL 2.7
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 0.4
30 FR NA
I am interested in solving the problem obviously BUT I would really really appreciate it if you could show me how to do this for multiple columns as my original data needs this same operation done for many columns where the country tickers (AL, UK, FR in example) remains the same.
You can do :
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * sum(!is.na(Obs))/n()) %>%
ungroup
# Country Obs
# <chr> <dbl>
# 1 AL NA
# 2 AL NA
# 3 AL 1.4
# 4 AL 2.1
# 5 AL 1.4
# 6 AL 2.1
# 7 AL 1.4
# 8 AL 2.1
# 9 AL 1.4
#10 AL NA
# … with 20 more rows
sum(!is.na(Obs)) counts number of non-NA values in the Country whereas n() gives the number of rows for the Country.
For multiple columns -
df1 %>%
group_by(Country) %>%
mutate(across(col1:col4, ~. * sum(!is.na(.))/n())) %>%
ungroup
This will be applied to col1 to col4 in your dataframe.
Using data.table
library(data.table)
setDT(df1)[, Obs := Obs * mean(!is.na(Obs)), County]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * mean(!is.na(Obs)))

apply hpfilter to grouped variables with NAs using dplyr

I am trying to apply the hpfilter to one of the variables in my dataset that has a panel structure (id + year) and then add the filtered series to my dataset. It works perfectly fine as long as I do not have any NAs in one of the variables, but it yields an error if one of the ids has missing values. The reason for this is that the hpfilter function does not work with NAs (it yields only NAs).
Here's a reproducible example:
df1 <- read.table(text="country year X1 X2 W
A 1990 10 20 40
A 1991 12 15 NA
A 1992 14 17 41
A 1993 17 NA 44
B 1990 20 NA 45
B 1991 NA 13 61
B 1992 12 12 67
B 1993 14 10 68
C 1990 10 20 70
C 1991 11 14 50
C 1992 12 15 NA
C 1993 14 16 NA
D 1990 20 17 80
D 1991 16 20 91
D 1992 15 21 70
D 1993 14 22 69
", header=TRUE, stringsAsFactors=FALSE)
My approach was to use the dplyr group_by function to apply the hpfilter by country to variable X1:
library(mFilter)
library(plm)
# Organizing the Data as a Panel
df1 <- pdata.frame(df1, index = c("country","year"))
# Apply hpfilter to X1 and add trend to the sample
df1 <- df1 %>% group_by(country) %>% mutate(X1_trend = mFilter::hpfilter(na.exclude(X1), type = "lambda", freq = 6.25)$trend)
However, this yields the following error:
Error in `[[<-.data.frame`(`*tmp*`, col, value = c(11.1695436493374, 12.7688604220353, :
replacement has 15 rows, data has 16
The error occurs because the filtered series is shortened after applying the hp filter (by the NAs).
Since I have a large dataset with many countries it would be really great if there was a workaround, to maybe ignore the NAs when passing the series to the hpfilter, but not removing them. Thank you!
Here is a way to drop NA and calculate trend:
df2 <- df1 %>% group_by(country) %>%
filter(!is.na(X1)) %>%
pdata.frame(., index = c("country","year")) %>%
mutate(X1_trend = mFilter::hpfilter(X1, type = "lambda", freq = 6.25)$trend)
> df2
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1992 12 12 67 14.38218
7 B 1993 14 10 68 13.45663
8 C 1990 10 20 70 12.75429
9 C 1991 11 14 50 12.71858
10 C 1992 12 15 NA 13.35221
11 C 1993 14 16 NA 14.38293
12 D 1990 20 17 80 15.32211
13 D 1991 16 20 91 15.61990
14 D 1992 15 21 70 15.47486
15 D 1993 14 22 69 15.14639
EDIT: To keep missing values in the final output, we do one more operation:
df3 <- merge(df1,df2, by = colnames(df1),all.x = T)
> df3
country year X1 X2 W X1_trend
1 A 1990 10 20 40 11.16954
2 A 1991 12 15 NA 12.76886
3 A 1992 14 17 41 14.18105
4 A 1993 17 NA 44 15.09597
5 B 1990 20 NA 45 15.17450
6 B 1991 NA 13 61 NA
7 B 1992 12 12 67 14.38218
8 B 1993 14 10 68 13.45663
9 C 1990 10 20 70 12.75429
10 C 1991 11 14 50 12.71858
11 C 1992 12 15 NA 13.35221
12 C 1993 14 16 NA 14.38293
13 D 1990 20 17 80 15.32211
14 D 1991 16 20 91 15.61990
15 D 1992 15 21 70 15.47486
16 D 1993 14 22 69 15.14639

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

Calculating yearly growth-rates from quarterly, long form data in r

My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Resources