Projecting data in R - r

Here my dataset, I have 194 countries with values from 2014 to 2018
# A tibble: 10 x 3
iso3 time y
<chr> <chr> <dbl>
1 AFG 2014 0.50
2 AFG 2015 0.55
3 AFG 2016 0.63
4 AFG 2017 0.68
5 AFG 2018 0.69
6 AGO 2014 0.54
7 AGO 2015 0.58
8 AGO 2016 0.57
9 AGO 2017 0.51
10 AGO 2018 0.61
What I would like to do is project data till 2023 using this function
proj <- function(y, time=2014:2018, target=2023){
stopifnot(any(y>0) | any(y<1))
period <- time[1]:target
yhat <- predict(glm(y ~ time, family=quasibinomial), newdata=data.frame(time=period))
return(data.frame(time=period, y=invlogit(yhat)))
}
Now the problem is that I don't know how to use functions..how to apply the abive function to my dataset to create a new dataset where I have both historical data from 2014 to 2018 and projected data till 2023 for all countries, in the same format as above.
Could you help?
Thank you very much

Let's say this is your data:
df = structure(list(iso3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("AFG", "AGO"), class = "factor"), time = c(2014L,
2015L, 2016L, 2017L, 2018L, 2014L, 2015L, 2016L, 2017L, 2018L
), y = c(0.5, 0.55, 0.63, 0.68, 0.69, 0.54, 0.58, 0.57, 0.51,
0.61)), class = "data.frame", row.names = c(NA, 10L))
You still need a inverse logit function, which should be:
invlogit =function (x){ 1/(1 + exp(-x)) }
Then what you need to provide, are the y data in the time period for each country, so for example:
subset(df,iso3=="AFG" & time %in% 2014:2018)
gives you a subset of the data.frame in country AFG from 2014 to 2018. This only works in your function if the data is complete (goes from 2014 to 2018) and sorted. We try that:
proj(subset(df,iso3=="AFG" & time %in% 2014:2018)$y)
time y
1 2014 0.5057644
2 2015 0.5598027
3 2016 0.6124600
4 2017 0.6626145
5 2018 0.7093585
6 2019 0.7520495
7 2020 0.7903234
8 2021 0.8240714
9 2022 0.8533952
10 2023 0.8785516
I suggest writing a function that handles situations when data is not sorted or is missing:
func = function(data,country){
data = subset(data,iso3==country)
data[match(2014:2018,data$time),]$y
}
res = lapply(unique(df$iso3),function(i){
data.frame(country=i,proj(func(df,country=i)))
})
res = do.call(rbind,res)
country time y
1 AFG 2014 0.5057644
2 AFG 2015 0.5598027
3 AFG 2016 0.6124600
4 AFG 2017 0.6626145
5 AFG 2018 0.7093585
6 AFG 2019 0.7520495
7 AFG 2020 0.7903234
8 AFG 2021 0.8240714
9 AFG 2022 0.8533952
10 AFG 2023 0.8785516
11 AGO 2014 0.5479759
21 AGO 2015 0.5550113
31 AGO 2016 0.5620247
41 AGO 2017 0.5690134
51 AGO 2018 0.5759748
61 AGO 2019 0.5829061
71 AGO 2020 0.5898048
81 AGO 2021 0.5966684
91 AGO 2022 0.6034943
101 AGO 2023 0.6102802

Related

Reshape R dataframe from long format to a wide format

I have a table I want to reshape/pivot. The Agency No will have duplicates as this is looking at years worth of data but they are grouped by Agency No, Fiscal Year, and Type currently. The table is provided below as well as a desired output.
Agency No
Fiscal Year
Type
Total Gross Weight
W1000FP
2018
Dry
1000
W1004CSFP
2018
Dry
2000
W1000FP
2018
Produce
500
W1004CSFP
2018
Produce
1000
W1004DR
2018
Produce
1000
W1004DR
2018
Dry
1000
W1005DR
2019
Dry
2000
W1000FP
2019
Dry
1000
W1005DR
2019
Produce
1000
W1000FP
2019
Produce
1000
Desired Output:
Agency No
Fiscal Year
Produce Weight
Dry Weight
W1000FP
2018
500
1000
W1004CSFP
2018
1000
2000
W1004DR
2018
1000
1000
W1005DR
2019
1000
2000
W1000FP
2019
1000
1000
Here is the script that I ran but did not provide the desired output:
reshape(df, idvar = "Agency No", timevar = "Type", direction = "wide"
We could use pivot_wider
library(tidyr)
pivot_wider(df1, names_from = Type, values_from = `Total Gross Weight`)
-output
# A tibble: 5 × 4
`Agency No` `Fiscal Year` Dry Produce
<chr> <int> <int> <int>
1 W1000FP 2018 1000 500
2 W1004CSFP 2018 2000 1000
3 W1004DR 2018 1000 1000
4 W1005DR 2019 2000 1000
5 W1000FP 2019 1000 1000
With reshape, specify the 'Fiscal Year' also a idvar
reshape(df1, idvar = c("Agency No", "Fiscal Year"),
timevar = "Type", direction = "wide")
Agency No Fiscal Year Total Gross Weight.Dry Total Gross Weight.Produce
1 W1000FP 2018 1000 500
2 W1004CSFP 2018 2000 1000
5 W1004DR 2018 1000 1000
7 W1005DR 2019 2000 1000
8 W1000FP 2019 1000 1000
data
df1 <- structure(list(`Agency No` = c("W1000FP", "W1004CSFP", "W1000FP",
"W1004CSFP", "W1004DR", "W1004DR", "W1005DR", "W1000FP", "W1005DR",
"W1000FP"), `Fiscal Year` = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2019L, 2019L, 2019L, 2019L), Type = c("Dry", "Dry", "Produce",
"Produce", "Produce", "Dry", "Dry", "Dry", "Produce", "Produce"
), `Total Gross Weight` = c(1000L, 2000L, 500L, 1000L, 1000L,
1000L, 2000L, 1000L, 1000L, 1000L)), class = "data.frame", row.names = c(NA,
-10L))

How should I impute NA with mean of that column, not just replacing the na value? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataset looks like this:
TYPE YEAR NUMBERS
A 2020 60
A 2019 NA
A 2018 88
A 2017 NA
A 2016 90
I want to impute the missing value with the mean of the value in column 'numbers'
I have search for a lot of tutorial, but they just directly replace the missing value with the mean which is not what i want. I try using mice and hmics, but they come out errors. So, if there is any better way to do this?Thanks!
I'd have done this :
df <- read.table(text = 'TYPE YEAR NUMBERS
A 2020 60
A 2019 NA
A 2018 88
A 2017 NA
A 2016 90', header=T)
a= mean(na.omit(df$NUMBERS))
df[is.na(df$NUMBERS),"NUMBERS"]=a
df
Output:
TYPE YEAR NUMBERS
1 A 2020 60.00000
2 A 2019 79.33333
3 A 2018 88.00000
4 A 2017 79.33333
5 A 2016 90.00000
Is it what you wanted?
I'm inferring from the presence of the TYPE column that you should be imputing based on the group's mean, not the population's mean.
Modified data:
dat <- structure(list(TYPE = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), YEAR = c(2020L, 2019L, 2018L, 2017L, 2016L, 2020L, 2019L, 2018L, 2017L, 2016L), NUMBERS = c(60L, NA, 88L, NA, 90L, 160L, NA, 188L, NA, 190L)), class = "data.frame", row.names = c(NA, -10L))
base R
do.call(rbind, by(dat, dat$TYPE,
function(z) { z$NUMBERS[is.na(z$NUMBERS)] <- mean(z$NUMBERS, na.rm = TRUE); z}))
# TYPE YEAR NUMBERS
# A.1 A 2020 60.00000
# A.2 A 2019 79.33333
# A.3 A 2018 88.00000
# A.4 A 2017 79.33333
# A.5 A 2016 90.00000
# B.6 B 2020 160.00000
# B.7 B 2019 179.33333
# B.8 B 2018 188.00000
# B.9 B 2017 179.33333
# B.10 B 2016 190.00000
or
do.call(rbind, by(dat, dat$TYPE,
function(z) transform(z, NUMBERS = ifelse(is.na(NUMBERS), mean(NUMBERS, na.rm = TRUE), NUMBERS))))
dplyr
library(dplyr)
dat %>%
group_by(TYPE) %>%
mutate(NUMBERS = if_else(is.na(NUMBERS), mean(NUMBERS, na.rm = TRUE), as.numeric(NUMBERS))) %>%
ungroup()
# # A tibble: 10 x 3
# TYPE YEAR NUMBERS
# <chr> <int> <dbl>
# 1 A 2020 60
# 2 A 2019 79.3
# 3 A 2018 88
# 4 A 2017 79.3
# 5 A 2016 90
# 6 B 2020 160
# 7 B 2019 179.
# 8 B 2018 188
# 9 B 2017 179.
# 10 B 2016 190

R | Adding index numbers

I have two dataset which look like below
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 NA
South Asia 2009 4.5 NA
South Asia 2011 11 0
South Asia 2014 16.7 NA
Africa 2008 0.4 NA
Africa 2013 3.5 0
Africa 2017 9.7 NA
Strategy
Region StrategyYear
South Asia 2011
Africa 2013
Japan 2007
SE Asia 2009
There are multiple regions and many review years which are not periodic and not even same for all regions. I have added a column 'Index' to dataframe 'Sales' such that for a strategy year from second dataframe, the index value is zero. I now want to change NA to a series of numbers that tell how many rows before or after that particular row is to 0 row, grouped by 'Region'.
I can do this using a for loop but that is just tedious and checking if there is a cleaner way to do this. Final output should look like
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 -2
South Asia 2009 4.5 -1
South Asia 2011 11 0
South Asia 2014 16.7 1
Africa 2008 0.4 -1
Africa 2013 3.5 0
Africa 2017 9.7 1
Join the two datasets by Region and for each Region create an Index column by subtracting the row number with the index where StrategyYear matches the ReviewYear.
library(dplyr)
left_join(Sales, Strategy, by = 'Region') %>%
arrange(Region, StrategyYear) %>%
group_by(Region) %>%
mutate(Index = row_number() - match(first(StrategyYear), ReviewYear))
# Region ReviewYear Sales Index StrategyYear
# <chr> <int> <dbl> <int> <int>
#1 Africa 2008 0.4 -1 2013
#2 Africa 2013 3.5 0 2013
#3 Africa 2017 9.7 1 2013
#4 SouthAsia 2006 1.5 -2 2011
#5 SouthAsia 2009 4.5 -1 2011
#6 SouthAsia 2011 11 0 2011
#7 SouthAsia 2014 16.7 1 2011
data
Sales <- structure(list(Region = c("SouthAsia", "SouthAsia", "SouthAsia",
"SouthAsia", "Africa", "Africa", "Africa"), ReviewYear = c(2006L,
2009L, 2011L, 2014L, 2008L, 2013L, 2017L), Sales = c(1.5, 4.5,
11, 16.7, 0.4, 3.5, 9.7), Index = c(NA, NA, 0L, NA, NA, 0L, NA
)), class = "data.frame", row.names = c(NA, -7L))
Strategy <- structure(list(Region = c("SouthAsia", "Africa", "Japan", "SEAsia"
), StrategyYear = c(2011L, 2013L, 2007L, 2009L)), class = "data.frame",
row.names = c(NA, -4L))

Deflate numbers by price index (CPI), when there are varying observation numbers for each year

I have some Revenue data in a format like that shown below. So the years are not sequential and they can also repeat (because of a different firm).
Firm Year Revenue
1 A 2018 100
2 B 2017 90
3 B 2018 80
4 C 2016 20
And I want to adjust the Revenue for inflation, by dividing through by the appropriate CPI for each year. The CPI data looks like this:
Year CPI
1 2016 98
2 2017 100
3 2018 101
I have a solution that works, but is this the best way to do it? Is it clumsy to mutate an entire calculating column in there?
revenue <- data.frame(stringsAsFactors=FALSE,
Firm = c("A", "B", "B", "C"),
Year = c(2018L, 2017L, 2018L, 2016L),
Revenue = c(100L, 90L, 80L, 20L)
)
cpi <- data.frame(
Year = c(2016L, 2017L, 2018L),
CPI = c(98L, 100L, 101L)
)
library(dplyr)
df <- left_join(revenue, cpi, by = 'Year')
mutate(df, real_revenue = (Revenue*100)/CPI)
The output is correct, shown below. But is this the best way to do it?
Firm Year Revenue CPI real_revenue
1 A 2018 100 101 99.00990
2 B 2017 90 100 90.00000
3 B 2018 80 101 79.20792
4 C 2016 20 98 20.40816

Months to integer R

This is part of the dataframe I am working on. The first column represents the year, the second the month, and the third one the number of observations for that month of that year.
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3
I have observations from 2000 to 2018. I would like to run a Kernel Regression on this data, so I need to create a continuum integer from a date class vector. For instance Jan 2000 would be 1, Jan 2001 would be 13, Jan 2002 would be 25 and so on. With that I will be able to run the Kernel. Later on, I need to translate that back (1 would be Jan 2000, 2 would be Feb 2000 and so on) to plot my model.
Just use a little algebra:
df$cont <- (df$year - 2000L) * 12L + df$month
You could go backward with modulus and integer division.
df$year <- df$cont %/% 12 + 2000L
df$month <- df$cont %% 12 # 12 is set at 0, so fix that with next line.
df$month[df$month == 0L] <- 12L
Here, %% is the modulus operator and %/% is the integer division operator. See ?"%%" for an explanation of these and other arithmetic operators.
What you can do is something like the following. First create a dates data.frame with expand.grid so we have all the years and months from 2000 01 to 2018 12. Next put this in the correct order and last add an order column so that 2000 01 starts with 1 and 2018 12 is 228. If you merge this with your original table you get the below result. You can then remove columns you don't need. And because you have a dates table you can return the year and month columns based on the order column.
dates <- expand.grid(year = seq(2000, 2018), month = seq(1, 12))
dates <- dates[order(dates$year, dates$month), ]
dates$order <- seq_along(dates$year)
merge(df, dates, by.x = c("year", "month"), by.y = c("year", "month"))
year month obs order
1 2005 10 4 70
2 2005 12 2 72
3 2005 7 2 67
4 2006 1 4 73
5 2006 10 3 82
6 2006 2 1 74
7 2006 7 2 79
8 2006 8 1 80
data:
df <- structure(list(year = c(2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L),
month = c(7L, 10L, 12L, 1L, 2L, 7L, 8L, 10L),
obs = c(2L, 4L, 2L, 4L, 1L, 2L, 1L, 3L)),
class = "data.frame",
row.names = c(NA, -8L))
An option is to use yearmon type from zoo package and then calculate difference of months from Jan 2001 using difference between yearmon type.
library(zoo)
# +1 has been added to difference so that Jan 2001 is treated as 1
df$slNum = (as.yearmon(paste0(df$year, df$month),"%Y%m")-as.yearmon("200001","%Y%m"))*12+1
# year month obs slNum
# 1 2005 7 2 67
# 2 2005 10 4 70
# 3 2005 12 2 72
# 4 2006 1 4 73
# 5 2006 2 1 74
# 6 2006 7 2 79
# 7 2006 8 1 80
# 8 2006 10 3 82
Data:
df <- read.table(text =
"year month obs
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3",
header = TRUE, stringsAsFactors = FALSE)

Resources