Generalise a function to vector format data.table - r

I have the following data structure, where I want to interpolate the data row-wise until a certain year:
require('data.table')
test_dt <- data.table(iso1 = c('BTN', 'IND', 'BGD'),
iso2 = c('AFG', 'AFG', 'AFG'),
year = c(2006, 2003, 2006))
I came up with the following function that works well for the single-row case, but does not work for the general case:
interpolate_rows <- function(dt, stop_year = 2008) {
year <- as.integer(dt[, .SD, .SDcols = 'year'])
# If year is less than stop year, fill in observations:
if (year < stop_year) {
time_delta <- seq(year, stop_year)
# Explode bilateral country observation:
dt <- dt[rep(dt[, .I], length(time_delta))]
# Replace year column w/ time_delta sequence:
dt <- dt[, year := time_delta]
}
return(dt)
}
## Output
bar <- interpolate_rows(test_dt[1])
bar
iso1 iso2 year
1: BTN AFG 2006
2: BTN AFG 2007
3: BTN AFG 2008
What I'd like to have is the following:
bar <- interpolate_rows(test_dt)
bar
iso1 iso2 year
1: BTN AFG 2006
2: BTN AFG 2007
3: BTN AFG 2008
6: IND AFG 2003
7: IND AFG 2004
8: IND AFG 2005
9: IND AFG 2006
10: IND AFG 2007
11: IND AFG 2008
14: BGD AFG 2006
14: BGD AFG 2007
14: BGD AFG 2008
I know the culprit is most likely this line
year <- as.integer(dt[, .SD, .SDcols = 'year']), but I got no clue how to substitute this for a working vector solution. I tried to nest an lapply() function within interpolate_rows() to extract the year for each unique group and played around with Map(), but none of these yielded working solutions.
Any help pointing me to feasible vector solution, would be greatly appreciated.

What about simply using by:
test_dt[, .(year = min(year):stop_year), by = .(iso1, iso2)]
# iso1 iso2 year
# 1: BTN AFG 2006
# 2: BTN AFG 2007
# 3: BTN AFG 2008
# 4: IND AFG 2003
# 5: IND AFG 2004
# 6: IND AFG 2005
# 7: IND AFG 2006
# 8: IND AFG 2007
# 9: IND AFG 2008
# 10: BGD AFG 2006
# 11: BGD AFG 2007
# 12: BGD AFG 2008

One way using dplyr and tidyr libraries.
library(dplyr)
library(tidyr)
interpolate_rows <- function(dt, stop_year = 2008) {
dt %>%
group_by(iso1, iso2) %>%
complete(year = year : stop_year) %>%
ungroup
}
interpolate_rows(test_dt)
# iso1 iso2 year
# <chr> <chr> <dbl>
# 1 BGD AFG 2006
# 2 BGD AFG 2007
# 3 BGD AFG 2008
# 4 BTN AFG 2006
# 5 BTN AFG 2007
# 6 BTN AFG 2008
# 7 IND AFG 2003
# 8 IND AFG 2004
# 9 IND AFG 2005
#10 IND AFG 2006
#11 IND AFG 2007
#12 IND AFG 2008
Another way -
library(data.table)
interpolate_rows <- function(dt, stop_year = 2008) {
vals <- seq(dt$year, stop_year)
dt[rep(1, length(vals))][, year := vals]
}
rbindlist(by(test_dt, seq(nrow(test_dt)), interpolate_rows))

Related

Is it possible to interpolate a list of dataframes in r?

According to the answer of lhs,
https://stackoverflow.com/a/72467827/11124121
#From lhs
library(tidyverse)
data("population")
# create some data to interpolate
population_5 <- population %>%
filter(year %% 5 == 0) %>%
mutate(female_pop = population / 2,
male_pop = population / 2)
interpolate_func <- function(variable, data) {
data %>%
group_by(country) %>%
# can't interpolate if only one year
filter(n() >= 2) %>%
group_modify(~as_tibble(approx(.x$year, .x[[variable]],
xout = min(.x$year):max(.x$year)))) %>%
set_names(c("country", "year", paste0(variable, "_interpolated"))) %>%
ungroup()
}
The data that already exists, i.e. year 2000 and 2005 are also interpolated. I want to keep the orginal data and only interpolate the missing parts, that is,
2001-2004 ; 2006-2009
Therefore, I would like to construct a list:
population_5_list = list(population_5 %>% filter(year %in% c(2000,2005)),population_5 %>% filter(year %in% c(2005,2010)))
And impute the dataframes in the list one by one.
However, a error appeared:
Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "list"
I am wondering how can I change the interpolate_func into purrr format, in order to apply to list.
We need to loop over the list with map
library(purrr)
library(dplyr)
map(population_5_list,
~ map(vars_to_interpolate, interpolate_func, data = .x) %>%
reduce(full_join, by = c("country", "year")))
-output
[[1]]
# A tibble: 1,266 × 5
country year population_interpolated female_pop_interpolated male_pop_interpolated
<chr> <int> <dbl> <dbl> <dbl>
1 Afghanistan 2000 20595360 10297680 10297680
2 Afghanistan 2001 21448459 10724230. 10724230.
3 Afghanistan 2002 22301558 11150779 11150779
4 Afghanistan 2003 23154657 11577328. 11577328.
5 Afghanistan 2004 24007756 12003878 12003878
6 Afghanistan 2005 24860855 12430428. 12430428.
7 Albania 2000 3304948 1652474 1652474
8 Albania 2001 3283184. 1641592. 1641592.
9 Albania 2002 3261421. 1630710. 1630710.
10 Albania 2003 3239657. 1619829. 1619829.
# … with 1,256 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 1,278 × 5
country year population_interpolated female_pop_interpolated male_pop_interpolated
<chr> <int> <dbl> <dbl> <dbl>
1 Afghanistan 2005 24860855 12430428. 12430428.
2 Afghanistan 2006 25568246. 12784123. 12784123.
3 Afghanistan 2007 26275638. 13137819. 13137819.
4 Afghanistan 2008 26983029. 13491515. 13491515.
5 Afghanistan 2009 27690421. 13845210. 13845210.
6 Afghanistan 2010 28397812 14198906 14198906
7 Albania 2005 3196130 1598065 1598065
8 Albania 2006 3186933. 1593466. 1593466.
9 Albania 2007 3177735. 1588868. 1588868.
10 Albania 2008 3168538. 1584269. 1584269.
# … with 1,268 more rows

r - Select a numerical value in a Matrix which is based of another string dataset (Distance between two countries)

I have a dataset(1) of investments where the target and the host country are listed.
In addition, I have a matrix(2) which shows the distance between all countries.
Right now I would like to add a column in the first data set which contains the distance between the target and home country in each row.
The first data set looks like this (with values below):
targetC year Comp_id homeC sales assets profit Distance_Target_Home (this column would be the goal)
ABW 2008 AL8234 ALB 74839 75342 976857 8543
and the second (with the distance in between the countries):
ABW AFG AGO ALB ANT
ABW 3455 2456 8543 1342
AFG
AGO
ALB
ANT
Thanks a lot
Assuming that df_distances is your second dataframe (with distances), you can reshape it to long format like this:
## for testing you can use this minimal distance dataframe:
df_distances <- structure(list(ABW = c(10L, 7L), AFG = c(9L, 5L)), class = "data.frame", row.names = c("ABW", "AFG"))
## > df_distances
##
## ABW AFG
## ABW 9 10
## AFG 3 4
... now reshape (stack) df_distances to long format so that homeC and targetC receive their own column each:
library(tidyr)
library(dplyr)
df_distances <- df_distances %>%
rownames_to_column('targetC') %>%
pivot_longer(cols = -1,
names_to = 'homeC',
values_to = 'Distance_Target_Home'
)
## > df_distances
## # A tibble: 4 x 3
## targetC homeC Distance_Target_Home
## <chr> <chr> <int>
## 1 ABW ABW 9
## 2 ABW AFG 10
## 3 AFG ABW 3
## 4 AFG AFG 4
... join with df1:
df1 %>% left_join(df_distances)

Creating a Variable Initial Values from a base variable in Panel Data Structure in R

I'm trying to create a new variable in R containing the initial values of another variable (crime) based on groups (countries) considering the initial period of time observable per group (on panel data framework), my current data looks like this:
country
year
Crime
Albania
2016
2.7369478
Albania
2017
2.0109779
Argentina
2002
9.474084
Argentina
2003
7.7898825
Argentina
2004
6.0739941
And I want it to look like this:
country
year
Crime
Initial_Crime
Albania
2016
2.7369478
2.7369478
Albania
2017
2.0109779
2.7369478
Argentina
2002
9.474084
9.474084
Argentina
2003
7.7898825
9.474084
Argentina
2004
6.0739941
9.474084
I saw that ddply could make it work this way, but the problem is that it is not longer supported by the latest R updates.
Thank you in advance.
Maybe arrange by year, then after grouping by country set Initial_Crime to be the first Crime in the group.
library(tidyverse)
df %>%
arrange(year) %>%
group_by(country) %>%
mutate(Initial_Crime = first(Crime))
Output
country year Crime Initial_Crime
<chr> <int> <dbl> <dbl>
1 Argentina 2002 9.47 9.47
2 Argentina 2003 7.79 9.47
3 Argentina 2004 6.07 9.47
4 Albania 2016 2.74 2.74
5 Albania 2017 2.01 2.74
library(data.table)
setDT(data)[, Initial_Crime:=.SD[1,Crime], by=country]
country year Crime Initial_Crime
1: Albania 2016 2.736948 2.736948
2: Albania 2017 2.010978 2.736948
3: Argentina 2002 9.474084 9.474084
4: Argentina 2003 7.789883 9.474084
5: Argentina 2004 6.073994 9.474084
A data.table solution
setDT(df)
df[, x := 1:.N, country
][x==1, initial_crime := crime
][, initial_crime := nafill(initial_crime, type = "locf")
][, x := NULL
]

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

Panel data, from wide to long with multiple variables [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 4 years ago.
I'm struggling with a sizeable panel data in long format with multiple variables. It looks like this
set.seed(42)
dat_0=
data.frame(
c(rep('AFG',2),rep('UK',2)),
c(rep(c('GDP','pop'),2)),
runif(4),
runif(4))
colnames(dat_0)<-c('country','variable','2010','2011')
Producing a data frame like this:
country variable 2010 2011
1 AFG GDP 0.535761290 0.7515226
2 AFG pop 0.002272966 0.4527316
3 UK GDP 0.608937453 0.5357900
4 UK pop 0.836801559 0.5373767
And I am trying/struggling to coerce it to this structure
country year GDP pop
1 AFG 2010 0.5357612 0.0022729
2 AFG 2011 0.7515226 0.4527316
3 UK 2010 0.6089374 0.8368015
4 UK 2011 0.5357900 0.5373767
Apologies if repeated, I seem to be struggling with reshape/tidyr/dplyr
You need to gather and then spread:
library(tidyverse)
set.seed(42)
dat_0 <- data.frame(c(rep("AFG", 2), rep("UK", 2)), c(rep(c("GDP", "pop"), 2)), runif(4), runif(4))
colnames(dat_0) <- c("country", "variable", "2010", "2011")
dat_0 %>%
gather(year, value, `2010`, `2011`) %>%
spread(variable, value)
#> country year GDP pop
#> 1 AFG 2010 0.9148060 0.9370754
#> 2 AFG 2011 0.6417455 0.5190959
#> 3 UK 2010 0.2861395 0.8304476
#> 4 UK 2011 0.7365883 0.1346666
Created on 2019-02-20 by the reprex package (v0.2.1)
Looks like you could solve your problem with a mix from spread and gather functions from the tidyverse package.
Edit: actually the package is tidyr, which is part of the tidyverse package
You can solve this problem in two steps.
First: gather by year and values, creating a new column called "measurement"
> dat_1 <- dat_0 %>% gather(key="year",value="measurement","2010":"2011")
> dat_1
country variable year measurement
1 AFG GDP 2010 0.9148060
2 AFG pop 2010 0.9370754
3 UK GDP 2010 0.2861395
4 UK pop 2010 0.8304476
5 AFG GDP 2011 0.6417455
6 AFG pop 2011 0.5190959
7 UK GDP 2011 0.7365883
8 UK pop 2011 0.1346666
Second: spread by your new "variable" and "measurement"
> dat_2 <- dat_1 %>% spread(key="variable",value="measurement")
> dat_2
country year GDP pop
1 AFG 2010 0.9148060 0.9370754
2 AFG 2011 0.6417455 0.5190959
3 UK 2010 0.2861395 0.8304476
4 UK 2011 0.7365883 0.1346666
I sincerly hope this solves your problem.

Resources