Is it possible to interpolate a list of dataframes in r? - r

According to the answer of lhs,
https://stackoverflow.com/a/72467827/11124121
#From lhs
library(tidyverse)
data("population")
# create some data to interpolate
population_5 <- population %>%
filter(year %% 5 == 0) %>%
mutate(female_pop = population / 2,
male_pop = population / 2)
interpolate_func <- function(variable, data) {
data %>%
group_by(country) %>%
# can't interpolate if only one year
filter(n() >= 2) %>%
group_modify(~as_tibble(approx(.x$year, .x[[variable]],
xout = min(.x$year):max(.x$year)))) %>%
set_names(c("country", "year", paste0(variable, "_interpolated"))) %>%
ungroup()
}
The data that already exists, i.e. year 2000 and 2005 are also interpolated. I want to keep the orginal data and only interpolate the missing parts, that is,
2001-2004 ; 2006-2009
Therefore, I would like to construct a list:
population_5_list = list(population_5 %>% filter(year %in% c(2000,2005)),population_5 %>% filter(year %in% c(2005,2010)))
And impute the dataframes in the list one by one.
However, a error appeared:
Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "list"
I am wondering how can I change the interpolate_func into purrr format, in order to apply to list.

We need to loop over the list with map
library(purrr)
library(dplyr)
map(population_5_list,
~ map(vars_to_interpolate, interpolate_func, data = .x) %>%
reduce(full_join, by = c("country", "year")))
-output
[[1]]
# A tibble: 1,266 × 5
country year population_interpolated female_pop_interpolated male_pop_interpolated
<chr> <int> <dbl> <dbl> <dbl>
1 Afghanistan 2000 20595360 10297680 10297680
2 Afghanistan 2001 21448459 10724230. 10724230.
3 Afghanistan 2002 22301558 11150779 11150779
4 Afghanistan 2003 23154657 11577328. 11577328.
5 Afghanistan 2004 24007756 12003878 12003878
6 Afghanistan 2005 24860855 12430428. 12430428.
7 Albania 2000 3304948 1652474 1652474
8 Albania 2001 3283184. 1641592. 1641592.
9 Albania 2002 3261421. 1630710. 1630710.
10 Albania 2003 3239657. 1619829. 1619829.
# … with 1,256 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 1,278 × 5
country year population_interpolated female_pop_interpolated male_pop_interpolated
<chr> <int> <dbl> <dbl> <dbl>
1 Afghanistan 2005 24860855 12430428. 12430428.
2 Afghanistan 2006 25568246. 12784123. 12784123.
3 Afghanistan 2007 26275638. 13137819. 13137819.
4 Afghanistan 2008 26983029. 13491515. 13491515.
5 Afghanistan 2009 27690421. 13845210. 13845210.
6 Afghanistan 2010 28397812 14198906 14198906
7 Albania 2005 3196130 1598065 1598065
8 Albania 2006 3186933. 1593466. 1593466.
9 Albania 2007 3177735. 1588868. 1588868.
10 Albania 2008 3168538. 1584269. 1584269.
# … with 1,268 more rows

Related

How to calculate difference in only one variable without including others from the dataset?

I want to calculate the change in life expectancy over the years for each country in the gapminder dataset. I am using the lag function to calculate this difference for each country with the data set.
i.e:
v <- 1:10
print(v)
v-v
v-lag(v)
When I try implementing this to the gapminder dataset, I end up calculating the difference in life expectancy between two different countries, which is not what I want to find. The code calculates the difference between the life expectancy of country B's earliest year and country A's earliest year in error.
my code:
library(gapminder)
library(tidyverse)
library(tidyr)
library(readr)
library(dplyr)
gm <- gapminder
explife <- gm %>%
group_by(country) %>%
mutate(inc = lifeExp - lag(lifeExp)) %>%
arrange(desc(inc)) %>%
select(country, year, lifeExp, inc)
print(explife)
I also tried grouping by year as well, but all the values are NA.
library(gapminder)
library(tidyverse)
library(tidyr)
library(readr)
library(dplyr)
gm <- gapminder
explife <- gm %>%
group_by(country, year) %>%
mutate(inc = lifeExp - lag(lifeExp)) %>%
arrange(desc(inc)) %>%
select(country, year, lifeExp, inc)
print(explife)
The key lines of your code do seem to give the expected results, with all countries receiving NAs in their earliest years:
library(dplyr)
gapminder::gapminder |>
group_by(country) |>
mutate(inc = lifeExp - lag(lifeExp)) |>
filter(is.na(inc))
#> # A tibble: 142 × 7
#> # Groups: country [142]
#> country continent year lifeExp pop gdpPercap inc
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779. NA
#> 2 Albania Europe 1952 55.2 1282697 1601. NA
#> 3 Algeria Africa 1952 43.1 9279525 2449. NA
#> 4 Angola Africa 1952 30.0 4232095 3521. NA
#> 5 Argentina Americas 1952 62.5 17876956 5911. NA
#> 6 Australia Oceania 1952 69.1 8691212 10040. NA
#> 7 Austria Europe 1952 66.8 6927772 6137. NA
#> 8 Bahrain Asia 1952 50.9 120447 9867. NA
#> 9 Bangladesh Asia 1952 37.5 46886859 684. NA
#> 10 Belgium Europe 1952 68 8730405 8343. NA
#> # … with 132 more rows

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

Using a conditional in a for loop to create a unique panel id

I have a dataset which looks as follows:
# A tibble: 5,458 x 539
# Groups: country, id1 [2,729]
idstd id2 xxx id1 country year
<dbl+> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl>
1 445801 NA NA 7 Albania 2009
2 542384 4616555 1163 7 Albania 2013
3 445802 NA NA 8 Albania 2009
4 542386 4616355 1162 8 Albania 2013
5 445803 NA NA 25 Albania 2009
6 542371 4616545 1161 25 Albania 2013
7 445804 NA NA 30 Albania 2009
8 542152 4616556 475 30 Albania 2013
9 445805 NA NA 31 Albania 2009
10 542392 4616542 1160 31 Albania 2013
The data is paneldata, but is there is no unique panel-id. The first two observations are for example respondent number 7 from Albania, but number 7 is used again for other countries. id2 however is unique. My plan is therefore to copy id2 into the NA entry of the corresponding respondent.
I wrote the following code:
for (i in 1:nrow(df)) {
if (df$id1[i]== df$id1[i+1] & df$country[i] == df$country[i+1]) {
df$id2[i] <- df$id2[i+1]
}}
Which gives the following error:
Error in if (df$id1[i] == df1$id1[i + 1] & : missing value where TRUE/FALSE needed
It does however seem to work. As my dataset is quite large and I am not very skilled, I am reluctant to accept the solution I came up with, especially when it gives an error.
Could anyone may help explain the error to me?
In addition, is there a more efficient (for example data.table) and maybe error free way to deal with this?
Can you not do something along the line:
library(tidyverse)
df %>%
group_by(country, id1) %>%
mutate(uniqueId = id2 %>% discard(is.na) %>% unique) %>%
ungroup()
Also, from looking at your loop I judge that the NA are always 1 row apart from the unique IDs, so you could also do:
df %>%
mutate(id2Lag = lag(id2),
uniqueId = ifelse(is.na(id2), id2Lag, id2) %>%
select(-id2Lag)

Reshaping Dataframe in R (melt?)

So, I currently have a dataframe that looks like:
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
There are 140+ countries. The years are in 5 year intervals. From 1952- 2007 I want to reshape my dataframe such that I get.
Country gdpPercap(1952) gdpPercap(1957) ... gdpPercap(2007)
<fctr> <dbl>
1 Afghanistan 974.5803 .... ...
2 Albania 5937.0295 ... ...
3 Algeria 6223.3675 ... ...
4 Angola 4797.2313
5 Argentina 12779.3796
6 Australia 34435.3674
7 Austria 36126.4927
8 Bahrain 29796.0483
9 Bangladesh 1391.2538
10 Belgium 33692.6051
My attempt is this:
gapminder %>% #my dataframe
filter(year >= 1952) %>%
group_by(country) %>%
summarise(gdpPercap = mean(gdpPercap))
OUTPUT:
country gdpPercap <- but this takes the mean of gdpPercap from 1952-2007
<fctr> <dbl>
1 Afghanistan 802.6746
2 Albania 3255.3666
3 Algeria 4426.0260
4 Angola 3607.1005
5 Argentina 8955.5538
6 Australia 19980.5956
7 Austria 20411.9163
8 Bahrain 18077.6639
9 Bangladesh 817.5588
10 Belgium 19900.7581
# ... with 132 more rows
Any ideas? PS: I'm new to R. I'm also looking at melt(). Any help will be appreciated!
tidyr::spread() would solve your problem
library(dplyr); library(tidyr)
gapminder %>%
select(country, year, gdpPercap) %>%
spread(year, gdpPercap)
You should use year also in group_by, and after summary, just reshape the data the way you want using dcast or rehape
Here is a sample solution :
library(dplyr)
library(reshape2)
gapminder <- data.frame(cbind(gdpPercap=runif(10000), year =as.integer(seq(from=1952, to=2007, by=5)), country = c("India", "US", "UK")))
gapminder$gdpPercap <- as.numeric(as.character(gapminder$gdpPercap))
gapminder$year <- as.integer(as.character(gapminder$year))
gapminder %>% #my dataframe
filter(year >= 1952) %>%
group_by(country, year) %>%
summarise(gdpPercap = mean(gdpPercap)) %>%
dcast(country ~ year, value.var="gdpPercap")
I have to generate a new data, because your example is not reproducible. Go through the link How to make a great R reproducible example?. It helps in answering and understanding the problem, as well as, quicker answers.
Built-in reshape can do this.
foo.data.frame <- data.frame(
Country=rep(c("Here", "There"), each=3),
year=rep(c(1952, 1957, 1962),2),
gdpPercap=779:784
# ... other variables
)
reshape(foo.data.frame[, c("Country", "year", "gdpPercap")],
timevar="year", idvar="Country", direction="wide", sep=" ")
# Country gdpPercap 1952 gdpPercap 1957 gdpPercap 1962
# 1 Here 779 780 781
# 4 There 782 783 784

Resources