How to use a loop to create panel data by subsetting and merging a lot of different data frames in R? - r

I've looked around but I can't find an answer to this!
I've imported a large number of datasets to R.
Each dataset contains information for a single year (ex. df_2012, df_2013, df_2014 etc).
All the datasets have the same variables/columns (ex. varA_2012 in df_2012 corresponds to varA_2013 in df_2013).
I want to create a df with my id variable and varA_2012, varB_2012, varA_2013, varB_2013, varA_2014, varB_2014 etc
I'm trying to create a loop that helps me extract the few columns that I'm interested in (varA_XXXX, varB_XXXX) in each data frame and then do a full join based on my id var.
I haven't used R in a very long time...
So far, I've tried this:
id <- c("France", "Belgium", "Spain")
varA_2012 <- c(1,2,3)
varB_2012 <- c(7,2,9)
varC_2012 <- c(1,56,0)
varD_2012 <- c(13,55,8)
varA_2013 <- c(34,3,56)
varB_2013 <- c(2,53,5)
varC_2013 <- c(24,3,45)
varD_2013 <- c(27,13,8)
varA_2014 <- c(9,10,5)
varB_2014 <- c(95,30,75)
varC_2014 <- c(99,0,51)
varD_2014 <- c(9,40,1)
df_2012 <-data.frame(id, varA_2012, varB_2012, varC_2012, varD_2012)
df_2013 <-data.frame(id, varA_2013, varB_2013, varC_2013, varD_2013)
df_2014 <-data.frame(id, varA_2014, varB_2014, varC_2014, varD_2014)
year = c(2012:2014)
for(i in 1:length(year)) {
df_[i] <- df_[I][df_[i]$id, df_[i]$varA_[i], df_[i]$varB_[i], ]
list2env(df_[i], .GlobalEnv)
}
panel_df <- Reduce(function(x, y) merge(x, y, by="if"), list(df_2012, df_2013, df_2014))
I know that there are probably loads of errors in here.

Here are a couple of options; however, it's unclear what you want the expected output to look like.
If you want a wide format, then we can use tidyverse to do:
library(tidyverse)
results <-
map(list(df_2012, df_2013, df_2014), function(x)
x %>% dplyr::select(id, starts_with("varA"), starts_with("varB"))) %>%
reduce(., function(x, y)
left_join(x, y, all = TRUE, by = "id"))
Output
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75
However, if you need it in a long format, then we could pivot the data:
results %>%
pivot_longer(-id, names_to = c("variable", "year"), names_sep = "_")
Output
id variable year value
<chr> <chr> <chr> <dbl>
1 France varA 2012 1
2 France varB 2012 7
3 France varA 2013 34
4 France varB 2013 2
5 France varA 2014 9
6 France varB 2014 95
7 Belgium varA 2012 2
8 Belgium varB 2012 2
9 Belgium varA 2013 3
10 Belgium varB 2013 53
11 Belgium varA 2014 10
12 Belgium varB 2014 30
13 Spain varA 2012 3
14 Spain varB 2012 9
15 Spain varA 2013 56
16 Spain varB 2013 5
17 Spain varA 2014 5
18 Spain varB 2014 75
Or if using base R for the wide format, then we can do:
results <-
lapply(list(df_2012, df_2013, df_2014), function(x)
subset(x, select = c("id", names(x)[startsWith(names(x), "varA")], names(x)[startsWith(names(x), "varB")])))
results <-
Reduce(function(x, y)
merge(x, y, all = TRUE, by = "id"), results)

From your initial for loop attempt, it seems the code below may help
> (df <- Reduce(merge, list(df_2012, df_2013, df_2014)))[grepl("^(id|var(A|B))",names(df))]
id varA_2012 varB_2012 varA_2013 varB_2013 varA_2014 varB_2014
1 Belgium 2 2 3 53 10 30
2 France 1 7 34 2 9 95
3 Spain 3 9 56 5 5 75

Related

R moving average between data frame variables

I am trying to find a solution but haven't, yet.
I have a dataframe structured as follows:
country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54
I want to find the moving average for every 3 years (i.e. 2014-16, 2015-17, etc) to be placed in ad-hoc columns.
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
France Paris 23 34 54 12 23 21 37 33.3 29.7 18.7
US NYC 1 2 2 12 95 54 etc etc etc etc
Any hint?
1) Using the data shown reproducibly in the Note at the end we apply rollmean to each column in the transpose of the data and then transpose back. We rollapply the appropriate paste command to create the names.
library(zoo)
DF2 <- DF[-(1:2)]
cbind(DF, setNames(as.data.frame(t(rollmean(t(DF2), 3))),
rollapply(names(DF2), 3, function(x) paste(range(x), collapse = "-"))))
giving:
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
1 France Paris 23 34 54 12 23 21 37.000000 33.333333 29.66667 18.66667
2 US NYC 1 2 2 12 95 54 1.666667 5.333333 36.33333 53.66667
2) This could also be expressed using dplyr/tidyr/zoo like this:
library(dplyr)
library(tidyr)
library(zoo)
DF %>%
pivot_longer(-c(country, City)) %>%
group_by(country, City) %>%
mutate(value = rollmean(value, 3, fill = NA),
name = rollapply(name, 3, function(x) paste(range(x), collapse="-"), fill=NA)) %>%
ungroup %>%
drop_na %>%
pivot_wider %>%
left_join(DF, ., by = c("country", "City"))
Note
Lines <- "country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54 "
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, check.names = FALSE)

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

merge two data frames based on matching rows of multiple columns

Below is the summary and structure of the two data sets I tried to merge claimants and unemp, they can me found here claims.csv and unemp.csv
> tbl_df(claimants)
# A tibble: 6,960 × 5
X County Month Year Claimants
<int> <fctr> <fctr> <int> <int>
1 1 ALAMEDA Jan 2007 13034
2 2 ALPINE Jan 2007 12
3 3 AMADOR Jan 2007 487
4 4 BUTTE Jan 2007 3496
5 5 CALAVERAS Jan 2007 644
6 6 COLUSA Jan 2007 1244
7 7 CONTRA COSTA Jan 2007 8475
8 8 DEL NORTE Jan 2007 328
9 9 EL DORADO Jan 2007 2120
10 10 FRESNO Jan 2007 19974
# ... with 6,950 more rows
> tbl_df(unemp)
# A tibble: 6,960 × 7
County Year Month laborforce emplab unemp unemprate
* <chr> <int> <chr> <int> <int> <int> <dbl>
1 Alameda 2007 Jan 743100 708300 34800 4.7
2 Alameda 2007 Feb 744800 711000 33800 4.5
3 Alameda 2007 Mar 746600 713200 33300 4.5
4 Alameda 2007 Apr 738200 705800 32400 4.4
5 Alameda 2007 May 739100 707300 31800 4.3
6 Alameda 2007 Jun 744900 709100 35800 4.8
7 Alameda 2007 Jul 749600 710900 38700 5.2
8 Alameda 2007 Aug 746700 709600 37000 5.0
9 Alameda 2007 Sep 748200 712100 36000 4.8
10 Alameda 2007 Oct 749000 713000 36100 4.8
# ... with 6,950 more rows
I thought first I should change all the factor columns to character columns.
unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)
claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)
m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1] 0 10
In the output of dim(m), 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.
To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:
a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]
b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]
b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE
This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.
I have tried looking into the documentation for merge and could not gather where do I go wrong, I have also tried the inner_join function from dplyr:
> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8
I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge R should identify the matching rows and merge the non-matching columns.
The claimants df has the counties in all uppercase, the unemp df has them in lower case.
I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.
options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)
m <- inner_join(unemp, claims)
dim(m)
# [1] 6960 8

replace NA in a dplyr chain

Question has been edited from the original.
After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA NA
The following does not work as I expected
library(dplyr)
library(Lahman)
df <- Batting[ c("yearID", "teamID", "G_batting") ]
df <- group_by(df, teamID )
df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)
Source: local data frame [20 x 3]
Groups: yearID, teamID
yearID teamID G_batting
1 2004 SFN 11.00000
2 2006 CHN 43.00000
3 2007 CHA 2.00000
4 2008 BOS 5.00000
5 2009 SEA 3.00000
6 2010 SEA 4.00000
7 2012 NYA **49.07894**
> mean(Batting$G_battin, na.rm = TRUE)
[1] **49.07894**
In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?
df %.%
group_by( yearID ) %.%
transform(G_batting = ifelse(is.na(G_batting),
mean(G_batting, na.rm = TRUE),
G_batting)
)
Edit: Replacing transform with mutate gives the following error
Error in mutate_impl(.data, named_dots(...), environment()) :
INTEGER() can only be applied to a 'integer', not a 'double'
Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also #eddi's answer.
df %.%
group_by( teamID ) %.%
mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA 47
> mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE)
> as.integer(mean_NYA)
[1] 47
Edit: Following up on #Romain's comment I installed dplyr from github:
> head(df,10)
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA NA
8 1954 ML1 122
9 1955 ML1 153
10 1956 ML1 153
> df %.%
+ group_by(teamID) %.%
+ mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 0
2 2006 CHN 0
3 2007 CHA 0
4 2008 BOS 0
5 2009 SEA 0
6 2010 SEA 1074266112
7 2012 NYA 90693125
8 1954 ML1 122
9 1955 ML1 153
10 1956 ML1 153
.. ... ... ...
So I didn't get the error (good) but I got a (seemingly) strange result.
The main issue you're having is that mean returns a double while the G_batting column is an integer. So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess.
That said, here are a couple of data.table alternatives - I didn't check which one is faster.
library(data.table)
# using ifelse
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]
# using a temporary column
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]
And this is what I'd want to do ideally (there is an FR about this):
# again, atm this is pure fantasy and will not work
dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]
The dplyr version of the ifelse is (as in OP):
dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))
I'm not sure how to implement the second data.table idea in a single line in dplyr. I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column).

R - Function to create a data.frame containing manipulated data from another data.frame

Hi I am new to R and have a question. I have a data.frame (df) containing about 30 different types of statistics from years 1960-2012 for about 100 different countries. Here is an example of what it looks like:
Country Statistic.Type 1960 1961 1962 1963 ... 2012
__________________________________________________________________________________
1 Albania Death Rate 10 21 13 24 25
2 Albania Birth Rate 7 15 6 10 9
3 Albania Life Expectancy 8 12 10 7 20
4 Albania Population 10 30 27 18 13
5 Brazil Death Rate 14 20 22 13 18
6 Brazil Birth Rate ...
7 Brazil Life Expectancy ...
8 Brazil Population ...
9 Cambodia Death Rate ...
10 Cambodia Birth Rate ... etc...
Note that there are 55 columns in total and the values in each of the 53 year columns are made up for the purposes of this question.
I need help writing a function which takes as inputs the country and statistic type and returns a new data.frame with 2 columns which shows the year and value in each year for a given country and statistic type. For example, if I input country=Brazil and statistic.type=Death Rate into the function, the new data.frame should look like:
Year Value
_____________________
1 1960 14
2 1961 20
3 1962 22
...
51 2012 18
I have no idea on how to do this, if anyone can give me any ideas/code/packages to install then that would be very helpful.
Thank you so much!
If df is your data.frame, all you need is this:
f <- function(country, statistic.type, data=df)
{
values <- data[data$Country==country & data$Statistic.Type==statistic.type,-(1:2)]
cbind(Year=names(df)[-(1:2)], Value=values)
}
Use it as
f(country="Brazil", statistic.type="Death Rate")
You will probably have to do some split operation on the total data set to have country individual datasets.
https://stat.ethz.ch/pipermail/r-help/2008-February/155328.html
Then use the melt function for each subset of data. In your case, adapted from
http://www.statmethods.net/management/reshape.html, where mydata is the already splitted data:
% example of melt function
library(reshape)
mdata <- melt(mydata, id=c("Year"))
That is it.
You could just combine subset with stack, with maybe a gsub in there to leave only the numbers in your column of years:
df <- expand.grid(
"country" = c("A", "B"),
"statistic" = c("c", "d", "e", "f"),
stringsAsFactors = FALSE)
df$year1980 <- rnorm(8)
df$year1990 <- rnorm(8)
df$year2000 <- rnorm(8)
getYears <- function(input, cntry, stat) {
x <- subset(input, country == cntry & stat == statistic,
select = -c(country, statistic))
x <- stack(x)[,c("ind", "values")]
x$ind <- gsub("\\D", "", x$ind)
x
}
getYears(df, "A", "c")
ind values
1 1980 1.1421309
2 1990 1.0777974
3 2000 -0.2010913

Resources