Reshaping df into data panel model - r

I have the following sets of data:
df1 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2011,2012,2012), variable_1= c(1,3,5,7))
df2 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2012,2012,2013), variable_2= c(2,4,6,8))
df3 <- data.frame( country = c("A", "C","C"), year = c(2011,2011,2013), variable_3= c(9,9,9))
I want to reshape them into a panel data model, so I can get the following result:
df4 <- data.frame( country = c("A","A","A","B","B","B","C","C","C"), year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013), variable_1 = c(1,5,NA,3,7,NA,NA,NA,NA), variable_2 = c(2,6,NA,NA,4,8,NA,NA,NA), variable_3 = c(9,NA,NA,NA,NA,NA,9,NA,9) )
I have searched for this info, but the topics I found (Reshaping panel data) didn´t help me.
Any ideas on how to do that? My real data sets have thousands of lines ("countries"), several variables, years and NA´s, so please take that into account.

Try
library(tidyr)
library(dplyr)
Reduce(full_join, list(df1, df2, df3)) %>%
complete(country, year)
Which gives:
#Source: local data frame [9 x 5]
#
# country year variable_1 variable_2 variable_3
# (chr) (dbl) (dbl) (dbl) (dbl)
#1 A 2011 1 2 9
#2 A 2012 5 6 NA
#3 A 2013 NA NA NA
#4 B 2011 3 NA NA
#5 B 2012 7 4 NA
#6 B 2013 NA 8 NA
#7 C 2011 NA NA 9
#8 C 2012 NA NA NA
#9 C 2013 NA NA 9

Related

Pulling out values that match row and column name in another dataframe in R

I am trying to take a data frame that looks like this:
desc<- c("a", "b", "c", "d")
df <- data.frame(matrix(ncol = 3 , nrow = 12))
colnames(df)<- c("Date", "Value", "Desc")
df[1:4,'Date'] <- "2017-01-01"
df[5:8,'Date'] <- "2018-01-01"
df[9:12,'Date'] <- "2019-01-01"
df[1:12,'Desc'] <- desc
df[1:12,'Value'] <- seq(1:12)
head(df)
And change it into one that has the dates as columns and "Desc" as row names, like this:
2017-01-01 2018-01-01 2019-01-01
a NA NA NA
b NA NA NA
c NA NA NA
d NA NA NA
I have gotten as far as a few if loops to get my DF set up but cannot for the life of me think of how to pull out the value and put it into the new dataframe. Any help would be so appreciated at this point! I am also a student so I apologize if I missed any steps in posting this or did it the wrong way.
My brain couldn't wrap itself around how to match both values and then put it in the right position.
We may use xtabs in base R
xtabs(Value ~ Desc + Date, df)
-output
Date
Desc 2017-01-01 2018-01-01 2019-01-01
a 1 5 9
b 2 6 10
c 3 7 11
d 4 8 12
Not sure if this is within the scope of what you are working on but I'd use the tidyverse functions for this since this seems to require pivoting the data frame from long to wide:
install.packages("tidyverse")
library("tidyverse")
df_wide <- df %>%
mutate(Value = NA) %>%
pivot_wider(names_from = Date, values_from = Value)
df_wide
# A tibble: 4 × 4
Desc `2017-01-01` `2018-01-01` `2019-01-01`
<chr> <lgl> <lgl> <lgl>
1 a NA NA NA
2 b NA NA NA
3 c NA NA NA
4 d NA NA NA
More info about tidyverse functions here:
https://www.tidyverse.org

compare sets of columns in R dataframe and keep one value from each set of two columns

Basically, I have a large dataset with many different variables. The data is ordered in pairs (2019 and 2020) and for some variables for neither year data is available for some only 2019 and some only 2020. I would like the 2020 data to 'override' the 2019 data, but only if it is available in 2020 and 2019. If no data is available for either year, then the data should stay missing. I now do this with a little helper function, but this should be more scalable, so that I can do it for 200+ column pairs. What am I missing in mutate(across(....),)
# Create data
mydf <- tibble(ID = 1:5,
var1_2019 = c(9, NA, 3, 2, NA),
var1_2020 = c(NA, NA, 3, 2, 4),
var2_2019 = c("A", "B",NA, "D", "C"),
var2_2020 = c(NA, "B",NA, "R", NA),
var3_2019 = c(T, F, NA, NA, NA),
var3_2020 = c(NA, NA, NA, NA, F))
# create little helper function. this is good because
# it could be made more complex in the future,
# for example for numeric variables keeping the larger of the two
which_to_keep_f <-
function(x, y) {
if (is.na(x) && is.na(y)) {
output <- NA
}
if (is.na(x) && !is.na(y)) {
output <- y
}
if (!is.na(x) && is.na(y)) {
output <- x
}
if (!is.na(x) && !is.na(y)) {
output <- y
}
output
}
# vectorize it
which_to_keep_f_vec <- Vectorize(which_to_keep_f)
# use function inside mutate
mydf %>%
mutate(var1 = which_to_keep_f_vec(var1_2019, var1_2020)) %>%
mutate(var2 = which_to_keep_f_vec(var2_2019, var2_2020)) %>%
mutate(var3 = which_to_keep_f_vec(var3_2019, var3_2020)) %>%
select(-contains("_20"))
Solution
Thanks to TarJae and micahkimel I got to 99% of the solution. This is the complete solution (including dropping the variables that are no longer needed and renaming the variables to their desired format)
mydf %>%
mutate(across(ends_with('_2019'),
~(which_to_keep_f_vec(.,
get(stringr::str_replace(cur_column(), "_2019$", "_2020"))))) %>%
unnest(cols=c()))%>%
select(-contains("_2020")) %>%
rename_all(~ stringr::str_replace(., regex("_2019$", ignore_case = TRUE), ""))
Update: Thanks to micahkimel removing list to not duplicate the data:
Is this what you are looking for. Here we apply your function to sets of pairs:
library(dplyr)
library(stringr)
mydf %>%
mutate(across(ends_with('_2019'),
~(which_to_keep_f_vec(.,
get(str_replace(cur_column(), "_2019$", "_2020"))))) %>%
unnest(cols=c())
ID var1_2019 var1_2020 var2_2019 var2_2020 var3_2019 var3_2020
<int> <dbl> <dbl> <chr> <chr> <lgl> <lgl>
1 1 9 NA A NA TRUE NA
2 2 NA NA B B FALSE NA
3 3 3 3 NA NA NA NA
4 4 2 2 R R NA NA
5 5 4 4 C NA FALSE FALSE
Here's an approach that results in just one variable for each pair of variables in your input table. First, use pivot_longer() to collapse the pairs into single variables, and add year as a column (with twice as many observations).
mydf_long = mydf %>%
pivot_longer(cols = matches("_20"), names_to = c(".value", "year"),
names_sep = "_")
ID year var1 var2 var3
<int> <chr> <dbl> <chr> <lgl>
1 1 2019 9 A TRUE
2 1 2020 NA NA NA
3 2 2019 NA B FALSE
4 2 2020 NA B NA
5 3 2019 3 NA NA
6 3 2020 3 NA NA
7 4 2019 2 D NA
8 4 2020 2 R NA
9 5 2019 NA C NA
10 5 2020 4 NA FALSE
Next, use fill() to populate later NA values with earlier non-missing values. Then we can just filter to the most recent year (2020). For each variable, that year will have its own value if it had one before; otherwise, it will carry over the value from the previous year.
mydf_long %>%
group_by(ID) %>%
fill(var1, var2, var3) %>%
filter(year == 2020)
ID year var1 var2 var3
<int> <chr> <dbl> <chr> <lgl>
1 1 2020 9 A TRUE
2 2 2020 NA B FALSE
3 3 2020 3 NA NA
4 4 2020 2 R NA
5 5 2020 4 C FALSE

R: How to fill up missing year values in a data frame

I have a quite basic R question. I have the following data frame with a year column that has no 1-year steps.
year <- c(1991,1993,1996)
value <-c(3, NA, 4)
However, for plotting a line chart, I want to fill the missing years so that I have a series from 1990 to 2000 in 1-year steps. The additional years shall be filled with NA values.
Is there a smart solution to this problem?
We can use complete from tidyr.
dat <- data.frame(
year = c(1991,1993,1996),
value = c(3, NA, 4)
)
library(dplyr)
library(tidyr)
dat2 <- dat %>%
complete(year = 1990:2000)
print(dat2)
# # A tibble: 11 x 2
# year value
# <dbl> <dbl>
# 1 1990 NA
# 2 1991 3
# 3 1992 NA
# 4 1993 NA
# 5 1994 NA
# 6 1995 NA
# 7 1996 4
# 8 1997 NA
# 9 1998 NA
# 10 1999 NA
# 11 2000 NA
Using base R to generate a sequence from 1990 to 2000 and merge with original data.frame.
df1 <- data.frame(year = c(1991, 1993, 1996),
value = c(3, NA, 4))
merge(df1,
data.frame(full = seq(1990, 2000))
by.x = "year",
by.y = "full",
all = TRUE)
year value
1 1990 NA
2 1991 3
3 1992 NA
4 1993 NA
5 1994 NA
6 1995 NA
7 1996 4
8 1997 NA
9 1998 NA
10 1999 NA
11 2000 NA
We assume that what you have is:
dd <- data.frame(year, value)
This is a time series so it makes sense to represent it using a time series representation such as ts, zoo or xts. We convert it to zoo and then to ts. The latter conversion will fill in the missing years.
library(zoo)
z <- read.zoo(dd)
tt <- as.ts(z)
tt
## Time Series:
## Start = 1991
## End = 1996
## Frequency = 1
## [1] 3 NA NA NA NA 4
If you really want to convert it to a data frame then use fortify.zoo(tt) .
Plotting
If the only reason to do this is for plotting a line chart then alternately just remove the missing values. Any of these will work.
plot(na.omit(dd), type = "l", xlab = "year", ylab = "value")
plot(na.omit(z), xlab = "year", ylab = "value")
library(ggplot2)
autoplot(na.omit(z)) + xlab("year") + ylab("value")
The last plot is shown here:

Creating a variable by group for sample data

I have a sample data base (which I did not make myself) as follows:
panelID= c(1:50)
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
library(data.table)
set.seed(123)
DT <- data.table( country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA
DT$sales[DT$sales< 0] <- NA
DT <- as.data.frame(DT)
I am always struggling when I want to create a new variable which has to meet certain conditions.
I would like to create a tax rate for my sample database. The tax rate has to be the same per country-year, between 10% and 40% and not more than 5% apart per country.
I cannot seem to figure out how to do it. It would be great if someone could point me in the right direction.
Not 100 % sure what you are looking for. You could use dplyr:
DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(tax_rate = base_rate + as.integer(runif(1,-2.5,+2.5)))
which returns
# A tibble: 100 x 6
# Groups: country, year [20]
country year uniqueID sales base_rate tax_rate
<chr> <dbl> <int> <lgl> <int> <int>
1 C 2005 1 NA 26 26
2 C 2010 2 NA 26 26
3 C 2010 3 NA 26 26
4 C 2005 4 NA 26 26
5 J 2005 5 NA 21 21
6 J 2010 6 NA 21 20
7 B 2010 7 NA 20 20
8 B 2005 8 NA 20 22
9 F 2010 9 NA 26 26
10 F 2005 10 NA 26 26
I first created a random base_rate per country and then a random tax_rate per country and year.
I used integer but you could easily replace them with real percentage values.

Fill missing values using last or previous observation R

Suppose I have the following table:
ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor
and I have another table:
ID Name Country
1 A
2 Bel
3 Bel
4 Bel
the result I want to get is:
ID Name Country
1 A Nor
2 B Bel
3 C Bel
4 D Bel
Basically i would like to create a third table which will take as a priority the second table but will fill the missing fields with the second table based on ID. Any help on how to do this in base R will be much appreciated.
You can get the logical vector representing the locations of the NA values using is.na(df2).
You can then set the NA elements of df2 to be the corresponding elements in df.
df <- data.frame(
ID = 1:4,
Name = LETTERS[1:4],
Country = "Nor",
stringsAsFactors = F)
df2 <- data.frame(
ID = 1:4,
Name = c("A", NA, NA, NA),
Country = c(NA, "Bel", "Bel", "Bel"),
stringsAsFactors = F)
df2[is.na(df2)] <- df[is.na(df2)]
df2
#> ID Name Country
#> 1 1 A Nor
#> 2 2 B Bel
#> 3 3 C Bel
#> 4 4 D Bel
You can try a tidyverse solution
library(tidyverse)
d1 %>%
left_join(d2, by="ID") %>%
mutate(Country=case_when(
is.na(Country.y) ~ as.character(Country.x),
is.na(Name.y) ~ as.character(Country.y)
)) %>%
select(ID, Name=Name.x, Country)
ID Name Country
1 1 A Nor
2 2 B Bel
3 3 C Bel
4 4 D Bel
The case_when part is easily and freely expandable.
Data
d1 <- read.table(text="ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor", header=T)
d2 <- read.table(text="ID Name Country
1 A NA
2 NA Bel
3 NA Bel
4 NA Bel", header=T)
Supposing the order is strictly the same and that df1 and df2 have the same size and that df1 has all the names defined (if not you need to go through a left_join). And well it is not base R but dplyr is a must have ;)
df3 <- dplyr::mutate(df1, Country = ifelse(is.na(df2$Country), Country, df2$Country))
Basically taking df1 as baseline (so as to keep the Names column, and replacing the column Country with the value for df2 unless there is NA.
(if you already have dplyr called then remove dplyr::).
with df1
ID Name Country
1 A
2 Bel
3 Bel
4 Bel
and df2
ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor
ps: I voted for #Paul for the base solution ... very neat.

Resources