R: How to fill up missing year values in a data frame - r

I have a quite basic R question. I have the following data frame with a year column that has no 1-year steps.
year <- c(1991,1993,1996)
value <-c(3, NA, 4)
However, for plotting a line chart, I want to fill the missing years so that I have a series from 1990 to 2000 in 1-year steps. The additional years shall be filled with NA values.
Is there a smart solution to this problem?

We can use complete from tidyr.
dat <- data.frame(
year = c(1991,1993,1996),
value = c(3, NA, 4)
)
library(dplyr)
library(tidyr)
dat2 <- dat %>%
complete(year = 1990:2000)
print(dat2)
# # A tibble: 11 x 2
# year value
# <dbl> <dbl>
# 1 1990 NA
# 2 1991 3
# 3 1992 NA
# 4 1993 NA
# 5 1994 NA
# 6 1995 NA
# 7 1996 4
# 8 1997 NA
# 9 1998 NA
# 10 1999 NA
# 11 2000 NA

Using base R to generate a sequence from 1990 to 2000 and merge with original data.frame.
df1 <- data.frame(year = c(1991, 1993, 1996),
value = c(3, NA, 4))
merge(df1,
data.frame(full = seq(1990, 2000))
by.x = "year",
by.y = "full",
all = TRUE)
year value
1 1990 NA
2 1991 3
3 1992 NA
4 1993 NA
5 1994 NA
6 1995 NA
7 1996 4
8 1997 NA
9 1998 NA
10 1999 NA
11 2000 NA

We assume that what you have is:
dd <- data.frame(year, value)
This is a time series so it makes sense to represent it using a time series representation such as ts, zoo or xts. We convert it to zoo and then to ts. The latter conversion will fill in the missing years.
library(zoo)
z <- read.zoo(dd)
tt <- as.ts(z)
tt
## Time Series:
## Start = 1991
## End = 1996
## Frequency = 1
## [1] 3 NA NA NA NA 4
If you really want to convert it to a data frame then use fortify.zoo(tt) .
Plotting
If the only reason to do this is for plotting a line chart then alternately just remove the missing values. Any of these will work.
plot(na.omit(dd), type = "l", xlab = "year", ylab = "value")
plot(na.omit(z), xlab = "year", ylab = "value")
library(ggplot2)
autoplot(na.omit(z)) + xlab("year") + ylab("value")
The last plot is shown here:

Related

R subset dataframe where no observations of certain variables

I have a dataframe that looks like
country
sector
data1
data2
France
1
7
.
France
2
10
.
belgium
1
12
7
belgium
2
14
8
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country
sector
data1
France
1
7
France
2
10
belgium
1
12
belgium
2
14
data 2 is now excluded because it had a complete set of missing values for all sectors in France
We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")
In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \ replaces function.
For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]

How to fasten nested for-loop R

I have two datasets, and one of them is very big. I'm trying to run the following loop to create a treatment column, treatment, in the dataset a. However, it is way too slow. I looked for a way to fasten for-loops like vectorization or defining conditions outside the loops however I'm having a hard time applying those methods since I have two datasets I'm conditioning on.
Here is my code:
reform_loop <- function(a, b){
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a[i,"treatment"])){break}
a[i,"treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
return(a)
}
a <- reform_loop(a, b)
You can find a sample dataset below. Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. treatment is 1 if thebirth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff in a specific country which means country_code variables should also be matched. And anything else should be NA.
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
The following is the result I want to get:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, NA)
Unfortunately, I cannot merge these two datasets since most of the countries in my dataset have more than one reform.
Any ideas on how can I fasten this code? Thank you so much in advance!
This is a range-based non-equi join. As such, this can be done with data.table or fuzzyjoin or sqldf.
data.table
library(data.table)
setDT(a)
setDT(b)
b[, treatment := 1L]
a[b, treatment := i.treatment, on = .(country_code, birth_year >= lower_cutoff, birth_year <= upper_cutoff)]
a[is.na(treatment), treatment := 0L]
a
# country_code birth_year treatment
# <num> <num> <int>
# 1: 2 1920 0
# 2: 2 1930 1
# 3: 2 1940 1
# 4: 10 1970 0
# 5: 10 1980 1
# 6: 10 1990 1
# 7: 10 2000 0
# 8: 8 1910 0
sqldf
out <- sqldf::sqldf("select a.*, b.treatment from a left join b on a.country_code=b.country_code and a.birth_year between b.lower_cutoff and b.upper_cutoff")
out$treatment[is.na(out$treatment)] <- 0L
out
# country_code birth_year treatment
# 1 2 1920 0
# 2 2 1930 1
# 3 2 1940 1
# 4 10 1970 0
# 5 10 1980 1
# 6 10 1990 1
# 7 10 2000 0
# 8 8 1910 0
fuzzyjoin
fuzzyjoin::fuzzy_left_join(a, b, by = c("country_code" = "country_code", "birth_year" = "lower_cutoff", "birth_year" = "upper_cutoff"), match_fun = list(`==`, `>=`, `<=`))
# country_code.x birth_year country_code.y lower_cutoff upper_cutoff cohort treatment
# 1 2 1920 NA NA NA NA NA
# 2 2 1930 2 1928 1948 1938 1
# 3 2 1940 2 1928 1948 1938 1
# 4 10 1970 NA NA NA NA NA
# 5 10 1980 10 1975 1995 1985 1
# 6 10 1990 10 1975 1995 1985 1
# 7 10 2000 NA NA NA NA NA
# 8 8 1910 NA NA NA NA NA
and then you need to clean up the extra columns (and fill 0 for NA).

Conditionally filling in dataframe values using information from a separate df using R

Suppose I have a table with a country-year unit, like below, that records information about a few variables (the actual dataset is very large). Some values in some of the columns are missing (not all of cols are affected). However, some of the 'missing' values in the affected columns are really zeros because only non-zero values were initially recorded.
data <- tibble::tibble(country = c(rep("USA",8), rep("MEX",8))
,year = c(1990:1997, 1990:1997)
,var1 = c(1:4, rep(NA, 4), c(3,3,3,3), rep(NA, 4))
,var2 = c(rep(c(rep(1, 6), rep(NA, 2)), 2))
,var3 = c(1:length(country))
,var4 = c(length(country):1)
)
So, I have information regarding when those problematic variables in the data df were observed, such that anything outside these ranges should be NA and anything inside the ranges should be 0:
when_observed <- tibble::tibble(variable = c(rep("var1",6), rep("var2",7))
,year = c(c(1990:1995), c(1990:1996))
)
I need something that will use the information regarding when the variable columns are observed (using when_observed) and fill in those values with zeros in the data df, but without altering actual missing values. It should produce the following table, but at scale (handling multiple column types beyond numerics would be great too):
goal_data <- tibble::tibble(country = c(rep("USA",8), rep("MEX",8))
,year = c(1990:1997, 1990:1997)
,var1 = c(1:4, 0, 0, rep(NA, 2), c(3,3,3,3), 0, 0, rep(NA, 2))
,var2 = c(rep(c(rep(1, 6), 0, NA), 2))
,var3 = c(1:length(country))
,var4 = c(length(country):1)
)
Thanks for any ideas/help.
One dplyr option could be:
data %>%
mutate(across(var1:var4,
~ ifelse(is.na(.) & year %in% observed$year[which(observed$variable %in% cur_column())], 0, .)))
country year var1 var2 var3 var4
<chr> <int> <dbl> <dbl> <int> <int>
1 USA 1990 1 1 1 16
2 USA 1991 2 1 2 15
3 USA 1992 3 1 3 14
4 USA 1993 4 1 4 13
5 USA 1994 0 1 5 12
6 USA 1995 0 1 6 11
7 USA 1996 NA 0 7 10
8 USA 1997 NA NA 8 9
9 MEX 1990 3 1 9 8
10 MEX 1991 3 1 10 7
11 MEX 1992 3 1 11 6
12 MEX 1993 3 1 12 5
13 MEX 1994 0 1 13 4
14 MEX 1995 0 1 14 3
15 MEX 1996 NA 0 15 2
16 MEX 1997 NA NA 16 1

Attribute value to new column based on values in similarly called columns

I have a data frame which has distances from a unit's centroid to different points. The points are identified by numbers and what I am trying to obtain a new column where I get the distance to the closest object.
So the data frame looks like this:
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
sample2
FID year centroid_distance_1 centroid_distance_2 centroid_distance_3 centroid_distance_4
1 12 1990 220.3 520.0 NA 725.3
2 12 1994 220.3 520.0 12.8 725.3
3 14 1983 515.6 24.3 124.2 44.2
4 15 1953 NA NA NA NA
5 17 1957 200.2 NA NA 62.9
6 18 2000 22.0 51.8 18.8 217.9
FID is an identifier of each unit and year a year indicator. Each row is a FID*year pair. centroid_distance_xis the row's distance between its centroid and the object x. This is a small sample of the data frame, which contains much more columns and rows.
What I am looking for is something like this:
short_distance <- c(220.3, 12.8, 24.3, NA, 62.9,18.8)
unit <- c(1, 3, 2, NA, 4, 3)
ideal.df <- data.frame(FID, year, short_distance, unit)
ideal.df
FID year short_distance unit
1 12 1990 220.3 1
2 12 1994 12.8 3
3 14 1983 24.3 2
4 15 1953 NA NA
5 17 1957 62.9 4
6 18 2000 18.8 3
Where basically, I add one column with named short_distance which is the cell with the lower value a row takes of all the centroid_distance_* columns above, and one named unit which identifies the object from which each row has the smaller distance (so if one row has smallest value in centorid_distance_1 it takes the value of 1 for unit).
I have tried a bunch of things with dplyr and pivot and re-pivoting the dataframe but I'm really not getting there.
Thanks a lot for the help!
Another solution based in the tidyverse - using pivot_longer - could look as follows.
library(dplyr)
library(tidyr)
library(stringr)
sample2 %>%
pivot_longer(-c(FID, year)) %>%
group_by(year, FID) %>%
slice_min(value, n = 1, with_ties = FALSE) %>%
mutate(unit = str_sub(name, -1)) %>%
select(-name, short_distance = value)
# Groups: year, FID [6]
# FID year short_distance unit
# <dbl> <dbl> <dbl> <chr>
# 1 15 1953 NA 1
# 2 17 1957 62.9 4
# 3 14 1983 24.3 2
# 4 12 1990 220. 1
# 5 12 1994 12.8 3
# 6 18 2000 18.8 3
My first couple of attempts at this weren't working like I imagined, either - couldn't always get the NA behavior you want - but here's one that works:
library(dplyr)
library(reshape2) # Or use tidyr if you prefer
sample2 %>%
# Melt/unpivot to one value per row
melt(id.vars = c("FID", "year")) %>%
# Extract the unit number
mutate(
unit = sub(x = variable,
pattern = "^centroid_distance_",
replacement = "")
) %>%
group_by(FID, year) %>% # Group by FID and year to get one row of output for each
arrange(value) %>% # Put smallest distance at the top of each group
slice_head(n = 1) # Take one row from the top of each group
Base R solution
FID <- c(12, 12, 14, 15, 17, 18)
year <- c(1990, 1994, 1983, 1953, 1957, 2000)
centroid_distance_1 <- c(220.3, 220.3, 515.6, NA, 200.2, 22)
centroid_distance_2 <- c(520, 520, 24.3, NA , NA, 51.8)
centroid_distance_3 <- c(NA, 12.8, 124.2, NA, NA, 18.8)
centroid_distance_4 <- c(725.3, 725.3, 44.2, NA, 62.9, 217.9)
sample2 <- data.frame(FID, year, centroid_distance_1, centroid_distance_2, centroid_distance_3, centroid_distance_4)
Apply function min for each row and add it to the data frame as column short_distance. Ignore the warning and handle it in the next operation.
sample2$short_distance <- apply(sample2[,3:6], 1, min, na.rm = TRUE)
#> Warning in FUN(newX[, i], ...): kein nicht-fehlendes Argument für min; gebe Inf
#> zurück
sample2$short_distance[is.infinite(sample2$short_distance)] <- NA #C hange `Inf` created by the `min` function to to `NA`
Get units with which.min. ifelse is required because min.which would drop NA rows.
sample2$unit <- apply(sample2[,3:6], 1, function(x) ifelse(length(which.min(x)) == 0, NA, which.min(x)))
Keep only relevant columns
sample2 <- sample2[, c(1,2,7,8)]
sample2
#> FID year short_distance unit
#> 1 12 1990 220.3 1
#> 2 12 1994 12.8 3
#> 3 14 1983 24.3 2
#> 4 15 1953 NA NA
#> 5 17 1957 62.9 4
#> 6 18 2000 18.8 3
Created on 2021-01-18 by the reprex package (v0.3.0)
Here is a solution using dplyr & stringr packages (but you can just import tidyverse):
library(tidyverse)
df <- sample2 %>%
gather('centroid', 'dist', 3:length(.)) %>%
group_by(year) %>%
slice(if(all(is.na(dist))) 1L else which.min(dist)) %>%
mutate(centroid = str_replace(centroid, "centroid_distance_", ""))
df
Returns:
# A tibble: 6 x 4
# Groups: year [6]
FID year centroid dist
<dbl> <dbl> <chr> <dbl>
1 15 1953 1 NA
2 17 1957 4 62.9
3 14 1983 2 24.3
4 12 1990 1 220.
5 12 1994 3 12.8
6 18 2000 3 18.8
A data.table solution
setDT(sample2)
s <- melt(sample2, id = 1:2, variable.name = "object", value.name = "distance") ## pivot
s[, obj := as.numeric(object) ## transform factor into numeric
][, .(shortest = min(distance, na.rm=TRUE), unit= which.min(distance)), by = .(FID, year) ## calculate the shortest and which
][is.infinite(shortest), shortest:= NA # transform Inf into NA
][] ## report

Reshaping df into data panel model

I have the following sets of data:
df1 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2011,2012,2012), variable_1= c(1,3,5,7))
df2 <- data.frame( country = c("A", "B","A","B"), year = c(2011,2012,2012,2013), variable_2= c(2,4,6,8))
df3 <- data.frame( country = c("A", "C","C"), year = c(2011,2011,2013), variable_3= c(9,9,9))
I want to reshape them into a panel data model, so I can get the following result:
df4 <- data.frame( country = c("A","A","A","B","B","B","C","C","C"), year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013), variable_1 = c(1,5,NA,3,7,NA,NA,NA,NA), variable_2 = c(2,6,NA,NA,4,8,NA,NA,NA), variable_3 = c(9,NA,NA,NA,NA,NA,9,NA,9) )
I have searched for this info, but the topics I found (Reshaping panel data) didn´t help me.
Any ideas on how to do that? My real data sets have thousands of lines ("countries"), several variables, years and NA´s, so please take that into account.
Try
library(tidyr)
library(dplyr)
Reduce(full_join, list(df1, df2, df3)) %>%
complete(country, year)
Which gives:
#Source: local data frame [9 x 5]
#
# country year variable_1 variable_2 variable_3
# (chr) (dbl) (dbl) (dbl) (dbl)
#1 A 2011 1 2 9
#2 A 2012 5 6 NA
#3 A 2013 NA NA NA
#4 B 2011 3 NA NA
#5 B 2012 7 4 NA
#6 B 2013 NA 8 NA
#7 C 2011 NA NA 9
#8 C 2012 NA NA NA
#9 C 2013 NA NA 9

Resources