Creating a categorical variable for sequence breaks in R? - r

I have a dataframe with two columns for year and age, e.g.:
df <- data.frame(year = 1980:2000, age = c(40:45, 31:40, 32:36))
I need to create a categorical variable that identifies each age sequence. That would look something like this:
df$seq <- as.character(c(rep(1,6), rep(2,10), rep(3,5)))
Any ideas how to do this efficiently? I have managed to create a dummy for sequence breaks
require(dplyr)
df <- df %>% mutate(brk = case_when(age - lag(age) != 1 ~ 1, T ~ 0)
but I'm struggling with filling in the rest.

You have almost done it already. You just need to create a cumulative sum (cumsum) of your brk column:
df %>% mutate(brk = cumsum(case_when(age - lag(age) != 1 ~ 1, T ~ 0)))
You can add 1 to the whole vector if you want to start the first sequence from 1 instead of 0.

Related

Create columns based on other columns names R

I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611

Apply a function within list-column to another column (compare to reference ecdf by group)

I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)

Remove observation before certain row

I have a data frame and I want to compute the mean across the variable value for all the period excluding +- two observations before/after that the crisis is 1 (i don't care about missing val). The calculation should be done by country (even though here in the example below I have only one country). Example:
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
mean(df$value[df$crisis == 0], na.rm=TRUE)
# expected result
exp_mean <- (1+2+3+9+10)/5
exp_mean
edit:
I would like to get a general case where we take into account other possible 1 in the dataset, for instance if we have
crisis[10] = 1
the result should be (3+9)/2
in order not to consider the periods after the first crisis but that actually experience a crisis at the second perdiod. Any idea?
Another base R solution, using outer + c + unique to filter out rows, i.e.,
r <- mean(na.omit(df[-unique(c(outer(which(df$crisis==1),-2:2,"+"))),"value"]))
such that
> r
[1] 5
We can write a function which excludes the variables which are +- 2 observations after crisis = 1.
custom_mean <- function(c, v) {
inds <- which(c == 1)
mean(v[-unique(c(sapply(inds, `+`, -2:2)))], na.rm = TRUE)
}
sapply is used assuming there could be multiple crisis = 1 situations for a country.
We can then apply this function for each country.
library(dplyr)
df %>% group_by(country) %>% summarise(exp_mean = custom_mean(crisis, value))
# A tibble: 1 x 2
# country exp_mean
# <fct> <dbl>
#1 AT 5
This solution using base R works as long as there is only one value with 'crisis == 1' and as long as there are always two rows befor and after the row with 'crisis == 1'
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
df[(which(df$crisis == 1) - 2):(which(df$crisis == 1) + 2), ]
This solution does not work for this data:
country <- rep("AT",11)
value <- seq(1,11,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0,1)
df2 <- data.frame(country, value, crisis)
df2[(which(df2$crisis == 1) - 2):(which(df2$crisis == 1) + 2), ]

Time series function in dplyr

I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)

Using functionals instead of for loops to identify sequential changes in a vector

My data look like this:
I want to identify which "downward trend" each observation is part of, so I can group them and do things like make this graph:
My logic for distinguishing "downward trends" is that they end when the next observation has a higher measurement.
I've written a loop to do this, but I'm wondering if there's a better way to do it with one of the apply functions or something like them.
##Create sample data
df <- data.frame(timestamp = seq(1:20),
measurement = seq(10, 1, by = -1))
## This is the for loop I'm hoping to improve
df$downward.trend.seq <- 0
seq <- 1
for(i in 1:nrow(df)){
df$downward.trend.seq[i] <- seq
if (i < nrow(df) & df$measurement[i] < df$measurement[i+1]) {
seq <- seq + 1
}
}
## Code for plots
library(ggplot2)
library(dplyr)
ggplot(df, aes(x = timestamp, y = measurement)) + geom_point()
ggplot(df, aes(x = timestamp, y = measurement, group = downward.trend.seq)) + geom_line(aes(color=downward.trend.seq %>% factor))
You can use which and diff to help identify the where downward trend changes occur, and use cumsum to fill out the group membership.
# set up new column with all 0s
df$downward.trend.seq <- 0
# use diff to identify indices to change to 1
df$downward.trend.seq[which(c(NA, diff(df$measurement)) > 0)] <- 1
# use cumsum to fill in proper group membership
df$downward.trend.seq <- cumsum(df$downward.trend.seq)
Here is a dplyr solution
df %>% mutate(data_group = cumsum( c(0, diff(measurement)) > 0 ))
This performs the cumulative sum over a logical vector and assigns the results to data_group

Resources