I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)
Related
I have a table of house prices and sale dates. I want to calculate the rolling median price over a time window of 365 days using the runner package. I only want one median price per date.
My problem is when I try the below code, I get more than one median price for a date if that date appears more than once. This isn't what I expected to occur. I thought there'd be one result for each day if I used group_by/summarise.
library(runner)
library(tidyverse)
library(lubridate)
startDate = as_date("2018-01-01")
endDate = as_date("2020-01-01")
# Create data
soldData <- tibble(
price = round(rnorm(100, mean=500000, sd=100000),-3),
date = sample(seq.Date(startDate,endDate,by="days"),100,replace=T))
# Fill in the missing dates between startDate and endDate
soldData <- bind_rows(soldData,anti_join(tibble(date=seq.Date(startDate,endDate,by="day")),soldData)) %>%
arrange(date)
# Find the duplicated dates
duplicatedDates <- soldData[duplicated(soldData$date),]$date
# I thought using group_by/summarise would return one medianPrice per date
results <- soldData %>%
group_by(date) %>%
summarise(medianPrice = runner(
price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
# These are the problem rows.
duplicatedResults <- results %>%
filter(date %in% duplicatedDates)
Any idea where I'm going wrong?
From dplyr 1.0.0, you can have output that returns multiple rows from summarise.
First you need to deal with duplicate data which you already have in your data. What do you want to do of dates that have multiple occurrence? One way would be to take median/mean of them.
library(dplyr)
library(runner)
soldData %>%
group_by(date) %>%
summarise(price = median(price, na.rm = TRUE)) -> df
So now in df we only have one value for each date. You can now apply the runner function.
df %>%
mutate(medianPrice = runner(price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
There is also zoo:rollmedianr which helps in calculating rolling median.
I have a data frame like DF below which will be imported directly from the database (as tibble).
library(tidyverse)
library(lubridate)
date_until <- dmy("31.05.2019")
date_val <- dmy("30.06.2018")
DF <- data.frame( date_bal = as.Date(c("2018-04-30", "2018-05-31", "2018-06-30", "2018-05-31", "2018-06-30")),
department = c("A","A","A","B","B"),
amount = c(10,20,30,40,50)
)
DF <- DF %>%
as_tibble()
DF
It represents the amount of money spent by each department in a specific month. My task is to project how much money will be spent by each department in the following months until a specified date in the future (in this case date_until=31.05.2019)
I would like to use tidyverse in order to generate additional rows for each department where the first column date_bal would be a sequence of dates from the last one from "original" DF up until date_until which is predefined. Then I would like to add additional column called "DIFF" which would represent the difference between DATE_BAL and DATE_VAL, where DATE_VAL is also predefined. My final result would look like this:
Final result
I have managed to do this in the following way:
first filter data from DF for department A
Create another DF2 by populating it with date sequence from min(dat_bal) to date_until from 1.
Merge data frames from 1. and 2. and then add calculated columns using mutate
Since I will have to repeat this procedure for many departments I wonder if it's possible to add rows (create date sequence) in existing DF (without creating a second DF and then merging).
Thanks in advance for your help and time.
I add one day to the dates, create a sequence and then rollback to the last day of the previous month.
seq(min(date_val + days(1)), date_until + days(1), by = 'months')[-1] %>%
rollback() %>%
tibble(date_bal = .) %>%
crossing(DF %>% distinct(department)) %>%
bind_rows(DF %>% select(date_bal, department)) %>%
left_join(DF) %>%
arrange(department, date_bal) %>%
mutate(
amount = if_else(is.na(amount), 0, amount),
DIFF = interval(
rollback(date_val, roll_to_first = TRUE),
rollback(date_bal, roll_to_first = TRUE)) %/% months(1)
)
I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.
Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.
I have a dataframe with two columns for year and age, e.g.:
df <- data.frame(year = 1980:2000, age = c(40:45, 31:40, 32:36))
I need to create a categorical variable that identifies each age sequence. That would look something like this:
df$seq <- as.character(c(rep(1,6), rep(2,10), rep(3,5)))
Any ideas how to do this efficiently? I have managed to create a dummy for sequence breaks
require(dplyr)
df <- df %>% mutate(brk = case_when(age - lag(age) != 1 ~ 1, T ~ 0)
but I'm struggling with filling in the rest.
You have almost done it already. You just need to create a cumulative sum (cumsum) of your brk column:
df %>% mutate(brk = cumsum(case_when(age - lag(age) != 1 ~ 1, T ~ 0)))
You can add 1 to the whole vector if you want to start the first sequence from 1 instead of 0.
Working with dplyr, I am trying to match a row with n other rows on a variable, so I can feed the matching set to summarise(). I've only succeeded with a loop so far. Example data:
dfraw <- data.frame( id = c(1:20), age = c(30:35, 32:37, 34:41) )
set.seed(1)
df <- dfraw %>%
mutate( var = age + runif(20) - 0.5 ) %>%
arrange( age )
To calculate a z-score of var from the five closest matches on age, I can do
for ( i in 1:nrow(df) ) {
df$windowedz[i] <- df %>%
arrange( abs( df$age[i] - age) ) %>%
head(n=6) %>% tail(n=5) %>% # 5 closest matches excluding row `i`
summarise( (df$var[i] - mean(var) ) / sd(var) ) %>%
as.numeric
}
Is there a more elegant way to achieve this? If I use group_by, I can't seem to generate a matching group from the individual variable (df$age[i] in the example).
Edit: Minor changes for clarification, arrange as part of the example data definition, modified loop to insert a scalar rather than a list in column windowedz
Edit: With the package RcppRoll I was partially successful:
library(RcppRoll)
df <- df %>%
mutate(
mean = roll_mean( var, n = 5, fill = NA ),
sd = roll_sd( var, n = 5, fill = NA ),
roll_z = (var - mean) / sd
)
The issue with this solution is that the window contains the value that is to be transformed. So there is no equivalent to the head-tail manoeuvre that removes the matched row from the matching set. Also, this approach gives strange results if I calculate roll_z directly instead of calculating mean and sd first.