Consider a data frame that has 3 columns: A - a name; B - the yearly food intake (one name can eat different foods); C - the year in which the person stops eating that food
Such as:
A B C
Peter 400 2035
Peter 500 2050
Peter 350 2024
John 700 2050
I need to create a time series that sums all the food intake for each person, from today (2022) to 2050. In the case of John is easy: 700 * (2050-2022). But for Peter, I need to add some restrictions: sum the 3 lines until 2024, then one of them goes to zero, but the time series keeps summing the other two lines, until eventually there is only one line to sum.
So year 2022 would be (400+500+350), the same for years 2023 to 2024. Then would be (400+500), until 2035, etc.
This allows me to have a time-series, per person, which contains the yearly intake of food, taking into consideration that the yearly food intake will decrease throughout the years.
Are you after the total intake over the period? Then this will calculate it:
library(tidyverse)
data <- tribble(~"A", ~"B", ~"C",
"Peter", 400, 2035,
"Peter", 500, 2050,
"Peter", 350, 2024,
"John", 700, 2050)
data %>%
mutate(line_total = B*(C - 2022)) %>% # 2022 being the start year
group_by(A) %>%
summarise(person_total = sum(line_total))
If you actually want a time-series, with a column for each row and the total for the row at the end, then try this:
years <- 2022:max(data$C)
mat <- matrix(nrow = nrow(data), ncol = length(years))
colnames(mat) <- c(years)
timeseries <- cbind(data, mat) %>%
as_tibble() %>%
pivot_longer(-c(A, B, C)) %>%
mutate(value = ifelse(name <= C, B, 0)) %>%
pivot_wider() %>%
select(-c(B, C)) %>%
mutate(rowsum = rowSums(across(where(is.numeric))))
Related
This question already has answers here:
Mutate across multiple columns to create new variable sets
(3 answers)
Closed 10 days ago.
I've got a dplyr query which runs on a large data frame and it's painfully slow. Reprex:
Start with a dataframe df which has duplicate rows (because it was formed by a left_join() call). If I see a duplicate index value then the name, year and city values will be duplicated too.
df <- data.frame(index = c(1, 1, 1, 2, 2, 3),
name = c("Andy", "Andy", "Andy", "Bob", "Bob", "Charles"),
year = c(1970, 1970, 1970, 1971, 1971, 1972),
city = c("Amsterdam", "Amsterdam", "Amsterdam", "Barcelona", "Barcelona", "Copenhagen"),
amount = c(123, 234, 345, 456, 567, 678))
I want to know the sum of the amount field for each value of index. However I want to retain name, year and city.
output_i_want <- data.frame(index = c(1, 2, 3),
name = c("Andy", "Bob", "Charles"),
year = c(1970, 1971, 1972),
city = c("Amsterdam", "Barcelona", "Copenhagen"),
total_amount = c(702, 1023, 678))
It's easy enough to do it like this:
df |>
group_by(index) |>
summarise(name = first(name),
year = first(year),
city = first(city),
total_amount = sum(amount)) |>
ungroup()
...but in my real world case (where first() appears about 20 times and sum() appears 8 times) it's horribly slow.
If I instead do:
df |>
group_by(index) |>
summarise(total_amount = sum(amount)) |>
ungroup()
then it runs fast, but I then lose name, year and city - and I'm not sure how best to get them back. Do I need a different type of join afterwards, or some other technique?
Thank you.
1) Since the columns involved with first in the question are constant within index try grouping by all 4 to eliminate the need for first. The code below assumes all non-grouping columns are to be summed but you could specify something like where(is.numeric) if all numeric non-grouping columns are to be summed or amount1:amount3 if the columns are side by side or starts_with("amount") if they all start with amount.
library(dplyr) # version 1.1.0 or later
df %>%
summarize(across(everything(), sum, .names = "total_{.col}"), .by = index:city)
giving:
index name year city total_amount
1 1 Andy 1970 Amsterdam 702
2 2 Bob 1971 Barcelona 1023
3 3 Charles 1972 Copenhagen 678
2) A base solution using the same idea is the following. Change the numbers as needed. Omit the last 2 statements if having a total_ prefix is not important.
ag <- aggregate(df[5], df[-5], sum)
names(ag)[5] <- paste("total", names(ag)[5], sep = "_")
ag
giving:
index name year city total_amount
1 1 Andy 1970 Amsterdam 702
2 2 Bob 1971 Barcelona 1023
3 3 Charles 1972 Copenhagen 678
I'm calculating the percent change of enrollment from academic year to academic year, but there are some academic years missing data, so I don't want it to calculate the change in those instances and keep it as blank instead of calculating a two year difference. I have multiple years, schools, and groups I am doing this by. Example data frame below and the code I am using currently. So I am missing 2016-17 in this example and don't want to calculate it for 17-18 then.
School Academic Year Group Enrollment pct_change
1 School 1 2018-19 Overall 450 ANSWER
2 School 1 2017-18 Overall 630 NA
3 School 1 2015-16 Overall 635 ANSWER
4 School 1 2014-15 Overall 750 ANSWER
5 School 1 2013-14 Overall 704 ANSWER
data <- data %>%
group_by(School, Group) %>%
mutate(pct_change = (((Enrollment-lead(Enrollment, order_by = `Academic Year`))/Enrollment)) * 100) %>%
ungroup()
An option may be to expand the data for complete year
library(dplyr)
library(tidyr)
data %>%
separate(`Academic Year`, into = c("Year", "Day"),
remove = FALSE, convert = TRUE) %>%
group_by(School, Group) %>%
complete(Year = full_seq(Year, period = 1)) %>%
mutate(pct_change = (((Enrollment-lead(Enrollment,
order_by = Year))/Enrollment)) * 100) %>%
ungroup()
filter(complete.cases(Enrollment)) %>%
select(-Year, -Day)
I have a growth rate, calculated from individual measurements 4 times a year, that I am trying to assign to a different time frame called Year2 (August 1st of year 1 to July 31st of year 2, see attached photo).
My Dataframe:
ID
Date
Year
Year2
Lag
Lapse
Growth
Daily_growth
1
2009-07-30
2009
2009
NA
NA
35.004
NA
1
2009-10-29
2009
2010
2009-07-30
91 days
31.585
0.347
1
2010-01-27
2010
2010
2009-10-29
90 days
63.769
0.709
1
2010-04-27
2010
2010
2010-01-27
90 days
28.329
0.315
1
2010-07-29
2010
2010
2010-04-27
93 days
32.068
0.345
1
2010-11-02
2010
2011
2010-07-29
96 days
128.1617320
1.335
I took the growth rate as follows:
Growth_df <- Growth_df%>%
group_by(ID) %>% # Individuals we measured
mutate(Lag = lag(Date), #Last date measured
Lapse = round(difftime(Date, Lag, units = "days")), #days between Dates monitored
Daily_growth = as.numeric(Growth) / as.numeric(Lapse))
What I am trying to do is assign the daily growth rate between each measurement, matching to the Year2 timeframe:
Growth_df <- Growth_df %>%
mutate(Year = as.numeric(Year),
Year2_growth = ifelse(Year == Year2, Daily_growth*Lapse, 0)) %>%
group_by(Year2) %>%
mutate(Year2_growth = sum(Year2_growth, na.rm = TRUE))
My problem is that I do not know how to get the dates in between the years (something in place of the 0 in the ifelse statement). I need some sort of way that would calculate how many days would be left from the new start date (August 1st) to the most recent measurement, then multiply it by the growth rate, as well as cut the end early (July 31st)
I tried making a second dataframe with nothing by years and days then assigning the growth rate when comparing the two dataframes but I have been stuck on the same issue: partitioning the time frame.
I am sure there's a much much muuuuch more efficient way to deal with this, but this is the way I sorted out:
Make my timeframes
Create a function for the ranges I wanted
Make a dataframe with for both the start and the end ranges
Join them together
Marvel in my lack of r skills.
Start_dates <- seq(ymd('2008-08-01'),ymd('2021-08-1'), by = '12 months')
End_dates <- seq(ymd('2009-07-31'),ymd('2022-07-31'), by = '12 months')
Year2_dates <- data.frame(Start_dates, End_dates)
Year2_dates <- Year2_dates %>%
mutate(Year2 = format(as.Date(Start_dates, format="%d/%m/%Y"),"%Y"),
Year2 = as.numeric(Year2) + 1)
Vegetation <- Vegetation %>%
left_join(Year2_dates)
Range_finder <- function(x,y){
as.numeric(difftime(x, y, unit = "days"))
}
Range_start <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==1) %>%
filter(Year != Year2) #had to get rid of first year samples as they were the top row but didn't have a change in year
Range_start <- Range_start %>%
mutate(Number_days_start = Range_finder(Date, Start_dates),
Border_range = Number_days_start * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Range_end <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==n(),
Year2 != 2022)
Range_end <- Range_end %>%
mutate(Number_days_end = Range_finder(End_dates, Date),
Border_range = Number_days_end * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Ranges <- full_join(Range_start, Range_end)
Vegetation <- Vegetation %>%
left_join(Ranges)
As a novice I was hoping to understand how to aggregate data using an arbitrary look back (eg previous 30 days from a date). See my data below as an example. I want to group by each name, and sum sales for the 30 days leading up to say 02-15-2019. So it will look back 30 calendar days from 02-15-2019 and give me the total sales by Name (eg Person 1 = $60; Person 2 = $30)
Name Date Sales
Person1 01-31-2019 $10
Person1 02-15-2019 $50
Person1 06-18-2019 $100
Person2 01-31-2019 $25
Person2 02-15-2019 $5
Person2 06-18-2019 $200
Simple example (if I understood your question correctly):
library(dplyr)
set.seed(123)
df <- data.frame(Name = sample(c("Person1", "Person2"), 6, T),
Date = c("01-31-2019", "02-15-2019", "06-18-2019", "01-31-2019", "02-15-2019", "06-18-2019"),
Sales = runif(6, 10, 100), stringsAsFactors = F)
df$Date <- lubridate::mdy(df$Date)
target <- lubridate::mdy("02-15-2019")
sales <- df %>% filter(between(Date, target - 30, target)) %>%
group_by(Name) %>% summarise(Sales = sum(Sales))
select Name ,sum(sales) from orders
where
DATEDIFF(day,OrderDate,GETDATE()) between 0 and 30
group by Name
I'm a relative newbie to dplyr. I have a data.frame organized with each store name and source (made up of the results for 2018) making up the observations. The variables are total revenue, quantity, customer experience score, and a few others.
I'd like to rank each category in the data.frame and create new observations. All variables would be ranked in descending order, but customer experience and one additional column would be ranked in ascending order. The source I'd like to call this would be called "ranks".
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
table <- cbind(store, source, revenue, quantity, satisfaction)
I was able to get what I needed using mutate, but I had to manually name each new column. I'm sure there is a more efficient way to rank these values out there!
Here is what I originally did:
table <- table %>%
mutate(revenue_rank = rank(-revenue), quantity_rank = rank(-quantity), satisfaction_rank = rank(satisfaction))
In general, if you're having to do something repeatedly in a data frame, such as calculating ranks, you probably want to reshape to long data. Also note that what you got from cbind is a matrix, not data frame--probably not what you want, since this means numeric variables actually come through as characters. Instead of cbind, use data.frame or data_frame (for a tibble).
What I did here is gathered into a long data frame, grouped by the measures (revenue, quantity, or satisfaction), then gave ranks based on the value, keeping in mind that you wanted different orders for satisfaction and the other measures.
library(tidyverse)
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
df <- data_frame(store, source, revenue, quantity, satisfaction)
df %>%
gather(key = measure, value = value, revenue:satisfaction) %>%
group_by(measure) %>%
mutate(rank = ifelse(measure == "satisfaction", rank(value), rank(-value))) %>%
ungroup() %>%
select(-value) %>%
mutate(measure = paste(measure, "rank", sep = "_")) %>%
spread(key = measure, value = rank)
#> # A tibble: 3 x 5
#> store source quantity_rank revenue_rank satisfaction_rank
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Boston 2018 3 3 3
#> 2 Chicago 2018 2 1 1
#> 3 NYC 2018 1 2 2
Created on 2018-05-04 by the reprex package (v0.2.0).