How to add an increasing index based on multiple columns in R - r

I have a data frame that contains the columns "hour", "day","month" and "count".
library(tidyverse)
set.seed(0)
df <- expand_grid(expand_grid(
hour = seq(0:23),
day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
mutate(count = sample(0:100, n(), replace = TRUE))
head(df)
# A tibble: 6 × 4
hour day month count
<int> <chr> <chr> <int>
1 1 Mon Jan 13
2 1 Mon Feb 67
3 1 Mon Mar 38
4 1 Mon Apr 0
5 1 Mon May 33
6 1 Mon Jun 86
I would like to add a new column named "id" that contains an increasing index which can be used to sort the data in chronological order. The solution I found is not particularly concise and requires me to set factor levels before calling arrange(). Is there another way to solve this issue that capitalises on the fact that I am working with (unformatted) dates?
This is my solution with arrange():
df2 <- df %>%
mutate(day = factor(day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"))) %>%
arrange(month, day, hour) %>%
mutate(id = row_number())
head(df2)
# A tibble: 6 × 5
hour day month count id
<int> <fct> <fct> <int> <int>
1 1 Mon Jan 13 1
2 2 Mon Jan 43 2
3 3 Mon Jan 82 3
4 4 Mon Jan 66 4
5 5 Mon Jan 49 5
6 6 Mon Jan 79 6
Any suggestions are much appreciated. Thank you!

Related

How to create new column based off information in other columns in R

I have a large dataset that spans over 20 years. I have a column for the date and another column for the hour ending (HE). I'm trying to add a new column to provide the hour by hour (hrxhr) information in a given year (so running total). So date: Jan 1, 2023, HE: 1 should be hrxhr: 1 and Dec 31, 2023, HE: 24, should be hrxhr:8760 (8784 on leap years).
Should look like this:
YEAR
MONTH
DAY
HOUR OF DAY
Month_num
Date
Date1
NEW COLUMN hrxhr
2023
Dec
31
22
12
2023-12-31
365
8758
2023
Dec
31
23
12
2023-12-31
365
8759
2023
Dec
31
24
12
2023-12-31
365
8760
2024
Jan
01
1
01
2024-01-01
1
1
2024
Jan
01
2
01
2024-01-01
1
2
At first I thought I could get the Julian date and then multiple that by the HE, but that is incorrect since Jan 2, 2023, HE:1 would then equal 2 but the hrxhr/running total should equal 25.
In base R:
df <- data.frame(
YEAR = c(2023L, 2023L, 2023L, 2024L, 2023L),
MONTH = c("Dec", "Dec", "Dec", "Jan", "Jan"), DAY = c(31L, 31L, 31L, 1L, 1L),
HOUR_OF_DAY = c(22L, 23L, 24L, 1L, 2L), Month_num = c(12L,
12L, 12L, 12L, 12L), Date = c("2023-12-31", "2023-12-31",
"2023-12-31", "2024-01-01", "2024-01-01"), Date1 = c(365L,
365L, 365L, 1L, 1L))
df$hrxhr <- mapply(\(from, to, by) length(seq.POSIXt(from, to, by)),
from = trunc(as.POSIXlt(df$Date), "years"),
to = as.POSIXlt(df$Date),
by="1 hour") + df$HOUR_OF_DAY - 1
df
#> YEAR MONTH DAY HOUR_OF_DAY Month_num Date Date1 hrxhr
#> 1 2023 Dec 31 22 12 2023-12-31 365 8758
#> 2 2023 Dec 31 23 12 2023-12-31 365 8759
#> 3 2023 Dec 31 24 12 2023-12-31 365 8760
#> 4 2024 Jan 1 1 12 2024-01-01 1 1
#> 5 2023 Jan 1 2 12 2024-01-01 1 2
If you are open to a tidyverse / lubridate solution, you could use
library(dplyr)
library(lubridate)
df1 %>%
mutate(
begin = ymd_hms(paste(year(Date), "-01-01 00:00:00")),
target = ymd_hms(paste(Date, HOUR_OF_DAY, ":00:00")),
hrxhr = time_length(interval(begin, target), "hours")) %>%
select(-begin, -target)
This returns
# A tibble: 5 × 7
YEAR MONTH DAY HOUR_OF_DAY Month_num Date hrxhr
<dbl> <chr> <chr> <dbl> <dbl> <date> <dbl>
1 2023 Dec 31 22 12 2023-12-31 8758
2 2023 Dec 31 23 12 2023-12-31 8759
3 2023 Dec 31 24 12 2023-12-31 8760
4 2024 Jan 01 1 12 2024-01-01 1
5 2024 Jan 01 2 12 2024-01-01 2
Data
structure(list(YEAR = c(2023, 2023, 2023, 2024, 2024), MONTH = c("Dec",
"Dec", "Dec", "Jan", "Jan"), DAY = c("31", "31", "31", "01",
"01"), HOUR_OF_DAY = c(22, 23, 24, 1, 2), Month_num = c(12, 12,
12, 12, 12), Date = structure(c(19722, 19722, 19722, 19723, 19723
), class = "Date")), row.names = c(NA, -5L), class = "data.frame")

Applying a function iteratively in a grouped dplyr dataframe to create a column in R

Suppose I'm given the following input dataframe:
ID Date
1 20th May, 2020
1 21st May, 2020
1 28th May, 2020
1 29th May, 2020
2 20th May, 2020
2 1st June, 2020
I want to generate the following dataframe:
ID Date Delta
1 20th May, 2020 0
1 21st May, 2020 1
1 28th May, 2020 7
1 29th May, 2020 1
2 20th May, 2020 0
2 1st June, 2020 12
Where the idea is, first I group by id. Then within my current id. I iterate over the days and subtract the current date with the previous date with the exception of the first date which is just itself.
I have been using dplyr but I am uncertain on how to achieve this for groups and how to do this iteratively
My goal is to filter the deltas and retain 0 and anything larger than 7 but it must follow the 'preceeding date' logic within a specific id.
library(dplyr)
dat %>%
mutate(Date = as.Date(gsub("[a-z]{2} ", " ", Date), format = "%d %b, %Y")) %>%
group_by(ID) %>%
mutate(Delta = c(0, diff(Date))) %>%
ungroup()
# # A tibble: 6 x 3
# ID Date Delta
# <dbl> <date> <dbl>
# 1 1 2020-05-20 0
# 2 1 2020-05-21 1
# 3 1 2020-05-28 7
# 4 1 2020-05-29 1
# 5 2 2020-05-20 0
# 6 2 2020-06-01 12
Steps:
remove the ordinal from numbers, so that we can
convert them to proper Date-class objects, then
diff them within ID groups.
Data
dat <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Date = c(" 20th May, 2020", " 21st May, 2020", " 28th May, 2020", " 29th May, 2020", " 20th May, 2020", " 1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))
Similar logic as #r2evans but with different functions.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Delta = as.integer(Date - lag(Date, default = first(Date)))) %>%
ungroup
# ID Date Delta
# <int> <date> <int>
#1 1 2020-05-20 0
#2 1 2020-05-21 1
#3 1 2020-05-28 7
#4 1 2020-05-29 1
#5 2 2020-05-20 0
#6 2 2020-06-01 12
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L), Date = c("20th May, 2020",
"21st May, 2020", "28th May, 2020", "29th May, 2020", "20th May, 2020",
"1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))

How to summarize the top 3 highest values in a dataset when there are ties

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
Update
Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
data
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Life might be way simpler without top_n():
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130

Efficient algorithm to calculate values in data.frame without loop

Here is the situation where I got kinda stuck with R. I have data table with one row for each day, something like this:
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month1 = c("DEC", "JAN");
Year1 = c("15", "16");
Price1 = c(100, 110);
Month2 = c(NA_character_, NA_character_);
Year2 = c(NA_character_, NA_character_);
Price2 = c(NA_integer_, NA_integer_);
Month3 = c(NA_character_, NA_character_);
Year3 = c(NA_character_, NA_character_);
Price3 = c(NA_integer_, NA_integer_);
Month4 = c(NA_character_, NA_character_);
Year4 = c(NA_character_, NA_character_);
Price4 = c(NA_integer_, NA_integer_);
dataSample = data.frame(Date, Month1, Year1, Price1, Month2, Year2, Price2, Month3, Year3, Price3, Month4, Year4, Price4);
Which gives such a table:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now I need to calculate all months and prices for each. For that I have 2 other data frames:
Date = c(as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2016-01-01"), as.Date("2016-01-01"), as.Date("2016-01-01"));
Month.Start = c("DEC", "JAN", "FEB", "JAN", "FEB", "MAR");
Year.Start = c("15", "16", "16", "16", "16", "16")
Month.End = c("JAN", "FEB", "MAR", "FEB", "MAR", "APR");
Year.End = c("16", "16", "16", "16", "16", "16")
Diff = c(10, 15, -15, 19, -20, -5);
diffsOneMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month.Start = c("DEC", "MAR");
Year.Start = c("15", "16")
Month.End = c("MAR", "JUN");
Year.End = c("16", "16")
Diff = c(11, 25);
diffsThreeMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Which gives me these tables:
One month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 JAN 16 10
2 2015-12-31 JAN 16 FEB 16 15
3 2015-12-31 FEB 16 MAR 16 -15
4 2016-01-01 JAN 16 FEB 16 19
5 2016-01-01 FEB 16 MAR 16 -20
6 2016-01-01 MAR 16 APR 16 -5
Three month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 MAR 16 20
2 2016-01-01 MAR 16 JUN 16 25
Now I must fill dataSample data frame by using data from differences tables. I check what start/end months/years are available there and have to fill those months/years in dataSample. Then take difference of price and set calculated price in dataSample. So for example in dataSample we start with DEC 15, then in diffsOneMonth we have entry DEC 15 - JAN 16 with difference 10 so we add it to DEC 15 price and get JAN 16 price 110:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now its possible to do next month and then next etc. If we use diffsOneMonth only we would get desirable result like this:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 110
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
However there is additional requirement that I must use wider month spread to calculate prices if its possible. So for 2015-12-31 there exists three month spread from DEC 15 to MAR 16 which should override price from one month difference. So DEC 15 price is 110 and DEC 15 - MAR 16 difference is 11 which makes MAR 16 price not 110 but 111:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 111
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
So for this sample it would be my final desirable output.
Real data is much more complex, with 6 and 12 month differences and 64 months forward for each date. Also some months can be missing. I tried to do it with a loop but it was very slow, however I am not sure how to approach such a problem without a loop. I have created few helper methods to be able to calculate next year/month:
nextContract = function(currentMonth, currentYear, length = 1,
years = c("10", "11", "12", "13", "14", "15", "16", "17", "18"),
months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
yDiff = ifelse(length(months) < mIdx, mIdx / length(months) - ifelse(mIdx %% length(months) == 0, 1, 0), 0);
return(data.frame(nextMonth(currentMonth, length, months), nextYear(currentYear, length = yDiff)))
}
nextYear = function(currentYear, length = 1, years = c("10", "11", "12", "13", "14", "15", "16", "17", "18")) {
return(years[match(currentYear, years)+length]);
}
nextMonth = function(currentMonth, length = 1, months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
return(months[ifelse(length(months) < mIdx, ifelse(mIdx %% length(months) != 0, mIdx %% length(months), length(months)), mIdx)]);
}
Example of usage could be:
> nextContract("DEC", "15")
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 JAN 16
or:
> nextContract("DEC", "15", length = 3)
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 MAR 16
This got to be pretty long question but I hope someone will take time to review it :)
Thanks in advance!
EDIT
A little bit of improvement on proposed solution and I got what I needed:
outrightAndForwardRows <- list("1" = diffsOneMonth, "3" = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
mutate(Day.End = 1) %>%
mutate(Outright.Day = 1) %>%
unite("Contract.Start", Day.Start, Month.Start, Year.Start) %>%
unite("Contract.End", Day.End, Month.End, Year.End) %>%
unite("Contract.Outright", Outright.Day, Month1, Year1) %>%
mutate(time_step = as.numeric(time_step),
Contract.Start =
Contract.Start %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.End =
Contract.End %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.Outright =
Contract.Outright %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Contract.End) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Contract.End) %>%
slice(time_step %>% which.max) %>%
ungroup() %>%
select(-time_step, -Diff, -Contract.Start)
#### add outright and forward months to the same columns
outright <- outrightAndForwardRows %>% select(Date, Price=Price1, Contract=Contract.Outright) %>% unique
forwardMonths <- outrightAndForwardRows %>% select(Date, Contract=Contract.End, Price)
# join and sort rows
joined <- rbind(outright, forwardMonths) %>% arrange(Date, Contract)
# add contract sequence
joined = data.table(joined)
joined = joined[, Contract.seq:=seq(.N), by=Date];
dcast(joined, Date ~ Contract.seq, value.var=c("Price", "Contract"))
Something like this:
library(dplyr)
library(tidyr)
library(lubridate)
list(`1` = diffsOneMonth,
`3` = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
unite("Date.Start", Day.Start, Month.Start, Year.Start) %>%
mutate(time_step = as.numeric(time_step),
Date.Start =
Date.Start %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Date.Start) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Date.Start) %>%
slice(time_step %>% which.max)

R-Having trouble with aggregating data over dates

I am trying to aggregate a table using ddply().
My table looks like this:
Year Month Count
2000 Jan 1
2000 Jan 2
2001 Feb 2
2001 Feb 1
I want to sum up the counts based on year and month. So I would have 2000, Jan, 3 and 2001, Feb, 3.
My code is
ddply(df,???,sum(Count))
I am not sure how to add in multiple variables.
We group by the variables 'Year', 'Month', and get the sum of 'Count' specifying summarise from the plyr.
Using plyr
library(plyr)
ddply(df, .(Year, Month), plyr::summarise, Count=sum(Count))
# Year Month Count
#1 2000 Jan 3
#2 2001 Feb 3
Or we can use the formula method of aggregate from base R.
aggregate(Count~., df, FUN=sum)
# Year Month Count
#1 2001 Feb 3
#2 2000 Jan 3
Or with dplyr, we group by the variables and summarise
library(dplyr)
df %>%
group_by(Year, Month) %>%
dplyr::summarise(Count=sum(Count))
# Year Month Count
# (int) (chr) (int)
#1 2000 Jan 3
#2 2001 Feb 3
Or we convert the 'data.frame' to 'data.table' (setDT(df)), group by the columns, and get the sum of 'Count'.
library(data.table)
setDT(df)[, list(Count=sum(Count)), .(Year, Month)]
# Year Month Count
#1: 2000 Jan 3
#2: 2001 Feb 3
NOTE: When we load functions that are similar from other packages, it is better to use packagename::function (plyr::summarise and dplyr::summarise)
data
df <- structure(list(Year = c(2000L, 2000L, 2001L, 2001L),
Month = c("Jan",
"Jan", "Feb", "Feb"), Count = c(1L, 2L, 2L, 1L)), .Names = c("Year",
"Month", "Count"), class = "data.frame",
row.names = c(NA, -4L))

Resources