I am working with a large dataset (over 1 million rows) with e.g. two column date and a delay number.
ID col1 Date Delay
1: A 100 2021-05-01 1
2: B 200 2018-04-03 3
3: C 300 2020-02-17 2
I want to duplicate the rows in the table depending on the delay amount, while incrementing the date for each row in a new column:
ID col1 Date Delay New_Date
1: A 100 2021-05-01 1 2021-05-02
2: B 200 2018-04-03 3 2018-04-04
3: B 200 2018-04-03 3 2018-04-05
4: B 200 2018-04-03 3 2018-04-06
5: C 300 2020-02-17 2 2020-02-18
6: C 300 2020-02-17 2 2020-02-19
I am currently doing it with a for each loop, which is extremely inefficient and takes a lot of time.
for(row in 1:nrow(df)) {
delay <- as.numeric(df[row, "Delay"])
tempdf <- df[0,]
for(numberDelay in 1:delay) {
tempdf[numberDelay, ] <- df[row, ]
tempdf[numberDelay, "New_Date"] <- as.Date.character(tempdf[numberDelay, "Date"] + as.numeric(numberDelay),
tryFormats = "%Y-%m-%d")
}
result <- rbind(result, tempdf)
}
Context: This would allow me to determine delays which were in the weekend or on national holidays by further comparing the new date with a list of blacklisted dates.
Is there an efficient way to do this in R?
Coon
You can try with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
rowwise() %>%
mutate(New_Date = list(seq.Date(Date + 1, Date + Delay, by = "day"))) %>%
unnest(New_Date)
#> # A tibble: 6 x 5
#> ID col1 Date Delay New_Date
#> <chr> <int> <date> <int> <date>
#> 1 A 100 2021-05-01 1 2021-05-02
#> 2 B 200 2018-04-03 3 2018-04-04
#> 3 B 200 2018-04-03 3 2018-04-05
#> 4 B 200 2018-04-03 3 2018-04-06
#> 5 C 300 2020-02-17 2 2020-02-18
#> 6 C 300 2020-02-17 2 2020-02-19
However, considering the context you explained, I think something like this could be more effective for you:
# example of vector of blacklisted days
blacklist_days <- as.Date(c("2020-02-18", "2018-04-04", "2018-04-05"))
df %>%
rowwise() %>%
mutate(Dates = list(seq.Date(Date + 1, Date + Delay, by = "day"))) %>%
mutate(n_bl = sum(Dates %in% blacklist_days)) %>%
ungroup()
#> # A tibble: 3 x 6
#> ID col1 Date Delay Dates n_bl
#> <chr> <int> <date> <int> <list> <int>
#> 1 A 100 2021-05-01 1 <date [1]> 0
#> 2 B 200 2018-04-03 3 <date [3]> 2
#> 3 C 300 2020-02-17 2 <date [2]> 1
In this way you avoid rows duplication, which could affect your performance.
You can create a data frame of duplicates separately, and then combine them with the original. This uses a loop to go through the different values of Delay.
> dat <- data.frame(ID = LETTERS[1:3], col1 = 1:3 * 100,
+ date = as.Date(c('2021-05-01', '2018-04-03', '2020-02-17')),
+ delay = c(1, 3, 2))
> dat
ID col1 date delay
1 A 100 2021-05-01 1
2 B 200 2018-04-03 3
3 C 300 2020-02-17 2
> dat$sk <- 1:nrow(dat)
> ddup <- data.frame()
> for (i in 2:3) {
+
dd <- dat[dat$delay >= i, ]
+ dd <- dat[dat$delay >= i, ]
+ dd$date <- dd$date + 1
+ ddup <- rbind(ddup, dd)
}
+
+ }
> dat <- rbind(dat, ddup)
> dat <- dat[order(dat$sk, dat$date), ]
> dat
ID col1 date delay sk
1 A 100 2021-05-01 1 1
2 B 200 2018-04-03 3 2
22 B 200 2018-04-04 3 2
21 B 200 2018-04-04 3 2
3 C 300 2020-02-17 2 3
31 C 300 2020-02-18 2 3
>
Related
I want to create new rows based on the value of pre-existent rows in my dataset. There are two catches: first, some cell values need to remain constant while others have to increase by +1. Second, I need to cycle through every row the same amount of times.
I think it will be easier to understand with data
Here is where I am starting from:
mydata <- data.frame(id=c(10012000,10012002,10022000,10022002),
col1=c(100,201,44,11),
col2=c("A","C","B","A"))
Here is what I want:
mydata2 <- data.frame(id=c(10012000,10012001,10012002,10012003,10022000,10022001,10022002,10022003),
col1=c(100,100,201,201,44,44,11,11),
col2=c("A","A","C","C","B","B","A","A"))
Note how I add +1 in the id column cell for each new row but col1 and col2 remain constant.
Thank you
library(tidyverse)
mydata |>
mutate(id = map(id, \(x) c(x, x+1))) |>
unnest(id)
#> # A tibble: 8 × 3
#> id col1 col2
#> <dbl> <dbl> <chr>
#> 1 10012000 100 A
#> 2 10012001 100 A
#> 3 10012002 201 C
#> 4 10012003 201 C
#> 5 10022000 44 B
#> 6 10022001 44 B
#> 7 10022002 11 A
#> 8 10022003 11 A
Created on 2022-04-14 by the reprex package (v2.0.1)
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
uncount(2) %>%
mutate(id = first(id) + row_number() - 1) %>%
ungroup()
This returns
# A tibble: 8 x 3
id col1 col2
<dbl> <dbl> <chr>
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
library(data.table)
setDT(mydata)
final <- setorder(rbind(copy(mydata), mydata[, id := id + 1]), id)
# id col1 col2
# 1: 10012000 100 A
# 2: 10012001 100 A
# 3: 10012002 201 C
# 4: 10012003 201 C
# 5: 10022000 44 B
# 6: 10022001 44 B
# 7: 10022002 11 A
# 8: 10022003 11 A
I think this should do it:
library(dplyr)
df1 <- arrange(rbind(mutate(mydata, id = id + 1), mydata), id, col2)
Gives:
id col1 col2
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
in base R, for nostalgic reasons:
mydata2 <- as.data.frame(lapply(mydata, function(col) rep(col, each = 2)))
mydata2$id <- mydata2$id + 0:1
I have a dataset that is similar to the following:
df <- data.frame(
date = c("2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04", "2020-02-05", "2020-02-06"),
value = c(0,1,2,7,3,4))
I would like to split my data frame into two smaller data frames such that the first data frame includes a part of the original data frame before the value reaches its max (i.e. 7) and the second part of the data frame includes the rest of the original data frame as follows:
df1 <- data.frame(
date = c("2020-02-01", "2020-02-02", "2020-02-03"),
value = c(0,1,2)
)
df2 <- data.frame(
date = c("2020-02-04", "2020-02-05", "2020-02-06"),
value = c(7, 3, 4)
)
*** The 2nd part of the question
Now assume that I have the following dataset including more than one object identified by IDs. So, I would like to the same thing as explained above and applied to all objects (IDs)
df <- data.frame( ID = c(1,1,1,1,1,1,2,2,2,2),
date = c("2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04", "2020-02-05", "2020-02-06", "2020-02-01", "2020-02-02","2020-02-03", "2020-02-04"),
value = c(0,1,2,7,3,4,10,16,11,12))
Thanks for your time.
You can use which.max to get the index of max value and use it to subset the dataframe.
ind <- which.max(df$value)
df1 <- df[seq_len(ind - 1), ]
df2 <- df[ind:nrow(df), ]
df1
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-01 0
#2 2020-02-02 1
#3 2020-02-03 2
df2
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-04 7
#2 2020-02-05 3
#3 2020-02-06 4
We could create a list of dataframes if there are lot of ID's and we have to do this for each ID.
result <- df %>%
group_split(ID) %>%
purrr::map(~{.x %>%
group_split(row_number() < which.max(value), .keep = FALSE)})
## In case, someone is interested you could make a data frame from the list above as follows:
result_df <- result %>%
bind_rows()
Another approach using base R:
> df
date value
1 2020-02-01 0
2 2020-02-02 1
3 2020-02-03 2
4 2020-02-04 7
5 2020-02-05 3
6 2020-02-06 4
> df1 <- df[1:(which(df$value == max(df$value)) - 1), ]
> df2 <- df[which(df$value == max(df$value)):nrow(df), ]
> df1
date value
1 2020-02-01 0
2 2020-02-02 1
3 2020-02-03 2
> df2
date value
4 2020-02-04 7
5 2020-02-05 3
6 2020-02-06 4
>
For the grouped data:
> mylist <- df %>% split(f = df$ID)
> mylist
$`1`
ID date value
1 1 2020-02-01 0
2 1 2020-02-02 1
3 1 2020-02-03 2
4 1 2020-02-04 7
5 1 2020-02-05 3
6 1 2020-02-06 4
$`2`
ID date value
7 2 2020-02-01 10
8 2 2020-02-02 16
9 2 2020-02-03 11
10 2 2020-02-04 12
> split_list <- lapply(mylist, function(x) x[1:(which.max(x$value) - 1),])
> split_list <- append(split_list, lapply(mylist, function(x) x[which.max(x$value): nrow(x),]))
> split_list
$`1`
ID date value
1 1 2020-02-01 0
2 1 2020-02-02 1
3 1 2020-02-03 2
$`2`
ID date value
7 2 2020-02-01 10
$`1`
ID date value
4 1 2020-02-04 7
5 1 2020-02-05 3
6 1 2020-02-06 4
$`2`
ID date value
8 2 2020-02-02 16
9 2 2020-02-03 11
10 2 2020-02-04 12
>
This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I have a data frame that stores the amount someone spends per transaction for this month. I'm trying to create a loop that checks for repeated user IDs, then sums and stores the amount they spent in total in the first record that they appear. It should set the amount they spent in any other occurrences to 0.
I keep getting "Error: No loop for break/next, jumping to top level" when I stop it from running:
# Number of trips
numTrips <- NROW(tripData)
# For each trip in data
for (i in 1:numTrips){
# For each trip after i
for (j in ((i+1): numTrips)){
# If the user ID's match, sum prices
if (tripData[i,]$user_id == tripData[j,]$user_id){
tripData[i,]$original_price <- tripData[i,]$original_price + tripData[j,]$original_price
tripData[j,]$original_price <- 0
}
}
}
Can someone please help?
I'll go with #MrFlick's comment and give you a sample:
set.seed(42)
dat <- tibble(
id = rep(1:3, each=3),
when = sort(Sys.Date() - sample(10, size=9)),
amt = sample(1e4, size=9))
dat
# # A tibble: 9 x 3
# id when amt
# <int> <date> <int>
# 1 1 2020-06-19 356
# 2 1 2020-06-20 7700
# 3 1 2020-06-21 3954
# 4 2 2020-06-22 9091
# 5 2 2020-06-23 5403
# 6 2 2020-06-24 932
# 7 3 2020-06-25 9189
# 8 3 2020-06-27 5637
# 9 3 2020-06-28 4002
It sounds like you want to sum the amounts for each id, but preserve the individual rows with the rest of the amounts zeroed out.
dat %>%
group_by(id) %>%
mutate(amt2 = c(sum(amt), rep(0, n() - 1)))
# # A tibble: 9 x 4
# # Groups: id [3]
# id when amt amt2
# <int> <date> <int> <dbl>
# 1 1 2020-06-19 356 12010
# 2 1 2020-06-20 7700 0
# 3 1 2020-06-21 3954 0
# 4 2 2020-06-22 9091 15426
# 5 2 2020-06-23 5403 0
# 6 2 2020-06-24 932 0
# 7 3 2020-06-25 9189 18828
# 8 3 2020-06-27 5637 0
# 9 3 2020-06-28 4002 0
If instead you just want the summaries, you can use this:
dat %>%
group_by(id) %>%
summarize(amt = sum(amt))
# # A tibble: 3 x 2
# id amt
# <int> <int>
# 1 1 12010
# 2 2 15426
# 3 3 18828
or if you want to preserve the date range, then
dat %>%
group_by(id) %>%
summarize(whenfrom = min(when), whento = max(when), amt = sum(amt))
# # A tibble: 3 x 4
# id whenfrom whento amt
# <int> <date> <date> <int>
# 1 1 2020-06-19 2020-06-21 12010
# 2 2 2020-06-22 2020-06-24 15426
# 3 3 2020-06-25 2020-06-28 18828
I have these two toy example tables:
Table 1:
attendance_events <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456","RC456","RA123","RB123","RC123","RA456","RB456","RC456"),
dates = c("2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"),
attendance = c(1,1,1,0,1,1,0,0,1,0,0,1),
stringsAsFactors = F)
attendance_events
student_id dates attendance
1 RA123 2020-02-01 1
2 RB123 2020-02-01 1
3 RC123 2020-02-01 1
4 RA456 2020-02-01 0
5 RB456 2020-02-01 1
6 RC456 2020-02-01 1
7 RA123 2020-02-02 0
8 RB123 2020-02-02 0
9 RC123 2020-02-02 1
10 RA456 2020-02-02 0
11 RB456 2020-02-02 0
12 RC456 2020-02-02 1
Table2:
all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
school_id = c(1,1,1,1,1,2),
grade_level = c(10,10,9,9,11,11),
date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
hometown = c("farm","farm","farm","farm","farm","city"),
stringsAsFactors = F)
> all_students
student_id school_id grade_level date_of_birth hometown
1 RA123 1 10 1990-02-02 farm
2 RB123 1 10 1990-02-02 farm
3 RC123 1 9 1991-01-01 farm
4 RA456 1 9 1991-02-01 farm
5 RB456 1 11 1989-02-02 farm
6 RC456 2 11 1989-02-02 city
attendance in attendance_events is 0 if the student was absent that day.
My question is what is the most efficient way in R to find the grade_level that had the largest drop off in attendance between "2020-02-01" and "2020-02-02"
My code is:
#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
filter(attendance == 0)
#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"
#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
filter(dates %in% c(date_from, date_to)) %>%
group_by(grade_level,dates) %>%
summarize(absence_bydate = n()) %>%
dcast(grade_level ~ dates) %>%
sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
as.data.frame() %>%
mutate("absence_change" = .[,3] - .[,2]) %>%
select(grade_level, absence_change) %>%
arrange(desc(absence_change))
>att_drop_alt
grade_level absence_change
1 10 2
2 11 1
3 9 0
However, this feels a bit complex for what seems like a reasonably simple question. I want to see other ways R programmers could answer this question, ideally for better performance but even readability would be good to see.
Thanks community!
With data.table
library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)),
.(grade_level, dates)], on = .(student_id)][,
.(attendanace_change = diff(rev(V1))), .(grade_level)]
# grade_level attendanace_change
#1: 10 2
#2: 9 0
#3: 11 1
I guess this is a little more concise:
left_join(attendance_events, all_students, by = "student_id") %>%
group_by(grade_level, dates) %>%
summarise(attendance = sum(attendance)) %>%
group_by(grade_level) %>%
summarize(attendance_change = diff(attendance))
#> # A tibble: 3 x 2
#> grade_level attendance_change
#> <dbl> <dbl>
#> 1 9 0
#> 2 10 -2
#> 3 11 -1
Of course, if you want to count absences instead of attendances, just put a minus sign in front of the diff on the last line.
Sorry if this doesn't exactly answer your question, but I wouldn't want to unfairly accuse the students of being more absent then they were ;)
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA
I have a dataframe:
dat<- data.frame(date = c("2015-01-01","2015-01-01","2015-01-01", "2015-01-01","2015-02-02","2015-02-02","2015-02-02","2015-02-02","2015-02-02"), val= c(10,20,30,50,300,100,200,200,400), type= c("A","A","B","C","A","A","B","C","C") )
dat
date val type
1 2015-01-01 10 A
2 2015-01-01 20 A
3 2015-01-01 30 B
4 2015-01-01 50 C
5 2015-02-02 300 A
6 2015-02-02 100 A
7 2015-02-02 200 B
8 2015-02-02 200 C
9 2015-02-02 400 C
and I would like to have one row for each day with averages by type so the output would be:
Date A B C
2015-01-01 15 30 50
2015-02-02 200 200 300
additionally how would I get the counts so the results are:
Date A B C
2015-01-01 2 1 1
2015-02-02 2 1 2
library(reshape2)
dcast(data = dat, formula = date ~ type, fun.aggregate = mean, value.var = "val")
# date A B C
# 1 2015-01-01 15 30 50
# 2 2015-02-02 200 200 300
With dcast, the LHS of the formula defines rows, the RHS defines columns, the value.var is the name of the column that becomes values, and the fun.aggregate is how those values are computed. The default fun.aggregate is length, i.e., the number of values. You asked for the average, so we use mean. You could also do min, max, sd, IQR, or any function that takes a vector and returns a single value.
You may also use table for the updated question
table(dat[c(1,3)])
# type
#date A B C
#2015-01-01 2 1 1
#2015-02-02 2 1 2
For the first question, I think #Gregor's solution is the best (so far), a possible option with dplyr/tidyr would be
library(dplyr)
library(tidyr)
dat %>%
group_by(date,type) %>%
summarise(val=mean(val)) %>%
spread(type, val)
Or a base R option would be (nchar=50 and the dcast(.. nchar=44. So not so bad :-))
with(dat, tapply(val, list(date, type), FUN=mean))
# A B C
#2015-01-01 15 30 50
#2015-02-02 200 200 300
Personally I would go with Gregor's solution using reshape2. But for the sake of completeness I'll include a base R solution.
agg <- with(dat, aggregate(val, by = list(date = date, type = type), FUN = mean))
out <- reshape(agg, timevar = "type", idvar = "date", direction = "wide")
out
# date x.A x.B x.C
# 1 2015-01-01 15 30 50
# 2 2015-02-02 200 200 300
If you want to get rid of the x. on the column names, you can remove it with gsub.
colnames(out) <- gsub("^x\\.", "", colnames(out))
To get the counts of rows, replace FUN = mean with FUN = length in the call to aggregate.
Using data.table v1.9.5 (current devel), we can do:
require(data.table) ## v1.9.5+
dcast(setDT(dat), date ~ type, fun = list(mean, length), value.var="val")
# date A_mean_val B_mean_val C_mean_val A_length_val B_length_val C_length_val
# 1: 2015-01-01 15 30 50 2 1 1
# 2: 2015-02-02 200 200 300 2 1 2
Installation instructions here.
I'll add the pivot_wider solution, which is meant to replace earlier tidyverse options, and which is
Using pivot_wider with the values_fn option, we can do the following:
library(tidyr) # At least 1.0.0
dat %>% pivot_wider(names_from = type, values_from = val, values_fn = list(val = mean))
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <dbl> <dbl> <dbl>
#> 1 2015-01-01 15 30 50
#> 2 2015-02-02 200 200 300
and
dat %>% pivot_wider(names_from = type, values_from = val, values_fn = list(val = length))
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <int> <int> <int>
#> 1 2015-01-01 2 1 1
#> 2 2015-02-02 2 1 2
Of course, if we want to get fancy, we can do both at once:
library(purrr)
library(rlang)
map(quos(mean, length),
~pivot_wider(dat, names_from = type, values_from = val, values_fn = list(val = eval_tidy(.))))
#> [[1]]
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <dbl> <dbl> <dbl>
#> 1 2015-01-01 15 30 50
#> 2 2015-02-02 200 200 300
#>
#> [[2]]
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <int> <int> <int>
#> 1 2015-01-01 2 1 1
#> 2 2015-02-02 2 1 2
Created on 2019-12-04 by the reprex package (v0.3.0)
Note that if you're concerned about speed, it may be worth updating to the dev version of tidyr.