R - manipulate last rows depending on group and previous elements - r

Im fairly new to R and struggling to find a solution for the following problem:
I have a tibble consisting of 3 columns. First column describes ids of stocks (e.g. ID1,ID2..), the second the Date of observation and third the corresponding return. (ID | Date | Return )
For tidying my dataset I need to delete all zero returns starting from end of sample period until i reach the first non zero return.
The following picture further visualises my issue.
DatasetExample
In case of the example Dataset depicted above, I need to delete the yellow coloured elements.
Hence, one needs to first group by ID and second iterate over the table from bottom to top until reaching a non zero return.
I already found a way by converting the tibble into a matrix and then looping over each element but this apporach is rather naive and does not perform well on large datasets (+2 mio. observations), which is exactly my case.
Is there any more effcient way to achieve this aim? Solutions using dplyr would be highly appreciated.
Thanks in advance.

Here is a dplyr solution. I believe it's a bit complicated, but it works.
library(dplyr)
df1 %>%
mutate(Date = as.Date(Date, format = "%d.%m.%Y")) %>%
group_by(ID) %>%
arrange(desc(Date), .by_group = TRUE) %>%
mutate(flag = min(which(Return == 0)),
flag = cumsum(Return != 0 & flag <= row_number())) %>%
filter(flag > 0) %>%
select(-flag) %>%
arrange(Date, .by_group = TRUE)
## A tibble: 7 x 3
## Groups: ID [2]
# ID Date Return
# <int> <date> <dbl>
#1 1 2020-09-20 0.377
#2 1 2020-09-21 0
#3 1 2020-09-22 -1.10
#4 2 2020-09-20 0.721
#5 2 2020-09-21 0
#6 2 2020-09-22 0
#7 2 2020-09-23 1.76
Test data creation code
set.seed(2020)
df1 <- data.frame(ID = rep(1:2, each = 5), Date = Sys.Date() - 5:1, Return = rnorm(10))
df1$Date <- format(df1$Date, "%d.%m.%Y")
df1$Return[sample(1:5, 2)] <- 0
df1$Return[sample(6:10, 2)] <- 0
df1$Return[10] <- 0

There might be a more elegant way but this could work:
split_data <- split(data,data$ID)
split_tidy_data <- lapply(split_data,function(x) x[1:which.max(x[,"Return"]!=0),])
tidy_data <- do.call(rbind,split_tidy_data)
Note: This only works if there is at least 1 "Return" which is not equal 0

Related

In R, use Lubridate to get a conditional average duration between events

Background
I've got a dataframe d:
d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","c","c"),
treatment = c(0,1,0,0,0,1,0,1,0,1,1),
#event = c(0,0,1,1,1,1,1,0,1,1,1),
service_date = as.Date(c("2011-01-01",
"2011-08-21",
"2011-12-23",
"2012-02-23",
"2013-09-14",
"2013-04-07",
"2014-10-14",
"2013-01-01",
"2013-12-12",
"2014-06-17",
"2015-09-29")),
stringsAsFactors=FALSE)
> d
ID treatment service_date
1 a 0 2011-01-01
2 a 1 2011-08-21
3 a 0 2011-12-23
4 a 0 2012-02-23
5 a 0 2013-09-14
6 a 1 2013-04-07
7 a 0 2014-10-14
8 b 1 2013-01-01
9 b 0 2013-12-12
10 c 1 2014-06-17
11 c 1 2015-09-29
It describes some people (ID), whether or not they had a treatment, and the date of each entry (row).
The Problem
I want to calculate the mean duration between the first and last treatment==1 for IDs who have more than 1 row where treatment==1.
To make that more clear, let's lay out the steps as if we were doing this manually, and also see what answer I want:
Take ID a. Mr. A has 7 rows of data, but only two rows in which treatment==1: one from 2011-08-21 (row 2) and another from 2013-09-14 (row 6). If you hand-calculate the difference, you find that there are 595 days between the two.
For ID b, we do nothing, as they only have 1 treatment==1. (We'll use filter to skip people like b in the code.)
For Mr. c, we get a difference of 469 days.
Average duration of treatment in this group: (595 days + 469 days) / 2 people = 532 days. This is the desired result.
(It's entirely possible I've done this hand-calculation wrong, and that's fine, as long as it suffices to understand what I'm trying to do. Happy to clarify further if needed; let me know!)
What I've tried
I'm trying to adapt some old code from a similar query to work for this:
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
mutate(treatment_years = lubridate::time_length(max(service_date) - min(service_date), unit = "year")) %>%
ungroup() %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
This code runs, and gets me nearly there. It's filtering out the unwanted IDs and making a mean (and SD) calculation for a defined interval of time for each person.
But it's not quite correct: in lubridate::time_length, it isn't specifying the condition "max service date where treatment==1" minus "min service date where treatment==1". (The bolded parts are what's missing, and needed.)
How do I get it to do that?
I've tried something like this but it just throws an error:
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
mutate(treatment_years = lubridate::time_length(max(service_date) & treatment==1 - min(service_date) & treatment==1, unit = "year")) %>%
ungroup() %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
We may subset the service_date with a logical vector treatment == 1 i.e. service_date[treatment == 1] (assuming there is at least one 'treatment' level 1)
library(dplyr)
library(lubridate)
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
summarise(treatment_years = lubridate::time_length(max(service_date[treatment == 1]) -
min(service_date[treatment == 1]), unit = "day"), .groups = 'drop') %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
-output
# A tibble: 1 × 2
avg sd
<dbl> <dbl>
1 532 89.1
An option using by and just subtracting the treated dates.
by(d, d$ID, \(x) {
if (all(x$treatment == 0)) NA_real_
else diff(x$service_date[x$treatment == 1]) |> as.numeric()
}) |> unlist() |> {\(x) c(mean=mean(x, na.rm=TRUE), sd=sd(x, na.rm=TRUE))}()
# mean sd
# 532.00000 89.09545

How to create columns from a list in a for loop using mutate

I was wondering if there was a way to create multiple columns from a list in R using the mutate() function within a for loop.
Here is an example of what I mean:
The Problem:
I have a data frame df that has 2 columns: category and rating. I want to add a column for every element of df$category and in that column, I want a 1 if the category column matches the iterator.
library(dplyr)
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
Doing it manually, I could do:
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
However, what happens when I have 50 categories? (Which is close to what I have in my original problem. That would take a lot of time!)
What I tried:
category_names <- df$category
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
Unfortunately, It doesn't seem to work.
I'd appreciate any light on the subject!
Full Code:
library(dplyr)
#Creates tibble
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
#Showcases the operation I would like to loop over df
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
#Creates a variable for clarity
category_names <- df$category
#For loop I tried
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
I am aware that what I am essentially doing is a form of model.matrix(); however, before I found out about that function I was still perplexed why what I was doing before wasn't working.
We can use pivot_wider after creating a sequence column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number(), n = 1) %>%
pivot_wider(names_from = category, values_from = n,
values_fill = list(n = 0)) %>%
select(-rn)
# A tibble: 3 x 4
# rating Art Technology Finance
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 95 0 1 0
#3 50 0 0 1
Or another option is map
library(purrr)
map_dfc(unique(df$category), ~ df %>%
transmute(!! .x := +(category == .x))) %>%
bind_cols(df, .)
# A tibble: 3 x 5
# category rating Art Technology Finance
#* <chr> <dbl> <int> <int> <int>
#1 Art 100 1 0 0
#2 Technology 95 0 1 0
#3 Finance 50 0 0 1
If we need a for loop
for(name in category_names) df <- df %>% mutate(!! name := +(category == name))
Or in base R with table
cbind(df, as.data.frame.matrix(table(seq_len(nrow(df)), df$category)))
# category rating Art Finance Technology
#1 Art 100 1 0 0
#2 Technology 95 0 0 1
#3 Finance 50 0 1 0
Wanted to throw something in for anyone who stumbles across this question. The problem in the OP is that the "name" column name gets re-used during each iteration of the loop: you end up with only one new column, when you really wanted three (or 50). I consistently find myself wanting to create multiple new columns within loops, and I recently found out that mutate can now take "glue"-like inputs to do this. The following code now also solves the original question:
for(name in category_names){
df <-
df %>%
mutate("{name}" := ifelse(category == name, 1, 0))
}
This is equivalent to akrun's answer using a for loop, but it doesn't involve the !! operator. Note that you still need the "walrus" := operator, and that the column name needs to be a string (I think since it's using "glue" in the background). I'm thinking some people might find this format easier to understand.
Reference: https://www.tidyverse.org/blog/2020/02/glue-strings-and-tidy-eval/

Drop all rows from data frame that follow a filter threshold using dplyr

This feels like a common enough task that I assume there's an established function/method for accomplishing it. I'm imagining a function like dplyr::filter_after() but there doesn't seem to be one.
Here's the method I'm using as a starting point:
#Setup:
library(dplyr)
threshold <- 3
test.df <- data.frame("num"=c(1:5,1:5),"let"=letters[1:10])
#Drop every row that follows the first 3, including that row:
out.df <- test.df %>%
mutate(pastThreshold = cumsum(num>=threshold)) %>%
filter(pastThreshold==0) %>%
dplyr::select(-pastThreshold)
This produces the desired output:
> out.df
num let
1 1 a
2 2 b
Is there another solution that's less verbose?
dplyr provides the window functions cumany and cumall, that filter all rows after/before a condition becomes false for the first time. Documentation.
test.df %>%
filter(cumall(num<threshold)) #all rows until condition violated for first time
# num let
# 1 1 a
# 2 2 b
You can do:
test.df %>%
slice(1:which.max(num == threshold)-1)
num let
1 1 a
2 2 b
We can use the same in filter without the need for creating extra column and later removing it
library(dplyr)
test.df %>%
filter(cumsum(num>=threshold) == 0)
# num let
#1 1 a
#2 2 b
Or another option is match with slice
test.df %>%
slice(seq_len(match(threshold-1, num)))
Or another option is rleid
library(data.table)
test.df %>%
filter(rleid(num >= threshold) == 1)

Gather twice in same data frame

I have a dataframe where I want to do two separate gathers
library(tidyverse)
id <- c("A","B","C","D","E")
test_1_baseline <- c(1,2,4,5,6)
test_2_baseline <- c(21000, 23400, 26800,29000,30000)
test_1_followup <- c(0,4,2,3,1)
test_2_followup <- c(10000,12000,13000,15000,21000)
layout_1 <-data.frame(id,test_1_baseline,test_1_followup,test_2_baseline,test_2_followup)
This is the current layout.
Each person is 1 line.
The result of Test 1 at baseline is one variable
The result of Test 2 at baseline is a second variable
The same applies to Test 1/2 follow-up results
I would like the data to be tidier. One column for timepoint, one for result of test A, one for result of test B.
id2 <- c("A","B","C","D","E","A","B","C","D","E")
time <- c(rep("baseline",5),rep("followup",5))
test_1_result <- c(1,2,4,5,6,0,4,2,3,1)
test_2_result <- c(21000, 23400, 26800,29000,30000,10000,12000,13000,15000,21000)
layout_2 <- data.frame(id2, time,test_1_result,test_2_result)
I'm currently doing a what seems to me odd process where first of all I gather the test 1 data
test_1 <- select(layout_1,id,test_1_baseline,test_1_followup) %>%
gather("Timepoint","test_1",c(test_1_baseline,test_1_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_followup", "followup"))
Then I do same for test 2 and join them
test_2 <- select(layout_1,id,test_2_baseline,test_2_followup) %>%
gather("Timepoint","test_2",c(test_2_baseline,test_2_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_followup", "followup"))
test_combined <- full_join(test_1,test_2)
I tried doing the first Gather and then the second on the same dataframe but then you end up with duplicates; i.e you end up with
ID 1 Test_1 Baseline Test_2 Baseline
ID 1 Test_1 Baseline Test_2 Followup
ID 1 Test_1 Followup Test_2
Baseline ID 1 Test_1 Followup Test_2 Followup
== 4 rows where there should only be 2
I feel there must be a cleaner tidyverse way to do this.
Guidance welcomed
One option with data.table using melt which can take multiple measure patterns
library(data.table)
nm1 <- unique(sub(".*_", "", names(layout_1)[-1]))
melt(setDT(layout_1), measure = patterns("test_1", "test_2"),
value.name = c('test_1_result', 'test_2_result'),
variable.name = 'time')[, time := nm1[time]][]
You could gather all columns except id, then use separate to split into result and time.
Note that this code assumes that the result name is always 6 characters (test_1, test_2), and separates based on that assumption. You'll need to devise a different separate if that is not the case.
library(tidyr)
library(dplyr)
layout_1 %>%
gather(Var, Val, -id) %>%
separate(Var, into = c("result", "time"), sep = 6) %>%
spread(result, Val) %>%
mutate(time = gsub("_", "", time))
Result:
id time test_1 test_2
1 A baseline 1 21000
2 A followup 0 10000
3 B baseline 2 23400
4 B followup 4 12000
5 C baseline 4 26800
6 C followup 2 13000
7 D baseline 5 29000
8 D followup 3 15000
9 E baseline 6 30000
10 E followup 1 21000

R: sum row based on several conditions

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.
We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6
With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Resources