counting rows between two specific rows with a condition - r

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("2010-01-01",
"2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19"), portfolio = c(10,
0, 5, 2, 0)), operation = c(10, -10, 5, -3, -2), class = "data.frame", row.names = c(NA, -5L))
So I have 4000 investors with 6000 different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.
What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).
so in the dataset above I would count row2 - row1 ==> 2010-01-02 - 2010-01-01
and row 5 - row 3 ==> 2010-01-19 - 2010-01-03 and so on...
I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0.
So my dataset will have a further column called duration which would be equal, in this case to c(0,1,0,5,16) (so of course i also had to compute raw1 - raw1 and raw3 - raw3)
Hence my problem is to restart the count everytime portfolio goes back to zero.

library(dplyr)
df %>%
mutate(datetime = as.Date(datetime, "%Y-%m-%d")) %>%
group_by(investor, asset) %>%
arrange(datetime) %>%
mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>%
group_by(investor, asset, grp.pos) %>%
mutate(`Open (#days)` = datetime - datetime[1])
#> # A tibble: 5 x 6
#> # Groups: investor, asset, grp.pos [2]
#> investor asset datetime portfolio grp.pos `Open (#days)`
#> <chr> <chr> <date> <dbl> <int> <drtn>
#> 1 INV_1 x 2010-01-01 10 0 0 days
#> 2 INV_1 x 2010-01-02 0 0 1 days
#> 3 INV_1 x 2010-01-03 5 1 0 days
#> 4 INV_1 x 2010-01-08 2 1 5 days
#> 5 INV_1 x 2010-01-19 0 1 16 days
Data:
df <- structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"),
asset = c("x", "x", "x", "x", "x"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03",
"2010-01-08", "2010-01-19"),
portfolio = c(10, 0, 5, 2, 0)),
operation = c(10, -10, 5, -3, -2),
class = "data.frame", row.names = c(NA, -5L))

Here is a way how we could do it, that is expandable if necessary for ass
First we group by inv to use for the original dataset. Then transform datetime to date format to do calculations easily (here we use ymd() function).
The next step could be done in different ways:
Main idea is to group the column portfolio indicated by the last row of the group that is 0. For this we arrange datetime in descending form to easily apply the grouping id with cumsum == 0.
After rearranging datetime we can calculate the last from the first as intended:
library(dplyr)
library(lubridate)
df %>%
group_by(inv) %>%
mutate(datetime = ymd(datetime)) %>%
arrange(desc(datetime)) %>%
group_by(position_Group = cumsum(portfolio==0)) %>%
arrange(datetime) %>%
mutate(position_open = last(datetime)-first(datetime)) %>%
ungroup()
inv ass datetime portfolio operation id_Group position_open
<chr> <chr> <date> <dbl> <dbl> <int> <drtn>
1 INV_1 x 2010-01-01 10 10 2 1 days
2 INV_1 x 2010-01-02 0 -10 2 1 days
3 INV_1 x 2010-01-03 5 5 1 16 days
4 INV_1 x 2010-01-08 2 -3 1 16 days
5 INV_1 x 2010-01-19 0 -2 1 16 days

Related

R: Loops, Dplyr and lubridate, how to combine them

I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.
I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K+ rows and 50+ analysis periods.
I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.
The following is an example of code that works for the first analysis period.
library(tidyverse)
library(lubridate)
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m+% months(seq(0, 12, by = 6))
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
if_else(DateIn > vec_date[1], Premium, 0)))
Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?
Edit: an outline of the format you would need as a result would be:
df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500),
Aux1 = c("Y", "Y", "N"),
test1 = c(5421.429, 4849.246, 24500.000),
Aux2 = c("N", "Y", "Y"),
test2 = c(0.0000, 301.5075, 20125.0000),
Aux3 = c("N", "N", "Y"),
test3 = c(0, 0, 4025))
Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.
As #Jon Spring suggests in the comments, probably the preferred approach here
would be to use tidyr::complete() to extend your data frame, repeating each
row in it for each of your analysis dates. Then, you can stick to vectorized
calculations and get the analysis date column in the resulting data, too.
Below is how to do just that with the example data you provided. I took the
liberty of renaming some columns, and simplifying the control-flow based
calculation according to my understanding of the problem, based on what you
shared.
First, the example data slightly reframed:
library(tidyverse)
library(lubridate)
policies <- tibble(
policy_id = seq_len(3),
start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
premium = c(11000, 5000, 24500)
)
policies
#> # A tibble: 3 x 4
#> policy_id start end premium
#> <int> <date> <date> <dbl>
#> 1 1 2021-10-21 2022-03-10 11000
#> 2 2 2021-12-25 2022-07-12 5000
#> 3 3 2022-05-11 2023-02-15 24500
Then, finding remaining prorated premiums for policies at given dates:
start_date <- ymd("2021-12-31")
dates <- start_date %m+% months(seq(0, 12, by = 6))
policies %>%
mutate(
days = as.numeric(end - start),
daily_premium = premium / days
) %>%
crossing(date = dates) %>%
mutate(
days_left = pmax(0, end - pmax(start, date)),
premium_left = days_left * daily_premium
) %>%
select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#> policy_id date days_left premium_left
#> <int> <date> <dbl> <dbl>
#> 1 1 2021-12-31 69 5421.
#> 2 1 2022-06-30 0 0
#> 3 1 2022-12-31 0 0
#> 4 2 2021-12-31 193 4849.
#> 5 2 2022-06-30 12 302.
#> 6 2 2022-12-31 0 0
#> 7 3 2021-12-31 280 24500
#> 8 3 2022-06-30 230 20125
#> 9 3 2022-12-31 46 4025

In R: Create a data frame which includes consecutive observations only and a variable indicating sequence number

I have a data frame with column names similar to below, in which each row is one observation:
user_id; date; Var_1; Var_2
Each user_id can have 0 or 1 observations for each date. Only dates with an observation are included in the data frame for each user.
From this data, I want to create a data frame containing these variables (and the ones described below) but it should only contain observations for 3 successive dates (for each user). Days in each succession should be numbered 1 to 3 and each succession should be numbered as well.
For instance if user with user_id == 1 has observations on the following date: 2020-01-01, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 2020-01-10, 2020-01-12, 2020-01-13, 2020-01-14.
And user with user_id == 2 has observations on the following dates: 2020-01-01, 2020-01-03, 2020-01-04, 2020-01-06, 2020-01-10, 2020-01-12, 2020-01-15, 2020-01-16, 2020-01-17
Then the new data frame should include:
user_id      date      Sequence      Day      Var_1      Var_2      
1            20-01-03      1                  1            value            value      
1            20-01-04      1                  2            value            value      
1            20-01-05      1                  3            value            value      
1            20-01-04      2                  1            value            value      
1            20-01-05      2                  2            value            value      
1            20-01-06      2                  3            value            value      
1            20-01-12      3                  1            value            value      
1            20-01-13      3                  2            value            value      
1            20-01-14      3                  3            value            value      
2            20-01-15      1                  1            value            value      
2            20-01-16      1                  2            value            value      
2            20-01-17      1                  3            value            value      
(where value is the value of the observation for the variable)
Thank you for your help with this tricky problem !
Best wishes,
Eric
Here is something you can try working with. I'm sure there are better ways than this, but it seems to work.
Filter by identifying rows that are the beginning of 3-day sequences. To do that, calculate difference diff between dates, and identify dates where the subsequent two rows have a diff of one day.
Knowing the start dates of sequences, you can enumerate these as sequence. Then, use map to expand to 3 day sequences based on these starting dates. After that, you can enumerate again for day after grouping by both user_id and sequence.
Finally, would join result back to your original data to get your Var_1, Var_2, etc.
library(dplyr)
library(tidyr)
df %>%
select(user_id, date) %>%
group_by(user_id) %>%
mutate(diff = c(0, diff(date))) %>%
filter((lead(diff, 1L) == 1 & lead(diff, 2L) == 1)) %>%
mutate(sequence = row_number(),
date = map(date, seq.Date, length = 3, by = "1 day")) %>%
unnest(cols = date) %>%
group_by(user_id, sequence) %>%
mutate(day = row_number()) %>%
inner_join(df, by = c("user_id", "date")) %>%
select(-diff)
Output
user_id date sequence day Var_1 Var_2
<dbl> <date> <int> <int> <int> <int>
1 1 2020-01-03 1 1 2 17
2 1 2020-01-04 1 2 3 16
3 1 2020-01-05 1 3 4 15
4 1 2020-01-04 2 1 3 16
5 1 2020-01-05 2 2 4 15
6 1 2020-01-06 2 3 5 14
7 1 2020-01-12 3 1 7 12
8 1 2020-01-13 3 2 8 11
9 1 2020-01-14 3 3 9 10
10 2 2020-01-15 1 1 16 3
11 2 2020-01-16 1 2 17 2
12 2 2020-01-17 1 3 18 1
Data
df <- structure(list(user_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2), date = structure(c(18262, 18264, 18265, 18266,
18267, 18271, 18273, 18274, 18275, 18262, 18264, 18265, 18267,
18271, 18273, 18276, 18277, 18278), class = "Date"), Var_1 = 1:18,
Var_2 = 18:1), class = "data.frame", row.names = c(NA, -18L
))

30 sec rows to 1 min rows

I have a dataset with time stamps every 30 seconds and binary along side each time stamp with either 0 meaning active and 1 meaning inactive. I want to combine two 30 second intervals into one interval marked either active or inactive based on if there is a 0 in either of the two, the new minute interval is marked 0 and if there is two 1s, the interval is marked inactive. I could probably do a macro in excel but I think it would be easier to just do it in R.
11:00.20 1
11:00.50 0
11:01.20 1
11:01.50 1
Here's a way using dplyr -
df %>%
mutate(group = rep(1:n(), each = 2, length.out = n())) %>%
group_by(group) %>%
summarise(
timestamp = first(timestamp),
value = case_when(
sum(value) == 1 ~ "0",
sum(value) == 2 ~ "inactive",
TRUE ~ "active"
)
)
# A tibble: 2 x 3
group timestamp value
<int> <chr> <chr>
1 1 11:00.20 0
2 2 11:01.20 inactive
Data -
df <- read.table(text = "11:00.20 1
11:00.50 0
11:01.20 1
11:01.50 1", header = F, stringsAsFactors = F, col.names = c("timestamp", "value"))
Your data:
df <- tibble(
time = c("11:00.20", "11:00.50",
"11:01.20", "11:01.50"),
active = c(1, 0, 1, 1))
I have tried lubridate:
library(lubridate)
library(tidyverse)
df %>%
mutate(
date = hms(time),
) %>%
group_by(hour= hour(date), minute=minute(date)) %>%
summarize(
active = min(active)
)
and got the following tibble:
# A tibble: 2 x 3
# Groups: hour [1]
hour minute active
<dbl> <dbl> <dbl>
1 11 0 0
2 11 1 1

Multiplication of several columns

I want to multiply some counts in specific locations (0-100, 0-12 etc.., individual variable columns) by the number of days a count is present (days)
Here is an example of my data:
df <- structure(list(month = c("Apr", "Apr", "Aug", "Aug", "Aug", "Sep"
), Year = c(2018, 2018, 2018, 2018, 2018, 2018), First =
structure(c(17995,
17998, 17750, 17758, 17770, 17778), class = "Date"), Last =
structure(c(17999,
17998, 17750, 17761, 17771, 17778), class = "Date"), days = c(5,
1, 1, 4, 2, 1), `0-100` = c(1, 0, 1, 1, 1, 1), `0-12` = c(0,
0, 1, 1, 1, 1), `0-25` = c(1, 1, 1, 1, 1, 1), `0-50` = c(1, 0,
1, 1, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
So i was thinking something along the lines of:
df2 <- df %>%
mutate("0-100b" = days * "0-100", "0-12b" = days * "0-12", "0-25b" = days * "0-25", "0-50b" = days * "0-25")
Which one doesn't seem to work, but two there must be a more concise way than writing out each multiplication too ... if i had many more columns this seems a little tedious.
ok edit for col names:
colnames(df) <- c("month", "Year", "First", "Last" , "days", "V", "I",
"II", "III")
df2 <- df %>%
mutate(Vb = days * V, Ib = days * I, IIb = days *
II, IIIb = days * III)
Like I said above, you can select improperly named columns by wrapping them in backticks. One of the places that naming rules are laid out is in the docs of the base function make.names.
The easiest solution to having improper names is to just create data with valid names to begin with...but in practice, that isn't always possible. There are several ways to change the names into valid ones. The aforementioned make.names does this from a character vector.
If you're working in a larger piped workflow, you can use rename_all with a few string manipulation functions to 1) convert to lowercase, 2) replace - with _, and 3) prepend an x before any leading digits. You can also use janitor::clean_names, which cleans all the names in a data frame.
library(dplyr)
df %>%
rename_all(~tolower(.) %>%
stringr::str_replace_all(., "\\-", "_") %>%
stringr::str_replace("^\\b(?=\\d)", "x"))
# omitted: same names as below
With clean names, you can use mutate_at, select the columns, and pass it a function to multiply by days. If you use a named list, the name is appended to create new columns, instead of replacing them.
df %>%
janitor::clean_names() %>%
mutate_at(vars(x0_100:x0_50), list(b = ~. * days))
#> # A tibble: 6 x 13
#> month year first last days x0_100 x0_12 x0_25 x0_50 x0_100_b
#> <chr> <dbl> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Apr 2018 2019-04-09 2019-04-13 5 1 0 1 1 5
#> 2 Apr 2018 2019-04-12 2019-04-12 1 0 0 1 0 0
#> 3 Aug 2018 2018-08-07 2018-08-07 1 1 1 1 1 1
#> 4 Aug 2018 2018-08-15 2018-08-18 4 1 1 1 1 4
#> 5 Aug 2018 2018-08-27 2018-08-28 2 1 1 1 1 2
#> 6 Sep 2018 2018-09-04 2018-09-04 1 1 1 1 1 1
#> # … with 3 more variables: x0_12_b <dbl>, x0_25_b <dbl>, x0_50_b <dbl>
In this case, it might also make sense to select columns by regex:
df %>%
janitor::clean_names() %>%
mutate_at(vars(matches("^x\\d")), list(b = ~. * days))
# same output as above

R - Yelp data Business category column has multiple categories per business. Want to separate into category specific columns with values of 1 and 0

thank you in advance for anyone who is going to try and help with this.
I'm using the Yelp data set and the question I want to answer is "which categories are positively correlated with higher stars for X category (Bars for example)"
The issue I'm encountering is that for each business the categories are lumped together into one column and row per businesss_id. So I need a means to separate out each category, turn them into columns and then check if the original category column contains the category that the column was created for.
My current train of thought is to use group_by with business_id and then unnest_tokens the column, then model.matrix() that column into the split I want and then join it onto the df I'm using. But I can't get model.matrix to pass and keep business_id connected to each row.
# an example of what I am using #
df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"))
# what I want it to look like #
desired_df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"),
Pizza = c(1, 1, 0),
Burgers = c(1, 0, 0),
Caterers = c(1, 0, 0),
Restaurants = c(0, 1, 1),
Bars = c(0, 1, 0),
American = c(0, 0, 1),
Barbeque = c(0, 0, 1))
# where I am stuck #
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
model.matrix(business_id ~ categories, data = .) %>%
as_data_frame
Edit: After this post and the answers below I encountered a duplicate identifiers error using spread(). Which brought me to this thread https://github.com/tidyverse/tidyr/issues/426 where the answer to my question was posted, I've repasted it below.
# duplicating the error with a smaller data.frame #
library(tidyverse)
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
df
#> # A tibble: 5 x 2
#> age gender
#> <chr> <fct>
#> 1 21 Male
#> 2 17 Female
#> 3 32 Female
#> 4 29 Male
#> 5 15 Male
df %>%
spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)
# fixing the problem #
df %>%
group_by_at(vars(-age)) %>% # group by everything other than the value column.
mutate(row_id=1:n()) %>% ungroup() %>% # build group index
spread(key=gender, value=age) %>% # spread
select(-row_id) # drop the index
#> # A tibble: 3 x 2
#> Female Male
#> <chr> <chr>
#> 1 17 21
#> 2 32 29
#> 3 NA 15
Building from your nice use of tidytext::unnest_tokens(), you can also use this alternative solution
library(dplyr)
library(tidyr)
library(tidytext)
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
mutate(value = 1) %>%
spread(categories, value, fill = 0)
# business_id american barbeque bars burgers caterers pizza restaurants
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# bus_1 0 0 0 1 1 1 0
# bus_2 0 0 1 0 0 1 1
# bus_3 1 1 0 0 0 0 1
Here is a simple tidyverse solution:
library(tidyverse)
df %>%
mutate(
ind = 1,
tmp = strsplit(categories, ", ")
) %>%
unnest(tmp) %>%
spread(tmp, ind, fill = 0)
## A tibble: 3 x 9
# business_id categories American Barbeque Bars Burgers Caterers Pizza Restaurants
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 bus_1 Pizza, Burgers, Caterers 0 0 0 1 1 1 0
#2 bus_2 Pizza, Restaurants, Bars 0 0 1 0 0 1 1
#3 bus_3 American, Barbeque, Restaurants 1 1 0 0 0 0 1

Resources