How to create a dataset using columns from another table in R? - r

This maybe complicated and hard to explain. Let's say I have a dataframe that have 4 columns date, id, response_1, and response_2: id column has unique values, response_1 variable contains values of 1 and 0, and response_2 variable uses response_1 values to determine whether to assign 1 and 0 to the unique id. If the id has a value of 0 in the response_1 variable it assign the value of 0 to response_2 variable, but once the id has a value of 1 in the response_1 variable it stays 1 in the response_2 variable regardless of value in response_1 (please see id 1 and 3).
sample <- data.frame(date = c("2020-04-17", "2020-04-17", "2020-04-17",
"2020-05-13", "2020-05-13", "2020-05-13",
"2020-06-12", "2020-06-12", "2020-06-12",
"2020-06-19", "2020-06-19"),
id = c(1,2,3,1,2,3,1,3,4,5,1),
response_1=c(0,1,0,1,0,1,0,0,0,1,1),
response_2=c(0,1,0,1,1,1,1,1,0,1,1))
date id response_1 response_2
1 2020-04-17 1 0 0
2 2020-04-17 2 1 1
3 2020-04-17 3 0 0
4 2020-05-13 1 1 1
5 2020-05-13 2 0 1
6 2020-05-13 3 1 1
7 2020-06-12 1 0 1
8 2020-06-12 3 0 1
9 2020-06-12 4 0 0
10 2020-06-19 5 1 1
11 2020-06-19 1 1 1
What I want to calculate using this dataset is seeing in each day how many unique id we had and how many turned into 1 since the beginning of dataset. For instance on June 12, we had total of 4 unique id (1,2,3, and 4) in the whole dataset and 3 of them turned into 1 (id 1,2,and 3) 4 was still 0.
Like this:
result <- data.frame(date=c("04-17-2020", "05-13-2020","06-12-2020", "06-19-2020"),
count_id = c(3,3,4,5), total=c(1,3,3,4))
date count_id total
1 04-17-2020 3 1
2 05-13-2020 3 3
3 06-12-2020 4 3
4 06-19-2020 5 4
What will be the best way to accomplish this in R?

You can use duplicated with cumsum to get count of cumulative unique id's and take cumsum of response_1 variable. For each date we then select the last row to get final count.
library(dplyr)
sample %>%
group_by(id) %>%
mutate(response_11 = response_1 * as.integer(!duplicated(response_1))) %>%
ungroup %>%
mutate(count_id = cumsum(!duplicated(id)),
total = cumsum(response_11)) %>%
group_by(date) %>%
slice(n()) %>%
select(date, count_id, total)
# date count_id total
# <chr> <int> <dbl>
#1 2020-04-17 3 1
#2 2020-05-13 3 3
#3 2020-06-12 4 3
#4 2020-06-19 5 4

Related

Filtering every positive value for every negative in R

I have a dataset with financial data. Sometimes, a product gets refunded, resulting in a negative count of the product (so the money gets returned). I want to conditionally filter these rows out of the dataset.
Example:
library(tidyverse)
set.seed(1)
df <- tibble(
count = sample(c(-1,1),80,replace = TRUE,prob=c(.2,.8)),
id = rep(1:4,20)
)
df %>%
group_by(id) %>%
summarize(total = sum(count))
# A tibble: 4 x 2
id total
<int> <dbl>
1 1 10
2 2 14
3 3 16
4 4 10
id = 1 has 15 positive counts and 5 negatives. (15 - 5= 10). I want to keep 10 values in df with id = 1 with the positive values.
id = 2 has 17 positive counts and 3 negatives. (17- 3 = 14). I want to keep 14 values in df with id = 2 with the positive values.
In the end, this condition should be True nrow(df) == sum(df$count)
Unfortunately, a filtering join such as anti_join() will remove all the rows. For some reason I cannot think of another option to filter the tibble.
Thanks for helping me!
You can "uncount" using the total column to get the number of repeats of each row.
df %>%
group_by(id) %>%
summarize(total = sum(count)) %>%
uncount(total) %>%
mutate(count = 1)
#> # A tibble: 50 x 2
#> id count
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> # ... with 40 more rows
Created on 2022-10-21 with reprex v2.0.2

Count observations over rolling 30 days windows with restrictions

My idea is to count observations (grouped by Id's) within 30 days windows. My problem is that I want to introduce an exception in the counting process: if during the 30 days analyzed there is an observation that will be discarded (because n> 1) the count is only constructed with the data not discarded. (n is the variable that counts the number of observations within 30 days windows).
Example
id date
1 1/1/2021
1 22/1/2021
1 1/2/2021
Code:
test<-test%>%
group_by(id)%>%
mutate(n=sapply(seq(length(date)),
function(x) sum(between(date[1:x],date[x]-days(30),date[x]))))
id date n
1 1/1/2021 1
1 22/1/2021 2
1 1/2/2021 2
1 3/3/2021 2
1 2/2/2021 3
1 7/7/2021 1
Expected result:
id date n nexpected
1 1/1/2021 1 1
1 22/1/2021 2 2
1 1/2/2021 2 1
1 3/3/2021 2 2
1 2/2/2021 3 1
1 7/7/2021 1 1
Alternative explanation
I just want to keep an observation (grouped by ID) for every 30 days. I want to do this by creating a variable that tells me which observations are left inside (1) and which ones are outside (0) of the filter.
Not sure this is what you want, but lubridate::floor_date is often useful in those situations:
library(tidyverse)
library(lubridate)
test %>%
mutate(date = dmy(date)) %>%
group_by(id, floor = floor_date(date, 'month')) %>%
mutate(n = row_number())
id date floor n
<int> <date> <date> <int>
1 1 2021-01-01 2021-01-01 1
2 1 2021-01-22 2021-01-01 2
3 1 2021-02-01 2021-02-01 1

Find duplicate rows based on 2 columns and keep rows based on the value of a 3rd column in R

I have a dataset with ID numbers, dates, and test results, and need to create a final dataset where each row consists of a unique ID, date, and test result value. How can I find duplicates based on ID and date, and then keep rows based on a specific test result value?
df <- data.frame(id_number = c(1, 1, 2, 2, 3, 3, 3, 4),
date = c('2021-11-03', '2021-11-19', '2021-11-11', '2021-11-11', '2021-11-05', '2021-11-05', '2021-11-16', '2021-11-29'),
result = c(0,1,0,0,0,9,0,9) )
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 2 2021-11-11 0
5 3 2021-11-05 0
6 3 2021-11-05 9
7 3 2021-11-16 0
8 4 2021-11-29 9
df <- unique(df)
After using the unique function, I am still left with rows that have duplicate id_number and date, and different test results. Of these, I need to keep only the row that equals 0 or 1, and exclude any 9s.
In the example below, I'd want to keep row 4 and exclude row 5. I can't simply exclude rows where result = 9 because I want to keep those for any non-duplicate observations.
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-05 9
6 3 2021-11-16 0
7 4 2021-11-29 9
You can do:
library(tidyverse)
df %>%
group_by(id_number, date) %>%
filter(!(result == 9 & row_number() > 1)) %>%
ungroup()
# A tibble: 6 x 3
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-16 0
6 4 2021-11-29 9
For simplicity of understanding, use:
a) rows different than 9:
df <- subset(df,df$result != 9)
And then
b) Remove duplicated:
df <- subset(df, duplicated(df)==FALSE)
So if you want specific columns:
df <- subset(df,duplicated(df$result)==FALSE)
Or:
df <- subset(df,duplicated(df[ ,2:3])==FALSE)

How can I find the column index of the first non-zero value in a row with R dplyr?

I'm working in R. I have a dataset of COVID case totals that looks like this:
Facility
Day_1
Day_2
Day_3
A
0
0
1
B
1
2
5
C
0
2
6
D
0
0
0
I would like to use mutate() to create a new column, first_case, that has the column index of the first non-zero element in each row -- or "NA" if there is no non-zero element. I thought about using where(), but couldn't quite figure out how to get a column index instead of a row index.
Any help is much appreciated!
We can use max.col to get the first instance when the value is non-zero in each zero.
library(dplyr)
df %>%
mutate(first_case = {
tmp <- select(., starts_with('Day'))
ifelse(rowSums(tmp) == 0, NA, max.col(tmp != 0, ties.method = 'first'))
})
# Facility Day_1 Day_2 Day_3 first_case
#1 A 0 0 1 3
#2 B 1 2 5 1
#3 C 0 2 6 2
#4 D 0 0 0 NA
first_case has column number of the 'Day' columns, if you need column number in the data you can add + 1 to above output.
This is probably unnecessarily complex, because the data is not in a long ('tidy') format that dplyr etc expect.
datlong <- dat %>%
pivot_longer(cols=starts_with("Day"), names_to = c("day"), names_pattern="_(\\d+)")
## A tibble: 12 x 3
# Facility day value
# <chr> <chr> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 B 1 1
# 5 B 2 2
# 6 B 3 5
# 7 C 1 0
# 8 C 2 2
# 9 C 3 6
#10 D 1 0
#11 D 2 0
#12 D 3 0
It's then simple to get the first/second/third/[n]th day above whatever value, as well as to calculate minimums, maximums, means, weekly averages, rolling averages, whatever, because you are now dealing with a plain old vector of values rather than a list of values across multiple columns.
datlong %>%
group_by(Facility) %>%
filter(value > 0, .preserve=TRUE) %>%
summarise(first_day = first(day))
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 4 x 2
# Facility first_day
# <chr> <chr>
#1 A 3
#2 B 1
#3 C 2
#4 D <NA>
Alternative using indexes and stuff, which is less dplyr-like:
datlong %>%
group_by(Facility) %>%
summarise(first_day = day[value > 0][1])

sub setting panel data based on two variables in R

library(dplyr)
id <- c(rep(1,4),rep(2,3),rep(3,4))
missing <- c(rep(0,4),rep(0,3),1,0,0,0)
wave <- c(seq(1:4),1,2,3,seq(1:4))
df <- as.data.frame(cbind(id,missing,wave))
df
id missing wave
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4
5 2 0 1
6 2 0 2
7 2 0 3
8 3 1 1
9 3 0 2
10 3 0 3
11 3 0 4
I am trying to delete cases if they have missing=1 or if they are missing a wave (1:4). For example, ID=3 should be dropped because at wave=1 they have missing=1 and ID=2 should be dropped because they only have values of 1, 2, and 3 in Wave.
I tried to use dplyr's group_by and filter functions but this removes all cases. I want to only end up with cases for ID=1.
df <- df %>% group_by(id) %>% filter(missing==0, wave==1, wave==2, wave==3, wave==4)
df
Try this. We first group_by id, and then create a list column with the sorted unique values of wave for each id. Then we check to make sure this list equals 1:4. We create a missing_check variable, which is just the max of missing for each id. We filter on both missing_check and wave_check.
df %>%
group_by(id) %>%
mutate(wave_list = I(list(sort(unique(wave))))) %>%
mutate(wave_list_check = all(unlist(wave_list) == 1:4),
missing_check = max(missing)) %>%
filter(missing_check == 0, wave_list_check) %>%
select(id:wave)
id missing wave
<dbl> <dbl> <dbl>
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4

Resources