I have below-mentioned two dataframe in R, and I have tried various method but couldn't achieve the required output yet.
DF:
ID Date city code uid
I-1 2020-01-01 10:12:15 New York 123 K-1
I-1 2020-01-01 10:12:15 Utha 103 K-1
I-2 2020-01-02 10:12:15 Washington 122 K-1
I-3 2020-02-01 10:12:15 Tokyo 123 K-2
I-3 2020-02-01 10:12:15 Osaka 193 K-2
I-4 2020-02-02 10:12:15 London 144 K-3
I-5 2020-02-04 10:12:15 Dubai 101 K-4
I-6 2019-11-01 10:12:15 Dubai 101 K-4
I-7 2019-11-01 10:12:15 London 144 K-3
I-8 2018-12-13 10:12:15 Tokyo 143 K-5
I-9 2019-05-17 10:12:15 Dubai 101 K-4
I-19 2020-03-11 10:12:15 Dubai 150 K-7
Dput:
structure(list(ID = c("I-1", "I-1",
"I-2", "I-3", "I-3", "I-4",
"I-5", "I-6", "I-7", "I-8", "I-9","I-19"
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607",
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952",
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536",
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York",
"UTAH", "Washington", "Tokyo",
"Osaka", "London", "Dubai",
"Dubai", "London", "Tokyo", "Dubai",
"Dubai"), CODE = c("221010",
"411017", "638007", "583101", "560029", "643102", "363001", "452001",
"560024", "509208"), UID = c("K-1",
"K-1", "K-1", "K-2", "K-2",
"K-3", "K-4", "K-4", "K-3",
"K-5","K-4","K-7")), .Names = c("ID", "DATE",
"CITY", "CODE", "UID"), row.names = c(NA,
10L), class = "data.fram)
Using the above-mentioned two dataframe, I want to fetch records between 1st Jan 2020 to 29th Feb 2002 and compare those ID in entire database to check whether both city and code together match with other ID and categorize it further to check how many have the same uid and how many have different.
Where,
Match - combination of city and code match with other ID in database
Same_uid - classification of Match ids to identify how many ID have similar uid
different_uid - classification of Match ids to identify how many ID doesn't have similar uid
uid_count - count of similar uid of that particular ID in entire database
Note - I have more than 10M records in the dataframe.
Required Output
ID Date city code uid Match Same_uid different_uid uid_count
I-1 2020-01-01 10:12:15 New York 123 K-1 No 0 0 2
I-2 2020-01-02 10:12:15 Washington 122 K-1 No 0 0 2
I-3 2020-02-01 10:12:15 Tokyo 123 K-2 No 0 0 1
I-4 2020-02-02 10:12:15 London 144 K-3 Yes 1 0 2
I-5 2020-02-04 10:12:15 Dubai 101 K-4 Yes 2 0 3
An approach,
Load in the dataset
library(tidyverse)
library(lubridate)
mydata <- tibble(
ID = c("I-1","I-1",
"I-2","I-3",
"I-3","I-4",
"I-5","I-6",
"I-7","I-8",
"I-9","I-19"),
Date = c("2020-01-01", "2020-01-01",
"2020-01-02", "2020-02-01",
"2020-02-01", "2020-02-02",
"2020-02-04", "2019-11-01",
"2019-11-01", "2018-12-13",
"2019-05-17", "2020-03-11"),
city = c("New York", "Utha",
"Washington", "Tokyo",
"Osaka", "London",
"Dubai", "Dubai",
"London", "Tokyo",
"Dubai", "Dubai"),
code = c("123", "103", "122", "123", "193, "144",
"101", "101", "144", "143", "101", "150"),
uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
"K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))
mydata <- mydata %>%
mutate(Date = ymd(str_remove(Date, " .*")),
code = as.character(code))
Where clause number 1
I use count from dplyr to count the codes by cities. Then case_when to further identify with a "Yes" or "No" as requested.
# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <- "2020-03-29"
mydata %>%
filter(Date >= startdate,
Date <= enddate) %>%
count(city, code, name = "count_samecode") %>%
mutate(Match = case_when(
count_samecode > 1 ~ "Yes",
T ~ "No")) %>%
head()
# # A tibble: 6 x 4
# city code count_samecode Match
# <chr> <chr> <int> <chr>
# 1 Dubai 101 3 Yes
# 2 Dubai 150 1 No
# 3 London 144 2 Yes
# 4 New York 123 1 No
# 5 Osaka 193 1 No
# 6 Tokyo 123 1 No
Where clause number 2
I will do the same with UID
mydata %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered") %>%
head()
# # A tibble: 6 x 3
# city uid `UIDs_#_filtered`
# <chr> <chr> <int>
# 1 Dubai K-4 3
# 2 Dubai K-7 1
# 3 London K-3 2
# 4 New York K-1 1
# 5 Osaka K-2 1
# 6 Tokyo K-2 1
Where clause number 3
I can repeat the count of clause number 2 to find how many of these cities have a different UID, where > 1 signals a different UID.
mydata %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered") %>%
count(city, name = "UIDs_#_different") %>%
head()
# # A tibble: 6 x 2
# city `UIDs_#_different`
# <chr> <int>
# 1 Dubai 2
# 2 London 1
# 3 New York 1
# 4 Osaka 1
# 5 Tokyo 2
# 6 Utha 1
Where clause number 4
Taking the same code from #2, I can eliminate the filter to find the entire dataset
mydata %>%
count(city, uid, name = "UIDs_#_all") %>%
head()
Putting it all together
Using several left_join's we can get closer to your desired output.
EDIT: Now will bring the first instance of the ID from the first City / Code combination
check_duplicates_filterview.f <- function( df, startdate, enddate ){
# df should be a tibble
# startdate should be a string "yyyy-mm-dd"
# enddate should be a string "yyyy-mm-dd"
cityfilter <- df %>% filter(Date >= startdate,
Date <= enddate) %>% distinct(city) %>% pull(1)
df <- df %>%
filter(city %in% cityfilter) %>%
mutate(Date = ymd(str_remove(Date, " .*")),
code = as.character(code))
entire.db.countcodes <- df %>% # Finds count of code in entire DB
count(city, code)
where.1 <- df %>% filter(Date >= startdate,
Date <= enddate) %>%
distinct(city, code, .keep_all = T) %>%
left_join(entire.db.countcodes) %>%
rename("count_samecode" = n) %>%
mutate(Match = case_when(
count_samecode > 1 ~ "Yes",
T ~ "No"))
where.2 <- df %>%
filter(Date >= startdate,
Date <= enddate ) %>%
count(city, uid, name = "UIDs_#_filtered")
where.3 <- df %>%
filter(Date >= startdate,
Date <= enddate ) %>%
distinct(city, uid) %>%
count(city, name = "UIDs_#_distinct")
where.4 <- df %>%
filter(city %in% cityfilter) %>%
count(city, uid, name = "UIDs_#_all")
first_half <- left_join(where.1, where.2)
second_half <- left_join(where.4, where.3)
full <- left_join(first_half, second_half)
return(full)
}
# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city code count_samecode Match uid `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr> <chr> <int> <chr> <chr> <int> <int> <int>
# 1 Dubai 101 2 Yes K-4 2 3 1
# 2 London 144 1 No K-3 1 2 1
# 3 New York 123 1 No K-1 1 1 1
# 4 Tokyo 143 1 No K-5 1 1 1
# 5 Utha 103 1 No K-1 1 1 1
Related
I need to detect (among other things) the first occurrence of a non-"F" code in a patient's list, after the first "F" code occurrence. The below code seems to succeed in this, however it is shown to be too inefficient on the server running in a data set of one million observations.
The final data set should have a variable of number of non-F codes (nhosp), and the first non-F code found after the first F-code appearance on the DAIGNOSTICO variable. No duplicates of ID.
How can I improve both in terms of complexity and speed? Tidyverse pipe preferred.
This is how the result should look like:
# A tibble: 7 × 6
# Groups: ID [7]
ID DAIGNOSTICO data_entrada data_saida nhosp ficd
<dbl> <chr> <date> <date> <dbl> <chr>
1 1555 F180 1930-04-05 2005-03-15 1 T124
2 1234 F100 1980-04-01 2005-03-02 2 O155
3 16666 F120 1990-06-05 2005-03-18 0 <NA>
4 123456 F145 2001-03-07 2005-03-11 2 T123
5 177778 F155 2001-04-13 2005-03-22 2 G123
6 166666 F125 2002-03-12 2005-03-19 2 W345
7 12345 F150 2002-06-03 2005-03-07 4 K709
This is how my code looks like currently:
library(readr)
library(dplyr)
library(tidyr)
simulation <- read_csv("SIMULADO.txt", col_types = cols(
data_entrada = col_date("%d/%m/%Y"),
data_saida = col_date("%d/%m/%Y")
)
)
simulation <- as.data.frame(simulation)
simulation[, "nhosp"] <- 0
oldpos <- 1
for (i in 1:nrow(simulation)) {
if (grepl("F", simulation[i, "DAIGNOSTICO"], )) { # Has F?
oldpos <- i
clin <- 0
simulation[i, "hasF"] <- T
} else {
simulation[i, "hasF"] <-F
}
if (simulation[i, "ID"] == simulation[oldpos, "ID"]) { # same person?
if (simulation[oldpos, "hasF"] == T) { # Did she/him had F?
simulation[i, "hasF"] <- T
if (simulation[i, "data_entrada"] > simulation[oldpos, "data_entrada"]) { # é subsequente?
if (!grepl("F", simulation[i, "DAIGNOSTICO"], )) { # not-F?
simulation[i,"hasC"] <- T
clin <- 1
simulation[i, "ficd"] <- simulation[i, "DAIGNOSTICO"]
simulation[i, "nhosp"] <- clin
first_cc <- simulation[i, "DAIGNOSTICO"]
}
}
}
}
}
dt1 <- simulation %>%
arrange(data_entrada) %>%
group_by(ID) %>%
select(ficd) %>%
drop_na() %>%
slice(1)
dt2 <- simulation %>%
arrange(data_entrada) %>%
group_by(ID) %>%
filter(hasF == T) %>%
mutate(nhosp = cumsum(nhosp),
nhosp = max(nhosp)) %>%
select(-ficd,-hasF, -hasC) %>%
distinct(ID, .keep_all = TRUE) %>%
full_join(dt1, by = "ID")
dt2
And this is an example data set, with some errors to check robustness of the code:
ID, DAIGNOSTICO, data_entrada, data_saida
123490, O100, 01/04/1980, 02/03/2005
123490, O100, 01/04/1981, 02/03/2005
123491, O101, 01/04/1980, 02/03/2005
123491, O101, 01/04/1981, 02/03/2005
1234, F100, 01/04/1980, 02/03/2005
1234, O155, 02/04/1980, 03/03/2005
1234, G123, 05/05/1982, 04/03/2005
12345, T124, 01/06/2002, 05/03/2005
12345, Y124, 02/06/2002, 06/03/2005
12345, F150, 03/06/2002, 07/03/2005
12345, K709, 04/06/2002, 08/03/2005
12345, Y709, 05/06/2002, 09/03/2005
12345, F150, 03/06/2002, 07/03/2005
12345, K710, 06/06/2002, 08/03/2005
12345, K711, 07/06/2002, 10/03/2005
12345, F150, 08/06/2002, 07/03/2005
123456, F145, 07/03/2001, 11/03/2005
123456, T123, 08/03/2001, 12/03/2005
123456, P123, 09/03/2001, 13/03/2005
1555 ,R155, 04/04/1930, 14/03/2005
1555 ,F180, 05/04/1930, 15/03/2005
1555 ,T124, 06/04/1930, 16/03/2005
1555 ,F708, 07/04/1930, 17/03/2005
16666 ,F120, 05/06/1990, 18/03/2005
166666, F125, 12/03/2002, 19/03/2005
166666, W345, 13/03/2002, 20/03/2005
166666, L123, 14/03/2002, 21/03/2005
177778, F155, 13/04/2001, 22/03/2005
177778, G123, 14/04/2001, 23/03/2005
177778, F190, 15/04/2001, 24/03/2005
177778, E124, 16/04/2001, 25/03/2005
177779, G155, 13/04/2001, 22/03/2005
177779, G123, 14/04/2001, 23/03/2005
177779, G190, 15/04/2001, 24/03/2005
177779, E124, 16/04/2001, 25/03/2005
You could use
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
filter(cumsum(str_detect(DAIGNOSTICO, "^F")) > 0) %>%
mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
ficd = lead(DAIGNOSTICO)) %>%
filter(str_detect(DAIGNOSTICO, "^F")) %>%
slice(1) %>%
ungroup()
This returns
# A tibble: 7 x 6
ID DAIGNOSTICO data_entrada data_saida nhosp ficd
<dbl> <chr> <chr> <chr> <int> <chr>
1 1234 F100 01/04/1980 02/03/2005 2 O155
2 1555 F180 05/04/1930 15/03/2005 1 T124
3 12345 F150 03/06/2002 07/03/2005 4 K709
4 16666 F120 05/06/1990 18/03/2005 0 NA
5 123456 F145 07/03/2001 11/03/2005 2 T123
6 166666 F125 12/03/2002 19/03/2005 2 W345
7 177778 F155 13/04/2001 22/03/2005 2 G123
Edit
I think there might be a flaw, perhaps
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
filter(
cumsum(str_detect(DAIGNOSTICO, "^F")) == 1 |
!str_detect(DAIGNOSTICO, "^F") & cumsum(str_detect(DAIGNOSTICO, "^F")) > 0
) %>%
mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
ficd = lead(DAIGNOSTICO)) %>%
filter(str_detect(DAIGNOSTICO, "^F")) %>%
slice(1) %>%
ungroup()
is a better solution.
Here's a brief look at my data
X name sex X1880 X1881
1 1 Mary F 7065 6919
2 2 Anna F 2604 2698
3 3 Emma F 2003 2034
4 4 Elizabeth F 1939 1852
5 5 Minnie F 1746 1653
Each "X----" represents a year (up to 2010), the column "name" represents a unique name for a child, and so the corresponding number between any name and year is the number of children born in year "X---" with the specified name (for example, there were 7065 Marys born in 1880).
I would like to loop through columns covering the years 1931 to 2010, find the total number of children born in that year, and then find the total number of children born in that year whose name begins with each letter of the alphabet. Finally, I would like to get the percent of children born in each year whose name begins with each letter, and store it to a list so I can plot trend lines for all letters/all years on the same graph.
Here is the code I have
allnames <- read.csv("SSA-longtail-names.csv")
girls <- subset(allnames, allnames$sex=="F")
year_columns <- as.vector(names(girls)[54:134])
percs <- list()
years <- length(year_columns)
letters <- length(LETTERS)
for (i in range(1:years)){
total = sum(girls[year_columns[i]])
for (n in range(1:letters)){
l <- toString(LETTERS[n])
sub <- girls[(grep(l, girls$name)),year_columns[i]]
sub_total <- sum(sub[year_columns[i]])
percent <- (sub_total / total) * 100
percs <- append(percs, percent)
}
}
But the for loops only go through 8 iterations, and the list percs (which is supposed to store the calculated percentages) is full of NAs. Can anyone suggest a way to fix these loops, or perhaps an even easier way to accomplish this task?
Here is an approach using dplyr, tidyr, and stringr to make a long data table by pivoting your year columns.
library(dplyr)
library(tidyr)
library(stringr)
data2 <- data %>%
pivot_longer(cols = c(-X, -name, -sex), names_to = "year", values_to = "births") %>%
complete.cases() %>% # remove NA rows
mutate(year = as.integer(str_remove(year, "X")),
first_letter = str_sub(name, start = 1, end = 1) %>%
filter(year >= 1931 & year <= 2010)
Now you can do something like:
data3 <- data2 %>%
group_by(first_letter, year) %>%
summarize(total = sum(births))
This gives you a data.frame of three columns:
first_letter year total
A 1880 17972
A 1881 16426
# etc.
Now you can do some plotting, for example with ggplot2
library(ggplot2)
# this only looks at the English vowels to make a manageable example
ggplot(data = data3 %>% filter(first_letter %in% c("A", "E", "I", "O", "U"),
aes(x = year, y = total, color = first_letter)) +
geom_line()
As mentioned, consider reshaping data to long format (the better format in data analytics for merging, cleaning, aggregating, modeling, and plotting).
Reshape
girls_long <- reshape(girls, varying = names(girls)[4:ncol(girls)], times = names(girls)[4:ncol(girls)],
idvar = c("X", "name", "sex"),
v.names = "count", timevar = "year", ids=NULL,
new.row.names = 1:1E5, direction = "long")
girls_long$year <- as.integer(gsub("X", "", girls_long$year))
girls_long
# X name sex year count
# 1 1 Mary FALSE 1880 7065
# 2 2 Anna FALSE 1880 2604
# 3 3 Emma FALSE 1880 2003
# 4 4 Elizabeth FALSE 1880 1939
# 5 5 Minnie FALSE 1880 1746
# 6 1 Mary FALSE 1881 6919
# 7 2 Anna FALSE 1881 2698
# 8 3 Emma FALSE 1881 2034
# 9 4 Elizabeth FALSE 1881 1852
# 10 5 Minnie FALSE 1881 1653
Aggregations
# Total number of children born in that year
total_df <- aggregate(name ~ year, girls_long, FUN=length)
total_df
# year count
# 1 1880 15357
# 2 1881 15156
# Total number of children born in that year whose name begins with each letter of the alphabet
girls_long$name_letter <- substring(girls_long$name, 1, 1)
girls_agg <- aggregate(cbind(count=name) ~ name_letter + year, girls_long, FUN=length)
girls_agg
# name_letter year count
# 1 A 1880 2604
# 2 E 1880 3942
# 3 M 1880 8811
# 4 A 1881 2698
# 5 E 1881 3886
# 6 M 1881 8572
# Percent of children born in each year whose name begins with each letter
girls_agg$percent <- with(girls_agg, count / ave(count, year, FUN=sum))
girls_agg
# name_letter year count percent
# 1 A 1880 2604 0.1695644
# 2 E 1880 3942 0.2566908
# 3 M 1880 8811 0.5737449
# 4 A 1881 2698 0.1780153
# 5 E 1881 3886 0.2564001
# 6 M 1881 8572 0.5655846
I've split the solution into the three parts you describe. If you are only after the percentages, you can ignore the first part (total) and combine the second and third:
library(dplyr)
library(stringr)
library(tidyr)
data <- tibble(name = c('Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'),
sex = rep('F', 5),
X1880 = c(7065, 2604, 2003, 1939, 1746),
X1881 = c(6919, 2698, 2034, 1852, 1653))
total <- data %>%
summarise(across(X1880:X1881, sum)) %>%
pivot_longer(everything(), names_to = 'year', values_to = 'total')
total
# year total
# <chr> <dbl>
# 1 X1880 15357
# 2 X1881 15156
totalPerLetter <- data %>%
mutate(letter = str_extract(name, '^.')) %>%
select(letter, starts_with('X')) %>%
pivot_longer(-letter, names_to = 'year', values_to = 'count') %>%
group_by(letter, year) %>%
mutate(count = sum(count)) %>%
distinct()
totalPerLetter
# letter year count
# <chr> <chr> <dbl>
# 1 M X1880 8811
# 2 M X1881 8572
# 3 A X1880 2604
# 4 A X1881 2698
# 5 E X1880 3942
# 6 E X1881 3886
pctPerLetter <- totalPerLetter %>%
group_by(year) %>%
mutate(total = sum(count)) %>%
ungroup() %>%
mutate(percent = count/(total/100))
pctPerLetter
# letter year count total percent
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 M X1880 8811 15357 57.4
# 2 M X1881 8572 15156 56.6
# 3 A X1880 2604 15357 17.0
# 4 A X1881 2698 15156 17.8
# 5 E X1880 3942 15357 25.7
# 6 E X1881 3886 15156 25.6
How can I scrape the data and add and additional column to show the year that it is scraped?
nba_drafts <- function(year) {
url <- glue("https://www.basketball-reference.com/draft/NBA_{year}.html")
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
as.tibble() %>%
add_column(year = year)
write.csv(tables, year, file = "nba_draftsR.csv", na ="")
}
2000:2017 %>%
walk(function(year) {
nba_drafts(year)
})
Error: Column 1 must be named.
Checked your code, the error is happening at the step highlighted in below code.
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
as.tibble() %>% # error is happening at this step
Debug Step:
The reason for this error is the first three columns names are balnks(""), which you need to assign first, then only you can change to tibble or data frame.
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
purrr::simplify() %>%
first()
names(tables)
[1] "" "" "" "Round 1" "Round 1" "" "Totals" "Totals" "Totals" "Totals" "Totals"
[12] "Shooting" "Shooting" "Shooting" "Per Game" "Per Game" "Per Game" "Per Game" "Advanced" "Advanced" "Advanced" "Advanced"
I have added a for loop to update the names
nba_drafts <- function(year) {
url <- glue("https://www.basketball-reference.com/draft/NBA_{year}.html")
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
purrr::simplify() %>%
first()
oldName<-names(tables)
#updating names with col_
for(i in 1:length(oldName)){
oldName[i]<- paste0("col_",i,oldName[i])
}
names(tables)<-oldName
tables<-tables %>%
as.tibble() %>%
add_column(year = year)
return(tables)
}
Output:
> nba_drafts("2019")
# A tibble: 63 x 23
col_1 col_2 col_3 `col_4Round 1` `col_5Round 1` col_6 col_7Totals col_8Totals col_9Totals col_10Totals col_11Totals
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Rk Pk Tm Player College Yrs G MP PTS TRB AST
2 1 1 NOP Zion Williams… Duke 1 19 565 448 129 41
3 2 2 MEM Ja Morant Murray State 1 59 1771 1041 208 409
4 3 3 NYK RJ Barrett Duke 1 56 1704 803 279 143
5 4 4 LAL De'Andre Hunt… Virginia 1 63 2018 778 286 112
6 5 5 CLE Darius Garland Vanderbilt 1 59 1824 728 111 229
Below is a subsetted dataset, I was wondering how do I go about for each set of ids, and sorted by earliest to latest date, create a new column that indicates the row before the "LTD" status? The purpose is to identify the diagnosis or row before hitting the "LTD" status for that unique id. Thanks in advance!
Dataset:
id <- c(123,123,123,123,123,321,321)
diag <- c("injury1", "injury2" , "cancer","injury4","cancer", "injury5", "cancer")
date <- as.Date(c('2008-11-1','2009-3-25','2010-3-14',"2010-10-14","2010-11-14", '2015-3-14', '2015-4-15'))
status <- (c("STD", "STD", "LTD", "STD","LTD","STD", "LTD"))
data <- data.frame(id, diag, date, status)
Result (N for no, Y for yes):
123 injury1 2008-11-01 STD N
123 injury2 2009-03-25 STD Y
123 cancer 2010-03-14 LTD NA
123 injury4 2010-10-14 STD Y
123 Cancer 2010-11-14 LTD NA
321 injury5 2015-03-14 STD Y
321 cancer 2015-04-15 LTD NA
We can convert the date to date object arrange by date , group_by id and use case_when based on conditions.
library(dplyr)
data %>%
mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
mutate(result = case_when(lead(status == "LTD") ~"Y",
status == "LTD" ~ NA_character_,
TRUE~ "N"))
# id diag date status result
# <dbl> <fct> <date> <fct> <chr>
#1 123 injury1 2008-11-01 STD N
#2 123 injury2 2009-03-25 STD Y
#3 123 cancer 2010-03-14 LTD NA
#4 123 injury4 2010-10-14 STD Y
#5 123 cancer 2010-11-14 LTD NA
#6 321 injury5 2015-03-14 STD Y
#7 321 cancer 2015-04-15 LTD NA
Using by() and step-by-step assignment.
do.call(rbind, by(data[order(data$date), ], data$id, function(x) {
x$diag <- "N"
x$diag[which(x$status == "LTD") - 1] <- "Y"
x$diag[x$status == "LTD"] <- NA
return(x[c(1, 3:4, 2)])
}))
# id date status diag
# 123.1 123 2008-11-01 STD N
# 123.2 123 2009-03-25 STD Y
# 123.3 123 2010-03-14 LTD <NA>
# 123.4 123 2010-10-14 STD Y
# 123.5 123 2010-11-14 LTD <NA>
# 321.6 321 2015-03-14 STD Y
# 321.7 321 2015-04-15 LTD <NA>
For a list of events at the country-day level, we would like to create a unique ID for a sequence of consecutive days in a specific country (if two or more days of events in a country are consecutive --> create unique ID), so that I can ultimately reduce the data frame to specific sequences of events rather than event days.
I did not manage to aggregate the data based on the sequence of events. I believe this response is similar (Creating groups of consecutive days meeting a given criteria) however it is in SQL.
The data has the following format:
country <- c("Angola","Angola","Angola","Angola","Angola", "Benin","Benin","Benin","Benin","Benin","Benin")
event_date <- as.Date(c("2017-06-16", "2017-06-17", "2017-06-18", "2017-08-22", "2017-08-23", "2019-04-18", "2019-04-19", "2019-04-20", "2018-03-15", "2018-03-16", "2016-03-17"))
mydata <- data.frame(country, event_date)
In the output, I expect to have a new column with the ID that is unique to each series of events in a country:
seq.ID <- c(1,1,1,2,2,3,3,3,4,4,4)
mydata2 <- data.frame(country, event_date, seq.ID)
So that ultimately, I can reduce the data to the level of country and sequence of events:
mydata3 <- mydata2[!duplicated(mydata2$seq.ID),]
Try:
library(dplyr)
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)
Output:
# A tibble: 5 x 2
# Groups: country [2]
seq.ID country
<int> <fct>
1 1 Angola
2 2 Angola
3 1 Benin
4 2 Benin
5 3 Benin
You can also use the .keep_all argument in distinct and preserve the first date of each sequence:
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),
.keep_all = TRUE)
# A tibble: 5 x 3
# Groups: country [2]
country event_date seq.ID
<fct> <date> <int>
1 Angola 2017-06-16 1
2 Angola 2017-08-22 2
3 Benin 2019-04-18 1
4 Benin 2018-03-15 2
5 Benin 2016-03-17 3
In case of desired non-aggregated output with different sequence IDs, you could do:
mydata %>%
mutate(
seq.ID = cumsum(
(event_date != lag(event_date, default = first(event_date)) + 1L) |
country != lag(country, default = first(country))
)
)
country event_date seq.ID
1 Angola 2017-06-16 1
2 Angola 2017-06-17 1
3 Angola 2017-06-18 1
4 Angola 2017-08-22 2
5 Angola 2017-08-23 2
6 Benin 2019-04-18 3
7 Benin 2019-04-19 3
8 Benin 2019-04-20 3
9 Benin 2018-03-15 4
10 Benin 2018-03-16 4
11 Benin 2016-03-17 5
Note that there is a typo in your last event_date, this is why the outputs don't correspond 100% to your desired output.