Removing data after different start date based on id [duplicate]

Removing data after different start date based on id [duplicate] - r

This question already has answers here:
How to combine multiple conditions to subset a data-frame using "OR"?
(5 answers)
Closed 1 year ago.
I have a data set that includes a name, date and earliest_date, in which some name will have a earliest_date. Now I want to remove all the data after the earliest_date based on name. And ignore those that have NA in earliest_date. And sicne different name will have different earliest_date, I am pretty sure I can't use filter() with a set date. Any help will be much appericated.
Part of the data is below:
dput(mydata[1:10,])
structure(list(name = c("a", "b", "c",
"d", "e", "f", "g",
"a", "h", "i"), Date = structure(c(13214,
17634, 15290, 18046, 16326, 18068, 10234, 12647, 15485, 15182
), class = "Date"), earliest_date = structure(c(12647, NA, NA,
NA, NA, NA, NA, 12647, NA, 15552), class = "Date")), row.names = c(NA,
10L), class = "data.frame")
Desired output:
The first row will be removed as the Date recorded after earliest_date
dput(mydata[2:10,])
structure(list(name = c("b", "c",
"d", "e", "f", "g",
"a", "h", "i"), Date = structure(c(17634, 15290,
18046, 16326, 18068, 10234, 12647, 15485, 15182), class = "Date"),
earliest_date = structure(c(NA, NA, NA, NA, NA, NA, 12647,
NA, 15552), class = "Date")), row.names = 2:10, class = "data.frame")

This may helps
mydata %>%
filter(is.na(earliest_date) | Date<=earliest_date)
name Date earliest_date
1 b 2018-04-13 <NA>
2 c 2011-11-12 <NA>
3 d 2019-05-30 <NA>
4 e 2014-09-13 <NA>
5 f 2019-06-21 <NA>
6 g 1998-01-08 <NA>
7 a 2004-08-17 2004-08-17
8 h 2012-05-25 <NA>
9 i 2011-07-27 2012-07-31

Or try:
library(data.table)
setDT(mydata)[is.na(mydata$earliest_date) | mydata$Date<=earliest_date,]

Related

R - use Dplyr mutate with Purrr for string manipulation

I have two tibbles with list of strings in each. I need to compare one list of strings with another list of strings and depending on the comparison create a new column.
Small example below:
## Tibble 1 - the 'master'
structure(list(terms = c("This", "is", "a", "stri", "of", "areas",
"times", "two", "to", "see", "what", "will", "be", "in", "the",
"magic", "will", "rally", "for", "a", "cry", "from", "the", "deepest",
"part", "of", "the", "ocean", "com", "en", "au", "us"), rank = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K",
"L", "M", "N", "O", "P"), id = 1:32), row.names = c(NA, -32L), class = c("tbl_df",
"tbl", "data.frame"))
## Tibble 2 - the 'comparison'
structure(list(conds = c("this.com", "two.org", "magic.edu",
"cry/en/org", "magic.com"), ind = structure(c(2L, 1L, 5L, 3L,
4L), .Label = c("bad", "good", "Indifferent", "Maybe", "Ugly"
), class = "factor")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Ideally the output would be a mutated 'master' tibble with the ind value inserted depending on the comparison of the strings
Attempt so far:
terms <- terms %>% mutate(
test = ifelse(
sapply(lapply(terms, grepl, condition_str$conds), any) == TRUE,
condition_str$ind,
'NA'))
terms
result
# A tibble: 32 x 4
terms rank id test
<chr> <chr> <int> <chr>
1 This A 1 NA
2 is B 2 1
3 a C 3 5
4 stri D 4 NA
5 of E 5 NA
6 areas F 6 NA
7 times G 7 NA
8 two H 8 5
9 to I 9 NA
10 see J 10 NA
It gives me a result, the factor levels are carried across but the factor names are not. It fails on a larger data set I am working on.
Questions:
Is there a purrr solution that uses stringr or stringi? My problem might be in my string matching
Is there a way to use incorporate fixed = TRUE into the grepl function?
Is there a way to get the classification levels into the mutated column?
Thanks for any assistance.
James

check if subset of rows is NA then move adjacent rows to replace them

I have a dataframe that's a result of combining multiple sheets from excel. The columns did not align properly. I need to check if a subset of rows is all NA. If they are NA, then I need to check if the adjacent equally sized subset has content, and if it does, I need to copy over that row to replace the NAs.
This is what the data looks like from my dput:
structure(list(id = 1:20, A = c(NA, NA, NA, NA, NA, "c", "d",
"q", "p", "m", NA, NA, NA, NA, NA, "k", "o", "i", "a", "b"),
B = c(NA, NA, NA, NA, NA, "h", "a", "f", "b", "e", NA, NA,
NA, NA, NA, "m", "c", "s", "g", "p"), C = c(NA, NA, NA, NA,
NA, "a", "f", "j", "s", "g", NA, NA, NA, NA, NA, "l", "m",
"o", "k", "t"), D = c(NA, NA, NA, NA, NA, "n", "r", "l",
"h", "g", NA, NA, NA, NA, NA, "j", "p", "f", "d", "q"), E = c("j",
"p", "n", "i", "g", NA, NA, NA, NA, NA, "k", "e", "s", "m",
"l", NA, NA, NA, NA, NA), F = c("o", "d", "r", "q", "a",
NA, NA, NA, NA, NA, "h", "s", "f", "j", "k", NA, NA, NA,
NA, NA), G = c("f", "c", "a", "l", "m", NA, NA, NA, NA, NA,
"n", "t", "s", "e", "r", NA, NA, NA, NA, NA), H = c("r",
"c", "h", "i", "j", NA, NA, NA, NA, NA, "f", "e", "b", "l",
"n", NA, NA, NA, NA, NA)), row.names = c(NA, -20L), class = "data.frame")

If you have equal number of non-missing values in each row as shown in the shared example you can drop NA values in each row.
df1 <- as.data.frame(t(apply(df, 1, na.omit)))
# V1 V2 V3 V4 V5
#1 1 j o f r
#2 2 p d c c
#3 3 n r a h
#4 4 i q l i
#5 5 g a m j
#6 6 c h a n
#7 7 d a f r
#8 8 q f j l
#9 9 p b s h
#10 10 m e g g
#11 11 k h n f
#12 12 e s t e
#13 13 s f s b
#14 14 m j e l
#15 15 l k r n
#16 16 k m l j
#17 17 o c m p
#18 18 i s o f
#19 19 a g k d
#20 20 b p t q
To check for 1st half values and if all of them are NA we select second half we can do :
cbind(df[1], t(apply(df[-1], 1, function(x) {
x1 <- (length(x)/2)
if(all(is.na(x[1:x1]))) x[(x1+1):length(x)]
else x[1:x1]
})))

Create date range based on sparse variable by group in R

I have sparse data which has a score taken at periodic intervals and a measurement taken at more regular interval for multiple subjects along with corresponding dates. I would like to generate date ranges based on the score dates for each subject ID ie. starting at the score date and ending at the next score date (or starting/ending at the first/last subject observation if the score doesn't fall on those dates).
I would then like to average the measurement variable within these date ranges. The averaging step should be straightforward but I am stuck on generating the date ranges.
Below is a sample of the data and an example of how I would envision the resulting data
sample data:
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "D", "D"), date = c("1/21/2020", "1/27/2020",
"2/1/2020", "2/3/2020", "2/5/2020", "2/6/2020", "2/8/2020", "2/9/2020",
"2/11/2020", "2/12/2020", "2/13/2020", "2/15/2020", "2/18/2020",
"2/20/2020", "2/21/2020", "2/22/2020", "2/25/2020", "2/1/2020",
"2/5/2020", "2/7/2020", "2/8/2020", "2/11/2020", "2/12/2020",
"1/30/2020", "2/10/2020", "2/11/2020", "2/6/2020", "2/7/2020",
"2/8/2020", "2/9/2020", "2/11/2020", "2/13/2020", "2/14/2020",
"2/16/2020", "2/17/2020", "2/20/2020", "2/23/2020", "2/26/2020",
"3/1/2020", "3/3/2020", "3/5/2020"), score = c(0.5, 2, NA, NA,
3, NA, NA, NA, NA, NA, 2.5, NA, NA, 1.5, NA, NA, NA, 3, NA, NA,
2.5, NA, 1, 0.5, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14,
NA, NA, 11.5, NA, 9.5, NA), measure = c(0.394160734, 0.722462998,
0.82984815, 0.738432745, 0.321792398, 0.167492308, 0.218020898,
0.929210786, 0.686818585, 0.939678073, 0.708172942, 0.299863884,
0.48216267, 0.290307369, 0.801947902, 0.579418467, 0.78101844,
0.219494852, 0.875129822, 0.517971003, 0.475625007, 0.723003744,
0.257473477, 0.629818537, 0.817369151, 0.628573413, 0.364660834,
0.5971024, 0.002274261, 0.318937617, 0.983917106, 0.685933928,
0.487922831, 0.151769304, 0.392413694, 0.012429414, 0.149627658,
0.011724992, 0.536998203, 0.798399999, 0.763353822)), class = "data.frame", row.names = c(NA,
-41L))
answer data:
structure(list(ID = c("A", "A", "A"), startDate = c("1/21/2020",
"1/27/2020", "2/5/2020"), endDate = c("1/27/2020", "2/5/2020",
"2/13/2020"), score = c(0.5, 2, 3), measure = c(0.394160734,
0.763581298, 0.543835508)), class = "data.frame", row.names = c(NA,
-3L))

Here's a way with dplyr :
library(dplyr)
df %>%
group_by(ID, grp = cumsum(!is.na(score))) %>%
summarise(start_date = first(date),
score = first(score),
measure = mean(measure)) %>%
mutate(end_date = lead(start_date, default = last(start_date))) %>%
select(-grp)
# ID start_date score measure end_date
# <chr> <chr> <dbl> <dbl> <chr>
# 1 A 1/21/2020 0.5 0.394 1/27/2020
# 2 A 1/27/2020 2 0.764 2/5/2020
# 3 A 2/5/2020 3 0.544 2/13/2020
# 4 A 2/13/2020 2.5 0.497 2/20/2020
# 5 A 2/20/2020 1.5 0.613 2/20/2020
# 6 B 2/1/2020 3 0.538 2/8/2020
# 7 B 2/8/2020 2.5 0.599 2/12/2020
# 8 B 2/12/2020 1 0.257 2/12/2020
# 9 C 1/30/2020 0.5 0.692 1/30/2020
#10 D 2/6/2020 NA 0.449 2/17/2020
#11 D 2/17/2020 14 0.185 2/26/2020
#12 D 2/26/2020 11.5 0.274 3/3/2020
#13 D 3/3/2020 9.5 0.781 3/3/2020

Using data.table
library(data.table)
setDT(df)[, .(start_date = first(date),
score = first(score),
measure = mean(measure)),
by = .(ID, grp = cumsum(!is.na(score)))
][, end_date := shift(start_date, type= 'lead', fill = last(start_date))
][, grp := NULL][]

r - how to fill in values on stepped data hierarchy

Is there an elegant/tidy way to fill in the data if there are non-null values to the right? I have a wonky work-around but wanted to know if there was a nice dplyr way to do this.
actual <-
tibble(
a = c("A", NA, NA, NA, NA, NA, NA, "B", NA, NA, NA),
b = c(NA, "A", NA, NA, NA, "C", NA, NA, "E", NA, NA),
c = c(NA, NA, "B", NA, NA, NA, "D", NA, NA, "F", "G"),
d = c(NA, NA, NA, "C", "D", NA, NA, NA, NA, NA, NA)
)
desired <-
tibble(
w = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B"),
x = c(NA, "A", "A", "A", "A", "C", "C", NA, "E", "E", "E"),
y = c(NA, NA, "B", "B", "B", NA, "D", NA, NA, "F", "G"),
z = c(NA, NA, NA, "C", "D", NA, NA, NA, NA, NA, NA)
)

We can use fill from tidyr together with dplyr like the following.
library(dplyr)
library(tidyr)
dat <- actual %>%
fill(a) %>%
group_by(a) %>%
fill(b) %>%
group_by(b) %>%
fill(c) %>%
group_by(c) %>%
fill(d) %>%
ungroup()
print(dat)
# # A tibble: 11 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 A NA NA NA
# 2 A A NA NA
# 3 A A B NA
# 4 A A B C
# 5 A A B D
# 6 A C NA NA
# 7 A C D NA
# 8 B NA NA NA
# 9 B E NA NA
# 10 B E F NA
# 11 B E G NA

Distinct in dplyr does not work (sometimes)

I have the following data frame which I have obtained from a count. I have used dput to make the data frame available and then edited the data frame so there is a duplicate of A.
df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"),
class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))
print(df)
# A tibble: 4 x 2
Procedure n
<fct> <int>
1 D 10717
2 A 4412
3 A 2058
4 C 1480
Now I would like to take distinct on Procedure and only keep the first A.
df %>%
distinct(Procedure, .keep_all=TRUE)
# A tibble: 4 x 2
Procedure n
<fct> <int>
1 D 10717
2 A 4412
3 A 2058
4 C 1480
It does not work. Strange...

If we print the Procedure column, we can see that there are duplicated levels for a, which is problematic for the distinct function.
df$Procedure
[1] D A A C
Levels: A A C D -1
Warning message:
In print.factor(x) : duplicated level [2] in factor
One way to fix is to drop the factor levels. We can use factor function to achieve this. Another way is to convert the Procedure column to character.
df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"),
class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))
library(tidyverse)
df %>%
mutate(Procedure = factor(Procedure)) %>%
distinct(Procedure, .keep_all=TRUE)
# # A tibble: 3 x 2
# Procedure n
# <fct> <int>
# 1 D 10717
# 2 A 4412
# 3 C 1480

You have duplicated value in a label parameter .Label = c("A", "A", "C", "D", "-1"). That is an issue. Btw your way of initializing of a tibble seems to be very strange (i do not know exactly your goal but still)
Why not use
df <- tibble(
Procedure = c("D", "A", "A", "C"),
n = c(10717L, 4412L, 2058L, 1480L)
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing data after different start date based on id [duplicate] - r

This may helps mydata %>% filter(is.na(earliest_date) | Date<=earliest_date) name Date earliest_date 1 b 2018-04-13 <NA> 2 c 2011-11-12 <NA> 3 d 2019-05-30 <NA> 4 e 2014-09-13 <NA> 5 f 2019-06-21 <NA> 6 g 1998-01-08 <NA> 7 a 2004-08-17 2004-08-17 8 h 2012-05-25 <NA> 9 i 2011-07-27 2012-07-31

Or try: library(data.table) setDT(mydata)[is.na(mydata$earliest_date) | mydata$Date<=earliest_date,]

Related

R - use Dplyr mutate with Purrr for string manipulation

check if subset of rows is NA then move adjacent rows to replace them

Create date range based on sparse variable by group in R

r - how to fill in values on stepped data hierarchy

Distinct in dplyr does not work (sometimes)

Categories

Resources