Remove row on group depending on multiple criteria r - r

I have a dataset with some repeated values on Date variable, so I would like to filter this rows based on several conditions. As an example, the dataframe looks like:
df <- read.table(text =
"Date column_A column_B column_C Column_D
1 2020-01-01 10 15 15 20
2 2020-01-02 10 15 15 20
3 2020-01-03 10 13 15 20
4 2020-01-04 10 15 15 20
5 2020-01-05 NA 14 15 20
6 2020-01-05 7 NA NA 28
7 2020-01-06 10 15 15 20
8 2020-01-07 10 15 15 20
9 2020-01-07 10 NA NA 20
10 2020-01-08 10 15 15 20", header=TRUE)
df$Date <- as.Date(df$Date)
The different conditions to filter should be, ONLY on duplicated rows:
If "column A" is NA and the other numeric, select the numeric row
If both values are similar(both NA or both numeric), select row with less NAs.
My best approach, after several options is:
df$cnt_na <- apply(df[,2:5], 1, function(x) sum(is.na(x)))
df <- df %>% group_by(Date) %>% slice(which.min(all_of(cnt_na))) %>% select(-cnt_na)
Although in my case, it doesn't do the first condition. The main problem is that if I filter by !is.na(Date), I also remove other not duplicated rows.
Thanks in advance

I would sort your table based on your conditions and then pick the first row for every group:
library(dplyr)
df %>%
rowwise() %>%
mutate(cnt_na = sum(across(-Date, ~ sum(is.na(.))))) %>%
arrange(Date, is.na(column_A), cnt_na) %>%
group_by(Date) %>%
slice_head() %>%
ungroup()
which gives
# A tibble: 8 x 6
Date column_A column_B column_C Column_D cnt_na
<date> <int> <int> <int> <int> <int>
1 2020-01-01 10 15 15 20 0
2 2020-01-02 10 15 15 20 0
3 2020-01-03 10 13 15 20 0
4 2020-01-04 10 15 15 20 0
5 2020-01-05 7 NA NA 28 2
6 2020-01-06 10 15 15 20 0
7 2020-01-07 10 15 15 20 0
8 2020-01-08 10 15 15 20 0

Related

Given a series of dates and a birth day, is there a way to obtain the age at every date entry along with a final age using the lubridate package?

I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2002-06-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
df1<-data.frame(date,ID,status)
print(df1)
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC
date.new<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2001-02-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID.new<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status.new<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
age<-c(0,1,1,2,2,6,10,
0,4,10,15,16,
0,6,7,10)
df2<-data.frame(date.new,ID.new,status.new,age)
print(df2)
date.new ID.new status.new age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(clock)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
"2000-01-25","2000-02-12","2000-02-27","2000-06-05",
"2000-10-30","2001-02-04","2001-06-15","2001-12-26",
"2002-05-22","2002-06-04","2000-01-08","2000-07-11",
"2000-08-18","2000-11-27"),
ID = c("A","A","A","A","A","A",
"A","B","B","B","B","B","C","C","C","C"),
status = c("BIRTH","ETC","ETC","ETC",
"ETC","ETC","ETC","BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
)
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
mutate(
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(clock)
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
ungroup()
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

In search of a more efficient solution converting Wide data to long data

I want to convert the data from wide to long.I have solved the problem with the reshape package but then I manually had to define which column belonged the "gather columns", if there are hundreds of columns (which is the case in my data) that would be time consuming and a high risk of writing errors.
Does anyone know how to make a more efficient function to reach to this result?
id <- 1001:1003
qA2 <- c(10,5,1)
qB2 <- c(11,6,3)
qC2 <- c(10,7,5)
qA3 <- c(15,12,8)
qB3 <- c(18,15,7)
qC3 <- c(19,11,10)
df <- data.frame(id,qA2,qB2,qC2, qA3, qB3, qC3)
df
id qA2 qB2 qC2 qA3 qB3 qC3
1 1001 10 11 10 15 18 19
2 1002 5 6 7 12 15 11
3 1003 1 3 5 8 7 10
Solution with the reshape package:
library(reshape2)
df_test <- reshape(df, idvar="id", direction="long", varying=list(c(2,5), c(3,6), c(4,7)),v.names=c("qA", "qB", "qC"),times=2:3)
df_test
df_test <- df_test[order(df_test$id, df_test$time),]
id time qA qB qC
1001.2 1001 2 10 11 10
1001.3 1001 3 15 18 19
1002.2 1002 2 5 6 7
1002.3 1002 3 12 15 11
1003.2 1003 2 1 3 5
1003.3 1003 3 8 7 10
Using dplyr and tidyr, here is one way not sure about the efficiency though
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -id) %>%
mutate(key = sub("\\d+", "", key)) %>%
group_by(key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# A tibble: 6 x 4
# id qA qB qC
# <int> <dbl> <dbl> <dbl>
#1 1001 10 11 10
#2 1001 15 18 19
#3 1002 5 6 7
#4 1002 12 15 11
#5 1003 1 3 5
#6 1003 8 7 10
With the new version of tidyr (1.0.0) (already on CRAN, just update it):
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with("q"),
names_to = "time",
names_prefix = "q[A-Z]",
values_to = c("qA","qB","qC"))
Here is a base R one liner,
df1 <- cbind(id = df$id, (do.call(cbind, lapply(split.default(df[-1],
gsub('\\d+', '', names(df)[-1])), stack))[c(TRUE, FALSE)]))
df1[with(df1, order(id)),]
# id qA.values qB.values qC.values
#1 1001 10 11 10
#4 1001 15 18 19
#2 1002 5 6 7
#5 1002 12 15 11
#3 1003 1 3 5
#6 1003 8 7 10
We can use names_pattern with pivot_longer
library(tidyr)
pivot_longer(df, -id, names_to = c(".value", "time"), names_pattern= "(\\D+)(\\d+)")
# A tibble: 6 x 5
# id time qA qB qC
# <int> <chr> <dbl> <dbl> <dbl>
#1 1001 2 10 11 10
#2 1001 3 15 18 19
#3 1002 2 5 6 7
#4 1002 3 12 15 11
#5 1003 2 1 3 5
#6 1003 3 8 7 10

Repeated measures in messy format, need help to tidy

I have a very large data set containing weekly weights that have been coded with week of study and the weight at that visit. There are some missing visits and the data is not currently aligned.
df <- data.frame(ID=1:3, Week_A=c(6,6,7), Weight_A=c(23,24,23), Week_B=c(7,7,8),
Weight_B=c(25,26,27), Week_C=c(8,9,9), Weight_C=c(27,26,28))
df
ID Week_A Weight_A Week_B Weight_B Week_C Weight_C
1 1 6 23 7 25 8 27
2 2 6 24 7 26 9 26
3 3 7 23 8 27 9 28
I would like to align the data by week number (ideal output below).
df_ideal <- data.frame (ID=1:3, Week_6=c(23,24,NA), Week_7=c(25,26,23),
Week_8=c(27,NA,27), Week_9=c(NA,26,28))
df_ideal
ID Week_6 Week_7 Week_8 Week_9
1 1 23 25 27 NA
2 2 24 26 NA 26
3 3 NA 23 27 28
I would appreciate some help with this, even to find a starting point to manipulate this data to an easier to manage format.
A tidyverse solution:
df <- data.frame(ID=1:3,
Week_A=c(6,6,7),
Weight_A=c(23,24,23),
Week_B=c(7,7,8),
Weight_B=c(25,26,27),
Week_C=c(8,9,9),
Weight_C=c(27,26,28))
library(tidyverse)
df_long <- df %>% gather(key="v", value="value", -ID) %>%
separate(v, into=c("v1", "v2")) %>%
spread(v1, value) %>%
complete(ID, Week) %>%
arrange(Week, ID)
df_long
# A tibble: 12 x 4
# ID Week v2 Weight
# <int> <dbl> <chr> <dbl>
# 1 1 6 A 23
# 2 2 6 A 24
# 3 3 6 <NA> NA
# 4 1 7 B 25
# 5 2 7 B 26
# 6 3 7 A 23
# 7 1 8 C 27
# 8 2 8 <NA> NA
# 9 3 8 B 27
#10 1 9 <NA> NA
#11 2 9 C 26
#12 3 9 C 28
df_wide <- df_long %>% select(-v2) %>%
spread(Week, Weight, sep="_")
df_wide
# A tibble: 3 x 5
# ID Week_6 Week_7 Week_8 Week_9
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
Personally, I'd keep using df_long instead of df_wide, as it is a tidy data frame, while df_wide is not.
Here is a possible approach using the data.table package
library(data.table)
#convert into a data.table
setDT(df)
#convert into a long format
mdat <- melt(df, id.vars="ID", measure.vars=patterns("^Week", "^Weight", cols=names(df)))
#pivot into desired output
ans <- dcast(mdat, ID ~ value1, value.var="value2")
ans output:
ID 6 7 8 9
1: 1 23 25 27 NA
2: 2 24 26 NA 26
3: 3 NA 23 27 28
And if you really need the "Week_" in your column names, you can use
setnames(ans, names(ans)[-1L], paste("Week_", names(ans)[-1L]))
Another tidyverse solution using a double-gather with a final spread
df %>%
gather(k, v, -ID, -starts_with("Weight")) %>%
separate(k, into = c("k1", "k2")) %>%
unite(k1, k1, v) %>%
gather(k, v, starts_with("Weight")) %>%
separate(k, into = c("k3", "k4")) %>%
filter(k2 == k4) %>%
select(-k2, -k3, -k4) %>%
spread(k1, v)
# ID Week_6 Week_7 Week_8 Week_9
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
In base R, it's a double reshape, firstly to long and then back to wide on a different variable:
tmp <- reshape(df, idvar="ID", varying=lapply(c("Week_","Weight_"), grep, names(df)),
v.names=c("time","Week"), direction="long")
reshape(tmp, idvar="ID", direction="wide", sep="_")
# ID Week_6 Week_7 Week_8 Week_9
#1.1 1 23 25 27 NA
#2.1 2 24 26 NA 26
#3.1 3 NA 23 27 28

Resources