Subseting the lowest date by a factor - r

I have the following dataset:
id<-c("1a","1a","1a","1a","1a",
"2a","2a","2a","2a","2a",
"3a","3a","3a","3a","3a")
fch<-c("22/05/2020","12/01/2020","01/01/2019","10/11/2020","01/01/2019",
"10/10/2015","01/01/2015","20/10/2015","08/04/2020","12/12/2019",
"01/05/2020","01/01/2013","10/08/2019","12/01/2020","20/10/2019")
dat<-c(25,35,48,97,112,
65,85,77,89,555,
58,98,25,45,336)
data<-as.data.frame(cbind(id,fch,dat))
My intention is to extract the row corresponding to the earliest date by the factor "id".
So my resulting data frame would look like this:
id<-c("1a","2a","3a")
fch<-c("01/01/2019","01/01/2015","01/01/2013")
dat<-c(48,85,98)
data_result<-as.data.frame(cbind(id,fch,dat))
This was my unsuccessful attempt:
DF1 <- data %>%
mutate(fch = as.Date(as.character(data$fch),format="%d/%m/%Y")) %>%
group_by(id) %>%
mutate(fch = min(fch)) %>%
ungroup

Slightly different method from #akrun. Note that one of the earliest dates in your data has two entries. Without a time there is no way to know which occurred first (or maybe you want both?).
library(tidyverse)
library(lubridate)
data.frame(id = c(rep("1a",5), rep("2a",5), rep("3a",5)),
fch = c("22/05/2020","12/01/2020","01/01/2019","10/11/2020","01/01/2019",
"10/10/2015","01/01/2015","20/10/2015","08/04/2020","12/12/2019",
"01/05/2020","01/01/2013","10/08/2019","12/01/2020","20/10/2019"),
dat = c(25,35,48,97,112,65,85,77,89,555,58,98,25,45,336)) %>%
group_by(id) %>%
mutate(fch = dmy(fch)) %>%
filter(fch == min(fch))
ungroup()
# A tibble: 4 x 3
# Groups: id [3]
id fch dat
<chr> <chr> <dbl>
1 1a 01/01/2019 48
2 1a 01/01/2019 112
3 2a 01/01/2015 85
4 3a 01/01/2013 98

We arrange the data by 'id', and the Date converted 'fch', grouped by 'id', use slice_head to get the first row of each group
library(dplyr)
library(lubridate)
data %>%
arrange(id, dmy(fch)) %>%
group_by(id) %>%
slice_head(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# id fch dat
# <chr> <chr> <dbl>
#1 1a 01/01/2019 48
#2 2a 01/01/2015 85
#3 3a 01/01/2013 98
NOTE: cbind returns a matrix by default and matrix can have only a single type. Instead, we can directly create the data.frame
data
data <- data.frame(id, fch, dat)

Related

R - Filter data to only include date X and following date

I have data structured like below, but with many more columns.
I need to filter the data to include only instances where a person has a date of X and X+1.
In this example only person B and C should remain, and only the rows with directly adjacent dates. So rows 2,3,5,6 should be the only remaining ones.
Once it is filtered I need to count how many times this occurred as well as do calculations on the other values, likely summing up the Values column for the X+1 date.
Person <- c("A","B","B","B","C","C","D","D")
Date <- c("2021-01-01","2021-01-01","2021-01-02","2021-01-04","2021-01-09","2021-01-10","2021-01-26","2021-01-29")
Values <- c(10,15,6,48,71,3,1,3)
df <- data.frame(Person, Date, Values)
df
How would I accomplish this?
end_points <- df %>%
mutate(Date = as.Date(Date)) %>%
group_by(Person) %>%
filter(Date - lag(Date) == 1 | lead(Date) - Date == 1) %>%
ungroup()
Result
end_points
# A tibble: 4 x 3
Person Date Values
<chr> <date> <dbl>
1 B 2021-01-01 15
2 B 2021-01-02 6
3 C 2021-01-09 71
4 C 2021-01-10 3
2nd part:
end_points %>%
group_by(Person) %>%
slice_max(Date) %>%
ungroup() %>%
summarize(total = sum(Values))

Mutate column based on list of lists in R

I have a dataframe that I want to gather so that it is in tall format, and then mutate on another column with values based on membership of a string from another column in a list of lists. For example, I have the following data frame and list of lists:
dummy_data <- data.frame("id" = 1:20,"test1_10" = sample(1:100, 20),"test2_11" = sample(1:100, 20),
"test3_12" = sample(1:100, 20),"check1_20" = sample(1:100, 20),
"check2_21" = sample(1:100, 20),"sound1_30" = sample(1:100, 20),
"sound2_31" = sample(1:100, 20),"sound3_32" = sample(1:100, 20))
dummylist <- list(c('test1_','test2_','test3_'),c('check1_','check2_'),c('sound1_','sound2_','sound3_'))
names(dummylist) <- c('shipments','arrivals','departures')
And then I gather the data frame like so:
dummy_data <- dummy_data %>%
gather("part", "number", 2:ncol(.))
What I want to do is add a column that has the name of the list found in dummylist where the string before the underscore in the part column is a member. And I can do that like this:
dummydata <- dummydata %>%
mutate(Group = case_when(
str_extract(part,'.*_') %in% dummylist[[1]] ~ names(dummylist[1]),
str_extract(part,'.*_') %in% dummylist[[2]] ~ names(dummylist[2]),
str_extract(part,'.*_') %in% dummylist[[3]] ~ names(dummylist[3])
))
However, this requires a separate str_extract line for each list/group within the dummylist. And my real data has way more than 3 lists/groups. So I'm wondering if there is a more efficient way to do this mutate step to get the names of the lists in?
Any help is much appreciated, thanks!
It may be easier with a regex_left_join after converting the 'dummylist' to a two column dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
dummy_data %>%
# // reshape to long format - pivot_longer instead of gather
pivot_longer(cols = -id, names_to = 'part', values_to = 'number') %>%
# // join with the tibble/data.frame converted dummylist
regex_left_join(dummylist %>%
enframe(name = 'Group', value = 'part') %>%
unnest(part)) %>%
rename(part = part.x) %>%
select(-part.y)
-output
# A tibble: 160 × 4
id part number Group
<int> <chr> <int> <chr>
1 1 test1_10 72 shipments
2 1 test2_11 62 shipments
3 1 test3_12 17 shipments
4 1 check1_20 89 arrivals
5 1 check2_21 54 arrivals
6 1 sound1_30 39 departures
7 1 sound2_31 94 departures
8 1 sound3_32 95 departures
9 2 test1_10 77 shipments
10 2 test2_11 4 shipments
# … with 150 more rows
If you prepare your lookup table beforehand, you don't need any extra libraries, but dplyr and tidyr:
lookup <- sapply(
names(dummylist),
\(nm) { setNames(rep(nm, length(dummylist[[nm]])), dummylist[[nm]]) }
) |>
setNames(nm = NULL) |>
unlist()
lookup
# test1_ test2_ test3_ check1_ check2_ sound1_ sound2_ sound3_
# "shipments" "shipments" "shipments" "arrivals" "arrivals" "departures" "departures" "departures"
Now you just gsubing on the fly, and translating your parts, within usual mutate() verb:
dummy_data |>
pivot_longer(-id, names_to = 'part', values_to = 'number') |>
mutate(group = lookup[gsub('^(\\w+_).*$', '\\1', part)])
# # A tibble: 160 × 4
# id part number group
# <int> <chr> <int> <chr>
# 1 1 test1_10 91 shipments
# 2 1 test2_11 74 shipments
# 3 1 test3_12 46 shipments
# 4 1 check1_20 62 arrivals
# 5 1 check2_21 7 arrivals
# 6 1 sound1_30 35 departures
# 7 1 sound2_31 23 departures
# 8 1 sound3_32 84 departures
# 9 2 test1_10 59 shipments
# 10 2 test2_11 73 shipments
# # … with 150 more rows

Sum data frame rows according to column date

I have a data frame resembling this structure:
Name 2021-01-01 2021-01-02 2021-01-03
Banana 5 23 23
Apple 90 2 15
Pear 39 7 18
The actual dataframe has dates spanning a much larger period of time.
How do I aggregate the columns together so that each column represents a week, with the data from each day being summed to form the weekly value? Giving something like this:
Name 2021-01-01 2021-01-08 2021-01-15
Banana 50 23 62
Apple 34 34 81
Pear 13 18 29
I've looked at the aggregate function but it doesn't seem quite right for this purpose.
I found a nice solution from which I learnt a lot. R really is powerful. After the edit, the output now has as column names the dates of the start of the respective weeks, see below.
Data
example <- data.frame(Name = "Banana",
"2021-01-01" = 1,
"2021-01-02" = 3,
"2021-01-10" = 2,
"2021-02-02" = 3)
> example
Name X2021.01.01 X2021.01.02 X2021.01.10 X2021.02.02
1 Banana 1 3 2 3
Code
out <- example %>%
tidyr::pivot_longer(cols = c(-Name)) %>%
mutate(Name2 = as.Date(name, format = "X%Y.%m.%d")) %>%
mutate(week = lubridate::week(Name2)) %>%
group_by(week) %>%
mutate(Sum = sum(value)) %>%
mutate(Dates = lubridate::ymd("2021-01-01") + lubridate::weeks(week - 1)) %>%
ungroup %>%
select(-name, -value, -Name2, -week) %>%
group_by_all %>%
unique %>%
tidyr::pivot_wider(id_cols = Name, values_from = Sum, names_from = Dates)
Output
# A tibble: 1 x 4
# Groups: Name [1]
Name `2021-01-01` `2021-01-08` `2021-01-29`
<chr> <dbl> <dbl> <dbl>
1 Banana 4 2 3

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

How do I create a row combining certain values of previous rows in R?

In my data, I have some lines that represent results from a repeated test. Only certain values are captured in the repeat. What I'd like to do is to create a new row with the repeat values but pulling from the initial test if the repeat values are NA or blank.
E.g. for,
Patient ID Initial/Repeat Value Value 2 Accept/Reject
A1 Initial 95 NA Reject
A1 Repeat NA 80 Accept
A2 Initial 80 70 Accept
I'd like to tranform into:
Patient ID Initial/Repeat Value Value 2 Accept/Reject
A1 Repeat 95 80 Accept
A2 Initial 80 70 Accept
Thank you.
Try this:
require(zoo)
require(dplyr)
df %>%
group_by(Patient_ID) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
Output:
# A tibble: 2 x 5
# Groups: Patient_ID [2]
Patient_ID Initial_Repeat Value Value2 Accept_Reject
<chr> <chr> <int> <int> <chr>
1 A1 Repeat 95 80 Accept
2 A2 Initial 80 70 Accept
Is it always a series of NA's with a single valid value? If yes, you could take the mean of the rows, throwing away any NA's. I do this using dplyr's grouping and summarising functionality:
# Sample data:
df = read.table(text="PatientID Initial_Repeat Value Value2 Accept_Reject
A1 Initial 95 NA Reject
A1 Repeat NA 80 Accept
A2 Initial 80 70 Accept", header = TRUE)
# My solution uses the dplyr package:
library(dplyr)
answer = df %>%
group_by(PatientID) %>%
summarise(Value = mean(Value, na.rm = TRUE), Value2 = mean(Value2, na.rm = TRUE))
answer:
# A tibble: 2 x 3
PatientID Value Value2
<fctr> <dbl> <dbl>
1 A1 95 80
2 A2 80 70
Without extra libraries:
df1 <- with(df, data.frame(PatientID=tapply(PatientID, PatientID,
function(x) x[length(x)])))
df1$Inital_Repeat <- with(df, tapply(Initial_Repeat, PatientID,
function(x) levels(Initial_Repeat)[x[length(x)]]))
for (v in c('Value', 'Value2'))
df1[[v]] <- tapply(df[[v]], df$PatientID, function(x) x[!is.na(x)][1])
df1$Accept_Reject <- with(df, tapply(Accept_Reject, PatientID,
function(x) levels(Accept_Reject)[x[length(x)]]))
Output:
PatientID Inital_Repeat Value Value2 Accept_Reject
A1 1 Repeat 95 80 Accept
A2 2 Initial 80 70 Accept
Note that Inital_Repeat and Accept_Reject are factors.
EDIT: PatientID is also a factor, which is why we have 1 and 2 for PatientID. To have "A1" and "A2", change x[length(x)] on line 2 to levels(x)[x[length(x)]]. Also, levels(Initial_Repeat) on line 4 can be replaced with levels(x), so can levels(Accept_Reject) on line 8.
I have also found tools within the tidyverse also accomplish the job. It's a little slower than zoo but offers better readability and requires fewer packages to be loaded.
library(tidyverse)
df <- df %>%
group_by(Patient_ID) %>%
fill(names(df), .direction = "down") %>%
filter(row_number() == n())

Resources