Create list of dataframes subset by date range - r

I am trying to create a series of dataframes which are subset from a larger dataframe by a date range (2-year blocks), in order to do a separate survival analysis for each new dataframe. I cannot use "split" to split the dataframe based on one factor, as the data will need to be present in more than one subset.
I have some example data as follows:
Patient <- c(1:10)
First.Appt <- c("2014-01-01","2014-03-02","2015-05-17","2015-06-03","2016-01-12","2016-11-07","2017-07-08","2017-09-09","2018-04-12","2018-05-13")
DOD <- c("2014-01-29","2014-03-30","2015-06-14","2015-07-01","2016-02-09","2016-12-05","2017-08-05","2017-10-07","2018-05-10","2018-06-10")
First.Appt.Year <- c(2014,2014,2015,2015,2016,2016,2017,2017,2018,2018)
df <- as.data.frame(cbind(Patient, First.Appt, DOD, First.Appt.Year))%>%
mutate_at("First.Appt.Year", as.numeric)
I have created a start date (the minimum First.Appt.Year), the final start date (maximum First.Appt.Year - 1), and then a vector containing all my start dates from which to subset full 2-year blocks as follows:
Start.year <- as.numeric(min(df$First.Appt.Year))
Final.start.year <- max(df$First.Appt.Year) - 1
Start.vec <- c(Start.year:Final.start.year)
I thought to use a for loop with lapply to create a subset based on First.Appt.Year falling within the range of Start.vec and Start.vec + 1, for each value of Start.vec as follows:
for (i in 1:length(Start.vec)){
new.df = lapply(Start.vec, function(x)
subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1))
}
This almost works, but instead of creating four different dataframes (e.g. 2014-2015, 2015-2016, 2016-2017 and 2017-2018), all four of the dataframes in the output list only contain 2017-2018 values as below.
Patient
First.Appt
DOD
First.Appt.Year
7
08/07/2017
05/08/2017
2017
8
09/09/2017
07/10/2017
2017
9
12/04/2018
10/05/2018
2018
10
13/05/2018
10/06/2018
2018
Can anyone help me with what I am doing wrong and how to return the different subsets into each list object?
If there are other ways of doing this that seem more logical then please let me know.

It looks like a simple misunderstanding about the use of lapply. You don't need to wrap it in a for loop. Just replace your last block with :
new.df = lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And that should work. At least, it does on my side.

You are close! Instead of using both the for loop and the lapply, you need only one.
For example, with the lapply:
new.df <- lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And using only the for loop:
df_list <- list()
for (i in 1:length(Start.vec)){
new.df <- subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1)
df_list <- c(df_list, list(new.df))
}
df_list

Related

How to aggregate specific row ranges into a new data frame?

am trying to write a bit of code to look at three imported csv2 tables; each table has a column titled 'Year'. The code will look at the Year in each and calculate the compatible year 'range' accross all table. Please see below:
table_a <- Football
min_a <- min(Football$Year)
max_a <- max(Football$Year)
table_b <- UK_Population
min_b <- min(UK_Population$Year)
max_b <- max(UK_Population$Year)
table_c <- filter(UK_House_Prices, Quarter == 'Q4')
min_c <- min(UK_House_Prices$Year)
max_c <- max(UK_House_Prices$Year)
min_high <- max(min_a,min_b,min_c)
max_low <- min(max_a,max_b,max_c)
which(with(table_a, Year == min_high))
which(with(table_b, Year == min_high))
which(with(table_c, Year == min_high))
which(with(table_a, Year == max_low))
which(with(table_b, Year == max_low))
which(with(table_c, Year == max_low))
Once I assign the which function (currently unassigned) I will have the start and end row for each table I want to use to bring that row 'range' into a data frame.
So I would like to create a data frame that combines the relevant row range from each table (lets says each table has a column called 'xyz' to import into the new table (so the new table has four columns 'Year' and the 'xyz_[1:3]' table from each of the three).
I am a bit puzzled about how to do this, should I be using a loop to create the aggregate data frame? Or is the a more sensible way to do it? Any guidance would be very much appreciated.
We may place the datasets in a list and apply the code once in the list
# place the datasets in a list
lst1 <- list(Football, UK_Population, filter(UK_House_Prices, Quarter == 'Q4'))
# loop over the list, get the range in a matrix
m1 <- sapply(lst1, \(x) range(x$Year, na.rm = TRUE))
# find the max of the mins from the first column
min_high <- max(m1[,1], na.rm = TRUE)
# find the min of the maxs from the second column
max_low <- min(m1[,2], na.rm = TRUE)
# loop over the list, get the index from each of the list elements
lapply(lst1, \(x) which(with(x, Year == min_high)))
lapply(lst1, \(x) which(with(x, Year == max_low)))

subset the data frame based on multiple ranges and save each range as element in the list

I want to make the data frame as a list based on its values which belong to multiple ranges so that each value belongs to each range to be an element in that list. for example, if I have 10 range and data frame of nrow= n, so I will get a list of 10 data frames.
The data
df<- data.frame(x=seq(33, 37, 0.12), y=seq(31,35, 0.12))
library(data.table)
range<- data.table(start =c(36.15,36.08,36.02,35.95,35.89,35.82,35.76,35.69),
end = c(36.08,36.02,35.95,35.89,35.82,35.76,35.69,35.63))
I tried
nlist<-list(
df[which(df$x>36.15),],
df[which(df$x<=36.15 & df$x>36.08),],
df[which(df$x<=36.08 & df$x>36.02),],
df[which(df$x<=36.02 & df$x>35.95),],
df[which(df$x<=35.95 & df$x>35.89),],
df[which(df$x<=35.89 & df$x>35.82),],
df[which(df$x<=35.82 & df$x>35.76),],
df[which(df$x<=35.76 & df$x>35.69),],
df[which(df$x<=35.69 & df$x>35.63),],
df[which(df$x <= 35.63),])
There are two problems. Firstly, I want to make in loop instead of writing the vaules of each range limit. Secondly, this code:
Reduce('+', lapply(nlist, nrow))
produces the sum of rows = 35 whereas my data frame has nrow = 34. Where does this extra value come from?
you could apply over the rows of your range object
apply(range, 1, function(z) df[df$x > z[2] & df$x <= z[1],])
You can split the data frame according to levels obtained by cutting df$x by range$start. You don't even need a loop for this:
nlist <- split(df, cut(df$x, breaks = c(-Inf, range$start, Inf)))
Or if you want it in the same format (an unnamed list in reverse order, you can do:
nlist <- setNames(rev(split(df, cut(df$x, breaks=c(-Inf, range$start, Inf)))),NULL)
This also gives the correct answer for Reduce:
Reduce('+', lapply(nlist, nrow))
#> [1] 34

How to subset a column by triplicates?

I am wondering how to subset my data based on the appearance of triplicates in a column.
t <- c(1,1,2,2,3,3,4,4,5,5,5,6,6,7,7,7,8,8)
mydf <- data.frame(t, 1:18)
I want to be able to grab only the rows that correspond to a triplicate in column t, so that I can form a new dataframe of only those rows. That would look like this where p is the vector of rows I'm looking for:
p <- c(9,10,11,14,15,16)
myidealdf[p,]
Sorry if this isn't clear, it's my first post
This should do it
keeps <- unique(t)[table(as.factor(t)) == 3]
keeps <- t %in% keeps
mydf <- mydf[keeps, ]
Using rle function.
which(t %in% with(rle(t), values[lengths==3]))
[1] 9 10 11 14 15 16

match fundction with data frames that are differently constructed

I am relatively new to R and I have hit a wall with trying to figure out how to do what I want to do. I went through many questions on StackOF but still could not figure it out (exactly). Here is what I am trying to do:
data frame 1:
d1 =c("2005/01/02")
d2 = c("2005/01/08")
rm = c(13)
df1 = data.frame(d1, d2, rm)
data frame 2:
df2 <- as.data.frame(seq(as.Date("2005-01-02"), as.Date("2005-01-08"), by="days"))
colnames(df2)<-c("dtime")
What I hope to create:
df2$new <- if (df2$dtime >= df1$d1 AND <= df1$d2),
return df1$rm with the hopes of creating a variable df2$new looking like this in the end:
df2$new <- 13
View(df2)
I am essentially trying to match the value that corresponds to the week (df1$rm) to the individual days (df2$new) within that week.
I think what you might be looking for is sapply
df2$new <- sapply(df2$dtime, function(row){df1[((row >= as.Date(df1$d1)) + (row <= as.Date(df1$d2))) == 2,]$rm})
In R vectors being preferred to for loops. What I'm doing is taking df2's dtime column, and applying the function(row) to each in turn. This get's me a list of lookups into df1, will there always be an entry in df1 or do we need a default case?
If the dataframe is not too big, this would be really easy using a simple for loop:
d1 =c("2005/01/02")
d2 = c("2005/01/08")
rm = c(13)
df1 = data.frame(d1, d2, rm)
df2 <- as.data.frame(seq(as.Date("2005-01-02"), as.Date("2005-01-08"), by="days"))
colnames(df2)<-c("dtime")
df2$new <- NA
for(i in 1:nrow(df1)) df2$new[df2$dtime >= as.Date(df1$d1[i]) & df2$dtime <= as.Date(df1$d2[i])] <- df1$rm[i]

Creating a dataframe from an lapply function with different numbers of rows

I have a list of dates (df2) and a separate data frame with weekly dates and a measurement on that day (df1). What I need is to output a data frame within a year prior to the sample dates (df2) and the measurements with this.
eg1 <- data.frame(Date=seq(as.Date("2008-12-30"), as.Date("2012-01-04"), by="weeks"))
eg2 <- as.data.frame(matrix(sample(0:1000, 79*2, replace=TRUE), ncol=1))
df1 <- cbind(eg1,eg2)
df2 <- as.Date(c("2011-07-04","2010-07-28"))
A similar question I have previously asked (Outputting various subsets from one data frame based on dates) was answered effectively with daily data (where there is a balanced number of rows) through this function...
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
However, with weekly data an uneven number of rows means this is not possible. When the 'as.data.frame' function is removed, the code works but I get a list of data frames. What I would like to do is append a row of NA's to those dataframes containing fewer observations so that I can output one dataframe, so that I can apply functions simply ignoring the NA values e.g...
df2 <- as.Date(c("2011-01-04","2010-07-28"))
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
col <- c(2,4)
output_two <- output[,col]
Mean <- as.data.frame(apply(output_two,2,mean), na.rm = TRUE)
Try
lst <- lapply(df2, function(x) {df1[difftime(df1[,1], x - days(365)) >= 0 &
difftime(df1[,1], x) <= 0, ]})
n1 <- max(sapply(lst, nrow))
output <- data.frame(lapply(lst, function(x) x[seq_len(n1),]))

Resources