How to aggregate specific row ranges into a new data frame? - r

am trying to write a bit of code to look at three imported csv2 tables; each table has a column titled 'Year'. The code will look at the Year in each and calculate the compatible year 'range' accross all table. Please see below:
table_a <- Football
min_a <- min(Football$Year)
max_a <- max(Football$Year)
table_b <- UK_Population
min_b <- min(UK_Population$Year)
max_b <- max(UK_Population$Year)
table_c <- filter(UK_House_Prices, Quarter == 'Q4')
min_c <- min(UK_House_Prices$Year)
max_c <- max(UK_House_Prices$Year)
min_high <- max(min_a,min_b,min_c)
max_low <- min(max_a,max_b,max_c)
which(with(table_a, Year == min_high))
which(with(table_b, Year == min_high))
which(with(table_c, Year == min_high))
which(with(table_a, Year == max_low))
which(with(table_b, Year == max_low))
which(with(table_c, Year == max_low))
Once I assign the which function (currently unassigned) I will have the start and end row for each table I want to use to bring that row 'range' into a data frame.
So I would like to create a data frame that combines the relevant row range from each table (lets says each table has a column called 'xyz' to import into the new table (so the new table has four columns 'Year' and the 'xyz_[1:3]' table from each of the three).
I am a bit puzzled about how to do this, should I be using a loop to create the aggregate data frame? Or is the a more sensible way to do it? Any guidance would be very much appreciated.

We may place the datasets in a list and apply the code once in the list
# place the datasets in a list
lst1 <- list(Football, UK_Population, filter(UK_House_Prices, Quarter == 'Q4'))
# loop over the list, get the range in a matrix
m1 <- sapply(lst1, \(x) range(x$Year, na.rm = TRUE))
# find the max of the mins from the first column
min_high <- max(m1[,1], na.rm = TRUE)
# find the min of the maxs from the second column
max_low <- min(m1[,2], na.rm = TRUE)
# loop over the list, get the index from each of the list elements
lapply(lst1, \(x) which(with(x, Year == min_high)))
lapply(lst1, \(x) which(with(x, Year == max_low)))

Related

Create list of dataframes subset by date range

I am trying to create a series of dataframes which are subset from a larger dataframe by a date range (2-year blocks), in order to do a separate survival analysis for each new dataframe. I cannot use "split" to split the dataframe based on one factor, as the data will need to be present in more than one subset.
I have some example data as follows:
Patient <- c(1:10)
First.Appt <- c("2014-01-01","2014-03-02","2015-05-17","2015-06-03","2016-01-12","2016-11-07","2017-07-08","2017-09-09","2018-04-12","2018-05-13")
DOD <- c("2014-01-29","2014-03-30","2015-06-14","2015-07-01","2016-02-09","2016-12-05","2017-08-05","2017-10-07","2018-05-10","2018-06-10")
First.Appt.Year <- c(2014,2014,2015,2015,2016,2016,2017,2017,2018,2018)
df <- as.data.frame(cbind(Patient, First.Appt, DOD, First.Appt.Year))%>%
mutate_at("First.Appt.Year", as.numeric)
I have created a start date (the minimum First.Appt.Year), the final start date (maximum First.Appt.Year - 1), and then a vector containing all my start dates from which to subset full 2-year blocks as follows:
Start.year <- as.numeric(min(df$First.Appt.Year))
Final.start.year <- max(df$First.Appt.Year) - 1
Start.vec <- c(Start.year:Final.start.year)
I thought to use a for loop with lapply to create a subset based on First.Appt.Year falling within the range of Start.vec and Start.vec + 1, for each value of Start.vec as follows:
for (i in 1:length(Start.vec)){
new.df = lapply(Start.vec, function(x)
subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1))
}
This almost works, but instead of creating four different dataframes (e.g. 2014-2015, 2015-2016, 2016-2017 and 2017-2018), all four of the dataframes in the output list only contain 2017-2018 values as below.
Patient
First.Appt
DOD
First.Appt.Year
7
08/07/2017
05/08/2017
2017
8
09/09/2017
07/10/2017
2017
9
12/04/2018
10/05/2018
2018
10
13/05/2018
10/06/2018
2018
Can anyone help me with what I am doing wrong and how to return the different subsets into each list object?
If there are other ways of doing this that seem more logical then please let me know.
It looks like a simple misunderstanding about the use of lapply. You don't need to wrap it in a for loop. Just replace your last block with :
new.df = lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And that should work. At least, it does on my side.
You are close! Instead of using both the for loop and the lapply, you need only one.
For example, with the lapply:
new.df <- lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And using only the for loop:
df_list <- list()
for (i in 1:length(Start.vec)){
new.df <- subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1)
df_list <- c(df_list, list(new.df))
}
df_list

Substituting or summing based on condition

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help
In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE

R - count occurrences in long vectors

I have a dataframe that is 6249 rows long, filled with character-type data and will likely get a lot bigger.
I want to count the number of occurrences of each string. Normally I'd use table(df)
or
count(df)
but they both seem to stop after 250 rows.
Is there a different function or a way to force count() or table() to continue for 6000+ results?
A simple way to do this with any sized data frame is to add a count field to the data frame and then summarize the string field by count with the doBy package - like so:
require(doBy)
df$count <- 1
result <- summaryBy(count ~ string, data = df, FUN = sum, keep.names = TRUE)
As #Gregor noticed it seems like you interpreted the table output wrongly whereas it is actually doing the right counting. Anyway here goes a solution using Reduce, you should replace df where indicated by your dataframe and string column name by the column name of your actual dataframe in which you are counting.
# let's create some dataframe with three strings randomly distributed of length 1000
df <- data.frame(string = unlist(lapply(round(runif(1000, 1, 3)), function(i) c('hi', 'ok', 'my cat')[i])))
my.count <- function(word, df) {
# now let's count how many 'b' we found
Reduce(function(acc, r) {
# replace 'string' by the name of the column of your dataframe over which you want to count
if(r$string == word)
acc + 1
else
acc
}, apply(df, 1, as.list), init = 0)
}
# count how many 'my cat' strings are in the df dataframe at column 'string', replace with yours
my.count('my cat', df)
# now let's try to find the frequency of all of them
uniq <- unique(df$string)
freq <- unlist(lapply(uniq, my.count, df))
names(freq) <- uniq
freq
# output
# ok my cat hi
# 490 261 249
# we can check indeed that the sum is 1000
sum(freq)
# [1] 1000
Well, this won't be popular, but in the end I achieved the desired result with a for loop and and taking the number of rows in a subset.
y <- as.numeric(vector())
x <- as.numeric(vector())
for (i in test$token){
x <- as.numeric(nrow(df[(df$token == i),]))
y <- c(y, x)
}
Y then becomes a vector with the number of occurences of each string.

Creating a dataframe from an lapply function with different numbers of rows

I have a list of dates (df2) and a separate data frame with weekly dates and a measurement on that day (df1). What I need is to output a data frame within a year prior to the sample dates (df2) and the measurements with this.
eg1 <- data.frame(Date=seq(as.Date("2008-12-30"), as.Date("2012-01-04"), by="weeks"))
eg2 <- as.data.frame(matrix(sample(0:1000, 79*2, replace=TRUE), ncol=1))
df1 <- cbind(eg1,eg2)
df2 <- as.Date(c("2011-07-04","2010-07-28"))
A similar question I have previously asked (Outputting various subsets from one data frame based on dates) was answered effectively with daily data (where there is a balanced number of rows) through this function...
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
However, with weekly data an uneven number of rows means this is not possible. When the 'as.data.frame' function is removed, the code works but I get a list of data frames. What I would like to do is append a row of NA's to those dataframes containing fewer observations so that I can output one dataframe, so that I can apply functions simply ignoring the NA values e.g...
df2 <- as.Date(c("2011-01-04","2010-07-28"))
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
col <- c(2,4)
output_two <- output[,col]
Mean <- as.data.frame(apply(output_two,2,mean), na.rm = TRUE)
Try
lst <- lapply(df2, function(x) {df1[difftime(df1[,1], x - days(365)) >= 0 &
difftime(df1[,1], x) <= 0, ]})
n1 <- max(sapply(lst, nrow))
output <- data.frame(lapply(lst, function(x) x[seq_len(n1),]))

Subset columns based on list of column names and bring the column before it

I have a larger dataset following the same order, a unique date column, data, unique date column, date, etc. I am trying to subset not just the data column by name but the unique date column also. The code below selects columns based on a list of names, which is part of what I want but any ideas of how I can grab the column immediately before the subsetted column also?
Looking to end up with a DF containing Date1, Fire, Date3, Earth columns (using just the NameList).
Here is my reproducible code:
Cnames <- c("Date1","Fire","Date2","Water","Date3","Earth")
MAINDF <- data.frame(replicate(6,runif(120,-0.03,0.03)))
colnames(MAINDF) <- Cnames
NameList <- c("Fire","Earth")
NewDF <- MAINDF[,colnames(MAINDF) %in% NameList]
How about
NameList <- c("Fire","Earth")
idx <- match(NameList, names(MAINDF))
idx <- sort(c(idx-1, idx))
NewDF <- MAINDF[,idx]
Here we use match() to find the index of the desired column, and then we can use index subtraction to grab the column before it
Use which to get the column numbers from the names, and then it's just simple arithmetic:
col.num <- which(colnames(MAINDF) %in% NameList)
NewDF <- MAINDF[,sort(c(col.num, col.num - 1))]
Produces
Date1 Fire Date3 Earth
1 -0.010908003 0.007700453 -0.022778726 -0.016413307
2 0.022300509 0.021341360 0.014204445 -0.004492150
3 -0.021544992 0.014187158 -0.015174048 -0.000495121
4 -0.010600955 -0.006960160 -0.024535954 -0.024210771
5 -0.004694499 0.007198620 0.005543146 -0.021676692
6 -0.010623787 0.015977135 -0.027741109 -0.021102651
...

Resources