Creating a dataframe from an lapply function with different numbers of rows - r

I have a list of dates (df2) and a separate data frame with weekly dates and a measurement on that day (df1). What I need is to output a data frame within a year prior to the sample dates (df2) and the measurements with this.
eg1 <- data.frame(Date=seq(as.Date("2008-12-30"), as.Date("2012-01-04"), by="weeks"))
eg2 <- as.data.frame(matrix(sample(0:1000, 79*2, replace=TRUE), ncol=1))
df1 <- cbind(eg1,eg2)
df2 <- as.Date(c("2011-07-04","2010-07-28"))
A similar question I have previously asked (Outputting various subsets from one data frame based on dates) was answered effectively with daily data (where there is a balanced number of rows) through this function...
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
However, with weekly data an uneven number of rows means this is not possible. When the 'as.data.frame' function is removed, the code works but I get a list of data frames. What I would like to do is append a row of NA's to those dataframes containing fewer observations so that I can output one dataframe, so that I can apply functions simply ignoring the NA values e.g...
df2 <- as.Date(c("2011-01-04","2010-07-28"))
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
col <- c(2,4)
output_two <- output[,col]
Mean <- as.data.frame(apply(output_two,2,mean), na.rm = TRUE)

Try
lst <- lapply(df2, function(x) {df1[difftime(df1[,1], x - days(365)) >= 0 &
difftime(df1[,1], x) <= 0, ]})
n1 <- max(sapply(lst, nrow))
output <- data.frame(lapply(lst, function(x) x[seq_len(n1),]))

Related

subset the data frame based on multiple ranges and save each range as element in the list

I want to make the data frame as a list based on its values which belong to multiple ranges so that each value belongs to each range to be an element in that list. for example, if I have 10 range and data frame of nrow= n, so I will get a list of 10 data frames.
The data
df<- data.frame(x=seq(33, 37, 0.12), y=seq(31,35, 0.12))
library(data.table)
range<- data.table(start =c(36.15,36.08,36.02,35.95,35.89,35.82,35.76,35.69),
end = c(36.08,36.02,35.95,35.89,35.82,35.76,35.69,35.63))
I tried
nlist<-list(
df[which(df$x>36.15),],
df[which(df$x<=36.15 & df$x>36.08),],
df[which(df$x<=36.08 & df$x>36.02),],
df[which(df$x<=36.02 & df$x>35.95),],
df[which(df$x<=35.95 & df$x>35.89),],
df[which(df$x<=35.89 & df$x>35.82),],
df[which(df$x<=35.82 & df$x>35.76),],
df[which(df$x<=35.76 & df$x>35.69),],
df[which(df$x<=35.69 & df$x>35.63),],
df[which(df$x <= 35.63),])
There are two problems. Firstly, I want to make in loop instead of writing the vaules of each range limit. Secondly, this code:
Reduce('+', lapply(nlist, nrow))
produces the sum of rows = 35 whereas my data frame has nrow = 34. Where does this extra value come from?
you could apply over the rows of your range object
apply(range, 1, function(z) df[df$x > z[2] & df$x <= z[1],])
You can split the data frame according to levels obtained by cutting df$x by range$start. You don't even need a loop for this:
nlist <- split(df, cut(df$x, breaks = c(-Inf, range$start, Inf)))
Or if you want it in the same format (an unnamed list in reverse order, you can do:
nlist <- setNames(rev(split(df, cut(df$x, breaks=c(-Inf, range$start, Inf)))),NULL)
This also gives the correct answer for Reduce:
Reduce('+', lapply(nlist, nrow))
#> [1] 34

Delete rows after a negative value in multiple data frames

I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))

for loop and a function combined to calculate a formula and then a regression

A little confused with how I am trying to acheive the results I want.
I have an environment in R which consists of 5 data.frames called df[i]
So;
df1
df2
df3
df4
df5
Inside of these df´s I have 5 columns called col[j]
col1
col2
col3
col4
col5
In total I have 25 columns across 5 data frames (5 df x 5 col).
I also have a static variable called R which is a vector of numbers
I am trying to calculate for each column of each dataframe a basic formula using a function/loop. The formula for column 1 of df1 would be;
Y = df1$col1 - R
I am trying to calculate this and repeat for each colum[j:5] in df[i:5] and store it in a new data.frame
j <- 1:5
i <- 1:5
fun <- function(x){
for(i in 1:col[j](df[i])){
Y[j] <- col[j] - R
}
}
EDIT: Added comment below for easier reading.
Y1a = df1$col1 - R
Y2a = df1$col2 - R
Y3a = df1$col3 - R
.....
.....
Y1b = df2$col1 - R
Y2b = df2$col2 - R
Y3b = df2$col3 - R
..... etc
# Put your data in a list:
dflist = mget(paste0("df", 1:5))
# Apply your function to every data frame
ylist = lapply(dflist, function(x) x - R)
# Name the resulting columns y1:y5
ylist = lapply(ylist, setNames, paste0("y", 1:5))
Have a look at How to make a list of data frames for examples and discussion of why using lists is better.
tidyverse version
dplyr::mutate_all apply a fonction to each column of a data.frame.
So I would do like that:
all_df <- list(df1, df2, df3, df4, df5)
map(all_df, function(x) mutate_all(x, function(y) y - R))
It should return you a list of length 5. Each df contains your desired statistic.

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

Sorting and finding values in other data frames

I have a dataframe named commodities_3. It contains 28 columns with different commodities and 403 rows representing end-of-month data. What I need is to find the position for each row separately:
max value,
min value,
all other positives
all other negatives
Those index should then be used to locate the corresponding data in another dataframe with the same column and row characteristics called commodities_3_returns. These data should then be copied into 4 new dataframes (one dataframe for each sorting).
I know how to find the positions of the values for each row using which and which.min and which.max. But I don't know how to put this in a loop in order to do it for all 403 rows. And subsequently how to use this data to locate the corresponding data in the other dataframe commodities_3_returns.
Unfortunaltey I have to use a dataframe because I have dates as rownames in there, which I have to keep as I need them later for indexing, as well as NA's. It looks about like this:
commodities_3 <- as.data.frame(matrix(rnorm(15), nrow=5, ncol=3))
mydates <- as.Date(c("2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2011-01-05"))
rownames(commodities_3) <- mydates
commodities_3[3,2] <- NA
commodities_3_returns <- as.data.frame(matrix(rnorm(15), nrow=5, ncol=3))
mydates <- as.Date(c("2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2011-01-05"))
rownames(commodities_3_returns) <- mydates
commodities_3_returns[3,3] <- NA
As I said, I have in total 403 rows and 27 columns. In every row, there are some NA's which I have to keep as well. max.col doesn't seem to be able to handle NA's.
My desired output for the above mentioned example would be sth like this:
max_values <- as.data.frame(matrix(data=c(1:5,3,2,1,3,1), nrow=5, ncol=2, byrow=F))
If all the columns in commodities_3 are numeric, then you want a matrix, not a data frame. Then use the apply function. Some sample data, for reprodcubililty.
commodities_3 <- matrix(rnorm(12), nrow = 4)
commodities_3_returns <- matrix(1:12, nrow = 4)
The stats.
mins <- apply(commodities_3, 1, which.min)
maxs <- apply(commodities_3, 1, which.min)
pos <- apply(commodities_3, 1, function(x) which(x > 0)) #which is optional
neg <- apply(commodities_3, 1, function(x) which(x < 0))
Now use these in the index for commodities_3_returns. In the absence of coffee, my brain has only a clunky solution with a for loop
n_months <- nrow(commodities_3_returns)
min_returns <- numeric(n_months)
for(i in seq_len(n_months))
{
min_returns[i] <- commodities_3_returns[i, mins[i]]
}
Here is an alternate approach to get the min and max using max.col which is a C function internally. If you have a large data set, max.col works extremely fast compared to apply based solutions
mins = max.col(-commodities_3)
maxs = max.col(commodities_3)
N = NROW(commodities_3)
commodities_3_returns[cbind(1:N, mins)] # returns min
commodities_3_returns[cbind(1:N, maxs)] # returns max

Resources