Looping apply function over list of dataframes

Looping apply function over list of dataframes - r

I have looked through various Overflow pages with similar questions (some linked) but haven't found anything that seems to help with this complicated task.
I have a series of data frames in my workspace and I would like to loop the same function (rollmean or some version of that) over all of them, then save the results to new data frames.
I have written a couple of lines of to generate a list of all data frames and a for loop that should iterate an apply statement over each data frame; however, I'm having problems trying to accomplish everything I'm hoping to achieve (my code and some sample data are included below):
1) I would like to restrict the rollmean function to all columns, except the 1st (or first several), so that the column(s) 'info' does not get averaged. I would also like to add this column(s) back to the output data frame.
2) I want to save the output as a new data frame (with a unique name). I do not care if it is saved to the workspace or exported as an xlsx, as I already have batch import codes written.
3) Ideally, I would like the resultant data frame to be the same number of observations as the input, where as rollmean shrinks your data. I also do not want these to become NA, so I don't want to use fill = NA This could be accomplished by writing a new function, passing type = "partial" in rollmean (though that still shrinks my data by 1 in my hands), or by starting the roll mean on the nth+2 term and binding the non averaged nth and nth+1 terms to the resulting data frame. Any way is fine.
(see picture for detail, it illustrates what the later would look like)
My code only accomplishes parts of these things and I cannot get the for loop to work together but can get parts to work if I run them on single data frames.
Any input is greatly appreciated because I'm out of ideas.
#reproducible data frames
a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
colnames(a) = c("info", 1:20)
colnames(b) = c("info", 1:20)
colnames(c) = c("info", 1:20)
#identify all dataframes for looping rollmean
dflist = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)]
#for loop to create rolling average and save as new dataframe
for (j in 1:length(dflist)){
list = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)])
new.names = as.character(unique(list))
smoothed = as.data.frame(
apply(
X = names(list), MARGIN = 1, FUN = rollmean, k = 3, align = 'right'))
assign(new.names[i], smoothed)
}
I also tried a nested apply approach but couldn't get it to call the rollmean/rollapply function similar to issue here so I went back to for loops but if someone can make this work with nested applies, I'm down!
Picture is ideal output: Top is single input dataframe with colored boxes demonstrating a rolling average across all columns, to be iterated over each column; bottom is ideal output with colors reflecting the location of output for each colored window above

To approach this, think about one column, then one frame (which is just a list of columns), then a list of frames.
(My data used is at the bottom of the answer.)
One Column
If you don't like the reduction of zoo::rollmean, then write your own:
myrollmean <- function(x, k, ..., type=c("normal","rollin","keep"), na.rm=FALSE) {
type <- match.arg(type)
out <- zoo::rollmean(x, k, ...)
aug <- c()
if (type == "rollin") {
# effectively:
# c(mean(x[1]), mean(x[1:2]), ..., mean(x[1:j]))
# for the j=k-1 elements that precede the first from rollmean,
# when it'll become something like:
# c(mean(x[3:5]), mean(x[4:6]), ...)
aug <- sapply(seq_len(k-1), function(i) mean(x[seq_len(i)], na.rm=na.rm))
} else if (type == "keep") {
aug <- x[seq_len(k-1)]
}
out <- c(aug, out)
out
}
myrollmean(1:8, k=3) # "normal", default behavior
# [1] 2 3 4 5 6 7
myrollmean(1:8, k=3, type="rollin")
# [1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0
myrollmean(1:8, k=3, type="keep")
# [1] 1 2 2 3 4 5 6 7
I caution that this implementation is a bit naïve at best, and needs to be fixed. Make sure that you understand what it is doing when you pick other than "normal" (which will not work for you, I'm just defaulting to the normal zoo::rollmean behavior). This function could easily be applied to other zoo::roll* functions.
On one column of the data:
rbind(
dflist[[1]][,2], # for comparison
myrollmean(dflist[[1]][,2], k=3, type="keep")
)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1.865352 0.4047481 0.1466527 1.7307097 0.08952618 0.6668976 1.0743669 1.511629 1.314276 0.1565303
# [2,] 1.865352 0.4047481 0.8055844 0.7607035 0.65562952 0.8290445 0.6102636 1.084298 1.300091 0.9941452
One "frame"
Simple use of lapply, omitting the first column:
str(dflist[[1]][1:4, 1:3])
# 'data.frame': 4 obs. of 3 variables:
# $ info: num 1 2 3 4
# $ 1 : num 1.865 0.405 0.147 1.731
# $ 2 : num 0.745 1.243 0.674 1.59
dflist[[1]][-1] <- lapply(dflist[[1]][-1], myrollmean, k=3, type="keep")
str(dflist[[1]][1:4, 1:3])
# 'data.frame': 4 obs. of 3 variables:
# $ info: num 1 2 3 4
# $ 1 : num 1.865 0.405 0.806 0.761
# $ 2 : num 0.745 1.243 0.887 1.169
(For validation, column $ 1 matches the second row in the "one column" example above.)
List of "frames"
(I reset the data to what it was before I modified it above ... see the "data" code at the bottom of the answer.)
We nest the previous technique into another lapply:
dflist2 <- lapply(dflist, function(ldf) {
ldf[-1] <- lapply(ldf[-1], myrollmean, k=3, type="keep")
ldf
})
str(lapply(dflist2, function(a) a[1:4, 1:3]))
# List of 3
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 1.865 0.405 0.806 0.761
# ..$ 2 : num [1:4] 0.745 1.243 0.887 1.169
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 0.271 3.611 2.36 3.095
# ..$ 2 : num [1:4] 0.127 0.722 0.346 0.73
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 1.278 0.346 1.202 0.822
# ..$ 2 : num [1:4] 0.341 1.296 1.244 1.528
(Again, for simple validation, see that the first frame's $ 1 row shows the same rolled means as the second row of the "one column" example, above.)
PS:
if you need to skip more than just the first column, then inside the outer lapply, use instead ldf[-(1:n)] <- lapply(ldf[-(1:n)], myrollmean, k=3, type="keep") to skip the first n columns
to use a window function other than zoo::rollmean, you'll want to change the special-cases of myrollmean, though it should be straight-forward enough given this example
I use a concocted str(...) to shorten the output for display here. You should verify all of your data that it is doing what you expect for the whole of each frame.
Reproducible Data
set.seed(2)
a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
colnames(a) = c("info", 1:20)
colnames(b) = c("info", 1:20)
colnames(c) = c("info", 1:20)
dflist <- list(a,b,c)
str(lapply(dflist, function(a) a[1:3, 1:4]))
# List of 3
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 1.865 0.405 0.147
# ..$ 2 : num [1:3] 0.745 1.243 0.674
# ..$ 3 : num [1:3] 0.356 0.689 0.833
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 0.271 3.611 3.198
# ..$ 2 : num [1:3] 0.127 0.722 0.188
# ..$ 3 : num [1:3] 1.99 2.74 4.78
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 1.278 0.346 1.981
# ..$ 2 : num [1:3] 0.341 1.296 2.094
# ..$ 3 : num [1:3] 1.1159 3.05877 0.00506

Below dfnames is the names of the data frames in env, the global environment -- we have named it env in case you want to later change where they are located. Note that ls has a pattern= argument and if the data frame names have a distinct pattern then dfnames <- ls(pattern=whatever) could be used instead where whatever is a suitable regular expression.
Now define make_new which calls rollapplyr with a new mean function mean3 which returns the last value of its input if the input vector has a length less than 3 and mean otherwise. Then loop over the names using rollappyr with FUN=mean3 and partial=TRUE.
library(zoo)
env <- .GlobalEnv
dfnames <- Filter(function(x) is.data.frame(get(x, env)), ls(env))
# make_new - first version
mean3 <- function(x, k = 3) if (length(x) < k) tail(x, 1) else mean(x)
make_new <- function(df) replace(df, -1, rollapplyr(df[-1], 3, mean3, partial = TRUE))
for(nm in dfnames) env[[paste(nm, "new", sep = "_")]] <- make_new(get(nm, env))
Alternative version of make_new
An alternative to the first version of make_new shown above is the following second version. In the second version instead of defining mean3 we use just plain mean but specify a vector of widths w in rollapplyr such that w equals c(1, 1, 3, 3, ..., 3). Thus it takes the mean of just the last element for the first two input components and the mean of the 3 last elements for the rest. Note that now that we specify the widths explicitly we no longer need to specify partial= .
# make_new -- second version
make_new <- function(df) {
w <- replace(rep(3, nrow(df)), 1:2, 1)
replace(df, -1, rollapplyr(df[-1], w, mean))
}
Note
Normally when writing R and manpulating a set of objects one stores the objects in a list rather than leaving them loose in the global environment. We could create such a list L like this and then use lapply to create a second list L2 containing the new versions. Either version of make_new would work here.
L <- mget(dfnames, env)
L2 <- lapply(L, make_new)

Related

Apply na.locf to multiple datasets

I have multiple datasets (Eg: data01, data02..). In all these datasets, I want to apply na.locf to var1, and create a new variable 'var2' from the locf applied 'var1'. I tried using the following code:
L=list(data01,data02)
for (i in L){i$var2 <- na.locf(i$var1)}
However, when I try to read the locf column using code:
head(data01$var2)
The result given is NULL.

There are a few problems:
in the question i is a copy of each data frame so L is not changed. Index into L to ensure that it is the data frame in L that is changed.
use na.locf0 or equivalently na.locf(..., na.rm = FALSE) to ensure that the output is the same length as the input
the data01 and data02 in L are copies of data01 and data02 and modifying one does not modify the other. That is why you get NULL.
Using the built-in BOD data frame to construct sample input:
library(zoo)
# construct sample input
BOD1 <- BOD2 <- BOD
BOD1$Time[c(1, 3)] <- BOD2$Time[c(3, 5)] <- NA
L <- list(BOD1, BOD2)
for(i in seq_along(L)) L[[i]]$Time2 <- na.locf0(L[[i]]$Time)
giving:
str(L)
List of 2
$ :'data.frame': 6 obs. of 3 variables:
..$ Time : num [1:6] NA 2 NA 4 5 7
..$ demand: num [1:6] 8.3 10.3 19 16 15.6 19.8
..$ Time2 : num [1:6] NA 2 2 4 5 7
..- attr(*, "reference")= chr "A1.4, p. 270"
$ :'data.frame': 6 obs. of 3 variables:
..$ Time : num [1:6] 1 2 NA 4 NA 7
..$ demand: num [1:6] 8.3 10.3 19 16 15.6 19.8
..$ Time2 : num [1:6] 1 2 2 4 4 7
..- attr(*, "reference")= chr "A1.4, p. 270"
Any of these would also work and instead of modifying L produce a new list:
L2 <- lapply(L, function(x) { x$Time2 <- na.locf0(x$Time); x })
L3 <- lapply(L, transform, Time2 = na.locf0(Time))
If your aim is to modify BOD1 and BOD2 as opposed to creating a list with the modified BOD1 and BOD2 then the following would do that (although it is usually better to organize objects in a list if you intend to iterate over them) rather than leave them loose in the global environment.
nms <- c("BOD1", "BOD2")
for(nm in nms) assign(nm, transform(get(nm), Time2 = na.locf0(Time)))
or
nms <- c("BOD1", "BOD2")
for(nm in nms) .GlobalEnv[[nm]]$Time2 <- na.locf0(.GlobalEnv[[nm]]$Time2)
or other variations.

How to retrieve name of element in list (data frame) to use it as a title of the plot?

So briefly and without further ado - is it possible to retrieve only a name of element in list and use it as a main title of plot?
Let me explain - example:
Let's create a random df:
a <- c(1,2,3,4)
b <- runif(4)
c <- runif(4)
d <- runif(4)
e <- runif(4)
f <- runif(4)
df <- data.frame(a,b,c,d,e,f)
head(df)
a b c d e f
1 1 0.9694204 0.9869154 0.5386678 0.39331278 0.15054698
2 2 0.8949330 0.9910894 0.1009689 0.03632476 0.15523628
3 3 0.4930752 0.7179144 0.6957262 0.36579883 0.32006026
4 4 0.4850141 0.5539939 0.3196953 0.14348259 0.05292068
Then I want to create a list of data frame (based on this above) with specific columns to make a plot. In other words I'd like to make plot where first column of df (a) will be x axis on the plot and columns b,c,d,e and gonna represent values on y axis on the plot. Yes there'll be 5 plots - that's the point!
So my idea was to write some simple function which be able to create a list of df's based on that created above so:
my_fun <- function(x){
a <- df[1]
b <- x
aname <- "x_label"
bname <- "y_label"
df <- data.frame(a,b)
names(df) <- c(aname,bname)
return(df)
}
Run it for all (specified) columns:
df_s <- apply(df[,2:6], 2, function(x) my_fun(x))
So I have now:
class(df_s)
[1] "list"
str(df_s)
List of 5
$ b:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.969 0.895 0.493 0.485
$ c:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.987 0.991 0.718 0.554
$ d:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.539 0.101 0.696 0.32
$ e:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.3933 0.0363 0.3658 0.1435
$ f:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.1505 0.1552 0.3201 0.0529
Something that I wanted, but here's the question. I'd like to create a plot for every df in my list... As a result I want 5 plots with main titles b, c, d, e, f respectively Axis labels are the same name of the plot isn't... So I tried:
lapply(df_s, function(x) plot(x[2] ~ x[1], data = x, main = ???))
What should be instead of question marks? I tried main = names(df_s)[x] however it didin't work...

I think the following works. However, I think it might be best to use ggplot2 instead of the plot function (unless you are saving the plots inside inside lapply).
lapply(1 : length(df_s), function(x)
plot(df_s[[x]][,2] ~ df_s[[x]][,1],
xlab = names(df_s[[x]])[1],
ylab = names(df_s[[x]])[1],
main = names(df_s[x])))
With ggplot2
plot_lst <- lapply(seq_along(df_s), function(i) {
ggplot(df_s[[i]], aes(x=x_label, y=y_label)) +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(names(df_s)[i]) })

Extract multiple objects from list in R

I have some output from the vegan function specaccum. It is a list of 8 objects of varying lengths;
> str(SPECIES)
List of 8
$ call : language specaccum(comm = PRETEND.DATA, method = "rarefaction")
$ method : chr "rarefaction"
$ sites : num [1:5] 1 2 3 4 5
$ richness : num [1:5] 20.9 34.5 42.8 47.4 50
$ sd : num [1:5] 1.51 2.02 1.87 1.35 0
$ perm : NULL
$ individuals: num [1:5] 25 50 75 100 125
$ freq : num [1:50] 1 2 3 2 4 3 3 3 4 2 ...
- attr(*, "class")= chr "specaccum"
I want to extract three of the lists ('richness', 'sd' and 'individuals') and convert them to columns in a data frame. I have developed a workaround;
SPECIES.rich <- data.frame(SPECIES[["richness"]])
SPECIES.sd <- data.frame(SPECIES[["sd"]])
SPECIES.individuals <- data.frame(SPECIES[["individuals"]])
SPECIES.df <- cbind(SPECIES.rich, SPECIES.sd, SPECIES.individuals)
But this seems clumsy and protracted. I wonder if anyone could suggest a neater solution? (Should I be looking at something with lapply??) Thanks!
Example data to generate the specaccum output;
Set.Seed(100)
PRETEND.DATA <- matrix(sample(0:1, 250, replace = TRUE), 5, 50)
library(vegan)
SPECIES <- specaccum(PRETEND.DATA, method = "rarefaction")

We can concatenate the names in a vector and extract it
SPECIES.df <- data.frame(SPECIES[c("richness", "sd", "individuals")])

Another alternative, similar to akrun, is:
ctoc1 = as.data.frame(cbind(SPECIES$richness, SPECIES$sd, SPECIES$individuals))
Please note that in both cases (my answer and akrun) you will get an error if the lengths of the columns do not match.
e.g.: SPECIES.df <- data.frame(SPECIES[c( "sd", "freq")])
Error in data.frame(richness = c(20.5549865665613, 33.5688503093388, 41.4708434700877, :
arguments imply differing number of rows:7, 47
If so, remember to use length() function :
length(SPECIES$sd) <- 47 # this will add NAs to increase the column length.
SPECIES.df <- data.frame(SPECIES[c("sd", "freq")])
SPECIES.df # dataframe with 2 columns and 7 rows.

How can I assign every 5th sample in my dataset as a Testing dataset and the rest of the samples as the training dataset?

I have a dataset with 133 samples(133 rows and 2 coloumns(attributes)). I need to assign every 5th sample starting from 1(1st, 6th, 11th, 16th.... etc) as testing dataset and the rest as training dataset. How can I do it?

Using the colon-operator in Maltab:
testset = dataset(1:5:end,:)
trainingset = setdiff(dataset, testset, 'rows')
You can also do this without setdiff:
testset = dataset(1:5:end,:)
trainingset = dataset;
trainingset(1:5:end, :) = [];

In r you can just use seq to generate the sequence of rows you want to keep. Then, you can either take a basic [ based subset, or you can use %in% to split the dataset into two parts.
Here's an example of the latter, with some sample data:
set.seed(1)
mydf <- data.frame(matrix(rnorm(266), ncol = 2))
mySamples <- setNames(split(mydf, rownames(mydf) %in% seq(1, 133, by = 5)),
c("training", "testing"))
str(mySamples)
# List of 2
# $ training:'data.frame': 106 obs. of 2 variables:
# ..$ X1: num [1:106] 0.184 -0.836 1.595 0.33 0.487 ...
# ..$ X2: num [1:106] 0.3066 -1.5364 -0.301 -0.5283 -0.0569 ...
# $ testing :'data.frame': 27 obs. of 2 variables:
# ..$ X1: num [1:27] -0.6265 -0.8205 1.5118 -0.0449 0.919 ...
# ..$ X2: num [1:27] -1.518 -0.652 -0.464 -1.286 -0.929 ...
Here's the alternative slightly more manual approach using [:
testingRows <- seq(1, 133, by = 5)
trainingRows <- setdiff(sequence(nrow(mydf)), testingRows)
testing <- mydf[testingRows, ]
training <- mydf[trainingRows, ]

R Summary based on column name length

I have the following problem:
I have a matrix with 80 columns which names have either 10/11, 21/22,31/32 or 42/43 characters. The names are totally different but the lenth fits always in one of the four groups. Now I would like to add four columns were I get the sum of all the values of columns corresponding to one group. Here is a little example of what I mean
a<-rnorm(1:100)
b<-rnorm(1:100)
cc<-rnorm(1:100)
dd<-rnorm(1:100)
eee<-rnorm(1:100)
fff<-rnorm(1:100)
g<-data.frame(a,b,cc,dd,eee,fff)
g$group1<-"sum of all columns of with headers of length 1 (in this case a+b)"
g$group2<-"sum of all columns of with headers of length 2 (in this case cc+dd)"
g$group3<-"sum of all columns of with headers of length 3 (in this case eee+fff)"
I was able to transfer the matrix to a dataframe using melt() and carrying out the operation using stringr::str_length(). However, I could not transform this back to a matrix which I really need as final output. The columns are not in order and ordering would not help me much, since the number of columns depends on the outcome of the previous calculation and it would be too tedious to define dataframe ranges every time again.
Hope you can help.

You want this:
tmp <- nchar(names(g))
chargroups <- split(1:dim(g)[2], tmp)
# `chargroups` is a list of groups of columns with same number of letters in name
sapply(chargroups, function(x) {
if(length(x)>1) # rowSums can only accept 2+-dimensional object
rowSums(g[,x])
else
g[,x]
})
# `x` is, for each number of letters, a vector of column indices of `g`
The key part of this is that nchar is going to determine how long the column names are. The rest is pretty straightforward.
EDIT: In your actual code, though you should deal with the ranges of name lengths by just doing something like the following after you define tmp but before the sapply statement:
tmp[tmp==10] <- 11
tmp[tmp==21] <- 22
tmp[tmp==31] <- 32
tmp[tmp==32] <- 43

Another approach
set.seed(123)
a <- rnorm(1:100)
b <- rnorm(1:100)
cc <- rnorm(1:100)
dd <- rnorm(1:100)
eee <- rnorm(1:100)
fff <- rnorm(1:100)
g <- data.frame(a,b,cc,dd,eee,fff)
for ( i in 1:3 )
eval(parse(text = sprintf("g$group%s <- rowSums(g[nchar(names(g)) == %s])", i, i)))
## 'data.frame': 100 obs. of 9 variables:
## $ a : num -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
## $ b : num -0.71 0.257 -0.247 -0.348 -0.952 ...
## $ cc : num 2.199 1.312 -0.265 0.543 -0.414 ...
## $ dd : num -0.715 -0.753 -0.939 -1.053 -0.437 ...
## $ eee : num -0.0736 -1.1687 -0.6347 -0.0288 0.6707 ...
## $ fff : num -0.602 -0.994 1.027 0.751 -1.509 ...
## $ group1: num -1.2709 0.0267 1.312 -0.277 -0.8223 ...
## $ group2: num 1.484 0.56 -1.204 -0.509 -0.851 ...
## $ group3: num -0.675 -2.162 0.392 0.722 -0.838 ...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Looping apply function over list of dataframes - r

Related

Apply na.locf to multiple datasets

How to retrieve name of element in list (data frame) to use it as a title of the plot?

Extract multiple objects from list in R

How can I assign every 5th sample in my dataset as a Testing dataset and the rest of the samples as the training dataset?

R Summary based on column name length

Categories

Resources