I have a dataframe that has two columns, x and y (both populated with numbers). I am trying to look at a moving window within the data, and I've done it like this (source):
# Extract just x and y from the original data frame
df <- dat_fin %>% select(x, y)
# Moving window creation
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
This lapply creates a list of tibbles that are each 10 (x, y) pairs. At this point, I am trying to compute a single quantity using each of the sets of 10 pairs; my current (not working) code looks like this:
library(shotGroups)
for (f in 1:length(windfs)) {
tsceps[f] = getCEP(windfs[f], accuracy = TRUE)
}
When I run this, I get the error:
Error in getCEP.default(windfs, accuracy = TRUE) : xy must be numeric
My goal is that the variable that I've called tsceps should be a 1 x length(windfs) data frame, each value in which comes from the getCEP calculation for each of the windowed subsets.
I've tried various things with unnest and unlist, all of which were unsuccessful.
What am I missing?
Working code:
df <- dat_fin %>% select(x, y)
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
tsceps <- vector(mode = "numeric", length = length(windfs))
library(shotGroups)
for (j in 1:length(windfs)) {
tsceps[j] <- getCEP(windfs[[j]], type = "CorrNormal", CEPlevel = 0.50, accuracy = TRUE)
}
ults <- unlist(tsceps)
ults_cep <- vector(mode = "numeric", length = length(ults))
for (k in 1:length(ults)) {
ults_cep[k] <- ults[[k]]
}
To get this working with multiple type arguments to getCEP, just use additional code blocks for each type required.
I am trying to write a code that would automatically calculate Wilcoxon test p-value for several comparisons.
Data used: 2 data sets with the same information representing two groups of participants completed the same 5 tasks which means that the each table contains 5 columns (tasks) and X rows with tasks scores.
data_17_18_G2 # first data set (in data.table format)
data_18_20_G2 # second data set (in data.table format)
Both data sets have identical names of column which are to be used in the W-test the next way:
wilcox.test(Group1Task1, Group2Task1, paired = F)
wilcox.test(Group1Task2, Group2Task2, paired = F)
and so on.
The inputs (e.g., Grou1Task1) are two vectors of task scores (the first one will be from data_17_18_G2 and the other one from data_18_20_G2
Desired output: a data table with a column of p-values
The problem I faced is that no matter how I manipulated the val1 and val2 empty objects, in the second and the third lines the right size "as.numeric(unlist(data_17_18_G2[, ..i]))" gives a correct output (a numeric vector) but it's left size "val1[i]" always returns only one value from the vector. That gave me the idea that the main problem appeared on the step of creating an empty vector, however, I wasn't able to solve it.
Empty objects:
result <- data.table(matrix(ncol=2))
val1 <- as.numeric() # here I also tried functions "numeric" and "vector"
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
For loop
for (i in 1:5) {
val1[i] <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2[i] <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
Output:
Error in `[<-.data.table`(`*tmp*`, i, 2, value = NULL) :
When deleting columns, i should not be provided
1: В val1[i] <- as.numeric(unlist(data_17_18_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
2: В val2[i] <- as.numeric(unlist(data_18_20_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
3: В res[i] <- wilcox.test(val1[i], val2[i], paired = F) :
number of items to replace is not a multiple of replacement length
Alternative:
I changed the second and the third lines
for (i in 1:5) {
val1[i] <- as.numeric(data_17_18_G2[ , ..i])
val2[i] <- as.numeric(data_18_20_G2[ , ..i])
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
And got this
Error in as.numeric(data_17_18_G2[, ..i]) :
(list) object cannot be coerced to type 'double'
which means that the function wilcox.test cannot interpret this type of input.
How can I improve the code so that I get a data table of p-values?
There would appear to be some bugs in the code. I have rewritten the code using the cars dataset as a example.
## use the cars dataset as a example (change with appropriate data)
data(cars)
data_17_18_G2 <- as.data.table(cars)
data_18_20_G2 <- data_17_18_G2[,2:1]
## Fixed code
result <- data.table(matrix(as.numeric(), nrow=ncol(data_17_18_G2), ncol=2))
val1 <- as.numeric()
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
for (i in 1:ncol(data_17_18_G2)) {
val1 <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2 <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[[i]] <- wilcox.test(val1, val2, paired = F)
result[i, 1] <- as.numeric(i)
result[i, 2] <- as.numeric(res[[i]]$p.value)
}
Hope this gives you the output you are after.
I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.
But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.
What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?
## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) {
tmp <- do.call(rbind, crdOrig[[r]])
filterInd <- seq(1,nrow(tmp), by = filterBy)
FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)
# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)
The code above works also if a list contains several data.frames (data below).
## Dummy Data (Multiple DF)
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.
The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)
## Dummy Data (Multiple Lists)
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to #Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!
Preparation and implementation of flatten
Well, there are many other answers which are in principle the same.
I meanwhile implemented for fun the flattening of nested lists.
Since I am thinking in Lisp:
Implemented first car and cdr from lisp.
car <- function(l) {
if(is.list(l)) {
if (null(l)) {
list()
} else {
l[[1]]
}
} else {
error("Not a list.")
}
}
cdr <- function(l) {
if (is.list(l)) {
if (null(l) || length(l) == 1) {
list()
} else {
l[2:length(l)]
}
} else {
error("Not a list.")
}
}
Some predicate functions:
null <- function(l) length(l) == 0
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`
# upon #Martin Morgan's comment removed other predicate functions
# thank you #Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.
Which are necessary to build flatten (for data frame lists)
flatten <- function(nested.list.construct) {
# Implemented Lisp's flatten tail call recursively. (`..flatten()`)
# Instead of (atom l) (is.df l).
..flatten <- function(l, acc.l) {
if (null(l)) {
acc.l
} else if (is.data.frame(l)) { # originally one checks here for is.atom(l)
acc.l[[length(acc.l) + 1]] <- l
acc.l # kind of (list* l acc.l)
} else {
..flatten(car(l), ..flatten(cdr(l), acc.l))
}
}
..flatten(nested.list.construct, list())
}
# an atom is in the widest sence a non-list object
After this, the actual function is defined using a sampling function.
Defining sampling function
# helper function
nrow <- function(df) dim(df)[1L]
# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
# Randomly selects a fraction of the rows of a data frame
nr <- nrow(df)
df[sample(nr, fraction * nr), , drop = FALSE]
}
The actual collector function (from nested data-frame-lists)
collect.df.samples <- function(df.list.construct, fraction = 1/10) {
do.call(rbind,
lapply(flatten(df.list.construct),
function(df) sample.one.nth.of.rows(df, fraction)
)
)
}
# thanks for the improvement with `do.call(rbind, [list])` #Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.
collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.
Testing on example
## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
collect.df.samples(crdOrig, fraction = 1/10)
Refactoring for later modifications
By writing the collect.df.samples function to:
# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)
# refactored:
collect.df.samples <-
function(df.list.construct,
df.sampler.fun = sample.10th.fraction) {
do.call(rbind,
lapply(flatten(df.list.construct), df.sampler.fun))
}
One can make the sampler function replace-able.
(And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)
The sampler function is in this definition easily exchangable
For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling,
you could e.g. use the sampler function:
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.
every.10th.rows <- function(df, nth = 10) {
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}
a.10th.of.all.rows <- function(df, fraction = 1/10) {
sample.one.nth.of.rows(df, fraction)
}
collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)
I would just flatten the whole darn thing and work on a clean list.
library(rlist)
out <- list.flatten(y)
# prepare a vector for which columns belong together
vc <- rep(1:(length(out)/2), each = 2)
vc <- split(1:length(vc), vc)
# prepare the final list
ll <- vector("list", length(unique(vc)))
for (i in 1:length(vc)) {
ll[[i]] <- as.data.frame(out[vc[[i]]])
}
result <- lapply(ll, FUN = function(x) {
x[sample(1:nrow(x), size = 10, replace = FALSE), ]
})
do.call(rbind, result)
x y
98 10.32912 52.87113
52 16.42912 46.07026
92 18.85397 46.26403
90 12.04884 57.79290
23 18.20997 40.57904
27 18.98340 52.55919
...
Here's an answer in base borrowing from a custom "rapply" function mentioned here rapply to nested list of data frames in R
df_samples<-list()
i=1
f<-function(x) {
i<<-i+1
df_samples[[i]]<<-x[sample(rownames(x),10),]
}
recurse <- function (L, f) {
if (inherits(L, "data.frame")) {
f(L) }
else lapply(L, recurse, f)
}
recurse(crdOrig, f)
res<-do.call("rbind", df_samples)
I too would flatten the list-of-lists into a standard representation (and do all analysis on the flattened representation, not just the subseting), but keep track of relevant indexing information, e.g.,
flatten_recursive = function(x) {
i <- 0L
.f = function(x, depth) {
if (is.data.frame(x)) {
i <<- i + 1L
cbind(i, depth, x)
} else {
x = lapply(x, .f, depth + 1L)
do.call(rbind, x)
}
}
.f(x, 0L)
}
The internal function .f() visits each element of a list. If the element is a data.frame, it adds a unique identifier to index it. If it's a list, then it calls itself on each element of the list (incrementing a depth counter, in case this is useful, one could also add a 'group' counter) and then row-binds the elements. I use an internal function so that I can have a variable i to increment across function calls. The end result is a single data frame with a index to use for referencing the original results.
> tbl <- flatten_recursive(crdOrig) %>% as_tibble()
> tbl %>% group_by(i, depth) %>% summarize(n())
# A tibble: 4 x 3
# Groups: i [?]
i depth `n()`
<int> <int> <int>
1 1 3 100
2 2 3 100
3 3 2 300
4 4 2 100
> tbl %>% group_by(i) %>% slice(seq(1, n(), by = 10)) %>% summarize(n())
# A tibble: 4 x 2
i `n()`
<int> <int>
1 1 10
2 2 10
3 3 30
4 4 10
The overall pattern of .f() can be adjusted for additional data types, e.g., (some details omitted)
.f <- function(x) {
if (is.data.frame(x)) {
x
} else if (is.matrix(x)) {
x <- as.data.frame(x)
setNames(x, c("x", "y"))
} else {
do.call(rbind, lapply(x, .f))
}
}
Consider a recursive call conditionally checking if first item is a data.frame or list class.
stack_process <- function(lst){
if(class(lst[[1]]) == "data.frame") {
tmp <- lst[[1]]
}
if(class(lst[[1]]) == "list") {
inner <- lapply(lst, stack_process)
tmp <- do.call(rbind, inner)
}
return(tmp)
}
new_crdOrig <- lapply(crdOrig, function(x) {
df <- stack_process(x)
filterInd <- seq(1, nrow(df), by = filterBy)
return(df[filterInd,])
})
final_df <- do.call(rbind, new_crdOrig)
I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)
I have the following function taken from R: iterative outliers detection (this is an updated version):
dropout<-function(x) {
outliers <- NULL
res <- NULL
if(length(x)<2) return (1)
vals <- rep.int(1, length(x))
r <- chisq.out.test(x)
while (r$p.value<.05 & sum(vals==1)>2) {
if (grepl("highest",r$alternative)) {
d <- which.max(ifelse(vals==1,x, NA))
res <- rbind(list(as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value)),fill=TRUE)
}
else {
d <- which.min(ifelse(vals==1, x, NA))
}
vals[d] <- r$p.value
r <- chisq.out.test(x[vals==1])
}
return(res)
}
The problem is that in each round it gives me some missing rows to fill in the data.frame
i want to fill res but in some iterations it contains missing values.
I used all possible things e.g rbindlist, rbind.fill, rbind (with fill=TRUE) but nothing is working.
When i do something like :
res <- c(res,as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value))
it works but it creates 2 rows for each set of (V1,V2), one with the last column as r$alternativeand the second row with the same first 2 columns but with the p-value in the last column instead.
Thats how I'm calling the function on data similar as the one in the mentioned question:
outliers <- d[, dropout(V3), list(V1, V2)]
and im getting always this error : j doesn't evaluate to the same number of columns for each group