I have a data frame attended with 12 pairs of pre/post numerical metrics (columns), and computing a t-test for each pair.
Here is the function that does a single test:
attended_test <- function(pre, post) {
tryCatch(t.test(log10(attended[pre]+1), log10(attended[post]+1), alternative
= "greater", paired = FALSE,
var.equal = FALSE, conf.level = 0.95), error=function(e)
c("NA","NA","NA","NA","NA","NA","NA","NA","NA"))
}
Creating vectors that correspond to data frame's columns:
pre <- as.list(c(4,5,6,7,8,9,16,17,18,19,20,21))
post <- as.list(c(10,11,12,13,14,15,22,23,24,25,26,27))
Applying test function to each pair of columns:
attended_test_results_list <- mapply(attended_test, pre, post, SIMPLIFY = FALSE)
The problem I'm having is unlisting attended_test_results_list into a single data frame. This structure is a list of 12 list objects for each test (aka nested list).
I identified the attributes I want from each test result's list:
data.frame(t(unlist(attended_test_results_list[[1]][c("estimate","p.value","statistic","conf.int")])))
Which has an output like so:
estimate.mean.of.x estimate.mean.of.y p.value statistic.t conf.int1 conf.int2
1 0.2476742 0.2530888 0.5950925 -0.2407039 -0.04243605 Inf
I want to create a single data frame with 1 row for each test (12 rows) like above. I've used lapply plenty of times, and I understand that I need to execute the code above for each of the 12 lists in attended_test_results_list and row bind to a single data frame.
But with this function I am getting this error:
attended_unpacked_test_results <- lapply(attended_test_results_list,
function(x){
data.frame(t(unlist(attended_test_results_list[[x]]
[c("estimate","p.value","statistic","conf.int")])))
})
Error in attended_test_results_list[[x]] : invalid subscript type 'list'
Do I need to be using a second lapply somewhere? How can create the data frame in the format I want?
It should be enough with one lapply. You get the error because you are passing a list to the argument x. This is why you get the error invalid subscript type 'list'.
I am not sure, but this should work:
attended_unpacked_test_results <- lapply(attended_test_results_list, function(x) {
data.frame(t(unlist(x[c("estimate","p.value","statistic","conf.int")])))
})
This will return a list. Possibly sapply will return a data frame.
Related
I want to get the mean of specific columns in a dataframe and store those means in a vector in R.
The specific variable names of the columns are stored in a vector. For those specific variables (depends on user input) I want to calculate the mean and store those in a vector, over which I can loop then to use it in another part of my code.
I tried as follows, e.g.:
specific_variables <- c("variable1", "variable2") # can be of a different length depending on user input
data <- data.frame(...) # this is a dataframe with multiple columns, of which "variable1" and "variable2" are both columns from
mean_xm <- 0 # empty variable for storage purposes
# for loop over the variables
for (i in length(specific_variables)) {
mean_xm[i] <- mean(data$specific_variables[i], na.rm = TRUE)
}
print(mean_xm)
I get an error saying
Error: object of type 'closure' is not subsettable
Second attempt using sapply:
colMeans(data[sapply(data, is.numeric)])
But this gives me the means of all columns of the dataframe, but I only want to get those from the columns specified in specific_variables. Ideally, I'd like to store those means into a vector as I did in my first attempt.
We may use
v1 <- unname(colMeans(data[specific_variables], na.rm = TRUE))
I am attempting to write a function which accepts a dataframe, and then generates subset dataframes within a for() loop. As a first step, I tried the following:
dfcreator<-function(X,Z){
for(i in 1:Z){
df<-subset(X,Stratum==Z) #build dataframe from observations where index=value
assign(paste0("pop", Z),df) #name dataframe
}
}
This however does not save anything in to memory, and when I try to specify a return() I am still not getting what I need. For reference, I am using the
Sweden data set (which is native to RStudio).
EDIT Per Melissa's Advice!
I tried to implement the following code:
sampler <- function(df, n,...) {
return(df[sample(nrow(df),n),])
}
sample_list<-map2(data_list, stratumSizeVec, sampler)
where stratumSizeVec is a 1X7 df and data_list is a list of seven dfs. When I do this, I get seven samples in sample list all of the same size equal to stratumSizeVec[1]. Why is map2 not inputting the in the following manner
sampler(data_list$pop0,stratumSizeVec[1])
sampler(data_list$pop1,stratumSizeVec[2])
...
sampler(data_list$pop6,stratumSizeVec[7])
Furthermore, is there a way to "nest" the map2 function within lapply?
I'm confused as to why you never actually utilize i anywhere in your loop. It looks like you're creating Z copies of the data set where Stratum == Z - is that what you are after?
as for your code, I would use the following:
data_list <- split(df, df$Stratum)
names(data_list) <- paste0("pop", sort(unique(df$Stratum)))
This doesn't define a function, we are calling base-R function (namely split) which splits up a data frame based on some vector (here, we use df$Stratum). The result is a list of data frames, each with a single value of Stratum.
Random sampling from rows
sampled_data <- lapply(data_list, function(df, n,...) { # n is the number of rows to take, the dots let you send other information to the `sample` function.
df[sample(nrow(df), n, ...),]
},
n = 5,
replace = FALSE # this is default, but the purpose of using the ... notation is to allow this (and any other options in the `sample` function) to be changed.
)
You can also define the function separately:
sampler <- function(df, n,...) {
df[sample(nrow(df), n, ...),]
}
sampled_data <- lapply(data_list, sampler, n = 10) # replace 10 with however many samples you want.
purrr:map2 method
As defined, the sampler function does not need to be modified, each element of the first list (data_list) is put into the first argument of sampler, and the corresponding element of the 2nd "list" (sampleSizeVec) is put into the 2nd argument.
library(purrr)
map2(data_list, sampleSizeVec, sampler, replace = FALSE) # replace = FALSE not needed, there as an example only.
I am trying to write some functions that operates on matrices within a list, such that a particular operation is performed for a set of columns within each matrix for every matrix in the list. The code deals with a project involving data that isn't mine, so I can't share the exact code and data. However, here is some code for a reproducible example of the struggles I'm having. Can anyone enlighten me on how to accomplish an ith column in jth matrix function using *apply?
### This will generate a list of data frames so there is an reproducible example.
list = lapply(seq(1:10), function(x) dplyr::sample_n(iris[,1:4], size=30))
### List the number of columns per matrix for ease
n_i = 4
### If I manually set the list index the function works fine (for that list). Let's suppose I want the residuals for all variables being predicted by Sepal.Width
get.residuals = function(i) {
residuals(lm(y ~ ., data=cbind.data.frame(y=list[[5]][,i], list[[5]][,2])))
}
### This gives the residuals for each variable given all other variables. This doesn't make sense here of course.
sapply(c(1,3,4), function(i) get.residuals(i))
### Looks perfect. The output is the residuals for three variables specified in c(1,3,4) being predicted by Sepa.Width
### Now I want to extract the residuals for every column i in each matrix j in the list. I won't show them all here, but I've tried probably 50 different variations of the functions and uses of lapply,sapply,apply,and for loops and just can't seem to get it to do what I'd like.
### As an example of an attempt that thows the error "Error in terms.formula(formula, data = data) : '.' in formula and no 'data' argument
get.residuals = function(i,j) {
residuals(lm(y ~ ., data=cbind.data.frame(y=list[[j]][,i], list[[j]][,2])))
}
lapply(list, function(j) get.residuals(i))
This R code is setting up an example of the issue I am attempting to resolve. The data set measures a release of particles over non-uniform time intervals. The particle release is integrated over time using the trapezoid rule.
library(caTools)
test.data.frame <- data.frame(
sample = c('sample 1','sample 1','sample 1','sample 1',
'sample 2','sample 2','sample 2','sample 2'))
test.data.frame$time <- c(1,2,4,6,1,4,5,6)
test.data.frame$material.released.g <- c(5,3,2,1,2,4,5,1)
split.test <- split(test.data.frame, test.data.frame$sample)
integrate.test <- function(x){
dataframe.segment <- do.call(rbind.data.frame,x)
return(trapz(dataframe.segment$time,dataframe.segment$material.released.g))
}
So far the integrate.test function appears to work on a single element of a list.
> integrate.test(split.test[1])
[1] 12
> integrate.test(split.test[2])
[1] 16.5
The lapply function gives zeros in the output.
> lapply(split.test, integrate.test)
$`sample 1`
[1] 0
$`sample 2`
[1] 0
The output I am looking for is a data frame equivalent to:
expected.output <- data.frame(
sample = c('sample 1','sample 2'),
total.material.released = c(12 , 16.5))
Is anyone able to help resolve the error code. Thanks!
It's the difference between split.test[1], which is a one-element list containing a data frame, and split.test[[1]], which is the data frame stored in list element [[1]].
Your function, by calling do.call(rbind.data.frame, x), is expecting that x will be a list. But lapply(split.test, integrate.test) actually feeds it a data frame. Here's what happens when you feed integrate.test a data frame rather than a (generic) list:
x = do.call(rbind.data.frame, split.test[[1]])
x
c.1..1..5. c.1..2..3. c.1..4..2. c.1..6..1.
sample 1 1 1 1
time 1 2 4 6
material.released.g 5 3 2 1
do.call operates over a list. If you feed it a generic list (like split.test[1], which is a one-element list) it tries to rbind each list element. If the list contained several data frames, it would stack them into a single data frame. But there's only one element--the data frame contained in element 1 of split.test--so that's what gets returned.
However, when you run do.call(rbind, split.test[[1]]) you're giving do.call a data frame to operate on. A data frame is a special kind of list in which each column is a list element. So do.call takes the columns of your original data frame, transposes them into rows and stacks them. The integration returns 0, because the columns it wants to operate on no longer exist. When you reference those non-existent columns, values of NULL are returned instead of the data you were expecting and trapz(NULL, NULL) is zero.
The function will work if you use the data frame directly and skip the do.call step:
integrate.test <- function(x){
#dataframe.segment <- do.call(rbind.data.frame,x)
dataframe.segment = x
return(trapz(dataframe.segment$time,dataframe.segment$material.released.g))
}
lapply(split.test, integrate.test)
$`sample 1`
[1] 12
$`sample 2`
[1] 16.5
Of course this can be shortened to:
integrate.test <- function(x){
return(trapz(x$time,x$material.released.g))
}
Or you can just use trapz directly, without wrapping it in a function.
I am looking for a best practice to store multiple vector results of an evaluation performed at several different values. Currently, my working code does this:
q <- 55
value <- c(0.95, 0.99, 0.995)
a <- rep(0,q) # Just initialize the vector
b <- rep(0,q) # Just initialize the vector
for(j in 1:length(value)){
for(i in 1:q){
a[i]<-rnorm(1, i, value[j]) # just as an example function
b[i]<-rnorm(1, i, value[j]) # just as an example function
}
df[j] <- data.frame(a,b)
}
I am trying to find the best way to store individual a and b for each value level
To be able to iterate through the variable "value" later for graphing
To have the value of the variable "value" and/or a description of it available
I'm not exactly sure what you're trying to do, so let me know if this is what you're looking for.
q = 55
value <- c(sd95=0.95, sd99=0.99, sd995=0.995)
a = sapply(value, function(v) {
rnorm(q, 1:q, v)
})
In the code above, we avoid the inner loop by vectorizing. For example, rnorm(55, 1:55, 0.95) will give you 55 random normal deviates, the first drawn from a distribution with mean=1, the second from a distribution with mean=2, etc. Also, you don't need to initialize a.
sapply takes the place of the outer loop. It applies a function to each value in value and returns the three vectors of random draws as the data frame a. I've added names to the values in value and sapply uses those as the column names in the resulting data frame a. (It would be more standard to make value a list, rather than a vector with named elements. You can do that with value <- list(sd95=0.95, sd99=0.99, sd995=0.995) and the code will otherwise run the same.)
You can create multiple data frames and store them in a list as follows:
q <- list(a=10, b=20)
value <- list(sd95=0.95, sd99=0.99, sd995=0.995)
df.list = sapply(q, function(i) {
sapply(value, function(v) {
rnorm(i, 1:i, v)
})
})
This time we have two different values for q and we wrap the sapply code from above inside another call to sapply. The inner sapply does the same thing as before, but now it gets the value of q from the outer sapply (using the dummy variable i). We're creating two data frames, one called a and the other called b. a has 10 rows and b has 20 (due to the values we set in q). Both data frames are stored in a list called df.list.