I have a data frame that looks like this:
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
x <- data.frame(data,utility,stage)
head(x)
data utility stage
1 0.9148060 def xyz
2 0.9370754 abc wxy
3 0.2861395 def xyz
4 0.8304476 cde xyz
5 0.6417455 bcd xyz
6 0.5190959 abc xyz
and I want to generate cumulative distribution functions for the unique combinations of utility and stage. In my real application I'll end up generating about 100 cdfs but this random data will have 12 (4x3) unique combinations. But I'll be using each of those cdfs thousands of times, so I don't want to calculate the cdf on the fly each time. The ecdf() function works exactly as I'd like, except I'd need to vectorize it. The following code doesn't work, but it's the gist of what I'm trying to do:
ecdf_multiple <- function(x)
{
i=0
utilities <- levels(x$utilities)
stages <- levels(x$stages)
for(utility in utilities)
{
for(stage in stages)
{
i <- i + 1
y <- ecdf(x[x$utilities == utility & x$stage == stage,1])
# calculate ecdf for the unique util/stage combo
z[i] <- list(y,utility,stage)
# then assign it to a data element (list, data frame, json, whatever) note-this doesn't actually work
}
}
z # return value
}
so after running ecdf_multiple and assigning it to a variable, I'd reference that variable somehow by passing a value (for which I wanted the cdf), the utility and the stage.
Is there a way to vectorize the ecdf function (or use/build another) so that I can the output several times without neededing to generate distributions over and over?
-------Added to respond to #Pascal 's excellent suggestion.-------
How might one expand this to a more general case of taking "n" dimensions of categories? This is my stab, based on Pascal's case of two dimensions. Notice how I tried to assign "y":
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
openclose <- sample(c("open","close"),1000,replace=TRUE)
x <- data.frame(data,utility,stage,openclose)
numlabels <- length(names(x))-1
y <- split(x, list(x[,2:(numlabels+1)]))
l <- lapply(y,function(x) ecdf(x[,"data"]))
#execute
utility <- "abc"
stage <- "xyz"
openclose <- "close"
comb <- paste(utility, stage, openclose, sep = ".")
# call the function
l[[comb]](.25)
During the assignment of "y" above, I get this error message:
"Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?"
The following might help:
# we create a list of criteria by excluding
# the first column of the data.frame
y <- split(x, as.list(x[,-1]))
l <- lapply(y, function(x) ecdf(x[,"data"]))
utility <- "abc"
stage <- "xyz"
comb <- paste(utility, stage, sep = ".")
l[[comb]](0.25)
# [1] 0.2613636
plot(l[[comb]])
Related
I try to simulate changes in a data frame through different steps depending on each others. Let's try to take a very simple example to illustrate my problem.
I create a dataframe with two columns
a=runif(10)
b=runif(10)
data_1=data.frame(a,b)
data_1
a b
1 0.94922669 0.47418098
2 0.26702201 0.79179699
3 0.57398333 0.25158378
4 0.52724079 0.61531202
5 0.03999831 0.95233479
6 0.15171673 0.64564561
7 0.51353129 0.75676464
8 0.60312432 0.85318316
9 0.52900913 0.06297818
10 0.75459362 0.40209925
Then, I would like to create n steps, where each step consists in creating a new dataframe at i+1 which is function (let's call it "whatever") of the dataframe at i: data_2 is a transformation of data_1, data_3 a transformation of data_2, etc.
iterations=function(nsteps)
{
lapply(1:nsteps,function(i)
{
data_i+1=whatever(data_i)
})
}
Whatever the function I use, I have an error message saying:
Error in whatever(data_i) : object 'data_i' not found
Can someone help me figure out what I am missing?
See if you can get some inspiration from the following example.
First, a whatever function to be applied to the previous dataframe.
whatever <- function(DF) {
DF[[2]] <- DF[[2]]*2
DF
}
Now the function you want. I have added an extra argument, the dataframe x.
The function starts by creating the object to be returned. Each member of the list data_list will be a dataframe function of the previous dataframe.
iterations <- function(nsteps, x){
data_list <- vector("list", length = nsteps)
data_list[[1]] <- x
for(i in seq_len(nsteps)[-1]){
data_list[[i]] <- whatever(data_list[[i - 1]])
}
names(data_list) <- sprintf("data_%d", seq_len(nsteps))
data_list
}
And apply iterations to an example dataframe.
df1 <- data.frame(A = letters[1:10], X = 1:10)
iterations(10, df1)
You might be looking for a combination of assign and paste:
assign(paste("data_", i + 1, sep = ""), whatever(data_i))
I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251
in R I have a list of 100 phlyo objects called called Newick1, Newick2, Newick3, etc. I want to do pairwise comparisons between the trees (e.g. all.equal.phylo(Newick1, Newick2)) but am having difficulty figuring out how to do this efficiently since each file has a different name.
I think something like the for loop below will work, but how do I designate a different file for each iteration of the loop? For obvious reasons the [i] and [j] I put in the code below don't work, but I don't know what to replace them with.
Thank you very much!
for (i in 1:99) {
for (j in i+1:100) {
all.equal.phylo(Newick[i], Newick[j]) -> output[i,j]
} }
try mget() to reference multiple objects by name
> x1 <- x2 <- x3 <-1
> mget(paste0("x",1:3))
$x1
[1] 1
$x2
[1] 1
$x3
[1] 1
You can try a variation on the following:
# make a two column dataframe
# and filter the identical values
df <- expand.grid(1:100,1:100)
names(df) <- c('i','j')
df <- df[!df$i == df$j,]
# example function that takes two parameters
addtwo <- function(i,j){i + j}
# apply that function across rows of the dataframe
results <- mapply(addtwo, df$i, df$j)
# using the same logic,
# your function would look something like this
getdistance <- function(i,j, newicks=NEWICKS) {
all.equal.phylo(newicks[i], newicks[j])
}
# and apply it like this
results <- mapply(getdistance, df$i, df$j)
Key concepts:
expand.grid()
mapply()
x1=c(55,60,75,80)
x2=c(30,20,15,23)
x3=c(4,3,2,6)
x=data.frame(x1,x2,x3)
From this function :
NAins=function(x,alpha=0.3){
x.n=NULL
for (i in 1:ncol(x)){
S= sort(x[,i], decreasing=TRUE)
N= S[ceiling(alpha*nrow(x))]
x.n= ifelse(x[,i]>N, NA, x[,i])
print(x.n) }
}
How to save the final result as adataframe look like the original dataset ?however I used data.frame(x.nmar) .
and How to get the result out of the loop ?.
Better to use lapply here to avoid side effect of the for-loop:
NAins <- function(x,alpha=0.3){
Nr <- nrow(x)
lapply(x,function(col){
S <- sort(col, decreasing=TRUE)
N <- S[ceiling(alpha*Nr)]
ifelse(col>N, NA, col)
})
Then you can coerce the result to a data.frame:
as.data.frame(NAins(dx))
Converting the comment to answer
If you want to achieve this the loop way, you will need to predefine a matrix or a data frame and then fill it up (In your case you can just use your original x data.frame because the function will not update the original data set in the global environment). After the loop ends, you will need to return it because all the variables you've created within the function will be removed. print isn't being saved anywhere neither. Also, running ceiling(alpha*nrow(x)) in a loop doesn't make sense as it always stays the same. Neither the ifelse is needed if you only have a single alternative each time. See below
NAins=function(x, alpha = 0.3){
N <- ceiling(alpha * nrow(x)) ## Run this only once (take out of the loop)
for(i in 1:ncol(x)){
S <- sort(x[, i], decreasing = TRUE)
x[x[, i] > S[N], i] <- NA # don't use `ifelse`, you only inserting one value
}
x # return the result after the loop ends
}
Test
NAins(x)
# x1 x2 x3
# 1 55 NA 4
# 2 60 20 3
# 3 75 15 2
# 4 NA 23 NA
I have a for loop in R in which I want to store the result of each calculation (for all the values looped through). In the for loop a function is called and the output is stored in a variable r in the moment. However, this is overwritten in each successive loop. How could I store the result of each loop through the function and access it afterwards?
Thanks,
example
for (par1 in 1:n) {
var<-function(par1,par2)
c(var,par1)->var2
print(var2)
So print returns every instance of var2 but in var2 only the value for the last n is saved..is there any way to get an array of the data or something?
initialise an empty object and then assign the value by indexing
a <- 0
for (i in 1:10) {
a[i] <- mean(rnorm(50))
}
print(a)
EDIT:
To include an example with two output variables, in the most basic case, create an empty matrix with the number of columns corresponding to your output parameters and the number of rows matching the number of iterations. Then save the output in the matrix, by indexing the row position in your for loop:
n <- 10
mat <- matrix(ncol=2, nrow=n)
for (i in 1:n) {
var1 <- function_one(i,par1)
var2 <- function_two(i,par2)
mat[i,] <- c(var1,var2)
}
print(mat)
The iteration number i corresponds to the row number in the mat object. So there is no need to explicitly keep track of it.
However, this is just to illustrate the basics. Once you understand the above, it is more efficient to use the elegant solution given by #eddi, especially if you are handling many output variables.
To get a list of results:
n = 3
lapply(1:n, function(par1) {
# your function and whatnot, e.g.
par1*par1
})
Or sapply if you want a vector instead.
A bit more complicated example:
n = 3
some_fn = function(x, y) { x + y }
par2 = 4
lapply(1:n, function(par1) {
var = some_fn(par1, par2)
return(c(var, par1)) # don't have to type return, but I chose to make it explicit here
})
#[[1]]
#[1] 5 1
#
#[[2]]
#[1] 6 2
#
#[[3]]
#[1] 7 3