Related
I'm new to R and can't seem to get to grips with how to call a previous value of "self", in this case previous "b" b[-1].
b <- ( ( 1 / 14 ) * MyData$High + (( 13 / 14 )*b[-1]))
Obviously I need a NA somewhere in there for the first calculation, but I just couldn't figure this out on my own.
Adding example of what the sought after result should be (A=MyData$High):
A b
1 5 NA
2 10 0.7142...
3 15 3.0393...
4 20 4.6079...
1) for loop Normally one would just use a simple loop for this:
MyData <- data.frame(A = c(5, 10, 15, 20))
MyData$b <- 0
n <- nrow(MyData)
if (n > 1) for(i in 2:n) MyData$b[i] <- ( MyData$A[i] + 13 * MyData$b[i-1] )/ 14
MyData$b[1] <- NA
giving:
> MyData
A b
1 5 NA
2 10 0.7142857
3 15 1.7346939
4 20 3.0393586
2) Reduce It would also be possible to use Reduce. One first defines a function f that carries out the body of the loop and then we have Reduce invoke it repeatedly like this:
f <- function(b, A) (A + 13 * b) / 14
MyData$b <- Reduce(f, MyData$A[-1], 0, acc = TRUE)
MyData$b[1] <- NA
giving the same result.
This gives the appearance of being vectorized but in fact if you look at the source of Reduce it does a for loop itself.
3) filter Noting that the form of the problem is a recursive filter with coefficient 13/14 operating on A/14 (but with A[1] replaced with 0) we can write the following. Since filter returns a time series we use c(...) to convert it back to an ordinary vector. This approach actually is vectorized as the filter operation is performed in C.
MyData$b <- c(filter(replace(MyData$A, 1, 0)/14, 13/14, method = "recursive"))
MyData$b[1] <- NA
again giving the same result.
Note: All solutions assume that MyData has at least 1 row.
There are a couple of ways you could do this.
The first method is a simple loop
df <- data.frame(A = seq(5, 25, 5))
df$b <- 0
for(i in 2:nrow(df)){
df$b[i] <- (1/14)*df$A[i]+(13/14)*df$b[i-1]
}
df
A b
1 5 0.0000000
2 10 0.7142857
3 15 1.7346939
4 20 3.0393586
5 25 4.6079758
This doesn't give the exact values given in the expected answer, but it's close enough that I've assumed you made a transcription mistake. Note that we have to assume that we can take the NA in df$b[1] as being zero or we get NA all the way down.
If you have heaps of data or need to do this a bunch of time the speed could be improved by implementing the code in C++ and calling it from R.
The second method uses the R function sapply
The form you present the problem in
is recursive, which makes it impossible to vectorise, however we can do some maths and find that it is equivalent to
We can then write a function which calculates b_i and use sapply to calculate each element
calc_b <- function(n,A){
(1/14)*sum((13/14)^(n-1:n)*A[1:n])
}
df2 <- data.frame(A = seq(10,25,5))
df2$b <- sapply(seq_along(df2$A), calc_b, df2$A)
df2
A b
1 10 0.7142857
2 15 1.7346939
3 20 3.0393586
4 25 4.6079758
Note: We need to drop the first row (where A = 5) in order for the calculation to perform correctly.
I am trying to do sampling of variables for a statistical analysis. I have 10 variables, and I want to examine every possible combination of 5 of them. However, I only want those that follow certain rules. I only want those with 1 xor 2, 3 xor 4, 5 xor 6, 7 xor 8 and 9 xor 10. In other words, all combinations given 5 binary choices (32).
Any idea how to do this efficiently?
A simple idea is to find all the 5 out 10 using:
library(gtools)
sets = combinations(10,5) # choose 5 out of 10, all possibilities
sets = split(sets, seq.int(nrow(sets))) #so it's loopable
And then loop over these keeping only the ones that meet the criteria and thus ending up with the 32 ones desired.
But surely there is a more efficient way than this.
This will construct a matrix whose 32 rows enumerate all the possible combinations satisfying your contraint:
m <- as.matrix(expand.grid(1:2, 3:4, 5:6, 7:8, 9:10))
## Inspect a few of the rows to see that this works:
m[c(1,4,9,16,25),]
# Var1 Var2 Var3 Var4 Var5
# [1,] 1 3 5 7 9
# [2,] 2 4 5 7 9
# [3,] 1 3 5 8 9
# [4,] 2 4 6 8 9
# [5,] 1 3 5 8 10
I found a solution too, but it's not quite as elegant as Josh O'Brien's above.
library(R.utils) #for intToBin()
binaries = intToBin(0:31) #binary numbers 0 to 31
sets = list() #empty list
for (set in binaries) { #loop over each binary number string
vars = numeric() #empty vector
for (cif in 1:5) { #loop over each char in the string
if (substr(set,cif,cif)=="0"){ #if its 0
vars = c(vars,cif*2-1) #add the first var
}
else {
vars = c(vars,cif*2) #else, add the second var
}
}
sets[[set]] = as.vector(vars) #add result to list
}
Based on the idea in your answer, an alternative for the record:
n = 5
sets = matrix(1:10, ncol = 2, byrow = TRUE)
#the "on-off" combinations for each position
combs = lapply(0:(2^n - 1), function(x) as.integer(intToBits(x)[seq_len(n)]))
#a way to get the actual values
matrix(sets[cbind(seq_len(n), unlist(combs) + 1L)], ncol = n, byrow = TRUE)
I have a dataframe with multiple columns. For each row in the dataframe, I want to call a function on the row, and the input of the function is using multiple columns from that row. For example, let's say I have this data and this testFunc which accepts two args:
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> testFunc <- function(a, b) a + b
Let's say I want to apply this testFunc to columns x and z. So, for row 1 I want 1+5, and for row 2 I want 2 + 6. Is there a way to do this without writing a for loop, maybe with the apply function family?
I tried this:
> df[,c('x','z')]
x z
1 1 5
2 2 6
> lapply(df[,c('x','z')], testFunc)
Error in a + b : 'b' is missing
But got error, any ideas?
EDIT: the actual function I want to call is not a simple sum, but it is power.t.test. I used a+b just for example purposes. The end goal is to be able to do something like this (written in pseudocode):
df = data.frame(
delta=c(delta_values),
power=c(power_values),
sig.level=c(sig.level_values)
)
lapply(df, power.t.test(delta_from_each_row_of_df,
power_from_each_row_of_df,
sig.level_from_each_row_of_df
))
where the result is a vector of outputs for power.t.test for each row of df.
You can apply apply to a subset of the original data.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
apply(dat[,c('x','z')], 1, function(x) sum(x) )
or if your function is just sum use the vectorized version:
rowSums(dat[,c('x','z')])
[1] 6 8
If you want to use testFunc
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(x) testFunc(x[1],x[2]))
EDIT To access columns by name and not index you can do something like this:
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(y) testFunc(y['z'],y['x']))
A data.frame is a list, so ...
For vectorized functions do.call is usually a good bet. But the names of arguments come into play. Here your testFunc is called with args x and y in place of a and b. The ... allows irrelevant args to be passed without causing an error:
do.call( function(x,z,...) testFunc(x,z), df )
For non-vectorized functions, mapply will work, but you need to match the ordering of the args or explicitly name them:
mapply(testFunc, df$x, df$z)
Sometimes apply will work - as when all args are of the same type so coercing the data.frame to a matrix does not cause problems by changing data types. Your example was of this sort.
If your function is to be called within another function into which the arguments are all passed, there is a much slicker method than these. Study the first lines of the body of lm() if you want to go that route.
Use mapply
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> mapply(function(x,y) x+y, df$x, df$z)
[1] 6 8
> cbind(df,f = mapply(function(x,y) x+y, df$x, df$z) )
x y z f
1 1 3 5 6
2 2 4 6 8
New answer with dplyr package
If the function that you want to apply is vectorized,
then you could use the mutate function from the dplyr package:
> library(dplyr)
> myf <- function(tens, ones) { 10 * tens + ones }
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mutate(x, value = myf(tens, ones))
hundreds tens ones value
1 7 1 4 14
2 8 2 5 25
3 9 3 6 36
Old answer with plyr package
In my humble opinion,
the tool best suited to the task is mdply from the plyr package.
Example:
> library(plyr)
> x <- data.frame(tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
tens ones V1
1 1 4 14
2 2 5 25
3 3 6 36
Unfortunately, as Bertjan Broeksema pointed out,
this approach fails if you don't use all the columns of the data frame
in the mdply call.
For example,
> library(plyr)
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
Error in (function (tens, ones) : unused argument (hundreds = 7)
Others have correctly pointed out that mapply is made for this purpose, but (for the sake of completeness) a conceptually simpler method is just to use a for loop.
for (row in 1:nrow(df)) {
df$newvar[row] <- testFunc(df$x[row], df$z[row])
}
Many functions are vectorization already, and so there is no need for any iterations (neither for loops or *pply functions). Your testFunc is one such example. You can simply call:
testFunc(df[, "x"], df[, "z"])
In general, I would recommend trying such vectorization approaches first and see if they get you your intended results.
Alternatively, if you need to pass multiple arguments to a function which is not vectorized, mapply might be what you are looking for:
mapply(power.t.test, df[, "x"], df[, "z"])
Here is an alternate approach. It is more intuitive.
One key aspect I feel some of the answers did not take into account, which I point out for posterity, is apply() lets you do row calculations easily, but only for matrix (all numeric) data
operations on columns are possible still for dataframes:
as.data.frame(lapply(df, myFunctionForColumn()))
To operate on rows, we make the transpose first.
tdf<-as.data.frame(t(df))
as.data.frame(lapply(tdf, myFunctionForRow()))
The downside is that I believe R will make a copy of your data table.
Which could be a memory issue. (This is truly sad, because it is programmatically simple for tdf to just be an iterator to the original df, thus saving memory, but R does not allow pointer or iterator referencing.)
Also, a related question, is how to operate on each individual cell in a dataframe.
newdf <- as.data.frame(lapply(df, function(x) {sapply(x, myFunctionForEachCell()}))
data.table has a really intuitive way of doing this as well:
library(data.table)
sample_fxn = function(x,y,z){
return((x+y)*z)
}
df = data.table(A = 1:5,B=seq(2,10,2),C = 6:10)
> df
A B C
1: 1 2 6
2: 2 4 7
3: 3 6 8
4: 4 8 9
5: 5 10 10
The := operator can be called within brackets to add a new column using a function
df[,new_column := sample_fxn(A,B,C)]
> df
A B C new_column
1: 1 2 6 18
2: 2 4 7 42
3: 3 6 8 72
4: 4 8 9 108
5: 5 10 10 150
It's also easy to accept constants as arguments as well using this method:
df[,new_column2 := sample_fxn(A,B,2)]
> df
A B C new_column new_column2
1: 1 2 6 18 6
2: 2 4 7 42 12
3: 3 6 8 72 18
4: 4 8 9 108 24
5: 5 10 10 150 30
#user20877984's answer is excellent. Since they summed it up far better than my previous answer, here is my (posibly still shoddy) attempt at an application of the concept:
Using do.call in a basic fashion:
powvalues <- list(power=0.9,delta=2)
do.call(power.t.test,powvalues)
Working on a full data set:
# get the example data
df <- data.frame(delta=c(1,1,2,2), power=c(.90,.85,.75,.45))
#> df
# delta power
#1 1 0.90
#2 1 0.85
#3 2 0.75
#4 2 0.45
lapply the power.t.test function to each of the rows of specified values:
result <- lapply(
split(df,1:nrow(df)),
function(x) do.call(power.t.test,x)
)
> str(result)
List of 4
$ 1:List of 8
..$ n : num 22
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.9
..$ alternative: chr "two.sided"
..$ note : chr "n is number in *each* group"
..$ method : chr "Two-sample t test power calculation"
..- attr(*, "class")= chr "power.htest"
$ 2:List of 8
..$ n : num 19
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.85
... ...
I came here looking for tidyverse function name - which I knew existed. Adding this for (my) future reference and for tidyverse enthusiasts: purrrlyr:invoke_rows (purrr:invoke_rows in older versions).
With connection to standard stats methods as in the original question, the broom package would probably help.
If data.frame columns are different types, apply() has a problem.
A subtlety about row iteration is how apply(a.data.frame, 1, ...) does
implicit type conversion to character types when columns are different types;
eg. a factor and numeric column. Here's an example, using a factor
in one column to modify a numeric column:
mean.height = list(BOY=69.5, GIRL=64.0)
subjects = data.frame(gender = factor(c("BOY", "GIRL", "GIRL", "BOY"))
, height = c(71.0, 59.3, 62.1, 62.1))
apply(height, 1, function(x) x[2] - mean.height[[x[1]]])
The subtraction fails because the columns are converted to character types.
One fix is to back-convert the second column to a number:
apply(subjects, 1, function(x) as.numeric(x[2]) - mean.height[[x[1]]])
But the conversions can be avoided by keeping the columns separate
and using mapply():
mapply(function(x,y) y - mean.height[[x]], subjects$gender, subjects$height)
mapply() is needed because [[ ]] does not accept a vector argument. So the column
iteration could be done before the subtraction by passing a vector to [],
by a bit more ugly code:
subjects$height - unlist(mean.height[subjects$gender])
A really nice function for this is adply from plyr, especially if you want to append the result to the original dataframe. This function and its cousin ddply have saved me a lot of headaches and lines of code!
df_appended <- adply(df, 1, mutate, sum=x+z)
Alternatively, you can call the function you desire.
df_appended <- adply(df, 1, mutate, sum=testFunc(x,z))
In R with a matrix:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
I want to extract the submatrix whose rows have column three = 11. That is:
one two three four
[1,] 1 6 11 16
[3,] 3 8 11 18
[4,] 4 9 11 19
I want to do this without looping. I am new to R so this is probably very obvious but the
documentation is often somewhat terse.
This is easier to do if you convert your matrix to a data frame using as.data.frame(). In that case the previous answers (using subset or m$three) will work, otherwise they will not.
To perform the operation on a matrix, you can define a column by name:
m[m[, "three"] == 11,]
Or by number:
m[m[,3] == 11,]
Note that if only one row matches, the result is an integer vector, not a matrix.
I will choose a simple approach using the dplyr package.
If the dataframe is data.
library(dplyr)
result <- filter(data, three == 11)
m <- matrix(1:20, ncol = 4)
colnames(m) <- letters[1:4]
The following command will select the first row of the matrix above.
subset(m, m[,4] == 16)
And this will select the last three.
subset(m, m[,4] > 17)
The result will be a matrix in both cases.
If you want to use column names to select columns then you would be best off converting it to a dataframe with
mf <- data.frame(m)
Then you can select with
mf[ mf$a == 16, ]
Or, you could use the subset command.
Subset is a very slow function , and I personally find it useless.
I assume you have a data.frame, array, matrix called Mat with A, B, C as column names; then all you need to do is:
In the case of one condition on one column, lets say column A
Mat[which(Mat[,'A'] == 10), ]
In the case of multiple conditions on different column, you can create a dummy variable. Suppose the conditions are A = 10, B = 5, and C > 2, then we have:
aux = which(Mat[,'A'] == 10)
aux = aux[which(Mat[aux,'B'] == 5)]
aux = aux[which(Mat[aux,'C'] > 2)]
Mat[aux, ]
By testing the speed advantage with system.time, the which method is 10x faster than the subset method.
If your matrix is called m, just use :
R> m[m$three == 11, ]
If the dataset is called data, then all the rows meeting a condition where value of column 'pm2.5' > 300 can be received by -
data[data['pm2.5'] >300,]
I'm using R, and I have two data.frames, A and B. They both have 6 rows, but A has 25000 columns (genes), and B has 30 columns. I'd like to apply a function with two arguments f(x,y) where x is every column of A and y is every column of B. So far it looks like this:
i = 1
for (x in A){
j = 1
for (y in B){
out[i,j] <- f(x,y)
j = j + 1
}
i = i + 1
}
I have two issues with this: from my Python programming I associate keeping track of counters like this as crufty, and from my R programming I am nervous of for loops. However, I can't quite see how to apply apply (or even if I should apply apply) to this problem and was hoping someone might enlighten me. I need to treat f() as atomic (it's actually cor.test()) for now.
Since you are using data frames, it might be faster to use lapply or sapply to do this (specially given the scope of your data frames). For example,
x <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
bl <- lapply(x, function(u){
lapply(y, function(v){
f(u,v) # Function with column from x and column from y as inputs
})
})
out = matrix(unlist(bl), ncol=ncol(y), byrow=T)
Some data
nrows <- 6
A <- data.frame(a = runif(nrows), b = runif(nrows), c = runif(nrows))
B <- data.frame(z = rnorm(nrows), y = rnorm(nrows))
The trick: remember columns with expand.grid
counter <- expand.grid(seq_along(A), seq_along(B))
f <- function(x)
{
cor.test(A[, x["Var1"]], B[, x["Var2"]])$estimate
}
Now we only need 1 call to apply.
stats <- apply(counter, 1, f)
names(stats) <- paste(names(A)[counter$Var1], names(B)[counter$Var2], sep = ",")
stats
Nesting the applies works, not the easiest syntax, though.
x<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
z<-apply(x,2,function(col,df2)
{
apply(df2,2,function(col2,col1)
{
col2+col1
},col)
},y)
z
col1 col2 col3
[1,] 2 6 10
[2,] 4 8 12
[3,] 6 10 14
[4,] 8 12 16
[5,] 6 10 14
[6,] 8 12 16
[7,] 10 14 18
[8,] 12 16 20