Despite reading the documentation, I'm struggling to understand how the function argument works in the combn utility.
I have a table with two columns of data, for each column, I want to calculate the ratio of each unique combination of data pairs in that column. Let's just focus on one column for simplicity:
V1
1 342.3
2 123.5
3 472.0
4 678.3
...
14 567.2
I can use the following to return all the unique combinations:
combn(table[,1], 2)
but of course this just returns each pair of values. I want to divide them to get a ratio, but can't seem to figure out how to set this up.
I understand that for something like outer, for example, you can just provide the operator as the argument but how does this transfer to combn?
combn(table[,1], 2, FUN = "/")
# obviously not correct
The issue is that the function will receive exactly one parameter. And that parameter will be vector of the elements in that particular set. The / function require two separate parameters, not a single vector of values. Instead you could write
combn(table[,1], 2, FUN = function(x) x[1]/x[2])
So here we get one parameter x and we divide the first value by the second.
Other functions such as
combn(1:4, 2, FUN = sum)
work just fine because they expect to receive a single vector of values.
Related
I have a data frame, for example:
name age
1 "Danny" 20
2 "Mitt" 35
3 "Dylan" 8
When I get new entry, I want to update this df.
I have used nrow(df) + 1 for the next row:
df[nrow(df) + 1, ] <- c("Tom", 4)
Is there any other way to do this?
You may use rbind:
rbind(df,list("Tom",4))
check for ?rbind:
The functions cbind and rbind are S3 generic, with methods for data
frames. The data frame method will be used if at least one argument is
a data frame and the rest are vectors or matrices. There can be other
methods; in particular, there is one for time series objects. See the
section on ‘Dispatch’ for how the method to be used is selected. If
some of the arguments are of an S4 class, i.e., isS4(.) is true, S4
methods are sought also, and the hidden cbind / rbind functions from
package methods maybe called, which in turn build on cbind2 or rbind2,
respectively. In that case, deparse.level is obeyed, similarly to the
default method.
In the default method, all the vectors/matrices must be atomic (see
vector) or lists. Expressions are not allowed. Language objects (such
as formulae and calls) and pairlists will be coerced to lists: other
objects (such as names and external pointers) will be included as
elements in a list result. Any classes the inputs might have are
discarded (in particular, factors are replaced by their internal
codes).
If there are several matrix arguments, they must all have the same
number of columns (or rows) and this will be the number of columns (or
rows) of the result. If all the arguments are vectors, the number of
columns (rows) in the result is equal to the length of the longest
vector. Values in shorter arguments are recycled to achieve this
length (with a warning if they are recycled only fractionally).
Let me suggest you the add_row function from the tibble package. You could do it simply as follows:
df = add_row (df, name="Tom", age=4)
I am trying to change the final correcttot function from a for loop to apply, but have been running into issues in trying to get the apply function to take the underlying values in df, the array which I will be applying it to.
correcttot<-function(v,p,r){
df<-expand.grid(i=1:10,j=1:10,k=1:10,l=1:10,m=2:10,n=2:10,o=1:10))
df$correct3<-0
df$correct3<- apply(df, 1:7, function(x)
percentcorrect((x$i)/10,(x$j)/10,(x$k)*20,(x$l)*20,x$m,x$n,x$o,v,p,r)
)
df$correct3
}
newvec2<-correcttot(v,p,r)
The second argument of apply is not the column numbers, it's the number of the dimension. Your data frame only has two dimensions: rows (1) and columns (2).
For your analysis, set the second argument to 1 indicating you're applying the function to each row.
Just been playing with some basic functions and it seems rather strange how ifelse behaves if I use which() function as one of the arguments when the ifelse condition is true, e.g.:
#I want to identify the location of all values above 6.5
#only if there are more than 90 values in the vector a:
set.seed(100)
a <- rnorm(100, mean=5, sd=1)
ifelse(length(a)>90, which(a>6.5), NA)
I get this output:
[1] 4
When in fact it should be the following:
[1] 4 15 25 40 44 47 65
How then can I make ifelse return the correct values using which() function?
It seems it only outputs the first value that matches the condition. Why does it do that?
You actually don't want to use ifelse in this case. As BondedDust pointed out, you should think of ifelse as a function that takes three vectors and picks values out of the second two based on the TRUE/FALSE values in the first. Or, as the documentation puts it:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
You probably simply wanted to use a regular if statement instead.
One potential confusion with ifelse is that it does recycle arguments. Specifically, if we do
ifelse(rnorm(10) < 0,-1,1)
you'll note that the first argument is a logical vector of length 10, but our second two "vectors" are both of length one. R will simply extend them as needed to match the length of the first argument. This will happen even if the lengths are not evenly extendable to the correct length.
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.
What if one wants to apply a functon i.e. to each row of a matrix, but also wants to use as an argument for this function the number of that row. As an example, suppose you wanted to get the n-th root of the numbers in each row of a matrix, where n is the row number. Is there another way (using apply only) than column-binding the row numbers to the initial matrix, like this?
test <- data.frame(x=c(26,21,20),y=c(34,29,28))
t(apply(cbind(as.numeric(rownames(test)),test),1,function(x) x[2:3]^(1/x[1])))
P.S. Actually if test was really a matrix : test <- matrix(c(26,21,20,34,29,28),nrow=3) , rownames(test) doesn't help :(
Thank you.
What I usually do is to run sapply on the row numbers 1:nrow(test) instead of test, and use test[i,] inside the function:
t(sapply(1:nrow(test), function(i) test[i,]^(1/i)))
I am not sure this is really efficient, though.
If you give the function a name rather than making it anonymous, you can pass arguments more easily. We can use nrow to get the number of rows and pass a vector of the row numbers in as a parameter, along with the frame to be indexed this way.
For clarity I used a different example function; this example multiplies column x by column y for a 2 column matrix:
test <- data.frame(x=c(26,21,20),y=c(34,29,28))
myfun <- function(position, df) {
print(df[position,1] * df[position,2])
}
positions <- 1:nrow(test)
lapply(positions, myfun, test)
cbind()ing the row numbers seems a pretty straightforward approach. For a matrix (or a data frame) the following should work:
apply( cbind(1:(dim(test)[1]), test), 1, function(x) plot(x[-1], main=x[1]) )
or whatever you want to plot.
Actually, in the case of a matrix, you don't even need apply. Just:
test^(1/row(test))
does what you want, I think. I think the row() function is the thing you are looking for.
I'm a little confuse so excuse me if I get this wrong but you want work out n-th root of the numbers in each row of a matrix where n = the row number. If this this the case then its really simple create a new array with the same dimensions as the original with each column having the same values as the corresponding row number:
test_row_order = array(seq(1:length(test[,1]), dim = dim(test))
Then simply apply a function (the n-th root in this case):
n_root = test^(1/test_row_order)