Given LSD's and values - output significance letters - r

I've been looking for library functions to do this, but I'm surprised that I cannot find one.
There are quite a few stats functions in R that do the statistical test then output
a table that includes letters denoting significance groups, for example LSD.test.An example of how LSD's might be calculated and used to make multicomparison letters, used in a graph
There are others. All of the examples I could find tend to work from a model object, and then do their job. However, I already have the LSD values and the means -- and want to work directly with them. I've been looking for a common function that all of these multicomparison methods use to do this final step, but can't find one.
So, this is what I want to do...given the least significant difference between values (LSD) and the mean values them selves:
lsd <- 1.0
vals <- c(2,3,3.5,4,4.2,6.0)
I want to I want to produce output something like:
2 a
3 b
3.5 bc
4 c
4.2 c
6.0 d
,where values followed by the same letter are not significantly different, based on the least significant difference value.
Ideally, it would be best if it could handle the list of values un-ordered...
vals <- c(6.0, 2, 3.5, 4.0, 4.2, 3)
producing the output:
6.0 d
2 a
3.5 bc
4.0 b
4.2 c
3 c
I've been thinking that most of these LSD.test and multicompare functions
are probably using a base function to put together the letter list -- but I have not been able to find it.
Working through the problem, I think this does the trick, but it's pretty ugly...
lsd.letters <- function(vals, lsd) {
#find their order
#record their order
indx <- order(vals)
#sort their order
srt <- vals[indx]
#assign a variable of letters
lts <- letters
#create a character vector
siglets <- rep("", length(vals))#c("a",rep("", length(vals)-1))
#use a single pass through the list of means
#use the first letter a for the lowest value
itlet <- 1
for (i in c(1:(length(vals)))){
crnt <- srt[i]
clet <- lts[itlet]
#is this value within the LSD of any other value in the remaining list
ix <- which(srt[i:length(srt)] < (crnt+lsd))+i-1
for (ix2 in ix){
newletter <- 0
if (length(intersect( unlist(strsplit(siglets[i], "")), unlist(strsplit(siglets[ix2], "")))) == 0){
#If the string for this mean does not already contain a letter in common for the current step mean... assign the letter
#siglets[ix2] <- paste0(siglets[ix2],clet)
newletter <- 1
}
}
if (newletter == 1){
siglets[ix] <- paste0(siglets[ix],clet)
itlet <- itlet + 1
}
}
siglets
}
It's ugly, and I am not yet sorting the output (sorting it is easy).
Is there a library function to do this? Or has anyone written a better approach to do this?
Thanks for your help!

Related

initialise multiple variables at once in R [duplicate]

I am using the example of calculating the length of the arc around a circle and the area under the arc around a circle based on the radius of the circle (r) and the angle of the the arc(theta). The area and the length are both based on r and theta, and you can calculate them simultaneously in python.
In python, I can assign two values at the same time by doing this.
from math import pi
def circle_set(r, theta):
return theta * r, .5*theta*r*r
arc_len, arc_area = circle_set(1, .5*pi)
Implementing the same structure in R gives me this.
circle_set <- function(r, theta){
return(theta * r, .5 * theta * r *r)
}
arc_len, arc_area <- circle_set(1, .5*3.14)
But returns this error.
arc_len, arc_area <- circle_set(1, .5*3.14)
Error: unexpected ',' in "arc_len,"
Is there a way to use the same structure in R?
No, you can't do that in R (at least, not in base or any packages I'm aware of).
The closest you could come would be to assign objects to different elements of a list. If you really wanted, you could then use list2env to put the list elements in an environment (e.g., the global environment), or use attach to make the list elements accessible, but I don't think you gain much from these approaches.
If you want a function to return more than one value, just put them in a list. See also r - Function returning more than one value.
You can assign multiple variables the same value as below. Even here, I think the code is unusual and less clear, I think this outweighs any benefits of brevity. (Though I suppose it makes it crystal clear that all of the variables are the same value... perhaps in the right context it makes sense.)
x <- y <- z <- 1
# the above is equivalent to
x <- 1
y <- 1
z <- 1
As Gregor said, there's no way to do it exactly as you said and his method is a good one, but you could also have a vector represent your two values like so:
# Function that adds one value and returns a vector of all the arguments.
plusOne <- function(vec) {
vec <- vec + 1
return(vec)
}
# Creating variables and applying the function.
x <- 1
y <- 2
z <- 3
vec <- c(x, y, z)
vec <- plusOne(vec)
So essentially you could make a vector and have your function return vectors, which is essentially filling 3 values at once. Again, not what you want exactly, just a suggestion.

Indexing variables in R

I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.

Categorical Features in Distance Matrix

I'm calculating the cosine similarity between two feature vectors and wondering if someone might have a neat solution to the below problem around categorical features.
Currently i have (example):
# define the similarity function
cosineSim <- function(x){
as.matrix(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
# define some feature vectors
A <- c(1,1,0,0.5)
B <- c(1,1,0,0.5)
C <- c(1,1,0,1.2)
D <- c(1,0,0,0.7)
dataTest <- data.frame(A,B,C,D)
dataTest <- data.frame(t(dataTest))
dataMatrix <- as.matrix(dataTest)
# get similarity matrix
cosineSim(dataMatrix)
which works fine.
But say i want to add in a categorical variable such as city to generate a feature that is 1 when two cities are equal and 0 other wise.
In this case, example feature vectors would be:
A <- c(1,1,0,0.5,"Dublin")
B <- c(1,1,0,0.5,"London")
C <- c(1,1,0,1.2,"Dublin")
D <- c(1,0,0,0.7,"New York")
I'm wondering is there a neat way to generate the pairwise equality of the last feature on the fly within the function in a way that keeps it a vectorised implementation?
I have tried pre-processing to make binary flags for each category such that above example would become something like:
A <- c(1,1,0,0.5,1,0,0)
B <- c(1,1,0,0.5,0,1,0)
C <- c(1,1,0,1.2,1,0,0)
D <- c(1,0,0,0.7,0,0,1)
This works but the problem is it means i have to pre-process each variable and in some cases i can see the number of categories becoming quite large. This seems quite expensive/inefficient when all i want is to generate a feature that returns 1 for equality and 0 otherwise (granted there is complexity here in that it is essentially a feature dependent on two records and shared between them).
One solution i can see is to just write a loop to build each pair of feature vectors (where i can build a feature such as [is_same_city]=1/0 and set to 1 for each vector when we have equality and 0 otherwise) and then get distance - but this approach will kill me when i try to scale.
I am hoping my R skills are not well enough developed and there is a neat solution that ticks most of the boxes...
Any suggestions at all are very welcome, Thanks

R - vectorizing a which operation

Hi I have a function in R that I'm trying to optimize for performance. I need to vectorize a for loop. My problem is the slightly convoluted data structure and the way I need to perform lookups using the 'which' command.
Lets say we are dealing with 5 elements (1,2,3,4,5), the 10x2 matrix pairs is a combination of all unique pairs the 5 elements (i.e. (1,2), (1,3),(1,4) ....(4,5)). all_prods is a 10x1 matrix that I need to look up using the pairs while iterating through all the 5 elements.
So for 1, I need to index rows 1, 2, 3, 4 (pairs 1,2 1,3 1,4 and 1,5) from all_prods and so on for 1, 2, 3, 4, 5.
I have only recently switched to R from matlab so would really appreciate any help.
foo <- function(AA , BB , CC ){
pa <- AA*CC;
pairs <- t(combn(seq_len(length(AA)),2));
all_prods <- pa[pairs[,1]] * pa[pairs[,2]];
result <- matrix(0,1,length(AA));
# WANT TO VECTORIZE THIS BLOCK
for(st in seq(from=1,to=length(AA))){
result[st] <- sum(all_prods[c(which(pairs[,1]==st), which(pairs[,2]==st))])*BB[st];
}
return(result);
}
AA <- seq(from=1,to=5); BB<-seq(from=11,to=15); CC<-seq(from=21,to=25);
results <- foo(AA,BB,CC);
#final results is [7715 164208 256542 348096 431250]
I want to convert the for loop into a vectorised version. Instead of looping through every element st, I'd like to do it in one command that gives me a results vector (rather than building it up element by element)
You could write your function like this:
foo <- function(AA, BB, CC) {
pa <- AA*CC
x <- outer(pa, pa)
diag(x) <- 0
res <- colSums(x)*BB
return(res)
}
The key idea is to not break the symmetry. Your use of ordered pairs corresponds to the upper right triangle of my matrix x. Although this seems like just half as many values to compute, the syntactic and computational overhead becomes quite large. You are distinguishing situations where st is the first element in the pair from those where it is the second. Later on this leads to quite some trouble to get rid of that distinction. Having the full symmetric matrix, you don't have to worry about order, and things vectorize smoothly.

Numbering elements in a vector

I would like to number the elements of a vector, assigning '1' to the smallest element in the vector. I know how to do this, but my solution (code included below) seems overly complex. Is there a much simpler solution?
In my example below there are 5 unique numbers in the vector 'data'. The number 3 is the smallest and should be assigned the number '1'; the number 100 is the largest and should be assigned the number '5'.
The desired solution for the vector 'data' is: c(2,3,4,4,3,1,5).
data <- c(5,8,12,12,8,3,100)
unique.numbers <- sort(unique(data))
numbering <- seq(1:length(unique(data)))
template <- cbind(numbering,unique.numbers)
output <- rep(NA, length(data))
for(i in 1:length(data)) {
for(j in 1:dim(template)[1]) {
if(data[i]==template[j,2]) output[i]=j
}
}
output
Thank you for any advice. I am trying to become more efficient with my programming.
Mark Miller
More compact version of your program.
dat <- c(5,8,12,12,8,3,100)
dat_sorted <- sort(unique(dat))
match(dat,dat_sorted)
If you're using numeric or integer data you can use as.numeric(factor())
dat <- c(5,8,12,12,8,3,100)
as.numeric(factor(dat))
Also, as a side note, you should avoid using data as a variable name in R since its already a built-in function.
Another possibility is:
> rank(data)
[1] 2.0 3.5 5.5 5.5 3.5 1.0 7.0
You can see the argument "ties.method" for how to handle ties.

Resources