Related
Context
I asked this question recently:
Comparing partitions from split() using a nested for loop containing an if statement
where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by #robertdj
set.seed(1234) # set random seed for reproducibility
# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix
splt <- split(d, 1:5) # split data with 5 values in each partition
for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
if (i != j) {
a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
}
}
}
I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.
As per #Robert Hacken's comment if I instead do
d <- na.omit(d[lower.tri(d)])
I get partitions of unequal length.
Real Data
However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.
Here is my code
splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))
The distance matrix dist_matrix contains FASTA headers from which I extract the species names.
I then use splt above in the doubly nested loop.
For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.
splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667
splt[[10]]
[1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000
Expected Output
For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.
splt would thus contain 10 partitions (each not necessarily of the same length) for all species
The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().
What I've Done
I've read on SO that %in% is typically used to resolve the warning
In splt[[i]] == splt[[j]] :
longer object length is not a multiple of shorter object length
except in my case, I believe I would need %notin% <- Negate(%in%).
However, %notin% gives the error in my original post
the condition has length > 1
Question
How can my nested loop be altered to remove the warning?
I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).
Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,
set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)
We have:
splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389 0.06445882
# $`5`
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).
Some quick math so we know what we should end up with:
splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069
Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.
Similarly, [[5]] over [[4]]:
splt[[5]]
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644 0.06445882
### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2
### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2
We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.
Greater than the other's max
sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
# 1 2 3 4 5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1
Interpretation and subset validation:
1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.
Less than the other's min
sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
# 1 2 3 4 5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1
Same two pairs:
1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected
Edit: #compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks #compbiostats!
By definition, the entropy is defined as:
entropy <- function (p) sum(-p * log(p))
I'm performing LCA using the poLCA package and trying to calculate entropy, which for some of my models are outputting NaN.
error_prior <- entropy(lca2$P) # Class proportions model 2
error_post <- mean(apply(lca2$posterior, 1, entropy), na.rm = TRUE)
results[2,8] <- round(((error_prior - error_post) / error_prior), 3)
From the answer to this question: Entropy output is NaN for some class solutions and not others, I learnt that it is caused by zeros in p and it can be resolved by adding na.omit to the function as follows:
entropy <- function (p) sum(na.omit(-p * log(p)))
My question is - is this technical tweak mathematically valid without affecting the integrity of the calculation?
In my case, around 1/3 of the values in p are zeros. I'm really unsure if I should use na.omit or find another way to resolve this problem.
It is valid, but not transparent at first glance. The reason is that the mathematical limit of xlog(x) as x -> 0 is 0 (we can prove this using L'Hospital Rule). In this regard, the most robust definition of the function should be
entropy.safe <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- numeric(length(p))
safe <- p != 0
log.p[safe] <- log(p[safe])
sum(-p * log.p)
}
But simply dropping p = 0 cases gives identical results, because the result at p = 0 is 0 and contributes nothing to the sum anyway.
entropy.brutal <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- log(p)
## as same as sum(na.omit(-p * log.p))
sum(-p * log.p, na.rm = TRUE)
}
## p has a single 0
( p <- seq(0, 1, by = 0.1) )
#[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 2.455935
entropy.safe(p)
#[1] 2.455935
## half of p are zeros
p[1:5] <- 0
p
#[1] 0.0 0.0 0.0 0.0 0.0 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 1.176081
entropy.safe(p)
#[1] 1.176081
In conclusion, we can use either entropy.brutal or entropy.safe.
I have multiple files with the same structure. I would like to calculate the sum of each first element of each file.
Here is the head of one file:
> head(NodesA1to10)
NodeAMean NodeBMean NodeCMean NodeBMeanclo NodeCMeanclo NodeAMeanclo NodeBMeanin NodeCMeanin
1 0.6 0.5 0.1 0.06978355 0.08988359 0.06957624 0.3890288 0.2389382
2 0.6 1.1 0.4 0.04403793 0.04955660 0.04052408 0.4039787 0.2533702
3 0.5 1.6 0.4 0.03244537 0.03386485 0.02925561 0.4301655 0.2714006
4 0.6 2.7 1.3 0.02629242 0.02545891 0.02101667 0.4710782 0.2987464
5 1.0 2.9 1.4 0.02194354 0.02050273 0.01643439 0.5221127 0.3531018
6 1.8 5.1 2.0 0.01895985 0.01674231 0.01369418 0.5651066 0.3922610
NodeAMeanin NodeBMeanout NodeCMeanout NodeAMeanout
1 0.3550979 0.3236412 0.3807460 0.2786777
2 0.6173852 0.4587713 0.4784292 0.4671891
3 0.7020968 0.5933347 0.5594147 0.5536033
4 0.7686962 0.7336590 0.6230537 0.5872133
5 0.8024617 0.8738159 0.7147912 0.6457293
6 0.8289504 1.0158102 0.7665834 0.7186818
I have in total 10 files with the same format. I would like to calculate sum in as below:
N<-10
B1_1 <-sum(NodesA1to10$NodeBMean[1],NodesA11to20$NodeBMean[1],NodesA21to30$NodeBMean[1],NodesA31to40$NodeBMean[1],
NodesA41to50$NodeBMean[1],NodesA61to70$NodeBMean[1],NodesA71to80$NodeBMean[1],NodesA81to90$NodeBMean[1],
NodesA91to100$NodeBMean[1])/N
B1_2 <-sum(NodesA1to10$NodeBMean[2],NodesA11to20$NodeBMean[2],NodesA21to30$NodeBMean[2],NodesA31to40$NodeBMean[2],
NodesA41to50$NodeBMean[2],NodesA61to70$NodeBMean[2],NodesA71to80$NodeBMean[2],NodesA81to90$NodeBMean[2],
NodesA91to100$NodeBMean[2])/N
This code works for me, but I have to repeat it 10 times and further do it for different conditions, therefore I would really like to do it in faster way. I tried to use list:
allNodesA <- list(NodesA1to10,NodesA11to20,NodesA21to30,
NodesA31to40,NodesA41to50,NodesA51to60,NodesA61to70,
NodesA71to80,NodesA81to90,NodesA91to100)
B1<-lapply(allNodesA, function(z) { z$NodeBMean <- sum(z$NodeBMean[[1]]);z})
NodesBb<-do.call(rbind.data.frame, B1)
B1<-list()
for (j in 1:10)
{
B1[[j]]<-lapply(allNodesA, function(z) { sum(z[[j]]$NodeBMean[j])})
}
But this solution gives me an error:
Error in z[[j]]$NodeBMean : $ operator is invalid for atomic vectors
Could someone please help me how to find more optimal ways to get a sum of every 1st, 2nd....10th element from many files?
Something like this may work for you.
N <-10
allNodesA <- list(NodesA1to10,NodesA11to20,NodesA21to30,
NodesA31to40,NodesA41to50,NodesA51to60,NodesA61to70,
NodesA71to80,NodesA81to90,NodesA91to100)
vals <- numeric()
for(i in 1:nrow(NodesA1to10)){
vals <- append(vals, sum(sapply(allNodesA, function(x) x[i,"NodeBMean"])) / N)
}
vals
I have a long vector in R in which successive value often repeat themselves. For example
x = c(rep(0.2, 1500), rep(0.1, 10007), 0.7, 0.9, rep(0.1, 9784))
I am trying to write a function that takes this vector as input and return either of the two following strings
s = "R 0.2 1500 R 0.1 10007 R 0.7 1 R 0.9 1 R 0.1 9784"
or preferentially
s = "R 0.2 1500 R 0.1 10007 0.7 0.9 R 0.1 9784"
, where R 0.7 1 R 0.9 1 became 0.7 0.9.
For your intuition, R stands for repeat or rep. The string therefore closely ressemble the way I constructed the vector x to start with.
I tried looping through each value but that was too slow for my needs. Can you help me out finding a fast solution?
#Data
x = c(rep(0.2, 1500), rep(0.1, 10007), 0.7, 0.9, rep(0.1, 9784))
#Run rle and paste values and lengths together
y = paste("R", rle(x)$values, rle(x)$lengths)
#There may be an easier way to do this using regex
#But here is one solution using strsplit
#Remove 1 and R
y = sapply(strsplit(y," "), function(a)
if (gsub(" ","",a[3]) == "1"){
a = a[2]
} else {
a = a
}
)
#Collapse everything together
paste(unlist(y), collapse = " ")
#[1] "R 0.2 1500 R 0.1 10007 0.7 0.9 R 0.1 9784"
I'm basically looking for a way to do a variation of this Ruby script in R.
I have an arbitrary list of numbers (steps of a moderator for a regression plot in this case) which have unequal distances from each other, and I'd like to round values which are within a range around these numbers to the nearest number in the list.
The ranges don't overlap.
arbitrary.numbers <- c(4,10,15) / 10
numbers <- c(16:1 / 10, 0.39, 1.45)
range <- 0.1
Expected output:
numbers
## 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.39 1.45
round_to_nearest_neighbour_in_range(numbers,arbitrary.numbers,range)
## 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
I've got a little helper function that might do for my specific problem, but it's not very flexible and it contains a loop. I can post it here, but I think a real solution would look completely different.
The different answers timed for speed (on a million numbers)
> numbers = rep(numbers,length.out = 1000000)
> system.time({ mvg.round(numbers,arbitrary.numbers,range) })[3]
elapsed
0.067
> system.time({ rinker.loop.round(numbers,arbitrary.numbers,range) })[3]
elapsed
0.289
> system.time({ rinker.round(numbers,arbitrary.numbers,range) })[3]
elapsed
1.403
> system.time({ nograpes.round(numbers,arbitrary.numbers,range) })[3]
elapsed
1.971
> system.time({ january.round(numbers,arbitrary.numbers,range) })[3]
elapsed
16.12
> system.time({ shariff.round(numbers,arbitrary.numbers,range) })[3]
elapsed
15.833
> system.time({ mplourde.round(numbers,arbitrary.numbers,range) })[3]
elapsed
9.613
> system.time({ kohske.round(numbers,arbitrary.numbers,range) })[3]
elapsed
26.274
MvG's function is the fastest, about 5 times faster than Tyler Rinker's second function.
A vectorized solution, without any apply family functions or loops:
The key is findInterval, which finds the "space" in arbitrary.numbers where each element in numbers is "between". So, findInterval(6,c(2,4,7,8)) returns 2, because 6 is between the 2nd and 3rd index of c(2,4,7,8).
# arbitrary.numbers is assumed to be sorted.
# find the index of the number just below each number, and just above.
# So for 6 in c(2,4,7,8) we would find 2 and 3.
low<-findInterval(numbers,arbitrary.numbers) # find index of number just below
high<-low+1 # find the corresponding index just above.
# Find the actual absolute difference between the arbitrary number above and below.
# So for 6 in c(2,4,7,8) we would find 2 and 1.
# (The absolute differences to 4 and 7).
low.diff<-numbers-arbitrary.numbers[ifelse(low==0,NA,low)]
high.diff<-arbitrary.numbers[ifelse(high==0,NA,high)]-numbers
# Find the minimum difference.
# In the example we would find that 6 is closest to 7,
# because the difference is 1.
mins<-pmin(low.diff,high.diff,na.rm=T)
# For each number, pick the arbitrary number with the minimum difference.
# So for 6 pick out 7.
pick<-ifelse(!is.na(low.diff) & mins==low.diff,low,high)
# Compare the actual minimum difference to the range.
ifelse(mins<=range+.Machine$double.eps,arbitrary.numbers[pick],numbers)
# [1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
Yet another solution using findInterval:
arbitrary.numbers<-sort(arbitrary.numbers) # need them sorted
range <- range*1.000001 # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers - range) # index of nearest
nearest <- c(-Inf, arbitrary.numbers)[nearest + 1] # value of nearest
diff <- numbers - nearest # compute errors
snap <- diff <= range # only snap near numbers
numbers[snap] <- nearest[snap] # snap values to nearest
print(numbers)
The nearest in the above code is not really mathematically the nearest number. Instead, it is the largest arbitrary number such that nearest[i] - range <= numbers[i], or equivalently nearest[i] <= numbers[i] + range. So in one go we find the largest arbitrary number which is either in the snapping range for a given input number, or still too small for that. For this reason, we only need to check one way for snap. No absolute value required, and even the squaring from a previous revision of this post was unneccessary.
Thanks to Interval search on a data frame for the pointer at findInterval, as I found it there before recognizing it in the answer by nograpes.
If, in contrast to your original question, you had overlapping ranges, you could write things like this:
arbitrary.numbers<-sort(arbitrary.numbers) # need them sorted
range <- range*1.000001 # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers) + 1 # index of interval
hi <- c(arbitrary.numbers, Inf)[nearest] # next larger
nearest <- c(-Inf, arbitrary.numbers)[nearest] # next smaller
takehi <- (hi - numbers) < (numbers - nearest) # larger better than smaller
nearest[takehi] <- hi[takehi] # now nearest is really nearest
snap <- abs(nearest - numbers) <= range # only snap near numbers
numbers[snap] <- nearest[snap] # snap values to nearest
print(numbers)
In this code, nearestreally ends up being the nearest number. This is achieved by considering both endpoints of every interval. In spirit, this is very much like the version by nograpes, but it avoids using ifelse and NA, which should benefit performance as it reduces the number of branching instructions.
Is this what you want?
> idx <- abs(outer(arbitrary.numbers, numbers, `-`)) <= (range+.Machine$double.eps)
> rounded <- arbitrary.numbers[apply(rbind(idx, colSums(idx) == 0), 2, which)]
> ifelse(is.na(rounded), numbers, rounded)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
Please note that due to rounding errors (most likely), I use range = 0.1000001 to achieve the expected effect.
range <- range + 0.0000001
blah <- rbind( numbers, sapply( numbers, function( x ) abs( x - arbitrary.numbers ) ) )
ff <- function( y ) { if( min( y[-1] ) <= range + 0.000001 ) arbitrary.numbers[ which.min( y[ -1 ] ) ] else y[1] }
apply( blah, 2, ff )
This is still shorter:
sapply(numbers, function(x) ifelse(min(abs(arbitrary.numbers - x)) >
range + .Machine$double.eps, x, arbitrary.numbers[which.min
(abs(arbitrary.numbers - x))] ))
Thanks #MvG
Another option:
arb.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
lapply(1:length(arbitrary.numbers), function(i){
numbers <<- arrnd(numbers, arbitrary.numbers[i], range)
}
)
numbers
}
arb.round(numbers, arbitrary.numbers, range)
Yields:
> arb.round(numbers, arbitrary.numbers, range)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
EDIT: I removed the return call at the end of the function as it's not necessary adn can burn time.
EDIT: I think a loop will be even faster here:
loop.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
for(i in seq_along(arbitrary.numbers)){
numbers <- arrnd(numbers, arbitrary.numbers[i], range)
}
numbers
}