How to use apply function and which function of apply-family? - r

I'm still struggling with the different apply-function and how they can replace a for-next-loop. What I want to do is sorting a vector with strings (value labels) according to a sorted order of values, in my case odds ratios.
I have odds ratios (unordered) in the "oo" object and the sorted / ordered odds values in the so object. Further, I have value labels sorted in the same order as "oo", which now should be re-orderd to match the values in the "so" object:
# sort labels descending in order of
# odds ratio values
oo <- exp(coef(x))[-1]
so <- sort(exp(coef(x))[-1])
nlab <- NULL
for (k in 1:length(categoryLabels)) {
nlab <- c(nlab, categoryLabels[which(so[k]==oo)])
}
categoryLabels <- nlab
e.g.
"oo" is (0.3, 0.7, 0.5)
"so" is (0.3, 0.5, 0.7)
categoryLabels (of oo) is ("A", "B", "C") and should be re-ordered according to "so": ("A", "C", "B")
What I like to know is, if it's possible to replace the for-next-loop by an apply-function, and if so, how?
Thanks in advance,
Daniel

It looks like all you're trying to do is order categoryLabels based on oo, which could be done with:
categoryLabels = categoryLabels[order(oo)]
order gives you a vector of indices that, when used to index a vector, will turn it into the sorted order. In your example:
oo = c(0.3, 0.7, 0.5)
order(oo)
# [1] 1 3 2
Though if we did start with so and oo, much easier than using any apply function in this case would be using match:
categoryLabels = categoryLabels[match(oo, so)]
match is a function that finds the indices of the first vector in the second vector. In your example:
oo = c(0.3, 0.7, 0.5)
so = c(0.3, 0.5, 0.7)
match(oo, so)
# [1] 1 3 2

Related

Numeric Matching / Extracting with Hard Coded Values in R

Having trouble understanding numeric matching / indexing in R.
If I have a situation where I create a dataframe such as:
options(digits = 3)
x <- seq(from = 0, to = 5, by = 0.10)
TestDF <- data.frame(x = x, y = dlnorm(x))
and I wanted to compare a hardcoded value to my y column -
> TestDF[TestDF$y == 0.0230,]$x
numeric(0)
That being said, if I compare to the value that's straight out of the dataframe (which for an x value of 4.9, should be a y value of 0.0230).
> TestDF[TestDF$y == TestDF[50,]$y,]$x
[1] 4.9
Does this have to do with exact matching? If I limit the digits to 3 decimal point, then 0.0230000 won't be the same as the original value in y I'm comparing to? If this is the case, is there a way around it if I do need to extract values based on rounded, hard-coded values?
You can use round() function to reduce the number of decimal digits to the preferred scale of the floating point number. See below.
set.seed(1L)
x <- seq(from = 0, to = 5, by = 0.10)
TestDF <- data.frame(x = x, y = dlnorm(x))
constant <- 0.023
TestDF[ with(TestDF, round(y, 3) == constant), ]
# x y
# 50 4.9 0.02302884
You can compare the rounded y with the stated value:
> any(TestDF$y == 0.0230)
[1] FALSE
> any(round(TestDF$y, 3) == 0.0230)
[1] TRUE
I'm not certain you grok the meaning of the digits option. From ?options it says about digits
digits: controls the number of significant digits to print when printing numeric values.
(emphasis mine.) So this only affects how the values are printed, not how they are stored.
You generated a set of reals, none of which are exactly 0.0230. This has nothing to do with exact matching. The value you indicated should be 0.0230 is actually stored as
> with(TestDF, print(y[50], digits = 22))
[1] 0.02302883835550340041465
regardless of the digits setting in options because that setting only affects the printed value. And the issue is not exact matching because even with the small fudge allowed by the recommended way to do comparisons, all.equal(), y[50] and 0.0230 are still not equal
> with(TestDF, all.equal(0.0230, y[50]))
[1] "Mean relative difference: 0.001253842"

Check if decimal values are in a range in R

I need to re-categorize codes that represent various diseases so as to form appropriate groups for later analysis.
Many of the groupings include ranges that look like this:
1.0 to 1.5, 1.8 to 2.5, 3.0
where another might be 37.0
Originally I thought that something like this might work:
x <-c(0:.9, 1.9:2.9, 7.9:8.9, 4.0:4.9, 3:3.9, 5:5.9, 6:6.9, 11:11.9, 9:9.9, 10:10.9, 12.9, 13:13.9, 14,14.2, 14.8)
df$disease_cat[df$site_code %in% x] <- "disease a"
The problem is, 0.1,0.2 etc. are not being recognized as being in the range of 0:0.9.
I now understand that 5:10 (for example) in r is actually 5,6,7...10
What is a better way to code these intervals so that the decimals will be recognized as being in the interval 0 to 0.9? (keeping in mind that there will be many "mini" ranges and the idea of coding them all explicitly isn't particularly appealing)
You can find the answer by printing the content of c(1.1:4). The result is [1] 1.1 2.1 3.1. The thing you need is findInterval function. Check out this solution:
findInterval(c(1,2,3,4.5), c(1.1,4)) == 1
If you would like to have the inclusive right boundary, i. e. [1.1, 4] interval, you can use rightmost.closed parameter:
findInterval(c(1,2,3,4.5), c(1.1,4), rightmost.closed = TRUE) == 1
EDIT:
Here is the solution for a more general problem you have described:
d = data.frame(disease = c('d1', 'd2', 'd3'), minValue = c(0.3, 1.2, 2.2), maxValue = c(0.6, 1.9, 2.5))
measurements = c(0.1, 0.5, 2.2, 0.3, 2.7)
findDiagnosis <- function(data, measurement) {
diagnosis = data[data$minValue <= measurement & measurement <= data$maxValue,]
if (nrow(diagnosis) == 0) {
return(NA)
} else {
return(diagnosis$disease)
}
}
sapply(measurements, findDiagnosis, data = d)
I think you want this:
c(1,2,3,4.5) >= 1.1 & c(1,2,3,4.5) <= 4
[1] FALSE TRUE TRUE FALSE
Examine the output of 1.1:4:
1.1:4
[1] 1.1 2.1 3.1
You are actually testing whether elements from your vector are exactly equal to 1.1, 2.1, or 3.1
#This the list of your ranges that you want to check
ranges = list(c(0,.9), c(1.9,2.9), c(7.9,8.9), c(4.0,4.9), c(3,3.9), c(5,5.9), c(6,6.9), c(11,11.9), c(9,9.9), c(10,10.9), c(12.9), c(13,13.9), c(14),c(14.2), c(14.8))
#This is the values that you want to check for each range in ranges
values = c(1,2,3,4.5)
#You can check each value in each range with following command
output = data.frame(t(sapply(ranges, function(x) (min(x)<values & max(x)>values))))
#Maybe set column names to values so you know clearly what you are checking.
#Column names are values, row names are indexes of the ranges
colnames(output) = values
output$ranges = sapply(ranges, function(x) paste(x,collapse = "-"))

Replacing specific values in vector with different samples from another vector

I have a vector "a" with integer values, some of which might have become 0 due to other parts of code that are running. I would like to replace the occurrences of 0 in this vector with a random sample from another vector "b" that I have. However, if there are multiple 0 values in "a", I would like them to all different samples from "b". So for instance:
a <- c(1, 2, 3, 0, 0, 0)
b <- 1:100
I would like the last three 0 values of "a" to be replaced with random values within "b", but I would like to avoid using 1, 2, or 3. Those are already in a.
Currently, I am using a while loop, so:
while(0 %in% a) {
s = sample(1, b)
while(s %in% a) {
s = sample(1, b)
}
a[a==0][1] = s
}
Is there a better way to do this? it seems like this double while loop might take a long time to run.
You could do something like the following
indx <- which(!a) # identify the zeroes locations
# Or less golfed `indx <- which(a == 0)`
a[indx] <- sample(setdiff(b, a), length(indx)) # replace by a sample from `setdiff(b, a)`
We haven't specified replace = TRUE so the new values will be always different of each other.

R - classification the number - assign labels

How to convert the numeric data to string, not the datatype change, but the classification in R? Say, I got 100 numbers 0:1, and if it's > 0.5, then I need to assign a name of "Good", otherwise it's "Bad".
You could try
nums <- seq(0,1, by = .01)
res <- c('Bad', 'Good')[(nums > 0.5)+1]
Do you wish to do it using factors?
a=runif(100, 0, 1) > 0.5
b=factor(a, c(FALSE,TRUE), labels=c("Bad","Good"))
c=as.character(b)
Alternatively, if you just want to change the names in the vector, a, then:
a=runif(100, 0, 1) > 0.5
c=ifelse(a,"Good","Bad")
names(a)=c

Is there a concise (built-in) way to sample an index from an array by treating values as a probabilities?

Suppose I have a vector of probabilities that sum to 1, such as foo = c(0.2,0.5,0.3).
I would like to sample an index from this vector by treating the values as probabilities.
In particular, I'd like to sample 1 with probability 0.2, 2 with probability 0.5, and 3 with probability 0.3.
Here is one implementation, similar to what I would write in C:
sample_index = function(probs) {
r = runif(1)
sum = 0
for (i in 1:length(probs)) {
sum <- sum + probs[i]
if (r < sum) return(i)
}
}
foo = c(0.2,0.5,0.3)
print(sample_index(foo));
Is there a more direct / built-in / canonical way to do this in R?
It always makes me smile and think R is doing a good job when people are looking for a function and repeatedly use its name in their question.
foo <- c(0.2, 0.5, 0.3)
sample(x = 1:3, size = 1, prob = foo)
Depending on your use case, you could make it a little more general:
sample(x = seq_along(foo), size = 1, prob = foo)
But do be careful, sample has sometimes convenient but very often unexpected behavior if its x argument is of length 1. If you're wrapping this up in a function, check the input length
if (length(foo) == 1) foo else sample(x = seq_along(foo), size = 1, prob = foo)

Resources