I need to re-categorize codes that represent various diseases so as to form appropriate groups for later analysis.
Many of the groupings include ranges that look like this:
1.0 to 1.5, 1.8 to 2.5, 3.0
where another might be 37.0
Originally I thought that something like this might work:
x <-c(0:.9, 1.9:2.9, 7.9:8.9, 4.0:4.9, 3:3.9, 5:5.9, 6:6.9, 11:11.9, 9:9.9, 10:10.9, 12.9, 13:13.9, 14,14.2, 14.8)
df$disease_cat[df$site_code %in% x] <- "disease a"
The problem is, 0.1,0.2 etc. are not being recognized as being in the range of 0:0.9.
I now understand that 5:10 (for example) in r is actually 5,6,7...10
What is a better way to code these intervals so that the decimals will be recognized as being in the interval 0 to 0.9? (keeping in mind that there will be many "mini" ranges and the idea of coding them all explicitly isn't particularly appealing)
You can find the answer by printing the content of c(1.1:4). The result is [1] 1.1 2.1 3.1. The thing you need is findInterval function. Check out this solution:
findInterval(c(1,2,3,4.5), c(1.1,4)) == 1
If you would like to have the inclusive right boundary, i. e. [1.1, 4] interval, you can use rightmost.closed parameter:
findInterval(c(1,2,3,4.5), c(1.1,4), rightmost.closed = TRUE) == 1
EDIT:
Here is the solution for a more general problem you have described:
d = data.frame(disease = c('d1', 'd2', 'd3'), minValue = c(0.3, 1.2, 2.2), maxValue = c(0.6, 1.9, 2.5))
measurements = c(0.1, 0.5, 2.2, 0.3, 2.7)
findDiagnosis <- function(data, measurement) {
diagnosis = data[data$minValue <= measurement & measurement <= data$maxValue,]
if (nrow(diagnosis) == 0) {
return(NA)
} else {
return(diagnosis$disease)
}
}
sapply(measurements, findDiagnosis, data = d)
I think you want this:
c(1,2,3,4.5) >= 1.1 & c(1,2,3,4.5) <= 4
[1] FALSE TRUE TRUE FALSE
Examine the output of 1.1:4:
1.1:4
[1] 1.1 2.1 3.1
You are actually testing whether elements from your vector are exactly equal to 1.1, 2.1, or 3.1
#This the list of your ranges that you want to check
ranges = list(c(0,.9), c(1.9,2.9), c(7.9,8.9), c(4.0,4.9), c(3,3.9), c(5,5.9), c(6,6.9), c(11,11.9), c(9,9.9), c(10,10.9), c(12.9), c(13,13.9), c(14),c(14.2), c(14.8))
#This is the values that you want to check for each range in ranges
values = c(1,2,3,4.5)
#You can check each value in each range with following command
output = data.frame(t(sapply(ranges, function(x) (min(x)<values & max(x)>values))))
#Maybe set column names to values so you know clearly what you are checking.
#Column names are values, row names are indexes of the ranges
colnames(output) = values
output$ranges = sapply(ranges, function(x) paste(x,collapse = "-"))
Related
I have a data.frame of intervals given row-wise, the interval starts in column one, the interval ends in column 2. The numbers are not integers. How can I find the overlapping range, if any, of all intervals. e.g:
df <- cbind(c(1.5, 3, 2.1, 1), c(6, 5, 3.7, 10.1))
plot(1:11, ylim = c(0, 5), col = NA)
segments(x0 = c(1.5, 3, 2.1, 1), y0 = 1:4, x1 = c(6, 5, 3.7, 10.1), y1 = 1:4)
abline(v = 3, col = "red", lty = 2)
abline(v = 3.7, col = "red", lty = 2)
somefunc(df)
[1] 3 3.7
A nice, fast base R (or common package like dplyr ect) solution is preferred. I already know of foverlaps (data.table) and IRranges, but they do not seem to address my problem. For bonus points, if there were interval(s) that prevented total overlap, e.g: rbind'ing c(20, 25) to df above, then the function should still return the largest possible overlap of any of the vectors, i.e. still returning c(3, 3.7).
EDIT: the solution linked by Henrik is good, but relies on generating a sequence with a given step (e.g. seq(start, end by = 1)) then reducing them to get the intersection. My intervals may narrower than this step. Ideally I need a solution that operates using logical comparison or something like that. The second solution in the linked page is also not quite right (see below)
EDIT EDIT: The intersection should be returned only if it is common to all ranges. Solution 2 in the post linked by Henrik groups together intervals even if not all intervals in the group intersect with every other interval
Here is a solution which which seems to return the expected result for the given sample datasets.
It takes the vector of all unique interval endpoints and counts the number of intervals they are intersecting (by aggregating in a non-equi join). Among the subset of points with the maximum number of intersections, the range is taken.
library(data.table)
# enhanced dataset with 2 additional intervals
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
8.3 , 12
20 , 25")
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][N == max(N), range(lb)]
res
[1] 3.0 3.7
visualisation
library(ggolot2)
ggplot(dt) +
aes(x = lb, y = seq_along(lb), xend = ub, yend = seq_along(ub)) +
geom_segment() +
geom_vline(xintercept = res, col = "red", lty = 2)
EDIT: Handling of no overlaps
The OP has pointed out that the case where there are no overlaps needs to be recognized and handled separately. So I have modified the code:
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][
N == max(N), {
if (max(N) > 1) {
cat("Maximum overlaps found:", max(N), "out of", nrow(dt), "intervals\n")
range(lb)
} else {
cat("No overlaps found\n")
NULL
}
}]
This code will recognize the situation where there are no overlaps and will return NULL in these cases. In addition, a message is printed.
In all other cases, it will print an informative message, e.g.,
Maximum overlaps found: 4 out of 6 intervals
For OP's sample dataset without overlaps
dt <- data.table(lb = c(3, 6, 10), ub = c(5, 9, 15))
it will print
No overlaps found
Caveat
In case of multiple solutions the code above will return the overall range, i.e, the start of the first interval and the end of the last interval instead of a list of separate intervals.
Sample data for this use case:
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
11.5, 16
13 , 15
12.1, 13.7
11 , 20.1
0 , 22
")
So, there is a 5-fold overlap between 3 and 3.7 and a second 5-fold overlap between 13 and 13.7.
Furthermore, there is another use case which needs to be considered: How shall intervals be treated which overlap only in one point, i.e. one interval ends where another starts?
How rounding up starting at .6 (not at .5)?
For example, round(53.51245, 4) will return me 53.5125, but I want 53.5124.
How can I specify a separation number (namely increase the values starting from .6)?
I'm not sure if this is a duplicate of the post linked to in the comments (but the post may certainly be relevant). From what I understand OP would like to "round" values up or down if they are >= 0.6 or < 0.6, respectively. (The linked post refers to the number of digits a number should be rounded to, which is a different issue.)
In response to OPs question, here is an option where we define a custom function my.round
my.round <- function(x, digits = 4, val = 0.6) {
z <- x * 10^digits
z <- ifelse(signif(z - trunc(z), 1) >= val, trunc(z + 1), trunc(z))
z / 10^digits
}
Then
x <- 53.51245
my.round(x, 4)
#[1] 53.5124
x <- 53.51246
my.round(x, 4)
#[1] 53.5125
my.round is vectorised, so we could have done
my.round(c(53.51245, 53.51246, 53.51246789), digits = 4)
#[1] 53.5124 53.5125 53.5125
Having trouble understanding numeric matching / indexing in R.
If I have a situation where I create a dataframe such as:
options(digits = 3)
x <- seq(from = 0, to = 5, by = 0.10)
TestDF <- data.frame(x = x, y = dlnorm(x))
and I wanted to compare a hardcoded value to my y column -
> TestDF[TestDF$y == 0.0230,]$x
numeric(0)
That being said, if I compare to the value that's straight out of the dataframe (which for an x value of 4.9, should be a y value of 0.0230).
> TestDF[TestDF$y == TestDF[50,]$y,]$x
[1] 4.9
Does this have to do with exact matching? If I limit the digits to 3 decimal point, then 0.0230000 won't be the same as the original value in y I'm comparing to? If this is the case, is there a way around it if I do need to extract values based on rounded, hard-coded values?
You can use round() function to reduce the number of decimal digits to the preferred scale of the floating point number. See below.
set.seed(1L)
x <- seq(from = 0, to = 5, by = 0.10)
TestDF <- data.frame(x = x, y = dlnorm(x))
constant <- 0.023
TestDF[ with(TestDF, round(y, 3) == constant), ]
# x y
# 50 4.9 0.02302884
You can compare the rounded y with the stated value:
> any(TestDF$y == 0.0230)
[1] FALSE
> any(round(TestDF$y, 3) == 0.0230)
[1] TRUE
I'm not certain you grok the meaning of the digits option. From ?options it says about digits
digits: controls the number of significant digits to print when printing numeric values.
(emphasis mine.) So this only affects how the values are printed, not how they are stored.
You generated a set of reals, none of which are exactly 0.0230. This has nothing to do with exact matching. The value you indicated should be 0.0230 is actually stored as
> with(TestDF, print(y[50], digits = 22))
[1] 0.02302883835550340041465
regardless of the digits setting in options because that setting only affects the printed value. And the issue is not exact matching because even with the small fudge allowed by the recommended way to do comparisons, all.equal(), y[50] and 0.0230 are still not equal
> with(TestDF, all.equal(0.0230, y[50]))
[1] "Mean relative difference: 0.001253842"
I'd like to make an R function which does a very particular job, and I'm looking to find a more efficient way of doing it.
Basically, I'd like a function
indicies<-function(increasing.series, multiple)
which picks out the indicies of an increasing series where the value of the series exceeds a multiple of a certain level. So for example, if the input is a vector
testvector <- c(0.1, 0.5, 1.7, 2.1, 3.2, 4.5, 6.2, 6.3, 6.4, 7.9, 8.1)
the results would be
[1] 1 4 6 7 11
where it holds that
testvector[c(1,4,6,7,11)] == c(0.1, 2.1, 4.5, 6.2, 8.1)
so that the function picks out the indicies where the values of the series first exceeds 2 (index 4, value 2.1), 4 (index 6, value 5.6), 6 (index 7, value 6.2) and 8 (index 11, value 8.1). For perspective, I plan on using this to have an easy way to pick weekly / monthly / quarterly series out of a daily time series. I was hoping for a way to run some sort of functional-like aggregate function over the windowed pairs of the input series as implementation, but I'm not sure how to do so succintly. Currently, I've implemented the function in the following much more long-winded manner:
indicies<-function(increasing.series, multiple)
{
# Create matrix with three columns: previous, current and orig.index, yielding
# the previous and current value corresponding to an index in the original
# series.
pairs <- zoo::rollapply(data=increasing.series,width=2,identity)
pairs <- rbind(c(NA, increasing.series[1]),pairs)
pairs<-cbind(pairs,1:dim(pairs)[1])
colnames(pairs) <- c("previous","current","orig.index")
# This predicate returns true if the indexcorresponding to a row of the above matrix should
# be included in the output.
predicate <- function(row)
{
first <- row["previous"]
second <- row["current"]
orig.index <- row["orig.index"]
firstRemainder <- first %% multiple
secondRemainder <- second %% multiple
# Include if the previous remainder is larger than the current or if the current timepoint
# is a whole period in front of the previous.
return(orig.index == 1 || firstRemainder > secondRemainder || second > first + multiple)
}
bool.indicies <- apply(pairs,1,predicate)
return((1:length(bool.indicies))[bool.indicies])
}
Is there a better, shorter, more readable way?
Here is a simpler solution:
indicies <- function(increasing.series, multiple) {
multiples <- (0:floor(max(increasing.series)/multiple)) * multiple
sapply(multiples, function(x) which.max(increasing.series > x))
}
indicies(testvector, 2)
#[1] 1 4 6 7 11
Here is my approach:
c(1, which( diff( testvector %/% 2)>0) + 1)
This does not require defining variables or calling sapply.
I'm still struggling with the different apply-function and how they can replace a for-next-loop. What I want to do is sorting a vector with strings (value labels) according to a sorted order of values, in my case odds ratios.
I have odds ratios (unordered) in the "oo" object and the sorted / ordered odds values in the so object. Further, I have value labels sorted in the same order as "oo", which now should be re-orderd to match the values in the "so" object:
# sort labels descending in order of
# odds ratio values
oo <- exp(coef(x))[-1]
so <- sort(exp(coef(x))[-1])
nlab <- NULL
for (k in 1:length(categoryLabels)) {
nlab <- c(nlab, categoryLabels[which(so[k]==oo)])
}
categoryLabels <- nlab
e.g.
"oo" is (0.3, 0.7, 0.5)
"so" is (0.3, 0.5, 0.7)
categoryLabels (of oo) is ("A", "B", "C") and should be re-ordered according to "so": ("A", "C", "B")
What I like to know is, if it's possible to replace the for-next-loop by an apply-function, and if so, how?
Thanks in advance,
Daniel
It looks like all you're trying to do is order categoryLabels based on oo, which could be done with:
categoryLabels = categoryLabels[order(oo)]
order gives you a vector of indices that, when used to index a vector, will turn it into the sorted order. In your example:
oo = c(0.3, 0.7, 0.5)
order(oo)
# [1] 1 3 2
Though if we did start with so and oo, much easier than using any apply function in this case would be using match:
categoryLabels = categoryLabels[match(oo, so)]
match is a function that finds the indices of the first vector in the second vector. In your example:
oo = c(0.3, 0.7, 0.5)
so = c(0.3, 0.5, 0.7)
match(oo, so)
# [1] 1 3 2