Matching elements in a list - r

Just starting to program in R... Got stumped on this one, perhaps because I don't know where to begin.
Define a random variable to be equal to the number of trials before there is a match. So if you have a list of numbers, (4,5,7,11,3,11,12,8,8,1....), the first value of the random variable is 6 because by then there are two 11's.(4,5,7,11,3,11) The second value is 3 because then you have 2 8's..12,8,8.
The code below creates the list of numbers, u, by simulating from a uniform distribution.
Thank-you for any help or pointers. I've included a full description of the problem I am solving below if anyone is interested (trying to learn by coding a statistics text).
set.seed(1); u = matrix(runif(1000), nrow=1000)
u[u > 0 & u <= 1/12] <- 1
u[u > 1/12 & u <= 2/12] <- 2
u[u > 2/12 & u <= 3/12] <- 3
u[u > 3/12 & u <= 4/12] <- 4
u[u > 4/12 & u <= 5/12] <- 5
u[u > 5/12 & u <= 6/12] <- 6
u[u > 6/12 & u <= 7/12] <- 7
u[u > 7/12 & u <= 8/12] <- 8
u[u > 8/12 & u <= 9/12] <- 9
u[u > 9/12 & u <= 10/12] <- 10
u[u > 10/12 & u <= 11/12] <- 11
u[u > 11/12 & u < 12/12] <- 12
table(u); u[1:10,]
Example 2.6-3 Concepts in Probability and Stochastic Modeling, Higgins
Suppose we were to ask people at random in which month they were born. Let the random variable X denote the number of people we would need to ask before we found two people born in the same month. The possible values for X are 2,3,...13. That is, at least two people must be asked in order to have a match and no more than 13 need to be asked. With the simplifying assumption that every month is an equally likely candidate for a response, a computer simulation was used to estimate the probabilitiy mass function of X. The simulation generated birth months until a match was found. Based on 1000 repetitions of this experiment, the following empirical distribution and sample statistics were obtained...

R has a steep initial learning curve. I don't think it's fair to assume this is your homework, and yes, it's possible to find solutions if you know what you're looking for. However, I remember it being difficult at times to research problems online simply because I didn't know what to search for (I wasn't familiar enough with the terminology).
Below is an explanation of one approach to solving the problem in R. Read the commented code and try and figure out exactly what it's doing. Still, I would recommend working through a good beginner resource. From memory, a good one to get up and running is icebreakeR, but there are many out there...
# set the number of simulations
nsim <- 10000
# Create a matrix, with nsim columns, and fill it with something.
# The something with which you'll populate it is a random sample,
# with replacement, of month names (held in a built-in vector called
# 'month.abb'). We're telling the sample function that it should take
# 13*nsim samples, and these will be used to fill the matrix, which
# has nsim columns (and hence 13 rows). We've chosen to take samples
# of length 13, because as your textbook states, 13 is the maximum
# number of month names necessary for a month name to be duplicated.
mat <- matrix(sample(month.abb, 13*nsim, replace=TRUE), ncol=nsim)
# If you like, take a look at the first 10 columns
mat[, 1:10]
# We want to find the position of the first duplicated value for each column.
# Here's one way to do this, but it might be a bit confusing if you're just
# starting out. The 'apply' family of functions is very useful for
# repeatedly applying a function to columns/rows/elements of an object.
# Here, 'apply(mat, 2, foo)' means that for each column (2 represents columns,
# 1 would apply to rows, and 1:2 would apply to every cell), do 'foo' to that
# column. Our function below extends this a little with a custom function. It
# says: for each column of mat in turn, call that column 'x' and perform
# 'match(1, duplicated(x))'. This match function will return the position
# of the first '1' in the vector 'duplicated(x)'. The vector 'duplicated(x)'
# is a logical (boolean) vector that indicates, for each element of x,
# whether that element has already occurred earlier in the vector (i.e. if
# the month name has already occurred earlier in x, the corresponding element
# of duplicated(x) will be TRUE (which equals 1), else it will be false (0).
# So the match function returns the position of the first duplicated month
# name (well, actually the second instance of that month name). e.g. if
# x consists of 'Jan', 'Feb', 'Jan', 'Mar', then duplicated(x) will be
# FALSE, FALSE, TRUE, FALSE, and match(1, duplicated(x)) will return 3.
# Referring back to your textbook problem, this is x, a realisation of the
# random variable X.
# Because we've used the apply function, the object 'res' will end up with
# nsim realisations of X, and these can be plotted as a histogram.
res <- apply(mat, 2, function(x) match(1, duplicated(x)))
hist(res, breaks=seq(0.5, 13.5, 1))

Related

In R, need to find best combination of 8 columns, only being able to select one value from each row

In R, I'm attempting to find the best combination of 8 different columns of values but with the caveat of only being able to select one value from each row. It sounds relatively simple, but I'm trying to avoid a nasty looping scenario to evaluate all possible options, so I'm hopeful there is a function available that could make this a possibility. There are scenarios where I will need to run this on datasets with over 2000 rows, so efficiency is really important.
Here is an example:
I've been racking my brain and searching forever, but every scenario and solution I'm able to find can maximize series of columns but cant handle the condition of only allowing a single value per row. Are there any functions where this is possible?
I will take a risk here, and assume that I interpreted you right. That you seek the group of 8 numbers in that table that have the maximum sum. Given, of course that they do not share a column or a row.
There is no easy answer to this question. I am not a computer scientist, but I believe this is what is called an NP-hard problem. So efficiency will always be a problem. Fortunately, in practical terms, I think you can get an answer for a 2000+ table in a matter of seconds, as long as the number of columns remains small.
The algorithm I tried to use to win this problem is essentially a depth-first search that takes advantage of existing function in R that makes it faster. You can think of your problem as jumping from column to column, each time selecting the highest value with a twist. Every time you select a value, all cells in that row are turned to zero. So in essence, when you get to the last column, there will only be one value to choose.
However, due to this nature of excluding rows, your results will be different depending on the order you choose to visit the columns (let's call that a path). Thus, you have to test all paths.
So our code must be something of the sort:
1- Enumerate all paths (all permutations of column numbers);
2- For each path, "walk" it taking the maximum value of each column and transforming to 0 the values in its row. Store the values;
3- For each set of values, calculate its sum and select based on that.
Below is the code I have used to do it:
library(combinat) # loads permn function, that enumerates all the permutations
#Create fake data
data = sample(1:25)
data = matrix(data,5,5)
# Walking function
walker = function(path,data) {
bestn = numeric(length(path)) # Placeholder for the max value of each column
usedrows = numeric(length(path)) #Placeholder for the row of each max value
data.reduced=data # copies data to a new object
for(a in 1:length(path)) { # iterate through columns
bestn[a] = max(data.reduced[,path[a]]) #find the maximum value
usedrows[a] = which.max(data.reduced[,path[a]]) # find maximum value's row
data.reduced[usedrows,]=0 # set all values in that row to 0
data.reduced[,path[a]]=0 # set current column to 0.
}
return(bestn)
}
# Create all permutations and use functions in it, get their sum, and choose based on that
paths = permn(1:5)
values = lapply(paths,walker,data)
values.sum = sapply(values,sum)
values[[ which.max(values.sum)]]
The code can handle a matrix of 2000 x 5 in less than a second in a laptop. I just did not added it here, because the more rows, the more independent the results become from the path taken. And it is less easy to see its progress with large numbers.
This problem can be solved simply as a binary integer optimization problem. Here using the ROI and ompr optimization packages. ompr is a formulation manager that calls ROI functions for optimization and processing. Here is an example:
require(ROI)
require(ROI.plugin.glpk)
require(ompr)
require(ompr.roi)
set.seed(7)
n <- runif(77, 80, 120)
n <- c(n, rep(0, 179))
n <- sample(n)
m <- matrix(n, ncol = 8)
nrows <- nrow(m)
ncols <- ncol(m)
model <- MIPModel() %>%
add_variable(x[i, j], i=1:nrows, j=1:ncols, type='binary', lb=0) %>%
set_objective(sum_expr(colwise(m[i, j]) * x[i, j], i=1:nrows, j=1:ncols), 'max') %>%
add_constraint(sum_expr(x[i, j], i=1:nrows) <= 1, j=1:ncols) %>%
add_constraint(sum_expr(x[i, j], j=1:ncols) <= 1, i=1:nrows)
result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
* 20: obj = 9.321807877e+002 infeas = 0.000e+000 (0)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
256 integer variables, all of which are binary
Integer optimization begins...
+ 20: mip = not found yet <= +inf (1; 0)
+ 20: >>>>> 9.321807877e+002 <= 9.321807877e+002 0.0% (1; 0)
+ 20: mip = 9.321807877e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
solution <- get_solution(result, x[i, j])
solution <- subset(solution, value != 0)
solution
variable i j value
27 x 27 1 1
43 x 11 2 1
88 x 24 3 1
99 x 3 4 1
146 x 18 5 1
173 x 13 6 1
209 x 17 7 1
246 x 22 8 1
The first code chunk generates a 32X8 random matrix. The sample generates a 30% fill. The constraints constrain each column and row to have <= 1 active variable. You can use this code directly for any matrix of any dimension.

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

identify if a number is within a range of numbers in R

I know the title makes this sound very easy, but I have a For loop that graphs my data. As the values vary, so too does the limits of the X-axis (different start and end for each plot). However, I want to have a defined X-axis range that stays the same for all similar plots. This is complicated by the fact that there are ~40 possible ranges for the X that I want.
I have a data frame of the x-lims I would like to choose from, for each plot. It basically looks like:
Trait start end
A 123456 134567
B 234546 245678
C 234546 245678
D 345678 356789
and so on. So, if one loop gives me the values: Trait C, start = 235000 and end = 240000, I would like to automatically use the third set of default X-lims.
edit: added more info (Trait).
If your displayed data.frame is called df (and your are sure your variable end will be greater than start), this should work:
which(start > df$start & end < df$end)[1]
Revised to answer revised question
As you add more conditions, you can extend the logic above:
## Make your data easily reproducible for others
df <- read.table(text="Trait start end
A 123456 134567
B 234546 245678
C 234546 245678
D 345678 356789", header=TRUE)
## Set values from within your example loop
Trait <- "C"
start <- 235000
end <- 240000
## Get index of desired row
i <- which(Trait==df$Trait & start > df$start & end < df$end)[1]
## Extract xlim values in the form of a numeric vector
myxlim <- unname(unlist(df[i, c("start", "end")]))
myxlim
[1] 234546 245678
I'm not sure I totally follow the issue. If I'm understanding you correctly, you want to ensure that the range of each of your plots is 11111 (in this example, anyway), but the upper and lower values are going to vary significantly. So, right now, you're looking for a way to create a table of all the possible upper and lower bounds you might want, and then you want to look them up when you plot.
I'd propose that you could do it a lot more easily by simply writing the plot statement to ensure that the minimum and maximum are always 11111 apart.
Let's say you got start <- 235000, end <- 240000, and trait <- B during one loop iteration. Could you structure your code like this?
diff <- end-start
gap <- 11111-diff
plot(thing_you_plot,xlim(start-(gap/2),end+(gap/2))
With numbers:
diff <- 240000-235000 (5000)
gap <- 11111-5000 (6111)
plot(thing_you_plot,xlim(235000-(6111/2),240000+(6111/2))
(x limits are: 231944.5,243055.5, making the plot 11111 in length)
Obviously, if you wanted, you could use floor and ceiling functions to get round numbers instead of decimals. It's not clear from your question how the "trait" even really affects the dimensions; if you can have two traits (B and C) that are plotted using the same dimensions, why do you need the table at all? I think you can do a lot better just doing it for each plot using simple functions.
EDITED BASED ON REVISION:
Josh beat me to it, but here it is again since I had it almost all typed.
df <- data.frame(trait=c("A","B","C","D"),
start=c(123456,234546,234546,345678),
end=c(134567,245678,245678,356789))
trait <- "C"
start <- 235000
end <- 240000
xmin <- df[which(start > df$start & end < df$end & trait == df$trait),2]
xmax <- df[which(start > df$start & end < df$end & trait == df$trait),3]

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Resources